Title: Allegorithmic Substance
1AllegorithmicSubstance
2Procedural textures on multi-core
- Other than framerate and features, what else can
you do with extra CPU power ? - Well look at Allegorithmics middleware,
Substance
3Procedural textures are valuable for modern games
- Have a LOT of textures.
- Want shorter loading times?? (faster starts,
teleportations or zooms)?. - Need to reduce texture memory on a disc, for
download, and/or in RAM. - Can benefit from more flexible and reusable
assets.
4Introducing Substance
- In Q2 2007 Allegorithmic started a complete
reengineering of ProFX2, authoring tool and
engine, named Substance. - Unit tests were done very early to ensure that
Substance could target streaming. - Cross-platform PC, PS3, XBOX, etc.
- Expected linear multi-thread scalability.
5What is Substance ?
- Substance is a middleware product composed of two
elements. - Substance Authoring Tool lets you
- create procedural textures
- create texture packages of a few kilobytes !
- A cooker compiles generic data into binaries
optimized for a specific platform or user. - Substance Engine
- generates bitmap textures on the fly.
6Less FPS ?
- More textures, not less FPS
- Substance consumes idle cycles, not frames
- Graphics bitrates follow Moore's law
- Higher poly count ? bigger worlds
- Higher filter rate ? larger textures
- Desired texture volume grows faster than RAM
- Streaming is a necessity
- But HDD net bitrate does not follow. Bottleneck !
- Modern gameplay entails sudden bitrate bursts
- This is worsened by HDD seeks and entails stalls.
7No, a stable and high FPS.
- Even masked, a stall is actually a FPS drop
- Substance works in Random Access Memory
- The gamer zooms or teleports
- Give 4 cores and a GPU to Substance
- Sacrifice 1 or 2 frames
- Substance gen. cache 1-2M new texels.
- The stall does not hinder game play.
- Substance diminishes stalls
- Substance helps to maintain a high FPS.
8Performance issuestreaming in games
- DVD or HDD net bitrate is 2 or 6 MB/s
- Our aim add a stable 4MB/s without the GPU
- Requires billions of intermediate pixels/s.
- Can CPUs compete with GPUs ?
- Opportunity cores are still under-exploited in
most game engines. - Texture processing is privileged in the new
multi-core architectures.
9The architecture was designed with these issues
in mind
- Homogeneous CPU and GPU versions
- Streaming (1-10 CPU cycles per pixel)?
- SIMD MT for the multi-core generations
- No cache nor threading pollution
- Fine grained jobs and lockless sync.
- Low memory footprint
10The theoretical benefit was calculated
- New architectures come with enhanced SIMD.
Expected x10 compared to std C - Tricks and algorithmic changes could give another
x10 on some filters, like DXT - We were confident that our image processes could
be well threaded. Partly because we generate
textures asynchronously - Hence the CPU version of ProFX2 could be
accelerated by a factor x25-x100
11This is the approach taken to address the issue
- Simple innerloop tests actually showed that
optimized SSE2-4 code could give a boost of x10 - Find a data layout coherent with micro
parallelism (SIMD and pipeline), low level
threading, cache and memory handling. - OpenMP is then used to test strategies before
designing a specific MT HAL
12Heres the code that was developed to make this
possible
- A SIMD HAL is ready for PC, Xbox, PS3.
- OpenMP easily gives a 85 MT linearity.
- Our MT HAL is converging towards a model of
lockless synchronization, 95 expected. - The cooker precomputes data that will help
synchronization and MT efficiency. - Our API exposes asynchronous commands. Perfect to
share cores with a game loop !
13The compositing graph,node based image processing
- Authoring Tool non linear editing
- Engine efficient high level structure
- Graph (DAG) contains 3 types of nodes
- Sources procedural noise, bitmaps, SVGs
- Filters blend, HSL, TRS, warp, blur, etc.
- Outputs coherent diffuse normal maps, etc.
- Main advantages
- Libraries, capsules instanciation of subgraphs
- Complex variants fast to create and compute
- Dynamic custom branches (ex aging textures)?
14The compositing graph,node based image processing
15Threading strategies
- High level threading
- Task decomposition 1 node (filter) per thread
- Graph splitting ensures task independency
- Low level threading
- Data decomposition 1 strip of blocks per thread
- Dispatcher ensures non conflicting areas
- Pixel to pixel filters are concatenated.
- Streamed R/W, no L2 cache pollution
- Temporary blocks in private L1 double buffers
- Intermediate images never allocated
- Lockless reactive sync and cache friendly
16Threading sub graphs (1/11)by nodes (high level)?
17Threading sub graphs (2/11)by nodes, caching
18Threading sub graphs (3/11)by nodes
19Threading sub graphs (4/11)by strips (low level)?
20Threading sub graphs (5/11)remove from cache
21Threading sub graphs (6/11)by strips
22Threading sub graphs (7/11)remove from cache
23Threading sub graphs (8/11)by strips
24Threading sub graphs (9/11)remove from cache
25Threading sub graphs (10/11)by strips
26Threading sub graphs (11/11)update cache, and
finished
27Expect more streaming bandwidth
- Substance generates 4MB/s of compressed textures
per second - Cumulate this with classical streaming
- 50 MB/s loading with 4 cores and 1 GPU
28Heres how close we got to the theoretical best
performance
- DXT compression at 2G pixels/s (same as what
hi-end GPUs can do in 2007). - 8 bits SVG (cooked) rendering at 20G/s. 8G/s
anti-aliasing with 4 sub-samples. - In most cases 4 cores give a x3.8 boost
- Some filters are more problematic, but solutions
have been imagined in details, and will be
implemented between Q2 and Q4 2008.
29Heres the new performance profile
- Substance and ProFX2 figures are for one core.
- 4 cores 3.8 times more fillrate.
- ProFX2 SVG GPU
- Substance SVG CPU
- SVG AA 2G pixels/s per core
30This is future-proofed
- The cooker precomputes whatever helps to
linearise computations. - Scalable code SSE4 added in one day thanks to
the SIMD HAL - Scalable threading our two strategies scale
- A few functions dispatch virtual CPU "shaders"
- 64-cores ready ? code a new dispatcher ?
- Multiplatform design.
31Whats next?
32Procedural diffuse map
33Coherent procedural normal map
34Complex procedural environment map
35This scene is made entirely of proceduraltextures
36Future sources of bandwidth
- SIMD code can be better pipelined in ASM.
- Our cooker can optimize a lot of things.
- Authoring tool will have a RT profiler
- Artists gaining experience with Substance will
also optimize their packages better. - Artist feedback will also help us to improve the
expressiveness of each filter - 30-50 filters per texture, main perf. divisor.
37Heres how you can best take advantage of
procedural textures
- Anticipate texture generation requests.
- Predict visibility (HOM, PVS)?.
- Create mipmaps. Access levels JIT.
- Cache the useful texels.
- Adapt texture resolution to workload.
- Use texture variants, less tiling textures or
details. Show a higher texel/pixel ratio.
38What do you think?
- Have you tried something like this?
- Have you rejected trying something like this?