Allegorithmic Substance - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Allegorithmic Substance

Description:

Other than framerate and features, what else can you do with extra CPU power ? ... create texture packages of a few kilobytes ! ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 39
Provided by: cmpmedia
Category:

less

Transcript and Presenter's Notes

Title: Allegorithmic Substance


1
AllegorithmicSubstance
  • Threaded Middleware

2
Procedural textures on multi-core
  • Other than framerate and features, what else can
    you do with extra CPU power ?
  • Well look at Allegorithmics middleware,
    Substance

3
Procedural textures are valuable for modern games
  • Have a LOT of textures.
  • Want shorter loading times?? (faster starts,
    teleportations or zooms)?.
  • Need to reduce texture memory on a disc, for
    download, and/or in RAM.
  • Can benefit from more flexible and reusable
    assets.

4
Introducing Substance
  • In Q2 2007 Allegorithmic started a complete
    reengineering of ProFX2, authoring tool and
    engine, named Substance.
  • Unit tests were done very early to ensure that
    Substance could target streaming.
  • Cross-platform PC, PS3, XBOX, etc.
  • Expected linear multi-thread scalability.

5
What is Substance ?
  • Substance is a middleware product composed of two
    elements.
  • Substance Authoring Tool lets you
  • create procedural textures
  • create texture packages of a few kilobytes !
  • A cooker compiles generic data into binaries
    optimized for a specific platform or user.
  • Substance Engine
  • generates bitmap textures on the fly.

6
Less FPS ?
  • More textures, not less FPS
  • Substance consumes idle cycles, not frames
  • Graphics bitrates follow Moore's law
  • Higher poly count ? bigger worlds
  • Higher filter rate ? larger textures
  • Desired texture volume grows faster than RAM
  • Streaming is a necessity
  • But HDD net bitrate does not follow. Bottleneck !
  • Modern gameplay entails sudden bitrate bursts
  • This is worsened by HDD seeks and entails stalls.

7
No, a stable and high FPS.
  • Even masked, a stall is actually a FPS drop
  • Substance works in Random Access Memory
  • The gamer zooms or teleports
  • Give 4 cores and a GPU to Substance
  • Sacrifice 1 or 2 frames
  • Substance gen. cache 1-2M new texels.
  • The stall does not hinder game play.
  • Substance diminishes stalls
  • Substance helps to maintain a high FPS.

8
Performance issuestreaming in games
  • DVD or HDD net bitrate is 2 or 6 MB/s
  • Our aim add a stable 4MB/s without the GPU
  • Requires billions of intermediate pixels/s.
  • Can CPUs compete with GPUs ?
  • Opportunity cores are still under-exploited in
    most game engines.
  • Texture processing is privileged in the new
    multi-core architectures.

9
The architecture was designed with these issues
in mind
  • Homogeneous CPU and GPU versions
  • Streaming (1-10 CPU cycles per pixel)?
  • SIMD MT for the multi-core generations
  • No cache nor threading pollution
  • Fine grained jobs and lockless sync.
  • Low memory footprint

10
The theoretical benefit was calculated
  • New architectures come with enhanced SIMD.
    Expected x10 compared to std C
  • Tricks and algorithmic changes could give another
    x10 on some filters, like DXT
  • We were confident that our image processes could
    be well threaded. Partly because we generate
    textures asynchronously
  • Hence the CPU version of ProFX2 could be
    accelerated by a factor x25-x100

11
This is the approach taken to address the issue
  • Simple innerloop tests actually showed that
    optimized SSE2-4 code could give a boost of x10
  • Find a data layout coherent with micro
    parallelism (SIMD and pipeline), low level
    threading, cache and memory handling.
  • OpenMP is then used to test strategies before
    designing a specific MT HAL

12
Heres the code that was developed to make this
possible
  • A SIMD HAL is ready for PC, Xbox, PS3.
  • OpenMP easily gives a 85 MT linearity.
  • Our MT HAL is converging towards a model of
    lockless synchronization, 95 expected.
  • The cooker precomputes data that will help
    synchronization and MT efficiency.
  • Our API exposes asynchronous commands. Perfect to
    share cores with a game loop !

13
The compositing graph,node based image processing
  • Authoring Tool non linear editing
  • Engine efficient high level structure
  • Graph (DAG) contains 3 types of nodes
  • Sources procedural noise, bitmaps, SVGs
  • Filters blend, HSL, TRS, warp, blur, etc.
  • Outputs coherent diffuse normal maps, etc.
  • Main advantages
  • Libraries, capsules instanciation of subgraphs
  • Complex variants fast to create and compute
  • Dynamic custom branches (ex aging textures)?

14
The compositing graph,node based image processing
15
Threading strategies
  • High level threading
  • Task decomposition 1 node (filter) per thread
  • Graph splitting ensures task independency
  • Low level threading
  • Data decomposition 1 strip of blocks per thread
  • Dispatcher ensures non conflicting areas
  • Pixel to pixel filters are concatenated.
  • Streamed R/W, no L2 cache pollution
  • Temporary blocks in private L1 double buffers
  • Intermediate images never allocated
  • Lockless reactive sync and cache friendly

16
Threading sub graphs (1/11)by nodes (high level)?
17
Threading sub graphs (2/11)by nodes, caching
18
Threading sub graphs (3/11)by nodes
19
Threading sub graphs (4/11)by strips (low level)?
20
Threading sub graphs (5/11)remove from cache
21
Threading sub graphs (6/11)by strips
22
Threading sub graphs (7/11)remove from cache
23
Threading sub graphs (8/11)by strips
24
Threading sub graphs (9/11)remove from cache
25
Threading sub graphs (10/11)by strips
26
Threading sub graphs (11/11)update cache, and
finished
27
Expect more streaming bandwidth
  • Substance generates 4MB/s of compressed textures
    per second
  • Cumulate this with classical streaming
  • 50 MB/s loading with 4 cores and 1 GPU

28
Heres how close we got to the theoretical best
performance
  • DXT compression at 2G pixels/s (same as what
    hi-end GPUs can do in 2007).
  • 8 bits SVG (cooked) rendering at 20G/s. 8G/s
    anti-aliasing with 4 sub-samples.
  • In most cases 4 cores give a x3.8 boost
  • Some filters are more problematic, but solutions
    have been imagined in details, and will be
    implemented between Q2 and Q4 2008.

29
Heres the new performance profile
  • Substance and ProFX2 figures are for one core.
  • 4 cores 3.8 times more fillrate.
  • ProFX2 SVG GPU
  • Substance SVG CPU
  • SVG AA 2G pixels/s per core

30
This is future-proofed
  • The cooker precomputes whatever helps to
    linearise computations.
  • Scalable code SSE4 added in one day thanks to
    the SIMD HAL
  • Scalable threading our two strategies scale
  • A few functions dispatch virtual CPU "shaders"
  • 64-cores ready ? code a new dispatcher ?
  • Multiplatform design.

31
Whats next?
32
Procedural diffuse map
33
Coherent procedural normal map
34
Complex procedural environment map
35
This scene is made entirely of proceduraltextures
36
Future sources of bandwidth
  • SIMD code can be better pipelined in ASM.
  • Our cooker can optimize a lot of things.
  • Authoring tool will have a RT profiler
  • Artists gaining experience with Substance will
    also optimize their packages better.
  • Artist feedback will also help us to improve the
    expressiveness of each filter
  • 30-50 filters per texture, main perf. divisor.

37
Heres how you can best take advantage of
procedural textures
  • Anticipate texture generation requests.
  • Predict visibility (HOM, PVS)?.
  • Create mipmaps. Access levels JIT.
  • Cache the useful texels.
  • Adapt texture resolution to workload.
  • Use texture variants, less tiling textures or
    details. Show a higher texel/pixel ratio.

38
What do you think?
  • Have you tried something like this?
  • Have you rejected trying something like this?
Write a Comment
User Comments (0)
About PowerShow.com