Title: Developing Efficient Graphics Software
1Developing Efficient Graphics Software
2Developing Efficient Graphics Software
- Intent of Course
- Identify application and hardware interaction
- Quantify and optimize interaction
- Identify efficient software structure
- Balance software and hardware system component use
3Developing Efficient Graphics Software
- Outline
- 135 Hardware and graphics architecture and
performance - 205 Software and System Performance
- Break
- 255 Software profiling and performance analysis
- 320 C/C language issues
- 350 Graphics techniques and algorithms
- 440 Performance Hints
4Developing Efficient Graphics Software
- Speakers
- Applications Consulting Engineers for SGI
- optimizing, differentiating, graphics
- Keith Cok, Bob Kuehne, Thomas True, Alan Commike
5Hardware Graphics Architecture Performance
6Course Overview
- Why is your application drawing so slowly?
- Could actually be the graphics
- Could be the data traversal
- Could be something entirely different
7Tour Guide
- Platform architecture components
- CPU
- Memory
- Graphics
- Graphics performance
- Measurements triangle rate, fill rate, misc.
- Reproduce maximize
8Bottlenecks Balance
- Bottlenecks
- Find them
- Eliminate them (sort of - move them around)
- Balance
- Understand hardware architecture
- Fully utilize hardware
9Yin Yang
- Yin and yang are the two primal cosmic
principles of the universe - The best state for everything in the universe is
a state of harmony represented by a balance of
yin and yang. - Skeptics Dictionary -- http//skepdic.com/yinyang.
html
10Write Once Run Everywhere?
- My application ran fast on that platform! Why is
this one so slow? - Different platforms require different tuning
- Different platforms implement hardware
differently - Macro Architecture features
- Micro Storage capacities, buffers, caches
- Effect Bandwidth latency
11Latency Bandwidth
- Definitions
- Latency time required to communicate a unit of
data - Bandwidth data transferred per unit time
- Example
- Latency bottleneck
- Bandwidth bottleneck
12Platform Software View
graphics
CPU
i/o
memory
misc
net
13Platform PCI, AGP
CPU
Memory
CPU
Memory
glue
PCI
AGP
Disk
Net
Graphics
I/O
Disk
Net
Graphics
I/O
14Platform UMA, Switched Hub
CPU
Memory
CPU
Memory
glue
UMA
glue
PCI
Disk
Net
I/O
Graphics
Disk
Net
Graphics
I/O
15Platform The Points
- Why learn about hardware?
- To understand how your app interacts with it
- To best utilize the hardware
- Potentially can use extra hardware features
- Where?
- Platform documentation
- Talk with hardware vendor
16CPU Overview
- CPU Operation
- Data transferred from main memory to registers
- CPU works on data in registers
- Latency
- Registers 0 (free)
- Level-1 (L1) cache 1
- Level-2 (L2) cache 10x L1
- Main memory 100x L1
CPU
R
L1
L2
Main Memory
17CPU, Cache, and Memory
- Caches designed to exploit data locality
- Temporal locality
- Spatial locality
Main Memory
CPU
Registers
L1
L2
18Memory Cache Logical Flow
In L1?
In L2?
In Register?
Copy to L2 (100)
Compute
Copy to L1 (10)
Copy to Register (1)
19Memory Cache Physical Flow
Main Memory
L2 Cache
L1 Cache
Page
Registers
CPU
20Memory Allocation Pools
- List elements are often allocated as-needed
- This leads to spatial disparity
- Mitigated by use of application memory management
- Bad malloc, malloc, malloc, malloc, ...
- Good pools - pool_init, pool_alloc, ...
- Graphics example
- Vertices, normals, textures, etc.
21Memory Graphics! Vertex Arrays
22Graphics Pipe
xf world to screen light apply light clip
clip to view
rast convert to pixels fx apply texture,
etc. fops test pixel ops
23Graphics Pipe Akeley Taxonomy
-
- G - Generate geometric data
- T - Traverse data structures
- X - Transform primitives world to screen
- R - Rasterize triangles to pixels
- D - Display framebuffer on output device
G
D
X
R
T
24Graphics Hardware
- 4 types of hardware are common
- G-TXRD all hardware
- GT-XRD
- GTX-RD
- GTXR-D all software
25Graphics Performance
- Benchmarks
- Trust, but verify. - an ex-president
- Definitions
- Triangle rate speed at which primitives are
transformed (X) - Fill rate speed at which primitives are
rasterized (R) - Depth complexity number of times pixel filled
- Caveats
- Quantization, fastpath
26Graphics Quantization
- Frame quantization is the result of swapbuffers
occurring at the next vertical retrace. - Necessary to avoid image artifacts such as
tearing - Example 100Hz display refresh
27Graphics Quantization
no-sync 120 Hz
100 Hz
50 Hz
50 Hz
33 Hz
t0
t1
t2
t3
t4
t5
t4
t6
t7
28Graphics Fastpath
- Definition
- Fastpath the most optimized path through
graphics hardware - Example
- fast path float verts, float norms, AGBR
textures, z-test - less fast path float verts, float norms, RGBA
textures, z-test
29Graphics Fastpath Example
30Graphics Fastpath Points
- Fast path is often synonymous with ideal path.
- Real usage of graphics falls on a continuum.
- Must quantify what hardware can do
- Quality speed
31Graphics Hardware Testing
- Duplicate performance numbers simply
- Good build a simple test program
- Better glPerf - http//www.spec.org
- Maximize performance in an app
- Good Use fast API extensions
- Better Create an is-fast test, use what is
verified as fast
32Graphics Hardware Is-Fast
- Test each platform to determine fast path
- Once, per-machine, test primitives and modes
- Vertex array format, texture format, display
list, etc. - Store data in database
- Detect hardware changes or time-to-live
- Read data from database at startup
- Check database or re-generate data
33Graphics Hardware Is-Fast
If ( new_machine() hardware_changed() )
test_interesting_modes() store_in_database()
else // have database entry
get_performance_data_from_database() // use
the modes primitives that are fast when
rendering
34Think Globally, Act Locally
- Think globally
- Know the platforms graphics hardware
- Use hardware effectively in your app
- Balance hardware utilization
- Act locally
- Use in-cache data
- Understand hardware graphics fastpaths
- Balance quality vs. performance
35Software and System Performance
36A Four Step Process
37Quantify
- Characterize
- Application Space
- Primitive Types
- Primitive Counts
- Rendering Characteristics
- Frame Rate
38Quantify
39Examine System Configuration
- Resources
- Memory
- Disk
- Setup
- Display
- Network
40Graphics Analysis
- Ideal Performance
- Keep graphics pipeline full.
- 100 CPU utilization running application code.
- 100 graphics utilization.
41Graphics Analysis
42Graphics Analysis
- Graphics Bound
- Graphics subsystem processes data slower than CPU
can feed it. - Graphics subsystem issues an interrupt which
causes the CPU to stall. - Data processing within application stops until
graphics subsystem can again accept data.
43Graphics Analysis
- Geometry Limited
- Limited by the rate at which vertices can be
transformed and clipped. - Fill Limited
- Limited by the rate at which transformed vertices
can be rasterized.
44Graphics Analysis
45Graphics Analysis
- CPU Bound
- CPU at 100 utilization but cant feed graphics
fast enough. - Graphics subsystem at less than 100 utilization.
- All CPU cycles consumed by data processing.
46Graphics Analysis
- Determination Techniques
- Remove graphics API calls.
- Shrink graphics window.
- Reduce geometry processing requirements.
- Use system monitoring tool.
47Graphics Analysis
Start
Remove graphics API calls
Performance Problem Not Graphics
Excessive or unexpected CPU activity
frame rate increase
no change in frame rate
48Graphics Analysis
- Graphics Architecture GTXR-D
49Graphics Analysis
- Graphics Architecture GTXR-D
- (aka Dumb Frame Buffer)
- CPU does everything.
- Typically CPU bound.
- To remedy, buy a real graphics board.
50Graphics Analysis
- Graphics Architecture GTX-RD
51Graphics Analysis
- Graphics Architecture GTX-RD
- Screen space operations performed by graphics.
- Object-space to screen-space transform on host.
- Can easily become CPU bound.
- Roughly 100 single-precision floating point
operations are required to transform, light, clip
test, project and map an object-space vertex to
screen-space. - K. Akeley T. Jermoluk - Beware of fast-path and slow-path issues.
52Graphics Analysis
- Graphics Architecture GTX-RD
- If Graphics Bound
- Reduce per-pixel operations.
- Reduce depth complexity.
- Use native-format data.
53Graphics Analysis
- Graphics Architecture GTX-RD
- If CPU Bound
- Reduce scene complexity.
- Use more efficient graphics algorithms.
54Graphics Analysis
Graphics Architecture GT-XRD
55Graphics Analysis
- Graphics Architecture GT-XRD
- Transformation and rasterization performed by
graphics. - Can be CPU or graphics bound.
- Beware of fast-path and slow-path issues.
- Subject to host bandwidth limitations.
56Graphics Analysis
- Graphics Architecture GT-XRD
- If Graphics Bound
- Move lighting back to CPU.
- Use native data formats within application.
- Use display lists or vertex arrays.
- Use less expensive lighting modes.
57Graphics Analysis
- Graphics Architecture GT-XRD
- If CPU Bound
- Move lighting from CPU to graphics subsystem.
- Do matrix operations in graphics hardware.
- Profile in search of computational performance
issues.
58Bottleneck Elimination
59Bottleneck Elimination
- Bottlenecks
- Understanding, crucial to effective tuning.
- Will always exist, tune to balance.
- Not always a bad thing.
60Bottleneck Elimination
- Graphics
- Use native graphics formats.
- Remove excessive state changes.
- Package graphics primitives efficiently.
- Use textures that fit in texture cache.
- Dont use unnecessary rendering modes.
- Decrease depth complexity.
- Cull out excessive geometry.
61Bottleneck Elimination
- Memory
- Dont allocate memory in rendering loop.
- Avoid copying and repackaging of graphics data.
- Organize graphics data.
- Avoid memory fragmentation.
62Bottleneck Elimination
- Memory Bandwidth and Fragmentation
Independent Triangles 9 vertices 504 bytes
Triangle Strip 5 vertices 280 bytes
Vertex Array 5 vertices 280 bytes
Vertex RGBAXYZWXYZSTR 56 bytes
63Bottleneck Elimination
- Code and Language
- Use native data types.
- Avoid contention for a single shared resource.
- Avoid application bottlenecks in non-graphics
code. - Reduce API call overhead.
64Bottleneck Elimination
Independent Triangles (XYZW RGBA XYZ STR)
9 vertices 36 function calls
Triangle Strips (XYZW RGBA XYZ STR) 5
vertices 20 function calls
Vertex Array 5 function calls
Display List 1 function call
65Conclusion
- Performance Tuning an Iterative Process
66Conclusion
67Profiling and Performance Analysis
68Profile and Performance Analysis
- Profiling points out code areas that take up most
time - Imperative for well balanced application
- Points out code and system bottlenecks
69Two Methods of Software Profiling
- Basic block
- A section of code that has one entry and one exit
- Measures ideal time
- Statistical sampling
- Interrupts program execution and examines current
location - Measures actual CPU cycles spent executing a line
of code
70How Do You Profile Code?
- Compile/link with compiler optimizations turned
on - cc foo.c -use_all_optimization_flags ....
- Instrument the code
- Unix pixie foo.exe -gt foo.exe.pixie
- Visual Studio embedded in tool suite
- Run the application with relevant data sets
- foo.exe.pixie - args -gt produces results data
file
71Profiling Finding the Hot Spot
- Function list, in descending order by exclusive
ideal time - excl. cum. instructions
calls function (dso file, line) - 1 10.3 10.3 190583064 11484
GL_CreateSurfaceLightmap (foo gl_rsurf.c, 1293)
- 2 8.9 19.2 173920781 3203
S_Update_ (foo snd_dma.c, 848) - 3 8.2 27.4 145950460 338787
R_RenderBrushPoly (foo gl_rsurf.c, 641) - 4 5.9 33.3 97798122 1975976
__sin (libm.so sin.c, 194) - 5 4.1 37.4 82310479
240 GL_LoadTexture (foo gl_draw.c, 990) - 6 3.4 40.8 50786176 1204269
__glMgrim_Begin (libGLcore.so mgras_prim.c, 221)
- 7 3.2 44.0 58099072 16797
R_DrawAliasModel (foo gl_rmain.c, 232) - 8 3.1 47.1 53832546 290970
R_RecursiveWorldNode (foo gl_rsurf.c, 894) - 9 3.1 50.2 43855299 437627
R_CullBox (foo gl_rlight.c, 313 compiled in
gl_rmain.c) - 10 2.8 53.0 44666700 30981
EmitWaterPolys (foo gl_warp.c, 187)
72Profiling Fixing the Hot Spot
- What do you look for?
- Common sub-expressions
- Loop invariant code
- Repeated pointer de-referencing
- Global variables and cache misses
- Thin loops
73Profiling Example
- // Code the old way // Code the new way
- 19 void old_loop() 27 void new_loop ()
- 20 sum 0 28 sum 0
- 21 for (i 0i lt NUM i) 29 ii NUM4
- 22 sum xi 30 for (i0 i lt ii
i) - 23 printf("sum f\n",sum) 31
sum xI - 24 32 for (i ii i lt NUM i 4)
- 33 sum xi
- 34 sum xi1
- 35 sum xi2
- 36 sum xi3
- 37
- 38 printf( sum f\n,sum)
- 39
74Profiling Example Profile Results
- cycles instructions calls function
(dso file line) - 1 6160 6168 1 old_loop
(blahdso.so blahdso.c, 19) - 2 4869 8714 1 setup_data
(blahdso.so blahdso.c, 11) - 1 4869 8714 1 setup_data
(blahdso.so blahdso.c, 11) - 2 4625 4891 1 new_loop
(blahdso.so blahdso.c, 27)
75Profile Example Line Analysis
- Line list, in descending order by time
- --------------------------------------------------
---- - cycles invocations function (dso file,
line) - 4096 1024 old_loop sum xi
- 2061 1024 old_loop for (i 0i
lt NUM i) -
- 978 256 new_loop sum
xi3 - 968 256 new_loop sum
xi2 - 968 256 new_loop sum
xi1 - 968 256 new_loop sum
xi - 733 256 new_loop for (i
ii i lt NUM i 4) - 7 1 new_loop ii
NUM4
76Profile and Performance Analysis
- Profile Example Visual C/Intel
- Function Percent of Hit
Function - Time(s) Run Time
Count - -------------------------------
----------------------------------- - 0.410 39.4
1 _old_loop - 0.249 23.9
1 _new_loop
77Statistical vs. Basic Block Profile
- void ijk_loop()
// loops kji and ikj as well - sum 0
- for (i0iltYNUMi)
- for (j0jltYNUMj)
- for (k0kltYNUMk)
- sum yijk
-
- printf("sum f\n",sum)
78Basic Block vs. Statistical Sampling
- Basic Block
- Percent cycles inst
calls function - 1 25.3 51141434 37101028
1 ijk_loop foo.c, 47 - 2 25.3 51141434 37101028
1 kji_loop foo.c, 57 - 3 25.3 51141434 37101028
1 ikj_loop foo.c, 66 - Statistical Sampling
- Percent Samples Procedure
Function - 1 38.0 2700 kji_loop
foo.c, 57 - 2 23.9 1700
setup_data foo.c, 15 - 3 19.7 1400 ikj_loop
foo.c, 66 - 4 18.3 1300 ijk_loop
foo.c, 47
79Now We Know About Hot Spots...
- What do we do next?
- Use compilers to fine-tune code
- Use knowledge of language to optimize
- Hand-tune code
- Profiling is fun, hard, and iterative and it can
be highly effective
80 Compiler and Language Issues
- Keith Cok, SGI
- Bob Kuehne, SGI
81Compiler and Language Issues
- Compiler Optimizations
- Occur within a compromise of
- speed and memory space
- vs.
- time to compile and link
- An iterative process to discover what does and
doesnt work - Important to keep at it
82Compiler Issues Trade-Offs
- Trade-offs
- Round-off vs. needed precision
- Inter-procedural analysis vs. link time
- Pointer aliasing vs. coding constraints
- Optimizing for processor architectures vs. work
of multiple binaries (support, test) - Explore other compilers than your first choice
- Different source code - different flags
83Compiler and Language Issues
- Comments on 32 vs. 64 bit code
- Benefits of 64 bit code
- Increased address space
- Higher precision
- Downsides of 64 bit code
- Application memory footprint
- Need to port which can be difficult!
- Performance issues
84Language Issues
- Data Management
- Unrolling loops
- Arrays
- Temporary variables
- Pointer aliasing
85Language Issues Data Management
- Manipulate data structures efficiently since
graphics IS data - struct str next struct str next
- str prev
str prev - large_type foo
int key - int key large_type foo
- str str
86Language Issues Data Management
- Pack data efficiently
- struct foo struct foo_better
- char aa // 8 bits 24 pad float
bb // 32 bits - float bb // 32 bits char aa
// 8 bits - char cc // 8 bits 24 pad char
cc // 8 bits - float dd // 32 bits char ee
// 8 bits 8 pad - char ee // 8 bits 24 pad float
dd // 32 bits - foo_t // 160 bits foo_t
// 96 bits
87Language Issues Data Management
- Examine your arrays and note their caching
behavior - Break up large arrays into smaller sub-arrays for
better memory access patterns - Understand the implications of data layout and
cache behavior
88Language Issues Loop Unrolling
- Profiling Example
- // Code the old way // Code the new way
- 19 void old_loop() 27 void new_loop()
- 20 sum 0 28 sum 0
- 21 for (i 0i lt NUM i) 29 ii NUM4
- 22 sum xi 30 for (i0 i lt ii
i) - 23 printf("sum f\n",sum) 31 sum
xi - 24 32 for (iii iltNUM i 4)
- 33 sum xi
- 34 sum xi1
- 35 sum xi2
- 36 sum xi3
- 37
- 38 printf( sum f\n,sum)
- 39
89Language Issues Loop Unrolling
- Profile Example Line Analysis
- Line list, in descending order by time
- --------------------------------------------------
---- - cycles invocations function
- 4096 1024 old_loop sum xi
- 2061 1024 old_loop for (i 0i
lt NUM i) -
- 978 256 new_loop sum
xi3 - 968 256 new_loop sum
xi2 - 968 256 new_loop sum
xi1 - 968 256 new_loop sum
xi - 733 256 new_loop for (i
ii i lt NUM i 4) - 7 1 new_loop ii
NUM4
90Language Issues Loop Unrolling
- Issues with loop unrolling
- Code complexity
- Clutter
- Compiler may/may not do this
- Flags may affect compiler time spent optimizing
- Only thin loops gain performance
- Use application knowledge to take advantage of
loop unrolling
91Language Issues Local temporary variables
- Use local temporary variables to avoid repeatedly
de-referencing a pointer structure - Example
- x global_ptr-gtrecord_str-gta
- y global_ptr-gtrecord_str-gtb
-
- Use
- tmp global_ptr-gtrecord_str
- x tmp-gta
- y tmp-gtb
92Language Issues Using tmp vars for global vars
within a function
- void tr_point(FLOAT old_pt, FLOAT m, FLOAT
new_pt) - FLOAT c1, c2, c3, c4, op, np, tmp
- c1 m c2 m4 c3 m8 c4 m12
- for (j0, np new_ptjlt4 j) for
(j0 np new_pt jlt4j)
op old_pt
op old_pt - tmp op c1 np
op c1 - tmp op c2 np
op c2 - tmp op c3 np
op c3 - np tmp (op c4) np
op c4
93Language Issues Pointer Aliasing
- Pointers are aliases when they point to
potentially overlapping regions of memory - If regions never overlap, may optimize for this
case. Not possible, though, in general - Compiler can't tell when pointers are aliased
- Use restrict key word or compiler option
94Language Issues Pointer Aliasing
Unaliased Pointers Compilers may use -
Parallelism - Pipelining
in
out
in
out
Aliased pointers
95Language Issues Pointer Aliasing
- void process_data( float restrict in,
float restrict out,
float gain) - int i
- for (i 0 i lt NSAMPS i)
- outi ini gain
-
96C General Issues
- Language features
- RTTI, safe casts, etc.
- Use const, mutable, volatile, inline
- hints to compilers
- Object construction
- arrays, default constructors, arguments, etc.
- Method invocation issues
- operators, overloads, conversion, etc.
97C Virtual Functions
- Good - used to invoke child method when managing
base-class handles - Expensive - incur an additional pointer
de-reference - one, find VTBL, two, find method, invoke
- bad for caching
- Use when necessary, but not for common objects
- Good for large methods that do lots of work
- Bad for small methods, like a vertex query
98C Exceptions Templates
- Exceptions
- Great for error checking
- Performance penalty
- Additional stack information required
- Templates
- Great for code re-use
- Memory penalty
- Across libraries, across object files
99Code Language Issues The End
- Balance
- Know your compiler
- Features performance
- Know your language
- Features performance
- Know your app
- Features performance
100Idioms and Application Architectures
101Starting Quote
- The best tuned most efficient bubble sort is
still a bubble sort. Additional tweaking won't
improve performance. - Change The Algorithm!
-
- Commike 99
102Introduction
- To write an efficient graphics application, one
must - Understand the platform
- Use graphics efficiently
- Write good code
- Use efficient application structures and
algorithms
103Outline
- Outline
- Background
- Culling
- Level of Detail (LOD) management
- Application architectures
104Application ArchitecturesRendering Path
- Application work, culling, LOD, drawing
- Pipelined rendering path
105Application ArchitecturesRendering Path
- Application work, culling, LOD, drawing
- Pipelined rendering path
106Application ArchitecturesRendering Path
- Application work, culling, LOD, drawing
- Pipelined rendering path
107Application ArchitecturesTarget Frame Rate
- A target frame rate attempts to bound the maximum
render time - Control Culling and LOD aggressiveness
- Maintain a constant frame rate
- Achieve an acceptable interactive frame rate
108Graphics Idioms
- Culling
- Removing geometry that isn't visible
- Level of Detail Management
- Reducing geometric complexity
109Culling
- Dont draw what you cant see
110CullingCulling Types
- Use one. Use all. Pipeline them together.
- View Frustum Culling
- Backface Culling
- Contribution Culling
- Occlusion Culling
111CullingBounding Volumes
- Test against a bounding volume not individual
primitives - Can be bounding sphere, box, oriented box, or any
enclosing volume - Hierarchical bounding volumes to reduce cull time
- Spheres are fast, boxes are more accurate
- Use a combination of both
112Culling View Frustum
- Graphics pipeline clips data that falls outside
the View Frustum - If it will be clipped dont bother drawing
113Culling View Frustum Usefulness
- Improves geometry rate
- Culled vertices are not transformed, lit, and
clipped - Improves host download rate
- Less data moved from memory into graphics
- Does not change fill rate
- Triangles outside the View Frustum would not have
been drawn anyway
114Culling View Frustum Implementation
- Transform vertices to clip coordinates (in OpenGL
multiply by Model-View and Projection matrix) - Check each vertex against View Frustum
- Geometry is either In, Out, or Partial
- Render In and Partial
115Culling Skip the Clip
- In software transform systems (GTX-RD) skip the
clip - Partial and In geometry classified
- Pipe renders Partial as usual
- Pipe can render In without a View Frustum clip
- Might be a hint to render
- Can improve geometry rates if not already
fill-limited
116Culling Backface
- Only half of any closed polyhedron is visible at
any one time -
- Dont render what you cant see
117Culling Backface Usefulness
- Improves fill rate when using a native
implementation - Primitives are transformed and lit before culling
- Helps both geometry and fill with an application
specific algorithm - More computationally expensive
- Balance graphics and CPU work
- This may not work well when you can enter closed
geometry or need two-sided lighting
118Random Image
119Lava. Hot!
120Random Quote
- Try not. Do, or do not. There is no try.
-
- Yoda 80
121Culling Contribution
- If its too small to make a difference
- dont render it
122Culling Contribution Usefulness
- Improves geometry rate
- Culled vertices are not transformed, lit, and
clipped - Improves host download rate
- Less data moved from memory into graphics
- Does not change fill rate
- Screen space projection already minimal
- Removes few pixels from rasterization stage
123Culling Contribution Implementation
- Dont render items that fall below a size
threshold - Screen space size of bounding volume
- A less computational approach
- Distance to object combined with some notion of
global object size
124Culling Occlusion
- If you cant see it
-
- dont draw it
Front
Side
125Culling Occlusion Goals
- Find the optimal set of occluders that will
enable drawing the minimal number of occludees - Occluders The geometry that is visible
- Occludees The geometry that is not visible
- Use general purpose occlusion culling algorithms
- Use application specific spatial knowledge if
possible
126Culling Occlusion Culling Usefulness
- Can improve both transform-limited and
fill-limited applications - Computationally expensive
- Beware of time trade-offs
- Possible hardware support
127Culling General Occlusion Culling
- Used for arbitrary scenes
- Can improve both transform limited and fill
limited applications - Computationally expensive for arbitrary scenes
128Culling Occlusion Spatial Partitioning
- Cell and Portal Culling
- Spatial organization leads to Cells and Portals
- Games that move from room to room
- Architectural walkthroughs
129LOD Overview
- After culling, need to draw what is left
- Still too much geometry
- Use multiple Levels of Detail, I.e.
multi-resolution objects - Match geometric complexity to visible on-screen
space coverage - Reduce geometric complexity to maintain target
frame rate
130LOD Issues
- Generating LODs
- Height Fields vs 3D objects
- View-Dependent nice, but compute intensive
- View-Independent fast, memory intensive
- Need to decide which LOD level to use
- Not trivial!
- Need smooth transitions between levels
- Geomorphs
131LOD Height Fields
- Generally thought of as infinite terrain
- Specialized algorithms can be used
132LOD 3D Models
- General purpose simplification algorithm
- Can use on height fields also
- Some recent real-time view-dependent algorithms
- Also used for compression
133LOD When to switch LOD levels
- Ability to only generate LOD models is not
sufficient - Need to know when to use which LOD level
- single constant hard metric distance from eye
- Multiple heuristics cost, benefit, rankings
- Can bias LODs to ensure frame rate targets are
reached
134LODLevel determination
- Determine system rendering characteristics
- Determine cost of rendering each object
- Render objects with highest benefit while
remaining under the target frame rate - Level determination can be time consuming!
- take the time to time the time taken to reduce
the rendering time
135Going, and going, and going...
136LOD Determining cost of rendering
- Cost is affected by many factors
- Graphics hardware published benchmarks, startup
tests - Number of vertices primarily a function of LOD
algorithm - Rendering Quality lighting, shading, wire frame,
anti-aliasing, etc. - Global Factors total texture memory, dirty
internal state
137LOD Benefit Function
- Cost alone is not good enough, need benefit also
- Rendered size of object
- Error tolerance between LOD level and reference
model - Importance in scene
- Frame-to-frame coherency
138LOD The Optimal LODs
- For all Objects, at each LOD Level, rendered with
each RenderType - Maximize the Benefit function
- Benefit(Object, Level, RenderType)
- Subject to
- Cost(Object, Level, RenderType) lt
TargetFrameRate
139LOD Optimal Optimizations
- Simulated Annealing
- Monte Carlo Simulations
- Simplex Searches
140LOD Optimal Optimizations
- Simulated Annealing
- Monte Carlo Simulations
- Simplex Searches
- Dude,
- Can you spare a few dozen CPUs?
141LOD Trade-offs
- Dont have enough time to run full LOD
optimization problem and render the scene - Simplify cost and benefit functions
- Simplify optimization problem into a ranking of
Benefit/Cost - Use frame-to-frame coherency
- Be sure to consider time taken to calculate LODs
142Application Architectures Multi-Threading
- More stages give more time to cull or generate
LODs - Each stage adds latency
143Application Architectures Multi-Threading
- Hard part is data synchronization
- Watch out for memory bloat
144Application Architectures Scene Graphs
- A scene graph is the basic data structures
holding the description of your scene - Cull-able, sort-able, and can contain
multi-resolution objects - Hierarchical Bounding Volumes
- Statistics gathering and timing infrastructure
- For large scenes can do memory management and
database paging
145Application Architectures Trade-offs
- Quality
- Speed
- Memory
- Complexity
146Conclusion
- Most importantly - Think about balance!
147Performance Hints
148Performance HintsPipeline Management
- Avoid round trips to graphics server
- Cache own state/attribute information
- Avoid pipeline queries (e.g., glGet)
- Flush buffer efficiently (glFlush vs. glFinish)
- Reduce state changes. Sort by expense. For
example, sort geometry by type (triangles, quads,
etc) and then by color - Eliminate unused attributes
149Performance Hints Debugging
- Detect graphic errors
- ifdef DEBUG
- define GLEND() glEnd()\
- int err \
- err glGetError() \
- if (err ! GL_NO_ERROR)
\ printf("s\n",gluErrorString(err))
\ - assert(err GL_NO_ERROR)
- else
- define GLEND() glEnd()
- endif
150Performance Hints Geometry
- Maximize data between glBegin/glEnd
- Sort geometry by type (triangle, quad, etc.) and
group them together - Find best fit for length of glBegin/glEnd pair
- Use stripped primitives (GL_TRIANGLE_STRIP...) to
reduce geometry data sent to the pipeline - Avoid GL_POLYGON. Use specific geometric
primitives instead (GL_TRIANGLE, GL_QUAD, etc.) - Use GL_FASTEST with glHint calls where possible
151Performance Hints Geometry
- Use flat display lists for static geometry. Deep
display lists may induce unwanted memory
thrashing - Use API matrix operations instead of your own
- Use texture to simulate complex geometry
- Use vertex arrays. Test vertex, interleaved,
precompiled arrays
152Performance Hints Geometry
- Pass one normal (not 3 or 4) per flat shaded
polygon - Use a data format suitable for quick transfer to
the graphics subsystem - Disable unneeded operations (alpha blending,
depth, stencil, blending, dithering, fog, etc.)
153Performance Hints Lighting
- Reduce lighting requirements
- Use as few lights as possible
- Use directional (infinite) lighting. Use
glLightfv(GL_LIGHTn, GL_POSITION, x,y,z,0) - Use positional lights rather than spot lights
- Use one-sided lighting when possible (be aware of
issues associated with normals) - Dont change material properties frequently
154Performance Hints Lighting
- Use normalized normal vectors
- Supply unit length vectors
- Dont enable GL_NORMALIZE
- Dont scale using model-view matrix
- Pre-multiply geometry, if possible
155Performance Hints Visuals/Pixel Formats
- Pick the correct visual. Use hardware accelerated
visuals - Structure windows and contexts to maximize
performance (app may block after context swaps) - Put GUI elements in overlay planes to avoid
unwanted graphics window refreshes
156Performance Hints Buffers
- Turn off depth buffer when possible
- Use HW accelerated off-screen buffer for
backing-store - Use stencil buffer for interactive picking and
quick re-render (see course notes for full
algorithm) - Use color/depth buffer data for interactive
editing of complex scenes (see course notes for
full algorithm)
157Performance Hints Textures
- Be aware of texture sizes
- Reduce texture resolution
- Use texture LOD extension (OpenGL 1.2)
- Use texture objects. Create textures once
- Dont swap textures frequently, if possible
- Mosaic multiple textures into one large texture
- Sort geometry by texture
158Performance Hints Textures
- Use texture as an additional data lookup to
simulate more complex data - Lighting, geometry, color, clipping,
application-space data - Use glTexSubImage to replace part of a texture
rather than creating a whole new texture - Avoid expensive texture filter modes
- Use texture lookup tables instead of
multi-channel textures
159Conclusion
- Know how your application works within the system
- Dont let caches, latencies, bandwidths, etc.
slow you down - Know how fast you can go
- Identify system performance characteristics
- Work your compiler
- Get all you can out of the hardware
160Questions and Answers