Developing Efficient Graphics Software presentation

About This Presentation

Transcript and Presenter's Notes

Title: Developing Efficient Graphics Software

1
Developing Efficient Graphics Software

2
Developing Efficient Graphics Software

Intent of Course
Identify application and hardware interaction
Quantify and optimize interaction
Identify efficient software structure
Balance software and hardware system component use

3
Developing Efficient Graphics Software

Outline
135 Hardware and graphics architecture and
performance
205 Software and System Performance
Break
255 Software profiling and performance analysis
320 C/C language issues
350 Graphics techniques and algorithms
440 Performance Hints

4
Developing Efficient Graphics Software

Speakers
Applications Consulting Engineers for SGI
optimizing, differentiating, graphics
Keith Cok, Bob Kuehne, Thomas True, Alan Commike

5
Hardware Graphics Architecture Performance

Bob Kuehne, SGI

6
Course Overview

Why is your application drawing so slowly?
Could actually be the graphics
Could be the data traversal
Could be something entirely different

7
Tour Guide

Platform architecture components
CPU
Memory
Graphics
Graphics performance
Measurements triangle rate, fill rate, misc.
Reproduce maximize

8
Bottlenecks Balance

Bottlenecks
Find them
Eliminate them (sort of - move them around)
Balance
Understand hardware architecture
Fully utilize hardware

9
Yin Yang

Yin and yang are the two primal cosmic
principles of the universe
The best state for everything in the universe is
a state of harmony represented by a balance of
yin and yang.
Skeptics Dictionary -- http//skepdic.com/yinyang.
html

10
Write Once Run Everywhere?

My application ran fast on that platform! Why is
this one so slow?
Different platforms require different tuning
Different platforms implement hardware
differently
Macro Architecture features
Micro Storage capacities, buffers, caches
Effect Bandwidth latency

11
Latency Bandwidth

Definitions
Latency time required to communicate a unit of
data
Bandwidth data transferred per unit time
Example
Latency bottleneck
Bandwidth bottleneck

12
Platform Software View
graphics
CPU
i/o
memory
misc
net
13
Platform PCI, AGP
CPU
Memory
CPU
Memory
glue
PCI
AGP
Disk
Net
Graphics
I/O
Disk
Net
Graphics
I/O
14
Platform UMA, Switched Hub
CPU
Memory
CPU
Memory
glue
UMA
glue
PCI
Disk
Net
I/O
Graphics
Disk
Net
Graphics
I/O
15
Platform The Points

Why learn about hardware?
To understand how your app interacts with it
To best utilize the hardware
Potentially can use extra hardware features
Where?
Platform documentation
Talk with hardware vendor

16
CPU Overview

CPU Operation
Data transferred from main memory to registers
CPU works on data in registers
Latency
Registers 0 (free)
Level-1 (L1) cache 1
Level-2 (L2) cache 10x L1
Main memory 100x L1

CPU
R
L1
L2
Main Memory
17
CPU, Cache, and Memory

Caches designed to exploit data locality
Temporal locality
Spatial locality

Main Memory
CPU
Registers
L1
L2
18
Memory Cache Logical Flow
In L1?
In L2?
In Register?
Copy to L2 (100)
Compute
Copy to L1 (10)
Copy to Register (1)
19
Memory Cache Physical Flow
Main Memory
L2 Cache
L1 Cache
Page
Registers
CPU
20
Memory Allocation Pools

List elements are often allocated as-needed
This leads to spatial disparity
Mitigated by use of application memory management
Bad malloc, malloc, malloc, malloc, ...
Good pools - pool_init, pool_alloc, ...
Graphics example
Vertices, normals, textures, etc.

21
Memory Graphics! Vertex Arrays
22
Graphics Pipe
xf world to screen light apply light clip
clip to view
rast convert to pixels fx apply texture,
etc. fops test pixel ops
23
Graphics Pipe Akeley Taxonomy

G - Generate geometric data
T - Traverse data structures
X - Transform primitives world to screen
R - Rasterize triangles to pixels
D - Display framebuffer on output device

G
D
X
R
T
24
Graphics Hardware

4 types of hardware are common
G-TXRD all hardware
GT-XRD
GTX-RD
GTXR-D all software

25
Graphics Performance

Benchmarks
Trust, but verify. - an ex-president
Definitions
Triangle rate speed at which primitives are
transformed (X)
Fill rate speed at which primitives are
rasterized (R)
Depth complexity number of times pixel filled
Caveats
Quantization, fastpath

26
Graphics Quantization

Frame quantization is the result of swapbuffers
occurring at the next vertical retrace.
Necessary to avoid image artifacts such as
tearing
Example 100Hz display refresh

27
Graphics Quantization
no-sync 120 Hz
100 Hz
50 Hz
50 Hz
33 Hz
t0
t1
t2
t3
t4
t5
t4
t6
t7
28
Graphics Fastpath

Definition
Fastpath the most optimized path through
graphics hardware
Example
fast path float verts, float norms, AGBR
textures, z-test
less fast path float verts, float norms, RGBA
textures, z-test

29
Graphics Fastpath Example
30
Graphics Fastpath Points

Fast path is often synonymous with ideal path.
Real usage of graphics falls on a continuum.
Must quantify what hardware can do
Quality speed

31
Graphics Hardware Testing

Duplicate performance numbers simply
Good build a simple test program
Better glPerf - http//www.spec.org
Maximize performance in an app
Good Use fast API extensions
Better Create an is-fast test, use what is
verified as fast

32
Graphics Hardware Is-Fast

Test each platform to determine fast path
Once, per-machine, test primitives and modes
Vertex array format, texture format, display
list, etc.
Store data in database
Detect hardware changes or time-to-live
Read data from database at startup
Check database or re-generate data

33
Graphics Hardware Is-Fast

Pseudo-code

If ( new_machine() hardware_changed() )
test_interesting_modes() store_in_database()
else // have database entry
get_performance_data_from_database() // use
the modes primitives that are fast when
rendering
34
Think Globally, Act Locally

Think globally
Know the platforms graphics hardware
Use hardware effectively in your app
Balance hardware utilization
Act locally
Use in-cache data
Understand hardware graphics fastpaths
Balance quality vs. performance

35
Software and System Performance

Thomas J. True, SGI

36
A Four Step Process
37
Quantify

Characterize
Application Space
Primitive Types
Primitive Counts
Rendering Characteristics
Frame Rate

38
Quantify

Compare

39
Examine System Configuration

Resources
Memory
Disk
Setup
Display
Network

40
Graphics Analysis

Ideal Performance
Keep graphics pipeline full.
100 CPU utilization running application code.
100 graphics utilization.

41
Graphics Analysis

Graphics Bound

42
Graphics Analysis

Graphics Bound
Graphics subsystem processes data slower than CPU
can feed it.
Graphics subsystem issues an interrupt which
causes the CPU to stall.
Data processing within application stops until
graphics subsystem can again accept data.

43
Graphics Analysis

Geometry Limited
Limited by the rate at which vertices can be
transformed and clipped.
Fill Limited
Limited by the rate at which transformed vertices
can be rasterized.

44
Graphics Analysis

CPU Bound

45
Graphics Analysis

CPU Bound
CPU at 100 utilization but cant feed graphics
fast enough.
Graphics subsystem at less than 100 utilization.
All CPU cycles consumed by data processing.

46
Graphics Analysis

Determination Techniques
Remove graphics API calls.
Shrink graphics window.
Reduce geometry processing requirements.
Use system monitoring tool.

47
Graphics Analysis
Start
Remove graphics API calls
Performance Problem Not Graphics
Excessive or unexpected CPU activity
frame rate increase
no change in frame rate
48
Graphics Analysis

Graphics Architecture GTXR-D

49
Graphics Analysis

Graphics Architecture GTXR-D
(aka Dumb Frame Buffer)
CPU does everything.
Typically CPU bound.
To remedy, buy a real graphics board.

50
Graphics Analysis

Graphics Architecture GTX-RD

51
Graphics Analysis

Graphics Architecture GTX-RD
Screen space operations performed by graphics.
Object-space to screen-space transform on host.
Can easily become CPU bound.
Roughly 100 single-precision floating point
operations are required to transform, light, clip
test, project and map an object-space vertex to
screen-space. - K. Akeley T. Jermoluk
Beware of fast-path and slow-path issues.

52
Graphics Analysis

Graphics Architecture GTX-RD
If Graphics Bound
Reduce per-pixel operations.
Reduce depth complexity.
Use native-format data.

53
Graphics Analysis

Graphics Architecture GTX-RD
If CPU Bound
Reduce scene complexity.
Use more efficient graphics algorithms.

54
Graphics Analysis
Graphics Architecture GT-XRD
55
Graphics Analysis

Graphics Architecture GT-XRD
Transformation and rasterization performed by
graphics.
Can be CPU or graphics bound.
Beware of fast-path and slow-path issues.
Subject to host bandwidth limitations.

56
Graphics Analysis

Graphics Architecture GT-XRD
If Graphics Bound
Move lighting back to CPU.
Use native data formats within application.
Use display lists or vertex arrays.
Use less expensive lighting modes.

57
Graphics Analysis

Graphics Architecture GT-XRD
If CPU Bound
Move lighting from CPU to graphics subsystem.
Do matrix operations in graphics hardware.
Profile in search of computational performance
issues.

58
Bottleneck Elimination

Bottlenecks

59
Bottleneck Elimination

Bottlenecks
Understanding, crucial to effective tuning.
Will always exist, tune to balance.
Not always a bad thing.

60
Bottleneck Elimination

Graphics
Use native graphics formats.
Remove excessive state changes.
Package graphics primitives efficiently.
Use textures that fit in texture cache.
Dont use unnecessary rendering modes.
Decrease depth complexity.
Cull out excessive geometry.

61
Bottleneck Elimination

Memory
Dont allocate memory in rendering loop.
Avoid copying and repackaging of graphics data.
Organize graphics data.
Avoid memory fragmentation.

62
Bottleneck Elimination

Memory Bandwidth and Fragmentation

Independent Triangles 9 vertices 504 bytes
Triangle Strip 5 vertices 280 bytes
Vertex Array 5 vertices 280 bytes
Vertex RGBAXYZWXYZSTR 56 bytes
63
Bottleneck Elimination

Code and Language
Use native data types.
Avoid contention for a single shared resource.
Avoid application bottlenecks in non-graphics
code.
Reduce API call overhead.

64
Bottleneck Elimination

API Call Overhead

Independent Triangles (XYZW RGBA XYZ STR)
9 vertices 36 function calls
Triangle Strips (XYZW RGBA XYZ STR) 5
vertices 20 function calls
Vertex Array 5 function calls
Display List 1 function call
65
Conclusion

Performance Tuning an Iterative Process

66
Conclusion

Its all about balance!

67
Profiling and Performance Analysis

Keith Cok, SGI

68
Profile and Performance Analysis

Profiling points out code areas that take up most
time
Imperative for well balanced application
Points out code and system bottlenecks

69
Two Methods of Software Profiling

Basic block
A section of code that has one entry and one exit
Measures ideal time
Statistical sampling
Interrupts program execution and examines current
location
Measures actual CPU cycles spent executing a line
of code

70
How Do You Profile Code?

Compile/link with compiler optimizations turned
on
cc foo.c -use_all_optimization_flags ....
Instrument the code
Unix pixie foo.exe -gt foo.exe.pixie
Visual Studio embedded in tool suite
Run the application with relevant data sets
foo.exe.pixie - args -gt produces results data
file

71
Profiling Finding the Hot Spot

Function list, in descending order by exclusive
ideal time
excl. cum. instructions
calls function (dso file, line)
1 10.3 10.3 190583064 11484
GL_CreateSurfaceLightmap (foo gl_rsurf.c, 1293)
2 8.9 19.2 173920781 3203
S_Update_ (foo snd_dma.c, 848)
3 8.2 27.4 145950460 338787
R_RenderBrushPoly (foo gl_rsurf.c, 641)
4 5.9 33.3 97798122 1975976
__sin (libm.so sin.c, 194)
5 4.1 37.4 82310479
240 GL_LoadTexture (foo gl_draw.c, 990)
6 3.4 40.8 50786176 1204269
__glMgrim_Begin (libGLcore.so mgras_prim.c, 221)
7 3.2 44.0 58099072 16797
R_DrawAliasModel (foo gl_rmain.c, 232)
8 3.1 47.1 53832546 290970
R_RecursiveWorldNode (foo gl_rsurf.c, 894)
9 3.1 50.2 43855299 437627
R_CullBox (foo gl_rlight.c, 313 compiled in
gl_rmain.c)
10 2.8 53.0 44666700 30981
EmitWaterPolys (foo gl_warp.c, 187)

72
Profiling Fixing the Hot Spot

What do you look for?
Common sub-expressions
Loop invariant code
Repeated pointer de-referencing
Global variables and cache misses
Thin loops

73
Profiling Example

// Code the old way // Code the new way
19 void old_loop() 27 void new_loop ()
20 sum 0 28 sum 0
21 for (i 0i lt NUM i) 29 ii NUM4
22 sum xi 30 for (i0 i lt ii
i)
23 printf("sum f\n",sum) 31
sum xI
24 32 for (i ii i lt NUM i 4)
33 sum xi
34 sum xi1
35 sum xi2
36 sum xi3
37
38 printf( sum f\n,sum)
39

74
Profiling Example Profile Results

cycles instructions calls function
(dso file line)
1 6160 6168 1 old_loop
(blahdso.so blahdso.c, 19)
2 4869 8714 1 setup_data
(blahdso.so blahdso.c, 11)
1 4869 8714 1 setup_data
(blahdso.so blahdso.c, 11)
2 4625 4891 1 new_loop
(blahdso.so blahdso.c, 27)

75
Profile Example Line Analysis

Line list, in descending order by time
--------------------------------------------------
----
cycles invocations function (dso file,
line)
4096 1024 old_loop sum xi
2061 1024 old_loop for (i 0i
lt NUM i)
978 256 new_loop sum
xi3
968 256 new_loop sum
xi2
968 256 new_loop sum
xi1
968 256 new_loop sum
xi
733 256 new_loop for (i
ii i lt NUM i 4)
7 1 new_loop ii
NUM4

76
Profile and Performance Analysis

Profile Example Visual C/Intel
Function Percent of Hit
Function
Time(s) Run Time
Count
-------------------------------
-----------------------------------
0.410 39.4
1 _old_loop
0.249 23.9
1 _new_loop

77
Statistical vs. Basic Block Profile

void ijk_loop()
// loops kji and ikj as well
sum 0
for (i0iltYNUMi)
for (j0jltYNUMj)
for (k0kltYNUMk)
sum yijk
printf("sum f\n",sum)

78
Basic Block vs. Statistical Sampling

Basic Block
Percent cycles inst
calls function
1 25.3 51141434 37101028
1 ijk_loop foo.c, 47
2 25.3 51141434 37101028
1 kji_loop foo.c, 57
3 25.3 51141434 37101028
1 ikj_loop foo.c, 66
Statistical Sampling
Percent Samples Procedure
Function
1 38.0 2700 kji_loop
foo.c, 57
2 23.9 1700
setup_data foo.c, 15
3 19.7 1400 ikj_loop
foo.c, 66
4 18.3 1300 ijk_loop
foo.c, 47

79
Now We Know About Hot Spots...

What do we do next?
Use compilers to fine-tune code
Use knowledge of language to optimize
Hand-tune code
Profiling is fun, hard, and iterative and it can
be highly effective

80
Compiler and Language Issues

Keith Cok, SGI
Bob Kuehne, SGI

81
Compiler and Language Issues

Compiler Optimizations
Occur within a compromise of
speed and memory space
vs.
time to compile and link
An iterative process to discover what does and
doesnt work
Important to keep at it

82
Compiler Issues Trade-Offs

Trade-offs
Round-off vs. needed precision
Inter-procedural analysis vs. link time
Pointer aliasing vs. coding constraints
Optimizing for processor architectures vs. work
of multiple binaries (support, test)
Explore other compilers than your first choice
Different source code - different flags

83
Compiler and Language Issues

Comments on 32 vs. 64 bit code
Benefits of 64 bit code
Increased address space
Higher precision
Downsides of 64 bit code
Application memory footprint
Need to port which can be difficult!
Performance issues

84
Language Issues

Data Management
Unrolling loops
Arrays
Temporary variables
Pointer aliasing

85
Language Issues Data Management

Manipulate data structures efficiently since
graphics IS data
struct str next struct str next
str prev
str prev
large_type foo
int key
int key large_type foo
str str

86
Language Issues Data Management

Pack data efficiently
struct foo struct foo_better
char aa // 8 bits 24 pad float
bb // 32 bits
float bb // 32 bits char aa
// 8 bits
char cc // 8 bits 24 pad char
cc // 8 bits
float dd // 32 bits char ee
// 8 bits 8 pad
char ee // 8 bits 24 pad float
dd // 32 bits
foo_t // 160 bits foo_t
// 96 bits

87
Language Issues Data Management

Examine your arrays and note their caching
behavior
Break up large arrays into smaller sub-arrays for
better memory access patterns
Understand the implications of data layout and
cache behavior

88
Language Issues Loop Unrolling

Profiling Example
// Code the old way // Code the new way
19 void old_loop() 27 void new_loop()
20 sum 0 28 sum 0
21 for (i 0i lt NUM i) 29 ii NUM4
22 sum xi 30 for (i0 i lt ii
i)
23 printf("sum f\n",sum) 31 sum
xi
24 32 for (iii iltNUM i 4)
33 sum xi
34 sum xi1
35 sum xi2
36 sum xi3
37
38 printf( sum f\n,sum)
39

89
Language Issues Loop Unrolling

Profile Example Line Analysis
Line list, in descending order by time
--------------------------------------------------
----
cycles invocations function
4096 1024 old_loop sum xi
2061 1024 old_loop for (i 0i
lt NUM i)
978 256 new_loop sum
xi3
968 256 new_loop sum
xi2
968 256 new_loop sum
xi1
968 256 new_loop sum
xi
733 256 new_loop for (i
ii i lt NUM i 4)
7 1 new_loop ii
NUM4

90
Language Issues Loop Unrolling

Issues with loop unrolling
Code complexity
Clutter
Compiler may/may not do this
Flags may affect compiler time spent optimizing
Only thin loops gain performance
Use application knowledge to take advantage of
loop unrolling

91
Language Issues Local temporary variables

Use local temporary variables to avoid repeatedly
de-referencing a pointer structure
Example
x global_ptr-gtrecord_str-gta
y global_ptr-gtrecord_str-gtb
Use
tmp global_ptr-gtrecord_str
x tmp-gta
y tmp-gtb

92
Language Issues Using tmp vars for global vars
within a function

void tr_point(FLOAT old_pt, FLOAT m, FLOAT
new_pt)
FLOAT c1, c2, c3, c4, op, np, tmp
c1 m c2 m4 c3 m8 c4 m12
for (j0, np new_ptjlt4 j) for
(j0 np new_pt jlt4j)

op old_pt
op old_pt
tmp op c1 np
op c1
tmp op c2 np
op c2
tmp op c3 np
op c3
np tmp (op c4) np
op c4

93
Language Issues Pointer Aliasing

Pointers are aliases when they point to
potentially overlapping regions of memory
If regions never overlap, may optimize for this
case. Not possible, though, in general
Compiler can't tell when pointers are aliased
Use restrict key word or compiler option

94
Language Issues Pointer Aliasing
Unaliased Pointers Compilers may use -
Parallelism - Pipelining
in
out
in
out
Aliased pointers
95
Language Issues Pointer Aliasing

void process_data( float restrict in,

float restrict out,
float gain)
int i
for (i 0 i lt NSAMPS i)
outi ini gain

96
C General Issues

Language features
RTTI, safe casts, etc.
Use const, mutable, volatile, inline
hints to compilers
Object construction
arrays, default constructors, arguments, etc.
Method invocation issues
operators, overloads, conversion, etc.

97
C Virtual Functions

Good - used to invoke child method when managing
base-class handles
Expensive - incur an additional pointer
de-reference
one, find VTBL, two, find method, invoke
bad for caching
Use when necessary, but not for common objects
Good for large methods that do lots of work
Bad for small methods, like a vertex query

98
C Exceptions Templates

Exceptions
Great for error checking
Performance penalty
Additional stack information required
Templates
Great for code re-use
Memory penalty
Across libraries, across object files

99
Code Language Issues The End

Balance
Know your compiler
Features performance
Know your language
Features performance
Know your app
Features performance

100
Idioms and Application Architectures

Alan Commike, SGI

101
Starting Quote

The best tuned most efficient bubble sort is
still a bubble sort. Additional tweaking won't
improve performance.
Change The Algorithm!
- Commike 99

102
Introduction

To write an efficient graphics application, one
must
Understand the platform
Use graphics efficiently
Write good code
Use efficient application structures and
algorithms

103
Outline

Outline
Background
Culling
Level of Detail (LOD) management
Application architectures

104
Application ArchitecturesRendering Path

Application work, culling, LOD, drawing
Pipelined rendering path

105
Application ArchitecturesRendering Path

Application work, culling, LOD, drawing
Pipelined rendering path

106
Application ArchitecturesRendering Path

Application work, culling, LOD, drawing
Pipelined rendering path

107
Application ArchitecturesTarget Frame Rate

A target frame rate attempts to bound the maximum
render time
Control Culling and LOD aggressiveness
Maintain a constant frame rate
Achieve an acceptable interactive frame rate

108
Graphics Idioms

Culling
Removing geometry that isn't visible
Level of Detail Management
Reducing geometric complexity

109
Culling

Dont draw what you cant see

110
CullingCulling Types

Use one. Use all. Pipeline them together.
View Frustum Culling
Backface Culling
Contribution Culling
Occlusion Culling

111
CullingBounding Volumes

Test against a bounding volume not individual
primitives
Can be bounding sphere, box, oriented box, or any
enclosing volume
Hierarchical bounding volumes to reduce cull time
Spheres are fast, boxes are more accurate
Use a combination of both

112
Culling View Frustum

Graphics pipeline clips data that falls outside
the View Frustum
If it will be clipped dont bother drawing

113
Culling View Frustum Usefulness

Improves geometry rate
Culled vertices are not transformed, lit, and
clipped
Improves host download rate
Less data moved from memory into graphics
Does not change fill rate
Triangles outside the View Frustum would not have
been drawn anyway

114
Culling View Frustum Implementation

Transform vertices to clip coordinates (in OpenGL
multiply by Model-View and Projection matrix)
Check each vertex against View Frustum
Geometry is either In, Out, or Partial
Render In and Partial

115
Culling Skip the Clip

In software transform systems (GTX-RD) skip the
clip
Partial and In geometry classified
Pipe renders Partial as usual
Pipe can render In without a View Frustum clip
Might be a hint to render
Can improve geometry rates if not already
fill-limited

116
Culling Backface

Only half of any closed polyhedron is visible at
any one time
Dont render what you cant see

117
Culling Backface Usefulness

Improves fill rate when using a native
implementation
Primitives are transformed and lit before culling
Helps both geometry and fill with an application
specific algorithm
More computationally expensive
Balance graphics and CPU work
This may not work well when you can enter closed
geometry or need two-sided lighting

118
Random Image
119
Lava. Hot!

120
Random Quote

Try not. Do, or do not. There is no try.
- Yoda 80

121
Culling Contribution

If its too small to make a difference
dont render it

122
Culling Contribution Usefulness

Improves geometry rate
Culled vertices are not transformed, lit, and
clipped
Improves host download rate
Less data moved from memory into graphics
Does not change fill rate
Screen space projection already minimal
Removes few pixels from rasterization stage

123
Culling Contribution Implementation

Dont render items that fall below a size
threshold
Screen space size of bounding volume
A less computational approach
Distance to object combined with some notion of
global object size

124
Culling Occlusion

If you cant see it
dont draw it

Front
Side
125
Culling Occlusion Goals

Find the optimal set of occluders that will
enable drawing the minimal number of occludees
Occluders The geometry that is visible
Occludees The geometry that is not visible
Use general purpose occlusion culling algorithms
Use application specific spatial knowledge if
possible

126
Culling Occlusion Culling Usefulness

Can improve both transform-limited and
fill-limited applications
Computationally expensive
Beware of time trade-offs
Possible hardware support

127
Culling General Occlusion Culling

Used for arbitrary scenes
Can improve both transform limited and fill
limited applications
Computationally expensive for arbitrary scenes

128
Culling Occlusion Spatial Partitioning

Cell and Portal Culling
Spatial organization leads to Cells and Portals
Games that move from room to room
Architectural walkthroughs

129
LOD Overview

After culling, need to draw what is left
Still too much geometry
Use multiple Levels of Detail, I.e.
multi-resolution objects
Match geometric complexity to visible on-screen
space coverage
Reduce geometric complexity to maintain target
frame rate

130
LOD Issues

Generating LODs
Height Fields vs 3D objects
View-Dependent nice, but compute intensive
View-Independent fast, memory intensive
Need to decide which LOD level to use
Not trivial!
Need smooth transitions between levels
Geomorphs

131
LOD Height Fields

Generally thought of as infinite terrain
Specialized algorithms can be used

132
LOD 3D Models

General purpose simplification algorithm
Can use on height fields also
Some recent real-time view-dependent algorithms
Also used for compression

133
LOD When to switch LOD levels

Ability to only generate LOD models is not
sufficient
Need to know when to use which LOD level
single constant hard metric distance from eye
Multiple heuristics cost, benefit, rankings
Can bias LODs to ensure frame rate targets are
reached

134
LODLevel determination

Determine system rendering characteristics
Determine cost of rendering each object
Render objects with highest benefit while
remaining under the target frame rate
Level determination can be time consuming!
take the time to time the time taken to reduce
the rendering time

135
Going, and going, and going...
136
LOD Determining cost of rendering

Cost is affected by many factors
Graphics hardware published benchmarks, startup
tests
Number of vertices primarily a function of LOD
algorithm
Rendering Quality lighting, shading, wire frame,
anti-aliasing, etc.
Global Factors total texture memory, dirty
internal state

137
LOD Benefit Function

Cost alone is not good enough, need benefit also
Rendered size of object
Error tolerance between LOD level and reference
model
Importance in scene
Frame-to-frame coherency

138
LOD The Optimal LODs

For all Objects, at each LOD Level, rendered with
each RenderType
Maximize the Benefit function
Benefit(Object, Level, RenderType)
Subject to
Cost(Object, Level, RenderType) lt
TargetFrameRate

139
LOD Optimal Optimizations

Simulated Annealing
Monte Carlo Simulations
Simplex Searches

140
LOD Optimal Optimizations

Simulated Annealing
Monte Carlo Simulations
Simplex Searches
Dude,
Can you spare a few dozen CPUs?

141
LOD Trade-offs

Dont have enough time to run full LOD
optimization problem and render the scene
Simplify cost and benefit functions
Simplify optimization problem into a ranking of
Benefit/Cost
Use frame-to-frame coherency
Be sure to consider time taken to calculate LODs

142
Application Architectures Multi-Threading

More stages give more time to cull or generate
LODs
Each stage adds latency

143
Application Architectures Multi-Threading

Hard part is data synchronization
Watch out for memory bloat

144
Application Architectures Scene Graphs

A scene graph is the basic data structures
holding the description of your scene
Cull-able, sort-able, and can contain
multi-resolution objects
Hierarchical Bounding Volumes
Statistics gathering and timing infrastructure
For large scenes can do memory management and
database paging

145
Application Architectures Trade-offs

Quality
Speed
Memory
Complexity

146
Conclusion

Most importantly - Think about balance!

147
Performance Hints

Keith Cok, SGI

148
Performance HintsPipeline Management

Avoid round trips to graphics server
Cache own state/attribute information
Avoid pipeline queries (e.g., glGet)
Flush buffer efficiently (glFlush vs. glFinish)
Reduce state changes. Sort by expense. For
example, sort geometry by type (triangles, quads,
etc) and then by color
Eliminate unused attributes

149
Performance Hints Debugging

Detect graphic errors
ifdef DEBUG
define GLEND() glEnd()\
int err \
err glGetError() \
if (err ! GL_NO_ERROR)
\ printf("s\n",gluErrorString(err))
\
assert(err GL_NO_ERROR)
else
define GLEND() glEnd()
endif

150
Performance Hints Geometry

Maximize data between glBegin/glEnd
Sort geometry by type (triangle, quad, etc.) and
group them together
Find best fit for length of glBegin/glEnd pair
Use stripped primitives (GL_TRIANGLE_STRIP...) to
reduce geometry data sent to the pipeline
Avoid GL_POLYGON. Use specific geometric
primitives instead (GL_TRIANGLE, GL_QUAD, etc.)
Use GL_FASTEST with glHint calls where possible

151
Performance Hints Geometry

Use flat display lists for static geometry. Deep
display lists may induce unwanted memory
thrashing
Use API matrix operations instead of your own
Use texture to simulate complex geometry
Use vertex arrays. Test vertex, interleaved,
precompiled arrays

152
Performance Hints Geometry

Pass one normal (not 3 or 4) per flat shaded
polygon
Use a data format suitable for quick transfer to
the graphics subsystem
Disable unneeded operations (alpha blending,
depth, stencil, blending, dithering, fog, etc.)

153
Performance Hints Lighting

Reduce lighting requirements
Use as few lights as possible
Use directional (infinite) lighting. Use
glLightfv(GL_LIGHTn, GL_POSITION, x,y,z,0)
Use positional lights rather than spot lights
Use one-sided lighting when possible (be aware of
issues associated with normals)
Dont change material properties frequently

154
Performance Hints Lighting

Use normalized normal vectors
Supply unit length vectors
Dont enable GL_NORMALIZE
Dont scale using model-view matrix
Pre-multiply geometry, if possible

155
Performance Hints Visuals/Pixel Formats

Pick the correct visual. Use hardware accelerated
visuals
Structure windows and contexts to maximize
performance (app may block after context swaps)
Put GUI elements in overlay planes to avoid
unwanted graphics window refreshes

156
Performance Hints Buffers

Turn off depth buffer when possible
Use HW accelerated off-screen buffer for
backing-store
Use stencil buffer for interactive picking and
quick re-render (see course notes for full
algorithm)
Use color/depth buffer data for interactive
editing of complex scenes (see course notes for
full algorithm)

157
Performance Hints Textures

Be aware of texture sizes
Reduce texture resolution
Use texture LOD extension (OpenGL 1.2)
Use texture objects. Create textures once
Dont swap textures frequently, if possible
Mosaic multiple textures into one large texture
Sort geometry by texture

158
Performance Hints Textures

Use texture as an additional data lookup to
simulate more complex data
Lighting, geometry, color, clipping,
application-space data
Use glTexSubImage to replace part of a texture
rather than creating a whole new texture
Avoid expensive texture filter modes
Use texture lookup tables instead of
multi-channel textures

159
Conclusion

Know how your application works within the system
Dont let caches, latencies, bandwidths, etc.
slow you down
Know how fast you can go
Identify system performance characteristics
Work your compiler
Get all you can out of the hardware

160
Questions and Answers

Write a Comment

User Comments (0)

About PowerShow.com

Developing Efficient Graphics Software PowerPoint PPT Presentation