Developing Efficient Graphics Software - PowerPoint PPT Presentation

About This Presentation
Title:

Developing Efficient Graphics Software

Description:

fast path: float verts, float norms, AGBR textures, z-test ... Vertex array format, texture format, display list, etc. Store data in database ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 160
Provided by: keit82
Category:

less

Transcript and Presenter's Notes

Title: Developing Efficient Graphics Software


1
Developing Efficient Graphics Software

2
Developing Efficient Graphics Software
  • Intent of Course
  • Identify application and hardware interaction
  • Quantify and optimize interaction
  • Identify efficient software structure
  • Balance software and hardware system component use

3
Developing Efficient Graphics Software
  • Outline
  • 135 Hardware and graphics architecture and
    performance
  • 205 Software and System Performance
  • Break
  • 255 Software profiling and performance analysis
  • 320 C/C language issues
  • 350 Graphics techniques and algorithms
  • 440 Performance Hints

4
Developing Efficient Graphics Software
  • Speakers
  • Applications Consulting Engineers for SGI
  • optimizing, differentiating, graphics
  • Keith Cok, Bob Kuehne, Thomas True, Alan Commike

5
Hardware Graphics Architecture Performance
  • Bob Kuehne, SGI

6
Course Overview
  • Why is your application drawing so slowly?
  • Could actually be the graphics
  • Could be the data traversal
  • Could be something entirely different

7
Tour Guide
  • Platform architecture components
  • CPU
  • Memory
  • Graphics
  • Graphics performance
  • Measurements triangle rate, fill rate, misc.
  • Reproduce maximize

8
Bottlenecks Balance
  • Bottlenecks
  • Find them
  • Eliminate them (sort of - move them around)
  • Balance
  • Understand hardware architecture
  • Fully utilize hardware

9
Yin Yang
  • Yin and yang are the two primal cosmic
    principles of the universe
  • The best state for everything in the universe is
    a state of harmony represented by a balance of
    yin and yang.
  • Skeptics Dictionary -- http//skepdic.com/yinyang.
    html

10
Write Once Run Everywhere?
  • My application ran fast on that platform! Why is
    this one so slow?
  • Different platforms require different tuning
  • Different platforms implement hardware
    differently
  • Macro Architecture features
  • Micro Storage capacities, buffers, caches
  • Effect Bandwidth latency

11
Latency Bandwidth
  • Definitions
  • Latency time required to communicate a unit of
    data
  • Bandwidth data transferred per unit time
  • Example
  • Latency bottleneck
  • Bandwidth bottleneck

12
Platform Software View
graphics
CPU
i/o
memory
misc
net
13
Platform PCI, AGP
CPU
Memory
CPU
Memory
glue
PCI
AGP
Disk
Net
Graphics
I/O
Disk
Net
Graphics
I/O
14
Platform UMA, Switched Hub
CPU
Memory
CPU
Memory
glue
UMA
glue
PCI
Disk
Net
I/O
Graphics
Disk
Net
Graphics
I/O
15
Platform The Points
  • Why learn about hardware?
  • To understand how your app interacts with it
  • To best utilize the hardware
  • Potentially can use extra hardware features
  • Where?
  • Platform documentation
  • Talk with hardware vendor

16
CPU Overview
  • CPU Operation
  • Data transferred from main memory to registers
  • CPU works on data in registers
  • Latency
  • Registers 0 (free)
  • Level-1 (L1) cache 1
  • Level-2 (L2) cache 10x L1
  • Main memory 100x L1

CPU
R
L1
L2
Main Memory
17
CPU, Cache, and Memory
  • Caches designed to exploit data locality
  • Temporal locality
  • Spatial locality

Main Memory
CPU
Registers
L1
L2
18
Memory Cache Logical Flow
In L1?
In L2?
In Register?
Copy to L2 (100)
Compute
Copy to L1 (10)
Copy to Register (1)
19
Memory Cache Physical Flow
Main Memory
L2 Cache
L1 Cache
Page
Registers
CPU
20
Memory Allocation Pools
  • List elements are often allocated as-needed
  • This leads to spatial disparity
  • Mitigated by use of application memory management
  • Bad malloc, malloc, malloc, malloc, ...
  • Good pools - pool_init, pool_alloc, ...
  • Graphics example
  • Vertices, normals, textures, etc.

21
Memory Graphics! Vertex Arrays
22
Graphics Pipe
xf world to screen light apply light clip
clip to view
rast convert to pixels fx apply texture,
etc. fops test pixel ops
23
Graphics Pipe Akeley Taxonomy
  • G - Generate geometric data
  • T - Traverse data structures
  • X - Transform primitives world to screen
  • R - Rasterize triangles to pixels
  • D - Display framebuffer on output device

G
D
X
R
T
24
Graphics Hardware
  • 4 types of hardware are common
  • G-TXRD all hardware
  • GT-XRD
  • GTX-RD
  • GTXR-D all software

25
Graphics Performance
  • Benchmarks
  • Trust, but verify. - an ex-president
  • Definitions
  • Triangle rate speed at which primitives are
    transformed (X)
  • Fill rate speed at which primitives are
    rasterized (R)
  • Depth complexity number of times pixel filled
  • Caveats
  • Quantization, fastpath

26
Graphics Quantization
  • Frame quantization is the result of swapbuffers
    occurring at the next vertical retrace.
  • Necessary to avoid image artifacts such as
    tearing
  • Example 100Hz display refresh

27
Graphics Quantization
no-sync 120 Hz
100 Hz
50 Hz
50 Hz
33 Hz
t0
t1
t2
t3
t4
t5
t4
t6
t7
28
Graphics Fastpath
  • Definition
  • Fastpath the most optimized path through
    graphics hardware
  • Example
  • fast path float verts, float norms, AGBR
    textures, z-test
  • less fast path float verts, float norms, RGBA
    textures, z-test

29
Graphics Fastpath Example
30
Graphics Fastpath Points
  • Fast path is often synonymous with ideal path.
  • Real usage of graphics falls on a continuum.
  • Must quantify what hardware can do
  • Quality speed

31
Graphics Hardware Testing
  • Duplicate performance numbers simply
  • Good build a simple test program
  • Better glPerf - http//www.spec.org
  • Maximize performance in an app
  • Good Use fast API extensions
  • Better Create an is-fast test, use what is
    verified as fast

32
Graphics Hardware Is-Fast
  • Test each platform to determine fast path
  • Once, per-machine, test primitives and modes
  • Vertex array format, texture format, display
    list, etc.
  • Store data in database
  • Detect hardware changes or time-to-live
  • Read data from database at startup
  • Check database or re-generate data

33
Graphics Hardware Is-Fast
  • Pseudo-code

If ( new_machine() hardware_changed() )
test_interesting_modes() store_in_database()
else // have database entry
get_performance_data_from_database() // use
the modes primitives that are fast when
rendering
34
Think Globally, Act Locally
  • Think globally
  • Know the platforms graphics hardware
  • Use hardware effectively in your app
  • Balance hardware utilization
  • Act locally
  • Use in-cache data
  • Understand hardware graphics fastpaths
  • Balance quality vs. performance

35
Software and System Performance
  • Thomas J. True, SGI

36
A Four Step Process
37
Quantify
  • Characterize
  • Application Space
  • Primitive Types
  • Primitive Counts
  • Rendering Characteristics
  • Frame Rate

38
Quantify
  • Compare

39
Examine System Configuration
  • Resources
  • Memory
  • Disk
  • Setup
  • Display
  • Network

40
Graphics Analysis
  • Ideal Performance
  • Keep graphics pipeline full.
  • 100 CPU utilization running application code.
  • 100 graphics utilization.

41
Graphics Analysis
  • Graphics Bound

42
Graphics Analysis
  • Graphics Bound
  • Graphics subsystem processes data slower than CPU
    can feed it.
  • Graphics subsystem issues an interrupt which
    causes the CPU to stall.
  • Data processing within application stops until
    graphics subsystem can again accept data.

43
Graphics Analysis
  • Geometry Limited
  • Limited by the rate at which vertices can be
    transformed and clipped.
  • Fill Limited
  • Limited by the rate at which transformed vertices
    can be rasterized.

44
Graphics Analysis
  • CPU Bound

45
Graphics Analysis
  • CPU Bound
  • CPU at 100 utilization but cant feed graphics
    fast enough.
  • Graphics subsystem at less than 100 utilization.
  • All CPU cycles consumed by data processing.

46
Graphics Analysis
  • Determination Techniques
  • Remove graphics API calls.
  • Shrink graphics window.
  • Reduce geometry processing requirements.
  • Use system monitoring tool.

47
Graphics Analysis
Start
Remove graphics API calls
Performance Problem Not Graphics
Excessive or unexpected CPU activity
frame rate increase
no change in frame rate
48
Graphics Analysis
  • Graphics Architecture GTXR-D

49
Graphics Analysis
  • Graphics Architecture GTXR-D
  • (aka Dumb Frame Buffer)
  • CPU does everything.
  • Typically CPU bound.
  • To remedy, buy a real graphics board.

50
Graphics Analysis
  • Graphics Architecture GTX-RD

51
Graphics Analysis
  • Graphics Architecture GTX-RD
  • Screen space operations performed by graphics.
  • Object-space to screen-space transform on host.
  • Can easily become CPU bound.
  • Roughly 100 single-precision floating point
    operations are required to transform, light, clip
    test, project and map an object-space vertex to
    screen-space. - K. Akeley T. Jermoluk
  • Beware of fast-path and slow-path issues.

52
Graphics Analysis
  • Graphics Architecture GTX-RD
  • If Graphics Bound
  • Reduce per-pixel operations.
  • Reduce depth complexity.
  • Use native-format data.

53
Graphics Analysis
  • Graphics Architecture GTX-RD
  • If CPU Bound
  • Reduce scene complexity.
  • Use more efficient graphics algorithms.

54
Graphics Analysis
Graphics Architecture GT-XRD
55
Graphics Analysis
  • Graphics Architecture GT-XRD
  • Transformation and rasterization performed by
    graphics.
  • Can be CPU or graphics bound.
  • Beware of fast-path and slow-path issues.
  • Subject to host bandwidth limitations.

56
Graphics Analysis
  • Graphics Architecture GT-XRD
  • If Graphics Bound
  • Move lighting back to CPU.
  • Use native data formats within application.
  • Use display lists or vertex arrays.
  • Use less expensive lighting modes.

57
Graphics Analysis
  • Graphics Architecture GT-XRD
  • If CPU Bound
  • Move lighting from CPU to graphics subsystem.
  • Do matrix operations in graphics hardware.
  • Profile in search of computational performance
    issues.

58
Bottleneck Elimination
  • Bottlenecks

59
Bottleneck Elimination
  • Bottlenecks
  • Understanding, crucial to effective tuning.
  • Will always exist, tune to balance.
  • Not always a bad thing.

60
Bottleneck Elimination
  • Graphics
  • Use native graphics formats.
  • Remove excessive state changes.
  • Package graphics primitives efficiently.
  • Use textures that fit in texture cache.
  • Dont use unnecessary rendering modes.
  • Decrease depth complexity.
  • Cull out excessive geometry.

61
Bottleneck Elimination
  • Memory
  • Dont allocate memory in rendering loop.
  • Avoid copying and repackaging of graphics data.
  • Organize graphics data.
  • Avoid memory fragmentation.

62
Bottleneck Elimination
  • Memory Bandwidth and Fragmentation

Independent Triangles 9 vertices 504 bytes
Triangle Strip 5 vertices 280 bytes
Vertex Array 5 vertices 280 bytes
Vertex RGBAXYZWXYZSTR 56 bytes
63
Bottleneck Elimination
  • Code and Language
  • Use native data types.
  • Avoid contention for a single shared resource.
  • Avoid application bottlenecks in non-graphics
    code.
  • Reduce API call overhead.

64
Bottleneck Elimination
  • API Call Overhead

Independent Triangles (XYZW RGBA XYZ STR)
9 vertices 36 function calls
Triangle Strips (XYZW RGBA XYZ STR) 5
vertices 20 function calls
Vertex Array 5 function calls
Display List 1 function call
65
Conclusion
  • Performance Tuning an Iterative Process

66
Conclusion
  • Its all about balance!

67
Profiling and Performance Analysis
  • Keith Cok, SGI

68
Profile and Performance Analysis
  • Profiling points out code areas that take up most
    time
  • Imperative for well balanced application
  • Points out code and system bottlenecks

69
Two Methods of Software Profiling
  • Basic block
  • A section of code that has one entry and one exit
  • Measures ideal time
  • Statistical sampling
  • Interrupts program execution and examines current
    location
  • Measures actual CPU cycles spent executing a line
    of code

70
How Do You Profile Code?
  • Compile/link with compiler optimizations turned
    on
  • cc foo.c -use_all_optimization_flags ....
  • Instrument the code
  • Unix pixie foo.exe -gt foo.exe.pixie
  • Visual Studio embedded in tool suite
  • Run the application with relevant data sets
  • foo.exe.pixie - args -gt produces results data
    file

71
Profiling Finding the Hot Spot
  • Function list, in descending order by exclusive
    ideal time
  • excl. cum. instructions
    calls function (dso file, line)
  • 1 10.3 10.3 190583064 11484
    GL_CreateSurfaceLightmap (foo gl_rsurf.c, 1293)
  • 2 8.9 19.2 173920781 3203
    S_Update_ (foo snd_dma.c, 848)
  • 3 8.2 27.4 145950460 338787
    R_RenderBrushPoly (foo gl_rsurf.c, 641)
  • 4 5.9 33.3 97798122 1975976
    __sin (libm.so sin.c, 194)
  • 5 4.1 37.4 82310479
    240 GL_LoadTexture (foo gl_draw.c, 990)
  • 6 3.4 40.8 50786176 1204269
    __glMgrim_Begin (libGLcore.so mgras_prim.c, 221)
  • 7 3.2 44.0 58099072 16797
    R_DrawAliasModel (foo gl_rmain.c, 232)
  • 8 3.1 47.1 53832546 290970
    R_RecursiveWorldNode (foo gl_rsurf.c, 894)
  • 9 3.1 50.2 43855299 437627
    R_CullBox (foo gl_rlight.c, 313 compiled in
    gl_rmain.c)
  • 10 2.8 53.0 44666700 30981
    EmitWaterPolys (foo gl_warp.c, 187)

72
Profiling Fixing the Hot Spot
  • What do you look for?
  • Common sub-expressions
  • Loop invariant code
  • Repeated pointer de-referencing
  • Global variables and cache misses
  • Thin loops

73
Profiling Example
  • // Code the old way // Code the new way
  • 19 void old_loop() 27 void new_loop ()
  • 20 sum 0 28 sum 0
  • 21 for (i 0i lt NUM i) 29 ii NUM4
  • 22 sum xi 30 for (i0 i lt ii
    i)
  • 23 printf("sum f\n",sum) 31
    sum xI
  • 24 32 for (i ii i lt NUM i 4)
  • 33 sum xi
  • 34 sum xi1
  • 35 sum xi2
  • 36 sum xi3
  • 37
  • 38 printf( sum f\n,sum)
  • 39

74
Profiling Example Profile Results
  • cycles instructions calls function
    (dso file line)
  • 1 6160 6168 1 old_loop
    (blahdso.so blahdso.c, 19)
  • 2 4869 8714 1 setup_data
    (blahdso.so blahdso.c, 11)
  • 1 4869 8714 1 setup_data
    (blahdso.so blahdso.c, 11)
  • 2 4625 4891 1 new_loop
    (blahdso.so blahdso.c, 27)

75
Profile Example Line Analysis
  • Line list, in descending order by time
  • --------------------------------------------------
    ----
  • cycles invocations function (dso file,
    line)
  • 4096 1024 old_loop sum xi
  • 2061 1024 old_loop for (i 0i
    lt NUM i)
  • 978 256 new_loop sum
    xi3
  • 968 256 new_loop sum
    xi2
  • 968 256 new_loop sum
    xi1
  • 968 256 new_loop sum
    xi
  • 733 256 new_loop for (i
    ii i lt NUM i 4)
  • 7 1 new_loop ii
    NUM4

76
Profile and Performance Analysis
  • Profile Example Visual C/Intel
  • Function Percent of Hit
    Function
  • Time(s) Run Time
    Count
  • -------------------------------
    -----------------------------------
  • 0.410 39.4
    1 _old_loop
  • 0.249 23.9
    1 _new_loop

77
Statistical vs. Basic Block Profile
  • void ijk_loop()
    // loops kji and ikj as well
  • sum 0
  • for (i0iltYNUMi)
  • for (j0jltYNUMj)
  • for (k0kltYNUMk)
  • sum yijk
  • printf("sum f\n",sum)

78
Basic Block vs. Statistical Sampling
  • Basic Block
  • Percent cycles inst
    calls function
  • 1 25.3 51141434 37101028
    1 ijk_loop foo.c, 47
  • 2 25.3 51141434 37101028
    1 kji_loop foo.c, 57
  • 3 25.3 51141434 37101028
    1 ikj_loop foo.c, 66
  • Statistical Sampling
  • Percent Samples Procedure
    Function
  • 1 38.0 2700 kji_loop
    foo.c, 57
  • 2 23.9 1700
    setup_data foo.c, 15
  • 3 19.7 1400 ikj_loop
    foo.c, 66
  • 4 18.3 1300 ijk_loop
    foo.c, 47

79
Now We Know About Hot Spots...
  • What do we do next?
  • Use compilers to fine-tune code
  • Use knowledge of language to optimize
  • Hand-tune code
  • Profiling is fun, hard, and iterative and it can
    be highly effective

80
Compiler and Language Issues
  • Keith Cok, SGI
  • Bob Kuehne, SGI

81
Compiler and Language Issues
  • Compiler Optimizations
  • Occur within a compromise of
  • speed and memory space
  • vs.
  • time to compile and link
  • An iterative process to discover what does and
    doesnt work
  • Important to keep at it

82
Compiler Issues Trade-Offs
  • Trade-offs
  • Round-off vs. needed precision
  • Inter-procedural analysis vs. link time
  • Pointer aliasing vs. coding constraints
  • Optimizing for processor architectures vs. work
    of multiple binaries (support, test)
  • Explore other compilers than your first choice
  • Different source code - different flags

83
Compiler and Language Issues
  • Comments on 32 vs. 64 bit code
  • Benefits of 64 bit code
  • Increased address space
  • Higher precision
  • Downsides of 64 bit code
  • Application memory footprint
  • Need to port which can be difficult!
  • Performance issues

84
Language Issues
  • Data Management
  • Unrolling loops
  • Arrays
  • Temporary variables
  • Pointer aliasing

85
Language Issues Data Management
  • Manipulate data structures efficiently since
    graphics IS data
  • struct str next struct str next
  • str prev
    str prev
  • large_type foo
    int key
  • int key large_type foo
  • str str

86
Language Issues Data Management
  • Pack data efficiently
  • struct foo struct foo_better
  • char aa // 8 bits 24 pad float
    bb // 32 bits
  • float bb // 32 bits char aa
    // 8 bits
  • char cc // 8 bits 24 pad char
    cc // 8 bits
  • float dd // 32 bits char ee
    // 8 bits 8 pad
  • char ee // 8 bits 24 pad float
    dd // 32 bits
  • foo_t // 160 bits foo_t
    // 96 bits

87
Language Issues Data Management
  • Examine your arrays and note their caching
    behavior
  • Break up large arrays into smaller sub-arrays for
    better memory access patterns
  • Understand the implications of data layout and
    cache behavior

88
Language Issues Loop Unrolling
  • Profiling Example
  • // Code the old way // Code the new way
  • 19 void old_loop() 27 void new_loop()
  • 20 sum 0 28 sum 0
  • 21 for (i 0i lt NUM i) 29 ii NUM4
  • 22 sum xi 30 for (i0 i lt ii
    i)
  • 23 printf("sum f\n",sum) 31 sum
    xi
  • 24 32 for (iii iltNUM i 4)
  • 33 sum xi
  • 34 sum xi1
  • 35 sum xi2
  • 36 sum xi3
  • 37
  • 38 printf( sum f\n,sum)
  • 39

89
Language Issues Loop Unrolling
  • Profile Example Line Analysis
  • Line list, in descending order by time
  • --------------------------------------------------
    ----
  • cycles invocations function
  • 4096 1024 old_loop sum xi
  • 2061 1024 old_loop for (i 0i
    lt NUM i)
  • 978 256 new_loop sum
    xi3
  • 968 256 new_loop sum
    xi2
  • 968 256 new_loop sum
    xi1
  • 968 256 new_loop sum
    xi
  • 733 256 new_loop for (i
    ii i lt NUM i 4)
  • 7 1 new_loop ii
    NUM4

90
Language Issues Loop Unrolling
  • Issues with loop unrolling
  • Code complexity
  • Clutter
  • Compiler may/may not do this
  • Flags may affect compiler time spent optimizing
  • Only thin loops gain performance
  • Use application knowledge to take advantage of
    loop unrolling

91
Language Issues Local temporary variables
  • Use local temporary variables to avoid repeatedly
    de-referencing a pointer structure
  • Example
  • x global_ptr-gtrecord_str-gta
  • y global_ptr-gtrecord_str-gtb
  • Use
  • tmp global_ptr-gtrecord_str
  • x tmp-gta
  • y tmp-gtb

92
Language Issues Using tmp vars for global vars
within a function
  • void tr_point(FLOAT old_pt, FLOAT m, FLOAT
    new_pt)
  • FLOAT c1, c2, c3, c4, op, np, tmp
  • c1 m c2 m4 c3 m8 c4 m12
  • for (j0, np new_ptjlt4 j) for
    (j0 np new_pt jlt4j)


    op old_pt
    op old_pt
  • tmp op c1 np
    op c1
  • tmp op c2 np
    op c2
  • tmp op c3 np
    op c3
  • np tmp (op c4) np
    op c4

93
Language Issues Pointer Aliasing
  • Pointers are aliases when they point to
    potentially overlapping regions of memory
  • If regions never overlap, may optimize for this
    case. Not possible, though, in general
  • Compiler can't tell when pointers are aliased
  • Use restrict key word or compiler option

94
Language Issues Pointer Aliasing
Unaliased Pointers Compilers may use -
Parallelism - Pipelining
in
out
in
out
Aliased pointers
95
Language Issues Pointer Aliasing
  • void process_data( float restrict in,

    float restrict out,
    float gain)
  • int i
  • for (i 0 i lt NSAMPS i)
  • outi ini gain

96
C General Issues
  • Language features
  • RTTI, safe casts, etc.
  • Use const, mutable, volatile, inline
  • hints to compilers
  • Object construction
  • arrays, default constructors, arguments, etc.
  • Method invocation issues
  • operators, overloads, conversion, etc.

97
C Virtual Functions
  • Good - used to invoke child method when managing
    base-class handles
  • Expensive - incur an additional pointer
    de-reference
  • one, find VTBL, two, find method, invoke
  • bad for caching
  • Use when necessary, but not for common objects
  • Good for large methods that do lots of work
  • Bad for small methods, like a vertex query

98
C Exceptions Templates
  • Exceptions
  • Great for error checking
  • Performance penalty
  • Additional stack information required
  • Templates
  • Great for code re-use
  • Memory penalty
  • Across libraries, across object files

99
Code Language Issues The End
  • Balance
  • Know your compiler
  • Features performance
  • Know your language
  • Features performance
  • Know your app
  • Features performance

100
Idioms and Application Architectures
  • Alan Commike, SGI

101
Starting Quote
  • The best tuned most efficient bubble sort is
    still a bubble sort. Additional tweaking won't
    improve performance.
  • Change The Algorithm!

  • - Commike 99

102
Introduction
  • To write an efficient graphics application, one
    must
  • Understand the platform
  • Use graphics efficiently
  • Write good code
  • Use efficient application structures and
    algorithms

103
Outline
  • Outline
  • Background
  • Culling
  • Level of Detail (LOD) management
  • Application architectures

104
Application ArchitecturesRendering Path
  • Application work, culling, LOD, drawing
  • Pipelined rendering path

105
Application ArchitecturesRendering Path
  • Application work, culling, LOD, drawing
  • Pipelined rendering path

106
Application ArchitecturesRendering Path
  • Application work, culling, LOD, drawing
  • Pipelined rendering path

107
Application ArchitecturesTarget Frame Rate
  • A target frame rate attempts to bound the maximum
    render time
  • Control Culling and LOD aggressiveness
  • Maintain a constant frame rate
  • Achieve an acceptable interactive frame rate

108
Graphics Idioms
  • Culling
  • Removing geometry that isn't visible
  • Level of Detail Management
  • Reducing geometric complexity

109
Culling
  • Dont draw what you cant see

110
CullingCulling Types
  • Use one. Use all. Pipeline them together.
  • View Frustum Culling
  • Backface Culling
  • Contribution Culling
  • Occlusion Culling

111
CullingBounding Volumes
  • Test against a bounding volume not individual
    primitives
  • Can be bounding sphere, box, oriented box, or any
    enclosing volume
  • Hierarchical bounding volumes to reduce cull time
  • Spheres are fast, boxes are more accurate
  • Use a combination of both

112
Culling View Frustum
  • Graphics pipeline clips data that falls outside
    the View Frustum
  • If it will be clipped dont bother drawing

113
Culling View Frustum Usefulness
  • Improves geometry rate
  • Culled vertices are not transformed, lit, and
    clipped
  • Improves host download rate
  • Less data moved from memory into graphics
  • Does not change fill rate
  • Triangles outside the View Frustum would not have
    been drawn anyway

114
Culling View Frustum Implementation
  • Transform vertices to clip coordinates (in OpenGL
    multiply by Model-View and Projection matrix)
  • Check each vertex against View Frustum
  • Geometry is either In, Out, or Partial
  • Render In and Partial

115
Culling Skip the Clip
  • In software transform systems (GTX-RD) skip the
    clip
  • Partial and In geometry classified
  • Pipe renders Partial as usual
  • Pipe can render In without a View Frustum clip
  • Might be a hint to render
  • Can improve geometry rates if not already
    fill-limited

116
Culling Backface
  • Only half of any closed polyhedron is visible at
    any one time
  • Dont render what you cant see

117
Culling Backface Usefulness
  • Improves fill rate when using a native
    implementation
  • Primitives are transformed and lit before culling
  • Helps both geometry and fill with an application
    specific algorithm
  • More computationally expensive
  • Balance graphics and CPU work
  • This may not work well when you can enter closed
    geometry or need two-sided lighting

118
Random Image
119
Lava. Hot!

120
Random Quote
  • Try not. Do, or do not. There is no try.

  • - Yoda 80

121
Culling Contribution
  • If its too small to make a difference
  • dont render it

122
Culling Contribution Usefulness
  • Improves geometry rate
  • Culled vertices are not transformed, lit, and
    clipped
  • Improves host download rate
  • Less data moved from memory into graphics
  • Does not change fill rate
  • Screen space projection already minimal
  • Removes few pixels from rasterization stage

123
Culling Contribution Implementation
  • Dont render items that fall below a size
    threshold
  • Screen space size of bounding volume
  • A less computational approach
  • Distance to object combined with some notion of
    global object size

124
Culling Occlusion
  • If you cant see it
  • dont draw it

Front
Side
125
Culling Occlusion Goals
  • Find the optimal set of occluders that will
    enable drawing the minimal number of occludees
  • Occluders The geometry that is visible
  • Occludees The geometry that is not visible
  • Use general purpose occlusion culling algorithms
  • Use application specific spatial knowledge if
    possible

126
Culling Occlusion Culling Usefulness
  • Can improve both transform-limited and
    fill-limited applications
  • Computationally expensive
  • Beware of time trade-offs
  • Possible hardware support

127
Culling General Occlusion Culling
  • Used for arbitrary scenes
  • Can improve both transform limited and fill
    limited applications
  • Computationally expensive for arbitrary scenes

128
Culling Occlusion Spatial Partitioning
  • Cell and Portal Culling
  • Spatial organization leads to Cells and Portals
  • Games that move from room to room
  • Architectural walkthroughs

129
LOD Overview
  • After culling, need to draw what is left
  • Still too much geometry
  • Use multiple Levels of Detail, I.e.
    multi-resolution objects
  • Match geometric complexity to visible on-screen
    space coverage
  • Reduce geometric complexity to maintain target
    frame rate

130
LOD Issues
  • Generating LODs
  • Height Fields vs 3D objects
  • View-Dependent nice, but compute intensive
  • View-Independent fast, memory intensive
  • Need to decide which LOD level to use
  • Not trivial!
  • Need smooth transitions between levels
  • Geomorphs

131
LOD Height Fields
  • Generally thought of as infinite terrain
  • Specialized algorithms can be used

132
LOD 3D Models
  • General purpose simplification algorithm
  • Can use on height fields also
  • Some recent real-time view-dependent algorithms
  • Also used for compression

133
LOD When to switch LOD levels
  • Ability to only generate LOD models is not
    sufficient
  • Need to know when to use which LOD level
  • single constant hard metric distance from eye
  • Multiple heuristics cost, benefit, rankings
  • Can bias LODs to ensure frame rate targets are
    reached

134
LODLevel determination
  • Determine system rendering characteristics
  • Determine cost of rendering each object
  • Render objects with highest benefit while
    remaining under the target frame rate
  • Level determination can be time consuming!
  • take the time to time the time taken to reduce
    the rendering time

135
Going, and going, and going...
136
LOD Determining cost of rendering
  • Cost is affected by many factors
  • Graphics hardware published benchmarks, startup
    tests
  • Number of vertices primarily a function of LOD
    algorithm
  • Rendering Quality lighting, shading, wire frame,
    anti-aliasing, etc.
  • Global Factors total texture memory, dirty
    internal state

137
LOD Benefit Function
  • Cost alone is not good enough, need benefit also
  • Rendered size of object
  • Error tolerance between LOD level and reference
    model
  • Importance in scene
  • Frame-to-frame coherency

138
LOD The Optimal LODs
  • For all Objects, at each LOD Level, rendered with
    each RenderType
  • Maximize the Benefit function
  • Benefit(Object, Level, RenderType)
  • Subject to
  • Cost(Object, Level, RenderType) lt
    TargetFrameRate

139
LOD Optimal Optimizations
  • Simulated Annealing
  • Monte Carlo Simulations
  • Simplex Searches

140
LOD Optimal Optimizations
  • Simulated Annealing
  • Monte Carlo Simulations
  • Simplex Searches
  • Dude,
  • Can you spare a few dozen CPUs?

141
LOD Trade-offs
  • Dont have enough time to run full LOD
    optimization problem and render the scene
  • Simplify cost and benefit functions
  • Simplify optimization problem into a ranking of
    Benefit/Cost
  • Use frame-to-frame coherency
  • Be sure to consider time taken to calculate LODs

142
Application Architectures Multi-Threading
  • More stages give more time to cull or generate
    LODs
  • Each stage adds latency

143
Application Architectures Multi-Threading
  • Hard part is data synchronization
  • Watch out for memory bloat

144
Application Architectures Scene Graphs
  • A scene graph is the basic data structures
    holding the description of your scene
  • Cull-able, sort-able, and can contain
    multi-resolution objects
  • Hierarchical Bounding Volumes
  • Statistics gathering and timing infrastructure
  • For large scenes can do memory management and
    database paging

145
Application Architectures Trade-offs
  • Quality
  • Speed
  • Memory
  • Complexity

146
Conclusion
  • Most importantly - Think about balance!

147
Performance Hints
  • Keith Cok, SGI

148
Performance HintsPipeline Management
  • Avoid round trips to graphics server
  • Cache own state/attribute information
  • Avoid pipeline queries (e.g., glGet)
  • Flush buffer efficiently (glFlush vs. glFinish)
  • Reduce state changes. Sort by expense. For
    example, sort geometry by type (triangles, quads,
    etc) and then by color
  • Eliminate unused attributes

149
Performance Hints Debugging
  • Detect graphic errors
  • ifdef DEBUG
  • define GLEND() glEnd()\
  • int err \
  • err glGetError() \
  • if (err ! GL_NO_ERROR)
    \ printf("s\n",gluErrorString(err))
    \
  • assert(err GL_NO_ERROR)
  • else
  • define GLEND() glEnd()
  • endif

150
Performance Hints Geometry
  • Maximize data between glBegin/glEnd
  • Sort geometry by type (triangle, quad, etc.) and
    group them together
  • Find best fit for length of glBegin/glEnd pair
  • Use stripped primitives (GL_TRIANGLE_STRIP...) to
    reduce geometry data sent to the pipeline
  • Avoid GL_POLYGON. Use specific geometric
    primitives instead (GL_TRIANGLE, GL_QUAD, etc.)
  • Use GL_FASTEST with glHint calls where possible

151
Performance Hints Geometry
  • Use flat display lists for static geometry. Deep
    display lists may induce unwanted memory
    thrashing
  • Use API matrix operations instead of your own
  • Use texture to simulate complex geometry
  • Use vertex arrays. Test vertex, interleaved,
    precompiled arrays

152
Performance Hints Geometry
  • Pass one normal (not 3 or 4) per flat shaded
    polygon
  • Use a data format suitable for quick transfer to
    the graphics subsystem
  • Disable unneeded operations (alpha blending,
    depth, stencil, blending, dithering, fog, etc.)

153
Performance Hints Lighting
  • Reduce lighting requirements
  • Use as few lights as possible
  • Use directional (infinite) lighting. Use
    glLightfv(GL_LIGHTn, GL_POSITION, x,y,z,0)
  • Use positional lights rather than spot lights
  • Use one-sided lighting when possible (be aware of
    issues associated with normals)
  • Dont change material properties frequently

154
Performance Hints Lighting
  • Use normalized normal vectors
  • Supply unit length vectors
  • Dont enable GL_NORMALIZE
  • Dont scale using model-view matrix
  • Pre-multiply geometry, if possible

155
Performance Hints Visuals/Pixel Formats
  • Pick the correct visual. Use hardware accelerated
    visuals
  • Structure windows and contexts to maximize
    performance (app may block after context swaps)
  • Put GUI elements in overlay planes to avoid
    unwanted graphics window refreshes

156
Performance Hints Buffers
  • Turn off depth buffer when possible
  • Use HW accelerated off-screen buffer for
    backing-store
  • Use stencil buffer for interactive picking and
    quick re-render (see course notes for full
    algorithm)
  • Use color/depth buffer data for interactive
    editing of complex scenes (see course notes for
    full algorithm)

157
Performance Hints Textures
  • Be aware of texture sizes
  • Reduce texture resolution
  • Use texture LOD extension (OpenGL 1.2)
  • Use texture objects. Create textures once
  • Dont swap textures frequently, if possible
  • Mosaic multiple textures into one large texture
  • Sort geometry by texture

158
Performance Hints Textures
  • Use texture as an additional data lookup to
    simulate more complex data
  • Lighting, geometry, color, clipping,
    application-space data
  • Use glTexSubImage to replace part of a texture
    rather than creating a whole new texture
  • Avoid expensive texture filter modes
  • Use texture lookup tables instead of
    multi-channel textures

159
Conclusion
  • Know how your application works within the system
  • Dont let caches, latencies, bandwidths, etc.
    slow you down
  • Know how fast you can go
  • Identify system performance characteristics
  • Work your compiler
  • Get all you can out of the hardware

160
Questions and Answers
Write a Comment
User Comments (0)
About PowerShow.com