Title: Introduction to Realtime Ray Tracing Course 41
1Introduction to Realtime Ray TracingCourse 41
- Philipp Slusallek Peter Shirley
- Bill Mark Gordon Stoll Ingo Wald
2Introduction to Realtime Ray Tracing
- Parallel Distributed Processing
- Characteristics of Ray Tracing
- Parallelism
- Communication
- Caching
- Frame Synchronization
- Results
3Promises and Challenges
- Characteristics
- Independence of every ray tree
- Serial dependency between generations of rays
- Often dependency between child rays due to shader
- Significant coherence between adjacent rays
- Geometry can be cached ? rays cannot
- Coherence generally diminishes with generation
- Single shared (mostly) read-only scene data base
4Parallelism
- Task versus Data Parallelism (I)
- Data parallelism task follows data
- Distribute scene among processors, migrate tasks
(rays) - Seems suitable for massive scenes
- Drawbacks
- Large bandwidth due to many rays (difficult to
cache) - Hotspots at camera, lights, and other locations
5Parallelism
- Task versus Data Parallelism (II)
- Task parallelism data follows tasks
- Distribute pixel (tiles) among processors
- Load data on-demand, cache locally
- Cache size accumulates among processors
- Should assign similar tasks to same processor
(coherence) - Within frame and between frames
- Dynamic load balancing is simple, but conflicts
with coherence
6Communication
- Shared Memory versus Message Passing
- Conceptually highly similar
- Separate memories with fast interconnect network
- Both need low latency, high bandwidth networks
- Often special HW support for SHM (cache
coherence) - Shared memory User-space illusion through OS
- Convenient and efficient cache No explicit
programming - Fully transparent but can introduce long thread
stalls on miss - Ideal SHM with OS support under user control
7Hardware Options
- PC cluster
- Good scalability, reasonable price
- Rather slow networks with long latency (Ethernet)
- Limited communication due to latency
- Cannot use bi-directional communication (!!)
- Must operate in streaming mode
- Send data in advance, keep pipeline filled (dep.
on latency) - Better networks are coming (Infiniband)
8Hardware Options
Master
- PC-Cluster
- Setup with commodity HW
- Dual Athlon/Pentium-4 PCs
- Fast- and Gigabit Ethernet
- Master
- Application and OpenRT library
- Job distribution and load-balancing
- Slaves
- Ray tracing computation only
Network- Switch
Slave
Slave
Slave
Slave
Slave
Slave
9Hardware Options
- Shared-Memory Computer
- High scalability with fast low latency networks
- Single virtual address space lots of RAM
- Single admin domain
- Future computers will use shared memory
- Today 8 sockets dual-core 16 CPU 64 GB
- Heavy multi-core designs with fast high-bw
interconnects - May become all we need (at some point -)
10Hardware Options
- Cell processor
- No caches but low latency DMA messages
- Globally shared memory with high bandwidth
- 256 KB Local Store manually managed cache
- 128x 4-vector registers keep many packets
in-flight - packets allow for hiding DMA latency (hopefully)
- Similar to PC cluster, but at different
level/granularity
11Caching
- Caching and Working Sets (Level I)
- Sharing between adjacent (groups of) rays
- Even small caches work extremely well
- SaarCOR I 4 KB cache ? gt98 hit rate (2x2
traversal) less for lists and triangles - Larger tiles reduce coherence
- Diminishing return on bandwidth due to reduced
coherence - Increased working set requires larger caches
12Caching
- Caching and Working Sets (Level II)
- Sharing between frames
- Size of working set (with paging)
- rays avg(pages-per-ray) sizeof(page)
(1-avg(sharing)) - Little sharing at leaf nodes of large scenes
- Only changed and new data must be send
- Can significantly reduce bandwidth
- Might even use procedural approach
13Caching
- Caching and Working Sets (Level III)
- Out-of-core rendering (loading from disk, e.g.
Boeing) - In renderer and/or application (with feedback)
- Similar characteristics to L-II
- But latency in the order of frame times
- Must use pre-fetching of data to avoid artifacts
14Caching
- Caching and Working Sets (Summary)
- Caching generally works very well
- RT has surprising amount of coherence
- Across all levels of hierarchy
- Reduces required bandwidth
- May fail occasionally ? high peak bandwidth
- Completely incoherent rays (no sharing)
- Sudden changes of working set (e.g. fast
movements)
15Load Balancing
- Load Balancing of Tasks (packets of rays)
- Centrally assign tasks to processors on demand
- Goal Keep processors busy at all times
- Size of tasks Overhead versus granularity
- Overhead through network bandwidth processing
- Small granularity Allows better distribution of
load - Large granularity Not enough tasks for large
cluster - Typical on PC-Cluster 32x32 to 16x16 (4 packets
queued) - Keeps queue filled with 1 ms latency and 4 Mrays/s
16Load Balancing
- Fairness
- Usually not an issue Enough tasks to distribute
- 640x480/32x32 300 tasks per frame
- At 30 processors can tolerate 101 imbalance (a
lot !) - End of frame synchronization
- Big worry in offline rendering Must balance each
frame - Not an issue in online rendering stream
computing - Start assigning tasks from next frame to idle
processors - Even higher imbalance can be handled
17Frame Synchronization
One frame of application latency in
rtSwapBuffers()
Display
Frame n-2
Frame n-1
Frame n
Send OpenRT Commands Frame n
Send OpenRT Commands Frame n1
rtSwap- Buffers
rtSwap- Buffers
Send OpenRT Commands Frame n2
Application
Send Tiles Frame n-1
Send Tiles Frame n
Send Tiles Frame n1
OpenRT
Recv Tiles Frame n-1
Recv Tiles Frame n
Recv Tiles Frame n1
Send/Recv Tiles Rendering Frame n-1
Exec Frame n
Send/Recv Tiles Rendering Frame n
Exec Frame n1
Send/Recv Tiles Rendering Frame n1
Rendering Client
t0
t1
t2
Not too scale
18Load Balancing
- Criteria for task allocation
- Allow tasks to move freely for load balancing
- Keep adjacent tasks on same processor (caching)
- Keep visible geometry on same processor (cache
accum.) - Must re-project samples per frame as a preprocess
- Assign tasks to processors that have the data
- Large models may need to load lots of data
large latency - Must balance different criteria
- Simple first come, first served sufficient for
most scenarios
19Scene Updates
- Synchronization for Scene Update
- May need to update scene between frames
- Must adhere to temporal consistency
- All packets in old frame must be finished, no new
started yet - Approach (Copy Update)
- Separate receiver thread updates scene
- Changed objects are copied before applying
changes - At end of frame pointers are updated ? fast
pointer copy
20Networking
- Bandwidth requirements
- From master
- Scene updates (it depends )
- 2D tile coordinates (tiny)
- Enough bandwidth to add other data (e.g. AR
background) - To master
- RBG data (3 byte per pixel gt3MB N fps)
- Could use low latency compression (MJPEG, H-263,
)
21Networking
- Available Options
- Fast-Ethernet 10 MB/s ? Not really
- Gigabit-Ethernet 100 MB/s ? Reasonable (
up to 23 fps uncompr.) - Infiniband 4x 1 GB/s ? Nice
- Infiniband 12x 3 GB/s (???)
- Proprietary
- SGI NumaLink 4 max. 3.2 GB/s, 3-4x less latency
22Network Scalability
Scalability graph on dual processor AMD
1800 clients (up to 48 CPUs _at_ 640x480)
23Hybrid Distribution Model
- Cluster of SHM machines
- SHM Better caching and sharing of scene data
- Single link ? many parallel threads
- Reduced bandwidth requirements with large cache
- Several processes on SHM machine
- One link per process ? increased bandwidth
- SHM leads to frame sync between threads (!)
- Better scalability with shared memory mapping
24Summary and Limitations
- Ideally scalable due independent rays
- Limited only by bandwidth to data
- Scene updates, cache misses, pixel data
- Reduced bandwidth through caching
- Needs good task scheduling load balancing
- Generally work very well
- ? OK on computer scale, translate to chip scale
25Shared-memory multiprocessors
- University of Utah implementations
- Designed mainly for SGIs
- Many CPUs, shared address space, frame buffer
- Chief architect Steve Parker
- rtrt (1997, workhorse)
- manta (2004, demo in SGI booth today/tomorrow)
- Tested on up to 512 processor system
26Shared Memory Implementation
- Each processor can randomly access the database
- Database must be duplicated or shared
- Each processor can write to an arbitrary portion
of the image - Image must be shared or composited later
27Strategy KISS
- Break image into tiles of cache line size
- 32x4128 pixels for most of our machines
- Share database using global address space
- Use C, emphasize locality
- Worry about memory more than arithmetic
- Biggest change for us ray packets
28Scalability
- Caching works well for our data
- Almost perfect up to 32 processors
- 91 for 512 processors (466x speedup)
29Load Balancing
- Work Queue based
- Each tile becomes a unit of work
- Work is doled out in larger chunks at the
beginning of a frame, and smaller chunks nearer
the end - Like stacking blocks
30Bottlenecks
- Tile manager
- Frame boundaries
Start of new frame
tile
cpu 1
cpu 2
cpu 3
3160 million triangles
32Boeing 777 350M triangles
3335 Million spheres
3468 billion height grid
35Summary
- Shared memory multiprocessors work
- But they are expensive
- May give lessons for multi-core CPUs
- Clever programming not needed
- Encourage natural memory locality
- For big datasets, much better than GPUs
36(No Transcript)