Introduction to Realtime Ray Tracing Course 41 - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Introduction to Realtime Ray Tracing Course 41

Description:

Introduction to Realtime Ray Tracing Course 41 – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 37

Provided by: steve1637

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Realtime Ray Tracing Course 41

1
Introduction to Realtime Ray TracingCourse 41

Philipp Slusallek Peter Shirley
Bill Mark Gordon Stoll Ingo Wald

2
Introduction to Realtime Ray Tracing

Parallel Distributed Processing
Characteristics of Ray Tracing
Parallelism
Communication
Caching
Frame Synchronization
Results

3
Promises and Challenges

Characteristics
Independence of every ray tree
Serial dependency between generations of rays
Often dependency between child rays due to shader
Significant coherence between adjacent rays
Geometry can be cached ? rays cannot
Coherence generally diminishes with generation
Single shared (mostly) read-only scene data base

4
Parallelism

Task versus Data Parallelism (I)
Data parallelism task follows data
Distribute scene among processors, migrate tasks
(rays)
Seems suitable for massive scenes
Drawbacks
Large bandwidth due to many rays (difficult to
cache)
Hotspots at camera, lights, and other locations

5
Parallelism

Task versus Data Parallelism (II)
Task parallelism data follows tasks
Distribute pixel (tiles) among processors
Load data on-demand, cache locally
Cache size accumulates among processors
Should assign similar tasks to same processor
(coherence)
Within frame and between frames
Dynamic load balancing is simple, but conflicts
with coherence

6
Communication

Shared Memory versus Message Passing
Conceptually highly similar
Separate memories with fast interconnect network
Both need low latency, high bandwidth networks
Often special HW support for SHM (cache
coherence)
Shared memory User-space illusion through OS
Convenient and efficient cache No explicit
programming
Fully transparent but can introduce long thread
stalls on miss
Ideal SHM with OS support under user control

7
Hardware Options

PC cluster
Good scalability, reasonable price
Rather slow networks with long latency (Ethernet)
Limited communication due to latency
Cannot use bi-directional communication (!!)
Must operate in streaming mode
Send data in advance, keep pipeline filled (dep.
on latency)
Better networks are coming (Infiniband)

8
Hardware Options
Master

PC-Cluster
Setup with commodity HW
Dual Athlon/Pentium-4 PCs
Fast- and Gigabit Ethernet
Master
Application and OpenRT library
Job distribution and load-balancing
Slaves
Ray tracing computation only

Network- Switch
Slave
Slave
Slave
Slave
Slave
Slave

9
Hardware Options

Shared-Memory Computer
High scalability with fast low latency networks
Single virtual address space lots of RAM
Single admin domain
Future computers will use shared memory
Today 8 sockets dual-core 16 CPU 64 GB
Heavy multi-core designs with fast high-bw
interconnects
May become all we need (at some point -)

10
Hardware Options

Cell processor
No caches but low latency DMA messages
Globally shared memory with high bandwidth
256 KB Local Store manually managed cache
128x 4-vector registers keep many packets
in-flight
packets allow for hiding DMA latency (hopefully)
Similar to PC cluster, but at different
level/granularity

11
Caching

Caching and Working Sets (Level I)
Sharing between adjacent (groups of) rays
Even small caches work extremely well
SaarCOR I 4 KB cache ? gt98 hit rate (2x2
traversal) less for lists and triangles
Larger tiles reduce coherence
Diminishing return on bandwidth due to reduced
coherence
Increased working set requires larger caches

12
Caching

Caching and Working Sets (Level II)
Sharing between frames
Size of working set (with paging)
rays avg(pages-per-ray) sizeof(page)
(1-avg(sharing))
Little sharing at leaf nodes of large scenes
Only changed and new data must be send
Can significantly reduce bandwidth
Might even use procedural approach

13
Caching

Caching and Working Sets (Level III)
Out-of-core rendering (loading from disk, e.g.
Boeing)
In renderer and/or application (with feedback)
Similar characteristics to L-II
But latency in the order of frame times
Must use pre-fetching of data to avoid artifacts

14
Caching

Caching and Working Sets (Summary)
Caching generally works very well
RT has surprising amount of coherence
Across all levels of hierarchy
Reduces required bandwidth
May fail occasionally ? high peak bandwidth
Completely incoherent rays (no sharing)
Sudden changes of working set (e.g. fast
movements)

15
Load Balancing

Load Balancing of Tasks (packets of rays)
Centrally assign tasks to processors on demand
Goal Keep processors busy at all times
Size of tasks Overhead versus granularity
Overhead through network bandwidth processing
Small granularity Allows better distribution of
load
Large granularity Not enough tasks for large
cluster
Typical on PC-Cluster 32x32 to 16x16 (4 packets
queued)
Keeps queue filled with 1 ms latency and 4 Mrays/s

16
Load Balancing

Fairness
Usually not an issue Enough tasks to distribute
640x480/32x32 300 tasks per frame
At 30 processors can tolerate 101 imbalance (a
lot !)
End of frame synchronization
Big worry in offline rendering Must balance each
frame
Not an issue in online rendering stream
computing
Start assigning tasks from next frame to idle
processors
Even higher imbalance can be handled

17
Frame Synchronization
One frame of application latency in
rtSwapBuffers()
Display
Frame n-2
Frame n-1
Frame n
Send OpenRT Commands Frame n
Send OpenRT Commands Frame n1
rtSwap- Buffers
rtSwap- Buffers
Send OpenRT Commands Frame n2
Application
Send Tiles Frame n-1
Send Tiles Frame n
Send Tiles Frame n1
OpenRT
Recv Tiles Frame n-1
Recv Tiles Frame n
Recv Tiles Frame n1
Send/Recv Tiles Rendering Frame n-1
Exec Frame n
Send/Recv Tiles Rendering Frame n
Exec Frame n1
Send/Recv Tiles Rendering Frame n1
Rendering Client
t0
t1
t2
Not too scale
18
Load Balancing

Criteria for task allocation
Allow tasks to move freely for load balancing
Keep adjacent tasks on same processor (caching)
Keep visible geometry on same processor (cache
accum.)
Must re-project samples per frame as a preprocess
Assign tasks to processors that have the data
Large models may need to load lots of data
large latency
Must balance different criteria
Simple first come, first served sufficient for
most scenarios

19
Scene Updates

Synchronization for Scene Update
May need to update scene between frames
Must adhere to temporal consistency
All packets in old frame must be finished, no new
started yet
Approach (Copy Update)
Separate receiver thread updates scene
Changed objects are copied before applying
changes
At end of frame pointers are updated ? fast
pointer copy

20
Networking

Bandwidth requirements
From master
Scene updates (it depends )
2D tile coordinates (tiny)
Enough bandwidth to add other data (e.g. AR
background)
To master
RBG data (3 byte per pixel gt3MB N fps)
Could use low latency compression (MJPEG, H-263,
)

21
Networking

Available Options
Fast-Ethernet 10 MB/s ? Not really
Gigabit-Ethernet 100 MB/s ? Reasonable (
up to 23 fps uncompr.)
Infiniband 4x 1 GB/s ? Nice
Infiniband 12x 3 GB/s (???)
Proprietary
SGI NumaLink 4 max. 3.2 GB/s, 3-4x less latency

22
Network Scalability
Scalability graph on dual processor AMD
1800 clients (up to 48 CPUs _at_ 640x480)
23
Hybrid Distribution Model

Cluster of SHM machines
SHM Better caching and sharing of scene data
Single link ? many parallel threads
Reduced bandwidth requirements with large cache
Several processes on SHM machine
One link per process ? increased bandwidth
SHM leads to frame sync between threads (!)
Better scalability with shared memory mapping

24
Summary and Limitations

Ideally scalable due independent rays
Limited only by bandwidth to data
Scene updates, cache misses, pixel data
Reduced bandwidth through caching
Needs good task scheduling load balancing
Generally work very well
? OK on computer scale, translate to chip scale

25
Shared-memory multiprocessors

University of Utah implementations
Designed mainly for SGIs
Many CPUs, shared address space, frame buffer
Chief architect Steve Parker
rtrt (1997, workhorse)
manta (2004, demo in SGI booth today/tomorrow)
Tested on up to 512 processor system

26
Shared Memory Implementation

Each processor can randomly access the database
Database must be duplicated or shared
Each processor can write to an arbitrary portion
of the image
Image must be shared or composited later

27
Strategy KISS

Break image into tiles of cache line size
32x4128 pixels for most of our machines
Share database using global address space
Use C, emphasize locality
Worry about memory more than arithmetic
Biggest change for us ray packets

28
Scalability

Caching works well for our data
Almost perfect up to 32 processors
91 for 512 processors (466x speedup)

29
Load Balancing

Work Queue based
Each tile becomes a unit of work
Work is doled out in larger chunks at the
beginning of a frame, and smaller chunks nearer
the end
Like stacking blocks

30
Bottlenecks

Tile manager
Frame boundaries

Start of new frame
tile
cpu 1
cpu 2
cpu 3
31
60 million triangles
32
Boeing 777 350M triangles
33
35 Million spheres
34
68 billion height grid
35
Summary