Introduction to Realtime Ray Tracing Course 41 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Introduction to Realtime Ray Tracing Course 41

Description:

Introduction to Realtime Ray Tracing Course 41 – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 37
Provided by: steve1637
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Realtime Ray Tracing Course 41


1
Introduction to Realtime Ray TracingCourse 41
  • Philipp Slusallek Peter Shirley
  • Bill Mark Gordon Stoll Ingo Wald

2
Introduction to Realtime Ray Tracing
  • Parallel Distributed Processing
  • Characteristics of Ray Tracing
  • Parallelism
  • Communication
  • Caching
  • Frame Synchronization
  • Results

3
Promises and Challenges
  • Characteristics
  • Independence of every ray tree
  • Serial dependency between generations of rays
  • Often dependency between child rays due to shader
  • Significant coherence between adjacent rays
  • Geometry can be cached ? rays cannot
  • Coherence generally diminishes with generation
  • Single shared (mostly) read-only scene data base

4
Parallelism
  • Task versus Data Parallelism (I)
  • Data parallelism task follows data
  • Distribute scene among processors, migrate tasks
    (rays)
  • Seems suitable for massive scenes
  • Drawbacks
  • Large bandwidth due to many rays (difficult to
    cache)
  • Hotspots at camera, lights, and other locations

5
Parallelism
  • Task versus Data Parallelism (II)
  • Task parallelism data follows tasks
  • Distribute pixel (tiles) among processors
  • Load data on-demand, cache locally
  • Cache size accumulates among processors
  • Should assign similar tasks to same processor
    (coherence)
  • Within frame and between frames
  • Dynamic load balancing is simple, but conflicts
    with coherence

6
Communication
  • Shared Memory versus Message Passing
  • Conceptually highly similar
  • Separate memories with fast interconnect network
  • Both need low latency, high bandwidth networks
  • Often special HW support for SHM (cache
    coherence)
  • Shared memory User-space illusion through OS
  • Convenient and efficient cache No explicit
    programming
  • Fully transparent but can introduce long thread
    stalls on miss
  • Ideal SHM with OS support under user control

7
Hardware Options
  • PC cluster
  • Good scalability, reasonable price
  • Rather slow networks with long latency (Ethernet)
  • Limited communication due to latency
  • Cannot use bi-directional communication (!!)
  • Must operate in streaming mode
  • Send data in advance, keep pipeline filled (dep.
    on latency)
  • Better networks are coming (Infiniband)

8
Hardware Options
Master
  • PC-Cluster
  • Setup with commodity HW
  • Dual Athlon/Pentium-4 PCs
  • Fast- and Gigabit Ethernet
  • Master
  • Application and OpenRT library
  • Job distribution and load-balancing
  • Slaves
  • Ray tracing computation only

Network- Switch
Slave
Slave
Slave
Slave
Slave
Slave

9
Hardware Options
  • Shared-Memory Computer
  • High scalability with fast low latency networks
  • Single virtual address space lots of RAM
  • Single admin domain
  • Future computers will use shared memory
  • Today 8 sockets dual-core 16 CPU 64 GB
  • Heavy multi-core designs with fast high-bw
    interconnects
  • May become all we need (at some point -)

10
Hardware Options
  • Cell processor
  • No caches but low latency DMA messages
  • Globally shared memory with high bandwidth
  • 256 KB Local Store manually managed cache
  • 128x 4-vector registers keep many packets
    in-flight
  • packets allow for hiding DMA latency (hopefully)
  • Similar to PC cluster, but at different
    level/granularity

11
Caching
  • Caching and Working Sets (Level I)
  • Sharing between adjacent (groups of) rays
  • Even small caches work extremely well
  • SaarCOR I 4 KB cache ? gt98 hit rate (2x2
    traversal) less for lists and triangles
  • Larger tiles reduce coherence
  • Diminishing return on bandwidth due to reduced
    coherence
  • Increased working set requires larger caches

12
Caching
  • Caching and Working Sets (Level II)
  • Sharing between frames
  • Size of working set (with paging)
  • rays avg(pages-per-ray) sizeof(page)
    (1-avg(sharing))
  • Little sharing at leaf nodes of large scenes
  • Only changed and new data must be send
  • Can significantly reduce bandwidth
  • Might even use procedural approach

13
Caching
  • Caching and Working Sets (Level III)
  • Out-of-core rendering (loading from disk, e.g.
    Boeing)
  • In renderer and/or application (with feedback)
  • Similar characteristics to L-II
  • But latency in the order of frame times
  • Must use pre-fetching of data to avoid artifacts

14
Caching
  • Caching and Working Sets (Summary)
  • Caching generally works very well
  • RT has surprising amount of coherence
  • Across all levels of hierarchy
  • Reduces required bandwidth
  • May fail occasionally ? high peak bandwidth
  • Completely incoherent rays (no sharing)
  • Sudden changes of working set (e.g. fast
    movements)

15
Load Balancing
  • Load Balancing of Tasks (packets of rays)
  • Centrally assign tasks to processors on demand
  • Goal Keep processors busy at all times
  • Size of tasks Overhead versus granularity
  • Overhead through network bandwidth processing
  • Small granularity Allows better distribution of
    load
  • Large granularity Not enough tasks for large
    cluster
  • Typical on PC-Cluster 32x32 to 16x16 (4 packets
    queued)
  • Keeps queue filled with 1 ms latency and 4 Mrays/s

16
Load Balancing
  • Fairness
  • Usually not an issue Enough tasks to distribute
  • 640x480/32x32 300 tasks per frame
  • At 30 processors can tolerate 101 imbalance (a
    lot !)
  • End of frame synchronization
  • Big worry in offline rendering Must balance each
    frame
  • Not an issue in online rendering stream
    computing
  • Start assigning tasks from next frame to idle
    processors
  • Even higher imbalance can be handled

17
Frame Synchronization
One frame of application latency in
rtSwapBuffers()
Display
Frame n-2
Frame n-1
Frame n
Send OpenRT Commands Frame n
Send OpenRT Commands Frame n1
rtSwap- Buffers
rtSwap- Buffers
Send OpenRT Commands Frame n2
Application
Send Tiles Frame n-1
Send Tiles Frame n
Send Tiles Frame n1
OpenRT
Recv Tiles Frame n-1
Recv Tiles Frame n
Recv Tiles Frame n1
Send/Recv Tiles Rendering Frame n-1
Exec Frame n
Send/Recv Tiles Rendering Frame n
Exec Frame n1
Send/Recv Tiles Rendering Frame n1
Rendering Client
t0
t1
t2
Not too scale
18
Load Balancing
  • Criteria for task allocation
  • Allow tasks to move freely for load balancing
  • Keep adjacent tasks on same processor (caching)
  • Keep visible geometry on same processor (cache
    accum.)
  • Must re-project samples per frame as a preprocess
  • Assign tasks to processors that have the data
  • Large models may need to load lots of data
    large latency
  • Must balance different criteria
  • Simple first come, first served sufficient for
    most scenarios

19
Scene Updates
  • Synchronization for Scene Update
  • May need to update scene between frames
  • Must adhere to temporal consistency
  • All packets in old frame must be finished, no new
    started yet
  • Approach (Copy Update)
  • Separate receiver thread updates scene
  • Changed objects are copied before applying
    changes
  • At end of frame pointers are updated ? fast
    pointer copy

20
Networking
  • Bandwidth requirements
  • From master
  • Scene updates (it depends )
  • 2D tile coordinates (tiny)
  • Enough bandwidth to add other data (e.g. AR
    background)
  • To master
  • RBG data (3 byte per pixel gt3MB N fps)
  • Could use low latency compression (MJPEG, H-263,
    )

21
Networking
  • Available Options
  • Fast-Ethernet 10 MB/s ? Not really
  • Gigabit-Ethernet 100 MB/s ? Reasonable (
    up to 23 fps uncompr.)
  • Infiniband 4x 1 GB/s ? Nice
  • Infiniband 12x 3 GB/s (???)
  • Proprietary
  • SGI NumaLink 4 max. 3.2 GB/s, 3-4x less latency

22
Network Scalability
Scalability graph on dual processor AMD
1800 clients (up to 48 CPUs _at_ 640x480)
23
Hybrid Distribution Model
  • Cluster of SHM machines
  • SHM Better caching and sharing of scene data
  • Single link ? many parallel threads
  • Reduced bandwidth requirements with large cache
  • Several processes on SHM machine
  • One link per process ? increased bandwidth
  • SHM leads to frame sync between threads (!)
  • Better scalability with shared memory mapping

24
Summary and Limitations
  • Ideally scalable due independent rays
  • Limited only by bandwidth to data
  • Scene updates, cache misses, pixel data
  • Reduced bandwidth through caching
  • Needs good task scheduling load balancing
  • Generally work very well
  • ? OK on computer scale, translate to chip scale

25
Shared-memory multiprocessors
  • University of Utah implementations
  • Designed mainly for SGIs
  • Many CPUs, shared address space, frame buffer
  • Chief architect Steve Parker
  • rtrt (1997, workhorse)
  • manta (2004, demo in SGI booth today/tomorrow)
  • Tested on up to 512 processor system

26
Shared Memory Implementation
  • Each processor can randomly access the database
  • Database must be duplicated or shared
  • Each processor can write to an arbitrary portion
    of the image
  • Image must be shared or composited later

27
Strategy KISS
  • Break image into tiles of cache line size
  • 32x4128 pixels for most of our machines
  • Share database using global address space
  • Use C, emphasize locality
  • Worry about memory more than arithmetic
  • Biggest change for us ray packets

28
Scalability
  • Caching works well for our data
  • Almost perfect up to 32 processors
  • 91 for 512 processors (466x speedup)

29
Load Balancing
  • Work Queue based
  • Each tile becomes a unit of work
  • Work is doled out in larger chunks at the
    beginning of a frame, and smaller chunks nearer
    the end
  • Like stacking blocks

30
Bottlenecks
  • Tile manager
  • Frame boundaries

Start of new frame
tile
cpu 1
cpu 2
cpu 3
31
60 million triangles
32
Boeing 777 350M triangles
33
35 Million spheres
34
68 billion height grid
35
Summary
  • Shared memory multiprocessors work
  • But they are expensive
  • May give lessons for multi-core CPUs
  • Clever programming not needed
  • Encourage natural memory locality
  • For big datasets, much better than GPUs

36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com