Title: Optimizing task and data representations
1Optimizing task and data representations
- Tim Todman
- Imperial College
- London
- 18 January 2008
2Overview
- 1. Outline of hArtes project
- 2. Task transformation
- 3. Case study ray tracing
- 4. Data representation optimisation
- 5. Conclusion
- Thanks to Gabriel Coutinho and William Osborne
31. The hArtes Project
- Multimedia systems getting larger
- Increase flexibility
- Reduce time to market
- C annotations as intermediate language
4Three key compilation stages
Partitioning (WP2.1)?
- Task Transformation
- support CPU, DSP and FPGA
optimisation - automatic task restructuring into
efficient architecture - Data Representation Optimisation
- mainly for FPGA
- trade-offs in accuracy, speed, area,
power consumption - Task Mapping and Scheduling
- decides which task runs on which
processing element - optimises cost metrics
-
System Parameterisation and optimisation (WP2.3.4)
?
Task Transformation (WP2.2.3)?
Data Representation Optimisation (WP2.2.4)?
Cost Estimation and Metrics (WP2.2.5)?
Task Mapping and Scheduling (WP2.3.1)?
Code Generation (WP2.3.3)?
52. Exploring Task transformation
- Transform a single node in task graph
- Start obvious, but may lead to inefficient
design - End non-obvious, better implementation
- Source-to-source transformations from start to
end - Domain-specific language to implement transforms
- Compact description of transform
- Abstract from implementation in compiler
framework - Automate housekeeping functions (e.g. Visitor
pattern)?
6Requirements CML language
- Aim compact transformation description
- Describe transformations on
- Abstract Syntax Tree (AST)?
- Data Flow Graph (DFG)?
- Support transformations specific to
- Application domain embedded media
- Target technology CPU DSP FPGA
- Allow parameterisable transforms
- e.g. unrolling factor
- Interface to data representation optimisation
- data representation optimisation as transform
- Facilitate cost estimate e.g. number of
registers
72. CML design flow
Partitioning
Code (COpenMP)?
Requirements
Library of transformations (CML)?
Transform engine
Data representation transform
Code (COpenMP)?
Cost estimate
Partitioning / code generation
8CML for task transformations
- Basic CML 3 parts to a transform
- Pattern syntax to match, label elements
- Conditions based on dataflow
- Resulting pattern to substitute
- Proposed novel aspects of extended CML
- Systematic description of dataflow conditions
- Parameterised transforms
- Features for labelling subpatterns
- Probabilities for machine learning
- Extend CML code matching DFGs
- s1-gts2 matches true dependence arc from s1 to s2
- s1 -/gt s2 matches antidependence arc from s2 to
s2 - s1 -_at_-gt s2 matches output dependence arc from s1
to s2
9Related work
- CoSy compiler framework
- build compiler for new architectures
- cost criteria for instruction selection
- Machine learning (Mike OBoyle)?
- find optimal set of transforms (from a small
library)? - possible use in task transformation
- transform ordering add suggested following
transforms - initial probability for machine learning
103. Case study ray tracing
- A classical computer graphics algorithm
- Also has applications in
- Seismology
- Acoustics
- Strengths photorealistic, global illumination
- Weaknesses diffuse reflections, soft shadows
- Very computationally expensive
- Sublinear time complexity
11Example images
From PoVray (hof.povray.org)?
Our raytracer
12Ray tracing characteristics
- Very processor intensive
- Naturally recursive
- Massively parallel each pixel is independent
- For each pixel, rays depend on results of
previous rays - Ray-object intersection calculations dominate the
computation time - Ray-object intersection calculations are
relatively simple
13Basic algorithm
Light source
Reflected ray
- Trace rays for each screen pixel
- If ray hits object
- Trace to light sources (shadow rays)?
- Trace reflection
- Trace refraction
- End tracing when below threshold
Shadow ray
Refracted ray
Camera
Object
14Dataflow in ray tracing
C
R2
s
d
Bank 1 Ray directions d
Bank 2 Ray start points s
Bank 3 Sphere C, R2
FPGA
sqrt
Bank 4 Intersection results dist1, dist2
dist1
dist2
15Dataflow software and hardware
- Process results
- of batch n
- Generate batch
- n1 in shared
- memory
Bank 1 Ray directions d
Bank 2 Ray start points s
Bank 3 Sphere C, R2
write
FPGA
Bank 4 Intersection results dist1, dist2
read
Software
Hardware
16Call graph of depth-first ray tracing
Main
After Heckbert
- Screen
- Generate primary rays
Secondary rays
- Trace
- If intersection then shade point
- Shade
- Test for shadows
- Compute illumination
- Recurse for reflection and refraction
Shadow rays
Intersect
17Depth-first poor match for hardware
- Ray-object intersection calculations tightly
coupled to rest of algorithm - Hardware called for small batches of rays
- Limits pipelining
- Most time spent communicating over bus
- Solution transform algorithm
- Marshal batches of independent rays together
- Runs much slower in software, but much faster in
hardware
18Call graph of breadth-first ray tracing
Main
Calculate Pixel colours Visit each ray tree
Trace rays
Add rays to buffer Visit ray tree roots in order
Intersect Ray batch
Process Intersection Results
Calculate final colour traverse ray tree
Process ray Results
Secondary rays
19Automate the restructuring
- Aim depth-first to breadth-first algorithm
- Restructure to intersect rays in large batches
- Standard passes
- Hoisting initialisation
- Loop interchange
- Index normalisation
- Custom passes specific recursive structure to
iteration - Arrays replace stacks
- For-loops with extra variables as guards
- Custom passes strip mining of rays from data
structure - Marshal into batches for intersection
- Parameterise by hardware buffer size
- Split loops to separate buffer fill, intersect
and read back
20Performance estimate
- Hardware 16MHz, result every three cycles
- 5 x106 intersections per second (ips)?
- Software 2 x106 ips
- 100MB/s bus with 10ms startup latency
- Application
- 640 by 480 pixels, 20 objects
- Depth-first
- 6140 seconds
- needs bus read / write per pixel
- Breadth-first
- Batch size 1024 gt 6.8 seconds
- Batch size 10240 gt 1.3 seconds
215. Data representation optimization
Uniform vs variable word-lengths
Area / Slices
Design
- Independent of input data
22Static word-length optimization
- Multi-stage
- - Range analysis
- - Low-effort
- - High-effort
- Guaranteed accuracy
- - Reduce area
- - Increase speed
- - Reduce power consumption
23Our approach
- Range analysis
- Interval Affine Instrumentation
- Precision analysis
- Partitioned, heuristic algorithm
- Accuracy
- Genetic Algorithm increases accuracy
- Extension dynamic analysis
- Input range analysis
- Black-box function analysis
- Branch analysis
24 Range analysis instrumentation
- Loops
- instrument code
- calculate number of iterations
- Benefits
- Increased accuracy
- reduced area
- Interval Arithmetic
- Simplistic
- no correlation information
- Affine Arithmetic
- correlation information used
- can produce misleading results
while (acc lt in_x)? acc acc 1
while (acc lt in_x)? analyze_loop(loop_1) ac
c acc 1 analyze_end(loop_1)
25Ray tracing accuracy vs precision
26Trade-offs
- Reduce precision
- reduce area
- higher speed
- reduce power consumption
275. Conclusion
- Task Transformations
- Automation of transformation
- Proposed extensions to CML
- Case study Ray tracing
- Map ray tracing to hardware
- Manually use breadth first transform for best use
of slow bus - Estimate hardware for complex scenes
- Data Representation Optimisation
- Interval arithmetic affine arithmetic
- Instrumented source code
- Partitioned heuristic algorithm
- Same accuracy, 200 times faster
- Applications Ray tracing, Molecular dynamics,
String simulations