Optimizing task and data representations - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Optimizing task and data representations

Description:

Thanks to Gabriel Coutinho and William Osborne. Multimedia systems: getting larger ... support CPU, DSP and FPGA optimisation. automatic task restructuring into ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 28

Provided by: cast57

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing task and data representations

1
Optimizing task and data representations

Tim Todman
Imperial College
London
18 January 2008

2
Overview

1. Outline of hArtes project
2. Task transformation
3. Case study ray tracing
4. Data representation optimisation
5. Conclusion
Thanks to Gabriel Coutinho and William Osborne

3
1. The hArtes Project

Multimedia systems getting larger
Increase flexibility
Reduce time to market
C annotations as intermediate language

4
Three key compilation stages
Partitioning (WP2.1)?

Task Transformation
support CPU, DSP and FPGA
optimisation
automatic task restructuring into
efficient architecture
Data Representation Optimisation
mainly for FPGA
trade-offs in accuracy, speed, area,
power consumption
Task Mapping and Scheduling
decides which task runs on which
processing element
optimises cost metrics

System Parameterisation and optimisation (WP2.3.4)
?
Task Transformation (WP2.2.3)?
Data Representation Optimisation (WP2.2.4)?
Cost Estimation and Metrics (WP2.2.5)?
Task Mapping and Scheduling (WP2.3.1)?
Code Generation (WP2.3.3)?
5
2. Exploring Task transformation

Transform a single node in task graph
Start obvious, but may lead to inefficient
design
End non-obvious, better implementation
Source-to-source transformations from start to
end
Domain-specific language to implement transforms
Compact description of transform
Abstract from implementation in compiler
framework
Automate housekeeping functions (e.g. Visitor
pattern)?

6
Requirements CML language

Aim compact transformation description
Describe transformations on
Abstract Syntax Tree (AST)?
Data Flow Graph (DFG)?
Support transformations specific to
Application domain embedded media
Target technology CPU DSP FPGA
Allow parameterisable transforms
e.g. unrolling factor
Interface to data representation optimisation
data representation optimisation as transform
Facilitate cost estimate e.g. number of
registers

7
2. CML design flow
Partitioning
Code (COpenMP)?
Requirements
Library of transformations (CML)?
Transform engine
Data representation transform
Code (COpenMP)?
Cost estimate
Partitioning / code generation
8
CML for task transformations

Basic CML 3 parts to a transform
Pattern syntax to match, label elements
Conditions based on dataflow
Resulting pattern to substitute
Proposed novel aspects of extended CML
Systematic description of dataflow conditions
Parameterised transforms
Features for labelling subpatterns
Probabilities for machine learning
Extend CML code matching DFGs
s1-gts2 matches true dependence arc from s1 to s2
s1 -/gt s2 matches antidependence arc from s2 to
s2
s1 -_at_-gt s2 matches output dependence arc from s1
to s2

9
Related work

CoSy compiler framework
build compiler for new architectures
cost criteria for instruction selection
Machine learning (Mike OBoyle)?
find optimal set of transforms (from a small
library)?
possible use in task transformation
transform ordering add suggested following
transforms
initial probability for machine learning

10
3. Case study ray tracing

A classical computer graphics algorithm
Also has applications in
Seismology
Acoustics
Strengths photorealistic, global illumination
Weaknesses diffuse reflections, soft shadows
Very computationally expensive
Sublinear time complexity

11
Example images
From PoVray (hof.povray.org)?
Our raytracer
12
Ray tracing characteristics

Very processor intensive
Naturally recursive
Massively parallel each pixel is independent
For each pixel, rays depend on results of
previous rays
Ray-object intersection calculations dominate the
computation time
Ray-object intersection calculations are
relatively simple

13
Basic algorithm
Light source
Reflected ray

Trace rays for each screen pixel
If ray hits object
Trace to light sources (shadow rays)?
Trace reflection
Trace refraction
End tracing when below threshold

Shadow ray
Refracted ray
Camera
Object
14
Dataflow in ray tracing

Ray-object intersector

C
R2
s
d
Bank 1 Ray directions d
Bank 2 Ray start points s
Bank 3 Sphere C, R2
FPGA

sqrt
Bank 4 Intersection results dist1, dist2

dist1
dist2
15
Dataflow software and hardware

Process results
of batch n
Generate batch
n1 in shared
memory

Bank 1 Ray directions d
Bank 2 Ray start points s
Bank 3 Sphere C, R2
write
FPGA
Bank 4 Intersection results dist1, dist2
read
Software
Hardware
16
Call graph of depth-first ray tracing
Main
After Heckbert

Screen
Generate primary rays

Secondary rays

Trace
If intersection then shade point

Shade
Test for shadows
Compute illumination
Recurse for reflection and refraction

Shadow rays
Intersect
17
Depth-first poor match for hardware

Ray-object intersection calculations tightly
coupled to rest of algorithm
Hardware called for small batches of rays
Limits pipelining
Most time spent communicating over bus
Solution transform algorithm
Marshal batches of independent rays together
Runs much slower in software, but much faster in
hardware

18
Call graph of breadth-first ray tracing
Main
Calculate Pixel colours Visit each ray tree
Trace rays
Add rays to buffer Visit ray tree roots in order
Intersect Ray batch
Process Intersection Results
Calculate final colour traverse ray tree
Process ray Results
Secondary rays
19
Automate the restructuring

Aim depth-first to breadth-first algorithm
Restructure to intersect rays in large batches
Standard passes
Hoisting initialisation
Loop interchange
Index normalisation
Custom passes specific recursive structure to
iteration
Arrays replace stacks
For-loops with extra variables as guards
Custom passes strip mining of rays from data
structure
Marshal into batches for intersection
Parameterise by hardware buffer size
Split loops to separate buffer fill, intersect
and read back

20
Performance estimate

Hardware 16MHz, result every three cycles
5 x106 intersections per second (ips)?
Software 2 x106 ips
100MB/s bus with 10ms startup latency
Application
640 by 480 pixels, 20 objects
Depth-first
6140 seconds
needs bus read / write per pixel
Breadth-first
Batch size 1024 gt 6.8 seconds
Batch size 10240 gt 1.3 seconds

21
5. Data representation optimization
Uniform vs variable word-lengths
Area / Slices
Design

Independent of input data

22
Static word-length optimization

Multi-stage
- Range analysis
- Low-effort
- High-effort
Guaranteed accuracy
- Reduce area
- Increase speed
- Reduce power consumption

23
Our approach

Range analysis
Interval Affine Instrumentation
Precision analysis
Partitioned, heuristic algorithm
Accuracy
Genetic Algorithm increases accuracy
Extension dynamic analysis
Input range analysis
Black-box function analysis
Branch analysis

24
Range analysis instrumentation

Loops
instrument code
calculate number of iterations
Benefits
Increased accuracy
reduced area
Interval Arithmetic
Simplistic
no correlation information
Affine Arithmetic
correlation information used
can produce misleading results

while (acc lt in_x)? acc acc 1
while (acc lt in_x)? analyze_loop(loop_1) ac
c acc 1 analyze_end(loop_1)
25
Ray tracing accuracy vs precision
26
Trade-offs