Title: Application Exploration Key Learnings Christopher Rodrigues, Sara Sadeghi, Christopher Kung, John St
1Application Exploration Key LearningsChristophe
r Rodrigues, Sara Sadeghi, Christopher Kung, John
StrattonIan Steiner, Sain-Zee Ueng, Shane Ryoo,
Wen-Mei Hwu GSRC Soft Systems
Abstract
Key Observations
LAME has four encoding modes Constant Bit Rate,
Average Bit Rate, Variable Bit Rate with
reservoir enabled, and Variable Bit Rate with
reservoir disabled.
- Media applications are compute-intensive, soft
real-time applications. In single-threaded form,
they can be highly demanding on both compute
speed and memory bandwidth. Yet at the
algorithmic level they are parallel, opening the
possibility of using parallel computation to run
them on cheap, general-purpose hardware. - We know that media codecs are parallel by design.
But what needs to be done to a common,
sequential implementation in order to make it
parallel? To answer this question, we
hand-parallelized some implementations of common
media coders and decoders.
- Several well-known parallelizing transformations
can be applied to a program - Data Distribution, Pipelining, Task Distribution
- These transformations depend on the program being
in the right form - No data or control dependences between the
elements to parallelize (either loop iterations
or statements) - Enabling transformations eliminate benign
dependences - Identify latent parallelism, then expose with
enabling transformations - Goal is to make transformations realizable in the
compiler, guided by analysis and/or programmer
interaction - Other issues are still obstacles in some programs
- Recursive data structures
- Unfamiliar forms of control
Constant and Average Bit Rate have a low level of
available parallelism.
Data flow suggests fission
Control Flow
Data flow properties maintained
There is a similar dependency due to the static
variable resv_size in Variable Bit Rate, unless
the user selects to disable the reservoir at
invocation.
ISOLATING CODE PATHS (LAME)
Data Flow
PRIVATIZATION (MP3Dec)
Loop Fission
Side exit redirected to IDCT loop
Side exit prevents fission
Critical Recurrence in Precomputation
Iterations reflect side exit
OBSTACLES (JPEG)
Front End
Back End
Precomputation block depends on the previous
invocations of synth_1to1
Data buffers need to be saved for Compute
Intensive
Compute Intensive only depends on Precomputation
Program Dependence Graphs
High-level
Detailed
- Exposes high-level parallelism for one user
option for maximum performance - Does not increase performance of other code
paths - Increases code footprint
Problem complexity does not scale linearly for
achieving high levels of parallelism
Static. Dependence between calls
Narrowing b0 selection using pointers
RETILING (MPEG)
Initial transformations prepare loops for
parallelization
- Re-initialize parts of buffs array
- not possible to prove that previous writes are
killed
Execution Ordering
Fuse
Data Distribute
Source Code
Retile
for (i0 iltblocks i) mvi ...
Motion Estimation
mv
Writes in the second loop are not in the same
order as reads in the third loop. Retiling the
third loop makes the access patterns match,
allowing the loops to execute in parallel with
respect to each other.
- Retiling is the key to including the subtraction
loop in the parallelized loop. - Choosing a useful retiling is nontrivial
- Multiple simultaneous constraints
- Multiple array accesses per loop (e.g., from
hand-unrolling) - More than one array carrying data from one loop
to the next - Loop-carried dependences
Use of b0, part of buffs array
for (i0 iltblocks i) Use mvi for
(x0 xlt16 x) for (y0 ylt16 y)
predi256 x y16
Motion Compensation
Access Patterns
pred
- Only part of buffs array used in Compute
Intensive block - Minimize amount of data that is privatized
reduce fission overhead
Subtraction
for (n0 nlt256blocks n) diffn
orign - predn
Retile