Application Exploration Key Learnings Christopher Rodrigues, Sara Sadeghi, Christopher Kung, John St - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Application Exploration Key Learnings Christopher Rodrigues, Sara Sadeghi, Christopher Kung, John St

Description:

'Critical Recurrence' in Precomputation. High-level. Detailed. OBSTACLES (JPEG) ... No data or control dependences between the elements to parallelize (either loop ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 2
Provided by: wmr77
Category:

less

Transcript and Presenter's Notes

Title: Application Exploration Key Learnings Christopher Rodrigues, Sara Sadeghi, Christopher Kung, John St


1
Application Exploration Key LearningsChristophe
r Rodrigues, Sara Sadeghi, Christopher Kung, John
StrattonIan Steiner, Sain-Zee Ueng, Shane Ryoo,
Wen-Mei Hwu GSRC Soft Systems
Abstract
Key Observations
LAME has four encoding modes Constant Bit Rate,
Average Bit Rate, Variable Bit Rate with
reservoir enabled, and Variable Bit Rate with
reservoir disabled.
  • Media applications are compute-intensive, soft
    real-time applications. In single-threaded form,
    they can be highly demanding on both compute
    speed and memory bandwidth. Yet at the
    algorithmic level they are parallel, opening the
    possibility of using parallel computation to run
    them on cheap, general-purpose hardware.
  • We know that media codecs are parallel by design.
    But what needs to be done to a common,
    sequential implementation in order to make it
    parallel? To answer this question, we
    hand-parallelized some implementations of common
    media coders and decoders.
  • Several well-known parallelizing transformations
    can be applied to a program
  • Data Distribution, Pipelining, Task Distribution
  • These transformations depend on the program being
    in the right form
  • No data or control dependences between the
    elements to parallelize (either loop iterations
    or statements)
  • Enabling transformations eliminate benign
    dependences
  • Identify latent parallelism, then expose with
    enabling transformations
  • Goal is to make transformations realizable in the
    compiler, guided by analysis and/or programmer
    interaction
  • Other issues are still obstacles in some programs
  • Recursive data structures
  • Unfamiliar forms of control

Constant and Average Bit Rate have a low level of
available parallelism.
Data flow suggests fission
Control Flow
Data flow properties maintained
There is a similar dependency due to the static
variable resv_size in Variable Bit Rate, unless
the user selects to disable the reservoir at
invocation.
ISOLATING CODE PATHS (LAME)
Data Flow
PRIVATIZATION (MP3Dec)
Loop Fission
Side exit redirected to IDCT loop
Side exit prevents fission
Critical Recurrence in Precomputation
Iterations reflect side exit
OBSTACLES (JPEG)
Front End
Back End
Precomputation block depends on the previous
invocations of synth_1to1
Data buffers need to be saved for Compute
Intensive
Compute Intensive only depends on Precomputation
Program Dependence Graphs
High-level
Detailed
  • Exposes high-level parallelism for one user
    option for maximum performance
  • Does not increase performance of other code
    paths
  • Increases code footprint

Problem complexity does not scale linearly for
achieving high levels of parallelism
Static. Dependence between calls
Narrowing b0 selection using pointers
RETILING (MPEG)
Initial transformations prepare loops for
parallelization
  • Re-initialize parts of buffs array
  • not possible to prove that previous writes are
    killed

Execution Ordering
Fuse
Data Distribute
Source Code
Retile
for (i0 iltblocks i) mvi ...
Motion Estimation
mv
Writes in the second loop are not in the same
order as reads in the third loop. Retiling the
third loop makes the access patterns match,
allowing the loops to execute in parallel with
respect to each other.
  • Retiling is the key to including the subtraction
    loop in the parallelized loop.
  • Choosing a useful retiling is nontrivial
  • Multiple simultaneous constraints
  • Multiple array accesses per loop (e.g., from
    hand-unrolling)
  • More than one array carrying data from one loop
    to the next
  • Loop-carried dependences

Use of b0, part of buffs array
for (i0 iltblocks i) Use mvi for
(x0 xlt16 x) for (y0 ylt16 y)
predi256 x y16
Motion Compensation
Access Patterns
pred
  • Only part of buffs array used in Compute
    Intensive block
  • Minimize amount of data that is privatized
    reduce fission overhead

Subtraction
for (n0 nlt256blocks n) diffn
orign - predn
Retile
Write a Comment
User Comments (0)
About PowerShow.com