Title: Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories
1Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories
- Muthu Baskaran1 Uday Bondhugula1
Sriram Krishnamoorthy 1 - J Ramanujam2 Atanas Rountev1 P
Sadayappan1 - 1Department of Computer Science Engineering
- The Ohio State University
- 2Department of Electrical and Computer
Engineering - Louisiana State University
2Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
3Emergence of Multi-core Architectures
- Single-processor performance
- Improved by 50/yr for almost two decades
- Clock speed, ILP,
- Clock speed increased over 100x
- Limits to single-processor performance growth
- Increase in power density
- Flattening of clock speed due to power limitation
- Transistor density continues to rise unabated
- Multiple cores are now the best option for
sustained performance growth
4Scratchpad Memories (1/2)
- Need to optimize memory bandwidth and latency in
multi-core architectures - Traditional solution introducing a cache
hierarchy - Drawback
- Caches are hardware-managed - difficult to model
miss behavior and to predict program execution
times - Solution in many modern architectures fast
on-chip explicitly managed memory - scratchpad
memory (local memory store)
5Scratchpad Memories (2/2)
- Scratchpads
- Software-managed
- Control over data movement
- Easier to model performance
- Burden on programmer/compiler to manage and
utilize - Lower power per chip area required compared to
cache - Some modern architectures having scratchpad
memories - GPU
- Cell
- MPSoC
6Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
7Challenges
- Effective management of on-chip scratchpads in
multi-core architectures - Utilize limited capacity of scratchpad
- Optimize data movement
- Effective computation mapping in many-core
architectures with multiple levels of parallelism - Exploit available parallelism
- Account for scratchpad capacity constraints
8Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
9Data Management Issues
- Orchestration of data movement between off-chip
global and on-chip scratchpad memory - Decisions on
- What data elements to move in and out of
scratchpad - When to move data
- How to move data
- How to access the data elements copied to
scratchpad
10Overview of Automatic Data Management Approach
(1/2)
- Allocation of storage space (as arrays) in the
scratchpad memory for local copies - Determination of access functions of arrays in
scratchpad memories - Generation of code for moving data between
scratchpad (local) and off-chip (global) memories
11Overview of Automatic Data Management Approach
(2/2)
- Targeted at affine programs
- Dense arrays
- Loop bounds affine functions of outer loop
variables, constants and program parameters - Array access functions - affine functions of
surrounding loop variables, constants and program
parameters - Developed using polyhedral model
- an algebraic framework for representing affine
programs statement domains, dependences, array
access functions and affine program
transformations
12Polyhedral Model
for (i1 ilt4 i) for (j2 jlt4 j)
S1 aij aji aij-1
DS1a ?1a IS1
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
13Automatic Data Allocation
- Given a program block, identify the storage space
needed for each non-overlapping accessed region
of all arrays - Access functions of array references may be
non-uniformly generated -
- For architectures (e.g. nVIDIA GeForce GPU)
supporting direct data access from off-chip
memory - Estimate extent of reuse of data to determine
whether or not to copy to scratchpad
14Algorithm and Illustration
for ( i10ilt14i) for ( j10jlt14j)
Ai j1 Aij j1 3
for (k11klt20k) Bi jk
Ai k Bij k
Local Array LA1 lb ( i ) 20 ub( i ) 28 lb
( j ) 11 ub( j ) 15
- Find the set of all data spaces accessed by all
references to an array - Access function of the reference
- Iteration space of the statement that holds the
reference
- Partition the set of all data spaces into maximal
disjoint non-overlapping subset of data spaces
Local Array LA0 lb ( i ) 10 ub( i ) 14 lb
( j ) 11 ub( j ) 20
- Find the bounding box of each partition of data
spaces
- Local memory array for each bounding box
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
15Accessing Arrays in Scratchpad
- Array dimension in scratchpad may be lower than
original array dimension, depending on accessed
data - Access function in local memory array
- Original access function or reduced access
function with offsets lower bounds (in each
dimension) of scratchpad array
for ( i10ilt14i) for (
j10jlt14j) LA0i-10j1-11
LA1ij-20j1-113 for
(k11klt20k) LB0i-10jk-21
LA0i-10k-11
LB1ij-20k-11
for ( i10ilt14i) for ( j10jlt14j)
Ai j1 Aij j13
for (k11klt20k) Bi jk
Ai k Bij k
16Data Movement Code Generation
- Generation of loop structure
- Scanning of polytopes (using CLooG - a tool for
code generation) corresponding to data spaces of - read references for moving data into scratchpad
- write references for moving data out of
scratchpad - Generation of loop body (data movement statement)
- Copy from a location in scratchpad buffer to
off-chip memory location or vice versa
/ Data Move in code / for (i10ilt14i)
for (j11jlt20j) LA0i-10j-11
Aij for (i20ilt28i) for
(jmax(i-13,11)jltmin(15,i-9) j)
LA1i-20j-11 Aij / Data Move out
code / for (i10ilt14i) for
(j11jlt15j) Aij LA0i-10j-11
17Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
18GPU architecture
- Architectural components
- Slow off-chip (global) memory
- Two levels of parallelism
- Set of multiprocessors
- Set of processor cores in each multiprocessor
- Scratchpad on each multiprocessor, shared by its
processor cores
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
19Multi-level Tiling Approach
- Tiling transformation framework recently
developed at OSU by Bondhugula (CC-08, PLDI-08) - Finds tiling transformations or hyperplanes
- for sequences of imperfectly nested loops
- enables communication minimal parallelization and
locality optimization - Identifies loops to tile for parallelism and data
locality - Multiple levels of tiling
- for exploiting parallelism across multiple
parallel levels - Additional tiling (sequential) at each level with
scratchpad memory - If data required by tile executing at the level
exceeds memory - Data movement at the start and end of each
sequential tile - Synchronization points to ensure consistency
20Example
// Tiling to distribute at the outer level FORALL
iT 1, Ni, Ti FORALL jT 1, Nj, Tj
// Tiling to satisfy scratchpad memory limit
FOR i' iT, min(iTTi-1,Ni), ti' FOR j'
jT, min(jTTj-1,Nj), tj' FOR k' 1, WS,
tk' FOR l' 1, WS, tl'
FORALL i 1, Ni FORALL j 1, Nj FOR k
1, WS FOR l 1, WS S1 END
FOR END FOR END FORALL END FORALL
ltData move in Codegt
// Tiling to distribute at the inner
level FORALL it i',
min(i'ti'-1,Ni), ti FORALL jt
j', min(j'tj'-1,Nj), tj
FOR i it, min(itti-1,Ni)
FOR j jt, min(jttj-1,Nj)
FOR k k', min(k'tk'-1,WS)
FOR l l', min(l'tl'-1,WS)
S1 END FOR
END FOR END
FOR END FOR
END FORALL END FORALL
ltData move out Codegt
END FOR END FOR END FOR
END FOR
END FORALL END FORALL
21Tile Size Determination
- Handling scratchpad memory constraints
- Cost model for data movement
- C N x (S (V x L)/P)
- N Number of data movements
- S Sync cost per data movement
- V Number of elements per data movement
(based on tile sizes) - L Cost to transfer one element
- P Number of processes involved in data
movement - Tile size search formulation
- Constraint memory requirement within limit
- Objective function minimize data movement cost,
C
22Illustration of tile size search formulation
- Loop nest of m loops with tile sizes t1, t2,..,
tm - nl local arrays
- Mj Memory (as a function of tile sizes) for
local array j - V inj and Voutj Volume (as a function of tile
sizes) moved in to and out of local array memory
j, respectively - rj position in the loop nest where the data
movement code of array j is placed - Mup - total scratchpad memory
Variables
t1, t2,.., tm
Memory Constraint
Objective function
23Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
24Motion Estimation Kernel (1/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
251D Jacobi Kernel (1/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
26Motion Estimation Kernel (2/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Tile size from model
271D Jacobi Kernel (2/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Tile size from model
28Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
29Related Work
- Scratchpad memory management
- Data reuse - Issenin et al. DAC06
- Allocation for uniformly generated references
- Schreiber and Cronquist HPLTR04
- Anantharaman and Pande RTSS98
- Kandemir et al. CAD04
- Improving performance on cached architectures
- Ferrante et al. LCPC92
- Gallivan et al. ICS88
- Multi-level tiling
- Fatahalian et al. SC06 various levels of
memory - Bikshandi et al. PPOPP06 and Renganarayanan et
al. SC07, IPDPS07 parallelism and locality
30Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
31Summary
- Addressed two issues in compiling for modern
multi-level parallel architectures with
scratchpads - Data management in scratchpad memory
- Data allocation
- Access in scratchpad
- Code generation for data movement
- Mapping of computation in regular programs on to
multiple levels of parallel units - Experimental evaluation using nVIDIA GPU
32Talk Outline
- Introduction
- Challenges
- Automatic Data Management
- Multi-level Tiling
- Experiments
- Related Work
- Summary
- Ongoing and Future Work
33Ongoing and Future Work
- Developing an end-to-end compiler framework for
modern many-core architectures like GPUs - Algorithms developed in this work an integral
part of the overall compiler framework - Further optimize transformations like tiling, for
modern architectures like GPUs, using
model-driven empirical search
34Thank you