Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories - PowerPoint PPT Presentation

About This Presentation

Title:

Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories

Description:

Targeted at affine programs. Dense arrays. Loop bounds affine functions of outer ... Array access functions - affine functions of surrounding loop variables, ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 35

Provided by: bask1

Category:

more less

Transcript and Presenter's Notes

Title: Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories

1
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories

Muthu Baskaran1 Uday Bondhugula1
Sriram Krishnamoorthy 1
J Ramanujam2 Atanas Rountev1 P
Sadayappan1
1Department of Computer Science Engineering
The Ohio State University
2Department of Electrical and Computer
Engineering
Louisiana State University

2
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

3
Emergence of Multi-core Architectures

Single-processor performance
Improved by 50/yr for almost two decades
Clock speed, ILP,
Clock speed increased over 100x
Limits to single-processor performance growth
Increase in power density
Flattening of clock speed due to power limitation
Transistor density continues to rise unabated
Multiple cores are now the best option for
sustained performance growth

4
Scratchpad Memories (1/2)

Need to optimize memory bandwidth and latency in
multi-core architectures
Traditional solution introducing a cache
hierarchy
Drawback
Caches are hardware-managed - difficult to model
miss behavior and to predict program execution
times
Solution in many modern architectures fast
on-chip explicitly managed memory - scratchpad
memory (local memory store)

5
Scratchpad Memories (2/2)

Scratchpads
Software-managed
Control over data movement
Easier to model performance
Burden on programmer/compiler to manage and
utilize
Lower power per chip area required compared to
cache
Some modern architectures having scratchpad
memories
GPU
Cell
MPSoC

6
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

7
Challenges

Effective management of on-chip scratchpads in
multi-core architectures
Utilize limited capacity of scratchpad
Optimize data movement
Effective computation mapping in many-core
architectures with multiple levels of parallelism
Exploit available parallelism
Account for scratchpad capacity constraints

8
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

9
Data Management Issues

Orchestration of data movement between off-chip
global and on-chip scratchpad memory
Decisions on
What data elements to move in and out of
scratchpad
When to move data
How to move data
How to access the data elements copied to
scratchpad

10
Overview of Automatic Data Management Approach
(1/2)

Allocation of storage space (as arrays) in the
scratchpad memory for local copies
Determination of access functions of arrays in
scratchpad memories
Generation of code for moving data between
scratchpad (local) and off-chip (global) memories

11
Overview of Automatic Data Management Approach
(2/2)

Targeted at affine programs
Dense arrays
Loop bounds affine functions of outer loop
variables, constants and program parameters
Array access functions - affine functions of
surrounding loop variables, constants and program
parameters
Developed using polyhedral model
an algebraic framework for representing affine
programs statement domains, dependences, array
access functions and affine program
transformations

12
Polyhedral Model
for (i1 ilt4 i) for (j2 jlt4 j)
S1 aij aji aij-1
DS1a ?1a IS1
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
13
Automatic Data Allocation

Given a program block, identify the storage space
needed for each non-overlapping accessed region
of all arrays
Access functions of array references may be
non-uniformly generated
For architectures (e.g. nVIDIA GeForce GPU)
supporting direct data access from off-chip
memory
Estimate extent of reuse of data to determine
whether or not to copy to scratchpad

14
Algorithm and Illustration
for ( i10ilt14i) for ( j10jlt14j)
Ai j1 Aij j1 3
for (k11klt20k) Bi jk
Ai k Bij k
Local Array LA1 lb ( i ) 20 ub( i ) 28 lb
( j ) 11 ub( j ) 15

Find the set of all data spaces accessed by all
references to an array
Access function of the reference
Iteration space of the statement that holds the
reference

Partition the set of all data spaces into maximal
disjoint non-overlapping subset of data spaces

Local Array LA0 lb ( i ) 10 ub( i ) 14 lb
( j ) 11 ub( j ) 20

Find the bounding box of each partition of data
spaces

Local memory array for each bounding box

Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
15
Accessing Arrays in Scratchpad

Array dimension in scratchpad may be lower than
original array dimension, depending on accessed
data
Access function in local memory array
Original access function or reduced access
function with offsets lower bounds (in each
dimension) of scratchpad array

for ( i10ilt14i) for (
j10jlt14j) LA0i-10j1-11
LA1ij-20j1-113 for
(k11klt20k) LB0i-10jk-21
LA0i-10k-11
LB1ij-20k-11
for ( i10ilt14i) for ( j10jlt14j)
Ai j1 Aij j13
for (k11klt20k) Bi jk
Ai k Bij k
16
Data Movement Code Generation

Generation of loop structure
Scanning of polytopes (using CLooG - a tool for
code generation) corresponding to data spaces of
read references for moving data into scratchpad
write references for moving data out of
scratchpad
Generation of loop body (data movement statement)
Copy from a location in scratchpad buffer to
off-chip memory location or vice versa

/ Data Move in code / for (i10ilt14i)
for (j11jlt20j) LA0i-10j-11
Aij for (i20ilt28i) for
(jmax(i-13,11)jltmin(15,i-9) j)
LA1i-20j-11 Aij / Data Move out
code / for (i10ilt14i) for
(j11jlt15j) Aij LA0i-10j-11
17
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

18
GPU architecture

Architectural components
Slow off-chip (global) memory
Two levels of parallelism
Set of multiprocessors
Set of processor cores in each multiprocessor
Scratchpad on each multiprocessor, shared by its
processor cores

Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
19
Multi-level Tiling Approach

Tiling transformation framework recently
developed at OSU by Bondhugula (CC-08, PLDI-08)
Finds tiling transformations or hyperplanes
for sequences of imperfectly nested loops
enables communication minimal parallelization and
locality optimization
Identifies loops to tile for parallelism and data
locality
Multiple levels of tiling
for exploiting parallelism across multiple
parallel levels
Additional tiling (sequential) at each level with
scratchpad memory
If data required by tile executing at the level
exceeds memory
Data movement at the start and end of each
sequential tile
Synchronization points to ensure consistency

20
Example
// Tiling to distribute at the outer level FORALL
iT 1, Ni, Ti FORALL jT 1, Nj, Tj
// Tiling to satisfy scratchpad memory limit
FOR i' iT, min(iTTi-1,Ni), ti' FOR j'
jT, min(jTTj-1,Nj), tj' FOR k' 1, WS,
tk' FOR l' 1, WS, tl'
FORALL i 1, Ni FORALL j 1, Nj FOR k
1, WS FOR l 1, WS S1 END
FOR END FOR END FORALL END FORALL
ltData move in Codegt
// Tiling to distribute at the inner
level FORALL it i',
min(i'ti'-1,Ni), ti FORALL jt
j', min(j'tj'-1,Nj), tj
FOR i it, min(itti-1,Ni)
FOR j jt, min(jttj-1,Nj)
FOR k k', min(k'tk'-1,WS)
FOR l l', min(l'tl'-1,WS)
S1 END FOR
END FOR END
FOR END FOR
END FORALL END FORALL
ltData move out Codegt
END FOR END FOR END FOR
END FOR
END FORALL END FORALL
21
Tile Size Determination

Handling scratchpad memory constraints
Cost model for data movement
C N x (S (V x L)/P)
N Number of data movements
S Sync cost per data movement
V Number of elements per data movement
(based on tile sizes)
L Cost to transfer one element
P Number of processes involved in data
movement
Tile size search formulation
Constraint memory requirement within limit
Objective function minimize data movement cost,
C

22
Illustration of tile size search formulation

Loop nest of m loops with tile sizes t1, t2,..,
tm
nl local arrays
Mj Memory (as a function of tile sizes) for
local array j
V inj and Voutj Volume (as a function of tile
sizes) moved in to and out of local array memory
j, respectively
rj position in the loop nest where the data
movement code of array j is placed
Mup - total scratchpad memory

Variables
t1, t2,.., tm
Memory Constraint
Objective function
23
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

24
Motion Estimation Kernel (1/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
25
1D Jacobi Kernel (1/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
26
Motion Estimation Kernel (2/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Tile size from model
27
1D Jacobi Kernel (2/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Tile size from model
28
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

29
Related Work

Scratchpad memory management
Data reuse - Issenin et al. DAC06
Allocation for uniformly generated references
Schreiber and Cronquist HPLTR04
Anantharaman and Pande RTSS98
Kandemir et al. CAD04
Improving performance on cached architectures
Ferrante et al. LCPC92
Gallivan et al. ICS88
Multi-level tiling
Fatahalian et al. SC06 various levels of
memory
Bikshandi et al. PPOPP06 and Renganarayanan et
al. SC07, IPDPS07 parallelism and locality

30
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

31
Summary

Addressed two issues in compiling for modern
multi-level parallel architectures with
scratchpads
Data management in scratchpad memory
Data allocation
Access in scratchpad
Code generation for data movement
Mapping of computation in regular programs on to
multiple levels of parallel units
Experimental evaluation using nVIDIA GPU

32
Talk Outline

Introduction
Challenges
Automatic Data Management
Multi-level Tiling
Experiments
Related Work
Summary
Ongoing and Future Work

33
Ongoing and Future Work

Developing an end-to-end compiler framework for
modern many-core architectures like GPUs
Algorithms developed in this work an integral
part of the overall compiler framework
Further optimize transformations like tiling, for
modern architectures like GPUs, using
model-driven empirical search

34
Thank you

Write a Comment

User Comments (0)