Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories

Description:

Targeted at affine programs. Dense arrays. Loop bounds affine functions of outer ... Array access functions - affine functions of surrounding loop variables, ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Automatic%20Data%20Movement%20and%20Computation%20Mapping%20for%20Multi-level%20Parallel%20Architectures%20with%20Explicitly%20Managed%20Memories


1
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories
  • Muthu Baskaran1 Uday Bondhugula1
    Sriram Krishnamoorthy 1
  • J Ramanujam2 Atanas Rountev1 P
    Sadayappan1
  • 1Department of Computer Science Engineering
  • The Ohio State University
  • 2Department of Electrical and Computer
    Engineering
  • Louisiana State University

2
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

3
Emergence of Multi-core Architectures
  • Single-processor performance
  • Improved by 50/yr for almost two decades
  • Clock speed, ILP,
  • Clock speed increased over 100x
  • Limits to single-processor performance growth
  • Increase in power density
  • Flattening of clock speed due to power limitation
  • Transistor density continues to rise unabated
  • Multiple cores are now the best option for
    sustained performance growth

4
Scratchpad Memories (1/2)
  • Need to optimize memory bandwidth and latency in
    multi-core architectures
  • Traditional solution introducing a cache
    hierarchy
  • Drawback
  • Caches are hardware-managed - difficult to model
    miss behavior and to predict program execution
    times
  • Solution in many modern architectures fast
    on-chip explicitly managed memory - scratchpad
    memory (local memory store)

5
Scratchpad Memories (2/2)
  • Scratchpads
  • Software-managed
  • Control over data movement
  • Easier to model performance
  • Burden on programmer/compiler to manage and
    utilize
  • Lower power per chip area required compared to
    cache
  • Some modern architectures having scratchpad
    memories
  • GPU
  • Cell
  • MPSoC

6
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

7
Challenges
  • Effective management of on-chip scratchpads in
    multi-core architectures
  • Utilize limited capacity of scratchpad
  • Optimize data movement
  • Effective computation mapping in many-core
    architectures with multiple levels of parallelism
  • Exploit available parallelism
  • Account for scratchpad capacity constraints

8
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

9
Data Management Issues
  • Orchestration of data movement between off-chip
    global and on-chip scratchpad memory
  • Decisions on
  • What data elements to move in and out of
    scratchpad
  • When to move data
  • How to move data
  • How to access the data elements copied to
    scratchpad

10
Overview of Automatic Data Management Approach
(1/2)
  • Allocation of storage space (as arrays) in the
    scratchpad memory for local copies
  • Determination of access functions of arrays in
    scratchpad memories
  • Generation of code for moving data between
    scratchpad (local) and off-chip (global) memories

11
Overview of Automatic Data Management Approach
(2/2)
  • Targeted at affine programs
  • Dense arrays
  • Loop bounds affine functions of outer loop
    variables, constants and program parameters
  • Array access functions - affine functions of
    surrounding loop variables, constants and program
    parameters
  • Developed using polyhedral model
  • an algebraic framework for representing affine
    programs statement domains, dependences, array
    access functions and affine program
    transformations

12
Polyhedral Model
for (i1 ilt4 i) for (j2 jlt4 j)
S1 aij aji aij-1
DS1a ?1a IS1
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
13
Automatic Data Allocation
  • Given a program block, identify the storage space
    needed for each non-overlapping accessed region
    of all arrays
  • Access functions of array references may be
    non-uniformly generated
  • For architectures (e.g. nVIDIA GeForce GPU)
    supporting direct data access from off-chip
    memory
  • Estimate extent of reuse of data to determine
    whether or not to copy to scratchpad

14
Algorithm and Illustration
for ( i10ilt14i) for ( j10jlt14j)
Ai j1 Aij j1 3
for (k11klt20k) Bi jk
Ai k Bij k
Local Array LA1 lb ( i ) 20 ub( i ) 28 lb
( j ) 11 ub( j ) 15
  • Find the set of all data spaces accessed by all
    references to an array
  • Access function of the reference
  • Iteration space of the statement that holds the
    reference
  • Partition the set of all data spaces into maximal
    disjoint non-overlapping subset of data spaces

Local Array LA0 lb ( i ) 10 ub( i ) 14 lb
( j ) 11 ub( j ) 20
  • Find the bounding box of each partition of data
    spaces
  • Local memory array for each bounding box

Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
15
Accessing Arrays in Scratchpad
  • Array dimension in scratchpad may be lower than
    original array dimension, depending on accessed
    data
  • Access function in local memory array
  • Original access function or reduced access
    function with offsets lower bounds (in each
    dimension) of scratchpad array

for ( i10ilt14i) for (
j10jlt14j) LA0i-10j1-11
LA1ij-20j1-113 for
(k11klt20k) LB0i-10jk-21
LA0i-10k-11
LB1ij-20k-11
for ( i10ilt14i) for ( j10jlt14j)
Ai j1 Aij j13
for (k11klt20k) Bi jk
Ai k Bij k
16
Data Movement Code Generation
  • Generation of loop structure
  • Scanning of polytopes (using CLooG - a tool for
    code generation) corresponding to data spaces of
  • read references for moving data into scratchpad
  • write references for moving data out of
    scratchpad
  • Generation of loop body (data movement statement)
  • Copy from a location in scratchpad buffer to
    off-chip memory location or vice versa

/ Data Move in code / for (i10ilt14i)
for (j11jlt20j) LA0i-10j-11
Aij for (i20ilt28i) for
(jmax(i-13,11)jltmin(15,i-9) j)
LA1i-20j-11 Aij / Data Move out
code / for (i10ilt14i) for
(j11jlt15j) Aij LA0i-10j-11
17
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

18
GPU architecture
  • Architectural components
  • Slow off-chip (global) memory
  • Two levels of parallelism
  • Set of multiprocessors
  • Set of processor cores in each multiprocessor
  • Scratchpad on each multiprocessor, shared by its
    processor cores

Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
19
Multi-level Tiling Approach
  • Tiling transformation framework recently
    developed at OSU by Bondhugula (CC-08, PLDI-08)
  • Finds tiling transformations or hyperplanes
  • for sequences of imperfectly nested loops
  • enables communication minimal parallelization and
    locality optimization
  • Identifies loops to tile for parallelism and data
    locality
  • Multiple levels of tiling
  • for exploiting parallelism across multiple
    parallel levels
  • Additional tiling (sequential) at each level with
    scratchpad memory
  • If data required by tile executing at the level
    exceeds memory
  • Data movement at the start and end of each
    sequential tile
  • Synchronization points to ensure consistency

20
Example
// Tiling to distribute at the outer level FORALL
iT 1, Ni, Ti FORALL jT 1, Nj, Tj
// Tiling to satisfy scratchpad memory limit
FOR i' iT, min(iTTi-1,Ni), ti' FOR j'
jT, min(jTTj-1,Nj), tj' FOR k' 1, WS,
tk' FOR l' 1, WS, tl'
FORALL i 1, Ni FORALL j 1, Nj FOR k
1, WS FOR l 1, WS S1 END
FOR END FOR END FORALL END FORALL
ltData move in Codegt
// Tiling to distribute at the inner
level FORALL it i',
min(i'ti'-1,Ni), ti FORALL jt
j', min(j'tj'-1,Nj), tj
FOR i it, min(itti-1,Ni)
FOR j jt, min(jttj-1,Nj)
FOR k k', min(k'tk'-1,WS)
FOR l l', min(l'tl'-1,WS)
S1 END FOR
END FOR END
FOR END FOR
END FORALL END FORALL
ltData move out Codegt
END FOR END FOR END FOR
END FOR
END FORALL END FORALL
21
Tile Size Determination
  • Handling scratchpad memory constraints
  • Cost model for data movement
  • C N x (S (V x L)/P)
  • N Number of data movements
  • S Sync cost per data movement
  • V Number of elements per data movement
    (based on tile sizes)
  • L Cost to transfer one element
  • P Number of processes involved in data
    movement
  • Tile size search formulation
  • Constraint memory requirement within limit
  • Objective function minimize data movement cost,
    C

22
Illustration of tile size search formulation
  • Loop nest of m loops with tile sizes t1, t2,..,
    tm
  • nl local arrays
  • Mj Memory (as a function of tile sizes) for
    local array j
  • V inj and Voutj Volume (as a function of tile
    sizes) moved in to and out of local array memory
    j, respectively
  • rj position in the loop nest where the data
    movement code of array j is placed
  • Mup - total scratchpad memory

Variables
t1, t2,.., tm
Memory Constraint
Objective function
23
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

24
Motion Estimation Kernel (1/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
25
1D Jacobi Kernel (1/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Automatic Data Movement and Computation Mapping
for Multi-level Parallel Architectures with
Explicitly Managed Memories, PPoPP 2008
26
Motion Estimation Kernel (2/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Tile size from model
27
1D Jacobi Kernel (2/2)
Machine Information NVIDIA GeForce 8800 GTX 16
x 8 cores _at_ 1.35 GHz 768 MB off-chip memory 16 x
16 KB scratchpad
Tile size from model
28
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

29
Related Work
  • Scratchpad memory management
  • Data reuse - Issenin et al. DAC06
  • Allocation for uniformly generated references
  • Schreiber and Cronquist HPLTR04
  • Anantharaman and Pande RTSS98
  • Kandemir et al. CAD04
  • Improving performance on cached architectures
  • Ferrante et al. LCPC92
  • Gallivan et al. ICS88
  • Multi-level tiling
  • Fatahalian et al. SC06 various levels of
    memory
  • Bikshandi et al. PPOPP06 and Renganarayanan et
    al. SC07, IPDPS07 parallelism and locality

30
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

31
Summary
  • Addressed two issues in compiling for modern
    multi-level parallel architectures with
    scratchpads
  • Data management in scratchpad memory
  • Data allocation
  • Access in scratchpad
  • Code generation for data movement
  • Mapping of computation in regular programs on to
    multiple levels of parallel units
  • Experimental evaluation using nVIDIA GPU

32
Talk Outline
  • Introduction
  • Challenges
  • Automatic Data Management
  • Multi-level Tiling
  • Experiments
  • Related Work
  • Summary
  • Ongoing and Future Work

33
Ongoing and Future Work
  • Developing an end-to-end compiler framework for
    modern many-core architectures like GPUs
  • Algorithms developed in this work an integral
    part of the overall compiler framework
  • Further optimize transformations like tiling, for
    modern architectures like GPUs, using
    model-driven empirical search

34
Thank you
Write a Comment
User Comments (0)
About PowerShow.com