LocalityConscious Workload Assignment for ArrayBased Computations in MPSoC Architectures PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: LocalityConscious Workload Assignment for ArrayBased Computations in MPSoC Architectures


1
Locality-Conscious Workload Assignment for
Array-Based Computations in MPSoC Architectures
  • Feihui Li and Mahmut Kandemir
  • Dept. of Computer Science and Engineering
  • The Pennsylvania State University
  • June 14, 2005

2
Outline
  • Introduction
  • Loop based code parallelization
  • Global optimization
  • Experimental evaluation
  • Conclusions

3
Introduction
  • Why MPSoC?
  • Efficient area utilization
  • Easy design verification
  • Flexible workload assignment
  • Important issues in workload assignment
  • Locality
  • Parallelism
  • Load balance
  • Our focus improving datalocality for MPSOCs
    (mini- mizing off-chip
    references) through compiler

4
Related work
  • Customized memory hierarchy design and loop
    transformations for low power F. Catthoor et
    al., Kluwer Academic Publishers, 1998
  • Data shackling data-centric code restructuring
    for cache based single processor system I.
    Kodukula et al., PLDI97
  • Memory optimization for low power embedded
    systems W.-T. Shiue and C. Chakarabarti, DAC99
  • Energy saving based on adaptive loop
    parallelization I. Kadayif et al., DAC02
  • Energy efficient synchronization J.Li et al.,
    HPCA04
  • Energy efficiency of CMP vs. SMT R. Sasanka et
    al., ICS04

5
Loop based code parallelization
for i1,100 for j1,100
  • Coarse grain parallelism loop level
  • Data dependence constraints
  • Parallelize each loop nest individually
  • Drawback not capturing data locality between
    different loop nests

6
Iteration vector and iteration space
for i1 for i2 for in

for i 1, N for j 1, N
7
A motivating example
8
Loop based vs. Global
9
Global parallelization scheme
  • Main goal improving data locality for each
    processor loop iterations accessing the same
    array region should be executed by the same
    processor
  • Methodology inter-loop-nest data reuse aware
    workload assignment
  • Constraint inter-loop-iteration data dependences

10
Mathematical model
  • Assign loop iterations to each processor based on
    array access patterns

11
Issues in global parallelization
  • Array element partitioning
  • Array access patterns
  • Cache locality for each processor
  • Parallelism in the application program
  • Partial array access
  • A loop nest accesses a portion of the array
  • Multiple arrays
  • Multiple arrays accessed by the same processors
    share the same data cache
  • Loop iteration mapping must consider all the
    arrays to improve data locality

12
Issue-1 array element partitioning
  • Use loop based parallelization to extract maximum
    parallelism
  • Given iteration set Is executed by processor s,
    and the array reference function set Rs,
    determine the accessed array region
  • Ds can be different for different loop nests
  • A unification step to obtain a globally
    acceptable array partitioning

13
A Unification example
A1000, 1000
14
A Unification example
P1
L1
P2
P3
P4
Row-block array partitioning is selected
P1
P2
P3
P4
L2
P1
L3
P2
P3
P4
15
Data mapping and loop iterations assignment
16
Issue-2 partial array access pattern
L1 for I 10 to 300 for J 100 to 900
AI, J
L2 for I 10 to 600 for J 500 to 900
AI, J
L1
L2
17
Issue-3 Multiple arrays
  • Affinity two data items accessed by the same
    loop iteration are affinitive
  • Affinity class a set of affinitive data elements

do t1 1, M-2 do t2 4, N Z1t1t2Z2t2t
1Z3t12t2-3
18
Determining the workload of processor under
multiple arrays scenario
19
An example
L1 do t1 1, N do t2 1, N
Z1t1t2Z2t2t1 L2 do t1 1,N
do t2 1, N Z1t2t1Z3t2t1
No cross-loop data dependence except for t2 in
loop nest L1
20
Experimental environment
  • Global optimization implemented within SUIF
  • Omega library utilized to generate codes after
    loop iterations assignment
  • MPSoC architecture
  • 8KB L1 cache per processor
  • L1 latency 2 cycles
  • Main memory latency 80 cycles
  • Energy values based on CACTI under 70nm technology

We thank I. Kolcu of Univ. of Manchester for
his help in implementation.
21
Memory energy Improvement
22
Impact of load balancing
  • Load imbalance may degrade performance
  • How to balance load
  • Measure load imbalance severity
  • Re-distribute loop iterations

23
Impact of workload balancing
24
Impact of data size (Conv-Spa)
25
Performance improvement
26
Conclusions
  • A global (application wide) parallelization
    strategy for improving data locality across loop
    nests
  • Data locality aware loop iterations assignment
    for MPSoC
  • Experimental evaluation demonstrating the energy
    and performance improvements

27
  • Thank you!

28
Discussion
  • Obtain globally good data mapping through loop
    based parallelization and unification
  • Possible negative impact on parallelism
  • Expect gains from data locality outweigh losses
    in loop level parallelism

29
Issue2 Array access pattern
  • Problem how to partition each array such that
    iterations from the above two nests,
  • , access the same
    set of elements as much as possible

30
Procedure under partial array access
  • Determine common elements between
  • Assign the first
    to the first processor, the next
    to the second processor, and so on
  • Assign other elements
    to the remaining processors
  • Similar process for the second nest, except
    assigning the same set of common elements to the
    same processor as in the previous nest
Write a Comment
User Comments (0)
About PowerShow.com