Title: LocalityConscious Workload Assignment for ArrayBased Computations in MPSoC Architectures
1Locality-Conscious Workload Assignment for
Array-Based Computations in MPSoC Architectures
- Feihui Li and Mahmut Kandemir
- Dept. of Computer Science and Engineering
- The Pennsylvania State University
- June 14, 2005
2Outline
- Introduction
- Loop based code parallelization
- Global optimization
- Experimental evaluation
- Conclusions
3Introduction
- Why MPSoC?
- Efficient area utilization
- Easy design verification
- Flexible workload assignment
- Important issues in workload assignment
- Locality
- Parallelism
- Load balance
- Our focus improving datalocality for MPSOCs
(mini- mizing off-chip
references) through compiler
4Related work
- Customized memory hierarchy design and loop
transformations for low power F. Catthoor et
al., Kluwer Academic Publishers, 1998 - Data shackling data-centric code restructuring
for cache based single processor system I.
Kodukula et al., PLDI97 - Memory optimization for low power embedded
systems W.-T. Shiue and C. Chakarabarti, DAC99 - Energy saving based on adaptive loop
parallelization I. Kadayif et al., DAC02 - Energy efficient synchronization J.Li et al.,
HPCA04 - Energy efficiency of CMP vs. SMT R. Sasanka et
al., ICS04
5Loop based code parallelization
for i1,100 for j1,100
- Coarse grain parallelism loop level
- Data dependence constraints
- Parallelize each loop nest individually
- Drawback not capturing data locality between
different loop nests
6Iteration vector and iteration space
for i1 for i2 for in
for i 1, N for j 1, N
7A motivating example
8Loop based vs. Global
9Global parallelization scheme
- Main goal improving data locality for each
processor loop iterations accessing the same
array region should be executed by the same
processor - Methodology inter-loop-nest data reuse aware
workload assignment - Constraint inter-loop-iteration data dependences
10Mathematical model
- Assign loop iterations to each processor based on
array access patterns
11Issues in global parallelization
- Array element partitioning
- Array access patterns
- Cache locality for each processor
- Parallelism in the application program
- Partial array access
- A loop nest accesses a portion of the array
- Multiple arrays
- Multiple arrays accessed by the same processors
share the same data cache - Loop iteration mapping must consider all the
arrays to improve data locality
12Issue-1 array element partitioning
- Use loop based parallelization to extract maximum
parallelism - Given iteration set Is executed by processor s,
and the array reference function set Rs,
determine the accessed array region - Ds can be different for different loop nests
- A unification step to obtain a globally
acceptable array partitioning
13A Unification example
A1000, 1000
14A Unification example
P1
L1
P2
P3
P4
Row-block array partitioning is selected
P1
P2
P3
P4
L2
P1
L3
P2
P3
P4
15Data mapping and loop iterations assignment
16Issue-2 partial array access pattern
L1 for I 10 to 300 for J 100 to 900
AI, J
L2 for I 10 to 600 for J 500 to 900
AI, J
L1
L2
17Issue-3 Multiple arrays
- Affinity two data items accessed by the same
loop iteration are affinitive - Affinity class a set of affinitive data elements
do t1 1, M-2 do t2 4, N Z1t1t2Z2t2t
1Z3t12t2-3
18Determining the workload of processor under
multiple arrays scenario
19An example
L1 do t1 1, N do t2 1, N
Z1t1t2Z2t2t1 L2 do t1 1,N
do t2 1, N Z1t2t1Z3t2t1
No cross-loop data dependence except for t2 in
loop nest L1
20Experimental environment
- Global optimization implemented within SUIF
- Omega library utilized to generate codes after
loop iterations assignment - MPSoC architecture
- 8KB L1 cache per processor
- L1 latency 2 cycles
- Main memory latency 80 cycles
- Energy values based on CACTI under 70nm technology
We thank I. Kolcu of Univ. of Manchester for
his help in implementation.
21Memory energy Improvement
22Impact of load balancing
- Load imbalance may degrade performance
- How to balance load
- Measure load imbalance severity
- Re-distribute loop iterations
23Impact of workload balancing
24Impact of data size (Conv-Spa)
25Performance improvement
26Conclusions
- A global (application wide) parallelization
strategy for improving data locality across loop
nests - Data locality aware loop iterations assignment
for MPSoC - Experimental evaluation demonstrating the energy
and performance improvements
27 28Discussion
- Obtain globally good data mapping through loop
based parallelization and unification - Possible negative impact on parallelism
- Expect gains from data locality outweigh losses
in loop level parallelism
29Issue2 Array access pattern
- Problem how to partition each array such that
iterations from the above two nests, - , access the same
set of elements as much as possible
30Procedure under partial array access
- Determine common elements between
-
- Assign the first
to the first processor, the next
to the second processor, and so on - Assign other elements
to the remaining processors - Similar process for the second nest, except
assigning the same set of common elements to the
same processor as in the previous nest