LocalityConscious Workload Assignment for ArrayBased Computations in MPSoC Architectures presentation

About This Presentation

Transcript and Presenter's Notes

Title: LocalityConscious Workload Assignment for ArrayBased Computations in MPSoC Architectures

1
Locality-Conscious Workload Assignment for
Array-Based Computations in MPSoC Architectures

Feihui Li and Mahmut Kandemir
Dept. of Computer Science and Engineering
The Pennsylvania State University
June 14, 2005

2
Outline

Introduction
Loop based code parallelization
Global optimization
Experimental evaluation
Conclusions

3
Introduction

Why MPSoC?
Efficient area utilization
Easy design verification
Flexible workload assignment
Important issues in workload assignment
Locality
Parallelism
Load balance
Our focus improving datalocality for MPSOCs
(mini- mizing off-chip
references) through compiler

4
Related work

Customized memory hierarchy design and loop
transformations for low power F. Catthoor et
al., Kluwer Academic Publishers, 1998
Data shackling data-centric code restructuring
for cache based single processor system I.
Kodukula et al., PLDI97
Memory optimization for low power embedded
systems W.-T. Shiue and C. Chakarabarti, DAC99
Energy saving based on adaptive loop
parallelization I. Kadayif et al., DAC02
Energy efficient synchronization J.Li et al.,
HPCA04
Energy efficiency of CMP vs. SMT R. Sasanka et
al., ICS04

5
Loop based code parallelization
for i1,100 for j1,100

Coarse grain parallelism loop level
Data dependence constraints
Parallelize each loop nest individually
Drawback not capturing data locality between
different loop nests

6
Iteration vector and iteration space
for i1 for i2 for in

for i 1, N for j 1, N
7
A motivating example
8
Loop based vs. Global
9
Global parallelization scheme

Main goal improving data locality for each
processor loop iterations accessing the same
array region should be executed by the same
processor
Methodology inter-loop-nest data reuse aware
workload assignment
Constraint inter-loop-iteration data dependences

10
Mathematical model

Assign loop iterations to each processor based on
array access patterns

11
Issues in global parallelization

Array element partitioning
Array access patterns
Cache locality for each processor
Parallelism in the application program
Partial array access
A loop nest accesses a portion of the array
Multiple arrays
Multiple arrays accessed by the same processors
share the same data cache
Loop iteration mapping must consider all the
arrays to improve data locality

12
Issue-1 array element partitioning

Use loop based parallelization to extract maximum
parallelism
Given iteration set Is executed by processor s,
and the array reference function set Rs,
determine the accessed array region
Ds can be different for different loop nests
A unification step to obtain a globally
acceptable array partitioning

13
A Unification example
A1000, 1000
14
A Unification example
P1
L1
P2
P3
P4
Row-block array partitioning is selected
P1
P2
P3
P4
L2
P1
L3
P2
P3
P4
15
Data mapping and loop iterations assignment
16
Issue-2 partial array access pattern
L1 for I 10 to 300 for J 100 to 900
AI, J
L2 for I 10 to 600 for J 500 to 900
AI, J
L1
L2
17
Issue-3 Multiple arrays

Affinity two data items accessed by the same
loop iteration are affinitive
Affinity class a set of affinitive data elements

do t1 1, M-2 do t2 4, N Z1t1t2Z2t2t
1Z3t12t2-3
18
Determining the workload of processor under
multiple arrays scenario
19
An example
L1 do t1 1, N do t2 1, N
Z1t1t2Z2t2t1 L2 do t1 1,N
do t2 1, N Z1t2t1Z3t2t1
No cross-loop data dependence except for t2 in
loop nest L1
20
Experimental environment

Global optimization implemented within SUIF
Omega library utilized to generate codes after
loop iterations assignment
MPSoC architecture
8KB L1 cache per processor
L1 latency 2 cycles
Main memory latency 80 cycles
Energy values based on CACTI under 70nm technology

We thank I. Kolcu of Univ. of Manchester for
his help in implementation.
21
Memory energy Improvement
22
Impact of load balancing

Load imbalance may degrade performance
How to balance load
Measure load imbalance severity
Re-distribute loop iterations

23
Impact of workload balancing
24
Impact of data size (Conv-Spa)
25
Performance improvement
26
Conclusions

A global (application wide) parallelization
strategy for improving data locality across loop
nests
Data locality aware loop iterations assignment
for MPSoC
Experimental evaluation demonstrating the energy
and performance improvements

Thank you!

28
Discussion

Obtain globally good data mapping through loop
based parallelization and unification
Possible negative impact on parallelism
Expect gains from data locality outweigh losses
in loop level parallelism

29
Issue2 Array access pattern

Problem how to partition each array such that
iterations from the above two nests,
, access the same
set of elements as much as possible

30
Procedure under partial array access

Determine common elements between
Assign the first
to the first processor, the next
to the second processor, and so on
Assign other elements
to the remaining processors
Similar process for the second nest, except
assigning the same set of common elements to the
same processor as in the previous nest

Write a Comment

User Comments (0)

About PowerShow.com

LocalityConscious Workload Assignment for ArrayBased Computations in MPSoC Architectures PowerPoint PPT Presentation