Title: Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems
1Exploiting Shared Scratch Pad Memory Space in
Embedded Multiprocessor Systems
- Mahmut Kandemir Penn State Univ.
- J. Ramanujam Louisiana State Univ.
- Alok Choudhary Northwestern Univ.
- June 2002 DAC 02 New Orleans
2Outline
- Motivation Memory size affects power,
performance - Increasing use of embedded multiprocessors
- Software optimizations play a crucial role in
exploiting features like shared scratch-pad
memory - Preliminaries DSP and VSP codes, nested loops,
array computations - Compiler strategy for improving data sharing in
shared scratch-pad memories - Off-chip accesses expensive
- Careful scheduling of computations and data
accesses can eliminate extra off-chip accesses - Reduced energy-delay product (24 reduction) for
a four-processor system with shared scratch-pad
memories on several applications
3Motivation
- Software optimizations are important for mobile,
embedded systems beyond circuit and architectural
solutions - Embedded multiprocessors are used to meet the
computational demands of several image and
video signal processing applications - Memory architecture customized in embedded
systems - DSP and video signal processing codes
- Computations on large multi-dimensional arrays
- Nested loops
- Storage and access of data consumes significant
energy - Off-chip access in shared scratch-pad expensive
- Idea Coordinate the schedule of computations on
the processors so that data needed by one
processor in the on-chip shared scratch-pad
4System Architecture Virtually Shared Scratch
Pad Memory
5System Architecture - 2
- Scratchpad (SPM) Software-managed on-chip SRAM
- Virtually shared SPM (VP-SPM) Fast communication
link between the SPMs of processors - Multiprocessor-on-a-chip each processor can its
(local) SPM and the (remote) SPMs of other
processors - Per-access energy and latency for the off-chip
DRAM much higher than that of the on-chip SPMs - Example system VME/C6420 from Bluewave Systems
(provides a cross-bar on-chip interconnect)
6Execution Model
- Input Loop-level parallelized application
- Different processors execute a subset of the
loop iterations - Communication and synchronization among
processors via fast on-chip links - Barrier synchronization between loop nests
- Local SPM is much smaller than the part of the
array accessed by each processor - Use data tiles to reduce off-chip accesses
7Example Code Fragment - 1
parfor (I2 I lt N-1 I) parfor (j2 j lt
N-1 j) U2Ij f( U1Ij
U1Ij-1 U1I-1j
U1Ij1
U1I1j )
8Data accessed by a processor - 2
(a) Non-local portion for processor P1 (b)
Local portion for each processor (c) Data tiles
processed by P4
9Tile Access Pattern 1
10Tile Access Pattern - 2
11Coordinated Tile Access Pattern - 1
12Coordinated Tile Access Pattern - 2
13Problems in Compiler Optimization
- Shape and size of data tiles
- Local SPM size determines tile size
- Assume rectilinear tile shape
- SPM size is same for all processors
- Tile access pattern (a.k.a scheduling)
- Goal Find a tile access pattern for each
processor so that unnecessary off-chip memory
accesses arising from non-local SPM access are
eliminated - Need to find a tile access pattern for each
processor - Need not be the same for all of them
- For row and column tiles in two dimensions, only
one direction of movement for rectangular tiles,
two directions of movement - Coordinated scheduling accounts for movement
14Coordinated Legal Schedules
- Use a matrix notation to denote schedule
directions - Derived a scheduling equality that must be
satisfied by all pairs of communicating
processors - The per-communicating processor-pair equality
depends on the tile shape (1D row/column or
2Drectangular) - Details in the paper
- If some of the loops in the nest are not parallel
(due to dependence constraints), then this may
not result in the elimination of all extra
off-chip accesses
15Experimental Setup - 1
- Four array-dominated image processing codes
- 3D 305 KB (building models and scenes)
- dfe 286 KB (digital image filtering and
enhancement) - splat 635 KB (volume rendering)
- Wave 628 KB (wavelet compression code)
- Simulated system
- Each processor is a 100Mhz MIPS 4Kp core
- Local SPM access latency 2 cycles
- Non-local SPM access latency 4 to 16 cycles (1
extra cycle for synchronization) - Off-chip DRAM 4MB access latency 80 cycles
16Experimental Setup - 2
- Aggressive parallelization strategy (parallelize
as many loops as possible) - Energy model
- For SPM Similar to that of a cache (Shiue and
Chakrabarti 1999) except that it assumes full
associativity and ignores tag arrays - For interconnets Transition-sensitive similar
to that of Zhang et al. 1999 - Focus on data memory energy and performance
(instruction access and datapath activity not
included) - However, the total execution cycles accounts for
cycles spent in the data path (stall cycles)
17Energy-delay product (base config)
Percentage savings row tiles, 4 processors, SPM
size 1/8 local data size
SPM latency 2 cycles, off-chip
latency 80 cycles
Remote SPM latency ?
18Effect of number of processors
19Effect of tile shape on percentage savings in
energy-delay product
20Experimental Results Summary
- Percentage reduction in energy-delay products
when only the remote (non-local) SPM latency is
changed increases with decreasing remote SPM
latency (as expected) - Number of processors
- Our solution is more effective with larger number
of processors due to an increased volume in
inter-processor communication - Highlights the effect of number of processors
- Tile shape, and size of available SPM (slab
ratio) - More effective with smaller slab ratios (more
pressure on data memory) - Square tiles better than row or column tiles in
two dimensions
21Conclusions
- Developed scheduling solutions to eliminate
where possible extra off-chip accesses arising
from computations on embedded multi-processors
with shared on-chip scratch-pad memories - Significant reduction in percentage savings on
energy-delay product compared to the case where
no specific strategy is used - Work in progress
- Extend to multiple levels of SPMs
(software-controlled) - Effect of using both SPMs and data caches
- Heterogeneous multiprocessor environments
- Effect of inter-processor communication on
off-chip accesses