Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems

Description:

Exploiting Shared Scratch Pad Memory Space in Embedded ... splat [635 KB] (volume rendering) Wave [628 KB] (wavelet compression code) Simulated system: ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 22

Provided by: carl290

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems

1
Exploiting Shared Scratch Pad Memory Space in
Embedded Multiprocessor Systems

Mahmut Kandemir Penn State Univ.
J. Ramanujam Louisiana State Univ.
Alok Choudhary Northwestern Univ.
June 2002 DAC 02 New Orleans

2
Outline

Motivation Memory size affects power,
performance
Increasing use of embedded multiprocessors
Software optimizations play a crucial role in
exploiting features like shared scratch-pad
memory
Preliminaries DSP and VSP codes, nested loops,
array computations
Compiler strategy for improving data sharing in
shared scratch-pad memories
Off-chip accesses expensive
Careful scheduling of computations and data
accesses can eliminate extra off-chip accesses
Reduced energy-delay product (24 reduction) for
a four-processor system with shared scratch-pad
memories on several applications

3
Motivation

Software optimizations are important for mobile,
embedded systems beyond circuit and architectural
solutions
Embedded multiprocessors are used to meet the
computational demands of several image and
video signal processing applications
Memory architecture customized in embedded
systems
DSP and video signal processing codes
Computations on large multi-dimensional arrays
Nested loops
Storage and access of data consumes significant
energy
Off-chip access in shared scratch-pad expensive
Idea Coordinate the schedule of computations on
the processors so that data needed by one
processor in the on-chip shared scratch-pad

4
System Architecture Virtually Shared Scratch
Pad Memory
5
System Architecture - 2

Scratchpad (SPM) Software-managed on-chip SRAM
Virtually shared SPM (VP-SPM) Fast communication
link between the SPMs of processors
Multiprocessor-on-a-chip each processor can its
(local) SPM and the (remote) SPMs of other
processors
Per-access energy and latency for the off-chip
DRAM much higher than that of the on-chip SPMs
Example system VME/C6420 from Bluewave Systems
(provides a cross-bar on-chip interconnect)

6
Execution Model

Input Loop-level parallelized application
Different processors execute a subset of the
loop iterations
Communication and synchronization among
processors via fast on-chip links
Barrier synchronization between loop nests
Local SPM is much smaller than the part of the
array accessed by each processor
Use data tiles to reduce off-chip accesses

7
Example Code Fragment - 1
parfor (I2 I lt N-1 I) parfor (j2 j lt
N-1 j) U2Ij f( U1Ij
U1Ij-1 U1I-1j
U1Ij1
U1I1j )
8
Data accessed by a processor - 2
(a) Non-local portion for processor P1 (b)
Local portion for each processor (c) Data tiles
processed by P4
9
Tile Access Pattern 1
10
Tile Access Pattern - 2
11
Coordinated Tile Access Pattern - 1
12
Coordinated Tile Access Pattern - 2
13
Problems in Compiler Optimization

Shape and size of data tiles
Local SPM size determines tile size
Assume rectilinear tile shape
SPM size is same for all processors
Tile access pattern (a.k.a scheduling)
Goal Find a tile access pattern for each
processor so that unnecessary off-chip memory
accesses arising from non-local SPM access are
eliminated
Need to find a tile access pattern for each
processor
Need not be the same for all of them
For row and column tiles in two dimensions, only
one direction of movement for rectangular tiles,
two directions of movement
Coordinated scheduling accounts for movement

14
Coordinated Legal Schedules

Use a matrix notation to denote schedule
directions
Derived a scheduling equality that must be
satisfied by all pairs of communicating
processors
The per-communicating processor-pair equality
depends on the tile shape (1D row/column or
2Drectangular)
Details in the paper
If some of the loops in the nest are not parallel
(due to dependence constraints), then this may
not result in the elimination of all extra
off-chip accesses

15
Experimental Setup - 1

Four array-dominated image processing codes
3D 305 KB (building models and scenes)
dfe 286 KB (digital image filtering and
enhancement)
splat 635 KB (volume rendering)
Wave 628 KB (wavelet compression code)
Simulated system
Each processor is a 100Mhz MIPS 4Kp core
Local SPM access latency 2 cycles
Non-local SPM access latency 4 to 16 cycles (1
extra cycle for synchronization)
Off-chip DRAM 4MB access latency 80 cycles

16
Experimental Setup - 2

Aggressive parallelization strategy (parallelize
as many loops as possible)
Energy model
For SPM Similar to that of a cache (Shiue and
Chakrabarti 1999) except that it assumes full
associativity and ignores tag arrays
For interconnets Transition-sensitive similar
to that of Zhang et al. 1999
Focus on data memory energy and performance
(instruction access and datapath activity not
included)
However, the total execution cycles accounts for
cycles spent in the data path (stall cycles)

17
Energy-delay product (base config)
Percentage savings row tiles, 4 processors, SPM
size 1/8 local data size
SPM latency 2 cycles, off-chip
latency 80 cycles
Remote SPM latency ?
18
Effect of number of processors
19
Effect of tile shape on percentage savings in
energy-delay product
20
Experimental Results Summary

Percentage reduction in energy-delay products
when only the remote (non-local) SPM latency is
changed increases with decreasing remote SPM
latency (as expected)
Number of processors
Our solution is more effective with larger number
of processors due to an increased volume in
inter-processor communication
Highlights the effect of number of processors
Tile shape, and size of available SPM (slab
ratio)
More effective with smaller slab ratios (more
pressure on data memory)
Square tiles better than row or column tiles in
two dimensions

21
Conclusions

Developed scheduling solutions to eliminate
where possible extra off-chip accesses arising
from computations on embedded multi-processors
with shared on-chip scratch-pad memories
Significant reduction in percentage savings on
energy-delay product compared to the case where
no specific strategy is used
Work in progress
Extend to multiple levels of SPMs
(software-controlled)
Effect of using both SPMs and data caches
Heterogeneous multiprocessor environments
Effect of inter-processor communication on
off-chip accesses