Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems

Description:

Exploiting Shared Scratch Pad Memory Space in Embedded ... splat [635 KB] (volume rendering) Wave [628 KB] (wavelet compression code) Simulated system: ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 22
Provided by: carl290
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems


1
Exploiting Shared Scratch Pad Memory Space in
Embedded Multiprocessor Systems
  • Mahmut Kandemir Penn State Univ.
  • J. Ramanujam Louisiana State Univ.
  • Alok Choudhary Northwestern Univ.
  • June 2002 DAC 02 New Orleans

2
Outline
  • Motivation Memory size affects power,
    performance
  • Increasing use of embedded multiprocessors
  • Software optimizations play a crucial role in
    exploiting features like shared scratch-pad
    memory
  • Preliminaries DSP and VSP codes, nested loops,
    array computations
  • Compiler strategy for improving data sharing in
    shared scratch-pad memories
  • Off-chip accesses expensive
  • Careful scheduling of computations and data
    accesses can eliminate extra off-chip accesses
  • Reduced energy-delay product (24 reduction) for
    a four-processor system with shared scratch-pad
    memories on several applications

3
Motivation
  • Software optimizations are important for mobile,
    embedded systems beyond circuit and architectural
    solutions
  • Embedded multiprocessors are used to meet the
    computational demands of several image and
    video signal processing applications
  • Memory architecture customized in embedded
    systems
  • DSP and video signal processing codes
  • Computations on large multi-dimensional arrays
  • Nested loops
  • Storage and access of data consumes significant
    energy
  • Off-chip access in shared scratch-pad expensive
  • Idea Coordinate the schedule of computations on
    the processors so that data needed by one
    processor in the on-chip shared scratch-pad

4
System Architecture Virtually Shared Scratch
Pad Memory
5
System Architecture - 2
  • Scratchpad (SPM) Software-managed on-chip SRAM
  • Virtually shared SPM (VP-SPM) Fast communication
    link between the SPMs of processors
  • Multiprocessor-on-a-chip each processor can its
    (local) SPM and the (remote) SPMs of other
    processors
  • Per-access energy and latency for the off-chip
    DRAM much higher than that of the on-chip SPMs
  • Example system VME/C6420 from Bluewave Systems
    (provides a cross-bar on-chip interconnect)

6
Execution Model
  • Input Loop-level parallelized application
  • Different processors execute a subset of the
    loop iterations
  • Communication and synchronization among
    processors via fast on-chip links
  • Barrier synchronization between loop nests
  • Local SPM is much smaller than the part of the
    array accessed by each processor
  • Use data tiles to reduce off-chip accesses

7
Example Code Fragment - 1
parfor (I2 I lt N-1 I) parfor (j2 j lt
N-1 j) U2Ij f( U1Ij
U1Ij-1 U1I-1j
U1Ij1
U1I1j )
8
Data accessed by a processor - 2
(a) Non-local portion for processor P1 (b)
Local portion for each processor (c) Data tiles
processed by P4
9
Tile Access Pattern 1
10
Tile Access Pattern - 2
11
Coordinated Tile Access Pattern - 1
12
Coordinated Tile Access Pattern - 2
13
Problems in Compiler Optimization
  • Shape and size of data tiles
  • Local SPM size determines tile size
  • Assume rectilinear tile shape
  • SPM size is same for all processors
  • Tile access pattern (a.k.a scheduling)
  • Goal Find a tile access pattern for each
    processor so that unnecessary off-chip memory
    accesses arising from non-local SPM access are
    eliminated
  • Need to find a tile access pattern for each
    processor
  • Need not be the same for all of them
  • For row and column tiles in two dimensions, only
    one direction of movement for rectangular tiles,
    two directions of movement
  • Coordinated scheduling accounts for movement

14
Coordinated Legal Schedules
  • Use a matrix notation to denote schedule
    directions
  • Derived a scheduling equality that must be
    satisfied by all pairs of communicating
    processors
  • The per-communicating processor-pair equality
    depends on the tile shape (1D row/column or
    2Drectangular)
  • Details in the paper
  • If some of the loops in the nest are not parallel
    (due to dependence constraints), then this may
    not result in the elimination of all extra
    off-chip accesses

15
Experimental Setup - 1
  • Four array-dominated image processing codes
  • 3D 305 KB (building models and scenes)
  • dfe 286 KB (digital image filtering and
    enhancement)
  • splat 635 KB (volume rendering)
  • Wave 628 KB (wavelet compression code)
  • Simulated system
  • Each processor is a 100Mhz MIPS 4Kp core
  • Local SPM access latency 2 cycles
  • Non-local SPM access latency 4 to 16 cycles (1
    extra cycle for synchronization)
  • Off-chip DRAM 4MB access latency 80 cycles

16
Experimental Setup - 2
  • Aggressive parallelization strategy (parallelize
    as many loops as possible)
  • Energy model
  • For SPM Similar to that of a cache (Shiue and
    Chakrabarti 1999) except that it assumes full
    associativity and ignores tag arrays
  • For interconnets Transition-sensitive similar
    to that of Zhang et al. 1999
  • Focus on data memory energy and performance
    (instruction access and datapath activity not
    included)
  • However, the total execution cycles accounts for
    cycles spent in the data path (stall cycles)

17
Energy-delay product (base config)
Percentage savings row tiles, 4 processors, SPM
size 1/8 local data size
SPM latency 2 cycles, off-chip
latency 80 cycles
Remote SPM latency ?
18
Effect of number of processors
19
Effect of tile shape on percentage savings in
energy-delay product
20
Experimental Results Summary
  • Percentage reduction in energy-delay products
    when only the remote (non-local) SPM latency is
    changed increases with decreasing remote SPM
    latency (as expected)
  • Number of processors
  • Our solution is more effective with larger number
    of processors due to an increased volume in
    inter-processor communication
  • Highlights the effect of number of processors
  • Tile shape, and size of available SPM (slab
    ratio)
  • More effective with smaller slab ratios (more
    pressure on data memory)
  • Square tiles better than row or column tiles in
    two dimensions

21
Conclusions
  • Developed scheduling solutions to eliminate
    where possible extra off-chip accesses arising
    from computations on embedded multi-processors
    with shared on-chip scratch-pad memories
  • Significant reduction in percentage savings on
    energy-delay product compared to the case where
    no specific strategy is used
  • Work in progress
  • Extend to multiple levels of SPMs
    (software-controlled)
  • Effect of using both SPMs and data caches
  • Heterogeneous multiprocessor environments
  • Effect of inter-processor communication on
    off-chip accesses
Write a Comment
User Comments (0)
About PowerShow.com