Outline - PowerPoint PPT Presentation

About This Presentation
Title:

Outline

Description:

Requires no run time checks. High Predictable memory access ... The designer has to choose the right mix of Scratch pad and Cache for performance advantages. ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 55
Provided by: tsen
Category:

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Introduction
  • Different Scratch Pad Memories
  • Cache and Scratch Pad for embedded applications

2
Memories in Embedded Systems
  • Each memory has its own advantages
  • For better performance memory
    accesses have to be fast

CPU
Internal ROM
Internal SRAM
External DRAM
3
Efficient Utilization of Scratch-Pad Memory in
Embedded Processor Applications
4
What is Scratchpad memory ?
  • Fast on-chip SRAM
  • Abbreviated as SPM
  • 2 types of SPM -
  • Static ?SPM locations dont change at runtime
  • Dynamic ? SPM locations change at runtime

5
Objective
  • Find a technique for efficiently exploiting
    on-chip SPM by partitioning the applications
    scalar and array variables into off-chip DRAM and
    on-chip SPM.
  • Minimize the total execution time of the
    application.

6
SPM and Cache
  • Similarities
  • Connected to the same address and data buses.
  • Access latency of 1 processor cycle.
  • Difference
  • SPM guarantees single cycle access time while an
    access to cache is subject to a miss.

7
Block Diagram of Embedded Processor Application
8
Division of Data Address Space between SRAM and
DRAM
9
Example Histogram Evaluation Code
  • Builds a histogram of 256 brightness levels for
    the pixels of an N N image
  • char Brightnesslevel  512 512
  • int Hist 256 / Elements initialized to 0 /
  • for(i 0i lt Ni ) for (j 0j lt Nj )
    / For each pixel (i, j) in image / level
    BrightnessLevel i j Hist level Hist
    level 1

10
Problem Description
  • If the code is executed on a processor configured
    with a data cache of size 1Kb
  • performance will be degraded by conflict misses
    in the cache between elements of the 2 arrays
    Hist and BrightnessLevel.
  • Solution- Selectively map to SPM those
    variables that cause maximum number of conflicts
    in the data cache.

11
Partitioning Strategy
  • Features affecting partitioning
  • Scalar variables and constants
  • Size of arrays
  • Life-times of array variables
  • Access frequency of array variables
  • Conflicts in loops
  • Partitioning Algorithm

12
Features affecting partitioning
  • Scalar variables and constants
  • All scalar variables and scalar constants are
    mapped onto SPM.
  • Size of Arrays
  • Arrays that are larger than SRAM are mapped onto
    off-chip memory.

13
Features affecting partitioning
  • Lifetime of an Array Variable
  • Definition - period between its definition and
    its last use.
  • Variables with disjoint lifetimes can be stored
    in the same processor register.
  • Arrays with different lifetimes can share the
    same memory space.

14
Features affecting partitioning
  • Intersecting Life Times ?ILT(u)
  • Definition - Number of array variables having a
    non-null intersection of lifetimes with u.
  • Indicates the number of other arrays it could
    possibly interact with, in cache.
  • So map arrays with highest ILT values into SPM,
    thereby eliminating a large number of potential
    conflicts.

15
Features affecting partitioning
  • Access frequency of Array Variables
  • Variable Access Count ? VAC(u)
  • Definition - Number of accesses to elements of
    u during its lifetime.
  • Interference Access Count? IAC(u)
  • Definition - Number of accesses to other arrays
    during the lifetime of u.
  • Interference Factor ? IF(u) VAC(u)IAC(u)

16
Features affecting partitioning
Conflicts in Loops for i 0 to N-1 access a
i access b i access c 2 i access c 2 i
1 end for
b
a
3N
3N
c
Loop Conflict Graph?LCG edge weight e(u, v)
?pi1 ki ki -gttotal no. of accesses to u and v
in loop i Total no. of accesses to a and c
combined (12)N 3N gte(a,c) 3N e(b,c)
3N e(a,b) 0
17
Features affecting partitioning
  • Loop Conflict Factor?
  • Definition - sum of incident edge weights to
    node u.
  • LCF(u) ?v ? LCG - u e(u,v)
  • Higher the LCF, more conflicts are likely for an
    array, more desirable to map the array to the SPM.

18
Partitioning Strategy
  • Features affecting partitioning
  • Scalar variables and constants
  • Size of arrays
  • Life-times of array variables
  • Access frequency of array variables
  • Conflicts in loops
  • Partitioning Algorithm

19
Partitioning Algorithm
  • Algorithm for determining the mapping decision of
    each(scalar and array) program variable to SPM or
    DRAM/cache.
  • First assigns scalar constants and variables to
    SPM.
  • Arrays that are larger than SPM are mapped onto
    DRAM.

20
Partitioning Algorithm
  • For remaining (n) arrays, generates lifetime
    intervals and computes LCF and IF values.
  • Sorts the 2n interval points thus generated and
    traverses them in increasing order.
  • For each array u encountered, if there is
    sufficient SRAM space for u and all arrays with
    lifetimes intersecting the lifetime interval of
    u, with more critical LCF and IF nos., then maps
    u to SPM else to DRAM/cache.

21
Performance Details for Beamformer Example
22
Typical Applications
  • Dequant?de-quantization routine in MPEG decoder
    application
  • IDCT?Inverse Discrete Cosine Transform
  • SOR?Successive Over Relaxation Algorithm
  • MatrixMult?Matrix multiplication
  • FFT?Fast Fourier Transform
  • DHRC?Differential Heat Release Computation
    Algorithm

23
Performance Comparison of Configurations A, B, C
and D
24
Conclusion
  • Average improvement of 31.4 over A (only SRAM)
  • Average improvement of 30.0 over B (only cache)
  • Average improvement of 33.1 over C (random
    partitioning)

25
  • Compiler Decided Dynamic Memory
    allocation for Scratch Pad Based Embedded Systems.

26
Cache is one of the option for Onchip Memory

CPU
Internal ROM
Cache
External DRAM
27
Why All Embedded Systems Don't Have Cache
Memory
  • The reasons could be
  • Increased On Chip Area
  • Increased Energy
  • Increased Cost
  • Hit Latency and Undeterministic Cache Access

28
  • A method for allocating program data to
    non-cached SRAM
  • Dynamic i.e. allocation changes at runtime
  • Compiler-decided transfers
  • Zero overhead per-memory-instruction unlike
    software or hardware caching
  • Has no software Caching tags
  • Requires no run time checks
  • High Predictable memory access times

29
Static Approach
Internal SRAM

int a100 int b100 while(ilt100)
..a while(ilt100) b...
Allocator
External DRAM
Int b100
30
Static Approach
Internal SRAM
Int a100
int a100 int b100 while(ilt100)
..a while(ilt100) b...
Allocator
External DRAM
Int b100
31
Dynamic Approach
Internal SRAM
Int a100
int a100 int b100 while(ilt100)
..a while(ilt100) b...
Allocator
External DRAM
Int b100
32
Dynamic Approach
Internal SRAM
int b100
int a100 int b100 while(ilt100) a... whil
e(ilt100) b
Allocator
External DRAM
int a100
It is similar to caching, but under compiler
control
33
Compiler-Decided Dynamic Approach
  • Need to minimize costs
  • for greater benefit
  • Accounts for changing program
  • Requirements at run time
  • Compiler manages and decides the
  • transfers between sram and dram

int a100 int b100 // a is in SRAM
while(ilt100) a. // Copy a out to DRAM //
Copy b in to SRAM while(ilt100) ..b..
Decide on dynamic behavior statically
34
Approach
  • The method is to
  • Use profiling to estimate reuse
  • Copy variables in to SRAM when reused
  • Cost model ensures that benefit exceeds cost
  • Transfers data between the On chip and Off chip
    memory under compiler supervision
  • Compiler-known data allocation at each point in
    the code

35
  • Advantages
  • Benefits with no software translation overhead
  • Predictable SRAM accesses ensuring better
    real-time guarantees than Hardware or Software
    caching
  • No more data transfers than caching

36
Overview of Strategy

Divide the complete program into different
regions For (Starting Point of each Region) lt
Remove Some Variables from Sram Copy
Some Variables into Sram from Dram gt
37
Some Imp Questions

What are regions ? What to bring in to SRAM
? What to evict from SRAM ? The Problem has an
exponential number of Solutions (NP Complete)
38
Regions
  • It is the code between successive program points
  • Coincide with changes in program behavior
  • New regions start at
  • Start of each procedure
  • Before start of each loop
  • Before conditional statements containing loops,
    procedures

39
What to Bring in to SRAM ?
  • Bring in variables that are re-used in region,
    provided cost of transfer is recovered.
  • These transfers will reduce the memory access
    time
  • Cost model accounts for
  • Profile estimated re-use
  • Benefit from reuse
  • Detailed Cost of transfer
  • Bring in cost
  • Eviction cost

40
What to Remove from SRAM?
in the future.
The data variables that are furthest in the
future This time can be obtained by assigning
timestamps for each of the nodes
Need concept of time order of different code
regions
41
The Data-Program Relationship Graph
  • The DPGR is a new data structure that helps in
    identification of regions and marking of time
    stamps
  • It is essentially a programs call graph
    appended with additional nodes for
  • Loop nodes
  • Variable nodes

42
Data-Program Relationship Graph
1
  • Defines regions

Defines Regions Depth first search order reveals
execution time. order
2
7
Proc_B
6
3
Proc_C
loop
  • Allocation-change points at region changes

loop
a
b
43
Time Stamps
  • A method associates a time stamp with every
    program point
  • The time stamp forms a total order among
    themselves
  • The program points are reached during the runtime
    in time stamp order.

44
Optimizations
  • The is no need to write back unmodified or dead
    SRAM variables into DRAM
  • Optimize data transfer code using DMA when it is
    available
  • Data transfer code can be placed in special
    memory block copy procedures

45
Multiple Allocations due to Multiple Paths
  • Contents of SRAM could be different on different
    incoming paths to a node in DPRG
  • Problem can happen in
  • Loops
  • Conditional execution
  • Multiple calls to same procedure

46
Conditional join nodes
Join Node
  • Favor the most frequent path
  • Consensus allocation is chosen assuming the
    incoming allocation from the most probable
    predecessor

47
Procedure join nodes
  • Few program points have multiple timestamps
  • The nodes with multiple timestamps are called
    join nodes as they join multiple paths from
    main()
  • A strategy is used that adopts different
    allocation strategies for different paths but
    with same code

48
Offsets in SRAM
  • SRAM can get fragmented when variables are
    swapped out
  • Intelligent offset mechanism required
  • In this method
  • Place memory variables with similar lifetimes
    together ? larger fragments when evicted together

49
Experimental Setup
  • Architecture Motorola MCORE
  • Memory architecture 2 levels of memory
  • SRAM size Estimated as 25 of the total data
    requirement
  • DRAM latency 10 cycles
  • Compiler Gcc

50
Results
51
ConclusionThe designer has to choose the right
mix of Scratch pad and Cache for performance
advantages.
52
References
  • Sumesh U ,Rajeev B.
  • Compiler Decided Dynamic Memory
    Allocation for Scratch Pad Based Embedded Systems
    .
  • Alexandru N ,Preeti P, N Dutt .
  • Efficient Use of Scratch Pads in Embedded
    Applications
  • Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov
  • Balancing Scratch Pad and Cache in
    Embedded Systems for Power and Speed
    Performance


53
Questions
54
Thank you
Write a Comment
User Comments (0)
About PowerShow.com