Outline - PowerPoint PPT Presentation

About This Presentation

Title:

Outline

Description:

Requires no run time checks. High Predictable memory access ... The designer has to choose the right mix of Scratch pad and Cache for performance advantages. ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 55

Provided by: tsen

Learn more at: http://www.ann.ece.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Introduction
Different Scratch Pad Memories
Cache and Scratch Pad for embedded applications

2
Memories in Embedded Systems

Each memory has its own advantages
For better performance memory
accesses have to be fast

CPU
Internal ROM
Internal SRAM
External DRAM
3
Efficient Utilization of Scratch-Pad Memory in
Embedded Processor Applications
4
What is Scratchpad memory ?

Fast on-chip SRAM
Abbreviated as SPM
2 types of SPM -
Static ?SPM locations dont change at runtime
Dynamic ? SPM locations change at runtime

5
Objective

Find a technique for efficiently exploiting
on-chip SPM by partitioning the applications
scalar and array variables into off-chip DRAM and
on-chip SPM.
Minimize the total execution time of the
application.

6
SPM and Cache

Similarities
Connected to the same address and data buses.
Access latency of 1 processor cycle.
Difference
SPM guarantees single cycle access time while an
access to cache is subject to a miss.

7
Block Diagram of Embedded Processor Application
8
Division of Data Address Space between SRAM and
DRAM
9
Example Histogram Evaluation Code

Builds a histogram of 256 brightness levels for
the pixels of an N N image
char Brightnesslevel 512 512
int Hist 256 / Elements initialized to 0 /
for(i 0i lt Ni ) for (j 0j lt Nj )
/ For each pixel (i, j) in image / level
BrightnessLevel i j Hist level Hist
level 1

10
Problem Description

If the code is executed on a processor configured
with a data cache of size 1Kb
performance will be degraded by conflict misses
in the cache between elements of the 2 arrays
Hist and BrightnessLevel.
Solution- Selectively map to SPM those
variables that cause maximum number of conflicts
in the data cache.

11
Partitioning Strategy

Features affecting partitioning
Scalar variables and constants
Size of arrays
Life-times of array variables
Access frequency of array variables
Conflicts in loops
Partitioning Algorithm

12
Features affecting partitioning

Scalar variables and constants
All scalar variables and scalar constants are
mapped onto SPM.
Size of Arrays
Arrays that are larger than SRAM are mapped onto
off-chip memory.

13
Features affecting partitioning

Lifetime of an Array Variable
Definition - period between its definition and
its last use.
Variables with disjoint lifetimes can be stored
in the same processor register.
Arrays with different lifetimes can share the
same memory space.

14
Features affecting partitioning

Intersecting Life Times ?ILT(u)
Definition - Number of array variables having a
non-null intersection of lifetimes with u.
Indicates the number of other arrays it could
possibly interact with, in cache.
So map arrays with highest ILT values into SPM,
thereby eliminating a large number of potential
conflicts.

15
Features affecting partitioning

Access frequency of Array Variables
Variable Access Count ? VAC(u)
Definition - Number of accesses to elements of
u during its lifetime.
Interference Access Count? IAC(u)
Definition - Number of accesses to other arrays
during the lifetime of u.
Interference Factor ? IF(u) VAC(u)IAC(u)

16
Features affecting partitioning
Conflicts in Loops for i 0 to N-1 access a
i access b i access c 2 i access c 2 i
1 end for
b
a
3N
3N
c
Loop Conflict Graph?LCG edge weight e(u, v)
?pi1 ki ki -gttotal no. of accesses to u and v
in loop i Total no. of accesses to a and c
combined (12)N 3N gte(a,c) 3N e(b,c)
3N e(a,b) 0
17
Features affecting partitioning

Loop Conflict Factor?
Definition - sum of incident edge weights to
node u.
LCF(u) ?v ? LCG - u e(u,v)
Higher the LCF, more conflicts are likely for an
array, more desirable to map the array to the SPM.

18
Partitioning Strategy

Features affecting partitioning
Scalar variables and constants
Size of arrays
Life-times of array variables
Access frequency of array variables
Conflicts in loops
Partitioning Algorithm

19
Partitioning Algorithm

Algorithm for determining the mapping decision of
each(scalar and array) program variable to SPM or
DRAM/cache.
First assigns scalar constants and variables to
SPM.
Arrays that are larger than SPM are mapped onto
DRAM.

20
Partitioning Algorithm

For remaining (n) arrays, generates lifetime
intervals and computes LCF and IF values.
Sorts the 2n interval points thus generated and
traverses them in increasing order.
For each array u encountered, if there is
sufficient SRAM space for u and all arrays with
lifetimes intersecting the lifetime interval of
u, with more critical LCF and IF nos., then maps
u to SPM else to DRAM/cache.

21
Performance Details for Beamformer Example
22
Typical Applications

Dequant?de-quantization routine in MPEG decoder
application
IDCT?Inverse Discrete Cosine Transform
SOR?Successive Over Relaxation Algorithm
MatrixMult?Matrix multiplication
FFT?Fast Fourier Transform
DHRC?Differential Heat Release Computation
Algorithm

23
Performance Comparison of Configurations A, B, C
and D
24
Conclusion

Average improvement of 31.4 over A (only SRAM)
Average improvement of 30.0 over B (only cache)
Average improvement of 33.1 over C (random
partitioning)

Compiler Decided Dynamic Memory
allocation for Scratch Pad Based Embedded Systems.

26
Cache is one of the option for Onchip Memory

CPU
Internal ROM
Cache
External DRAM
27
Why All Embedded Systems Don't Have Cache
Memory

The reasons could be
Increased On Chip Area
Increased Energy
Increased Cost
Hit Latency and Undeterministic Cache Access

A method for allocating program data to
non-cached SRAM
Dynamic i.e. allocation changes at runtime
Compiler-decided transfers
Zero overhead per-memory-instruction unlike
software or hardware caching
Has no software Caching tags
Requires no run time checks
High Predictable memory access times

29
Static Approach
Internal SRAM

int a100 int b100 while(ilt100)
..a while(ilt100) b...
Allocator
External DRAM
Int b100
30
Static Approach
Internal SRAM
Int a100
int a100 int b100 while(ilt100)
..a while(ilt100) b...
Allocator
External DRAM
Int b100
31
Dynamic Approach
Internal SRAM
Int a100
int a100 int b100 while(ilt100)
..a while(ilt100) b...
Allocator
External DRAM
Int b100
32
Dynamic Approach
Internal SRAM
int b100
int a100 int b100 while(ilt100) a... whil
e(ilt100) b
Allocator
External DRAM
int a100
It is similar to caching, but under compiler
control
33
Compiler-Decided Dynamic Approach

Need to minimize costs
for greater benefit
Accounts for changing program
Requirements at run time
Compiler manages and decides the
transfers between sram and dram

int a100 int b100 // a is in SRAM
while(ilt100) a. // Copy a out to DRAM //
Copy b in to SRAM while(ilt100) ..b..
Decide on dynamic behavior statically
34
Approach

The method is to
Use profiling to estimate reuse
Copy variables in to SRAM when reused
Cost model ensures that benefit exceeds cost
Transfers data between the On chip and Off chip
memory under compiler supervision
Compiler-known data allocation at each point in
the code

Advantages
Benefits with no software translation overhead
Predictable SRAM accesses ensuring better
real-time guarantees than Hardware or Software
caching
No more data transfers than caching

36
Overview of Strategy

Divide the complete program into different
regions For (Starting Point of each Region) lt
Remove Some Variables from Sram Copy
Some Variables into Sram from Dram gt
37
Some Imp Questions

What are regions ? What to bring in to SRAM
? What to evict from SRAM ? The Problem has an
exponential number of Solutions (NP Complete)
38
Regions

It is the code between successive program points
Coincide with changes in program behavior
New regions start at
Start of each procedure
Before start of each loop
Before conditional statements containing loops,
procedures

39
What to Bring in to SRAM ?

Bring in variables that are re-used in region,
provided cost of transfer is recovered.
These transfers will reduce the memory access
time
Cost model accounts for
Profile estimated re-use
Benefit from reuse
Detailed Cost of transfer
Bring in cost
Eviction cost

40
What to Remove from SRAM?
in the future.
The data variables that are furthest in the
future This time can be obtained by assigning
timestamps for each of the nodes
Need concept of time order of different code
regions
41
The Data-Program Relationship Graph

The DPGR is a new data structure that helps in
identification of regions and marking of time
stamps
It is essentially a programs call graph
appended with additional nodes for
Loop nodes
Variable nodes

42
Data-Program Relationship Graph
1

Defines regions

Defines Regions Depth first search order reveals
execution time. order
2
7
Proc_B
6
3
Proc_C
loop

Allocation-change points at region changes

loop
a
b
43
Time Stamps

A method associates a time stamp with every
program point
The time stamp forms a total order among
themselves
The program points are reached during the runtime
in time stamp order.

44
Optimizations

The is no need to write back unmodified or dead
SRAM variables into DRAM
Optimize data transfer code using DMA when it is
available
Data transfer code can be placed in special
memory block copy procedures

45
Multiple Allocations due to Multiple Paths

Contents of SRAM could be different on different
incoming paths to a node in DPRG
Problem can happen in
Loops
Conditional execution
Multiple calls to same procedure

46
Conditional join nodes
Join Node

Favor the most frequent path
Consensus allocation is chosen assuming the
incoming allocation from the most probable
predecessor

47
Procedure join nodes

Few program points have multiple timestamps
The nodes with multiple timestamps are called
join nodes as they join multiple paths from
main()
A strategy is used that adopts different
allocation strategies for different paths but
with same code

48
Offsets in SRAM

SRAM can get fragmented when variables are
swapped out
Intelligent offset mechanism required
In this method
Place memory variables with similar lifetimes
together ? larger fragments when evicted together

49
Experimental Setup

Architecture Motorola MCORE
Memory architecture 2 levels of memory
SRAM size Estimated as 25 of the total data
requirement
DRAM latency 10 cycles
Compiler Gcc

50
Results
51
ConclusionThe designer has to choose the right
mix of Scratch pad and Cache for performance
advantages.
52
References

Sumesh U ,Rajeev B.
Compiler Decided Dynamic Memory
Allocation for Scratch Pad Based Embedded Systems
.
Alexandru N ,Preeti P, N Dutt .
Efficient Use of Scratch Pads in Embedded
Applications
Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov
Balancing Scratch Pad and Cache in
Embedded Systems for Power and Speed
Performance