Efficient Dynamic Heap Allocation of Scratch-Pad Memory - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Dynamic Heap Allocation of Scratch-Pad Memory

Description:

Efficient Dynamic Heap Allocation. of Scratch-Pad Memory ... Application of the charcoal ?lter to a 1024x768 Jpeg image using ImageMagick. ogg ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 32

Provided by: lynn66

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Dynamic Heap Allocation of Scratch-Pad Memory

1
Efficient Dynamic Heap Allocationof Scratch-Pad
Memory
Carnegie Trust for the Universities of Scotland

Ross McIlroy, Peter Dickman and Joe Sventek

2
Scratch-Pad Memory Allocator

SMA A dynamic memory allocator targeting
extremely small memories
(lt 1MB in size)
Why target such tiny memories?
Why provide dynamic memory allocation for such
small memories?

3
Outline

Rational for SMA
SMA Approach
Results
Concurrent SMA
Conclusion / Future work

4
Outline

Rational for SMA
SMA Approach
Results
Concurrent SMA
Conclusion / Future work

5
What Tiny Memories?

Embedded Systems
Sensor Network Motes
Vehicular Devices
Scratch-Pad Memories
Network Processors
Heterogeneous Multi-Core Processors

6
Scratch-Pad Memories

Memory structured as a hierarchy
Small fast memories, large slow memories
Usually hidden by hardware caches
Some processor architectures employ scratch-pad
memories instead
Similar size and speed as caches, but
explicitlyaccessible by software
Examples
IBM Cell processor
Intel IXP network processors
Intel PXA mobile phone processors

7
Why Dynamic Management?

Developers want as much useful data in the fast
Scratch-Pad memory as possible
They dont want to deal with the fragmented
memory hierarchy

Manual Static
Developer ease ? ?
Make full use of Scratch-Pad ? ?
Dynamic
?
?
8
Why SMA?
SMAmalloc
40
297
72.8
52.4
Resource Doug Lea malloc
State Memory (bytes) 516
Code Memory (instructions) 1634
Avg. Alloc Time (cycles) 70.7
Avg. Free Time (cycles) 95.2
Managing 4kB Scratch-Pad memory on an Intel IXP
processor
9
Outline

Rational for SMA
SMA Approach
Results
Concurrent SMA
Conclusion / Future work

10
Basic Approach

By default represent memory coarsely as a series
of fixed size blocks
Can employ a very simple bitmap based allocation
/ free algorithm
When required, split blocks into variable sized
regions
Prevents excessive internal fragmentation

11
Large Block Allocation

Each block in memory represented by a bit in a
free-block bitmap

1
1
1
1
rem_blocks blocks_bm mask next_pos
ffs(rem_blocks)
in_use mask blocks_bm next pos
fls(in_use) 1
12
Small Region Allocation

Unused parts of an allocated block can be reused
by sub-block sized allocations
Blocks are split into power of two sized regions,
in a Binary Buddy type approach
Free regions are stored in per-size free lists

13
Coalescing Freed Regions

We wanted to avoid boundary tags
Instead the orderly way in which regions are
split is exploited
A word sized coalesce tag stores the coalesce
details for all regions in a block

1
14
Deferred Coalescing

SMA (CAM)
Any size can have coalescing deferred
Content addressable memory used to associate
thesize of deferred coalesced regions with the
regionsthemselves
SMA (LM)
Sizes which coalescing can be deferred chosen
atcompile time
Deferred regions stored in an array in local
memory

15
Outline

Rational for SMA
SMA Approach
Results
Concurrent SMA
Conclusion / Future work

16
Experimental Setup

Intel IXP 2350
Network processor
4 microengine cores with 4kB local scratch-pad
each
Access to another 16kB of shared scratch-pad
Compared against Doug Leas malloc

a2p Conversion of a 15kB text file to postscript
gcc Compilation of the file combine.c in the gcc source, using gcc
gst Ghostscript extraction of a 682kB postscript file
cvt Application of the charcoal ?lter to a 1024x768 Jpeg image using ImageMagick
ogg Encoding of a 20 second wav file using the ogg encoder
pyt Execution of the python example file md5driver.py
tar Archive and gzip compression of 27 files in 4 directories into a 1Mb archive
17
Allocation Performance
18
Free Performance
19
Memory Wastage
20
Memory Wastage
21
Outline

Rational for SMA
SMA Approach
Results
Concurrent SMA
Conclusion / Future work

22
Lock-Free Block Allocation

State for large blocks is stored in the
free-block bitmap
A simple lock-free update algorithm can be used
to protect this bitmap
Uses the test and clear primitive

0
0
0
0
0
0
Global
Test Clear
Test Clear
Atomic Set
0
0
Thread 1
Thread 2
23
Protecting Small Region Lists

Locks are used to protect the free-lists used for
small size allocation
SMA Coarse uses one lock
SMA Fine uses one lock per size class
In SMA Fine, when regions are being coalesced,
two locks must be held briefly

24
Concurrency Scaling
25
Outline

Rational for SMA
SMA Approach
Results
Concurrent SMA
Conclusion / Future work

26
Future Work

Provide the illusion of a single memory
Let runtime worry about data placement
Data can be annotated to give hints to the
runtime system

27
Conclusion

Tiny memories need to be managed too
SMA is a simple and efficient algorithm for
dynamic management of small memories
Fixed size block allocation is simple and has low
state overheads
Splitting partially used blocks to be reused by
small allocations limits fragmentation
SMA can be augmented to support concurrent
requests from multiple cores