Title: Efficient Dynamic Heap Allocation of Scratch-Pad Memory
1Efficient Dynamic Heap Allocationof Scratch-Pad
Memory
Carnegie Trust for the Universities of Scotland
- Ross McIlroy, Peter Dickman and Joe Sventek
2Scratch-Pad Memory Allocator
- SMA A dynamic memory allocator targeting
extremely small memories - (lt 1MB in size)
- Why target such tiny memories?
- Why provide dynamic memory allocation for such
small memories?
3Outline
- Rational for SMA
- SMA Approach
- Results
- Concurrent SMA
- Conclusion / Future work
4Outline
- Rational for SMA
- SMA Approach
- Results
- Concurrent SMA
- Conclusion / Future work
5What Tiny Memories?
- Embedded Systems
- Sensor Network Motes
- Vehicular Devices
- Scratch-Pad Memories
- Network Processors
- Heterogeneous Multi-Core Processors
6Scratch-Pad Memories
- Memory structured as a hierarchy
- Small fast memories, large slow memories
- Usually hidden by hardware caches
- Some processor architectures employ scratch-pad
memories instead - Similar size and speed as caches, but
explicitlyaccessible by software - Examples
- IBM Cell processor
- Intel IXP network processors
- Intel PXA mobile phone processors
7Why Dynamic Management?
- Developers want as much useful data in the fast
Scratch-Pad memory as possible - They dont want to deal with the fragmented
memory hierarchy
Manual Static
Developer ease ? ?
Make full use of Scratch-Pad ? ?
Dynamic
?
?
8Why SMA?
SMAmalloc
40
297
72.8
52.4
Resource Doug Lea malloc
State Memory (bytes) 516
Code Memory (instructions) 1634
Avg. Alloc Time (cycles) 70.7
Avg. Free Time (cycles) 95.2
Managing 4kB Scratch-Pad memory on an Intel IXP
processor
9Outline
- Rational for SMA
- SMA Approach
- Results
- Concurrent SMA
- Conclusion / Future work
10Basic Approach
- By default represent memory coarsely as a series
of fixed size blocks - Can employ a very simple bitmap based allocation
/ free algorithm - When required, split blocks into variable sized
regions - Prevents excessive internal fragmentation
11Large Block Allocation
- Each block in memory represented by a bit in a
free-block bitmap
1
1
1
1
rem_blocks blocks_bm mask next_pos
ffs(rem_blocks)
in_use mask blocks_bm next pos
fls(in_use) 1
12Small Region Allocation
- Unused parts of an allocated block can be reused
by sub-block sized allocations - Blocks are split into power of two sized regions,
in a Binary Buddy type approach - Free regions are stored in per-size free lists
13Coalescing Freed Regions
- We wanted to avoid boundary tags
- Instead the orderly way in which regions are
split is exploited - A word sized coalesce tag stores the coalesce
details for all regions in a block
1
14 Deferred Coalescing
- SMA (CAM)
- Any size can have coalescing deferred
- Content addressable memory used to associate
thesize of deferred coalesced regions with the
regionsthemselves - SMA (LM)
- Sizes which coalescing can be deferred chosen
atcompile time - Deferred regions stored in an array in local
memory
15Outline
- Rational for SMA
- SMA Approach
- Results
- Concurrent SMA
- Conclusion / Future work
16Experimental Setup
- Intel IXP 2350
- Network processor
- 4 microengine cores with 4kB local scratch-pad
each - Access to another 16kB of shared scratch-pad
- Compared against Doug Leas malloc
a2p Conversion of a 15kB text file to postscript
gcc Compilation of the file combine.c in the gcc source, using gcc
gst Ghostscript extraction of a 682kB postscript file
cvt Application of the charcoal ?lter to a 1024x768 Jpeg image using ImageMagick
ogg Encoding of a 20 second wav file using the ogg encoder
pyt Execution of the python example file md5driver.py
tar Archive and gzip compression of 27 files in 4 directories into a 1Mb archive
17Allocation Performance
18Free Performance
19Memory Wastage
20Memory Wastage
21Outline
- Rational for SMA
- SMA Approach
- Results
- Concurrent SMA
- Conclusion / Future work
22Lock-Free Block Allocation
- State for large blocks is stored in the
free-block bitmap - A simple lock-free update algorithm can be used
to protect this bitmap - Uses the test and clear primitive
0
0
0
0
0
0
Global
Test Clear
Test Clear
Atomic Set
0
0
Thread 1
Thread 2
23Protecting Small Region Lists
- Locks are used to protect the free-lists used for
small size allocation - SMA Coarse uses one lock
- SMA Fine uses one lock per size class
- In SMA Fine, when regions are being coalesced,
two locks must be held briefly
24Concurrency Scaling
25Outline
- Rational for SMA
- SMA Approach
- Results
- Concurrent SMA
- Conclusion / Future work
26Future Work
- Provide the illusion of a single memory
- Let runtime worry about data placement
- Data can be annotated to give hints to the
runtime system
27Conclusion
- Tiny memories need to be managed too
- SMA is a simple and efficient algorithm for
dynamic management of small memories - Fixed size block allocation is simple and has low
state overheads - Splitting partially used blocks to be reused by
small allocations limits fragmentation - SMA can be augmented to support concurrent
requests from multiple cores
28Questions?
29 16kb Management Allocation
30 16kB Management Free
31 16kB Management Waste