Title: Getting Real, Getting Dirty, without getting real dirty
1Fast and Bounded-Time Storage Allocation
Funded by the National Science Foundation under
grant ITR-008124
Steven M. Donahue
Matt Hampton, Ron K. Cytron, Mark Franklin Center
for Distributed Object Computing Department of
Computer Science Washington University http//deuc
e.doc.wustl.edu/doc/
Joint work with Krishna Kavi University of North
Texas
WMPI May 2002
2Outline
- Motivation
- Background
- Our approach
- Analysis and results
- Conclusions and future work
3MotivationReal Time
Application
Language
Java
Timber
Ada
Operating System
lynxOS
VxWorks
QNX
Hardware
4MotivationReal Time Java
- Specify time properties
- Start
- Period
- Cost
- Deadline (relative)
- Storage allocation must be predictable and
reasonably bounded
new Foo()
5MotivationIntelligent RAM
Storage Management
IRAM
CPU
RAM
alloc, dealloc
Logic
cache
M
M
L2 cache
Data Bus
M
M
6Storage Allocation for Real Time
- Fast is good but predictable is required
- Develop allocation system that provides
performance, robustness, portability, and quick
development time
7Common Storage AllocationFree-list Algorithm
- Linked list of free blocks
Return
Allocation Times for List Allocator (ns)
- Allocation worst-case?
- O(n) for n blocks in the list
8Application Specific Allocator
- Example suppose application allocates only
blocks of size 135, 65, 23, and 5 - Have n free-list allocators, one for each size
- Allocation worst-case?
- O(1)
9Application Specific Disadvantages
- Not general purpose
- A priori knowledge of allocated blocks
- Makes code ugly and nonportable
- Can require extra storage
- Number of allocators times maxlive
10Ideal Allocator
- General Purpose
- Minimal impact on app footprint
- Ratio of worst-case and average-case close to 1
- Knuths Buddy Algorithm
- Overall speed is as fast as possible
- Hardware and optimizations
11Knuths Buddy System
- Free-list segregated by size
- All requests input to the system are rounded up
to a power of 2
12Knuths Buddy System (1)
256
- System begins with one large block, 256
- Example allocate a block of size 16
- Three Operations
- Find
- Wait
- Return
- First Find a chunk of free storage
128
64
32
16
8
4
2
1
13Knuths Buddy System (2)
256
- Wait
- Recursively subdivide
128
64
32
16
8
4
2
1
14Knuths Buddy System (3)
256
128
- Wait
- Recursively subdivide
64
32
16
8
4
2
1
15Knuths Buddy System (4)
256
128
- Wait
- Recursively subdivide
64
32
16
8
4
2
1
16Knuths Buddy System (5)
256
128
64
32
16
8
4
2
1
17Knuths Buddy System (6)
256
128
64
32
16
8
- Return One of those blocks can be given to the
program
4
2
1
18General Architecture
19Finding a Block
Allocate a block of 16
Software approach
Our hardware trick
256
256
128
128
64
64
32
32
16
16
8
8
4
4
2
2
1
1
Find O( log(N) )
Find O( 1 ) in practice
20Return
Allocate a block of 16
Software approach
Our hardware trick
64
64
32
32
16
16
new Foo()
new Foo()
App
App
Allocator
Allocator
F
R
W
F
W
R
21Results and Analysis
- Two versions of Buddy System
- Reference version
- Optimized version with Fast Find and Return
- Experiments compared the Hardware Buddy Systems
to two other systems - Java VM 1.1.18 free-list allocator
- glibc malloc black box allocator
- Two suites of benchmarks
- SPEC Java Benchmarks
- C Malloc Benchmarks
22Overview of Java Benchmarks
Allocation Time as a percentage of Execution Time
23Overview of C Benchmarks
Allocation Time as a percentage of Execution Time
24How Much Does Reference Version Save Java?
- The savings that the reference version provides
are not substantial compared to application
execution - Is significantly more efficient in allocation!
25Fast Find Results
- Compare ratio of average-case to worst-case
- Optimized version has much better ratio
- A system using the optimized version would suffer
from less over provisioning
26Fast Return Results
- Could Fast Return affect performance?
- In C Malloc Benchmarks, minimum inter-arrival
time of allocation requests is greater than Wait
time of Buddy System - Wait time could complete in parallel
inter-arrival time
27Fast Return Results
- To what extent could Fast Return affect
performance? - Factor Wait time out of allocation time
- Could save 96
- Affects overall application efficiency to a
limited extent
28Results Analysis
- Fast Return effectiveness depends on
inter-arrival time of allocations - Some Java allocations came too quickly
- But how often?
- In one benchmark, less than 15 of the requests
arrived before Wait time could complete in
parallel.
29Conclusions
- Lets revisit ideal allocator
- Used without modification by any application
- Does not require unreasonable amount of storage
- Difference between worst-case and average-case
performance is small - Overall speed is as fast as possible
30Future Work
- Improve Wait time similar to improvements for
Find and Return - Algorithms for defragmentation heap with Buddy
System - Use allocator as a building block
- Garbage Collection
- Goal of IRAM (intelligent storage).
31Questions?
32Motivation
- Bring modern, high-level languages to the arena
of real-time applications - Allow embedded developers to author applications
with high level languages - RTSJ?
- IRAM?
- Somehow tie in storage allocation here!
33Overview of Java Benchmarksmaybe nuke
- Compress data compression utility
- Jess Java Expert System Shell
- Raytrace graphics raytrace
- DB Database engine
- Javac java compiler
- Mpegaudio Mpeg decompress
- MTRT multi-threaded raytrace
- Jack parser generator
34Overview of C Benchmarksmaybe nuke
- Cfrac factoring program
- Gawk Gnu Awk interpreter
- GS Ghostscript image interpreter
- P2C Pascal to C converter
- PTC Another Pascal to C converter
35Optimizations
- RKC use these terms right when you are
introducing knuth buddy. Then you dont need this
the pictures in knuth buddy will help - Think of allocation as having three parts
- Find appropriate size block to return and perhaps
break apart. - Wait while block is broken apart or lists are
manipulated - Return the requested size block
- Fast Find
- Fast Return
36Current Allocators
- Unorganized Free-List of blocks
- Organized Free-list
- Segregate available blocks by size
- Application Specific Allocator
- Application knows what sizes of blocks it
consumes - Use a free-list for each block size
37Worst-case Free-list Behavior
- The longer the free-list, the more pronounced the
effect - No a priori bound on how much worse the
list-based scheme could get - Over provision cost on relatively unbounded worst
case behavior
Allocation Times for List Allocator (ns)
38Organized Free-list
- Example Knuths Buddy System
- Free-list segregated by size
- Sizes are powers of two
- Allocation worst-case?
- O(log(N)) where N is size of heap
39General Architecture
- 32 bit general purpose registers
- 32 bit registers for head pointers of segregated
free lists - Shift registers to track size
40General Architecture
- Simple ALU for address calculation and
comparisons - Memory I/O registers
- Controller for algorithm
41Fast Find Results
Average Case
Worst Case
- Optimized average-case suffers a little for Fast
Find compared to reference version - But worst-case times are significantly improved
42How Much Does Reference Version Save C?
- Results similar to Java results
- Reference version can save a substantial amount
of allocation time, but not overall application
time
43Finding a Block -- Fast
- In FPGA, each level has a bit to indicate its
status - Suppose we want a block of size 16
- Levels too small for a given request are masked
out - A leading ones detector Finds the desired level
- Allocation completes by recursively subdividing
and returning the block
256
128
64
32
16
8
4
2
1
44Fast Return
- Want 16 bytes
- Use Fast Find to get to the size 64 level
- Return the first part of that block immediately
to the requestor - Software would have to reorganize
64
32
16
45Fast Return
- Want 16 bytes
- Use Fast Find to get to the size 64 level
- Return the first part of that block immediately
to the requestor - Adjustment to the structures happens in parallel
with the return
64
32
16
46Effect of Optimizations
- Fast Find improves Find portion of allocation to
O( 1 ) in practice - However, Wait time is still O(log(N))
- But, Fast Return could allow Wait time to execute
in parallel to application execution
new Foo()
47(No Transcript)