Title: Getting Real, Getting Dirty, without getting real dirty
1Getting Real, Getting Dirty (without getting real
dirty)
Funded by the National Science Foundation under
grant 0081214 Funded by DARPA under contract
F33615-00-C-1697
Ron K. Cytron Joint work with Krishna Kavi
University of Alabama at Huntsville
Dante Cannarozzi, Sharath Cholleti, Morgan
Deters, Steve Donahue Mark Franklin, Matt
Hampton, Michael Henrichs, Nicholas Leidenfrost,
Jonathan Nye, Michael Plezbert, Conrad
Warmbold Center for Distributed Object
Computing Department of Computer
Science Washington University
April 2001
2Outline
- Motivation
- Allocation
- Collection
- Conclusion
3Traditional architecture and object-oriented
programs
- Caches are still biased toward Fortran-like
behavior - CPU is still responsible for storage management
- Object-management activity invalidates caches
- GC disruptive
- Compaction
4An OO-biased design using IRAMs(with Krishna
Kavi)
- CPU and cache stay the same, off-the-shelf
- Memory system redesigned to support OO programs
CPU cache
L2 cache
IRAM
Logic
M
M
M
M
5IRAM interface
Stable address for an object allows better cache
behavior Object can be relocated within IRAM, but
its address to the CPU is constant
IRAM
malloc
Logic
M
M
addr
M
M
6IRAM interface
Object referencingtracked inside IRAM-supports
garbage collection
IRAM
putfield/getfield
Logic
M
M
value
M
M
7IRAM interface
Goal relegate storage-management functions to
IRAM
gc compact prefetch
IRAM
Logic
M
M
M
M
8Macro accesses
Observe code sequences contain common gestures
(superoperators)
p.getLeft().getNext() ((p12)32)
IRAM
Logic
M
M
M
M
9Gesture abstraction
Goal decrease traffic between CPU and storage
M143(x) ((x12)32)
p.getLeft().getNext() ((p12)32)
IRAM
Logic
M
M
M
M
10Gesture application
M143(x) ((x12)32)
p.getLeft().getNext()
IRAM
Macro 143 (p)
Logic
M
M
M
M
11Gesture application
M143(x) ((x12)32)
p.getLeft().getNext()
IRAM
Macro 143 (p)
Logic
M
M
M
M
p.getLeft().getNext()
12Automatic prefetching
Goal decrease traffic between CPU and storage
IRAM
Fetch p
CPU cache
Logic
p
M
M
L2 cache
M
M
13Automatic prefetching
Goal decrease traffic between CPU and storage
IRAM
Fetch p
CPU cache
Logic
p
M
M
L2 cache
p.getLeft().getNext()
M
M
14Challenges
- Algorithmic
- Bounded-time methods for allocation and
collection - Good average performance as well
- Architectural
- Lean interface between the CPU and IRAM
- Efficient realization
15Storage Allocation (Real Time)
- Not necessarily fast
- Necessarily predictable
- Able to satisfy any reasonable request
- Developer should know maxlive characteristics
of the application - This is true for non-embedded systems as well
16How much storage?
Handles
- curlivethe number of objects live at a point in
time - curspacethe number of bytes live at a point in
time
Object Space
17Objects concurrently live
18How much object space?
19Storage AllocationFree List
- Linked list of free blocks
- Worst case O(n) for n blocks in the list
20Worst-case free-list behavior
- The longer the free-list, the more pronounced the
effect - No a priori bound on how much worse the
list-based scheme could get - Average performance similar
21Knuths Buddy System
- Free-list segregated by size
- All requests rounded up to a power of 2
22Knuths Buddy System (1)
256
- Begin with one large block
- Suppose we want a block of size 16
128
64
32
16
8
4
2
1
23Knuths Buddy System (2)
- Begin with one large block
256
128
64
32
16
8
4
2
1
24Knuths Buddy System (3)
256
- Begin with one large block
128
64
32
16
8
4
2
1
25Knuths Buddy System (4)
256
- Begin with one large block
128
64
32
16
8
4
2
1
26Knuths Buddy System (5)
256
- Begin with one large block
128
64
32
16
8
4
2
1
27Knuths Buddy System (6)
256
- Begin with one large block
128
64
32
16
8
- One of those blocks can be given to the program
4
2
1
28Worst-case free-list behavior
- The longer the free-list, the more pronounced the
effect - No a priori bound on how much worse the
list-based scheme could get - Average performance similar
29Spec Benchmark Results
30Buddy System
- If a block can be found, it can be found in
log(N), where N is the size of the heap - The application cannot make that worse
31Defragmentation
- To keep up with the diversity of requested block
sizes, an allocator may have to reorganize
smaller blocks into larger ones
32DefragmentationFree List
Free list
- Free-list permutes adjacent blocks
- Storage becomes fragmented, with many small
blocks and no large ones
Blocks in memory
33DefragmentationFree List
Free list
- Free-list permutes adjacent blocks
- Two issues
Blocks in memory
34DefragmentationFree List
Free list
- Free-list permutes adjacent blocks
- Two issues
- Reorganize holes (move live storage)
Blocks in memory
35DefragmentationFree List
Free list
- Free-list permutes adjacent blocks
- Two issues
Blocks in memory
- Organization by address can help Kavi
36Buddiesjoining adjacent blocks
- The blocks resulting from subdivision are viewed
as buddies - Their address differs by exactly one bit
- The address of a block of size 2 differs with
its buddys address at bit n
0
1
n
37Knuths Buddy System (6)
256
128
64
32
16
8
4
2
1
38Knuths Buddy System (5)
256
- When a block becomes free, it tries to rejoin its
buddy - A bit in its buddy tells whether the buddy is
free - If so, they glue together and make a block twice
as big
128
64
32
16
8
4
2
1
39Knuths Buddy System (4)
256
128
64
32
16
8
4
2
1
40Knuths Buddy System (3)
256
128
64
32
16
8
4
2
1
41Knuths Buddy System (2)
256
128
64
32
16
8
4
2
1
42Knuths Buddy System (1)
256
128
64
32
16
8
4
2
1
43Two problems
- OscillationBuddy looks like it may split, glue,
split, glueisnt this wasted effort? - FragmentationWhat happens when Buddy cant glue
but has space it would like to combine?
44Buddyoscillation
256
128
64
32
16
8
4
2
1
45Buddyoscillation
256
128
64
32
16
8
4
2
1
46Buddyoscillation
256
128
64
32
16
8
4
2
1
47Buddyoscillation
256
128
64
32
16
8
4
2
1
48Buddyoscillation
256
128
64
32
16
8
4
2
1
49Buddyoscillation
256
128
64
32
16
8
4
2
1
50Problem is lack of hysteresis
- Some programs allocate objects which are almost
immediately deallocated. - Continuous, incremental approaches to garbage
collection only make this worse! - Oscillation is expensive blocks are glued only
to be quickly subdivided again
51Estranged Buddy System
- Variant of Knuths idea
- When deallocated, blocks are not eager to rejoin
their buddies - Evidence of value Kaufman, TOPLAS 84
- Slight improvement on spec benchmarks
- Algorithmic improvement over Kaufman
52Buddy-Busy and Buddy-Free
Blocks whose buddies are busy
2
k
Blocks whose buddies are free
53Estranged BuddyAllocation
- Allocation heuristic
- Buddy-busy
- Buddy-free
- Glue one level below, buddy-free
- Search up (Knuth)
- Glue below
54How well does Estranged Buddy do?(contrived
example)
55Estranged Buddy on Spec
56Recall two problems
- OscillationBuddy looks like it may split, glue,
split, glueisnt this wasted effort? - Typically not, but can be
- FragmentationWhat happens when Buddy cant glue
but has space it would like to combine?
57Buddy SystemFragmentation
- Internal fragmentation from rounding-up of
requests to powers of two - Not really a concern these days
- Assume a program can run in maxlive bytes
- How much storage needed so Buddy never has to
defragment? - What is a good algorithm for Buddy
defragmentation?
58Buddy Configurations
8
4
2
1
59Buddy Configurations
60Heap Full
61Buddy cant allocate size-2 block
62How Big a Heap for Non-Blocking Buddy (M
maxlive)?
- Easy bound M log M
- Better bound M k, where k is the number of
distinct sizes to be allocated - Sounds like a good bound, but it isnt
- Defragmentation may be necessary
256
128
64
32
M bytes each level
16
8
4
2
1
63Managing object relocation
- Every object has a stable handle, whose address
does not change - Every handle points to its objects current
location - All references to objects are indirect, through a
handle
64Buddy Defragmentation
- When stuck at level k
- No blocks free above level k
- No glueable blocks free below level k
- Assume maxlive still suffices
- Example k6, size 64 not available
256
128
64
32
16
8
4
2
1
65Defragmentation Algorithm
64
32
16
66Defragmentation Algorithm
swap
64
32
16
67Defragmentation Algorithm
64
glue
32
16
68Defragmentation Algorithm
64
32
16
69Defragmentation Algorithm
- Recursively visit below to develop two buddies
that can be glued - Analogous to the recursive allocation algorithm
- Still, choices to be made.studies underway
70Need 4 bytes
Move 3 bytes?
Move 1 byte?
71Recall two problems
- OscillationBuddy looks like it may split, glue,
split, glueisnt this wasted effort? - Typically not, but can be
- FragmentationWhat happens when Buddy cant glue
but has space it would like to combine? - New algorithm to defragment Buddy
- Selective approachshould beat List
- Optimizations needed
72Towards an IRAM implementation
- VHDL of Buddy System complete
- DRAM clocked at 150 MHz
- 10 cycles per DRAM access
- Need 7 accesses per level to split blocks
- For 16Mbyte heap24 levels
- 1680 cycles worst case 11us
- 168x slower than a read
- Can we do better?
73Two tricks
- Find a suitable free-block quickly
- Return its address quickly
74Finding a suitable free block
- No space at 16, but 16 points to the level above
it that has a block to offer
75Finding a suitable free block
- Every level points to the level above it that has
a block to offer - Pointers are maintained using Tarjans
path-compression - Locating pointers are not stored in DRAM
256
128
64
32
16
8
4
2
1
76Alternative free-block finder
- Path-compression may be too complex for hardware
- Instead, track the largest available free block
77Alternative free-block finder
- Path-compression may be too complex for hardware
- Instead, track the largest available free block
- Tends to break up large blocks and favor
formation of small ones
256
128
64
32
16
8
4
2
1
78Fast return for malloc
- Want 16 bytes
- Zip to the 64 display
- WLOG we return the first part of that block
immediately to the requestor
64
32
16
79Fast return for malloc
- Want 16 bytes
- Zip to the 64 display
- WLOG we return the first part of that block
immediately to the requestor - Adjustment to the structures happens in parallel
with the return
64
32
16
80Improved IRAM allocator
- 10 cycles fast return
- 1000 cycles to recover, worst case
- Is this good enough?
- Compare software implementation
- 1000 cycles worst case
- 600 cycles average on spec benchmarks
- Hardware can be much faster
- Depends on recover time
81Do programs allow us to recover?
- Run of jackJVM instructions between requests
- 56 of requests separated by at least 100 JVM
instructions - Assume 10x expansion, JVM to native code
- For the 56, we return in 10 cycles
- Code motion might improve others
Min Median Max
3 181 174053
82Garbage Collection
- While allocators are needed for most modern
languages, garbage collection is not universally
accepted - Generational and incremental approaches help most
applications - Embedded and real-time need assurances of bounded
behavior
83Why not garbage collect?
- Some programmers want ultimate control over
storage - Real-Time applications need bounded-time overhead
- RT Java spec relegates allocation and collection
to user control - Isnt this a step back from Java?
84Marking Phasethe problem
- To discover the dead objects, we use calculatus
eliminatus - Find live objects
- All others are dead
85Marking Phasethe problem
- To discover the dead objects, we
- Find live objects
stack
heap
- Pointers from the stack to the heap make objects
live
86Marking Phasethe problem
- To discover the dead objects, we
- Find live objects
- Pointers from the stack to the heap make objects
live - These objects make other objects live
87Marking Phasethe problem
- To discover the dead objects, we
- Find live objects
- Sweep all others away as dead
stack
heap
88Marking Phasethe problem
- To discover the dead objects, we
- Find live objects
- Sweep all others away as dead
- Perhaps compact the heap
stack
heap
89Problems with mark phase
- Takes an unbounded amount of time
- Can limit it using generational collection but
then its not clear what will get collected - We seek an approach that spends a constant amount
of time per program operation and collects
objects continuously
90Two Approaches
- Variation on reference counting
- Contaminated garbage collection PLDI00
91Reference Counting
- An integer is associated with every object,
summing - Stack references
- Heap references
- Objects with reference count of zero are dead
stack
heap
2
0
1
1
1
1
0
0
2
0
92Problems with Reference Counting
- Standard problem is that objects in cycles
stack
heap
1
0
1
1
1
1
0
0
2
0
93Problems with Reference Counting
- Standard problem is that objects in cycles (and
those touched by such objects) cannot be
collected
stack
heap
1
0
1
1
1
1
0
0
2
0
94Problems with Reference Counting
- Standard problem is that objects in cycles (and
those touched by such objects) cannot be
collected - Contaminated gc will collect such objects
- Overhead of counting can be high
- Untyped stack complicates things
stack
heap
1
0
1
1
1
1
0
0
2
0
95The Untyped Stack
- The stack is a collection of untyped cells
- In JVM, safety is verified at class-load time
- No need to tag stack locations with what they
contain - Leads to imprecision in all gc methods
stack
heap
Heap Address?
96Idea
- When a stack frame pops, all of its cells are
dead - Dont worry about tracking cell pointers
- Instead, associate an object with the last stack
frame that can reference the object
97Reference Counting Approach
- s is zero or one, indicating none or at least one
stack reference to the object - h precisely reflects the number of heap
references to the object - If sh0 object is dead
s
h
98Our treatment of stack activity
- Object is associated with the last-to-be-popped
frame that can reference the object
stack
99Our treatment of stack activity
- Object is associated with the last-to-be-popped
frame that can reference the object - When that frame pops
- If object is returned, the receiving frame owns
the object
stack
100Our treatment of stack activity
- Object is associated with the last-to-be-popped
frame that can reference the object - When that frame pops
- Otherwise the object is dead
stack
101Our reference-counting implementation
- The objects associated with the frame are linked
together
stack
heap
0
1
1
2
frame
0
1
102Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything
stack
heap
0
1
1
2
frame
0
1
103Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object is unlinked, but still thought to be
live
stack
heap
0
1
1
2
frame
0
1
104Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object is dead and is collected
stack
heap
0
1
1
2
frame
0
1
105Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object is also dead
stack
heap
1
0
2
frame
0
1
106Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object is still live
stack
heap
1
1
frame
0
1
107Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - Now the frame is gone
stack
heap
1
1
0
1
frame
108Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object was linked to its frame all along
stack
heap
1
1
0
1
frame
109Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object was linked to its frame all along
- When heap count becomes zero, the object is
scheduled for deletion in that frame
stack
heap
0
1
0
1
frame
110Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object was linked to its frame all along
- When heap count becomes zero, the object is
scheduled for deletion in that frame - When frame pops, all are dead
stack
heap
0
1
0
1
frame
111Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object was linked to its frame all along
- When heap count becomes zero, the object is
scheduled for deletion in that frame - When frame pops, all are dead
stack
heap
0
0
0
1
frame
112Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object was linked to its frame all along
- When heap count becomes zero, the object is
scheduled for deletion in that frame - When frame pops, all are dead
stack
heap
0
0
0
0
frame
113Our reference-counting implementation
- The objects associated with the frame are linked
together - When a stack frame pops, all of its cells no
longer can point at anything - This object was linked to its frame all along
- When heap count becomes zero, the object is
scheduled for deletion in that frame - When frame pops, all are dead
stack
heap
0
0
0
0
frame
114Reference Counting
- Predictable, constant overhead for each JVM
instruction - putfield decreases count at old pointed-to
object, increases count at new pointed-to object - areturn associates object with stack frame if not
already associated below - How well does it do? We shall see!
115Contaminated Garbage Collection
- Need to collect objects involved in reference
cycles without resorting to marking live objects - Idea
- Associate each object with a stack frame such
that when that frame returns, the object is known
to be dead - Like escape analysis, but dynamic
116Contaminated garbage collection
- Initially each object is associated with the
frame in which it is instantiated
stack
A
B
C
D
E
117Contaminated garbage collection
- Initially each object is associated with the
frame in which it is instantiated - When B references A, A becomes as live as B
stack
A
B
C
D
E
118Contaminated garbage collection
- Initially each object is associated with the
frame in which it is instantiated - Now A, B, and C are as live as C
stack
A
B
C
D
E
119Contaminated garbage collection
- Initially each object is associated with the
frame in which it is instantiated - Even though D is less live than C, it gets
contaminated - Should something reference D later, all will be
affected
stack
A
B
C
D
E
120Contaminated garbage collection
- Initially each object is associated with the
frame in which it is instantiated - Static finger of life
- Now all objects appear to live forever
stack
A
B
C
D
E
121Contaminated garbage collection
- Initially each object is associated with the
frame in which it is instantiated - Static finger of life
- Now all objects appear to live forever
- Even if E points away!
stack
A
B
C
D
E
122Contaminated garbage collection
- Every object is a member of an equilive set
- All objects in a set are scheduled for
deallocation at the same time - Sets are maintained using Tarjans disjoint
union/find algorithm - Nearly constant amount of overhead per operation
123Contaminated GC
- Each equilive set is associated with a frame
stack
124Contaminated GC
- Each equilive set is associated with a frame
- Suppose an object in one set references an object
in another set (in either direction)
stack
125Contaminated GC
- Each equilive set is associated with a frame
- Suppose an object in one set references an object
in another set (in either direction) - Contamination!
- The sets are unioned
stack
126Contaminated GC
- Each equilive set is associated with a frame
- When a frame pops, objects associated with it are
dead
stack
127Contaminated GC
- Each equilive set is associated with a frame
- When a frame pops, objects associated with it are
dead
stack
128Contaminated GC
- Each equilive set is associated with a frame
- When a frame pops, objects associated with it are
dead
stack
129Summary of methods
- Reference counting
- Cant handle cycles
- Handles pointing at and then away
- Contaminated GC
- Tolerates cycles
- Cant track pointing and then pointing away
- Both techniques
- Incur cost at putfield, areturn
- (Nearly) constant overhead per operation
130Implementation details
- SUN JDK 1.1 interpreter version
- Many subtle places where references are
generated String.intern(), ldc instruction,
class loader, JNI - Each gc method took about 3 months to implement
- Can run either method or both in concert
- Fairly optimized, more is possible
131Spec benchmark effectiveness
132Spec benchmark effectiveness
133Exactness of Equilive Sets
134Distance to die in frames
135Speed of CGC
136Speedups of Mark-Free Approaches
137Future Plans
- VHDL simulation of more efficient buddy allocator
- VHDL simulation of garbage collection methods
- Better buddy defragmentation
- Experiment with informed allocation
- Comparison/integration with other IRAM-based
methods (with Krishna Kavi)
138Informed Storage Management
- Evidence that programs allocate many objects of
the same size
139Benchmark jack20 fragmentation
140Benchmark raytrace12 fragmentation
141Benchmark compress34 fragmentation
142Informed Storage Management
- Evidence that programs allocate many objects of
the same size
143Informed Storage Management
- Evidence that programs allocate many objects of
the same size - Not surprising, in Java
- same type ?same size
144Informed Storage Management
- Evidence that programs allocate many objects of
the same size - Not surprising, in Java
- same type ?same size
- In C and C programmers brew their own
allocators to take advantage of this - What can we do automatically?
145Informed Storage Management
- Capture program malloc requests by phase
- Generate a .class file and put it in CLASSPATH
- Load the .class file and inform the allocator
146Different phases?different distributions
raytrace
147How long is a phase?
- Phases 1-3 are common to all programsJVM startup
- Phases are keyed to allocations, not time, for
portability
148Questions?