Getting Real, Getting Dirty, without getting real dirty

About This Presentation

Title:

Getting Real, Getting Dirty, without getting real dirty

Description:

Dante Cannarozzi, Sharath Cholleti, Morgan Deters, Steve Donahue. Mark Franklin, Matt Hampton, Michael Henrichs, Nicholas ... Problem is lack of hysteresis ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 149

Provided by: roncy

Learn more at: https://www.cs.wustl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Getting Real, Getting Dirty, without getting real dirty

1
Getting Real, Getting Dirty (without getting real
dirty)
Funded by the National Science Foundation under
grant 0081214 Funded by DARPA under contract
F33615-00-C-1697
Ron K. Cytron Joint work with Krishna Kavi
University of Alabama at Huntsville
Dante Cannarozzi, Sharath Cholleti, Morgan
Deters, Steve Donahue Mark Franklin, Matt
Hampton, Michael Henrichs, Nicholas Leidenfrost,
Jonathan Nye, Michael Plezbert, Conrad
Warmbold Center for Distributed Object
Computing Department of Computer
Science Washington University
April 2001
2
Outline

Motivation
Allocation
Collection
Conclusion

3
Traditional architecture and object-oriented
programs

Caches are still biased toward Fortran-like
behavior
CPU is still responsible for storage management
Object-management activity invalidates caches
GC disruptive
Compaction

4
An OO-biased design using IRAMs(with Krishna
Kavi)

CPU and cache stay the same, off-the-shelf
Memory system redesigned to support OO programs

CPU cache
L2 cache
IRAM
Logic
M
M
M
M
5
IRAM interface
Stable address for an object allows better cache
behavior Object can be relocated within IRAM, but
its address to the CPU is constant
IRAM
malloc
Logic
M
M
addr
M
M
6
IRAM interface
Object referencingtracked inside IRAM-supports
garbage collection
IRAM
putfield/getfield
Logic
M
M
value
M
M
7
IRAM interface
Goal relegate storage-management functions to
IRAM
gc compact prefetch
IRAM
Logic
M
M
M
M
8
Macro accesses
Observe code sequences contain common gestures
(superoperators)
p.getLeft().getNext() ((p12)32)
IRAM
Logic
M
M
M
M
9
Gesture abstraction
Goal decrease traffic between CPU and storage
M143(x) ((x12)32)
p.getLeft().getNext() ((p12)32)
IRAM
Logic
M
M
M
M
10
Gesture application
M143(x) ((x12)32)
p.getLeft().getNext()
IRAM
Macro 143 (p)
Logic
M
M
M
M
11
Gesture application
M143(x) ((x12)32)
p.getLeft().getNext()
IRAM
Macro 143 (p)
Logic
M
M
M
M
p.getLeft().getNext()
12
Automatic prefetching
Goal decrease traffic between CPU and storage
IRAM
Fetch p
CPU cache
Logic
p
M
M
L2 cache
M
M
13
Automatic prefetching
Goal decrease traffic between CPU and storage
IRAM
Fetch p
CPU cache
Logic
p
M
M
L2 cache
p.getLeft().getNext()
M
M
14
Challenges

Algorithmic
Bounded-time methods for allocation and
collection
Good average performance as well
Architectural
Lean interface between the CPU and IRAM
Efficient realization

15
Storage Allocation (Real Time)

Not necessarily fast
Necessarily predictable
Able to satisfy any reasonable request
Developer should know maxlive characteristics
of the application
This is true for non-embedded systems as well

16
How much storage?
Handles

curlivethe number of objects live at a point in
time
curspacethe number of bytes live at a point in
time

Object Space
17
Objects concurrently live
18
How much object space?
19
Storage AllocationFree List

Linked list of free blocks

Search for desired fit

Worst case O(n) for n blocks in the list

20
Worst-case free-list behavior

The longer the free-list, the more pronounced the
effect
No a priori bound on how much worse the
list-based scheme could get
Average performance similar

21
Knuths Buddy System

Free-list segregated by size

All requests rounded up to a power of 2

22
Knuths Buddy System (1)
256

Begin with one large block
Suppose we want a block of size 16

128
64
32
16
8
4
2
1
23
Knuths Buddy System (2)

Begin with one large block

256
128
64
32

Recursively subdivide

16
8
4
2
1
24
Knuths Buddy System (3)
256

Begin with one large block

128
64
32

Recursively subdivide

16
8
4
2
1
25
Knuths Buddy System (4)
256

Begin with one large block

128
64
32

Recursively subdivide

16
8
4
2
1
26
Knuths Buddy System (5)
256

Begin with one large block

128
64
32

Yield 2 blocks size 16

16
8
4
2
1
27
Knuths Buddy System (6)
256

Begin with one large block

128
64
32

Yield 2 blocks size 16

16
8

One of those blocks can be given to the program

4
2
1
28
Worst-case free-list behavior

The longer the free-list, the more pronounced the
effect
No a priori bound on how much worse the
list-based scheme could get
Average performance similar

29
Spec Benchmark Results
30
Buddy System

If a block can be found, it can be found in
log(N), where N is the size of the heap
The application cannot make that worse

31
Defragmentation

To keep up with the diversity of requested block
sizes, an allocator may have to reorganize
smaller blocks into larger ones

32
DefragmentationFree List
Free list

Free-list permutes adjacent blocks
Storage becomes fragmented, with many small
blocks and no large ones

Blocks in memory
33
DefragmentationFree List
Free list

Free-list permutes adjacent blocks
Two issues

Join adjacent blocks

Blocks in memory
34
DefragmentationFree List
Free list

Free-list permutes adjacent blocks
Two issues

Join adjacent blocks

Reorganize holes (move live storage)

Blocks in memory
35
DefragmentationFree List
Free list

Free-list permutes adjacent blocks
Two issues

Join adjacent blocks

Reorganize holes

Blocks in memory

Organization by address can help Kavi

36
Buddiesjoining adjacent blocks

The blocks resulting from subdivision are viewed
as buddies
Their address differs by exactly one bit
The address of a block of size 2 differs with
its buddys address at bit n

0
1
n
37
Knuths Buddy System (6)
256
128
64
32
16
8
4
2
1
38
Knuths Buddy System (5)
256

When a block becomes free, it tries to rejoin its
buddy
A bit in its buddy tells whether the buddy is
free
If so, they glue together and make a block twice
as big

128
64
32
16
8
4
2
1
39
Knuths Buddy System (4)
256
128
64
32
16
8
4
2
1
40
Knuths Buddy System (3)
256
128
64
32
16
8
4
2
1
41
Knuths Buddy System (2)
256
128
64
32
16
8
4
2
1
42
Knuths Buddy System (1)
256
128
64
32
16
8
4
2
1
43
Two problems

OscillationBuddy looks like it may split, glue,
split, glueisnt this wasted effort?
FragmentationWhat happens when Buddy cant glue
but has space it would like to combine?

44
Buddyoscillation
256
128
64
32
16
8
4
2
1
45
Buddyoscillation
256
128
64
32
16
8
4
2
1
46
Buddyoscillation
256
128
64
32
16
8
4
2
1
47
Buddyoscillation
256
128
64
32
16
8
4
2
1
48
Buddyoscillation
256
128
64
32
16
8
4
2
1
49
Buddyoscillation
256
128
64
32
16
8
4
2
1
50
Problem is lack of hysteresis

Some programs allocate objects which are almost
immediately deallocated.
Continuous, incremental approaches to garbage
collection only make this worse!
Oscillation is expensive blocks are glued only
to be quickly subdivided again

51
Estranged Buddy System

Variant of Knuths idea
When deallocated, blocks are not eager to rejoin
their buddies
Evidence of value Kaufman, TOPLAS 84
Slight improvement on spec benchmarks
Algorithmic improvement over Kaufman

52
Buddy-Busy and Buddy-Free
Blocks whose buddies are busy
2
k
Blocks whose buddies are free
53
Estranged BuddyAllocation

Allocation heuristic
Buddy-busy
Buddy-free
Glue one level below, buddy-free
Search up (Knuth)
Glue below

54
How well does Estranged Buddy do?(contrived
example)
55
Estranged Buddy on Spec
56
Recall two problems

OscillationBuddy looks like it may split, glue,
split, glueisnt this wasted effort?
Typically not, but can be
FragmentationWhat happens when Buddy cant glue
but has space it would like to combine?

57
Buddy SystemFragmentation

Internal fragmentation from rounding-up of
requests to powers of two
Not really a concern these days
Assume a program can run in maxlive bytes
How much storage needed so Buddy never has to
defragment?
What is a good algorithm for Buddy
defragmentation?

58
Buddy Configurations
8
4
2
1
59
Buddy Configurations
60
Heap Full
61
Buddy cant allocate size-2 block
62
How Big a Heap for Non-Blocking Buddy (M
maxlive)?

Easy bound M log M
Better bound M k, where k is the number of
distinct sizes to be allocated
Sounds like a good bound, but it isnt
Defragmentation may be necessary

256
128
64
32
M bytes each level
16
8
4
2
1
63
Managing object relocation

Every object has a stable handle, whose address
does not change
Every handle points to its objects current
location
All references to objects are indirect, through a
handle

64
Buddy Defragmentation

When stuck at level k
No blocks free above level k
No glueable blocks free below level k
Assume maxlive still suffices
Example k6, size 64 not available

256
128
64
32
16
8
4
2
1
65
Defragmentation Algorithm
64
32
16
66
Defragmentation Algorithm
swap
64
32
16
67
Defragmentation Algorithm
64
glue
32
16
68
Defragmentation Algorithm
64
32
16
69
Defragmentation Algorithm

Recursively visit below to develop two buddies
that can be glued
Analogous to the recursive allocation algorithm
Still, choices to be made.studies underway

70
Need 4 bytes
Move 3 bytes?
Move 1 byte?
71
Recall two problems

OscillationBuddy looks like it may split, glue,
split, glueisnt this wasted effort?
Typically not, but can be
FragmentationWhat happens when Buddy cant glue
but has space it would like to combine?
New algorithm to defragment Buddy
Selective approachshould beat List
Optimizations needed

72
Towards an IRAM implementation

VHDL of Buddy System complete
DRAM clocked at 150 MHz
10 cycles per DRAM access
Need 7 accesses per level to split blocks
For 16Mbyte heap24 levels
1680 cycles worst case 11us
168x slower than a read
Can we do better?

73
Two tricks

Find a suitable free-block quickly
Return its address quickly

74
Finding a suitable free block

No space at 16, but 16 points to the level above
it that has a block to offer

75
Finding a suitable free block

Every level points to the level above it that has
a block to offer
Pointers are maintained using Tarjans
path-compression
Locating pointers are not stored in DRAM

256
128
64
32
16
8
4
2
1
76
Alternative free-block finder

Path-compression may be too complex for hardware
Instead, track the largest available free block

77
Alternative free-block finder

Path-compression may be too complex for hardware
Instead, track the largest available free block
Tends to break up large blocks and favor
formation of small ones

256
128
64
32
16
8
4
2
1
78
Fast return for malloc

Want 16 bytes
Zip to the 64 display
WLOG we return the first part of that block
immediately to the requestor

64
32
16
79
Fast return for malloc

Want 16 bytes
Zip to the 64 display
WLOG we return the first part of that block
immediately to the requestor
Adjustment to the structures happens in parallel
with the return

64
32
16
80
Improved IRAM allocator

10 cycles fast return
1000 cycles to recover, worst case
Is this good enough?
Compare software implementation
1000 cycles worst case
600 cycles average on spec benchmarks
Hardware can be much faster
Depends on recover time

81
Do programs allow us to recover?

Run of jackJVM instructions between requests
56 of requests separated by at least 100 JVM
instructions
Assume 10x expansion, JVM to native code
For the 56, we return in 10 cycles
Code motion might improve others

Min Median Max
3 181 174053
82
Garbage Collection

While allocators are needed for most modern
languages, garbage collection is not universally
accepted
Generational and incremental approaches help most
applications
Embedded and real-time need assurances of bounded
behavior

83
Why not garbage collect?

Some programmers want ultimate control over
storage
Real-Time applications need bounded-time overhead
RT Java spec relegates allocation and collection
to user control
Isnt this a step back from Java?

84
Marking Phasethe problem

To discover the dead objects, we use calculatus
eliminatus
Find live objects
All others are dead

85
Marking Phasethe problem

To discover the dead objects, we
Find live objects

stack
heap

Pointers from the stack to the heap make objects
live

86
Marking Phasethe problem

To discover the dead objects, we
Find live objects

Pointers from the stack to the heap make objects
live
These objects make other objects live

87
Marking Phasethe problem

To discover the dead objects, we
Find live objects
Sweep all others away as dead

stack
heap
88
Marking Phasethe problem

To discover the dead objects, we
Find live objects
Sweep all others away as dead
Perhaps compact the heap

stack
heap
89
Problems with mark phase

Takes an unbounded amount of time
Can limit it using generational collection but
then its not clear what will get collected
We seek an approach that spends a constant amount
of time per program operation and collects
objects continuously

90
Two Approaches

Variation on reference counting
Contaminated garbage collection PLDI00

91
Reference Counting

An integer is associated with every object,
summing
Stack references
Heap references
Objects with reference count of zero are dead

stack
heap
2
0
1
1
1
1
0
0
2
0
92
Problems with Reference Counting

Standard problem is that objects in cycles

stack
heap
1
0
1
1
1
1
0
0
2
0
93
Problems with Reference Counting

Standard problem is that objects in cycles (and
those touched by such objects) cannot be
collected

stack
heap
1
0
1
1
1
1
0
0
2
0
94
Problems with Reference Counting

Standard problem is that objects in cycles (and
those touched by such objects) cannot be
collected
Contaminated gc will collect such objects
Overhead of counting can be high
Untyped stack complicates things

stack
heap
1
0
1
1
1
1
0
0
2
0
95
The Untyped Stack

The stack is a collection of untyped cells
In JVM, safety is verified at class-load time
No need to tag stack locations with what they
contain
Leads to imprecision in all gc methods

stack
heap
Heap Address?
96
Idea

When a stack frame pops, all of its cells are
dead
Dont worry about tracking cell pointers
Instead, associate an object with the last stack
frame that can reference the object

97
Reference Counting Approach

s is zero or one, indicating none or at least one
stack reference to the object
h precisely reflects the number of heap
references to the object
If sh0 object is dead

s
h
98
Our treatment of stack activity

Object is associated with the last-to-be-popped
frame that can reference the object

stack
99
Our treatment of stack activity

Object is associated with the last-to-be-popped
frame that can reference the object
When that frame pops
If object is returned, the receiving frame owns
the object

stack
100
Our treatment of stack activity

Object is associated with the last-to-be-popped
frame that can reference the object
When that frame pops
Otherwise the object is dead

stack
101
Our reference-counting implementation

The objects associated with the frame are linked
together

stack
heap
0
1
1
2
frame
0
1
102
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything

stack
heap
0
1
1
2
frame
0
1
103
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object is unlinked, but still thought to be
live

stack
heap
0
1
1
2
frame
0
1
104
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object is dead and is collected

stack
heap
0
1
1
2
frame
0
1
105
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object is also dead

stack
heap
1
0
2
frame
0
1
106
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object is still live

stack
heap
1
1
frame
0
1
107
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
Now the frame is gone

stack
heap
1
1
0
1
frame
108
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object was linked to its frame all along

stack
heap
1
1
0
1
frame
109
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object was linked to its frame all along
When heap count becomes zero, the object is
scheduled for deletion in that frame

stack
heap
0
1
0
1
frame
110
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object was linked to its frame all along
When heap count becomes zero, the object is
scheduled for deletion in that frame
When frame pops, all are dead

stack
heap
0
1
0
1
frame
111
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object was linked to its frame all along
When heap count becomes zero, the object is
scheduled for deletion in that frame
When frame pops, all are dead

stack
heap
0
0
0
1
frame
112
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object was linked to its frame all along
When heap count becomes zero, the object is
scheduled for deletion in that frame
When frame pops, all are dead

stack
heap
0
0
0
0
frame
113
Our reference-counting implementation

The objects associated with the frame are linked
together
When a stack frame pops, all of its cells no
longer can point at anything
This object was linked to its frame all along
When heap count becomes zero, the object is
scheduled for deletion in that frame
When frame pops, all are dead

stack
heap
0
0
0
0
frame
114
Reference Counting

Predictable, constant overhead for each JVM
instruction
putfield decreases count at old pointed-to
object, increases count at new pointed-to object
areturn associates object with stack frame if not
already associated below
How well does it do? We shall see!

115
Contaminated Garbage Collection

Need to collect objects involved in reference
cycles without resorting to marking live objects
Idea
Associate each object with a stack frame such
that when that frame returns, the object is known
to be dead
Like escape analysis, but dynamic

116
Contaminated garbage collection

Initially each object is associated with the
frame in which it is instantiated

stack
A
B
C
D
E
117
Contaminated garbage collection

Initially each object is associated with the
frame in which it is instantiated
When B references A, A becomes as live as B

stack
A
B
C
D
E
118
Contaminated garbage collection

Initially each object is associated with the
frame in which it is instantiated
Now A, B, and C are as live as C

stack
A
B
C
D
E
119
Contaminated garbage collection

Initially each object is associated with the
frame in which it is instantiated
Even though D is less live than C, it gets
contaminated
Should something reference D later, all will be
affected

stack
A
B
C
D
E
120
Contaminated garbage collection

Initially each object is associated with the
frame in which it is instantiated
Static finger of life
Now all objects appear to live forever

stack
A
B
C
D
E
121
Contaminated garbage collection

Initially each object is associated with the
frame in which it is instantiated
Static finger of life
Now all objects appear to live forever
Even if E points away!

stack
A
B
C
D
E
122
Contaminated garbage collection

Every object is a member of an equilive set
All objects in a set are scheduled for
deallocation at the same time
Sets are maintained using Tarjans disjoint
union/find algorithm
Nearly constant amount of overhead per operation

123
Contaminated GC

Each equilive set is associated with a frame

stack
124
Contaminated GC

Each equilive set is associated with a frame
Suppose an object in one set references an object
in another set (in either direction)

stack
125
Contaminated GC

Each equilive set is associated with a frame
Suppose an object in one set references an object
in another set (in either direction)
Contamination!
The sets are unioned

stack
126
Contaminated GC

Each equilive set is associated with a frame
When a frame pops, objects associated with it are
dead

stack
127
Contaminated GC

Each equilive set is associated with a frame
When a frame pops, objects associated with it are
dead

stack
128
Contaminated GC

Each equilive set is associated with a frame
When a frame pops, objects associated with it are
dead

stack
129
Summary of methods

Reference counting
Cant handle cycles
Handles pointing at and then away

Contaminated GC
Tolerates cycles
Cant track pointing and then pointing away

Both techniques
Incur cost at putfield, areturn
(Nearly) constant overhead per operation

130
Implementation details

SUN JDK 1.1 interpreter version
Many subtle places where references are
generated String.intern(), ldc instruction,
class loader, JNI
Each gc method took about 3 months to implement
Can run either method or both in concert
Fairly optimized, more is possible

131
Spec benchmark effectiveness
132
Spec benchmark effectiveness
133
Exactness of Equilive Sets
134
Distance to die in frames
135
Speed of CGC
136
Speedups of Mark-Free Approaches
137
Future Plans

VHDL simulation of more efficient buddy allocator
VHDL simulation of garbage collection methods
Better buddy defragmentation
Experiment with informed allocation
Comparison/integration with other IRAM-based
methods (with Krishna Kavi)

138
Informed Storage Management

Evidence that programs allocate many objects of
the same size

139
Benchmark jack20 fragmentation
140
Benchmark raytrace12 fragmentation
141
Benchmark compress34 fragmentation
142
Informed Storage Management