Getting Real, Getting Dirty, without getting real dirty - PowerPoint PPT Presentation

About This Presentation
Title:

Getting Real, Getting Dirty, without getting real dirty

Description:

Dante Cannarozzi, Sharath Cholleti, Morgan Deters, Steve Donahue. Mark Franklin, Matt Hampton, Michael Henrichs, Nicholas ... Problem is lack of hysteresis ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 149
Provided by: roncy
Learn more at: https://www.cs.wustl.edu
Category:

less

Transcript and Presenter's Notes

Title: Getting Real, Getting Dirty, without getting real dirty


1
Getting Real, Getting Dirty (without getting real
dirty)
Funded by the National Science Foundation under
grant 0081214 Funded by DARPA under contract
F33615-00-C-1697
Ron K. Cytron Joint work with Krishna Kavi
University of Alabama at Huntsville
Dante Cannarozzi, Sharath Cholleti, Morgan
Deters, Steve Donahue Mark Franklin, Matt
Hampton, Michael Henrichs, Nicholas Leidenfrost,
Jonathan Nye, Michael Plezbert, Conrad
Warmbold Center for Distributed Object
Computing Department of Computer
Science Washington University
April 2001
2
Outline
  • Motivation
  • Allocation
  • Collection
  • Conclusion

3
Traditional architecture and object-oriented
programs
  • Caches are still biased toward Fortran-like
    behavior
  • CPU is still responsible for storage management
  • Object-management activity invalidates caches
  • GC disruptive
  • Compaction

4
An OO-biased design using IRAMs(with Krishna
Kavi)
  • CPU and cache stay the same, off-the-shelf
  • Memory system redesigned to support OO programs

CPU cache
L2 cache
IRAM
Logic
M
M
M
M
5
IRAM interface
Stable address for an object allows better cache
behavior Object can be relocated within IRAM, but
its address to the CPU is constant
IRAM
malloc
Logic
M
M
addr
M
M
6
IRAM interface
Object referencingtracked inside IRAM-supports
garbage collection
IRAM
putfield/getfield
Logic
M
M
value
M
M
7
IRAM interface
Goal relegate storage-management functions to
IRAM
gc compact prefetch
IRAM
Logic
M
M
M
M
8
Macro accesses
Observe code sequences contain common gestures
(superoperators)
p.getLeft().getNext() ((p12)32)
IRAM
Logic
M
M
M
M
9
Gesture abstraction
Goal decrease traffic between CPU and storage
M143(x) ((x12)32)
p.getLeft().getNext() ((p12)32)
IRAM
Logic
M
M
M
M
10
Gesture application
M143(x) ((x12)32)
p.getLeft().getNext()
IRAM
Macro 143 (p)
Logic
M
M
M
M
11
Gesture application
M143(x) ((x12)32)
p.getLeft().getNext()
IRAM
Macro 143 (p)
Logic
M
M
M
M
p.getLeft().getNext()
12
Automatic prefetching
Goal decrease traffic between CPU and storage
IRAM
Fetch p
CPU cache
Logic
p
M
M
L2 cache
M
M
13
Automatic prefetching
Goal decrease traffic between CPU and storage
IRAM
Fetch p
CPU cache
Logic
p
M
M
L2 cache
p.getLeft().getNext()
M
M
14
Challenges
  • Algorithmic
  • Bounded-time methods for allocation and
    collection
  • Good average performance as well
  • Architectural
  • Lean interface between the CPU and IRAM
  • Efficient realization

15
Storage Allocation (Real Time)
  • Not necessarily fast
  • Necessarily predictable
  • Able to satisfy any reasonable request
  • Developer should know maxlive characteristics
    of the application
  • This is true for non-embedded systems as well

16
How much storage?
Handles
  • curlivethe number of objects live at a point in
    time
  • curspacethe number of bytes live at a point in
    time

Object Space
17
Objects concurrently live
18
How much object space?
19
Storage AllocationFree List
  • Linked list of free blocks
  • Search for desired fit
  • Worst case O(n) for n blocks in the list

20
Worst-case free-list behavior
  • The longer the free-list, the more pronounced the
    effect
  • No a priori bound on how much worse the
    list-based scheme could get
  • Average performance similar

21
Knuths Buddy System
  • Free-list segregated by size
  • All requests rounded up to a power of 2

22
Knuths Buddy System (1)
256
  • Begin with one large block
  • Suppose we want a block of size 16

128
64
32
16
8
4
2
1
23
Knuths Buddy System (2)
  • Begin with one large block

256
128
64
32
  • Recursively subdivide

16
8
4
2
1
24
Knuths Buddy System (3)
256
  • Begin with one large block

128
64
32
  • Recursively subdivide

16
8
4
2
1
25
Knuths Buddy System (4)
256
  • Begin with one large block

128
64
32
  • Recursively subdivide

16
8
4
2
1
26
Knuths Buddy System (5)
256
  • Begin with one large block

128
64
32
  • Yield 2 blocks size 16

16
8
4
2
1
27
Knuths Buddy System (6)
256
  • Begin with one large block

128
64
32
  • Yield 2 blocks size 16

16
8
  • One of those blocks can be given to the program

4
2
1
28
Worst-case free-list behavior
  • The longer the free-list, the more pronounced the
    effect
  • No a priori bound on how much worse the
    list-based scheme could get
  • Average performance similar

29
Spec Benchmark Results
30
Buddy System
  • If a block can be found, it can be found in
    log(N), where N is the size of the heap
  • The application cannot make that worse

31
Defragmentation
  • To keep up with the diversity of requested block
    sizes, an allocator may have to reorganize
    smaller blocks into larger ones

32
DefragmentationFree List
Free list
  • Free-list permutes adjacent blocks
  • Storage becomes fragmented, with many small
    blocks and no large ones

Blocks in memory
33
DefragmentationFree List
Free list
  • Free-list permutes adjacent blocks
  • Two issues
  • Join adjacent blocks

Blocks in memory
34
DefragmentationFree List
Free list
  • Free-list permutes adjacent blocks
  • Two issues
  • Join adjacent blocks
  • Reorganize holes (move live storage)

Blocks in memory
35
DefragmentationFree List
Free list
  • Free-list permutes adjacent blocks
  • Two issues
  • Join adjacent blocks
  • Reorganize holes

Blocks in memory
  • Organization by address can help Kavi

36
Buddiesjoining adjacent blocks
  • The blocks resulting from subdivision are viewed
    as buddies
  • Their address differs by exactly one bit
  • The address of a block of size 2 differs with
    its buddys address at bit n

0
1
n
37
Knuths Buddy System (6)
256
128
64
32
16
8
4
2
1
38
Knuths Buddy System (5)
256
  • When a block becomes free, it tries to rejoin its
    buddy
  • A bit in its buddy tells whether the buddy is
    free
  • If so, they glue together and make a block twice
    as big

128
64
32
16
8
4
2
1
39
Knuths Buddy System (4)
256
128
64
32
16
8
4
2
1
40
Knuths Buddy System (3)
256
128
64
32
16
8
4
2
1
41
Knuths Buddy System (2)
256
128
64
32
16
8
4
2
1
42
Knuths Buddy System (1)
256
128
64
32
16
8
4
2
1
43
Two problems
  • OscillationBuddy looks like it may split, glue,
    split, glueisnt this wasted effort?
  • FragmentationWhat happens when Buddy cant glue
    but has space it would like to combine?

44
Buddyoscillation
256
128
64
32
16
8
4
2
1
45
Buddyoscillation
256
128
64
32
16
8
4
2
1
46
Buddyoscillation
256
128
64
32
16
8
4
2
1
47
Buddyoscillation
256
128
64
32
16
8
4
2
1
48
Buddyoscillation
256
128
64
32
16
8
4
2
1
49
Buddyoscillation
256
128
64
32
16
8
4
2
1
50
Problem is lack of hysteresis
  • Some programs allocate objects which are almost
    immediately deallocated.
  • Continuous, incremental approaches to garbage
    collection only make this worse!
  • Oscillation is expensive blocks are glued only
    to be quickly subdivided again

51
Estranged Buddy System
  • Variant of Knuths idea
  • When deallocated, blocks are not eager to rejoin
    their buddies
  • Evidence of value Kaufman, TOPLAS 84
  • Slight improvement on spec benchmarks
  • Algorithmic improvement over Kaufman

52
Buddy-Busy and Buddy-Free
Blocks whose buddies are busy
2
k
Blocks whose buddies are free
53
Estranged BuddyAllocation
  • Allocation heuristic
  • Buddy-busy
  • Buddy-free
  • Glue one level below, buddy-free
  • Search up (Knuth)
  • Glue below

54
How well does Estranged Buddy do?(contrived
example)
55
Estranged Buddy on Spec
56
Recall two problems
  • OscillationBuddy looks like it may split, glue,
    split, glueisnt this wasted effort?
  • Typically not, but can be
  • FragmentationWhat happens when Buddy cant glue
    but has space it would like to combine?

57
Buddy SystemFragmentation
  • Internal fragmentation from rounding-up of
    requests to powers of two
  • Not really a concern these days
  • Assume a program can run in maxlive bytes
  • How much storage needed so Buddy never has to
    defragment?
  • What is a good algorithm for Buddy
    defragmentation?

58
Buddy Configurations
8
4
2
1
59
Buddy Configurations
60
Heap Full
61
Buddy cant allocate size-2 block
62
How Big a Heap for Non-Blocking Buddy (M
maxlive)?
  • Easy bound M log M
  • Better bound M k, where k is the number of
    distinct sizes to be allocated
  • Sounds like a good bound, but it isnt
  • Defragmentation may be necessary

256
128
64
32
M bytes each level
16
8
4
2
1
63
Managing object relocation
  • Every object has a stable handle, whose address
    does not change
  • Every handle points to its objects current
    location
  • All references to objects are indirect, through a
    handle

64
Buddy Defragmentation
  • When stuck at level k
  • No blocks free above level k
  • No glueable blocks free below level k
  • Assume maxlive still suffices
  • Example k6, size 64 not available

256
128
64
32
16
8
4
2
1
65
Defragmentation Algorithm
64
32
16
66
Defragmentation Algorithm
swap
64
32
16
67
Defragmentation Algorithm
64
glue
32
16
68
Defragmentation Algorithm
64
32
16
69
Defragmentation Algorithm
  • Recursively visit below to develop two buddies
    that can be glued
  • Analogous to the recursive allocation algorithm
  • Still, choices to be made.studies underway

70
Need 4 bytes
Move 3 bytes?
Move 1 byte?
71
Recall two problems
  • OscillationBuddy looks like it may split, glue,
    split, glueisnt this wasted effort?
  • Typically not, but can be
  • FragmentationWhat happens when Buddy cant glue
    but has space it would like to combine?
  • New algorithm to defragment Buddy
  • Selective approachshould beat List
  • Optimizations needed

72
Towards an IRAM implementation
  • VHDL of Buddy System complete
  • DRAM clocked at 150 MHz
  • 10 cycles per DRAM access
  • Need 7 accesses per level to split blocks
  • For 16Mbyte heap24 levels
  • 1680 cycles worst case 11us
  • 168x slower than a read
  • Can we do better?

73
Two tricks
  • Find a suitable free-block quickly
  • Return its address quickly

74
Finding a suitable free block
  • No space at 16, but 16 points to the level above
    it that has a block to offer

75
Finding a suitable free block
  • Every level points to the level above it that has
    a block to offer
  • Pointers are maintained using Tarjans
    path-compression
  • Locating pointers are not stored in DRAM

256
128
64
32
16
8
4
2
1
76
Alternative free-block finder
  • Path-compression may be too complex for hardware
  • Instead, track the largest available free block

77
Alternative free-block finder
  • Path-compression may be too complex for hardware
  • Instead, track the largest available free block
  • Tends to break up large blocks and favor
    formation of small ones

256
128
64
32
16
8
4
2
1
78
Fast return for malloc
  • Want 16 bytes
  • Zip to the 64 display
  • WLOG we return the first part of that block
    immediately to the requestor

64
32
16
79
Fast return for malloc
  • Want 16 bytes
  • Zip to the 64 display
  • WLOG we return the first part of that block
    immediately to the requestor
  • Adjustment to the structures happens in parallel
    with the return

64
32
16
80
Improved IRAM allocator
  • 10 cycles fast return
  • 1000 cycles to recover, worst case
  • Is this good enough?
  • Compare software implementation
  • 1000 cycles worst case
  • 600 cycles average on spec benchmarks
  • Hardware can be much faster
  • Depends on recover time

81
Do programs allow us to recover?
  • Run of jackJVM instructions between requests
  • 56 of requests separated by at least 100 JVM
    instructions
  • Assume 10x expansion, JVM to native code
  • For the 56, we return in 10 cycles
  • Code motion might improve others

Min Median Max
3 181 174053
82
Garbage Collection
  • While allocators are needed for most modern
    languages, garbage collection is not universally
    accepted
  • Generational and incremental approaches help most
    applications
  • Embedded and real-time need assurances of bounded
    behavior

83
Why not garbage collect?
  • Some programmers want ultimate control over
    storage
  • Real-Time applications need bounded-time overhead
  • RT Java spec relegates allocation and collection
    to user control
  • Isnt this a step back from Java?

84
Marking Phasethe problem
  • To discover the dead objects, we use calculatus
    eliminatus
  • Find live objects
  • All others are dead

85
Marking Phasethe problem
  • To discover the dead objects, we
  • Find live objects

stack
heap
  • Pointers from the stack to the heap make objects
    live

86
Marking Phasethe problem
  • To discover the dead objects, we
  • Find live objects
  • Pointers from the stack to the heap make objects
    live
  • These objects make other objects live

87
Marking Phasethe problem
  • To discover the dead objects, we
  • Find live objects
  • Sweep all others away as dead

stack
heap
88
Marking Phasethe problem
  • To discover the dead objects, we
  • Find live objects
  • Sweep all others away as dead
  • Perhaps compact the heap

stack
heap
89
Problems with mark phase
  • Takes an unbounded amount of time
  • Can limit it using generational collection but
    then its not clear what will get collected
  • We seek an approach that spends a constant amount
    of time per program operation and collects
    objects continuously

90
Two Approaches
  • Variation on reference counting
  • Contaminated garbage collection PLDI00

91
Reference Counting
  • An integer is associated with every object,
    summing
  • Stack references
  • Heap references
  • Objects with reference count of zero are dead

stack
heap
2
0
1
1
1
1
0
0
2
0
92
Problems with Reference Counting
  • Standard problem is that objects in cycles

stack
heap
1
0
1
1
1
1
0
0
2
0
93
Problems with Reference Counting
  • Standard problem is that objects in cycles (and
    those touched by such objects) cannot be
    collected

stack
heap
1
0
1
1
1
1
0
0
2
0
94
Problems with Reference Counting
  • Standard problem is that objects in cycles (and
    those touched by such objects) cannot be
    collected
  • Contaminated gc will collect such objects
  • Overhead of counting can be high
  • Untyped stack complicates things

stack
heap
1
0
1
1
1
1
0
0
2
0
95
The Untyped Stack
  • The stack is a collection of untyped cells
  • In JVM, safety is verified at class-load time
  • No need to tag stack locations with what they
    contain
  • Leads to imprecision in all gc methods

stack
heap
Heap Address?
96
Idea
  • When a stack frame pops, all of its cells are
    dead
  • Dont worry about tracking cell pointers
  • Instead, associate an object with the last stack
    frame that can reference the object

97
Reference Counting Approach
  • s is zero or one, indicating none or at least one
    stack reference to the object
  • h precisely reflects the number of heap
    references to the object
  • If sh0 object is dead

s
h
98
Our treatment of stack activity
  • Object is associated with the last-to-be-popped
    frame that can reference the object

stack
99
Our treatment of stack activity
  • Object is associated with the last-to-be-popped
    frame that can reference the object
  • When that frame pops
  • If object is returned, the receiving frame owns
    the object

stack
100
Our treatment of stack activity
  • Object is associated with the last-to-be-popped
    frame that can reference the object
  • When that frame pops
  • Otherwise the object is dead

stack
101
Our reference-counting implementation
  • The objects associated with the frame are linked
    together

stack
heap
0
1
1
2
frame
0
1
102
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything

stack
heap
0
1
1
2
frame
0
1
103
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object is unlinked, but still thought to be
    live

stack
heap
0
1
1
2
frame
0
1
104
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object is dead and is collected

stack
heap
0
1
1
2
frame
0
1
105
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object is also dead

stack
heap
1
0
2
frame
0
1
106
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object is still live

stack
heap
1
1
frame
0
1
107
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • Now the frame is gone

stack
heap
1
1
0
1
frame
108
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object was linked to its frame all along

stack
heap
1
1
0
1
frame
109
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object was linked to its frame all along
  • When heap count becomes zero, the object is
    scheduled for deletion in that frame

stack
heap
0
1
0
1
frame
110
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object was linked to its frame all along
  • When heap count becomes zero, the object is
    scheduled for deletion in that frame
  • When frame pops, all are dead

stack
heap
0
1
0
1
frame
111
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object was linked to its frame all along
  • When heap count becomes zero, the object is
    scheduled for deletion in that frame
  • When frame pops, all are dead

stack
heap
0
0
0
1
frame
112
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object was linked to its frame all along
  • When heap count becomes zero, the object is
    scheduled for deletion in that frame
  • When frame pops, all are dead

stack
heap
0
0
0
0
frame
113
Our reference-counting implementation
  • The objects associated with the frame are linked
    together
  • When a stack frame pops, all of its cells no
    longer can point at anything
  • This object was linked to its frame all along
  • When heap count becomes zero, the object is
    scheduled for deletion in that frame
  • When frame pops, all are dead

stack
heap
0
0
0
0
frame
114
Reference Counting
  • Predictable, constant overhead for each JVM
    instruction
  • putfield decreases count at old pointed-to
    object, increases count at new pointed-to object
  • areturn associates object with stack frame if not
    already associated below
  • How well does it do? We shall see!

115
Contaminated Garbage Collection
  • Need to collect objects involved in reference
    cycles without resorting to marking live objects
  • Idea
  • Associate each object with a stack frame such
    that when that frame returns, the object is known
    to be dead
  • Like escape analysis, but dynamic

116
Contaminated garbage collection
  • Initially each object is associated with the
    frame in which it is instantiated

stack
A
B
C
D
E
117
Contaminated garbage collection
  • Initially each object is associated with the
    frame in which it is instantiated
  • When B references A, A becomes as live as B

stack
A
B
C
D
E
118
Contaminated garbage collection
  • Initially each object is associated with the
    frame in which it is instantiated
  • Now A, B, and C are as live as C

stack
A
B
C
D
E
119
Contaminated garbage collection
  • Initially each object is associated with the
    frame in which it is instantiated
  • Even though D is less live than C, it gets
    contaminated
  • Should something reference D later, all will be
    affected

stack
A
B
C
D
E
120
Contaminated garbage collection
  • Initially each object is associated with the
    frame in which it is instantiated
  • Static finger of life
  • Now all objects appear to live forever

stack
A
B
C
D
E
121
Contaminated garbage collection
  • Initially each object is associated with the
    frame in which it is instantiated
  • Static finger of life
  • Now all objects appear to live forever
  • Even if E points away!

stack
A
B
C
D
E
122
Contaminated garbage collection
  • Every object is a member of an equilive set
  • All objects in a set are scheduled for
    deallocation at the same time
  • Sets are maintained using Tarjans disjoint
    union/find algorithm
  • Nearly constant amount of overhead per operation

123
Contaminated GC
  • Each equilive set is associated with a frame

stack
124
Contaminated GC
  • Each equilive set is associated with a frame
  • Suppose an object in one set references an object
    in another set (in either direction)

stack
125
Contaminated GC
  • Each equilive set is associated with a frame
  • Suppose an object in one set references an object
    in another set (in either direction)
  • Contamination!
  • The sets are unioned

stack
126
Contaminated GC
  • Each equilive set is associated with a frame
  • When a frame pops, objects associated with it are
    dead

stack
127
Contaminated GC
  • Each equilive set is associated with a frame
  • When a frame pops, objects associated with it are
    dead

stack
128
Contaminated GC
  • Each equilive set is associated with a frame
  • When a frame pops, objects associated with it are
    dead

stack
129
Summary of methods
  • Reference counting
  • Cant handle cycles
  • Handles pointing at and then away
  • Contaminated GC
  • Tolerates cycles
  • Cant track pointing and then pointing away
  • Both techniques
  • Incur cost at putfield, areturn
  • (Nearly) constant overhead per operation

130
Implementation details
  • SUN JDK 1.1 interpreter version
  • Many subtle places where references are
    generated String.intern(), ldc instruction,
    class loader, JNI
  • Each gc method took about 3 months to implement
  • Can run either method or both in concert
  • Fairly optimized, more is possible

131
Spec benchmark effectiveness
132
Spec benchmark effectiveness
133
Exactness of Equilive Sets
134
Distance to die in frames
135
Speed of CGC
136
Speedups of Mark-Free Approaches
137
Future Plans
  • VHDL simulation of more efficient buddy allocator
  • VHDL simulation of garbage collection methods
  • Better buddy defragmentation
  • Experiment with informed allocation
  • Comparison/integration with other IRAM-based
    methods (with Krishna Kavi)

138
Informed Storage Management
  • Evidence that programs allocate many objects of
    the same size

139
Benchmark jack20 fragmentation
140
Benchmark raytrace12 fragmentation
141
Benchmark compress34 fragmentation
142
Informed Storage Management
  • Evidence that programs allocate many objects of
    the same size

143
Informed Storage Management
  • Evidence that programs allocate many objects of
    the same size
  • Not surprising, in Java
  • same type ?same size

144
Informed Storage Management
  • Evidence that programs allocate many objects of
    the same size
  • Not surprising, in Java
  • same type ?same size
  • In C and C programmers brew their own
    allocators to take advantage of this
  • What can we do automatically?

145
Informed Storage Management
  • Capture program malloc requests by phase
  • Generate a .class file and put it in CLASSPATH
  • Load the .class file and inform the allocator

146
Different phases?different distributions
raytrace
147
How long is a phase?
  • Phases 1-3 are common to all programsJVM startup
  • Phases are keyed to allocations, not time, for
    portability

148
Questions?
Write a Comment
User Comments (0)
About PowerShow.com