Parallel garbage collection with a block-structured heap - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel garbage collection with a block-structured heap

Description:

Parallel garbage collection with a block-structured heap Simon Marlow (Microsoft Research) Simon Peyton Jones (Microsoft Research) Roshan James (U. Indiana) – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 36
Provided by: Simon330
Category:

less

Transcript and Presenter's Notes

Title: Parallel garbage collection with a block-structured heap


1
Parallel garbage collection with a
block-structured heap
  • Simon Marlow (Microsoft Research)
  • Simon Peyton Jones (Microsoft Research)
  • Roshan James (U. Indiana)
  • Tim Harris (Microsoft Research)

2
Setting the scene
  • We have an existing GC for Haskell
  • multi-generational
  • copying by default (optional mark/compact for the
    old generation)
  • GC is a bottleneck for parallel execution we
    want to make the GC run in parallel on a
    multiprocessor to improve scaling.
  • NB. parallel, not concurrent
  • A parallel GC can speed up even single-threaded
    programs on a multiprocessor with no input from
    the programmer.
  • Target commodity multi-cores

3
High-level structure
  • Our storage manager is divided into two layers.
  • The block allocator
  • requests memory from the OS
  • provides blocks of memory to the rest of the RTS
  • manages a pool of free blocks.
  • The GC allocates memory from the block layer only.

GC
Block Allocator
malloc() / mmap()
4
Blocks of memory
  • Memory is managed in units of the fixed block
    size (currently 4k).
  • Blocks can be allocated singly, or in contiguous
    groups to accomodate larger objects.
  • Blocks can be linked together into lists to form
    areas

5
Why divide memory into blocks?
  • Flexibility, Portability
  • The storage manager needs to track multiple
    regions (e.g. generations) which may grow and
    shrink over time.
  • contiguous memory would be problematic how much
    space do we allocate to each one, so that they
    dont grow into each other?
  • Address space is limited.
  • With linked lists of blocks, we can grow or
    shrink a region whenever we like by
    adding/removing blocks.
  • managing large objects is easier each gets its
    own block group, and we can link large objects
    onto lists, so no need to copy the object to move
    it.
  • some wastage due to slop (lt1)

6
More advantages of blocks
  • Portable all we require from the OS is a way to
    allocate memory, there are no dependencies on
    address-space layout.
  • Memory can be recycled quickly cache-friendly
  • Sometimes we need to allocate when it is
    inconvenient to GC we can always grab another
    block.
  • A useful granularity
  • useful for performing occasional checks during
    execution (e.g. context switching)
  • for dividing up the work when parallelising GC
    (later...)

7
How do blocks work?
  • Each block has a fixed table of data the block
    descriptor.

1st free byte in the block
struct bdescr void start, free, link
int blocks bdescr Bdescr (void
) bdescr allocBlocks (int blocks) void
freeBlocks (bdescr )
Bdescr() maps address to block descriptor in a
few instructions, no memory accesses
chains blocks together (or links to head of group)
Start of the block
Number of blocks in group (0 if not the head)
8
Where do block descriptors live?
  • Choice 1 at the start of a block.
  • ? Bdescr(p) is one instruction p 0xfffff000
  • ? Bad for cache TLB.
  • We often traverse block descriptors if they are
    scattered all over memory this thrashes the TLB.
  • ? Contiguous multi-block groups can only have a
    descriptor for the first block.
  • ? A block contains 4k minus a bit space for
    data (awkward)

9
Choice 2
  • Block descriptors are grouped together, and can
    be found by an address calculation.
  • ? Bdescr(p) is 6 instructions (next slide)
  • ? We can keep block descriptors together, better
    for cache/TLB
  • ? Contiguous block groups are easier all blocks
    in the group have descriptors.
  • ? Blocks contain a full 4k of data

10
Block Allocator (cont.)
2m bytes
Block 1
Block 2
Block N
2k bytes
2m bytes aligned
The block allocator requests memory from the
operating system in units of a Megablock, which
is divided into N blocks and block descriptors.
11
Block Allocator (cont.)
2m bytes
Block 1
Block 2
Block N
2k bytes
Bdescr(p) ((p 2m-1) gtgt k) ltlt d)
(p 2m-1)
2m bytes aligned
2d
12
Parallel GC
  • First we consider how to parallelise 2-space
    copying GC, and then extend this to generational
    GC.

13
Background copying collection
Allocation area
To-space
14
How can we parallelise copying GC?
  • Basically, we want each CPU to copy and scan
    different objects.
  • The main problem is finding an effective way to
    partition the problem, so we can keep all CPUs
    busy all the time (load balancing).
  • Static partitioning (eg. partition the heap by
    address) is not good
  • live data might not be evenly distributed,
    leading to poor load balancing
  • Need synchronisation when pointers cross
    partition boundaries

15
Load balancing
  • So we need dynamic load-balancing for GC.
  • the pending set is the set of objects copied but
    not yet scanned.
  • Each CPU
  • where
  • (Need synchronisation to prevent two threads
    copying the same object later)

while (pending set non-empty) remove an
object p from the pending set scan(p)
scan(p) for each object q pointed to by p
if q has not been copied copy q
add q to the pending set
16
The Pending Set
  • Now the problem is reduced to finding a good
    representation for the pending set.
  • Operations on the pending set are in the inner
    loop, so heavyweight synchronisation would be
    very expensive.
  • In standard copying GC the pending set is
    represented implicitly (and cheaply!) by
    to-space. Hence any explicit representation will
    incur an overhead compared to single-threaded GC,
    and eat into any parallel speedup.

17
Previous solutions
  • per-CPU work-stealing queues (Flood et. al.
    (2001)).
  • good work partitioning, but
  • some administrative overhead (quantity unknown)
  • needs clever lock-free data structures
  • some strategy for overflow (GC cant use
    arbitrary extra memory!)
  • Dividing the pending set into chunks (Imai/Tick
    (1993), others).
  • coarser granularity reduces synchronisation
    overhead
  • less effective load-balancing, especially if the
    chunk size is too high.

18
How blocks help
  • Since to-space is already a list of blocks, it is
    a natural representation for a chunked pending
    set!
  • No need for a separate pending set
    representation, no extra admin overhead relative
    to single threaed GC.
  • Larger blocks gt lower synchronisation overhead
  • Smaller blocks gt better load balancing

19
But what if
  • the pending set isnt large enough to fill a
    block? E.g. If the heap consists of a single
    linked list of integers, then the scan pointer
    will always be close to the allocation pointer,
    we will never generate a full block of work.
  • There may be little available parallelism in the
    heap structure anyway.
  • But with e.g. 2 linked lists, we would still not
    be able to parallelise on two cores, because the
    scan pointer will only be 2 words behind the
    allocation pointer.

20
Available parallelism
  • There should be enough parallelism, at least in
    old-gen collections.

21
GC data structures
GC thread 1 Workspace
GC thread 2 Workspace
22
Inside a workspace
  • Objects are copied into the Alloc block
    (per-thread allocation!)
  • Loop
  • Grab a block to be scanned from the pending set
  • Scan it
  • Push it back to the done list
  • When an Alloc block becomes full, move it to the
    pending set, grab an empty block

23
Inside a workspace
  • When the pending set is empty
  • Make the Scan block the Alloc block
  • Scan until complete
  • Look for more full blocks
  • Note that we may now have scanned part of the
    Alloc block, so we need to remember what portion
    has been scanned. (full details of the algorithm
    are in the paper).

24
Termination
  • Keep a global counter of running threads
  • When a thread finds no work to do, it decrements
    the count of running threads
  • If it finds the count 0, all the work is done
    stop.
  • Poll for work if there is work to do, increment
    the counter and continue (dont remove the work
    from the queue yet).

25
When work is scarce
  • We found that often the pending set is small or
    empty (despite earlier measurements), leading to
    low parallelism.
  • The only solution is to export smaller chunks of
    work to the pending set.
  • We use a dynamic chunk size when the pending set
    is low, we export smaller chunks.
  • smaller chunks leads to a fragmentation problem
    we want to fill up the rest of the block later,
    so we have to keep track of these partially-full
    blocks in per-thread lists.

26
Forwarding pointers and synchronisation
  • Must synchronise if two threads attempt to copy
    the same object, otherwise the object is
    duplicated.
  • Use CAS to install the forwarding pointer if
    another thread installs the pointer first, return
    it (dont copy the object). One CAS per object!
  • If the object is immutable, then we dont mind
    coying it, and in this case we could omit the CAS
    (future work, and note that the forwarding
    pointer must not overwrite the payload).

27
Overhead due to atomic evacuation
28
Parallel Generational-copying GC
  • Fairly straightforward generalisation of parallel
    copying GC
  • Three complications
  • maintaining the remembered sets
  • deciding which pending set to take work from
  • tenuring policy
  • see paper for the details

29
Speedup results
30
Measuring load balancing
  • Ctot is the total copied by all threads
  • Cmax is the maximum copied by any thread
  • Work balance factor
  • Perfect balance N, perfect imbalance 1.
  • balance factor maximum possible speedup given
    the distribution of work across CPUs (speedup
    might be lower for other reasons).

31
Load balancing results
32
Status
  • Will be available in GHC 6.10 (autumn 2008)
  • Multi-threaded GC will usually be a win on 2
    cores, although requires increasing the heap size
    to get the most benefit parallelising small GCs
    doesnt work so well.

33
Further work
  • Investigate/improve load-balancing
  • Avoid locking for immutable objects
  • Contention is very low
  • We might get a tiny amount of duplication per GC
  • Independent minor GCs.
  • Hard to parallelise minor GC too quick, not
    enough parallelism
  • Stopping the world for minor GC is a severe
    bottleneck in a program running on multiple CPUs.
  • So do per-CPU independent minor GCs.
  • Main techincal problem either track or prevent
    inter-minor-generation pointers. (eg.
    Doligez/Leroy(1993) for ML, Steensgaard(2001)).
  • Concurrent marking, with simple sweep blocks
    with no live objects can be freed immediately,
    compact or copy occasionally to recover
    fragmentation.
  • Parallelise mark/compact too
  • Blocks make parallelising compaction easier just
    statically partition the list of marked heap
    blocks and compact each segment, concatenate the
    result.

34
Optimisations
  • There is a long list of tweaks and optimisations
    that we tried, some of which helped, some didnt.
  • Move block descriptors to the beginning of the
    block bad cache/TLB effects.
  • Prefetching no joy, too fragile, and recent CPUs
    do automatic prefetching
  • Should the pending block set be FIFO or LIFO? or
    something else?
  • Some objects dont need to be scanned, copy them
    to a separate non-scanned area (not worthwhile)

35
A war story
  • This GC was first implemented by Roshan James, in
    the summer of 2006.
  • measurements showed negative speedup
  • I re-implemented it in 2007, the new
    implementation also showed negative speedup,
    despite having good load-balancing.
  • The cause of the bottleneck
  • after copying an object, a pointer in the block
    descriptor was updated. Adjacent block
    descriptors sometimes share a cache line, so
    multiple threads were writing to the same cache
    line gt bad.
  • It took multiple man-weeks and 3 profiling tools
    to find the problem.
  • Solution cache the pointer in thread-local
    storage.
Write a Comment
User Comments (0)
About PowerShow.com