Title: Memory Management for High-Performance Applications
1Memory Management forHigh-Performance
Applications
Emery Berger
Advisor Kathryn S. McKinley
Department of Computer Sciences
2High-Performance Applications
- Web servers, search engines, scientific codes
- C or C
- Run on one or cluster of server boxes
software
compiler
- Needs support at every level
runtime system
operating system
hardware
3New Applications,Old Memory Managers
- Applications and hardware have changed
- Multiprocessors now commonplace
- Object-oriented, multithreaded
- Increased pressure on memory manager(malloc,
free) - But memory managers have not changed
- Inadequate support for modern applications
4Current Memory ManagersLimit Scalability
- As we add processors, program slows down
- Caused by heap contention
Larson server benchmark on 14-processor Sun
5The Problem
- Current memory managersinadequate for
high-performance applications on modern
architectures - Limit scalability, application performance, and
robustness
6Contributions
- Building memory managers
- Heap Layers framework
- Problems with current memory managers
- Contention, false sharing, space
- Solution provably scalable memory manager
- Hoard
- Extended memory manager for servers
- Reap
7Implementing Memory Managers
- Memory managers must be
- Space efficient
- Very fast
- Heavily-optimized code
- Hand-unrolled loops
- Macros
- Monolithic functions
- Hard to write, reuse, or extend
8Building Modular Memory Managers
- Classes
- Rigid hierarchy
- Overhead
- Mixins
- Flexible hierarchy
- No overhead
9A Heap Layer
- Mixin with malloc free methods
template ltclass SuperHeapgtclass GreenHeapLayer
public SuperHeap
10ExampleThread-Safe Heap Layer
- LockedHeap
- protect the superheap with a lock
LockedMallocHeap
11Empirical Results
- Heap Layers vs. originals
- KingsleyHeapvs. BSD allocator
- LeaHeapvs. DLmalloc 2.7
- Competitive runtime and memory efficiency
12Overview
- Building memory managers
- Heap Layers framework
- Problems with memory managers
- Contention, space, false sharing
- Solution provably scalable allocator
- Hoard
- Extended memory manager for servers
- Reap
13Problems with General-Purpose Memory Managers
- Previous work for multiprocessors
- Concurrent single heap Bigler et al. 85, Johnson
91, Iyengar 92 - Impractical
- Multiple heaps Larson 98, Gloger 99
- Reduce contention but cause other problems
- P-fold or even unbounded increase in space
- Allocator-induced false sharing
we show
14Multiple Heap AllocatorPure Private Heaps
Key
- One heap per processor
- malloc gets memoryfrom its local heap
- free puts memoryon its local heap
- STL, Cilk, ad hoc
in use, processor 0
free, on heap 1
processor 0
processor 1
x1 malloc(1)
x2 malloc(1)
free(x1)
free(x2)
x4 malloc(1)
x3 malloc(1)
free(x3)
free(x4)
15ProblemUnbounded Memory Consumption
- Producer-consumer
- Processor 0 allocates
- Processor 1 frees
- Unbounded memory consumption
- Crash!
processor 0
processor 1
x1 malloc(1)
free(x1)
x2 malloc(1)
free(x2)
x3 malloc(1)
free(x3)
16Multiple Heap AllocatorPrivate Heaps with
Ownership
- free returns memory to original heap
- Bounded memory consumption
- No crash!
- Ptmalloc (Linux),LKmalloc
processor 0
processor 1
x1 malloc(1)
free(x1)
x2 malloc(1)
free(x2)
17ProblemP-fold Memory Blowup
- Occurs in practice
- Round-robin producer-consumer
- processor i mod P allocates
- processor (i1) mod P frees
- Footprint 1 (2GB),but space 3 (6GB)
- Exceeds 32-bit address space Crash!
processor 0
processor 1
processor 2
x1 malloc(1)
free(x1)
x2 malloc(1)
free(x2)
x3malloc(1)
free(x3)
18ProblemAllocator-Induced False Sharing
- False sharing
- Non-shared objectson same cache line
- Bane of parallel applications
- Extensively studied
-
- All these allocatorscause false sharing!
cache line
processor 0
processor 1
x2 malloc(1)
x1 malloc(1)
thrash
thrash
19So What Do We Do Now?
- Where do we put free memory?
- on central heap
- on our own heap(pure private heaps)
- on the original heap(private heaps with
ownership) - How do we avoid false sharing?
- Heap contention
- Unbounded memory consumption
- P-fold blowup
20Overview
- Building memory managers
- Heap Layers framework
- Problems with memory managers
- Contention, space, false sharing
- Solution provably scalable allocator
- Hoard
- Extended memory manager for servers
- Reap
21Hoard Key Insights
- Bound local memory consumption
- Explicitly track utilization
- Move free memory to a global heap
- Provably bounds memory consumption
- Manage memory in large chunks
- Avoids false sharing
- Reduces heap contention
22Overview of Hoard
global heap
- Manage memory in heap blocks
- Page-sized
- Avoids false sharing
- Allocate from local heap block
- Avoids heap contention
- Low utilization
- Move heap block to global heap
- Avoids space blowup
processor 0
processor P-1
23Hoard Under the Hood
get or return memory to global heap
malloc from local heap, free to heap block
select heap based on size
24Summary of Analytical Results
- Space consumption near optimal worst-case
- Hoard O(n log M/m P) P n
- Optimal O(n log M/m) Robson 70
bin-packing - Private heaps with ownership O(P n log M/m)
- Provably low synchronization
n memory required M biggest object size m
smallest object size P processors
25Empirical Results
- Measure runtime on 14-processor Sun
- Allocators
- Solaris (system allocator)
- Ptmalloc (GNU libc)
- mtmalloc (Suns MT-hot allocator)
- Micro-benchmarks
- Threadtest no sharing
- Larson sharing (server-style)
- Cache-scratch mostly reads writes (tests for
false sharing) - Real application experience similar
26Runtime Performance threadtest
- Many threads,no sharing
- Hoard achieves linear speedup
speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
27Runtime Performance Larson
- Many threads,sharing(server-style)
- Hoard achieves linear speedup
28Runtime Performancefalse sharing
- Many threads,mostly reads writes of heap data
- Hoard achieves linear speedup
29Hoard in the Real World
- Open source code
- www.hoard.org
- 13,000 downloads
- Solaris, Linux, Windows, IRIX,
- Widely used in industry
- AOL, British Telecom, Novell, Philips
- Reports 2x-10x, impressive improvement in
performance - Search server, telecom billing systems, scene
rendering,real-time messaging middleware,
text-to-speech engine, telephony, JVM - Scalable general-purpose memory manager
30Custom Memory Allocation
- Programmers replace malloc/free
- Attempt to increase performance
- Provide extra functionality (e.g., for servers)
- Reduce space (rarely)
- Empirical study of custom allocators
- Lea allocator often as fast or faster
- Custom allocation ineffective, except for
regions.
31Overview of Regions
- Regions separate areas, deletion only en masse
regioncreate(r)
r
regionmalloc(r, sz)
regiondelete(r)
- Used in parsers, server applications
32Overview
- Building memory managers
- Heap Layers framework
- Problems with memory managers
- Contention, space, false sharing
- Solution provably scalable allocator
- Hoard
- Extended memory manager for servers
- Reap
33Server Support
- Certain servers need additional support
- Process isolation
- Multiple threads, many transactions per thread
- Minimize accidental overwrites of unrelated data
- Avoid resource leaks
- Tear down all memory associated with terminated
connections or transactions - Current approach (e.g., Apache) regions
34Regions Pros and Cons
- Regions separate areas, deletion only en masse
regioncreate(r)
r
regionmalloc(r, sz)
regiondelete(r)
- Fast
- Pointer-bumping allocation
- Deletion of chunks
- Convenient
- One call frees all memory
- Space
- Cant free objects
- Drag
- Cant use for allallocation patterns
35Regions Are Limited
- Cant reclaim memory in regions ? unbounded
memory consumption - Long-running computations
- Producer-consumer patterns
- Current situation for Apache
- vulnerable to denial-of-service
- limits runtime of connections
- limits module programming
- Regions wrong abstraction
36Reap Hybrid Allocator
- Reap region heap
- Adds individual object deletion heap
-
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
- Can reduce memory consumption
- Fast
- Adapts to use (region or heap style)
- Cheap deletion
-
37Using Reap as Regions
Reap performance nearly matches regions
38Reap In Progress
- Incorporate Reap in Apache
- Rewrite modules to use Reap
- Measure space savings
- Simplifies module programming adds robustness
against denial-of-service
39Overview
- Building memory managers
- Heap Layers framework
- Problems with memory managers
- Contention, space, false sharing
- Solution provably scalable allocator
- Hoard
- Extended memory manager for servers
- Reap
40Open Questions
- Grand Unified Memory Manager?
- Hoard Reap
- Integration with garbage collection
- Effective Custom Allocators?
- Exploit sizes, lifetimes, locality and sharing
- Challenges of newer architectures
- SMT/CMP
41Contributions
- Memory management for high-performance
applications - Framework for buildinghigh-quality memory
managers (Heap Layers) Berger, Zorn McKinley,
PLDI-01 - Provably scalable memory manager
(Hoard) Berger, McKinley, Blumofe Wilson,
ASPLOS-IX - Study of custom memory allocationHybrid
high-performance memory managerfor server
applications (Reap) Berger, Zorn McKinley,
OOPSLA-2002
42Backup Slides
43Empirical Results, Runtime
44Empirical Results, Space
45Robust Resource Management
- User processes can bring down systems ( DoS)
- Current solutions
- Kill processes (Linux)
- Die (Linux, Solaris, Windows)
- Proposed solutions limit utilization
- Quotas, proportional shares
- Insight As resources becomes scarce, make them
cost more (apply economic model)
Fork bomb use all process ids
- Malloc all memory exhausts swap
46Future Work
- Performance, scalability, and robustness
- Short-term
- Memory management
- False sharing
- Robust garbage collection for multiprogrammed
systems(with McKinley, Blackburn Stylos) - Locality self-reorganizing data structures
- Compiler-based static error detectionGuyer,
Berger Lin, in preparation - Longer term
- Safety security as dataflow problems
- Integration of OS/runtime/compiler
47Rockall Larson
48Rockall Threadtest
49Hoard Conclusions
- As fast as uniprocessor allocators
- Performance linear in number of processors
- Avoids false sharing
- Worst-case provably near optimal
- Scalable general-purpose memory manager
50Conceptually Modular
malloc
free
51A Real Memory Manager
- Modular design and implementation
KingsleyHeap
manage objects on freelist
add size info to objects
select heap based on size
malloc
free
52Conclusion
- Memory management for high-performance
applications - Heap Layers framework PLDI 2001
- Reusable components, no runtime cost
- Hoard scalable memory manager ASPLOS-IX
- High-performance, provably scalable
space-efficient - Reap hybrid memory manager in preparation
- Provides speed robustness for server
applications - Future work memory management, resource
management, static error detection in
preparation
53Regions Waste Memory
- Drag wasted memory caused by unreclaimed dead
objects
54Reap Under the Hood
manage poolof free chunks
just add headers or manage as a heap
select region or heap behavior
support for nesting(hierarchy of reaps)
55Memory Management
- Unifying Reap Hoard
- Comprehensive evaluation of custom allocatorsin
preparation - Not only bad software engineering, also largely
ineffective - GC to reduce working set size(with Steve
Blackburn Jeffrey Stylos _at_UMass, Kathryn
McKinley) - Multiprogrammed systems
- Twofold increase in WS? half as many programs
resident, lots of paging - Cooperation between GC VM
- Examples VM about to pageout trigger GC,GC
about to scan query VM for residency
56Robust Resource Management
- User processes can bring down systems ( DoS)
- Current solutions
- Kill processes (Linux)
- Die (Linux, Solaris, Windows)
- Proposed solutions limit utilization
- Quotas, proportional shares
- Insight As resources becomes scarce, make them
cost more (apply economic model)
Fork bomb use all process ids
- Malloc all memory exhausts swap
57Static Error Detection
- Many errors wed like to detect statically
- Usage errors (double locks, sockets, files)
- Information leaks
- Security
- Lots of recent work
- Syntactic state machines (Metal) Engler
- Program abstraction theorem proving (SLAM)
Ball - Type systems Shankar, Foster
- Whats right approach?
58Error Detection Observations
- Programmer time is expensive
- Annotations or specifications are a lot of work
- Type theory approaches require intervention
- Computer time is cheap (but not unbounded!)
- Theorem-based flow-sensitive techniquesstill
exponential - False positives must be near zero to be useful
- Sound analysis invaluable for security
- Lexical approaches unsound
59Detecting Errors with Configurable Whole-Program
Dataflow Analysis
- Insight model errors as dataflow analysis
problems on aggressive compiler framework(with
Samuel Guyer, Calvin Lin) - Interprocedural, flow-sensitive, precise pointer
analysis - Drive analysis with simple dataflow language
- Very promising results
- Captures extremely general class of errors
- Fast (7 minutes for 200KLOC),works on unmodified
C programs - Lowest reported false positive rates (none!)on
format string vulnerability problems
60Example Format String Vulnerability
- Subject of numerous CERT advisories
- Improper use of printf() family
- Enables stack-smashing attacks
- Solution Taintedness analysis Foster et al.,
Perl - Data from untrusted sources is tainted
- Ensure tainted data may not end up in format
string
fgets(buffer, size, file) printf(buffer)
61Taintedness Dataflow Analysis
- Taintedness lattice
- Transfer functions
- Functions that produce tainted data
- scanf(), getenv(), read()
- Functions that pass on taintedness
- strdup(), strcpy(), sprintf()
- Error conditions
- Functions with format string arguments
- printf() family, syslog()
property Taint Tainted Untainted
62Test programs
- Five programs
- bftpd FTP daemon
- muh IRC proxy
- named Name server in the BIND package
- lpd Print daemon
- cfengine System administration tool
- All programs have the format string vulnerability
- Notice these are real programs
- Mature versions were distributed with the bug
- Exploits are known and have been used to
compromise machines
63Results
- Run on Pentium 4 2Ghz, 512 MB RAM
Program Lines Procedures Time Errors Found False Positives
bftpd 1,017 180 001 1 1 0
muh 5,002 228 006 1 1 0
named 25,820 444 111 1 1 0
lpd 38,174 726 2357 1 1 0
cfengine 45,102 700 638 6 6 0
64Research Methodology
- Transparently improve program and programmer
efficiency - Compilers runtime systems, OS involvement
- Apply theory and rigorous experimentation to
systems development - Seek economically insensitive problems
- Brains, not money
65Conclusions
- Memory management support for high-performance
applications - Framework for buildinghigh-quality allocators
(Heap Layers) Berger, Zorn McKinley, PLDI-01 - Scalable high-performancegeneral-purpose
allocator (Hoard) Berger, McKinley, Blumofe
Wilson, ASPLOS-IX - Extended high-performance allocatorfor server
applications (Reap) Berger, Zorn McKinley, to
be submitted
66Reap Conclusions
- Custom allocation often not much faster
- Use better general-purpose allocator
- Important exception Regions
- Fast but waste space
- Reap best of both worlds
- Extends general-purpose allocation
- Fast, space-efficient flexible
- Eliminates the need for most custom allocators
67Experimental Methodology
- Built analogous allocators using heap layers
- KingsleyHeap (BSD allocator)
- LeaHeap (based on Lea allocator 2.7.0)
- Three weeks to develop
- 500 lines vs. 2,000 lines in original
- Compared performance with originals
- SPEC2000 standard allocation benchmarks
68Custom Allocators Are Effective
Average 30 faster than system allocator wrapper
69Custom Allocators Not Effective
Lea allocator wrapper as fast as custom,
except for regions
70Reap Runtime
71Runtime Compared toGeneral-Purpose Allocators
72Space Compared toGeneral-Purpose Allocators
73A General-Purpose Memory Manager for
High-Performance Applications
- To support high-performance applications, memory
manager requires - Speed
- Fast malloc/free
- Scalability
- Performance linear in number of processors
- Space-efficiency
- Worst-case and average-case
74Contributions
- Heap Layers framework for building memory
managers - Reusable components, no runtime cost
- Simplifies implementations, good experimental
platform - Hoard scalable general-purpose memory manager
- High-performance, provably scalable
space-efficient - Comprehensive evaluation of custom allocators
- Not only bad software engineering, also largely
ineffective - Reap hybrid general-purpose memory manager
- Combines regions and heaps
- Provides robustness for server applications
75Uniprocessor Memory Allocators
- Standard on many operating systems
- Scalability Poor
- Heap contention
- Single lock protects heap
- Space Excellent
- Nearly optimal for most programsWilson
Johnstone, 2000
76ExampleDebugging Heap Layer
- DebugHeap
- Protects against invalid multiple frees.
DebugLockedMallocHeap
77Hoard Example
processor 0
global heap
- malloc from heap block on local heap
- free returns memory to its heap block
- local heap too empty? move heap block
to global heap
x1 malloc(1)
some mallocs
some frees
free(x7)
Empty fraction 1/3
78Hoard Details
- Segregated size class allocator
- Size classes above 4K are logarithmically-spaced
- Superblocks hold objects of one size class
- empty superblocks are recycled
- Approximately radix-sorted
- Allocate from mostly-full superblocks
- Fast removal of mostly-empty superblocks
8
32
40
16
24
48
sizeclass bins
radix-sorted superblock lists (emptiest to
fullest)
superblocks