Memory Management for High-Performance Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Memory Management for High-Performance Applications

Description:

Memory Management for High-Performance Applications Emery Berger Advisor: Kathryn S. McKinley Department of Computer Sciences High-Performance Applications Web ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 79
Provided by: csUmassE
Category:

less

Transcript and Presenter's Notes

Title: Memory Management for High-Performance Applications


1
Memory Management forHigh-Performance
Applications
Emery Berger
Advisor Kathryn S. McKinley
Department of Computer Sciences
2
High-Performance Applications
  • Web servers, search engines, scientific codes
  • C or C
  • Run on one or cluster of server boxes

software
compiler
  • Needs support at every level

runtime system
operating system
hardware
3
New Applications,Old Memory Managers
  • Applications and hardware have changed
  • Multiprocessors now commonplace
  • Object-oriented, multithreaded
  • Increased pressure on memory manager(malloc,
    free)
  • But memory managers have not changed
  • Inadequate support for modern applications

4
Current Memory ManagersLimit Scalability
  • As we add processors, program slows down
  • Caused by heap contention

Larson server benchmark on 14-processor Sun
5
The Problem
  • Current memory managersinadequate for
    high-performance applications on modern
    architectures
  • Limit scalability, application performance, and
    robustness

6
Contributions
  • Building memory managers
  • Heap Layers framework
  • Problems with current memory managers
  • Contention, false sharing, space
  • Solution provably scalable memory manager
  • Hoard
  • Extended memory manager for servers
  • Reap

7
Implementing Memory Managers
  • Memory managers must be
  • Space efficient
  • Very fast
  • Heavily-optimized code
  • Hand-unrolled loops
  • Macros
  • Monolithic functions
  • Hard to write, reuse, or extend

8
Building Modular Memory Managers
  • Classes
  • Rigid hierarchy
  • Overhead
  • Mixins
  • Flexible hierarchy
  • No overhead

9
A Heap Layer
  • Mixin with malloc free methods

template ltclass SuperHeapgtclass GreenHeapLayer
public SuperHeap
10
ExampleThread-Safe Heap Layer
  • LockedHeap
  • protect the superheap with a lock

LockedMallocHeap
11
Empirical Results
  • Heap Layers vs. originals
  • KingsleyHeapvs. BSD allocator
  • LeaHeapvs. DLmalloc 2.7
  • Competitive runtime and memory efficiency

12
Overview
  • Building memory managers
  • Heap Layers framework
  • Problems with memory managers
  • Contention, space, false sharing
  • Solution provably scalable allocator
  • Hoard
  • Extended memory manager for servers
  • Reap

13
Problems with General-Purpose Memory Managers
  • Previous work for multiprocessors
  • Concurrent single heap Bigler et al. 85, Johnson
    91, Iyengar 92
  • Impractical
  • Multiple heaps Larson 98, Gloger 99
  • Reduce contention but cause other problems
  • P-fold or even unbounded increase in space
  • Allocator-induced false sharing

we show
14
Multiple Heap AllocatorPure Private Heaps
Key
  • One heap per processor
  • malloc gets memoryfrom its local heap
  • free puts memoryon its local heap
  • STL, Cilk, ad hoc

in use, processor 0
free, on heap 1
processor 0
processor 1
x1 malloc(1)
x2 malloc(1)
free(x1)
free(x2)
x4 malloc(1)
x3 malloc(1)
free(x3)
free(x4)
15
ProblemUnbounded Memory Consumption
  • Producer-consumer
  • Processor 0 allocates
  • Processor 1 frees
  • Unbounded memory consumption
  • Crash!

processor 0
processor 1
x1 malloc(1)
free(x1)
x2 malloc(1)
free(x2)
x3 malloc(1)
free(x3)
16
Multiple Heap AllocatorPrivate Heaps with
Ownership
  • free returns memory to original heap
  • Bounded memory consumption
  • No crash!
  • Ptmalloc (Linux),LKmalloc

processor 0
processor 1
x1 malloc(1)
free(x1)
x2 malloc(1)
free(x2)
17
ProblemP-fold Memory Blowup
  • Occurs in practice
  • Round-robin producer-consumer
  • processor i mod P allocates
  • processor (i1) mod P frees
  • Footprint 1 (2GB),but space 3 (6GB)
  • Exceeds 32-bit address space Crash!

processor 0
processor 1
processor 2
x1 malloc(1)
free(x1)
x2 malloc(1)
free(x2)
x3malloc(1)
free(x3)
18
ProblemAllocator-Induced False Sharing
  • False sharing
  • Non-shared objectson same cache line
  • Bane of parallel applications
  • Extensively studied
  • All these allocatorscause false sharing!

cache line
processor 0
processor 1
x2 malloc(1)
x1 malloc(1)
thrash
thrash
19
So What Do We Do Now?
  • Where do we put free memory?
  • on central heap
  • on our own heap(pure private heaps)
  • on the original heap(private heaps with
    ownership)
  • How do we avoid false sharing?
  • Heap contention
  • Unbounded memory consumption
  • P-fold blowup

20
Overview
  • Building memory managers
  • Heap Layers framework
  • Problems with memory managers
  • Contention, space, false sharing
  • Solution provably scalable allocator
  • Hoard
  • Extended memory manager for servers
  • Reap

21
Hoard Key Insights
  • Bound local memory consumption
  • Explicitly track utilization
  • Move free memory to a global heap
  • Provably bounds memory consumption
  • Manage memory in large chunks
  • Avoids false sharing
  • Reduces heap contention

22
Overview of Hoard
global heap
  • Manage memory in heap blocks
  • Page-sized
  • Avoids false sharing
  • Allocate from local heap block
  • Avoids heap contention
  • Low utilization
  • Move heap block to global heap
  • Avoids space blowup

processor 0
processor P-1

23
Hoard Under the Hood
get or return memory to global heap
malloc from local heap, free to heap block
select heap based on size
24
Summary of Analytical Results
  • Space consumption near optimal worst-case
  • Hoard O(n log M/m P) P n
  • Optimal O(n log M/m) Robson 70
    bin-packing
  • Private heaps with ownership O(P n log M/m)
  • Provably low synchronization

n memory required M biggest object size m
smallest object size P processors
25
Empirical Results
  • Measure runtime on 14-processor Sun
  • Allocators
  • Solaris (system allocator)
  • Ptmalloc (GNU libc)
  • mtmalloc (Suns MT-hot allocator)
  • Micro-benchmarks
  • Threadtest no sharing
  • Larson sharing (server-style)
  • Cache-scratch mostly reads writes (tests for
    false sharing)
  • Real application experience similar

26
Runtime Performance threadtest
  • Many threads,no sharing
  • Hoard achieves linear speedup

speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
27
Runtime Performance Larson
  • Many threads,sharing(server-style)
  • Hoard achieves linear speedup

28
Runtime Performancefalse sharing
  • Many threads,mostly reads writes of heap data
  • Hoard achieves linear speedup

29
Hoard in the Real World
  • Open source code
  • www.hoard.org
  • 13,000 downloads
  • Solaris, Linux, Windows, IRIX,
  • Widely used in industry
  • AOL, British Telecom, Novell, Philips
  • Reports 2x-10x, impressive improvement in
    performance
  • Search server, telecom billing systems, scene
    rendering,real-time messaging middleware,
    text-to-speech engine, telephony, JVM
  • Scalable general-purpose memory manager

30
Custom Memory Allocation
  • Programmers replace malloc/free
  • Attempt to increase performance
  • Provide extra functionality (e.g., for servers)
  • Reduce space (rarely)
  • Empirical study of custom allocators
  • Lea allocator often as fast or faster
  • Custom allocation ineffective, except for
    regions.

31
Overview of Regions
  • Regions separate areas, deletion only en masse

regioncreate(r)
r
regionmalloc(r, sz)
regiondelete(r)
  • Used in parsers, server applications

32
Overview
  • Building memory managers
  • Heap Layers framework
  • Problems with memory managers
  • Contention, space, false sharing
  • Solution provably scalable allocator
  • Hoard
  • Extended memory manager for servers
  • Reap

33
Server Support
  • Certain servers need additional support
  • Process isolation
  • Multiple threads, many transactions per thread
  • Minimize accidental overwrites of unrelated data
  • Avoid resource leaks
  • Tear down all memory associated with terminated
    connections or transactions
  • Current approach (e.g., Apache) regions

34
Regions Pros and Cons
  • Regions separate areas, deletion only en masse

regioncreate(r)
r
regionmalloc(r, sz)
regiondelete(r)
  • Fast
  • Pointer-bumping allocation
  • Deletion of chunks
  • Convenient
  • One call frees all memory
  • Space
  • Cant free objects
  • Drag
  • Cant use for allallocation patterns

35
Regions Are Limited
  • Cant reclaim memory in regions ? unbounded
    memory consumption
  • Long-running computations
  • Producer-consumer patterns
  • Current situation for Apache
  • vulnerable to denial-of-service
  • limits runtime of connections
  • limits module programming
  • Regions wrong abstraction

36
Reap Hybrid Allocator
  • Reap region heap
  • Adds individual object deletion heap

reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
  • Can reduce memory consumption
  • Fast
  • Adapts to use (region or heap style)
  • Cheap deletion

37
Using Reap as Regions
Reap performance nearly matches regions
38
Reap In Progress
  • Incorporate Reap in Apache
  • Rewrite modules to use Reap
  • Measure space savings
  • Simplifies module programming adds robustness
    against denial-of-service

39
Overview
  • Building memory managers
  • Heap Layers framework
  • Problems with memory managers
  • Contention, space, false sharing
  • Solution provably scalable allocator
  • Hoard
  • Extended memory manager for servers
  • Reap

40
Open Questions
  • Grand Unified Memory Manager?
  • Hoard Reap
  • Integration with garbage collection
  • Effective Custom Allocators?
  • Exploit sizes, lifetimes, locality and sharing
  • Challenges of newer architectures
  • SMT/CMP

41
Contributions
  • Memory management for high-performance
    applications
  • Framework for buildinghigh-quality memory
    managers (Heap Layers) Berger, Zorn McKinley,
    PLDI-01
  • Provably scalable memory manager
    (Hoard) Berger, McKinley, Blumofe Wilson,
    ASPLOS-IX
  • Study of custom memory allocationHybrid
    high-performance memory managerfor server
    applications (Reap) Berger, Zorn McKinley,
    OOPSLA-2002

42
Backup Slides
43
Empirical Results, Runtime
44
Empirical Results, Space
45
Robust Resource Management
  • User processes can bring down systems ( DoS)
  • Current solutions
  • Kill processes (Linux)
  • Die (Linux, Solaris, Windows)
  • Proposed solutions limit utilization
  • Quotas, proportional shares
  • Insight As resources becomes scarce, make them
    cost more (apply economic model)

Fork bomb use all process ids
  • Malloc all memory exhausts swap

46
Future Work
  • Performance, scalability, and robustness
  • Short-term
  • Memory management
  • False sharing
  • Robust garbage collection for multiprogrammed
    systems(with McKinley, Blackburn Stylos)
  • Locality self-reorganizing data structures
  • Compiler-based static error detectionGuyer,
    Berger Lin, in preparation
  • Longer term
  • Safety security as dataflow problems
  • Integration of OS/runtime/compiler

47
Rockall Larson
48
Rockall Threadtest
49
Hoard Conclusions
  • As fast as uniprocessor allocators
  • Performance linear in number of processors
  • Avoids false sharing
  • Worst-case provably near optimal
  • Scalable general-purpose memory manager

50
Conceptually Modular
malloc
free
51
A Real Memory Manager
  • Modular design and implementation

KingsleyHeap
manage objects on freelist
add size info to objects
select heap based on size
malloc
free
52
Conclusion
  • Memory management for high-performance
    applications
  • Heap Layers framework PLDI 2001
  • Reusable components, no runtime cost
  • Hoard scalable memory manager ASPLOS-IX
  • High-performance, provably scalable
    space-efficient
  • Reap hybrid memory manager in preparation
  • Provides speed robustness for server
    applications
  • Future work memory management, resource
    management, static error detection in
    preparation

53
Regions Waste Memory
  • Drag wasted memory caused by unreclaimed dead
    objects

54
Reap Under the Hood
manage poolof free chunks
just add headers or manage as a heap
select region or heap behavior
support for nesting(hierarchy of reaps)
55
Memory Management
  • Unifying Reap Hoard
  • Comprehensive evaluation of custom allocatorsin
    preparation
  • Not only bad software engineering, also largely
    ineffective
  • GC to reduce working set size(with Steve
    Blackburn Jeffrey Stylos _at_UMass, Kathryn
    McKinley)
  • Multiprogrammed systems
  • Twofold increase in WS? half as many programs
    resident, lots of paging
  • Cooperation between GC VM
  • Examples VM about to pageout trigger GC,GC
    about to scan query VM for residency

56
Robust Resource Management
  • User processes can bring down systems ( DoS)
  • Current solutions
  • Kill processes (Linux)
  • Die (Linux, Solaris, Windows)
  • Proposed solutions limit utilization
  • Quotas, proportional shares
  • Insight As resources becomes scarce, make them
    cost more (apply economic model)

Fork bomb use all process ids
  • Malloc all memory exhausts swap

57
Static Error Detection
  • Many errors wed like to detect statically
  • Usage errors (double locks, sockets, files)
  • Information leaks
  • Security
  • Lots of recent work
  • Syntactic state machines (Metal) Engler
  • Program abstraction theorem proving (SLAM)
    Ball
  • Type systems Shankar, Foster
  • Whats right approach?

58
Error Detection Observations
  • Programmer time is expensive
  • Annotations or specifications are a lot of work
  • Type theory approaches require intervention
  • Computer time is cheap (but not unbounded!)
  • Theorem-based flow-sensitive techniquesstill
    exponential
  • False positives must be near zero to be useful
  • Sound analysis invaluable for security
  • Lexical approaches unsound

59
Detecting Errors with Configurable Whole-Program
Dataflow Analysis
  • Insight model errors as dataflow analysis
    problems on aggressive compiler framework(with
    Samuel Guyer, Calvin Lin)
  • Interprocedural, flow-sensitive, precise pointer
    analysis
  • Drive analysis with simple dataflow language
  • Very promising results
  • Captures extremely general class of errors
  • Fast (7 minutes for 200KLOC),works on unmodified
    C programs
  • Lowest reported false positive rates (none!)on
    format string vulnerability problems

60
Example Format String Vulnerability
  • Subject of numerous CERT advisories
  • Improper use of printf() family
  • Enables stack-smashing attacks
  • Solution Taintedness analysis Foster et al.,
    Perl
  • Data from untrusted sources is tainted
  • Ensure tainted data may not end up in format
    string

fgets(buffer, size, file) printf(buffer)
61
Taintedness Dataflow Analysis
  • Taintedness lattice
  • Transfer functions
  • Functions that produce tainted data
  • scanf(), getenv(), read()
  • Functions that pass on taintedness
  • strdup(), strcpy(), sprintf()
  • Error conditions
  • Functions with format string arguments
  • printf() family, syslog()

property Taint Tainted Untainted
62
Test programs
  • Five programs
  • bftpd FTP daemon
  • muh IRC proxy
  • named Name server in the BIND package
  • lpd Print daemon
  • cfengine System administration tool
  • All programs have the format string vulnerability
  • Notice these are real programs
  • Mature versions were distributed with the bug
  • Exploits are known and have been used to
    compromise machines

63
Results
  • Run on Pentium 4 2Ghz, 512 MB RAM

Program Lines Procedures Time Errors Found False Positives
bftpd 1,017 180 001 1 1 0
muh 5,002 228 006 1 1 0
named 25,820 444 111 1 1 0
lpd 38,174 726 2357 1 1 0
cfengine 45,102 700 638 6 6 0
64
Research Methodology
  • Transparently improve program and programmer
    efficiency
  • Compilers runtime systems, OS involvement
  • Apply theory and rigorous experimentation to
    systems development
  • Seek economically insensitive problems
  • Brains, not money

65
Conclusions
  • Memory management support for high-performance
    applications
  • Framework for buildinghigh-quality allocators
    (Heap Layers) Berger, Zorn McKinley, PLDI-01
  • Scalable high-performancegeneral-purpose
    allocator (Hoard) Berger, McKinley, Blumofe
    Wilson, ASPLOS-IX
  • Extended high-performance allocatorfor server
    applications (Reap) Berger, Zorn McKinley, to
    be submitted

66
Reap Conclusions
  • Custom allocation often not much faster
  • Use better general-purpose allocator
  • Important exception Regions
  • Fast but waste space
  • Reap best of both worlds
  • Extends general-purpose allocation
  • Fast, space-efficient flexible
  • Eliminates the need for most custom allocators

67
Experimental Methodology
  • Built analogous allocators using heap layers
  • KingsleyHeap (BSD allocator)
  • LeaHeap (based on Lea allocator 2.7.0)
  • Three weeks to develop
  • 500 lines vs. 2,000 lines in original
  • Compared performance with originals
  • SPEC2000 standard allocation benchmarks

68
Custom Allocators Are Effective
Average 30 faster than system allocator wrapper
69
Custom Allocators Not Effective
Lea allocator wrapper as fast as custom,
except for regions
70
Reap Runtime
71
Runtime Compared toGeneral-Purpose Allocators
72
Space Compared toGeneral-Purpose Allocators
73
A General-Purpose Memory Manager for
High-Performance Applications
  • To support high-performance applications, memory
    manager requires
  • Speed
  • Fast malloc/free
  • Scalability
  • Performance linear in number of processors
  • Space-efficiency
  • Worst-case and average-case

74
Contributions
  • Heap Layers framework for building memory
    managers
  • Reusable components, no runtime cost
  • Simplifies implementations, good experimental
    platform
  • Hoard scalable general-purpose memory manager
  • High-performance, provably scalable
    space-efficient
  • Comprehensive evaluation of custom allocators
  • Not only bad software engineering, also largely
    ineffective
  • Reap hybrid general-purpose memory manager
  • Combines regions and heaps
  • Provides robustness for server applications

75
Uniprocessor Memory Allocators
  • Standard on many operating systems
  • Scalability Poor
  • Heap contention
  • Single lock protects heap
  • Space Excellent
  • Nearly optimal for most programsWilson
    Johnstone, 2000

76
ExampleDebugging Heap Layer
  • DebugHeap
  • Protects against invalid multiple frees.

DebugLockedMallocHeap
77
Hoard Example
processor 0
global heap
  • malloc from heap block on local heap
  • free returns memory to its heap block
  • local heap too empty? move heap block
    to global heap

x1 malloc(1)
some mallocs
some frees
free(x7)
Empty fraction 1/3
78
Hoard Details
  • Segregated size class allocator
  • Size classes above 4K are logarithmically-spaced
  • Superblocks hold objects of one size class
  • empty superblocks are recycled
  • Approximately radix-sorted
  • Allocate from mostly-full superblocks
  • Fast removal of mostly-empty superblocks

8
32
40
16
24
48
sizeclass bins
radix-sorted superblock lists (emptiest to
fullest)
superblocks
Write a Comment
User Comments (0)
About PowerShow.com