Title: Capriccio: Scalable Threads for Internet Service
1Capriccio Scalable Threads for Internet Service
2Introduction
- Internet services have ever-increasing
scalability demands - Current hardware is meeting these demands
- Software has lagged behind
- Recent approaches are event-based
- Pipeline stages of events
3Drawbacks of Events
- Events systems hide the control flow
- Difficult to understand and debug
- Eventually evolved into call-and-return event
pairs - Programmers need to match related events
- Need to save/restore states
- Capriccio instead of event-based model, fix the
thread-based model
4Goals of Capriccio
- Support for existing thread API
- Little changes to existing applications
- Scalability to thousands of threads
- One thread per
- execution
- Flexibility to address
- application-specific
- needs
Threads
Ideal
Ease of Programming
Events
Threads
Performance
5Thread Design Principles
- Kernel-level threads are for true concurrency
- User-level threads provide a clean programming
model with useful invariants and semantics - Decouple user from kernel level threads
- More portable
6Capriccio
- Thread package
- All thread operations are O(1)
- Linked stacks
- Address the problem of stack allocation for large
numbers of threads - Combination of compile-time and run-time analysis
- Resource-aware scheduler
7Thread Design and Scalability
- POSIX API
- Backward compatible
8User-Level Threads
- Performance
- Flexibility
- - Complex preemption
- - Bad interaction with kernel scheduler
9Flexibility
- Decoupling user and kernel threads allows faster
innovation - Can use new kernel thread features without
changing application code - Scheduler tailored for applications
- Lightweight
10Performance
- Reduce the overhead of thread synchronization
- No kernel crossing for preemptive threading
- More efficient memory management at user level
11Disadvantages
- Need to replace blocking calls with nonblocking
ones to hold the CPU - Translation overhead
- Problems with multiple processors
- Synchronization becomes more expensive
12Context Switches
- Built on top of Edgar Toernigs coroutine library
- Fast context switches when threads voluntarily
yield
13I/O
- Capriccio intercepts blocking I/O calls
- Uses epoll for asynchronous I/O
14Scheduling
- Very much like an event-driven application
- Events are hidden from programmers
15Synchronization
- Supports cooperative threading on single-CPU
machines - Requires only Boolean checks
16Threading Microbenchmarks
- SMP, two 2.4 GHz Xeon processors
- 1 GB memory
- two 10 K RPM SCSI Ultra II hard drives
- Linux 2.5.70
- Compared Capriccio, LinuxThreads, and Native
POSIX Threads for Linux
17Latencies of Thread Primitives
18Thread Scalability
- Producer-consumer microbenchmark
- LinuxThreads begin to degrade after 20 threads
- NPTL degrades after 100
- Capriccio scales to 32K producers and consumers
(64K threads total)
19Thread Scalability
20I/O Performance
- Network performance
- Token passing among pipes
- Simulates the effect of slow client links
- 10 overhead compared to epoll
- Twice as fast as both LinuxThreads and NPTL when
more than 1000 threads - Disk I/O comparable to kernel threads
21Linked Stack Management
- LinuxThreads allocates 2MB per stack
- 1 GB of VM holds only 500 threads
Fixed Stacks
22Linked Stack Management
- But most threads consumes only a few KB of stack
space at a given time - Dynamic stack allocation can significantly reduce
the size of VM
Linked Stack
23Compiler Analysis and Linked Stacks
- Whole-program analysis
- Based on the call graph
- Problematic for recursions
- Static estimation may be too conservative
24Compiler Analysis and Linked Stacks
- Grow and shrink the stack size on demand
- Insert checkpoints to determine whether we need
to allocate more before the next checkpoint - Result in noncontiguous stacks
25Placing Checkpoints
- One checkpoint in every cycle in the call graph
- Bound the size between checkpoints with the
deepest call path
26Dealing with Special Cases
- Function pointers
- Dont know what procedure to call at compile time
- Can find a potential set of procedures
27Dealing with Special Cases
- External functions
- Allow programmers to annotate external library
functions with trusted stack bounds - Allow larger stack chunks to be linked for
external functions
28Tuning the Algorithm
- Stack space can be wasted
- Internal and external fragmentation
- Tradeoffs
- Number of stack linkings
- External fragmentation
29Memory Benefits
- Tuning can be application-specific
- No preallocation of large stacks
- Reduced requirement to run a large numbers of
threads - Better paging behavior
- StacksLIFO
30Case Study Apache 2.0.44
- Maximum stack allocation chunk 2KB
- Apache under SPECweb99
- Overall slowdown is about 3
- Dynamic allocation 0.1
- Link to large chunks for external functions 0.5
- Stack removal 10
31Resource-Aware Scheduling
- Advantages of event-based scheduling
- Tailored for applications
- With event handlers
- Events provide two important pieces of
information for scheduling - Whether a process is close to completion
- Whether a system is overloaded
32Resource-Aware Scheduling
- Thread-based
- View applications as sequence of stages,
separated by blocking calls - Analogous to event-based scheduler
33Blocking Graph
- Node A location in the program that blocked
- Edge between two nodes if they were consecutive
blocking points - Generated at runtime
34Resource-Aware Scheduling
- 1. Keep track of resource utilization
- 2. Annotate each node with resource used and its
outgoing edges - 3. Dynamically prioritize nodes
- Prefer nodes that release resources
35Resources
- CPU
- Memory (malloc)
- File descriptors (open, close)
36Pitfalls
- Tricky to determine the maximum capacity of a
resource - Thrashing depends on the workload
- Disk can handle more requests that are sequential
instead of random - Resources interact
- VM vs. disk
- Applications may manage memory themselves
37Yield Profiling
- User threads are problematic if a thread fails to
yield - They are easy to detect, since their running
times are orders of magnitude larger - Yield profiling identifies places where programs
fail to yield sufficiently often
38Web Server Performance
- 4x500 MHz Pentium server
- 2GB memory
- Intel e1000 Gigabit Ethernet card
- Linux 2.4.20
- Workload requests for 3.2 GB of static file data
39Web Server Performance
- Request frequencies match those of the SPECweb99
- A client connects to a server repeated and issue
a series of five requests, separated by 20ms
pauses - Apaches performance improved by 15 with
Capriccio
40Resource-Aware Admission Control
- Consumer-producer applications
- Producer loops, adding memory, and randomly
touching pages - Consumer loops, removing memory from the pool and
freeing it - Fast producer may run out of virtual address space
41Resource-Aware Admission Control
- Touching pages too quickly will cause thrashing
- Capriccio can quickly detect the overload
conditions and limit the number of producers
42Programming Models for High Concurrency
- Event
- Application-specific optimization
- Thread
- Efficient thread runtimes
43User-Level Threads
- Capriccio is unique
- Blocking graph
- Resource-aware scheduling
- Target at a large number of blocking threads
- POSIX compliant
44Application-Specific Optimization
- Most approaches require programmers to tailor
their application to manage resources - Nonstandard APIs, less portable
45Stack Management
46Future Work
- Multi-CPU machines
- Profiling tools for system tuning