Capriccio: Scalable Threads for Internet Service - PowerPoint PPT Presentation

About This Presentation

Title:

Capriccio: Scalable Threads for Internet Service

Description:

Thread-based. View applications as sequence of stages, separated by ... Apache's performance improved by 15% with Capriccio. Resource-Aware Admission Control ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 47

Provided by: csF2

Learn more at: http://www.cs.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Capriccio: Scalable Threads for Internet Service

1
Capriccio Scalable Threads for Internet Service
2
Introduction

Internet services have ever-increasing
scalability demands
Current hardware is meeting these demands
Software has lagged behind
Recent approaches are event-based
Pipeline stages of events

3
Drawbacks of Events

Events systems hide the control flow
Difficult to understand and debug
Eventually evolved into call-and-return event
pairs
Programmers need to match related events
Need to save/restore states
Capriccio instead of event-based model, fix the
thread-based model

4
Goals of Capriccio

Support for existing thread API
Little changes to existing applications
Scalability to thousands of threads
One thread per
execution
Flexibility to address
application-specific
needs

Threads
Ideal
Ease of Programming
Events
Threads
Performance
5
Thread Design Principles

Kernel-level threads are for true concurrency
User-level threads provide a clean programming
model with useful invariants and semantics
Decouple user from kernel level threads
More portable

6
Capriccio

Thread package
All thread operations are O(1)
Linked stacks
Address the problem of stack allocation for large
numbers of threads
Combination of compile-time and run-time analysis
Resource-aware scheduler

7
Thread Design and Scalability

POSIX API
Backward compatible

8
User-Level Threads

Performance
Flexibility
- Complex preemption
- Bad interaction with kernel scheduler

9
Flexibility

Decoupling user and kernel threads allows faster
innovation
Can use new kernel thread features without
changing application code
Scheduler tailored for applications
Lightweight

10
Performance

Reduce the overhead of thread synchronization
No kernel crossing for preemptive threading
More efficient memory management at user level

11
Disadvantages

Need to replace blocking calls with nonblocking
ones to hold the CPU
Translation overhead
Problems with multiple processors
Synchronization becomes more expensive

12
Context Switches

Built on top of Edgar Toernigs coroutine library
Fast context switches when threads voluntarily
yield

13
I/O

Capriccio intercepts blocking I/O calls
Uses epoll for asynchronous I/O

14
Scheduling

Very much like an event-driven application
Events are hidden from programmers

15
Synchronization

Supports cooperative threading on single-CPU
machines
Requires only Boolean checks

16
Threading Microbenchmarks

SMP, two 2.4 GHz Xeon processors
1 GB memory
two 10 K RPM SCSI Ultra II hard drives
Linux 2.5.70
Compared Capriccio, LinuxThreads, and Native
POSIX Threads for Linux

17
Latencies of Thread Primitives
18
Thread Scalability

Producer-consumer microbenchmark
LinuxThreads begin to degrade after 20 threads
NPTL degrades after 100
Capriccio scales to 32K producers and consumers
(64K threads total)

19
Thread Scalability
20
I/O Performance

Network performance
Token passing among pipes
Simulates the effect of slow client links
10 overhead compared to epoll
Twice as fast as both LinuxThreads and NPTL when
more than 1000 threads
Disk I/O comparable to kernel threads

21
Linked Stack Management

LinuxThreads allocates 2MB per stack
1 GB of VM holds only 500 threads

Fixed Stacks
22
Linked Stack Management

But most threads consumes only a few KB of stack
space at a given time
Dynamic stack allocation can significantly reduce
the size of VM

Linked Stack
23
Compiler Analysis and Linked Stacks

Whole-program analysis
Based on the call graph
Problematic for recursions
Static estimation may be too conservative

24
Compiler Analysis and Linked Stacks

Grow and shrink the stack size on demand
Insert checkpoints to determine whether we need
to allocate more before the next checkpoint
Result in noncontiguous stacks

25
Placing Checkpoints

One checkpoint in every cycle in the call graph
Bound the size between checkpoints with the
deepest call path

26
Dealing with Special Cases

Function pointers
Dont know what procedure to call at compile time
Can find a potential set of procedures

27
Dealing with Special Cases

External functions
Allow programmers to annotate external library
functions with trusted stack bounds
Allow larger stack chunks to be linked for
external functions

28
Tuning the Algorithm

Stack space can be wasted
Internal and external fragmentation
Tradeoffs
Number of stack linkings
External fragmentation

29
Memory Benefits

Tuning can be application-specific
No preallocation of large stacks
Reduced requirement to run a large numbers of
threads
Better paging behavior
StacksLIFO

30
Case Study Apache 2.0.44

Maximum stack allocation chunk 2KB
Apache under SPECweb99
Overall slowdown is about 3
Dynamic allocation 0.1
Link to large chunks for external functions 0.5
Stack removal 10

31
Resource-Aware Scheduling

Advantages of event-based scheduling
Tailored for applications
With event handlers
Events provide two important pieces of
information for scheduling
Whether a process is close to completion
Whether a system is overloaded

32
Resource-Aware Scheduling

Thread-based
View applications as sequence of stages,
separated by blocking calls
Analogous to event-based scheduler

33
Blocking Graph

Node A location in the program that blocked
Edge between two nodes if they were consecutive
blocking points
Generated at runtime

34
Resource-Aware Scheduling

1. Keep track of resource utilization
2. Annotate each node with resource used and its
outgoing edges
3. Dynamically prioritize nodes
Prefer nodes that release resources

35
Resources

CPU
Memory (malloc)
File descriptors (open, close)

36
Pitfalls

Tricky to determine the maximum capacity of a
resource
Thrashing depends on the workload
Disk can handle more requests that are sequential
instead of random
Resources interact
VM vs. disk
Applications may manage memory themselves

37
Yield Profiling

User threads are problematic if a thread fails to
yield
They are easy to detect, since their running
times are orders of magnitude larger
Yield profiling identifies places where programs
fail to yield sufficiently often

38
Web Server Performance

4x500 MHz Pentium server
2GB memory
Intel e1000 Gigabit Ethernet card
Linux 2.4.20
Workload requests for 3.2 GB of static file data

39
Web Server Performance

Request frequencies match those of the SPECweb99
A client connects to a server repeated and issue
a series of five requests, separated by 20ms
pauses
Apaches performance improved by 15 with
Capriccio

40
Resource-Aware Admission Control