Capriccio: Scalable Threads for Internet Services von Behren

About This Presentation

Title:

Capriccio: Scalable Threads for Internet Services von Behren

Description:

Web 'transactions' involve a number of steps which must be performed in sequence. ... If we multiplex requests on a small set of threads, it's more difficult. ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 33

Provided by: Ken667

Category:

more less

Transcript and Presenter's Notes

Title: Capriccio: Scalable Threads for Internet Services von Behren

1
Capriccio Scalable Threads for Internet Services
(von Behren)

Kenneth Chiu

2
Background

Non-blocking I/O, async I/O
NB
Usually doesnt work well for disks.
Async I/O
Issue a request, get completion.
epoll()/poll()
convoy tendency for threads to bunch up
priority inversion
call graph
average, weighted moving average
capriccio improvisatory style, free form

3
The Problem

Web transactions involve a number of steps
which must be performed in sequence.
For high-throughput, we want to service many of
these requests concurrently.
When does concurrency help? When does it not?
If we use a single thread per request, we will
have too many threads.
If we multiplex requests on a small set of
threads, its more difficult.

4
Read two numbers and add

while (true)
fd get_read_ready()
state lookup(fd)
if (state.step READING_FIRST)
c read(fd, , bytes_left)
if (have enough)
state.step READING_SECOND
else if (state.step READING_SECOND)

while (true) int n1, n2 readexact(fd,
n1, 4) readexact(fd, n2, 4)
printf(d\n, n1 n2)
5
Thread Design and Scalability
6
The Case for User-Level Threads

Flexibility
Level of indirection between applications and the
kernel, which helps decouple the two.
Kernel-level thread scheduling must handle all
applications. User-level can be tailored.
Lightweight which means can use zillions of them.
Performance
Cooperative scheduling is nearly free.
Do not require kernel crossing for uncontended
locks. (Why do contended locks require kernel
crossings?)
Disadvantages
Non-blocking I/O requires an additional system
call. (Why?)
SMPs

7
Implementation

Context switches
Built on coroutine library.
I/O
Intercept blocking system calls, use epoll() and
AIO for disk.
Can be less efficient
Scheduling
Main scheduling loop looks very much like an
event-driven application. (What is an EDA?)
Makes it relatively easy to switch schedulers.
Synchronization
Cooperative threading on UP.
Efficiency
All O(1), except sleep queue.

8
Benchmarks

2 X 2.4 GHz Xeon, 1 GB memory, 2 X 10K RPM SCSI,
GigE.
2 X 1.2 GHz US III
Linux 2.5.70, epoll(), AIO.
Solaris 8
Capriccio, LinuxThreads, NPTL

9
Thread Primitives
10
Thread Scalability

Producer-consumer

11
Thread Scalability

Drop between 100 and 1000 to cache footprint.

12
I/O Performance

pipetest
Pass a number of tokens among a set of pipes.
Disk scheduling
A number of threads perform random 4 KB reads
from a 1 GB file.
Disk I/O through buffer cache
200 threads reading with a fixed miss rate.

When concurrency is low, performance is poorer.

Benefits of disk head scheduling.

I/O out of buffer.
Performance is lower due to AIO.

16
Linked Stack Management
17
Thread Stacks

If a lot of threads, the cumulative stack space
can be quite large.
Solution Use a dynamic allocation policy and
allocate on demand. Link stack chunks together.
Problem How do you link stack chunks together?
How do you know when to link a new one?

18
Weighed Call Graph

Use static analysis to create a weighted call
graph.
Each node is weighed by the maximum stack space
that that function might consume. (Why is it
maximum, and not exact?)
Now what?

19
Bounds

Most real-world programs use recursion.
Even without, static bound wastes too much.
Instead insert checkpoints at key places to link
in new stack chunks.
Chunks switched right before arguments are pushed.

20
Placing Checkpoints

Make sure one checkpoint in every cycle by
inserting in back edges. (How?) (Is this
efficient?)
Then make sure each path (sum) is not too long.

Function B is executing.
Function D, both ways.
Recursion.

22
Special Cases

Function pointers
Difficult, but they try to analyze.
External functions
Allow annotations.
Alternatively, link in a large chunk.
Variable length arrays
C99

23
Question

What kind of a problem is this?
Is it being solved at the right level?

24
Resource-Aware Scheduling
25
Admission Control

Weve seen many graphs where performance degrades
as some variable increases.
Scheduling in Capriccio is to keep performance in
the good part of the curve.

26
Blocking Graph

Each node is a location where the program
blocked.
Location is call chain.
Generated at run time.
Annotate with resource usage
Average running time (with exponentially-weighted
moving average), memory, stack, sockets, etc.
Maintain a run queue for each node. Admit threads
till resources reach maximum capacity.

27
Pitfalls

Too many non-linear effects to predict.
One solution is to use some kind of
instrumentation, plus feedback control.
But even detecting that is hard.

28
Web Server Test
29
(No Transcript)
30
Summary

Control flow maintains state. Control flow can be
swapped for explicit maintenance.
Threads perform two functions
Maintain state (logical threads of programming
model)
Allow concurrency (kernel)
Should separate the two, since the overhead of
concurrency is not necessary when just want to
maintain state.
Cooperative multitasking has been denigrated
before, but can be good.