Capriccio: Scalable Threads For Internet Services - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Capriccio: Scalable Threads For Internet Services

Description:

Introduces a fast, scalable user-level thread package for thread management ... capacity, throttle back, coupled with hysteresis, keeps system at full throttle ... – PowerPoint PPT presentation

Number of Views:203

Avg rating:3.0/5.0

Slides: 35

Provided by: willh4

Category:

more less

Transcript and Presenter's Notes

Title: Capriccio: Scalable Threads For Internet Services

1
Capriccio Scalable Threads For Internet Services

Authors
Rob von Behren, Jeremy Condit, Feng Zhou, George
C. Necula, Eric Brewer

Presentation by Will Hrudey
2
Introduction

Capriccio
a spritely improvisational musical dance
involving multiple voices
Introduces a fast, scalable user-level thread
package for thread management and synchronization

3
Motivation

Internet Servers And Databases
Have ever-increasing scalability needs
Need to handle hundreds of thousands of
simultaneous connections without significant
degradation
Need for a programming model to achieve
efficient, robust servers with ease

4
Approach

Utilizes user level threads to provide a natural
abstraction for high concurrency programming
Prior work discussed threads versus events
Decouples thread package from OS to take
advantage of
Cooperative threading
New asynchronous I/O interfaces
Compiler support
Provides 3 key features
Scalability
Linked stacks
Resource aware scheduling

5
Goals

To allow high performance without high complexity
Support for existing thread APIs (POSIX)
Scalability to 100,000s threads
Flexibility to address application-specific needs
Little or no modification of application itself

6
User Level Threads

Provide performance flexibility advantages
Provide a clean programming model with useful
invariants and semantics
Decouples thread package from OS
Hides both OS variation kernel evolution
Integrate compiler support
Can complicate preemption
Can interact badly with kernel scheduler

7
User Level Threads

Flexibility
Take advantage of new asynchronous I/O mechanisms
Tailored scheduling
Lightweight (scale to 100,000 threads)
Performance
Reduced synchronization overhead on uniprocessors
More efficient memory management
Disadvantages
Blocking I/O
Wrapper layer to translate blocking to
non-blocking I/O
Lightweight synchronization diminished on
multiprocessors

8
User Level Threads

Implementation (user level library for Linux)
Context switches
coroutine library
I/O intercepts blocking I/O calls
epoll() for pollable file descriptors and Linux
AIO
Scheduling
Main loop looks like an event-driven application
Run threads and checks for I/O completions
Synchronization
Cooperative scheduling to improve synchronization
Efficiency
Thread management functions have bounded worst
case running times

9
User Level Threads

Microbenchmark
Testbed
2x2.4GHz Xeon / 1GB / 2x10K RPM SCSI Ultra II HD
/ 3xGigabit Ethernet / Linux 2.5.70
Thread packages
Capriccio, LinuxThreads, NPTL

10
Efficient Stack Management

Optimizes stack allocation for many threads
Reduces size of VM dedicated to stacks
Small non-contiguous stack chunks
Grow and shrink at run time
Compiler analysis and runtime checks
Generates a weight directed call graph

11
Efficient Stack Management
Weighted Call graph

Nodes are functions weighted by max stack size
Edges indicate function calls between nodes
Path is a sequence of stack frames
Checkpoints are code inserted at call sites

12
Efficient Stack Management

Places a reasonable bound on the amount stack
space consumed by each thread
Checkpoints determine if enough space left to
reach next checkpoint without overflow
If not, new stack chunk allocated SP adjusted
Checkpoint placement
Break cycles
Scan nodes to ensure path within desired bound

13
Efficient Stack Management

Special cases
Function pointers complicate analysis
External function calls
Tuning to optimize memory usage
MaxPath
MinChunk
Linked stacks can improve paging behavior
Apache SPECweb99 results 3-4 slowdown overall

14
Resource Aware Scheduling

Thread scheduling and admission control adapt to
resource usage
Application viewed as sequence of stages
separated by blocking points
Dynamic scheduling decisions are finer grained
Blocking graphs generated at runtime
Learn behavior dynamically to improve scheduling
Determine impact on resource utilization if
schedule thread

15
Resource Aware Scheduling
Blocking Graph

Nodes are program locations where threads block
Edges reflect consecutive blocking points
Edges annotated with weighted averages reflecting
resource usage
Nodes annotated with weighted outer edge values
Threads walk this graph independently

16
Resource Aware Scheduling

Promote nodes that release resources and demote
nodes that acquire resources
Dynamically prioritize nodes (threads) for
scheduling
Responds to changes in resource consumption due
to type of work and offered load
Implement using separate run queues for each node

17
Resource Aware Scheduling

Usage
Drive each resource to max capacity, throttle
back, coupled with hysteresis, keeps system at
full throttle
Challenges
Determination of max capacity of resources is
tricky
Interaction between resources
Thrashing can be difficult to detect
Application specific resources memory mgmt

18
Performance

Evaluate real-world web server workload
Testbed
4x500 MHz Pentium / 2GB / Gigabit Ethernet
Linux 2.4.20
Kernel version doesnt support epoll or AIO (used
poll)
Client load up to 16 similar configurations
3.2GB static file data with various file sizes
Clients repeatedly connect, issue 5 requests
waiting 20ms apart
Limited cache sizes Haboob / Knot to 200MB to
force disk activity
Request frequencies for each size and file based
on SPECweb99

19
Performance

15 increase with Apache
Knot comparable to event-based Haboob

20
Performance

Overhead involved in maintaining information
about resources at each node
Gathering and maintaining statistics
lt2 for edges in Apache
Statistics remained fairly steady in tested
workloads
Ratio of 1/20 reduces aggregate overhead to 0.1
Stack trace overhead significant
(8 - Apache / 36 - Knot)
Could be reduced with compiler integration

21
Future Work

Incorporate multiprocessor support
Reduce kernel crossings under heavy load with a
batching interface for async I/O
Improve thrashing detection
Improve stack analysis function pointers
(CCured)
Develop profiler tools to optimize tuning
parameters
Generate blocking graph at compile time
Implement blocking point fairness strategies

22
Conclusion

Thread package was fixed to support scalable,
high concurrency Internet servers
Threading model is more useful for high
concurrency programming
User level thread package is decoupled from OS
Can benefit from new I/O mechanisms and compiler
support
Linked stacks and scheduler delivered significant
improvements in scalability and performance
compared with existing systems

23
Observations

External function call stack size doesnt scale
Offloads responsibility to compiler support
compiler technology will play an important
role in the evolution of the techniques described
in this paper
Performance test
Data not qualified how many runs? Are results
repeatable?
Kernel didnt have same non-blocking call support
so comparison is difficult are the results still
meaningful?
Stated goal of achieving 100,000s of threats not
explicitly evident

24
Discussion

It seems as though using a graph to dynamically
adjust the stack size (vs a default large stack
size) is a smart thing to do, especially if
memory is a problem. I'm trying to figure out if
this is a new era of more intelligent thread
packages, or if this is an overly complex
solution which has been avoided. So what is the
expense (in terms of computation) of this
intelligent stack management? Is it necessary for
this application to succeed?

25
Discussion

Capriccio can scale to 100,000 threads, what
about more than 100,000 thread? Will the system
just crash? Is there no mechanism in place if
that happens?
I was wondering whether the dynamic stack chunks
are mapped contiguously in the virtual memory of
the thread? If this was the case, how could they
achieve adding a chunk of memory to the stack as
small as half a page?

26
Discussion

In the experimental section there is no mention
of how many tests were performed, and from the
looks of it, there was just one---since otherwise
vanilla-apache seems to dip and then improve in
bandwidth as more clients connect. Also Knot
seems to have approximately the same performance
as Haboob, so I'm wondering how conclusive these
tests really are?

27
Discussion

The authors continually refer to their programs
event-driven behavior (page 3,8, 11). In this
way, it is a similar implementation to SEDA (in
that both event and thread behaviors are
exhibited). What is the implied advantage of
fixing threads to behave like events over fixing
events to behave like (or use) threads?

28
Discussion

What the authors seem to be doing with the
scheduling of the system is wrap an event-based
behavior (for I/O) into a thread-based
abstraction. Is this extra layer of abstraction
really needed? How much does the extra layer of
abstraction affect the performance of the system
in general? Also, why is it that people don't
accept the fact that events are better for this
type of task and just use them as they are, as
opposed to dressing them up in thread costumes?

29
Discussion

One assumption that the authors make is that
resource usage is likely to be similar for many
tasks at a blocking point. They say that this
assumption seems to hold in practice. This is
of course not too convincing. Is this actually a
good assumption to make? Are there any systems
where this does not hold, and what would be the
consequences on this piece of work?

30
Discussion

Authors commented that the resource-aware
scheduling is completely adaptive, but also
confess that the system suffers from several
parameter tuning problem like knowing maximum
capacity of each resource, adjusting speed of
adaptation (no reason why they use exponentially
weighted averages). Finding optimal parameters
can be another huge work to do which could be too
hard to be tuned by hand. Isn't it making things
more complicated or uncontrollable?

31
Discussion

One of the key features that is incorporated into
Capriccio is a new method of stack management,
linked stack management, whose goal is to improve
performance by reducing the amount of wasted
stack space, typical with other types of stack
management. Their approach is contingent on
compiler support. Is it realistic to expect to
see the development of a compiler for this
purpose?

32
Discussion

In the case study, the authors choose MaxPath and
MinChunk, the two tuning parameters available
with their linked stack management algorithm,
based on profiling information. Is it reasonable
to expect the programmer to supply this
information? How sensitive is the algorithm to
these parameters?

33
Discussion

Would it be possible to use something like NPTL
under low-load, since it performs better than
Capriccio, then switch to Capriccio under higher
loads when it begins to outperform NPTL? This
would give the best of both and constantly
maintain good performance.

34
Discussion

In Section 3.1, the authors used whole-program
analysis to determine the maximum amount of stack
space that a single stack frame for that a
function will consume. How about dynamic memory
allocation? If the codes allocate various size of
memory during run-time, how could the program
estimate the maximum stack size (or they just
give a rough estimation?)?

Write a Comment

User Comments (0)