Capriccio: Scalable Threads For Internet Services - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Capriccio: Scalable Threads For Internet Services

Description:

Introduces a fast, scalable user-level thread package for thread management ... capacity, throttle back, coupled with hysteresis, keeps system at full throttle ... – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 35
Provided by: willh4
Category:

less

Transcript and Presenter's Notes

Title: Capriccio: Scalable Threads For Internet Services


1
Capriccio Scalable Threads For Internet Services
  • Authors
  • Rob von Behren, Jeremy Condit, Feng Zhou, George
    C. Necula, Eric Brewer

Presentation by Will Hrudey
2
Introduction
  • Capriccio
  • a spritely improvisational musical dance
    involving multiple voices
  • Introduces a fast, scalable user-level thread
    package for thread management and synchronization

3
Motivation
  • Internet Servers And Databases
  • Have ever-increasing scalability needs
  • Need to handle hundreds of thousands of
    simultaneous connections without significant
    degradation
  • Need for a programming model to achieve
    efficient, robust servers with ease

4
Approach
  • Utilizes user level threads to provide a natural
    abstraction for high concurrency programming
  • Prior work discussed threads versus events
  • Decouples thread package from OS to take
    advantage of
  • Cooperative threading
  • New asynchronous I/O interfaces
  • Compiler support
  • Provides 3 key features
  • Scalability
  • Linked stacks
  • Resource aware scheduling

5
Goals
  • To allow high performance without high complexity
  • Support for existing thread APIs (POSIX)
  • Scalability to 100,000s threads
  • Flexibility to address application-specific needs
  • Little or no modification of application itself

6
User Level Threads
  • Provide performance flexibility advantages
  • Provide a clean programming model with useful
    invariants and semantics
  • Decouples thread package from OS
  • Hides both OS variation kernel evolution
  • Integrate compiler support
  • Can complicate preemption
  • Can interact badly with kernel scheduler

7
User Level Threads
  • Flexibility
  • Take advantage of new asynchronous I/O mechanisms
  • Tailored scheduling
  • Lightweight (scale to 100,000 threads)
  • Performance
  • Reduced synchronization overhead on uniprocessors
  • More efficient memory management
  • Disadvantages
  • Blocking I/O
  • Wrapper layer to translate blocking to
    non-blocking I/O
  • Lightweight synchronization diminished on
    multiprocessors

8
User Level Threads
  • Implementation (user level library for Linux)
  • Context switches
  • coroutine library
  • I/O intercepts blocking I/O calls
  • epoll() for pollable file descriptors and Linux
    AIO
  • Scheduling
  • Main loop looks like an event-driven application
  • Run threads and checks for I/O completions
  • Synchronization
  • Cooperative scheduling to improve synchronization
  • Efficiency
  • Thread management functions have bounded worst
    case running times

9
User Level Threads
  • Microbenchmark
  • Testbed
  • 2x2.4GHz Xeon / 1GB / 2x10K RPM SCSI Ultra II HD
    / 3xGigabit Ethernet / Linux 2.5.70
  • Thread packages
  • Capriccio, LinuxThreads, NPTL

10
Efficient Stack Management
  • Optimizes stack allocation for many threads
  • Reduces size of VM dedicated to stacks
  • Small non-contiguous stack chunks
  • Grow and shrink at run time
  • Compiler analysis and runtime checks
  • Generates a weight directed call graph

11
Efficient Stack Management
Weighted Call graph
  • Nodes are functions weighted by max stack size
  • Edges indicate function calls between nodes
  • Path is a sequence of stack frames
  • Checkpoints are code inserted at call sites

12
Efficient Stack Management
  • Places a reasonable bound on the amount stack
    space consumed by each thread
  • Checkpoints determine if enough space left to
    reach next checkpoint without overflow
  • If not, new stack chunk allocated SP adjusted
  • Checkpoint placement
  • Break cycles
  • Scan nodes to ensure path within desired bound

13
Efficient Stack Management
  • Special cases
  • Function pointers complicate analysis
  • External function calls
  • Tuning to optimize memory usage
  • MaxPath
  • MinChunk
  • Linked stacks can improve paging behavior
  • Apache SPECweb99 results 3-4 slowdown overall

14
Resource Aware Scheduling
  • Thread scheduling and admission control adapt to
    resource usage
  • Application viewed as sequence of stages
    separated by blocking points
  • Dynamic scheduling decisions are finer grained
  • Blocking graphs generated at runtime
  • Learn behavior dynamically to improve scheduling
  • Determine impact on resource utilization if
    schedule thread

15
Resource Aware Scheduling
Blocking Graph
  • Nodes are program locations where threads block
  • Edges reflect consecutive blocking points
  • Edges annotated with weighted averages reflecting
    resource usage
  • Nodes annotated with weighted outer edge values
  • Threads walk this graph independently

16
Resource Aware Scheduling
  • Promote nodes that release resources and demote
    nodes that acquire resources
  • Dynamically prioritize nodes (threads) for
    scheduling
  • Responds to changes in resource consumption due
    to type of work and offered load
  • Implement using separate run queues for each node

17
Resource Aware Scheduling
  • Usage
  • Drive each resource to max capacity, throttle
    back, coupled with hysteresis, keeps system at
    full throttle
  • Challenges
  • Determination of max capacity of resources is
    tricky
  • Interaction between resources
  • Thrashing can be difficult to detect
  • Application specific resources memory mgmt

18
Performance
  • Evaluate real-world web server workload
  • Testbed
  • 4x500 MHz Pentium / 2GB / Gigabit Ethernet
  • Linux 2.4.20
  • Kernel version doesnt support epoll or AIO (used
    poll)
  • Client load up to 16 similar configurations
  • 3.2GB static file data with various file sizes
  • Clients repeatedly connect, issue 5 requests
    waiting 20ms apart
  • Limited cache sizes Haboob / Knot to 200MB to
    force disk activity
  • Request frequencies for each size and file based
    on SPECweb99

19
Performance
  • 15 increase with Apache
  • Knot comparable to event-based Haboob

20
Performance
  • Overhead involved in maintaining information
    about resources at each node
  • Gathering and maintaining statistics
  • lt2 for edges in Apache
  • Statistics remained fairly steady in tested
    workloads
  • Ratio of 1/20 reduces aggregate overhead to 0.1
  • Stack trace overhead significant
  • (8 - Apache / 36 - Knot)
  • Could be reduced with compiler integration

21
Future Work
  • Incorporate multiprocessor support
  • Reduce kernel crossings under heavy load with a
    batching interface for async I/O
  • Improve thrashing detection
  • Improve stack analysis function pointers
    (CCured)
  • Develop profiler tools to optimize tuning
    parameters
  • Generate blocking graph at compile time
  • Implement blocking point fairness strategies

22
Conclusion
  • Thread package was fixed to support scalable,
    high concurrency Internet servers
  • Threading model is more useful for high
    concurrency programming
  • User level thread package is decoupled from OS
  • Can benefit from new I/O mechanisms and compiler
    support
  • Linked stacks and scheduler delivered significant
    improvements in scalability and performance
    compared with existing systems

23
Observations
  • External function call stack size doesnt scale
  • Offloads responsibility to compiler support
  • compiler technology will play an important
    role in the evolution of the techniques described
    in this paper
  • Performance test
  • Data not qualified how many runs? Are results
    repeatable?
  • Kernel didnt have same non-blocking call support
    so comparison is difficult are the results still
    meaningful?
  • Stated goal of achieving 100,000s of threats not
    explicitly evident

24
Discussion
  • It seems as though using a graph to dynamically
    adjust the stack size (vs a default large stack
    size) is a smart thing to do, especially if
    memory is a problem. I'm trying to figure out if
    this is a new era of more intelligent thread
    packages, or if this is an overly complex
    solution which has been avoided. So what is the
    expense (in terms of computation) of this
    intelligent stack management? Is it necessary for
    this application to succeed?

25
Discussion
  • Capriccio can scale to 100,000 threads, what
    about more than 100,000 thread? Will the system
    just crash? Is there no mechanism in place if
    that happens?
  • I was wondering whether the dynamic stack chunks
    are mapped contiguously in the virtual memory of
    the thread? If this was the case, how could they
    achieve adding a chunk of memory to the stack as
    small as half a page?

26
Discussion
  • In the experimental section there is no mention
    of how many tests were performed, and from the
    looks of it, there was just one---since otherwise
    vanilla-apache seems to dip and then improve in
    bandwidth as more clients connect. Also Knot
    seems to have approximately the same performance
    as Haboob, so I'm wondering how conclusive these
    tests really are?

27
Discussion
  • The authors continually refer to their programs
    event-driven behavior (page 3,8, 11). In this
    way, it is a similar implementation to SEDA (in
    that both event and thread behaviors are
    exhibited). What is the implied advantage of
    fixing threads to behave like events over fixing
    events to behave like (or use) threads?

28
Discussion
  • What the authors seem to be doing with the
    scheduling of the system is wrap an event-based
    behavior (for I/O) into a thread-based
    abstraction. Is this extra layer of abstraction
    really needed? How much does the extra layer of
    abstraction affect the performance of the system
    in general? Also, why is it that people don't
    accept the fact that events are better for this
    type of task and just use them as they are, as
    opposed to dressing them up in thread costumes?

29
Discussion
  • One assumption that the authors make is that
    resource usage is likely to be similar for many
    tasks at a blocking point. They say that this
    assumption seems to hold in practice. This is
    of course not too convincing. Is this actually a
    good assumption to make? Are there any systems
    where this does not hold, and what would be the
    consequences on this piece of work? 

30
Discussion
  • Authors commented that the resource-aware
    scheduling is completely adaptive, but also
    confess that the system suffers from several
    parameter tuning problem like knowing maximum
    capacity of each resource, adjusting speed of
    adaptation (no reason why they use exponentially
    weighted averages). Finding optimal parameters
    can be another huge work to do which could be too
    hard to be tuned by hand. Isn't it making things
    more complicated or uncontrollable?

31
Discussion
  • One of the key features that is incorporated into
    Capriccio is a new method of stack management,
    linked stack management, whose goal is to improve
    performance by reducing the amount of wasted
    stack space, typical with other types of stack
    management. Their approach is contingent on
    compiler support. Is it realistic to expect to
    see the development of a compiler for this
    purpose?

32
Discussion
  • In the case study, the authors choose MaxPath and
    MinChunk, the two tuning parameters available
    with their linked stack management algorithm,
    based on profiling information. Is it reasonable
    to expect the programmer to supply this
    information? How sensitive is the algorithm to
    these parameters?

33
Discussion
  • Would it be possible to use something like NPTL
    under low-load, since it performs better than
    Capriccio, then switch to Capriccio under higher
    loads when it begins to outperform NPTL? This
    would give the best of both and constantly
    maintain good performance.

34
Discussion
  • In Section 3.1, the authors used whole-program
    analysis to determine the maximum amount of stack
    space that a single stack frame for that a
    function will consume. How about dynamic memory
    allocation? If the codes allocate various size of
    memory during run-time, how could the program
    estimate the maximum stack size (or they just
    give a rough estimation?)?
Write a Comment
User Comments (0)
About PowerShow.com