Title: Eliminating Receive Livelock in an InterruptDriven Kernel
1Eliminating Receive Livelock in an
Interrupt-Driven Kernel
2Questions
- What are buffers for?
- Amortizing overhead
- Smoothing load
- What are interrupts for?
- Compare to polling.
- Which has more latency?
- When is polling good?
- What is scalability?
3poll (periodic)
Agent
Customer
You (Artist)
- Suppose you are an artist, with an agent who has
some samples of your work. - You periodically check with your agent to see if
anyone wants to commission a work. - What might be bad about this?
4Agent
Customer
You (Artist)
interrupt (immediate)
- Ask the agent to call immediately whenever anyone
expresses interest. - What might be bad about this?
5poll (when done with a painting)
Agent
Customer
You (Artist)
interrupt (when idle)
- When done with one painting, poll for another.
- If no jobs waiting, then enable interrupts and
wait. - First interrupt disables interrupts.
6Introduction
7- OSs originally designed to handle devices that
interrupt only once every few milliseconds. - Disks, slow network adapters.
- World has changed, network adapaters now
interrupt much more often. - Many network applications not flow-controlled.
(Why?) - Congestive collapse.
- No negative feedback loop. Maybe even a positive
feedback loop. (Explain?) - Example of a call center as positive feedback
loop. - Maybe cant accommodate, but should respond
gracefully. - Interrupt-driven systems tend to respond badly
under load. Tasks performed at interrupt-level,
by definition, have higher-priority. If all time
is spent responding to interrupts, nothing else
will happen. This is receive livelock. - Note that the definition of livelock is a little
bit different than in other contexts. - Can have livelock in a totally contained system.
- Just an infinite loop across two or more threads
- s1, s2, s3, s1, s2, s3,
- s1, t5, s3, t9, s1, t5, s3, t9,
8Livelock
- Any situation where you may have unbounded input
rates, and non-zero cost will eventually
livelock. - Turn off interrupts zero-cost.
- Hardware limited bounds input rate.
9- But interrupts are very useful. Hybrid design
- Polls only when triggered by interrupt,
interrupts only when polling suspended. - Then augment with feedback control to drop
packets with least investment. - Then connect scheduling subsystem to network
subsystem to give some CPU time to user tasks
even under overload.
10Motivating Applications
11Motivating Applications
- Host-based routing
- Many products based on Linux/UNIX.
- Experimentation also done on UNIX.
- Passive network monitoring
- Simpler/cheaper to do with a general-purpose OS.
- Network file service
- Can be swamped by NFS/RPC.
- High-performance networking
- Even though flow-controlled, livelock might still
be an issue.
12Requirements for Scheduling Network Tasks
- Ideally, handle worst-case load.
- Too expensive.
- Grace degradation.
- Constant overhead.
- If overhead increases as offered load increases,
eventually consumes all CPU. - Throughput
- Defined as rate delivered to ultimate consumer.
- Should keep up with offered load up to MLFRR, and
never drop below. - Also must allow transmission to continue.
- Latency and Jitter
- Even during high load, avoid long queues.
- Avoid bursty scheduling, which increases jitter.
(Why jitter bad?) - Fair allocation
- Must continue to process other tasks.
13Interrupt-Driven Scheduling and Its Consequences
14Problems
- Three kinds of problems
- Receive livelock under overload
- Increased latency for packet delivery or
forwarding - Starvation of transmission
- What causes these problems?
- Arise from interrupt subsystem not being a
component of the scheduler.
15Description of an Interrupt-Driven System
- Based on 4.2 BSD, others similar.
- Network interface signals packet arrival by
raising an interrupt. - Interrupt handler in device driver
- Performs some initial processing.
- Places packet on queue.
- Generates a software interrupt (at lower IPL) to
do the rest. - No scheduler participation.
- Some amortization is done by batching of
interrupts. - How is batching different from polling?
- But under heavy load, all time still spent at
device IPL. - Incoming packets given absolute priority.
- Design based on early adapters with little
memory. - Not appropriate for modern devices.
16 17Receive Livelock
- System can behave in one of three ways as load
increases - Ideal throughput always matches offered load.
- Realizable throughput goes up to MLFRR, then
constant. - Livelock Throughput goes down with offered load.
- What is the effect of better performance?
- What is the effect of batching?
- Fundamental problem is not performance, but
priorities/scheduling.
18Receive Latency under Overload
- Interrupts usually thought of as way to reduce
latency. - Burst arrives
- First, link-level processing of whole burst,
- Then higher-level processing of packet.
- May result in bad scheduling.
- NFS RPC requires disk.
- Experiment
- Link-level processing at device IPL, including
copying packet into kernel buffers (no DMA) - Further processing following a software
interrupt, locating process, queuing packet for
delivery to this process - Awakening user process, copy packet into its own
buffer
19Receive Latency under Overload
- Latency to deliver first packet to user
application almost linear - one-packet burst 1.23 ms
- two-packet burst 1.54 ms
- four-packet burst 2.02 ms
- 16-packet burst 5.03 ms
- Can we expect total lack of effect of burst on
latency?
20Starvation of Transmits under Load
- Context is routers/forwarding
- Transmission is usually done at lower priority
than receiving. - Idea is to minimize packet loss during burst.
- However, under load, starvation can occur.
21Avoiding Livelock Through Better Scheduling
22Avoiding Livelock Through Better Scheduling
- Control rate of interrupts
- Polling-based mechanisms to ensure fair
allocation of resources. - Techniques to avoid unnecessary preemption of
downstream packet processing.
23Limiting Interrupt Rate
- Minimize work in packets that will be dropped.
- Disable interrupts when cant handle load.
- When internal queue is full, disable.
- Re-enable when buffer space available, or after a
delay. (Which is better, in general?) - Guaranteeing some progress for user-level code.
- Time how long spent in packet-input code, disable
if too much. - Can simulate by using clock interrupt to sample
state. - Related question How does the OS compute CPU
usage? How about profiling?
24Use of Polling
- When tasks behave unpredictably, use interrupts.
- When behave predictably, use polling.
- Also poll to get fair allocation by using RR.
25Avoiding Preemption
- Livelock occurs because interrupts preempt
everything else. - Solution is to run downstream at same IPL
- Run (almost) everything at a high IPL
- Run (almost) everything at low IPL
- Which is better?
- Interrupt handler only sets flag, and schedules
the polling thread. - Polling thread enables interrupts only when done.
26Summary
- Avoid livelock by
- Use interrupts only to initiate polling.
- Use RR polling to fairly allocate resources among
sources. - Temporarily disabling input when feedback from a
full queue, or a limit on CPU usage indicates
other important tasks are pending. - Dropping packets early, rather than late, to
avoid wasted work. Once we decide to receive a
packet, try to process it to completion. - Maintain high performance by
- Re-enabling interrupts when no work is pending,
to avoid polling overhead and to keep latency
low. - Letting the receiving interface buffer bursts, to
avoid dropping packets. - Eliminate the IP input queue, and associated
overhead.
27Livelock in BSD-Based Routers
28Livelock in BSD-Based Routers
- IP packet router built using Digital UNIX.
- Goals
- Obtain highest possible maximum throughput.
- Maintain throughput even when overloaded.
- Allocate sufficient CPU cycles to user-mode
tasks. - Minimize latency.
- Avoid degrading performance in other apps.
29Measurement Methodology
- Host-based router connecting two Ethernets.
- Source host generated UDP packets carrying 4
bytes of data. - Used a slow Alpha host, to make livelock more
evident. - Tested both pure kernel, and kernel plus
user-mode component (screend). - Throughput (Y-axis) is output rate.
30- Whats MLFRR? Is it really?
- Where does livelock occur?
- Why is screend worse than pure kernel?
31Why Livelock Occurs in the 4.2 BSD Model
- Should discard as early as possible.
32Fixing the Livelock Problem
- Drivers register with polling system.
- Polling system notices which interfaces need
processing, and calls the callbacks with quota. - Received-packet callback calls the IP processing.
33Results of Modifications
- Why is the slope gradual in one, not so gradual
in the other?
34Feedback from Full Queues
- Detect when screend queue is full.
- Quota was 10, screend queue was 32, 25 and 75
watermarks.
35Choice of Quota
- Smaller quotas work better. (Why?)
36Overlap
Stage 1
Stage 2
Stage 1
Stage 2
37Sensitivity of Quota
- Peak rate slightly higher with larger quota.
38With screend
39Guaranteeing Progress for User-Level Processes
40Modification
- Use performance counter to measure how many CPU
cycles spent per period in packet-processing. - If above some threshold, then disable input
handling.
41- Why discrepancy?
- Why the dip?
42Future Work
- Selective packet dropping
- Packets have different value
- Interactions with application-level scheduling
- Reduce latency for currently schedule process
- During overload favor packets destined for
current process. - Run process with most work to do.
43Summary
- Must be able to discard input with 0 or minimal
overhead. - Balance interrupts and polling.
- Felt that the solutions were all a little ad hoc.
Perhaps a more general, end-to-end system could
be created. Might eliminate need for tuning.