Title: Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard
1Congestion Management for Data CentersIEEE
802.1 Ethernet Standard
Balaji Prabhakar Departments of EE and
CS Stanford University
2Background
- Data Centers see the true convergence of L3 and
L2 transport - While TCP is the dominant L3 transport protocol,
and a significant amount of L2 traffic uses it,
there is other L2 traffic notably, storage and
media - This, and other reasons, have prompted the IEEE
802.1 standards body to develop an Ethernet
congestion management standard - In this lecture, we shall see the development of
the QCN (Quantized Congestion Notification)
algorithm for standardization in the IEEE 802.1
Data Center Bridging standards - We will also review the technical background of
congestion control research - The lecture has 3 parts
- A brief overview of the relevant congestion
control background - A description of the QCN algorithm and its
performance - The Averaging Principle A new control-theoretic
idea underlying the QCN and BIC-TCP algorithms
which stabilizes them when loop delays increase
very useful for operating high-speed links with
shallow buffers---the situation in 10 Gbps
Ethernets
3Managing Congestion
- Congestion is a standard feature of networked
systems in data networks, - Congestion occurs when links are oversubscribed
when traffic and/or link bandwidth changes - A congestion notification mechanism allows
switches/routers to directly control the rate of
the ultimate sources of the traffic - Weve been involved in developing QCN (for
Quantized Congestion Notification) for
standardization in the Data Center Bridging
track of the IEEE 802.1 Ethernet standards - For deployment in 10 (and 40 and 100) Gbps Data
Center Ethernets - Complete information on the QCN algorithm
(p-code, draft of standard, detailed simulations
of lots of scenarios) available at
4Congestion control in the Internet
- In the Internet
- Queue management schemes (e.g. RED) at the links
signal congestion by either dropping or marking
packets using ECN - TCP at end-systems uses these signals to vary the
sending rate - There exists a rich history of algorithm
development, control-theoretic analysis and
detailed simulation of queue management schemes
and congestion control algorithms for the
Internet - Jacobson, Floyd et al, Kelly et al, Low et al,
Srikant et al, Misra et al, Katabi et al - TCP is excellent, so why look for another
algorithm? - There is other traffic on Ethernet than TCP so,
native Ethernet congestion management is needed - TCPs one size fits all approach makes it too
conservative for high bandwidth-delay product
networks - A hardware-based algorithm is needed for the very
high speeds of operation encountered in 10, 40
and 100 Gbps - Ethernet and the Internet have very different
operating conditions
5Switched Ethernet vs. the Internet
- Some significant differences
- No per-packet acks in Ethernet, unlike in the
Internet - Not possible to know round trip time!
- So congestion must be signaled to the source by
switches - Algorithm not automatically self-clocked (like
TCP) - Links can be paused i.e. packets may not be
dropped - No sequence numbering of L2 packets
- Sources do not start transmission gently (like
TCP slow-start) they can potentially come on at
the full line rate of 10Gbps - Ethernet switch buffers are much smaller than
router buffers (100s of KBs vs 100s of MBs) - Most importantly, algorithm should be simple
enough to be implemented completely in hardware - Note The QCN algorithm we have developed has
Internet relatives notably BIC-TCP at the source
and the REM/PI controllers at switches
6L2 Transport IEEE 802.1
- IEEE 802.1 Data Center Bridging standards
Enhancements to Ethernet - Reliable delivery (802.1Qbb) Link-level flow
control (PAUSE) prevents congestion drops - Ethernet congestion management (802.1Qau)
Prevents congestion spreading due to PAUSE - Consequences
- Hardware-friendly algorithms can operate on
10100Gbps links - Partial offload of CPU no packet retransmissions
- Corruption losses require abort/restart 10G over
copper uses short cables to keep low BER - PAUSE absorption buffers proportional to bdwdth
x delay of links, high memory bandwidth - NOTE Recent work addresses the last two points
this is not covered in the course
Pause absorption buffers
X
X
7Overview of Congestion Control Research
8Stability
- Congestion control algorithms aim to
- deliver high throughput, maintain low
latencies/backlogs, be fair to all flows, be
simple to implement and easy to deploy - Performance is related to stability of control
loop - Stability refers to the non-oscillatory or
non-exploding behavior of congestion control
loops. In real terms, stability refers to the
non-oscillatory behavior of the queues at the
switch. - If the switch buffers are short, oscillating
queues can overflow (hence drop packets/pause the
link) or underflow (hence lose utilization) - In either case, links cannot be fully utilized,
throughput is lost, flow transfers take longer - So stability is an important property, especially
for networks with high bandwidth-delay products
operating with shallow buffers
9Unit step response of the network
- The control loops are not easy to analyze
- They are described by non-linear, delay
differential equations which are usually
impossible to analyze - So linearized analyses are performed using
Nyquist or Bode theory - Is linearized analysis useful?
- Yes! It is not difficult to know if a zero-delay
non-linear system is stable. As the delay
increases, linearization can be used to tell if
the system is stable for delay (or number of
sources) in some range i.e. we get sufficient
conditions - The above stability theory is essentially
studying the unit step response of a network - Apply many infinitely long flows at time 0 and
see how long the network takes to settle them to
the correct collective and individual rate the
first is about throughput, the second is about
fairness
10TCP--RED A basic control loop
TCP Slow start Congestion avoidance Congestio
n avoidance AIMD No loss increase window by
1 Pkt loss cut window by half
11TCP--RED
- Two ways to analyze and understand this control
loop - Simulations ns-2
- Theory Delay-differential equations
- ns-2 A widely used event-driven simulator for
the Internet - Very detailed and accurate
- Different types of transport protocols TCP, UDP,
- Router mechanisms and algorithms RED, DRR,
- Web traffic sessions, flows, power law flow
sizes, - Different types of network wired, wireless,
satellite, mobility,
12The simulation setup
13Delay at Link 1
14TCP--RED Analytical model
15TCP--RED Analytical model
Users
Network
W window size RTT round trip time C link
capacity q queue length qa ave queue length
p drop probability
By V. Misra, W. Dong and D. Towsley at SIGCOMM
2000 Fluid model concept originated by F. Kelly,
A. Maullo and D. Tan at Jour. Oper. Res. Society,
1998
16Accuracy of analytical model
Recall the ns-2 simulation from earlier Delay at
Link 1
17Accuracy of analytical model
18Accuracy of analytical model
19Why are the Diff Eqn models so accurate?
- Theyve been developed in Physics, where they are
called Mean Field Models - The main idea
- very difficult to model large-scale systems
there are simply too many events, too many random
quantities - but, it is quite easy to model the mean or
average behavior of such systems - interestingly, when the size of the system grows,
its behavior gets closer and closer to that
predicted by the mean-field model! - physicists have been exploiting this feature to
model large magnetic materials, gases, etc. - just as a few electrons/particles dont have a
very big influence on a system, so is Internet
resource usage not heavily influenced by a few
packets aggregates matter more
20TCP--RED Stability analysis
- Given the differential equations, in principle
one can figure out whether the TCP--RED control
loop is stable - However, the differential equations are very
complicated - 3rd or 4th order, nonlinear, with delays
- There is no general theory, specific case
treatments exist - Linearize and analyze
- Linearize equations around the (unique) operating
point - Analyze resultant linear, delay-differential
equations using Nyquist or Bode theory - End result
- Design stable control loops
- Determine stability conditions (RTT limits,
number of users, etc) - Obtain control loop parameters gains, drop
functions,
21Instability of TCP--RED
- As the bandwidth-delay-product increases, the
TCP--RED control loop becomes unstable - Parameters 50 sources, link capacity 9000
pkts/sec, TCP--RED - Source S. Low et. al. Infocom 2002
22Summary
- We saw a very brief overview of research on the
analysis of congestion control systems - As loop lags increase, the control loop becomes
very oscillatory - This is true of any control scheme, not just
congestion control schemes - In networks, oscillatory queue sizes tend to
underflow buffers, causing to a loss of
throughput especially true for high BDP networks
with shallow buffers - This has led to much research on developing
algorithms for high BDP networks e.g. High-Speed
TCP, XCP, RCP, Scalable TCP, BIC-TCP, etc - We shall return to this later, after describing
the QCN algorithm we have developed for the IEEE
802.1 standard
23- Quantized Congestion Notification (QCN)
- Congestion control for Ethernet
Joint work with Mohammad Alizadeh, Berk Atikoglu
and Abdul Kabbani, Stanford University Ashvin
Lakshmikantha, Broadcom Rong Pan, Cisco
Systems Mick Seaman, Chair, Security Group
Ex-Chair, Interworking Group, IEEE 802.1
24Overview
- The description of QCN is brief, restricted to
the main points of the algorithm - A fuller description is available at the IEEE
802.1 Data Center Bridging Task Groups website,
including extensive simulations and pseudo-code - We will describe the congestion control loop
- How is congestion measured at the switches?
- What is the signal? And, how does the switch
send it? (Remember there are no per-packet acks
in Ethernet) - What does the source do when it receives a
congestion signal? - Terminology
- Congestion Point Where congestion occurs, mainly
switches - Reaction Point Source of traffic, mainly rate
limiters in Ethernet NICs
25QCN Congestion Point Dynamics
- Consider the single-source, single-switch loop
below - Congestion Point (Switch) Dynamics Sample
packets, compute feedback (Fb), quantize Fb to 6
bits, and reflect only negative Fb values back to
Reaction Point with a probability proportional to
Fb.
Qeq
Source
Pmax
Reflection Probability
Fb -(Q-Qeq w . dQ/dt ) -(queue offset
w.rate offset)
Pmin
Fb
26QCN Reaction Point
- Source (reaction point) Transmit regular
Ethernet frames. When congestion message
arrives - Multiplicative Decrease
- Fast Recovery similar to BIC-TCP gives high
performance in high bandwidth-delay product
networks, while being very simple. - Active Probing
Fast Recovery
Active Probing
27Timer-supported QCN
- Byte-Counter
- 5 cycles of FR (150KB per cycle)
- AI cycles afterwards (75KB per cycle)
- Fb lt 0 sends timer to FR
Byte-Ctr
- RL
- In FR if both byte-ctr and timer in FR
- In AI if only one of byte-ctr or timer in AI
- In HAI if both byte-ctr and timer in AI
- Note RL goes to HAI only after 500 pkts have
been sent
RL
Timer
- Timer
- 5 cycles of FR (T msec per cycle)
- AI cycles afterwards (T/2 msec/cycle)
- Fb lt 0 sends timer to FR
28Simulations Basic Case
- Parameters
- 10 sources share a 10 G link, whose capacity
drops to 0.5G during 2-4 secs - Max offered rate per source 1.05G
- RTT 50 usec
- Buffer size 100 pkts (150KB) Qeq 22
- T 10 msecs
- RAI 5 Mbps
- RHAI 50 Mbps
10 G
10 G
Source 1
Source 2
0.5G
Source 10
29Recovery Time
Recovery time 80 msec
30Fluid Model for QCN
P F(Fb)
- Assume N flows pass through a single queue at a
switch. State variables are TRi(t), CRi(t), q(t),
p(t).
10
Fb
63
31AccuracyEquations vs. ns-2 simulations
N 10, RTT 100 us
N 100, RTT 500 us
N 10, RTT 1 ms
N 10, RTT 2 ms
32Summary
- The algorithm has been extensively tested in
deployment scenarios of interest - Esp. interoperability with link-level PAUSE and
TCP - All presentations are available at the IEEE 802.1
website - The theoretical development is interesting, but
most notably because QCN (and BIC-TCP) display
strong stability in the face of increasing lags,
or, equivalently in high bandwidth-delay product
networks - While attempting to understand why these schemes
perform so well, we have uncovered a method for
improving the stability of any congestion control
scheme we present this next
33The Averaging Principle
34Background to the AP
- When the lags in a control loop increase, the
system becomes oscillatory and eventually becomes
unstable - Feedback compensation is applied to restore
stability the two main flavors of feedback
compensation in are - Determine lags (round trip times), apply the
correct gains for the loop to be stable (e.g.
XCP, RCP, FAST). - Include higher order queue derivatives in the
congestion information fed back to the source
(e.g. REM/PI, BCN). - Method 1 is not suitable for us, we dont know
RTTs in Ethernet - Method 2 requires a change to the switch
implementation - The Averaging Principle is a different method
- It is suited to Ethernet where round trip times
are unavailable - It doesnt need more feedback, hence switch
implementations dont have to change - QCN and BIC-TCP already turn out to employ it
35The Averaging Principle (AP)?
- A source in a congestion control loop is
instructed by the network to decrease or increase
its sending rate (randomly) periodically
- AP a source obeys the network whenever
instructed to change rate, and then voluntarily
performs averaging as below
TR Target Rate CR Current Rate
36Recall QCN does 5 steps of Averaging
- The Fast Recovery portion of QCN, there are 5
steps of averaging - In fact, QCN and BIC-TCP are the Ave Prin applied
to TCP!
Active Probing
37Applying the APRCP Rate Control
ProtocolDukkipatti and McKeown
- A router computes an upper bound R on the rate of
all flows traversing it. - R recomputed every T ( 10) msec as follows
- ?
- Where
- d0 Round trip time estimate (set constant 10
msec in our case)? - C link capacity ( 2.4 Gbps)
- Q Current queue size at the switch
- y(t) incoming rate
- a 0.1
- ß 1
- A flow chooses the smallest advertised rate on
its path. - We consider a scenario where 10 RCP sources share
a single link.
38AP-RCP Stability
RTT 60 msec
RTT 65 msec
39AP-RCP Stability contd
RTT 120 msec
RTT 130 msec
40AP-RCP Stability contd
RTT 230 msec
RTT 240 msec
41Understanding the AP
- As mentioned earlier, the two major flavors of
feedback compensation are - Determine lags, chose appropriate gains
- Feedback higher derivatives of state
- We prove that the AP is sense equivalent to both
of the above! - This is great because we dont need to change
network routers and switches - And the AP is really very easy to apply no
lag-dependent optimizations of gain parameters
needed
42AP Equivalence Single Source Case
Source does AP
Fb
Regular source
0.5 Fb 0.25 T dFb/dt
- Systems 1 and 2 are discrete-time models for an
AP enabled source, and a regular source
respectively. - Main Result Systems 1 and 2 are algebraically
equivalent. That is, given identical input
sequences, they produce identical output
sequences. - Therefore the AP is equivalent to adding a
derivative to the feedback and reducing the gain! - Thus, the AP does both known forms of feedback
compensation without knowing RTTs or changing
switch implementations
43AP-RCP vs PD-RCP
RTT 120 msec
RTT 130 msec
44A Generic Control Example
- As an example, we consider the plant transfer
function - P(s) (s1)/(s31.6s20.8
s0.6)
45Step ResponseBasic AP, No Delay
46Step ResponseBasic AP, Delay 8 seconds
47Step Response Two-step AP, Delay 14 seconds
48Step Response Two-step AP, Delay 25 seconds
Two-step AP is even more stable than Basic AP
49Summary of AP
- The AP is a simple method for making many control
loops (not just congestion control loops) more
robust to increasing lags - Gives a clear understanding as to the reason why
the BIC-TCP and QCN algorithms have such good
delay tolerance they do averaging repeatedly - There is a theorem which deals explicitly with
the QCN-type loop - Variations of the basic principle are possible
i.e. average more than once, average by more than
half-way, etc - The theory is fairly complete in these cases
50QCN and Buffer Sizing
51Background TCP Buffer Sizing
- Standard rule of thumb
- Single TCP flow Bandwidth Delay worth of
buffering needed for 100 utilization. -
- Recent result (Appenzellar et al.)
- For N gtgt 1 TCP flows Bdwdth x Delay/sqrt(N)
amount of buffering is enough. - The essence of this result is that when many
flows combine, the Variance of the net sending
rate decreases - Buffer sizing problem is challenging in data
centers - Typically, only a small number of flows are
active on each path. (N is small) - Ethernet switches are typically built with
shallow buffers to keep costs down. -
52Example Simulation Setup
Switch
- 10 Gig Ethernet
- Switch buffer is 150 Kbytes deep.
- We compare TCP and QCN for various of flows,
and RTTs.
53TCP vs QCN (N 1, RTT 120 µs)
TCP
QCN
Throughput 99.5 Standard Deviation 265.4
Mbps
Throughput 99.5 Standard Deviation 13.8 Mbps
54TCP vs QCN (N 1, RTT 250 µs)
TCP
QCN
Throughput 95.5 Standard Deviation 782.7
Mbps
Throughput 99.5 Standard Deviation 33.3 Mbps
55TCP vs QCN (N 1, RTT 500 µs)
TCP
QCN
Throughput 88 Standard Deviation 1249.7
Mbps
Throughput 99.5 Standard Deviation 95.4 Mbps
56TCP vs QCN (N 10, RTT 120 µs)
TCP
QCN
Throughput 99.5 Standard Deviation 625.1
Mbps
Throughput 99.5 Standard Deviation 25.1 Mbps
57TCP vs QCN (N 10, RTT 250 µs)
TCP
QCN
Throughput 95.5 Standard Deviation 981 Mbps
Throughput 99.5 Standard Deviation 27.2 Mbps
58TCP vs QCN (N 10, RTT 500 µs)
TCP
QCN
Throughput 89 Standard Deviation 1311.4
Mbps
Throughput 99.5 Standard Deviation 170.5 Mbps
59QCN and shallow buffers
- In contrast to TCP, QCN is stable with shallow
buffers, even with few sources. - Why?
- Recall that buffer requirements are closely
related to sending rate variance - Buffer size C x
Var(R1) x Bdwdth x Delay/ sqrt(N) - TCP
- Good performance for large N, since the
denominator is large. - QCN
- Good performance for all N, since the numerator
is small. - Thus, averaging reduces the variance of a
sources sending rate - This is a stochastic interpretation of the
Averaging Principles success in keeping
stability with shallow buffers
60Conclusions
- We have seen the background, development and
analysis of a congestion control scheme for the
IEEE 802.1 Ethernet standard - The QCN algorithm is
- More stable with respect to control loop delays
- Requires much smaller buffers than TCP
- Easy to build in hardware
- The Averaging Principle is interesting and were
exploring its use in nonlinear control systems