Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard

Description:

... performed using Nyquist or Bode theory. Is linearized analysis useful? ... Analyze resultant linear, delay-differential equations using Nyquist or Bode theory ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 61
Provided by: deep4
Category:

less

Transcript and Presenter's Notes

Title: Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard


1
Congestion Management for Data CentersIEEE
802.1 Ethernet Standard
Balaji Prabhakar Departments of EE and
CS Stanford University
2
Background
  • Data Centers see the true convergence of L3 and
    L2 transport
  • While TCP is the dominant L3 transport protocol,
    and a significant amount of L2 traffic uses it,
    there is other L2 traffic notably, storage and
    media
  • This, and other reasons, have prompted the IEEE
    802.1 standards body to develop an Ethernet
    congestion management standard
  • In this lecture, we shall see the development of
    the QCN (Quantized Congestion Notification)
    algorithm for standardization in the IEEE 802.1
    Data Center Bridging standards
  • We will also review the technical background of
    congestion control research
  • The lecture has 3 parts
  • A brief overview of the relevant congestion
    control background
  • A description of the QCN algorithm and its
    performance
  • The Averaging Principle A new control-theoretic
    idea underlying the QCN and BIC-TCP algorithms
    which stabilizes them when loop delays increase
    very useful for operating high-speed links with
    shallow buffers---the situation in 10 Gbps
    Ethernets

3
Managing Congestion
  • Congestion is a standard feature of networked
    systems in data networks,
  • Congestion occurs when links are oversubscribed
    when traffic and/or link bandwidth changes
  • A congestion notification mechanism allows
    switches/routers to directly control the rate of
    the ultimate sources of the traffic
  • Weve been involved in developing QCN (for
    Quantized Congestion Notification) for
    standardization in the Data Center Bridging
    track of the IEEE 802.1 Ethernet standards
  • For deployment in 10 (and 40 and 100) Gbps Data
    Center Ethernets
  • Complete information on the QCN algorithm
    (p-code, draft of standard, detailed simulations
    of lots of scenarios) available at

4
Congestion control in the Internet
  • In the Internet
  • Queue management schemes (e.g. RED) at the links
    signal congestion by either dropping or marking
    packets using ECN
  • TCP at end-systems uses these signals to vary the
    sending rate
  • There exists a rich history of algorithm
    development, control-theoretic analysis and
    detailed simulation of queue management schemes
    and congestion control algorithms for the
    Internet
  • Jacobson, Floyd et al, Kelly et al, Low et al,
    Srikant et al, Misra et al, Katabi et al
  • TCP is excellent, so why look for another
    algorithm?
  • There is other traffic on Ethernet than TCP so,
    native Ethernet congestion management is needed
  • TCPs one size fits all approach makes it too
    conservative for high bandwidth-delay product
    networks
  • A hardware-based algorithm is needed for the very
    high speeds of operation encountered in 10, 40
    and 100 Gbps
  • Ethernet and the Internet have very different
    operating conditions

5
Switched Ethernet vs. the Internet
  • Some significant differences
  • No per-packet acks in Ethernet, unlike in the
    Internet
  • Not possible to know round trip time!
  • So congestion must be signaled to the source by
    switches
  • Algorithm not automatically self-clocked (like
    TCP)
  • Links can be paused i.e. packets may not be
    dropped
  • No sequence numbering of L2 packets
  • Sources do not start transmission gently (like
    TCP slow-start) they can potentially come on at
    the full line rate of 10Gbps
  • Ethernet switch buffers are much smaller than
    router buffers (100s of KBs vs 100s of MBs)
  • Most importantly, algorithm should be simple
    enough to be implemented completely in hardware
  • Note The QCN algorithm we have developed has
    Internet relatives notably BIC-TCP at the source
    and the REM/PI controllers at switches

6
L2 Transport IEEE 802.1
  • IEEE 802.1 Data Center Bridging standards
    Enhancements to Ethernet
  • Reliable delivery (802.1Qbb) Link-level flow
    control (PAUSE) prevents congestion drops
  • Ethernet congestion management (802.1Qau)
    Prevents congestion spreading due to PAUSE
  • Consequences
  • Hardware-friendly algorithms can operate on
    10100Gbps links
  • Partial offload of CPU no packet retransmissions
  • Corruption losses require abort/restart 10G over
    copper uses short cables to keep low BER
  • PAUSE absorption buffers proportional to bdwdth
    x delay of links, high memory bandwidth
  • NOTE Recent work addresses the last two points
    this is not covered in the course

Pause absorption buffers
X
X
7
Overview of Congestion Control Research
8
Stability
  • Congestion control algorithms aim to
  • deliver high throughput, maintain low
    latencies/backlogs, be fair to all flows, be
    simple to implement and easy to deploy
  • Performance is related to stability of control
    loop
  • Stability refers to the non-oscillatory or
    non-exploding behavior of congestion control
    loops. In real terms, stability refers to the
    non-oscillatory behavior of the queues at the
    switch.
  • If the switch buffers are short, oscillating
    queues can overflow (hence drop packets/pause the
    link) or underflow (hence lose utilization)
  • In either case, links cannot be fully utilized,
    throughput is lost, flow transfers take longer
  • So stability is an important property, especially
    for networks with high bandwidth-delay products
    operating with shallow buffers

9
Unit step response of the network
  • The control loops are not easy to analyze
  • They are described by non-linear, delay
    differential equations which are usually
    impossible to analyze
  • So linearized analyses are performed using
    Nyquist or Bode theory
  • Is linearized analysis useful?
  • Yes! It is not difficult to know if a zero-delay
    non-linear system is stable. As the delay
    increases, linearization can be used to tell if
    the system is stable for delay (or number of
    sources) in some range i.e. we get sufficient
    conditions
  • The above stability theory is essentially
    studying the unit step response of a network
  • Apply many infinitely long flows at time 0 and
    see how long the network takes to settle them to
    the correct collective and individual rate the
    first is about throughput, the second is about
    fairness

10
TCP--RED A basic control loop
TCP Slow start Congestion avoidance Congestio
n avoidance AIMD No loss increase window by
1 Pkt loss cut window by half
11
TCP--RED
  • Two ways to analyze and understand this control
    loop
  • Simulations ns-2
  • Theory Delay-differential equations
  • ns-2 A widely used event-driven simulator for
    the Internet
  • Very detailed and accurate
  • Different types of transport protocols TCP, UDP,
  • Router mechanisms and algorithms RED, DRR,
  • Web traffic sessions, flows, power law flow
    sizes,
  • Different types of network wired, wireless,
    satellite, mobility,

12
The simulation setup
13
Delay at Link 1
14
TCP--RED Analytical model
15
TCP--RED Analytical model
Users
Network
W window size RTT round trip time C link
capacity q queue length qa ave queue length
p drop probability
By V. Misra, W. Dong and D. Towsley at SIGCOMM
2000 Fluid model concept originated by F. Kelly,
A. Maullo and D. Tan at Jour. Oper. Res. Society,
1998
16
Accuracy of analytical model
Recall the ns-2 simulation from earlier Delay at
Link 1
17
Accuracy of analytical model
18
Accuracy of analytical model
19
Why are the Diff Eqn models so accurate?
  • Theyve been developed in Physics, where they are
    called Mean Field Models
  • The main idea
  • very difficult to model large-scale systems
    there are simply too many events, too many random
    quantities
  • but, it is quite easy to model the mean or
    average behavior of such systems
  • interestingly, when the size of the system grows,
    its behavior gets closer and closer to that
    predicted by the mean-field model!
  • physicists have been exploiting this feature to
    model large magnetic materials, gases, etc.
  • just as a few electrons/particles dont have a
    very big influence on a system, so is Internet
    resource usage not heavily influenced by a few
    packets aggregates matter more

20
TCP--RED Stability analysis
  • Given the differential equations, in principle
    one can figure out whether the TCP--RED control
    loop is stable
  • However, the differential equations are very
    complicated
  • 3rd or 4th order, nonlinear, with delays
  • There is no general theory, specific case
    treatments exist
  • Linearize and analyze
  • Linearize equations around the (unique) operating
    point
  • Analyze resultant linear, delay-differential
    equations using Nyquist or Bode theory
  • End result
  • Design stable control loops
  • Determine stability conditions (RTT limits,
    number of users, etc)
  • Obtain control loop parameters gains, drop
    functions,

21
Instability of TCP--RED
  • As the bandwidth-delay-product increases, the
    TCP--RED control loop becomes unstable
  • Parameters 50 sources, link capacity 9000
    pkts/sec, TCP--RED
  • Source S. Low et. al. Infocom 2002

22
Summary
  • We saw a very brief overview of research on the
    analysis of congestion control systems
  • As loop lags increase, the control loop becomes
    very oscillatory
  • This is true of any control scheme, not just
    congestion control schemes
  • In networks, oscillatory queue sizes tend to
    underflow buffers, causing to a loss of
    throughput especially true for high BDP networks
    with shallow buffers
  • This has led to much research on developing
    algorithms for high BDP networks e.g. High-Speed
    TCP, XCP, RCP, Scalable TCP, BIC-TCP, etc
  • We shall return to this later, after describing
    the QCN algorithm we have developed for the IEEE
    802.1 standard

23
  • Quantized Congestion Notification (QCN)
  • Congestion control for Ethernet

Joint work with Mohammad Alizadeh, Berk Atikoglu
and Abdul Kabbani, Stanford University Ashvin
Lakshmikantha, Broadcom Rong Pan, Cisco
Systems Mick Seaman, Chair, Security Group
Ex-Chair, Interworking Group, IEEE 802.1
24
Overview
  • The description of QCN is brief, restricted to
    the main points of the algorithm
  • A fuller description is available at the IEEE
    802.1 Data Center Bridging Task Groups website,
    including extensive simulations and pseudo-code
  • We will describe the congestion control loop
  • How is congestion measured at the switches?
  • What is the signal? And, how does the switch
    send it? (Remember there are no per-packet acks
    in Ethernet)
  • What does the source do when it receives a
    congestion signal?
  • Terminology
  • Congestion Point Where congestion occurs, mainly
    switches
  • Reaction Point Source of traffic, mainly rate
    limiters in Ethernet NICs

25
QCN Congestion Point Dynamics
  • Consider the single-source, single-switch loop
    below
  • Congestion Point (Switch) Dynamics Sample
    packets, compute feedback (Fb), quantize Fb to 6
    bits, and reflect only negative Fb values back to
    Reaction Point with a probability proportional to
    Fb.

Qeq
Source
Pmax
Reflection Probability
Fb -(Q-Qeq w . dQ/dt ) -(queue offset
w.rate offset)
Pmin
Fb
26
QCN Reaction Point
  • Source (reaction point) Transmit regular
    Ethernet frames. When congestion message
    arrives
  • Multiplicative Decrease
  • Fast Recovery similar to BIC-TCP gives high
    performance in high bandwidth-delay product
    networks, while being very simple.
  • Active Probing

Fast Recovery
Active Probing
27
Timer-supported QCN
  • Byte-Counter
  • 5 cycles of FR (150KB per cycle)
  • AI cycles afterwards (75KB per cycle)
  • Fb lt 0 sends timer to FR

Byte-Ctr
  • RL
  • In FR if both byte-ctr and timer in FR
  • In AI if only one of byte-ctr or timer in AI
  • In HAI if both byte-ctr and timer in AI
  • Note RL goes to HAI only after 500 pkts have
    been sent

RL
Timer
  • Timer
  • 5 cycles of FR (T msec per cycle)
  • AI cycles afterwards (T/2 msec/cycle)
  • Fb lt 0 sends timer to FR

28
Simulations Basic Case
  • Parameters
  • 10 sources share a 10 G link, whose capacity
    drops to 0.5G during 2-4 secs
  • Max offered rate per source 1.05G
  • RTT 50 usec
  • Buffer size 100 pkts (150KB) Qeq 22
  • T 10 msecs
  • RAI 5 Mbps
  • RHAI 50 Mbps

10 G
10 G
Source 1
Source 2
0.5G
Source 10
29
Recovery Time
Recovery time 80 msec
30
Fluid Model for QCN
P F(Fb)
  • Assume N flows pass through a single queue at a
    switch. State variables are TRi(t), CRi(t), q(t),
    p(t).

10
Fb
63
31
AccuracyEquations vs. ns-2 simulations
N 10, RTT 100 us
N 100, RTT 500 us
N 10, RTT 1 ms
N 10, RTT 2 ms
32
Summary
  • The algorithm has been extensively tested in
    deployment scenarios of interest
  • Esp. interoperability with link-level PAUSE and
    TCP
  • All presentations are available at the IEEE 802.1
    website
  • The theoretical development is interesting, but
    most notably because QCN (and BIC-TCP) display
    strong stability in the face of increasing lags,
    or, equivalently in high bandwidth-delay product
    networks
  • While attempting to understand why these schemes
    perform so well, we have uncovered a method for
    improving the stability of any congestion control
    scheme we present this next

33
The Averaging Principle
34
Background to the AP
  • When the lags in a control loop increase, the
    system becomes oscillatory and eventually becomes
    unstable
  • Feedback compensation is applied to restore
    stability the two main flavors of feedback
    compensation in are
  • Determine lags (round trip times), apply the
    correct gains for the loop to be stable (e.g.
    XCP, RCP, FAST).
  • Include higher order queue derivatives in the
    congestion information fed back to the source
    (e.g. REM/PI, BCN).
  • Method 1 is not suitable for us, we dont know
    RTTs in Ethernet
  • Method 2 requires a change to the switch
    implementation
  • The Averaging Principle is a different method
  • It is suited to Ethernet where round trip times
    are unavailable
  • It doesnt need more feedback, hence switch
    implementations dont have to change
  • QCN and BIC-TCP already turn out to employ it

35
The Averaging Principle (AP)?
  • A source in a congestion control loop is
    instructed by the network to decrease or increase
    its sending rate (randomly) periodically
  • AP a source obeys the network whenever
    instructed to change rate, and then voluntarily
    performs averaging as below

TR Target Rate CR Current Rate
36
Recall QCN does 5 steps of Averaging
  • The Fast Recovery portion of QCN, there are 5
    steps of averaging
  • In fact, QCN and BIC-TCP are the Ave Prin applied
    to TCP!

Active Probing
37
Applying the APRCP Rate Control
ProtocolDukkipatti and McKeown
  • A router computes an upper bound R on the rate of
    all flows traversing it.
  • R recomputed every T ( 10) msec as follows
  • ?
  • Where
  • d0 Round trip time estimate (set constant 10
    msec in our case)?
  • C link capacity ( 2.4 Gbps)
  • Q Current queue size at the switch
  • y(t) incoming rate
  • a 0.1
  • ß 1
  • A flow chooses the smallest advertised rate on
    its path.
  • We consider a scenario where 10 RCP sources share
    a single link.

38
AP-RCP Stability
RTT 60 msec
RTT 65 msec
39
AP-RCP Stability contd
RTT 120 msec
RTT 130 msec
40
AP-RCP Stability contd
RTT 230 msec
RTT 240 msec
41
Understanding the AP
  • As mentioned earlier, the two major flavors of
    feedback compensation are
  • Determine lags, chose appropriate gains
  • Feedback higher derivatives of state
  • We prove that the AP is sense equivalent to both
    of the above!
  • This is great because we dont need to change
    network routers and switches
  • And the AP is really very easy to apply no
    lag-dependent optimizations of gain parameters
    needed

42
AP Equivalence Single Source Case
Source does AP
Fb
Regular source
0.5 Fb 0.25 T dFb/dt
  • Systems 1 and 2 are discrete-time models for an
    AP enabled source, and a regular source
    respectively.
  • Main Result Systems 1 and 2 are algebraically
    equivalent. That is, given identical input
    sequences, they produce identical output
    sequences.
  • Therefore the AP is equivalent to adding a
    derivative to the feedback and reducing the gain!
  • Thus, the AP does both known forms of feedback
    compensation without knowing RTTs or changing
    switch implementations

43
AP-RCP vs PD-RCP
RTT 120 msec
RTT 130 msec
44
A Generic Control Example
  • As an example, we consider the plant transfer
    function
  • P(s) (s1)/(s31.6s20.8
    s0.6)

45
Step ResponseBasic AP, No Delay
46
Step ResponseBasic AP, Delay 8 seconds
47
Step Response Two-step AP, Delay 14 seconds
48
Step Response Two-step AP, Delay 25 seconds
Two-step AP is even more stable than Basic AP
49
Summary of AP
  • The AP is a simple method for making many control
    loops (not just congestion control loops) more
    robust to increasing lags
  • Gives a clear understanding as to the reason why
    the BIC-TCP and QCN algorithms have such good
    delay tolerance they do averaging repeatedly
  • There is a theorem which deals explicitly with
    the QCN-type loop
  • Variations of the basic principle are possible
    i.e. average more than once, average by more than
    half-way, etc
  • The theory is fairly complete in these cases

50
QCN and Buffer Sizing
51
Background TCP Buffer Sizing
  • Standard rule of thumb
  • Single TCP flow Bandwidth Delay worth of
    buffering needed for 100 utilization.
  • Recent result (Appenzellar et al.)
  • For N gtgt 1 TCP flows Bdwdth x Delay/sqrt(N)
    amount of buffering is enough.
  • The essence of this result is that when many
    flows combine, the Variance of the net sending
    rate decreases
  • Buffer sizing problem is challenging in data
    centers
  • Typically, only a small number of flows are
    active on each path. (N is small)
  • Ethernet switches are typically built with
    shallow buffers to keep costs down.

52
Example Simulation Setup
Switch
  • 10 Gig Ethernet
  • Switch buffer is 150 Kbytes deep.
  • We compare TCP and QCN for various of flows,
    and RTTs.

53
TCP vs QCN (N 1, RTT 120 µs)
TCP
QCN
Throughput 99.5 Standard Deviation 265.4
Mbps
Throughput 99.5 Standard Deviation 13.8 Mbps
54
TCP vs QCN (N 1, RTT 250 µs)
TCP
QCN
Throughput 95.5 Standard Deviation 782.7
Mbps
Throughput 99.5 Standard Deviation 33.3 Mbps
55
TCP vs QCN (N 1, RTT 500 µs)
TCP
QCN
Throughput 88 Standard Deviation 1249.7
Mbps
Throughput 99.5 Standard Deviation 95.4 Mbps
56
TCP vs QCN (N 10, RTT 120 µs)
TCP
QCN
Throughput 99.5 Standard Deviation 625.1
Mbps
Throughput 99.5 Standard Deviation 25.1 Mbps
57
TCP vs QCN (N 10, RTT 250 µs)
TCP
QCN
Throughput 95.5 Standard Deviation 981 Mbps

Throughput 99.5 Standard Deviation 27.2 Mbps
58
TCP vs QCN (N 10, RTT 500 µs)
TCP
QCN
Throughput 89 Standard Deviation 1311.4
Mbps
Throughput 99.5 Standard Deviation 170.5 Mbps
59
QCN and shallow buffers
  • In contrast to TCP, QCN is stable with shallow
    buffers, even with few sources.
  • Why?
  • Recall that buffer requirements are closely
    related to sending rate variance
  • Buffer size C x
    Var(R1) x Bdwdth x Delay/ sqrt(N)
  • TCP
  • Good performance for large N, since the
    denominator is large.
  • QCN
  • Good performance for all N, since the numerator
    is small.
  • Thus, averaging reduces the variance of a
    sources sending rate
  • This is a stochastic interpretation of the
    Averaging Principles success in keeping
    stability with shallow buffers

60
Conclusions
  • We have seen the background, development and
    analysis of a congestion control scheme for the
    IEEE 802.1 Ethernet standard
  • The QCN algorithm is
  • More stable with respect to control loop delays
  • Requires much smaller buffers than TCP
  • Easy to build in hardware
  • The Averaging Principle is interesting and were
    exploring its use in nonlinear control systems
Write a Comment
User Comments (0)
About PowerShow.com