Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard

About This Presentation

Title:

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard

Description:

... performed using Nyquist or Bode theory. Is linearized analysis useful? ... Analyze resultant linear, delay-differential equations using Nyquist or Bode theory ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 61

Provided by: deep4

Category:

more less

Transcript and Presenter's Notes

Title: Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard

1
Congestion Management for Data CentersIEEE
802.1 Ethernet Standard
Balaji Prabhakar Departments of EE and
CS Stanford University
2
Background

Data Centers see the true convergence of L3 and
L2 transport
While TCP is the dominant L3 transport protocol,
and a significant amount of L2 traffic uses it,
there is other L2 traffic notably, storage and
media
This, and other reasons, have prompted the IEEE
802.1 standards body to develop an Ethernet
congestion management standard
In this lecture, we shall see the development of
the QCN (Quantized Congestion Notification)
algorithm for standardization in the IEEE 802.1
Data Center Bridging standards
We will also review the technical background of
congestion control research
The lecture has 3 parts
A brief overview of the relevant congestion
control background
A description of the QCN algorithm and its
performance
The Averaging Principle A new control-theoretic
idea underlying the QCN and BIC-TCP algorithms
which stabilizes them when loop delays increase
very useful for operating high-speed links with
shallow buffers---the situation in 10 Gbps
Ethernets

3
Managing Congestion

Congestion is a standard feature of networked
systems in data networks,
Congestion occurs when links are oversubscribed
when traffic and/or link bandwidth changes
A congestion notification mechanism allows
switches/routers to directly control the rate of
the ultimate sources of the traffic
Weve been involved in developing QCN (for
Quantized Congestion Notification) for
standardization in the Data Center Bridging
track of the IEEE 802.1 Ethernet standards
For deployment in 10 (and 40 and 100) Gbps Data
Center Ethernets
Complete information on the QCN algorithm
(p-code, draft of standard, detailed simulations
of lots of scenarios) available at

4
Congestion control in the Internet

In the Internet
Queue management schemes (e.g. RED) at the links
signal congestion by either dropping or marking
packets using ECN
TCP at end-systems uses these signals to vary the
sending rate
There exists a rich history of algorithm
development, control-theoretic analysis and
detailed simulation of queue management schemes
and congestion control algorithms for the
Internet
Jacobson, Floyd et al, Kelly et al, Low et al,
Srikant et al, Misra et al, Katabi et al
TCP is excellent, so why look for another
algorithm?
There is other traffic on Ethernet than TCP so,
native Ethernet congestion management is needed
TCPs one size fits all approach makes it too
conservative for high bandwidth-delay product
networks
A hardware-based algorithm is needed for the very
high speeds of operation encountered in 10, 40
and 100 Gbps
Ethernet and the Internet have very different
operating conditions

5
Switched Ethernet vs. the Internet

Some significant differences
No per-packet acks in Ethernet, unlike in the
Internet
Not possible to know round trip time!
So congestion must be signaled to the source by
switches
Algorithm not automatically self-clocked (like
TCP)
Links can be paused i.e. packets may not be
dropped
No sequence numbering of L2 packets
Sources do not start transmission gently (like
TCP slow-start) they can potentially come on at
the full line rate of 10Gbps
Ethernet switch buffers are much smaller than
router buffers (100s of KBs vs 100s of MBs)
Most importantly, algorithm should be simple
enough to be implemented completely in hardware
Note The QCN algorithm we have developed has
Internet relatives notably BIC-TCP at the source
and the REM/PI controllers at switches

6
L2 Transport IEEE 802.1

IEEE 802.1 Data Center Bridging standards
Enhancements to Ethernet
Reliable delivery (802.1Qbb) Link-level flow
control (PAUSE) prevents congestion drops
Ethernet congestion management (802.1Qau)
Prevents congestion spreading due to PAUSE
Consequences
Hardware-friendly algorithms can operate on
10100Gbps links
Partial offload of CPU no packet retransmissions
Corruption losses require abort/restart 10G over
copper uses short cables to keep low BER
PAUSE absorption buffers proportional to bdwdth
x delay of links, high memory bandwidth
NOTE Recent work addresses the last two points
this is not covered in the course

Pause absorption buffers
X
X
7
Overview of Congestion Control Research
8
Stability

Congestion control algorithms aim to
deliver high throughput, maintain low
latencies/backlogs, be fair to all flows, be
simple to implement and easy to deploy
Performance is related to stability of control
loop
Stability refers to the non-oscillatory or
non-exploding behavior of congestion control
loops. In real terms, stability refers to the
non-oscillatory behavior of the queues at the
switch.
If the switch buffers are short, oscillating
queues can overflow (hence drop packets/pause the
link) or underflow (hence lose utilization)
In either case, links cannot be fully utilized,
throughput is lost, flow transfers take longer
So stability is an important property, especially
for networks with high bandwidth-delay products
operating with shallow buffers

9
Unit step response of the network

The control loops are not easy to analyze
They are described by non-linear, delay
differential equations which are usually
impossible to analyze
So linearized analyses are performed using
Nyquist or Bode theory
Is linearized analysis useful?
Yes! It is not difficult to know if a zero-delay
non-linear system is stable. As the delay
increases, linearization can be used to tell if
the system is stable for delay (or number of
sources) in some range i.e. we get sufficient
conditions
The above stability theory is essentially
studying the unit step response of a network
Apply many infinitely long flows at time 0 and
see how long the network takes to settle them to
the correct collective and individual rate the
first is about throughput, the second is about
fairness

10
TCP--RED A basic control loop
TCP Slow start Congestion avoidance Congestio
n avoidance AIMD No loss increase window by
1 Pkt loss cut window by half
11
TCP--RED

Two ways to analyze and understand this control
loop
Simulations ns-2
Theory Delay-differential equations
ns-2 A widely used event-driven simulator for
the Internet
Very detailed and accurate
Different types of transport protocols TCP, UDP,
Router mechanisms and algorithms RED, DRR,
Web traffic sessions, flows, power law flow
sizes,
Different types of network wired, wireless,
satellite, mobility,

12
The simulation setup
13
Delay at Link 1
14
TCP--RED Analytical model
15
TCP--RED Analytical model
Users
Network
W window size RTT round trip time C link
capacity q queue length qa ave queue length
p drop probability
By V. Misra, W. Dong and D. Towsley at SIGCOMM
2000 Fluid model concept originated by F. Kelly,
A. Maullo and D. Tan at Jour. Oper. Res. Society,
1998
16
Accuracy of analytical model
Recall the ns-2 simulation from earlier Delay at
Link 1
17
Accuracy of analytical model
18
Accuracy of analytical model
19
Why are the Diff Eqn models so accurate?

Theyve been developed in Physics, where they are
called Mean Field Models
The main idea
very difficult to model large-scale systems
there are simply too many events, too many random
quantities
but, it is quite easy to model the mean or
average behavior of such systems
interestingly, when the size of the system grows,
its behavior gets closer and closer to that
predicted by the mean-field model!
physicists have been exploiting this feature to
model large magnetic materials, gases, etc.
just as a few electrons/particles dont have a
very big influence on a system, so is Internet
resource usage not heavily influenced by a few
packets aggregates matter more

20
TCP--RED Stability analysis

Given the differential equations, in principle
one can figure out whether the TCP--RED control
loop is stable
However, the differential equations are very
complicated
3rd or 4th order, nonlinear, with delays
There is no general theory, specific case
treatments exist
Linearize and analyze
Linearize equations around the (unique) operating
point
Analyze resultant linear, delay-differential
equations using Nyquist or Bode theory
End result
Design stable control loops
Determine stability conditions (RTT limits,
number of users, etc)
Obtain control loop parameters gains, drop
functions,

21
Instability of TCP--RED

As the bandwidth-delay-product increases, the
TCP--RED control loop becomes unstable
Parameters 50 sources, link capacity 9000
pkts/sec, TCP--RED
Source S. Low et. al. Infocom 2002

22
Summary

We saw a very brief overview of research on the
analysis of congestion control systems
As loop lags increase, the control loop becomes
very oscillatory
This is true of any control scheme, not just
congestion control schemes
In networks, oscillatory queue sizes tend to
underflow buffers, causing to a loss of
throughput especially true for high BDP networks
with shallow buffers
This has led to much research on developing
algorithms for high BDP networks e.g. High-Speed
TCP, XCP, RCP, Scalable TCP, BIC-TCP, etc
We shall return to this later, after describing
the QCN algorithm we have developed for the IEEE
802.1 standard

Quantized Congestion Notification (QCN)
Congestion control for Ethernet

Joint work with Mohammad Alizadeh, Berk Atikoglu
and Abdul Kabbani, Stanford University Ashvin
Lakshmikantha, Broadcom Rong Pan, Cisco
Systems Mick Seaman, Chair, Security Group
Ex-Chair, Interworking Group, IEEE 802.1
24
Overview

The description of QCN is brief, restricted to
the main points of the algorithm
A fuller description is available at the IEEE
802.1 Data Center Bridging Task Groups website,
including extensive simulations and pseudo-code
We will describe the congestion control loop
How is congestion measured at the switches?
What is the signal? And, how does the switch
send it? (Remember there are no per-packet acks
in Ethernet)
What does the source do when it receives a
congestion signal?
Terminology
Congestion Point Where congestion occurs, mainly
switches
Reaction Point Source of traffic, mainly rate
limiters in Ethernet NICs

25
QCN Congestion Point Dynamics

Consider the single-source, single-switch loop
below
Congestion Point (Switch) Dynamics Sample
packets, compute feedback (Fb), quantize Fb to 6
bits, and reflect only negative Fb values back to
Reaction Point with a probability proportional to
Fb.

Qeq
Source
Pmax
Reflection Probability
Fb -(Q-Qeq w . dQ/dt ) -(queue offset
w.rate offset)
Pmin
Fb
26
QCN Reaction Point

Source (reaction point) Transmit regular
Ethernet frames. When congestion message
arrives
Multiplicative Decrease
Fast Recovery similar to BIC-TCP gives high
performance in high bandwidth-delay product
networks, while being very simple.
Active Probing

Fast Recovery
Active Probing
27
Timer-supported QCN

Byte-Counter
5 cycles of FR (150KB per cycle)
AI cycles afterwards (75KB per cycle)
Fb lt 0 sends timer to FR

Byte-Ctr

RL
In FR if both byte-ctr and timer in FR
In AI if only one of byte-ctr or timer in AI
In HAI if both byte-ctr and timer in AI
Note RL goes to HAI only after 500 pkts have
been sent

RL
Timer

Timer
5 cycles of FR (T msec per cycle)
AI cycles afterwards (T/2 msec/cycle)
Fb lt 0 sends timer to FR

28
Simulations Basic Case

Parameters
10 sources share a 10 G link, whose capacity
drops to 0.5G during 2-4 secs
Max offered rate per source 1.05G
RTT 50 usec
Buffer size 100 pkts (150KB) Qeq 22
T 10 msecs
RAI 5 Mbps
RHAI 50 Mbps

10 G
10 G
Source 1
Source 2
0.5G
Source 10
29
Recovery Time
Recovery time 80 msec
30
Fluid Model for QCN
P F(Fb)

Assume N flows pass through a single queue at a
switch. State variables are TRi(t), CRi(t), q(t),
p(t).

10
Fb
63
31
AccuracyEquations vs. ns-2 simulations
N 10, RTT 100 us
N 100, RTT 500 us
N 10, RTT 1 ms
N 10, RTT 2 ms
32
Summary

The algorithm has been extensively tested in
deployment scenarios of interest
Esp. interoperability with link-level PAUSE and
TCP
All presentations are available at the IEEE 802.1
website
The theoretical development is interesting, but
most notably because QCN (and BIC-TCP) display
strong stability in the face of increasing lags,
or, equivalently in high bandwidth-delay product
networks
While attempting to understand why these schemes
perform so well, we have uncovered a method for
improving the stability of any congestion control
scheme we present this next

33
The Averaging Principle
34
Background to the AP

When the lags in a control loop increase, the
system becomes oscillatory and eventually becomes
unstable
Feedback compensation is applied to restore
stability the two main flavors of feedback
compensation in are
Determine lags (round trip times), apply the
correct gains for the loop to be stable (e.g.
XCP, RCP, FAST).
Include higher order queue derivatives in the
congestion information fed back to the source
(e.g. REM/PI, BCN).
Method 1 is not suitable for us, we dont know
RTTs in Ethernet
Method 2 requires a change to the switch
implementation
The Averaging Principle is a different method
It is suited to Ethernet where round trip times
are unavailable
It doesnt need more feedback, hence switch
implementations dont have to change
QCN and BIC-TCP already turn out to employ it

35
The Averaging Principle (AP)?

A source in a congestion control loop is
instructed by the network to decrease or increase
its sending rate (randomly) periodically

AP a source obeys the network whenever
instructed to change rate, and then voluntarily
performs averaging as below

TR Target Rate CR Current Rate
36
Recall QCN does 5 steps of Averaging

The Fast Recovery portion of QCN, there are 5
steps of averaging
In fact, QCN and BIC-TCP are the Ave Prin applied
to TCP!

Active Probing
37
Applying the APRCP Rate Control
ProtocolDukkipatti and McKeown

A router computes an upper bound R on the rate of
all flows traversing it.
R recomputed every T ( 10) msec as follows
?
Where
d0 Round trip time estimate (set constant 10
msec in our case)?
C link capacity ( 2.4 Gbps)
Q Current queue size at the switch
y(t) incoming rate
a 0.1
ß 1
A flow chooses the smallest advertised rate on
its path.
We consider a scenario where 10 RCP sources share
a single link.

38
AP-RCP Stability
RTT 60 msec
RTT 65 msec
39
AP-RCP Stability contd
RTT 120 msec
RTT 130 msec
40
AP-RCP Stability contd
RTT 230 msec
RTT 240 msec
41
Understanding the AP

As mentioned earlier, the two major flavors of
feedback compensation are
Determine lags, chose appropriate gains
Feedback higher derivatives of state
We prove that the AP is sense equivalent to both
of the above!
This is great because we dont need to change
network routers and switches
And the AP is really very easy to apply no
lag-dependent optimizations of gain parameters
needed

42
AP Equivalence Single Source Case
Source does AP
Fb
Regular source
0.5 Fb 0.25 T dFb/dt

Systems 1 and 2 are discrete-time models for an
AP enabled source, and a regular source
respectively.
Main Result Systems 1 and 2 are algebraically
equivalent. That is, given identical input
sequences, they produce identical output
sequences.
Therefore the AP is equivalent to adding a
derivative to the feedback and reducing the gain!
Thus, the AP does both known forms of feedback
compensation without knowing RTTs or changing
switch implementations

43
AP-RCP vs PD-RCP
RTT 120 msec
RTT 130 msec
44
A Generic Control Example

As an example, we consider the plant transfer
function
P(s) (s1)/(s31.6s20.8
s0.6)

45
Step ResponseBasic AP, No Delay
46
Step ResponseBasic AP, Delay 8 seconds
47
Step Response Two-step AP, Delay 14 seconds
48
Step Response Two-step AP, Delay 25 seconds
Two-step AP is even more stable than Basic AP
49
Summary of AP

The AP is a simple method for making many control
loops (not just congestion control loops) more
robust to increasing lags
Gives a clear understanding as to the reason why
the BIC-TCP and QCN algorithms have such good
delay tolerance they do averaging repeatedly
There is a theorem which deals explicitly with
the QCN-type loop
Variations of the basic principle are possible
i.e. average more than once, average by more than
half-way, etc
The theory is fairly complete in these cases

50
QCN and Buffer Sizing
51
Background TCP Buffer Sizing

Standard rule of thumb
Single TCP flow Bandwidth Delay worth of
buffering needed for 100 utilization.
Recent result (Appenzellar et al.)
For N gtgt 1 TCP flows Bdwdth x Delay/sqrt(N)
amount of buffering is enough.
The essence of this result is that when many
flows combine, the Variance of the net sending
rate decreases
Buffer sizing problem is challenging in data
centers
Typically, only a small number of flows are
active on each path. (N is small)
Ethernet switches are typically built with
shallow buffers to keep costs down.

52
Example Simulation Setup
Switch

10 Gig Ethernet
Switch buffer is 150 Kbytes deep.
We compare TCP and QCN for various of flows,
and RTTs.

53
TCP vs QCN (N 1, RTT 120 µs)
TCP
QCN
Throughput 99.5 Standard Deviation 265.4
Mbps
Throughput 99.5 Standard Deviation 13.8 Mbps
54
TCP vs QCN (N 1, RTT 250 µs)
TCP
QCN
Throughput 95.5 Standard Deviation 782.7
Mbps
Throughput 99.5 Standard Deviation 33.3 Mbps
55
TCP vs QCN (N 1, RTT 500 µs)
TCP
QCN
Throughput 88 Standard Deviation 1249.7
Mbps
Throughput 99.5 Standard Deviation 95.4 Mbps
56
TCP vs QCN (N 10, RTT 120 µs)
TCP
QCN
Throughput 99.5 Standard Deviation 625.1
Mbps
Throughput 99.5 Standard Deviation 25.1 Mbps
57
TCP vs QCN (N 10, RTT 250 µs)
TCP
QCN
Throughput 95.5 Standard Deviation 981 Mbps

Throughput 99.5 Standard Deviation 27.2 Mbps
58
TCP vs QCN (N 10, RTT 500 µs)
TCP
QCN
Throughput 89 Standard Deviation 1311.4
Mbps
Throughput 99.5 Standard Deviation 170.5 Mbps
59
QCN and shallow buffers

In contrast to TCP, QCN is stable with shallow
buffers, even with few sources.
Why?
Recall that buffer requirements are closely
related to sending rate variance
Buffer size C x
Var(R1) x Bdwdth x Delay/ sqrt(N)
TCP
Good performance for large N, since the
denominator is large.
QCN
Good performance for all N, since the numerator
is small.
Thus, averaging reduces the variance of a
sources sending rate
This is a stochastic interpretation of the
Averaging Principles success in keeping
stability with shallow buffers

60
Conclusions

We have seen the background, development and
analysis of a congestion control scheme for the
IEEE 802.1 Ethernet standard
The QCN algorithm is
More stable with respect to control loop delays
Requires much smaller buffers than TCP
Easy to build in hardware
The Averaging Principle is interesting and were
exploring its use in nonlinear control systems

Write a Comment

User Comments (0)