E2CM updates IEEE 802.1 Interim Geneva - PowerPoint PPT Presentation

About This Presentation

Title:

E2CM updates IEEE 802.1 Interim Geneva

Description:

Insert one probe every X KB of data sent per flow, e.g. X = 75 KB ... Radix N = 2, 3, 4. M = 150 KB/port. Link time of flight = 1 us ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 25

Provided by: cyrielmi

Learn more at: https://www.ieee802.org

Category:

more less

Transcript and Presenter's Notes

Title: E2CM updates IEEE 802.1 Interim Geneva

1
E2CM updatesIEEE 802.1 Interim _at_ Geneva

Cyriel Minkenberg Mitch Gusat
IBM Research GmbH, Zurich
May 29, 2007

2
Outline

Summary of E2CM proposal
How it works
What has changed
New E2CM performance results
Managing across a non-CM domain
Performance in fat tree topology
Mixed link speeds (1G/10G)

3
Refresher E2CM Operation

Probe arrives at dst
Insert timestamp
Return probe to source

Probe

Qeq exceeded
Send BCN to source

Switch 1
Switch 2
BCN
src

BCN arrives at source
Install rate limiter
Inject probe w/ timestamp

Probe arrives at source
Path occupancy computed
AIMD control applied using same rate limiter

Switch 3
dst

Probing is triggered by BCN frames only
rate-limited flows are probed
Insert one probe every X KB of data sent per
flow, e.g. X 75 KB
Probes traverse network inband Objective is to
observe real current queuing delay
Variant continuous probing (used here)
Per flow, BCN and probes employ the same rate
limiter
Control per-flow (probe) as well as per-queue
(BCN) occupancy
CPID of probes destination MAC
Rate limiter is associated with CPID from which
last negative feedback was received
Increment only on probes from associated CPID
Parameters relating to probes may be set
differently (in particular Qeq,flow, Qmax,flow,
Gd,flow, Gi,flow)

4
Synergies

Added value of E2CM
Fair and stable rate allocation
Fine granularity owing to per-flow end-to-end
probing
Improved initial response queue convergence
speeds
Transparent to network
Purely end-to-end, no (additional) burden on
bridges
Added value of ECM
Fast initial response
Feedback travels straight back to source
Capped aggregate queue length for large-degree
hotspots
Controls sum of per-flow queue occupancies

5
Modifications since March proposal
See also au-sim-ZRL-E2CM-src-based-r1.2.pdf
6
Coexistence of CM and non-CM domains

Concern has been raised that an end-to-end scheme
requires global deployment
We consider the case where a non-CM switch exists
in the path of the congesting flows
CM messages terminated at edge of domain
Cannot relay notifications across non-CM domain
Cannot control congestion inside non-CM domain
Non-CM (legacy) bridge behavior
Does not generate or interpret any CM
notifications
Can relay CM notifications as regular frames?
May depend on bridge implementation
Next results make this assumption

7
Managing across a non-CM domain
CM
Non-CM-domain
CM-domain
Switch 1
Node 6
CM-domain
Switch 4
Switch 5
Switch 2
Switch 3
Node 7
100
Node 5

Switches 1, 2, 3 5 are in congestion-managed
domains, switch 4 is in a non-congestion-managed
domain
Four hot flows of 10 Gb/s each from nodes 1, 2,
3, 4 to node 6 (hotspot)
One cold (lukewarm) flow of 10 Gb/s from node 5
to 7
Max-min fair allocation provides 2.0 Gb/s to each
flow

8
Simulation Setup Parameters

Traffic
Mean flow size 1500, 60000 B
Geometric flow size distribution
Source stops sending at T 1.0 s
Simulation runs to completion (no frames left in
the system)
Scenario
See previous slide
Switch
Radix N 2, 3, 4
M 150 KB/port
Link time of flight 1 us
Partitioned memory per input, shared among all
outputs
No limit on per-output memory usage
PAUSE enabled or disabled
Applied on a per input basis based on local
high/low watermarks
watermarkhigh 141.5 KB
watermarklow 131.5 KB

Adapter
Per-node virtual output queuing, round-robin
scheduling
No limit on number of rate limiters
Ingress buffer size unlimited, round-robin VOQ
service
Egress buffer size 150 KB
PAUSE enabled
watermarkhigh 141.5 KB
watermarklow 131.5 KB
ECM
W 2.0
Qeq 37.5 KB ( M/4)
Gd 0.5 / ((2W1)Qeq)
Gi0 (Rlink / Runit) ((2W1)Qeq)
Gi 0.1 Gi0
Psample 2 (on average 1 sample every 75 KB
Runit Rmin 1 Mb/s
BCN_MAX enabled, threshold 150 KB
BCN(0,0) disabled

9
E2CM Per-flow throughput
Bursty
Bernoulli
PAUSE disabled
Max-min fair rates
PAUSE enabled
10
E2CM Per-node throughput
Bursty
Bernoulli
PAUSE disabled
Max-min fair rates
PAUSE enabled
11
E2CM Switch queue length
Bursty
Bernoulli
Stable OQ level
PAUSE disabled
PAUSE enabled
12
Frame drops, flow completions, FCT

Mean FCT longer w/ PAUSE
All flows accounted for (w/o PAUSE not all flows
completed)
Absence of PAUSE heavily skews results
In particular for hot flows ? much longer FCT w/
PAUSE
Cold flow FCT independent of burst size!
Load compression Flows wait for a long time in
adapter before being injected
FCT dominated by adapter latency
Cold traffic also traverses hotspot, therefore
suffers from compression

13
Fat tree network

Fat trees enable scaling to arbitrarily large
networks with constant (full) bisectional
bandwidth
We use static, destination-based, shortest-path
routing
For more details on construction and routing see
au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf

14
Fat tree network
spine
Level 2
Up

Switches are labeled (stageID, switchID)
stageID ? 0, S-1
switchID ? 0, (N/2)L-1

Fat tree Folded representation
Level 1
Down
Level 0
spine
Conventions M no. of end nodes N(N/2)L-1 N
no. of bidir ports per switch L no. of levels
(folded) S no. of stages 2L-1
(unfolded) Number of switches per stage
(N/2)L-1 Total number of switches (2L-1)
(N/2)L-1 Nodes are connected at left and right
edges Left nodes are numbered 0 through
M/2-1 Right nodes are numbered M/2 to M-1
(0,0)
(1,0)
(2,0)
(3,0)
(4,0)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
Unfolded to Benes
(0,2)
(1,2)
(2,2)
(3,2)
(4,2)
(0,3)
(1,3)
(2,3)
(3,3)
(4,3)
Stage 0
Stage 1
Stage 2
Stage 3
Stage 4
Left
Right
15
Simulation Setup Parameters

Traffic
Mean flow size 1500, 60000 B
Geometric flow size distribution
Uniform destination distribution (except self)
Mean load 50
Source stops sending at T 1.0 s
Simulation runs to completion
Scenario
16-node (3-level) and 32-node (4-level) fat tree
networks
Output-generated hotspot (rate reduction to 10
of link rate) on port 1 from 0.1 to 0.5 s
Switch
Radix N 4
M 150 KB/port
Link time of flight 1 us
Partitioned memory per input, shared among all
outputs
No limit on per-output memory usage
PAUSE enabled or disabled

Adapter
Per-node virtual output queuing, round-robin
scheduling
No limit on number of rate limiters
Ingress buffer size unlimited, round-robin VOQ
service
Egress buffer size 150 KB
PAUSE enabled
watermarkhigh 141.5 KB
watermarklow 131.5 KB
ECM
W 2.0
Qeq 37.5 KB ( M/4)
Gd 0.5 / ((2W1)Qeq)
Gi0 (Rlink / Runit) ((2W1)Qeq)
Gi 0.1 Gi0
Psample 2 (on average 1 sample every 75 KB
Runit Rmin 1 Mb/s
BCN_MAX enabled, threshold 150 KB
BCN(0,0) en-/disabled, threshold 300 KB

16
E2CM fat tree results 16 nodes, 3 levels
Bursty
Bernoulli
Aggr. Thrput
Hot Q length
17
E2CM fat tree results 32 nodes, 4 levels
Bursty
Bernoulli
Aggr. Thrput
Hot Q length
18
Frame drops, completed flows, FCT
32 nodes
16 nodes
19
Mixed link speeds
Service rate 10
Node 1
50
Switch 1
Switch 2
50
1G
Node 11
50
10G
10G
Output-generated
Node 10
1G

Nodes 1-10 are connected via 1G adapters and
links
Switch 1 has 10 1G ports and 1 10G port to switch
2, which has 2 10G ports
Shared-memory switches ? create more serious
congestion
Ten hot flows of 0.5 Gb/s each from nodes 1-10
to node 11 (hotspot)
Node 11 sends uniformly at 5 Gb/s (cold)
Max-min fair shares 12.5 MB/s for 1-10 ? 11

Node 1
50
Switch 1
Switch 2
50
1G
Input-generated
Node 11
10G
10G
50
Node 10
1G

Same topology as above
One hot flow of 5.0 Gb/s from node 11 to node 1
(hotspot)
Nodes 1-10 send uniformly at 0.5 Gb/s (cold)
Max-min fair shares 62.5 MB/s for 11 ? 1 and
6.25 MB/s for 2-10 ? 1

20
E2CM mixed speed output-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
21
E2CM mixed speed input-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
22
Probing mixed speed output-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
Perfect bandwidth sharing
PAUSE enabled
23
Probing mixed speed input-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
24
Conclusions

FCT dominated by adapter latency for rate-limited
flows
E2CM can manage across non-CM domains
Even a hotspot within a non-CM domain can be
controlled
Need to ensure that CM notifications can traverse
non-CM domains
They have to look like valid frames to non-CM
bridges
E2CM works excellently in multi-level fat tree
topologies
E2CM also copes well with mixed speed networks
Continuous probing improves E2CMs overall
performance
In low-degree hotspot scenarios probing-only
appears to be sufficient to control congestion