Title: E2CM updates IEEE 802.1 Interim Geneva
1E2CM updatesIEEE 802.1 Interim _at_ Geneva
- Cyriel Minkenberg Mitch Gusat
- IBM Research GmbH, Zurich
- May 29, 2007
2Outline
- Summary of E2CM proposal
- How it works
- What has changed
- New E2CM performance results
- Managing across a non-CM domain
- Performance in fat tree topology
- Mixed link speeds (1G/10G)
3Refresher E2CM Operation
- Probe arrives at dst
- Insert timestamp
- Return probe to source
Probe
- Qeq exceeded
- Send BCN to source
Switch 1
Switch 2
BCN
src
- BCN arrives at source
- Install rate limiter
- Inject probe w/ timestamp
- Probe arrives at source
- Path occupancy computed
- AIMD control applied using same rate limiter
Switch 3
dst
- Probing is triggered by BCN frames only
rate-limited flows are probed - Insert one probe every X KB of data sent per
flow, e.g. X 75 KB - Probes traverse network inband Objective is to
observe real current queuing delay - Variant continuous probing (used here)
- Per flow, BCN and probes employ the same rate
limiter - Control per-flow (probe) as well as per-queue
(BCN) occupancy - CPID of probes destination MAC
- Rate limiter is associated with CPID from which
last negative feedback was received - Increment only on probes from associated CPID
- Parameters relating to probes may be set
differently (in particular Qeq,flow, Qmax,flow,
Gd,flow, Gi,flow)
4Synergies
- Added value of E2CM
- Fair and stable rate allocation
- Fine granularity owing to per-flow end-to-end
probing - Improved initial response queue convergence
speeds - Transparent to network
- Purely end-to-end, no (additional) burden on
bridges - Added value of ECM
- Fast initial response
- Feedback travels straight back to source
- Capped aggregate queue length for large-degree
hotspots - Controls sum of per-flow queue occupancies
5Modifications since March proposal
See also au-sim-ZRL-E2CM-src-based-r1.2.pdf
6Coexistence of CM and non-CM domains
- Concern has been raised that an end-to-end scheme
requires global deployment - We consider the case where a non-CM switch exists
in the path of the congesting flows - CM messages terminated at edge of domain
- Cannot relay notifications across non-CM domain
- Cannot control congestion inside non-CM domain
- Non-CM (legacy) bridge behavior
- Does not generate or interpret any CM
notifications - Can relay CM notifications as regular frames?
- May depend on bridge implementation
- Next results make this assumption
7Managing across a non-CM domain
CM
Non-CM-domain
CM-domain
Switch 1
Node 6
CM-domain
Switch 4
Switch 5
Switch 2
Switch 3
Node 7
100
Node 5
- Switches 1, 2, 3 5 are in congestion-managed
domains, switch 4 is in a non-congestion-managed
domain - Four hot flows of 10 Gb/s each from nodes 1, 2,
3, 4 to node 6 (hotspot) - One cold (lukewarm) flow of 10 Gb/s from node 5
to 7 - Max-min fair allocation provides 2.0 Gb/s to each
flow
8Simulation Setup Parameters
- Traffic
- Mean flow size 1500, 60000 B
- Geometric flow size distribution
- Source stops sending at T 1.0 s
- Simulation runs to completion (no frames left in
the system) - Scenario
- See previous slide
- Switch
- Radix N 2, 3, 4
- M 150 KB/port
- Link time of flight 1 us
- Partitioned memory per input, shared among all
outputs - No limit on per-output memory usage
- PAUSE enabled or disabled
- Applied on a per input basis based on local
high/low watermarks - watermarkhigh 141.5 KB
- watermarklow 131.5 KB
- Adapter
- Per-node virtual output queuing, round-robin
scheduling - No limit on number of rate limiters
- Ingress buffer size unlimited, round-robin VOQ
service - Egress buffer size 150 KB
- PAUSE enabled
- watermarkhigh 141.5 KB
- watermarklow 131.5 KB
- ECM
- W 2.0
- Qeq 37.5 KB ( M/4)
- Gd 0.5 / ((2W1)Qeq)
- Gi0 (Rlink / Runit) ((2W1)Qeq)
- Gi 0.1 Gi0
- Psample 2 (on average 1 sample every 75 KB
- Runit Rmin 1 Mb/s
- BCN_MAX enabled, threshold 150 KB
- BCN(0,0) disabled
9E2CM Per-flow throughput
Bursty
Bernoulli
PAUSE disabled
Max-min fair rates
PAUSE enabled
10E2CM Per-node throughput
Bursty
Bernoulli
PAUSE disabled
Max-min fair rates
PAUSE enabled
11E2CM Switch queue length
Bursty
Bernoulli
Stable OQ level
PAUSE disabled
PAUSE enabled
12Frame drops, flow completions, FCT
- Mean FCT longer w/ PAUSE
- All flows accounted for (w/o PAUSE not all flows
completed) - Absence of PAUSE heavily skews results
- In particular for hot flows ? much longer FCT w/
PAUSE - Cold flow FCT independent of burst size!
- Load compression Flows wait for a long time in
adapter before being injected - FCT dominated by adapter latency
- Cold traffic also traverses hotspot, therefore
suffers from compression
13Fat tree network
- Fat trees enable scaling to arbitrarily large
networks with constant (full) bisectional
bandwidth - We use static, destination-based, shortest-path
routing - For more details on construction and routing see
au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf
14Fat tree network
spine
Level 2
Up
- Switches are labeled (stageID, switchID)
- stageID ? 0, S-1
- switchID ? 0, (N/2)L-1
Fat tree Folded representation
Level 1
Down
Level 0
spine
Conventions M no. of end nodes N(N/2)L-1 N
no. of bidir ports per switch L no. of levels
(folded) S no. of stages 2L-1
(unfolded) Number of switches per stage
(N/2)L-1 Total number of switches (2L-1)
(N/2)L-1 Nodes are connected at left and right
edges Left nodes are numbered 0 through
M/2-1 Right nodes are numbered M/2 to M-1
(0,0)
(1,0)
(2,0)
(3,0)
(4,0)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
Unfolded to Benes
(0,2)
(1,2)
(2,2)
(3,2)
(4,2)
(0,3)
(1,3)
(2,3)
(3,3)
(4,3)
Stage 0
Stage 1
Stage 2
Stage 3
Stage 4
Left
Right
15Simulation Setup Parameters
- Traffic
- Mean flow size 1500, 60000 B
- Geometric flow size distribution
- Uniform destination distribution (except self)
- Mean load 50
- Source stops sending at T 1.0 s
- Simulation runs to completion
- Scenario
- 16-node (3-level) and 32-node (4-level) fat tree
networks - Output-generated hotspot (rate reduction to 10
of link rate) on port 1 from 0.1 to 0.5 s - Switch
- Radix N 4
- M 150 KB/port
- Link time of flight 1 us
- Partitioned memory per input, shared among all
outputs - No limit on per-output memory usage
- PAUSE enabled or disabled
- Adapter
- Per-node virtual output queuing, round-robin
scheduling - No limit on number of rate limiters
- Ingress buffer size unlimited, round-robin VOQ
service - Egress buffer size 150 KB
- PAUSE enabled
- watermarkhigh 141.5 KB
- watermarklow 131.5 KB
- ECM
- W 2.0
- Qeq 37.5 KB ( M/4)
- Gd 0.5 / ((2W1)Qeq)
- Gi0 (Rlink / Runit) ((2W1)Qeq)
- Gi 0.1 Gi0
- Psample 2 (on average 1 sample every 75 KB
- Runit Rmin 1 Mb/s
- BCN_MAX enabled, threshold 150 KB
- BCN(0,0) en-/disabled, threshold 300 KB
16E2CM fat tree results 16 nodes, 3 levels
Bursty
Bernoulli
Aggr. Thrput
Hot Q length
17E2CM fat tree results 32 nodes, 4 levels
Bursty
Bernoulli
Aggr. Thrput
Hot Q length
18Frame drops, completed flows, FCT
32 nodes
16 nodes
19Mixed link speeds
Service rate 10
Node 1
50
Switch 1
Switch 2
50
1G
Node 11
50
10G
10G
Output-generated
Node 10
1G
- Nodes 1-10 are connected via 1G adapters and
links - Switch 1 has 10 1G ports and 1 10G port to switch
2, which has 2 10G ports - Shared-memory switches ? create more serious
congestion - Ten hot flows of 0.5 Gb/s each from nodes 1-10
to node 11 (hotspot) - Node 11 sends uniformly at 5 Gb/s (cold)
- Max-min fair shares 12.5 MB/s for 1-10 ? 11
Node 1
50
Switch 1
Switch 2
50
1G
Input-generated
Node 11
10G
10G
50
Node 10
1G
- Same topology as above
- One hot flow of 5.0 Gb/s from node 11 to node 1
(hotspot) - Nodes 1-10 send uniformly at 0.5 Gb/s (cold)
- Max-min fair shares 62.5 MB/s for 11 ? 1 and
6.25 MB/s for 2-10 ? 1
20E2CM mixed speed output-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
21E2CM mixed speed input-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
22Probing mixed speed output-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
Perfect bandwidth sharing
PAUSE enabled
23Probing mixed speed input-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
24Conclusions
- FCT dominated by adapter latency for rate-limited
flows - E2CM can manage across non-CM domains
- Even a hotspot within a non-CM domain can be
controlled - Need to ensure that CM notifications can traverse
non-CM domains - They have to look like valid frames to non-CM
bridges - E2CM works excellently in multi-level fat tree
topologies - E2CM also copes well with mixed speed networks
- Continuous probing improves E2CMs overall
performance - In low-degree hotspot scenarios probing-only
appears to be sufficient to control congestion