E2CM updates IEEE 802.1 Interim Geneva - PowerPoint PPT Presentation

About This Presentation
Title:

E2CM updates IEEE 802.1 Interim Geneva

Description:

Insert one probe every X KB of data sent per flow, e.g. X = 75 KB ... Radix N = 2, 3, 4. M = 150 KB/port. Link time of flight = 1 us ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 25
Provided by: cyrielmi
Learn more at: https://www.ieee802.org
Category:
Tags: e2cm | ieee | geneva | interim | radix | updates

less

Transcript and Presenter's Notes

Title: E2CM updates IEEE 802.1 Interim Geneva


1
E2CM updatesIEEE 802.1 Interim _at_ Geneva
  • Cyriel Minkenberg Mitch Gusat
  • IBM Research GmbH, Zurich
  • May 29, 2007

2
Outline
  • Summary of E2CM proposal
  • How it works
  • What has changed
  • New E2CM performance results
  • Managing across a non-CM domain
  • Performance in fat tree topology
  • Mixed link speeds (1G/10G)

3
Refresher E2CM Operation
  • Probe arrives at dst
  • Insert timestamp
  • Return probe to source

Probe
  • Qeq exceeded
  • Send BCN to source

Switch 1
Switch 2
BCN
src
  • BCN arrives at source
  • Install rate limiter
  • Inject probe w/ timestamp
  • Probe arrives at source
  • Path occupancy computed
  • AIMD control applied using same rate limiter

Switch 3
dst
  • Probing is triggered by BCN frames only
    rate-limited flows are probed
  • Insert one probe every X KB of data sent per
    flow, e.g. X 75 KB
  • Probes traverse network inband Objective is to
    observe real current queuing delay
  • Variant continuous probing (used here)
  • Per flow, BCN and probes employ the same rate
    limiter
  • Control per-flow (probe) as well as per-queue
    (BCN) occupancy
  • CPID of probes destination MAC
  • Rate limiter is associated with CPID from which
    last negative feedback was received
  • Increment only on probes from associated CPID
  • Parameters relating to probes may be set
    differently (in particular Qeq,flow, Qmax,flow,
    Gd,flow, Gi,flow)

4
Synergies
  • Added value of E2CM
  • Fair and stable rate allocation
  • Fine granularity owing to per-flow end-to-end
    probing
  • Improved initial response queue convergence
    speeds
  • Transparent to network
  • Purely end-to-end, no (additional) burden on
    bridges
  • Added value of ECM
  • Fast initial response
  • Feedback travels straight back to source
  • Capped aggregate queue length for large-degree
    hotspots
  • Controls sum of per-flow queue occupancies

5
Modifications since March proposal
See also au-sim-ZRL-E2CM-src-based-r1.2.pdf
6
Coexistence of CM and non-CM domains
  • Concern has been raised that an end-to-end scheme
    requires global deployment
  • We consider the case where a non-CM switch exists
    in the path of the congesting flows
  • CM messages terminated at edge of domain
  • Cannot relay notifications across non-CM domain
  • Cannot control congestion inside non-CM domain
  • Non-CM (legacy) bridge behavior
  • Does not generate or interpret any CM
    notifications
  • Can relay CM notifications as regular frames?
  • May depend on bridge implementation
  • Next results make this assumption

7
Managing across a non-CM domain
CM
Non-CM-domain
CM-domain
Switch 1
Node 6
CM-domain
Switch 4
Switch 5
Switch 2
Switch 3
Node 7
100
Node 5
  • Switches 1, 2, 3 5 are in congestion-managed
    domains, switch 4 is in a non-congestion-managed
    domain
  • Four hot flows of 10 Gb/s each from nodes 1, 2,
    3, 4 to node 6 (hotspot)
  • One cold (lukewarm) flow of 10 Gb/s from node 5
    to 7
  • Max-min fair allocation provides 2.0 Gb/s to each
    flow

8
Simulation Setup Parameters
  • Traffic
  • Mean flow size 1500, 60000 B
  • Geometric flow size distribution
  • Source stops sending at T 1.0 s
  • Simulation runs to completion (no frames left in
    the system)
  • Scenario
  • See previous slide
  • Switch
  • Radix N 2, 3, 4
  • M 150 KB/port
  • Link time of flight 1 us
  • Partitioned memory per input, shared among all
    outputs
  • No limit on per-output memory usage
  • PAUSE enabled or disabled
  • Applied on a per input basis based on local
    high/low watermarks
  • watermarkhigh 141.5 KB
  • watermarklow 131.5 KB
  • Adapter
  • Per-node virtual output queuing, round-robin
    scheduling
  • No limit on number of rate limiters
  • Ingress buffer size unlimited, round-robin VOQ
    service
  • Egress buffer size 150 KB
  • PAUSE enabled
  • watermarkhigh 141.5 KB
  • watermarklow 131.5 KB
  • ECM
  • W 2.0
  • Qeq 37.5 KB ( M/4)
  • Gd 0.5 / ((2W1)Qeq)
  • Gi0 (Rlink / Runit) ((2W1)Qeq)
  • Gi 0.1 Gi0
  • Psample 2 (on average 1 sample every 75 KB
  • Runit Rmin 1 Mb/s
  • BCN_MAX enabled, threshold 150 KB
  • BCN(0,0) disabled

9
E2CM Per-flow throughput
Bursty
Bernoulli
PAUSE disabled
Max-min fair rates
PAUSE enabled
10
E2CM Per-node throughput
Bursty
Bernoulli
PAUSE disabled
Max-min fair rates
PAUSE enabled
11
E2CM Switch queue length
Bursty
Bernoulli
Stable OQ level
PAUSE disabled
PAUSE enabled
12
Frame drops, flow completions, FCT
  • Mean FCT longer w/ PAUSE
  • All flows accounted for (w/o PAUSE not all flows
    completed)
  • Absence of PAUSE heavily skews results
  • In particular for hot flows ? much longer FCT w/
    PAUSE
  • Cold flow FCT independent of burst size!
  • Load compression Flows wait for a long time in
    adapter before being injected
  • FCT dominated by adapter latency
  • Cold traffic also traverses hotspot, therefore
    suffers from compression

13
Fat tree network
  • Fat trees enable scaling to arbitrarily large
    networks with constant (full) bisectional
    bandwidth
  • We use static, destination-based, shortest-path
    routing
  • For more details on construction and routing see
    au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf

14
Fat tree network
spine
Level 2
Up
  • Switches are labeled (stageID, switchID)
  • stageID ? 0, S-1
  • switchID ? 0, (N/2)L-1

Fat tree Folded representation
Level 1
Down
Level 0
spine
Conventions M no. of end nodes N(N/2)L-1 N
no. of bidir ports per switch L no. of levels
(folded) S no. of stages 2L-1
(unfolded) Number of switches per stage
(N/2)L-1 Total number of switches (2L-1)
(N/2)L-1 Nodes are connected at left and right
edges Left nodes are numbered 0 through
M/2-1 Right nodes are numbered M/2 to M-1
(0,0)
(1,0)
(2,0)
(3,0)
(4,0)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
Unfolded to Benes
(0,2)
(1,2)
(2,2)
(3,2)
(4,2)
(0,3)
(1,3)
(2,3)
(3,3)
(4,3)
Stage 0
Stage 1
Stage 2
Stage 3
Stage 4
Left
Right
15
Simulation Setup Parameters
  • Traffic
  • Mean flow size 1500, 60000 B
  • Geometric flow size distribution
  • Uniform destination distribution (except self)
  • Mean load 50
  • Source stops sending at T 1.0 s
  • Simulation runs to completion
  • Scenario
  • 16-node (3-level) and 32-node (4-level) fat tree
    networks
  • Output-generated hotspot (rate reduction to 10
    of link rate) on port 1 from 0.1 to 0.5 s
  • Switch
  • Radix N 4
  • M 150 KB/port
  • Link time of flight 1 us
  • Partitioned memory per input, shared among all
    outputs
  • No limit on per-output memory usage
  • PAUSE enabled or disabled
  • Adapter
  • Per-node virtual output queuing, round-robin
    scheduling
  • No limit on number of rate limiters
  • Ingress buffer size unlimited, round-robin VOQ
    service
  • Egress buffer size 150 KB
  • PAUSE enabled
  • watermarkhigh 141.5 KB
  • watermarklow 131.5 KB
  • ECM
  • W 2.0
  • Qeq 37.5 KB ( M/4)
  • Gd 0.5 / ((2W1)Qeq)
  • Gi0 (Rlink / Runit) ((2W1)Qeq)
  • Gi 0.1 Gi0
  • Psample 2 (on average 1 sample every 75 KB
  • Runit Rmin 1 Mb/s
  • BCN_MAX enabled, threshold 150 KB
  • BCN(0,0) en-/disabled, threshold 300 KB

16
E2CM fat tree results 16 nodes, 3 levels
Bursty
Bernoulli
Aggr. Thrput
Hot Q length
17
E2CM fat tree results 32 nodes, 4 levels
Bursty
Bernoulli
Aggr. Thrput
Hot Q length
18
Frame drops, completed flows, FCT
32 nodes
16 nodes
19
Mixed link speeds
Service rate 10
Node 1
50
Switch 1
Switch 2
50
1G
Node 11
50
10G
10G
Output-generated
Node 10
1G
  • Nodes 1-10 are connected via 1G adapters and
    links
  • Switch 1 has 10 1G ports and 1 10G port to switch
    2, which has 2 10G ports
  • Shared-memory switches ? create more serious
    congestion
  • Ten hot flows of 0.5 Gb/s each from nodes 1-10
    to node 11 (hotspot)
  • Node 11 sends uniformly at 5 Gb/s (cold)
  • Max-min fair shares 12.5 MB/s for 1-10 ? 11

Node 1
50
Switch 1
Switch 2
50
1G
Input-generated
Node 11
10G
10G
50
Node 10
1G
  • Same topology as above
  • One hot flow of 5.0 Gb/s from node 11 to node 1
    (hotspot)
  • Nodes 1-10 send uniformly at 0.5 Gb/s (cold)
  • Max-min fair shares 62.5 MB/s for 11 ? 1 and
    6.25 MB/s for 2-10 ? 1

20
E2CM mixed speed output-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
21
E2CM mixed speed input-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
22
Probing mixed speed output-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
Perfect bandwidth sharing
PAUSE enabled
23
Probing mixed speed input-generated HS
Per-flow throughput
Per-node throughput
PAUSE disabled
PAUSE enabled
24
Conclusions
  • FCT dominated by adapter latency for rate-limited
    flows
  • E2CM can manage across non-CM domains
  • Even a hotspot within a non-CM domain can be
    controlled
  • Need to ensure that CM notifications can traverse
    non-CM domains
  • They have to look like valid frames to non-CM
    bridges
  • E2CM works excellently in multi-level fat tree
    topologies
  • E2CM also copes well with mixed speed networks
  • Continuous probing improves E2CMs overall
    performance
  • In low-degree hotspot scenarios probing-only
    appears to be sufficient to control congestion
Write a Comment
User Comments (0)
About PowerShow.com