Title: Scaling Internet Routers Using Optics UW, October 16th, 2003
1Scaling Internet Routers Using OpticsUW,
October 16th, 2003
Nick McKeown Joint work with research groups
of David Miller, Mark Horowitz, Olav Solgaard.
Students Isaac Keslassy, Shang-Tse Chuang,
Kyoungsik Yu. Department of Electrical
Engineering, Stanford University Paper
http//klamath.stanford.edu/nickm/papers/sigcomm2
003.pdf Web site http//klamath.stanford.edu/or
2Backbone router capacity
1Tb/s
100Gb/s
10Gb/s
Router capacity per rack 2x every 18 months
1Gb/s
3Backbone router capacity
1Tb/s
100Gb/s
Traffic 2x every year
10Gb/s
Router capacity per rack 2x every 18 months
1Gb/s
4Extrapolating
100Tb/s
2015 16x disparity
Traffic 2x every year
Router capacity 2x every 18 months
1Tb/s
5Consequence
- Unless something changes, operators will need
- 16 times as many routers, consuming
- 16 times as much space,
- 256 times the power,
- Costing 100 times as much.
- Actually need more than that
6Stanford 100Tb/s Internet Router
- Goal Study scalability
- Challenging, but not impossible
- Two orders of magnitude faster than deployed
routers - We will build components to show feasibility
7Throughput Guarantees
- Operators increasingly demand throughput
guarantees - To maximize use of expensive long-haul links
- For predictability and planning
- Despite lots of effort and theory, no commercial
router today has a throughput guarantee.
8Requirements of our router
- 100Tb/s capacity
- 100 throughput for all traffic
- Must work with any set of linecards present
- Use technology available within 3 years
- Conform to RFC 1812
9What limits router capacity?
Approximate power consumption per rack
Power density is the limiting factor today
10Trend Multi-rack routersReduces power density
11Juniper TX8/T640
Alcatel 7670 RSP
TX8
Avici TSR
Chiaro
12Limits to scaling
- Overall power is dominated by linecards
- Sheer number
- Optical WAN components
- Per packet processing and buffering.
- But power density is dominated by switch fabric
13Trend Multi-rack routersReduces power density
14Multi-rack routers
Switch fabric
Linecard
In
WAN
Out
In
WAN
Out
15Question
- Instead, can we use an optical fabric at 100Tb/s
with 100 throughput? - Conventional answer No.
- Need to reconfigure switch too often
- 100 throughput requires complex electronic
scheduler.
16Outline
- How to guarantee 100 throughput?
- How to eliminate the scheduler?
- How to use an optical switch fabric?
- How to make it scalable and practical?
17100 Throughput
In
In
In
18If traffic is uniform
R
In
R
In
R
In
19Real traffic is not uniform
20Two-stage load-balancing switch
R
R
R
R/N
R/N
Out
In
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
In
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
R/N
R/N
In
R/N
R/N
Load-balancing stage
Switching stage
21R
R
In
R/N
R/N
3
3
3
1
R/N
R/N
R/N
R/N
R/N
R/N
R
R
In
2
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R/N
In
3
R/N
R/N
22R
R
In
R/N
R/N
1
R/N
R/N
3
R/N
R/N
R/N
R/N
R
R
In
2
R/N
R/N
3
R/N
R/N
R/N
R/N
R/N
R
R
R/N
In
3
R/N
R/N
3
23Changs load-balanced switchGood properties
- 100 throughput for broad class of traffic
- No scheduler needed a Scalable
24Changs load-balanced switchBad properties
- Packet mis-sequencing
- Pathological traffic patterns a Throughput
1/N-th of capacity - Uses two switch fabrics a Hard to package
- Doesnt work with some linecards missinga
Impractical
25Single Mesh Switch
2R/N
In
2R/N
2R/N
2R/N
In
2R/N
2R/N
2R/N
2R/N
In
2R/N
26Packaging
R
In
R
In
R
In
27Many fabric options
N channels each at rate 2R/N
Any permutation network
Options Space Full uniform mesh Time
Round-robin crossbar Wavelength Static WDM
28Static WDM switching
Array Waveguide Router (AWGR) Passive
andAlmost ZeroPower
A
B
C
D
29Linecard dataflow
In
l1
l1, l2,.., lN
R
R
WDM
lN
1
3
1
1
1
1
2
3
4
1
1
1
1
30Problems of scale
- For N lt 64, WDM is a good solution.
- We want N 640.
- Need to decompose.
31Decomposing the mesh
2R/8
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
32Decomposing the mesh
2R/8
2R/8
1
1
2R/4
2R/8
2R/8
2
2
3
3
4
4
5
5
6
6
7
7
8
8
33When N is too largeDecompose into groups (or
racks)
Group/Rack 1
2R
Array Waveguide Router (AWGR)
l1, l2, , lG
2R
1
2R
Group/Rack G
2R
l1, l2, , lG
2R
G
2R
34When a linecard is missing
- Each linecard spreads its data equally over every
other linecard. - Problem If one is missing, or failed, then the
spreading no longer works.
35When a linecard fails
2R/3
In
2R/3
2R/3
- Solution
- Move light beams
- Replace AWGR with MEMS switch.
- Reconfigure when linecard added, removed or
fails. - Finer channel granularity
- Multiple paths.
2R/3
In
2R/3
2R/3
2R/3
2R/3
In
2R/3
36SolutionUse transparent MEMS switches
Group/Rack 1
MEMS switches reconfigured only when linecard
added, removed or fails.
2R
2R
2R
Group/Rack G40
2R
2R
2R
Theorems 1. Require LG-1 MEMS switches 2.
Polynomial time reconfiguration algorithm
37Hybrid Architecture Logical View
38Hybrid Electro-Optical Architecture
39Number of MEMS Switches
R
R
R
Linecard 1
Crossbar
Crossbar
Linecard 1
R
R
Linecard 2
Linecard 2
R
R
Linecard 3
Crossbar
Crossbar
Linecard 3
R
R
R
R
R
Linecard 4
Linecard 4
StaticMEMS
R
R
R
Linecard 1
Crossbar
Crossbar
Linecard 1
R
R
R
Linecard 2
Linecard 2
R
R
Linecard 3
Crossbar
Crossbar
Linecard 3
R
R
R
R
Linecard 3
Linecard 4
40Number of MEMS Switches
R
R
4R/3
Linecard 1
Crossbar
Crossbar
Linecard 1
R
R
Linecard 2
Linecard 2
R
R
Linecard 3
Crossbar
Crossbar
Linecard 3
2R/3
2R/3
R/3
StaticMEMS
R
R
R
Linecard 1
Crossbar
Crossbar
Linecard 1
R/3
2R/3
R
R
Linecard 2
Linecard 2
R/3
R
R
Linecard 3
Crossbar
Crossbar
Linecard 3
2R/3
41Number of MEMS needed for a schedule
- Li number of linecards in group i, 1 i G.
Group i needs to send to group j
- Assume each group can send at most R to each
MEMS. Number of MEMS needed between groups i and
j
42Number of MEMS needed for a schedule
- The number of MEMS needed for group i to send to
group j is Aij. - The total number of MEMS needed for group i is
the sum of the Aijs
43Constraints for the TDM Schedule
- Latin Square In any period N, each transmitting
linecard is connected to each receiving linecard
exactly once. - MEMS constraint In any time-slot, there are at
most Aij connections between transmitting group i
and receiving group j, where
44Example
- Assume L13, L22, L31
- Then
- E.g., at most 2 packets from the first group to
the first group at each time-slot
45Bad TDM Transmit Schedule
46Good TDM Transmit Schedule
47Configuration Algorithm
- Assign connections between groups, so MEMS
constraint is satisfied. - Assign group connections to specific linecards,
so there is exactly one connection per linecard
pair in the schedule.
- Comments
- Algorithm is surprisingly complex.
- Best running time so far 40 seconds for 640
linecards.
48Challenges
In
l1
Address Lookup
l1, l2,.., lG
R
R
WDM
lG
l1, l2,.., lG
R
l1, l2,.., lG
1
1
1
2
2
R160Gb/s
3
4
Out
l1
R
l1, l2,.., lG
R
WDM
lG
49What we are building
250ms DRAM
320Gb/s
Chip 1 160Gb/s Packet Buffer
Buffer Manager 90nm ASIC
160Gb/s
160Gb/s
Optical Detector
Optical Modulator
50100Tb/s Load-Balanced Router
L 16 160Gb/s linecards