High Speed Router Design - PowerPoint PPT Presentation

1 / 171
About This Presentation
Title:

High Speed Router Design

Description:

... Queueing Input Queuing Input Queueing Head of Line Blocking Solution: Input Queueing w/ Virtual output queues (VOQ) Head-of-Line (HOL) ... – PowerPoint PPT presentation

Number of Views:318
Avg rating:3.0/5.0
Slides: 172
Provided by: ShivkumarK64
Category:
Tags: design | high | router | speed

less

Transcript and Presenter's Notes

Title: High Speed Router Design


1
High Speed Router Design
  • Shivkumar Kalyanaraman
  • Rensselaer Polytechnic Institute
  • shivkuma_at_ecse.rpi.edu
  • http//www.ecse.rpi.edu/Homepages/shivkuma
  • Based in part on slides of Nick McKeown
    (Stanford), S. Keshav (Ensim), Douglas Comer
    (Purdue),
  • Raj Yavatkar (Intel), Cyriel Minkenberg (IBM
    Zurich)

2
Overview
  • Introduction
  • Evolution of High-Speed Routers
  • High Speed Router Components
  • Lookup Algorithm
  • Classification
  • Switching

3
What do switches/routers look like?
Access routers e.g. ISDN, ADSL
Core router e.g. OC48c POS
Core ATM switch
4
Dimensions, Power Consumption
Cisco GSR 12416
Juniper M160
19
19
Capacity 160Gb/sPower 4.2kW
Capacity 80Gb/sPower 2.6kW
6ft
3ft
2ft
2.5ft
5
Where high performance packet switches are used
- Carrier Class Core Router - ATM Switch - Frame
Relay Switch
The Internet Core
6
Where are routers? Ans Points of Presence (POPs)
7
Why the Need for Big/Fast/Large Routers?
POP with smaller routers
POP with large routers
  • Interfaces Price gt200k, Power gt 400W
  • Space, power, interface cost economics!
  • About 50-60 of i/fs are used for interconnection
    within the POP.
  • Industry trend is towards large, single router
    per POP.

8
Job of router architect
  • For a given set of features

9
Performance metrics
  • Capacity
  • maximize C, s.t. volume lt 2m3 and power lt 5kW
  • Throughput
  • Maximize usage of expensive long-haul links.
  • Trivial with work-conserving output-queued
    routers
  • Controllable Delay
  • Some users would like predictable delay.
  • This is feasible with output-queueing plus
    weighted fair queuing (WFQ).

10
The Problem
  • Output queued switches are impractical

R
R
R
R
DRAM
data
NR
NR
11
Memory BandwidthCommercial DRAM
  • Memory speed is not keeping up with Moores Law.

DRAM 1.1x / 18months
Moores Law 2x / 18 months
Router Capacity 2.2x / 18months
Line Capacity 2x / 7 months
12
Packet processing is getting harder
CPU Instructions per minimum length packet since
1996
13
Basic Ideas
14
Forwarding Functions ATM Switch
  • Lookup cell VCI/VPI in VC table.
  • Replace old VCI/VPI with new.
  • Forward cell to outgoing interface.
  • Transmit cell onto link.

15
Functions Ethernet (L2) Switch
  • Lookup frame destination address (DA) in
    forwarding table.
  • If known, forward to correct port.
  • If unknown, broadcast to all ports.
  • Learn source address (SA) of incoming frame.
  • Forward frame to outgoing interface.
  • Transmit frame onto link.

16
Functions IP Router
  • Lookup packet DA in forwarding table.
  • If known, forward to correct port.
  • If unknown, drop packet.
  • Decrement TTL, update header Cksum.
  • Forward packet to outgoing interface.
  • Transmit packet onto link.

17
Basic Architectural Components
Congestion Control
Control
Admission Control
Reservation
Routing
Datapath per-packet processing
Output Scheduling
Switching
Policing
18
Basic Architectural Components
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
19
Generic Router Architecture
Header Processing
Lookup IP Address
Update Header
Queue Packet
20
Generic Router Architecture
Buffer Manager
Buffer Memory
Buffer Manager
Buffer Memory
Buffer Manager
Buffer Memory
21
Simplest Design Software Router using PCs!
  • Idea add special-purpose software to
    general-purpose hardware Cheap, but slow
  • Measure of speed aggregate data rate or
    aggregate packet rate
  • Limits number type of interfaces, topologies
    etc
  • Eg 400 Mbps aggregate rate will allow four 100
    Mbps ethernet interfaces, but no GbE!
  • Eg MITs Click Router

22
Aggregate Packet vs Bit Rates
64 byte pkts
1518 byte pkts
23
Per-Packet Processing Time Budget
MITs Click Router claims 435Kpps with 64 byte
packets! See http//www.pdos.lcs.mit.edu/click/
(gt it can do 100 Mbps, but not GbE interfaces!)
24
Soln Decentralization/Parallelism
  • Fine-grained parallelism instruction-level
  • Symmetric coarse-grain parallelism multi-procs
  • Asymmetric coarse-grain parallelism multi-procs
  • Co-processors (ASICs)
  • Operates under control of CPU
  • Move expensive ops to hardware
  • NICs with on-board processing
  • Attack I/O bottleneck
  • Move processing to the NIC (ASIC or embedded
    RISC)
  • Handles only 1 interface rather than aggregate
    rate!
  • Smart NICs with onboard stacks
  • Cell Switching Design protocols to suit hardware
    speeds!
  • Data pipelines

25
Optimizations (contd)
26
Demultiplexing vs Classification
  • De-multiplexing in a layered model provides
    freedom to use arbitrary protocols without
    transmission overhead, but imposes sequential
    processing limitations
  • Packet classification combines demuxing from a
    sequence of opns at multiple layers to an
    operation at one layer!

Overall goal flow segregation
27
Classification example
28
Hardware Optimization of Classification
29
Hybrid Hardware/Software Classifier
30
Conceptual Bindings
Connectionless Network
31
Second Gen. Network Systems
32
Switch Fabric Concept
Data path (aka backplane) that provides
parallelism Connects the NICs which have on-board
processing
33
Desired Switch Fabric Properties
34
Space Division Fabric
Asynchronous design arose from multi-processor
context Data can be sent across fabric at
arbitrary times
35
Blocking and Port Contention
  • Even if internally non-blocking (I.e. fully
    inter-connected), port-contention can occur! Why
    ?
  • Need blocking circuits at input and output ports

36
Crossbar Switched interconnections
  • Use switches between each input and output
    instead of separate paths active gt data flows
    from I to O
  • Total number of paths required NM
  • Number of switching points NxM

37
Crossbar Switched interconnections
  • Switch controller (centralized) handles port
    contention
  • Allows transfers in parallel (upto MinN,M
    paths)
  • Note port hardware can operate much slower!
  • Issues switches, switch controller
  • Port contention still exists

38
Queuing input, output buffers
39
Time-division Switching Fabrics
  • Aka bus! (I.e. single shared link)
  • Low cost and low speed (used in computers!)
  • Need arbitration mechanism
  • eg fixed time-slots or data-blocks, fixed cells,
    variable packets

40
Time division switching telephony
  • Key idea when de-multiplexing, position in frame
    determines output trunk
  • Time division switching interchanges sample
    position within a frame time slot interchange
    (TSI)

41
Time-division Shared memory fabrics
  • Memory interface hardware expensive gt many
    ports share fewer memory interfaces
  • Eg dual-ported memory
  • Separate low-speed bus lines for controller

42
(No Transcript)
43
Multi-Stage Fabrics
  • Compromise between pure time-division and pure
    space division
  • Attempt to combine advantages of each
  • Lower cost from time-division
  • Higher performance from space-division
  • Technique Limited Sharing
  • Eg Banyan switch
  • Features
  • Scalable
  • Self-routing, I.e. no central controller
  • Packet queues allowed, but not required
  • Note multi-stage switches share the
    crosspoints which have now become expensive
    resources

44
Banyan Switch Fabric (Contd)
  • Basic building block 2x2 switch, labelled by
    0/1
  • Can be synchronous or asynchronous
  • Asynch gt packets can arrive at arbitrary times
  • Synchronous banyan offers TWICE the effective
    throughput!
  • Worst case when all inputs receive packets with
    same label

45
Banyan Fabric
More on switching later
46
Forwardinga.k.a. Port Mapping
47
Basic Architectural ComponentsForwarding
Decision
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
48
ATM and MPLS SwitchesDirect Lookup
(Port, VCI)
VCI
Memory
Address
Data
49
Bridges and Ethernet SwitchesAssociative Lookups
Associative Memory or CAM
Network Address
Associated Data
Search Data
48
50
Bridges and Ethernet SwitchesHashing
Search Data
Hashing Function
16
Data
Memory
Address
48
51
Lookups Using HashingAn example
Memory
1
2
3
4
Search Data
Hashing Function
16
1
2
CRC-16
48
1
2
3
Linked lists
52
Lookups Using HashingPerformance of simple
example
53
Lookups Using Hashing
  • Advantages
  • Simple
  • Expected lookup time can be small
  • Disadvantages
  • Non-deterministic lookup time
  • Inefficient use of memory

54
Per-packet processing in an IP Router
  • 1. Accept packet arriving on an incoming link.
  • 2. Lookup packet destination address in the
    forwarding table, to identify outgoing port(s).
  • 3. Manipulate packet header e.g., decrement TTL,
    update header checksum.
  • 4. Send (switch) packet to the outgoing port(s).
  • 5. Classify and buffer packet in the queue.
  • 6. Transmit packet onto outgoing link.

55
Caching Addresses
Slow Path
Buffer Memory
CPU
Fast Path
56
Caching Addresses
57
IP Router Lookup
  • IPv4 unicast destination address based lookup

58
Lookup and Forwarding Engine
Packet
header
payload
Router
Routing Lookup Data Structure
Destination Address
Outgoing Port
Forwarding Table
Dest-network
Port
65.0.0.0/8
3
128.9.0.0/16
1
149.12.0.0/19
7
59
Example Forwarding Table
Destination IP Prefix Outgoing Port
65.0.0.0/ 8 3
128.9.0.0/16 1
142.12.0.0/19 7
Prefix length
IP prefix 0-32 bits
142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
0
232-1
224
65.0.0.0
65.255.255.255
60
Prefixes can Overlap
Longest matching prefix
128.9.176.0/24
128.9.16.0/21
128.9.172.0/21
142.12.0.0/19
65.0.0.0/8
128.9.0.0/16
0
232-1
Routing lookup Find the longest matching prefix
(aka the most specific route) among all prefixes
that match the destination address.
61
Difficulty of Longest Prefix Match
  • 2-dimensional search
  • Prefix Length
  • Prefix Value

32
24
Prefix Length
128.9.176.0/24
128.9.172.0/21
128.9.16.0/21
142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
8
Prefix Values
62
IP RoutersMetrics for Lookups
  • Lookup time
  • Storage space
  • Update time
  • Preprocessing time

128.9.16.14
63
Lookup Rates Required
40B packets (Mpps)
Line-rate (Gbps)
Line
Year
1.94
0.622
OC12c
1998-99
7.81
2.5
OC48c
1999-00
31.25
10.0
OC192c
2000-01
125
40.0
OC768c
2002-03
64
Update Rates Required
  • Recent BGP studies show that updates can be
  • Bursty several 100s of routes updated/withdrawn
    gt insert/delete operations
  • Frequent Average 100 updates per second
  • Need data structure to be efficient in terms of
    lookup as well as update (insert/delete)
    operations.

65
Size of the Forwarding Table
Renewed Exponential Growth
Number of Prefixes
10,000/year
95
96
97
98
99
00
Year
Renewed growth due to multi-homing of enterprise
networks!
  • Source http//www.telstra.net/ops/bgptable.html

66
Potential Hyper-Exponential Growth!
Global routing table vs Moore's law since 1999
160000
Global prefixes
Moore's law
150000
Double growth
140000
130000
120000
110000
Prefixes
100000
90000
80000
70000
60000
50000
01/99
04/99
07/99
10/99
01/00
04/00
07/00
10/00
01/01
04/01
67
Trees and Tries
Binary Search Tree
Binary Search Trie
lt
gt
0
1
lt
gt
lt
gt
0
1
0
1
111
010
68
Trees and TriesMultiway tries
16-ary Search Trie
0000, ptr
1111, ptr
1111, ptr
0000, 0
1111, ptr
0000, 0
000011110000
111111111111
69
Lookup Multiway TriesTradeoffs
Table produced from 215 randomly generated 48-bit
addresses
70
Routing Lookups in Hardware
Number
Prefix length
Most prefixes are 24-bits or shorter
71
Routing Lookups in Hardware
224 16M entries
Prefixes up to 24-bits
142.19.6
142.19.6.14
14
72
Routing Lookups in Hardware
Prefixes up to 24-bits
1
Next Hop
128.3.72
128.3.72.44
44
73
Switchinga.k.a. Interconnect
74
Basic Architectural Components Interconnect
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
75
First-Generation IP Routers
Shared Backplane
Buffer Memory
CPU
  • Most Ethernet switches and cheap packet routers
  • Bottleneck can be CPU, host-adaptor or I/O bus
  • What is costly? Bus ? Memory? Interface? CPU?

76
Second-Generation IP Routers
  • Port mapping intelligence in line cards
  • Higher hit rate in local lookup cache
  • What is costly? Bus ? Memory? Interface? CPU?

77
Third-Generation Switches/Routers
Switched Backplane
Line Card
CPU Card
Line Card
Local Buffer Memory
Local Buffer Memory
MAC
MAC
  • Third generation switch provides parallel paths
    (fabric)
  • Whats costly? Bus? Memory, CPU?

78
Fourth-Generation Switches/RoutersClustering and
Multistage
13
14
15
16
17
18
25
26
27
28
29
30
1
2
3
4
5
6
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
19
20
21
22
23
24
31
32
21
7
8
9
10
11
12
79
Switching goals (telephony data)
80
Circuit switch
  • A switch that can handle N calls has N logical
    inputs and N logical outputs
  • N up to 200,000
  • Moves 8-bit samples from an input to an output
    port
  • Recall that samples have no headers
  • Destination of sample depends on time at which it
    arrives at the switch
  • In practice, input trunks are multiplexed
  • Multiplexed trunks carry frames set of samples
  • Goal extract samples from frame, and depending
    on position in frame, switch to output
  • each incoming sample has to get to the right
    output line and the right slot in the output frame

81
Call blocking
  • Cant find a path from input to output
  • Internal blocking
  • slot in output frame exists, but no path
  • Output blocking
  • no slot in output frame is available
  • Output blocking is reduced in transit switches
  • need to put a sample in one of several slots
    going to the desired next hop

82
Multiplexors and demultiplexors
  • Most trunks time division multiplex voice samples
  • At a central office, trunk is demultiplexed and
    distributed to active circuits
  • Addressing not required
  • Synchronous multiplexor N input lines
  • Output runs N times as fast as input

1
1
2
2
3
3
De- MUX

MUX

1
2
3

N
N
N
83
Switching what does a switch do?
  • Transfers data from an input to an output
  • many ports (density), high speeds
  • Eg Crossbar

84
Circuit Switch
85
Issue Call Blocking
86
Time division switching
  • Key idea when de-multiplexing, position in frame
    determines output trunk
  • Time division switching interchanges sample
    position within a frame time slot interchange
    (TSI)

87
Scaling Issues with TSI
88
Space division switching
  • Each sample takes a different path through the
    switch, depending on its destination

89
Crossbar
  • Simplest possible space-division switch
  • Crosspoints can be turned on or off, long enough
    to transfer a packet from an input to an output
  • Internally nonblocking
  • but need N2 crosspoints
  • time to set each crosspoint grows quadratically

90
Multistage crossbar
  • In a crossbar during each switching time only one
    cross-point per row or column is active
  • Can save crosspoints if a cross-point can attach
    to more than one input line (why?)
  • This is done in a multistage crossbar
  • Need to rearrange connections every switching time

91
Multistage crossbar
  • Can suffer internal blocking
  • unless sufficient number of second-level stages
  • Number of crosspoints lt N2
  • Finding a path from input to output requires a
    depth-first-search
  • Scales better than crossbar, but still not too
    well
  • 120,000 call switch needs 250 million crosspoints

92
Time-Space Switching
93
Time-Space-Time (TST) switching
Telephone switches like 5ESS use multiple
space-stages eg TSSST etc
94
Packet switches
  • In a circuit switch, path of a sample is
    determined at time of connection establishment
  • No need for a sample header--position in frame
    used
  • In a packet switch, packets carry a destination
    field or label
  • Need to look up destination port on-the-fly
  • Datagram switches
  • lookup based on entire destination address
    (longest-prefix match)
  • Cell or Label-switches
  • lookup based on VCI or Labels

95
Blocking in packet switches
  • Can have both internal and output blocking
  • Internal
  • no path to output
  • Output
  • trunk unavailable
  • Unlike a circuit switch, cannot predict if
    packets will block (why?)
  • If packet is blocked gt must either buffer or
    drop

96
Dealing with blocking in packet switches
  • Over-provisioning
  • internal links much faster than inputs
  • Buffers
  • at input or output
  • Backpressure
  • if switch fabric doesnt have buffers, prevent
    packet from entering until path is available
  • Parallel switch fabrics
  • increases effective switching capacity

97
Switch Fabrics Buffered crossbar
  • What happens if packets at two inputs both want
    to go to same output?
  • Can defer one at an input buffer
  • Or, buffer cross-points complex arbiter

98
Switch fabric element
  • Goal towards building self-routing fabrics
  • Can build complicated fabrics from a simple
    element
  • Routing rule if 0, send packet to upper output,
    else to lower output
  • If both packets to same output, buffer or drop

99
Banyan
  • Simplest self-routing recursive fabric
  • What if two packets both want to go to the same
    output?
  • output blocking

100
Features of multi-stage switches
  • Issue output blocking two packets want to go to
    same output port

101
Blocking in Banyan Fabric
102
Blocking in Banyan S/ws Sorting
  • Can avoid blocking by choosing order in which
    packets appear at input ports
  • If we can
  • present packets at inputs sorted by output
  • remove duplicates
  • remove gaps
  • precede banyan with a perfect shuffle stage
  • then no internal blocking
  • For example X, 010, 010, X, 011, X, X, X
  • Sort gt 010, 011, 011, X, X, X, X, X
  • Remove dups gt 010, 011, X, X, X, X, X, X
  • Shuffle gt 010, X, 011, X, X, X, X,
    X
  • Need sort, shuffle, and trap networks

103
Sorting using Merging
  • Build sorters from merge networks
  • Assume we can merge two sorted lists
  • Sort pairwise, merge, recurse

104
Putting together Batcher-Banyan
105
Non-Blocking Batcher-Banyan
Batcher Sorter
Self-Routing Network
3
7
7
7
7
7
7
000
7
2
5
0
4
6
6
001
5
3
2
5
5
4
5
010
2
5
3
1
6
5
4
011
6
6
1
3
0
3
3
100
0
1
0
4
3
2
2
101
1
0
6
2
1
0
1
110
4
4
4
6
2
2
0
111
  • Fabric can be used as scheduler.
  • Batcher-Banyan network is blocking for multicast.

106
Queuing, Buffer Management, Classification
107
Basic Architectural Components Queuing,
Classification
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
108
QueuingTwo basic techniques
Input Queueing
Output Queueing
Usually a non-blocking switch fabric (e.g.
crossbar)
Usually a fast bus
109
QueuingOutput Queueing
Individual Output Queues
Centralized Shared Memory
1
2
N
1
2
N
110
Input Queuing
111
Input QueueingHead of Line Blocking
Delay
Load
100
112
Solution Input Queueing w/Virtual output queues
(VOQ)
113
Head-of-Line (HOL) in Input Queuing
114
Input QueuesVirtual Output Queues
Delay
Load
100
115
Output Queuing
116
Packet Classification
HEADER
Action
Incoming Packet
117
Multi-field Packet Classification
Given a classifier with N rules, find the action
associated with the highest priority rule
matching an incoming packet.
118
Prefix matching 1-d range problem
128.9/16
0
232-1
128.9.16.14
119
Classification 2D Geometry problem
R7
R6
R2
R1
R4
R5
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, )
Field 2
Field 1
120
Network ProcessorsBuilding Block for
programmable networks
  • Slides from Raj Yavatkar, raj.yavatkar_at_intel.com

121
Intel IXP Network Processors
  • Microengines
  • RISC processors optimized for packet processing
  • Hardware support for multi-threading
  • Fast path
  • Embedded StrongARM/Xscale
  • Runs embedded OS and handles exception tasks
  • Slow path, Control plane

122
Various forms of Processors
Embedded Processor (run-to-completion)
Parallel architecture
Pipelined Architecture
123
Software Architectures
124
Division of Functions
125
Packet Flow Through the Hierarchy
126
Scaling Network Processors
127
Memory Scaling
128
Memory Scaling (contd)
129
Memory Types
130
Memory Caching and CAM
CACHE
Content Addressable Memory (CAM)
131
CAM and Ternary CAM
CAM Operation
Ternary CAM (T-CAM)
132
Ternary CAMs
Associative Memory
Value
Mask
10.0.0.0
R1
255.0.0.0
255.255.0.0
Next Hop
10.1.0.0
R2
255.255.255.0
10.1.1.0
R3
10.1.3.0
R4
255.255.255.0
255.255.255.255
10.1.3.1
R4
Priority Encoder
Using T-CAMs for Classification
133
IXP A Building Block for Network Systems
  • Example IXP2800
  • 16 micro-engines XScale core
  • Up to 1.4 Ghz ME speed
  • 8 HW threads/ME
  • 4K control store per ME
  • Multi-level memory hierarchy
  • Multiple inter-processor communication channels
  • NPU vs. GPU tradeoffs
  • Reduce core complexity
  • No hardware caching
  • Simpler instructions ? shallow pipelines
  • Multiple cores with HW multi-threading per chip

134
IXP 2400 Block Diagram
135
IXP2800 Features
  • Half Duplex OC-192 / 10 Gb/sec Ethernet Network
    Processor
  • XScale Core
  • 700 MHz (half the ME)
  • 32 Kbytes instruction cache / 32 Kbytes data
    cache
  • Media / Switch Fabric Interface
  • 2 x 16 bit LVDS Transmit Receive
  • Configured as CSIX-L2 or SPI-4
  • PCI Interface
  • 64 bit / 66 MHz Interface for Control
  • 3 DMA Channels
  • QDR Interface (w/Parity)
  • (4) 36 bit SRAM Channels (QDR or Co-Processor)
  • Network Processor Forum LookAside-1 Standard
    Interface
  • Using a clamshell topology both Memory and
    Co-processor can be instantiated on same channel
  • RDR Interface
  • (3) Independent Direct Rambus DRAM Interfaces
  • Supports 4i Banks or 16 interleaved Banks
  • Supports 16/32 Byte bursts

136
Hardware Features to ease packet processing
  • Ring Buffers
  • For inter-block communication/synchronization
  • Producer-consumer paradigm
  • Next Neighbor Registers and Signaling
  • Allows for single cycle transfer of context to
    the next logical micro-engine to dramatically
    improve performance
  • Simple, easy transfer of state
  • Distributed data caching within each micro-engine
  • Allows for all threads to keep processing even
    when multiple threads are accessing the same
    data

137
XScale Core processor
  • Compliant with the ARM V5TE architecture
  • support for ARMs thumb instructions
  • support for Digital Signal Processing (DSP)
    enhancements to the instruction set
  • Intels improvements to the internal pipeline to
    improve the memory-latency hiding abilities of
    the core
  • does not implement the floating-point
    instructions of the ARM V5 instruction set

138
Microengines RISC processors
  • IXP 2800 has 16 microengines, organized into 4
    clusters (4 MEs per cluster)
  • ME instruction set specifically tuned for
    processing network data
  • 40-bit x 4K control store
  • Six-stage pipeline in an instruction
  • On an average takes one cycle to execute
  • Each ME has eight hardware-assisted threads of
    execution
  • can be configured to use either all eight threads
    or only four threads
  • The non-preemptive hardware thread arbiter swaps
    between threads in round-robin order

139
MicroEngine v2
D-Push Bus
S-Push Bus
From Next Neighbor
Control Store 4K Instructions
Local Memory 640 words
128 GPR
128 GPR
128 Next Neighbor
128 S Xfer In
128 D Xfer In
LM Addr 1
2 per CTX
B_op
A_op
LM Addr 0
Prev B
Prev A
P-Random
B_Operand
A_Operand
CRC Unit
Multiply
Lock 0-15
Status and LRU Logic (6-bit)
TAGs 0-15
32-bit ExecutionData Path
Find first bit
CAM
CRC remain
Add, shift, logical
Status
Entry
Local CSRs
ALU_Out
To Next Neighbor
Timers
128 S Xfer Out
128 D Xfer Out
Timestamp
D-Pull Bus
S-Pull Bus
140
Why Multi-threading?
141
Packet processing using multi-threading within a
MicroEngine
142
Registers available to each ME
  • Four different types of registers
  • general purpose, SRAM transfer, DRAM transfer,
    next-neighbor (NN)
  • 256, 32-bit GPRs
  • can be accessed in thread-local or absolute mode
  • 256, 32-bit SRAM transfer registers.
  • used to read/write to all functional units on the
    IXP2xxx except the DRAM
  • 256, 32-bit DRAM transfer registers
  • divided equally into read-only and write-only
  • used exclusively for communication between the
    MEs and the DRAM
  • Benefit of having separate transfer and GPRs
  • ME can continue processing with GPRs while other
    functional units read and write the transfer
    registers

143
Different Types of Memory
Type of Memory Logical width (bytes) Size in bytes Approx unloaded latency (cycles) Special Notes
Local to ME 4 2560 3 Indexed addressing post incr/decr
On-chip scratch 4 16K 60 Atomic ops 16 rings w/at. get/put
SRAM 4 256M 150 Atomic ops 64-elem q-array
DRAM 8 2G 300 Direct path to/from MSF
144
IXA Software Framework
ExternalProcessors
Control Plane Protocol Stacks
Control Plane PDK
XScaleCore
Core Components
C/C Language
Core Component Library
Resource Manager Library
Microblock Library
Microengine Pipeline
MicroengineC Language
Micro block
Micro block
Micro block
Utility Library
Protocol Library
Hardware Abstraction Library
145
  • Micro-engine
  • C Compiler
  • C language constructs
  • Basic types,
  • pointers, bit fields
  • In-line assembly code support
  • Aggregates
  • Structs, unions, arrays

146
What is a Microblock
  • Data plane packet processing on the microengines
    is divided into logical functions called
    microblocks
  • Coarse Grain and stateful
  • Example
  • 5-Tuple Classification, IPv4 Forwarding, NAT
  • Several microblocks running on a microengine
    thread can be combined into a microblock group.
  • A microblock group has a dispatch loop that
    defines the dataflow for packets between
    microblocks
  • A microblock group runs on each thread of one or
    more microengines
  • Microblocks can send and receive packets to/from
    an associated Xscale Core Component.

147
Core Components and Microblocks
XScale Core
Micro- engines
Microblock Library
Core Libraries
User-written code
Intel/3rd party blocks
148
Applications of Network Processors
  • Fully programmable architecture
  • Implement any packet processing applications
  • Examples from customers
  • Routing/switching, VPN, DSLAM, Multi-servioce
    switch, storage, content processing
  • Intrusion Detection (IDS) and RMON
  • Use as a research platform
  • Experiment with new algorithms, protocols
  • Use as a teaching tool
  • Understand architectural issues
  • Gain hands-on experience withy networking systems

149
Technical and Business Challenges
  • Technical Challengers
  • Shift from ASIC-based paradigm to software-based
    apps
  • Challenges in programming an NPU
  • Trade-off between power, board cost, and no. of
    NPUs
  • How to add co-processors for additional
    functions?
  • Business challenges
  • Reliance on an outside supplier for the key
    component
  • Preserving intellectual property advantages
  • Add value and differentiation through software
    algorithms in data plane, control plane, services
    plane functionality
  • Must decrease time-to-market (TTM) to be
    competitive

150
Challenges in Modern Tera-bit Class Switch Design
151
Goals
  • Design of a terabit-class system
  • Several Tb/s aggregate throughput
  • 2.5 Tb/s 256x256 OC-192 or 64x64 OC-768
  • OEM
  • Achieve wide coverage of application spectrum
  • Single-stage
  • Electronic fabric

152
System Architecture
153
Power
  • Requirement
  • Do not exceed the per shelf (2 kW), per board
    (150W), and per chip (20W) budgets
  • Forced-air cooling, avoid hot-spots
  • More throughput at same power Gb/s/W density is
    increasing
  • I/O uses an increasing fraction of power (gt 50)
  • Electrical I/O technology has not kept pace with
    capacity demand
  • Low-power, high-density I/O technology is a must
  • CMOS density increases faster than W/gate
    decreases
  • Functionality/chip constrained by power rather
    than density
  • Power determines the number of chips and boards
  • Architecture must be able to be distributed
    accordingly

154
Packaging
  • Requirement
  • NEBS compliance
  • Constrained by
  • Standard form factors
  • Power budget at chip, card, rack level
  • Switch core
  • Link, connector, chip packaging technology
  • Connector density (pins/inch)
  • CMOS density doubles, number of pins 5-10 per
    generation
  • This determines the maximum per-chip and per-card
    throughput
  • Line cards
  • Increasing port counts
  • Prevalent line rate granularity OC-192 (10 Gb/s)
  • 1 adapter/card
  • gt 1 Tb/s systems require multi-rack solutions
  • Long cables instead of backplane (30 to 100m)
  • Interconnect accounts for large part of system
    cost

155
Packaging
  • 2.5 Tb/s, 1.6x speedup, 2.5 Gb/s links 8b/10b
    4000 links (diff. pairs)

156
Switch-Internal Round-Trip (RT)
  • Physical system size
  • Direct consequence of packaging
  • CMOS technology
  • Clock speeds increasing much slower than density
  • More parallelism required to increase throughput
  • Shrinking packet cycle
  • Line rates have up drastically (OC-3 through
    OC-768)
  • Minimum packet size has remained constant
  • Large round-trip (RT) in terms of min. packet
    duration
  • Can be (many) tens of packets per port
  • Used to be only a node-to-node issue, now also
    inside the node
  • System-wide clocking and synchronization

Evolution of RT
157
Switch-Internal Round-Trip (RT)
switch fabric
line card 1
switch core
switch fabric interface chips
line card N
  • Consequences
  • Performance impact?
  • All buffers must be scaled by RT
  • Fabric-internal flow control becomes an important
    issue

158
Speed-Up
  • Requirement
  • Industry standard 2x speed-up
  • Three flavors
  • Utilization compensate SAR overhead
  • Performance compensate scheduling inefficiencies
  • OQ speed-up memory access time
  • Switch core speed-up S is very costly
  • Bandwidth is a scarce resource COST and POWER
  • Core buffers must run S times faster
  • Core scheduler must run S times faster
  • Is it really needed?
  • SAR overhead reduction
  • Variable-length packet switching hard to
    implement, but may be more cost-effective
  • Performance does the gain in performance justify
    the increase in cost and power?
  • Depends on application
  • Low Internet utilization

159
Multicast
  • Requirement
  • Full multicast support
  • Many multicast groups, full link utilization, no
    blocking, QoS
  • Complicates everything
  • Buffering, queuing, scheduling, flow control, QoS
  • Sophisticated multicast support really needed?
  • Expensive
  • Often disabled in the field
  • Complexity, billing, potential for abuse, etc.
  • Again, depends on application

160
Packet size
  • Requirement
  • Support very short packets (32-64B)
  • 40B _at_ OC-768 8 ns
  • Short packet duration
  • Determines speed of control section
  • Queues and schedulers
  • Implies longer RT
  • Wider data paths
  • Do we have to switch short packets individually?
  • Aggregation techniques
  • Burst, envelope, container switching, packing
  • Single-stage, multi-path switches
  • Parallel packet switch

161
100Tb/s optical routerStanford University
Research Project
  • Collaboration
  • 4 Professors at Stanford (Mark Horowitz, Nick
    McKeown, David Miller and Olav Solgaard), and our
    groups.
  • Objective
  • To determine the best way to incorporate optics
    into routers.
  • Push technology hard to expose new issues.
  • Photonics, Electronics, System design
  • Motivating example The design of a 100 Tb/s
    Internet router
  • Challenging but not impossible (100x current
    commercial systems)
  • It identifies some interesting research problems

162
100Tb/s optical router
Optical Switch
Electronic Linecard 1
Electronic Linecard 625
160- 320Gb/s
160- 320Gb/s
40Gb/s
  • Line termination
  • IP packet processing
  • Packet buffering
  • Line termination
  • IP packet processing
  • Packet buffering

40Gb/s
160Gb/s
40Gb/s
Arbitration
Request
40Gb/s
Grant
(100Tb/s 625 160Gb/s)
163
Research Problems
  • Linecard
  • Memory bottleneck Address lookup and packet
    buffering.
  • Architecture
  • Arbitration Computation complexity.
  • Switch Fabric
  • Optics Fabric scalability and speed,
  • Electronics Switch control and link electronics,
  • Packaging Three surface problem.

164
160Gb/s Linecard Packet Buffering
b
DRAM
DRAM
DRAM
160 Gb/s
160 Gb/s
Queue Manager
SRAM
  • Problem
  • Packet buffer needs density of DRAM (40 Gbits)
    and speed of SRAM (2ns per packet)
  • Solution
  • Hybrid solution uses on-chip SRAM and off-chip
    DRAM.
  • Identified optimal algorithms that minimize size
    of SRAM (12 Mbits).
  • Precisely emulates behavior of 40 Gbit, 2ns SRAM.

165
The Arbitration Problem
  • A packet switch fabric is reconfigured for every
    packet transfer.
  • At 160Gb/s, a new IP packet can arrive every 2ns.
  • The configuration is picked to maximize
    throughput and not waste capacity.
  • Known algorithms are too slow.

166
100Tb/s Router
Optical links
Optical Switch Fabric
Racks of 160Gb/s Linecards
167
Racks with 160Gb/s linecards
168
Passive Optical Switching
Integrated AWGR or diffraction grating based
wavelength router
Midstage Linecard 1
Egress Linecard 1
Ingress Linecard 1
1
1
1
1
Midstage Linecard 2
Egress Linecard 2
2
Ingress Linecard 2
2
2
2
Midstage Linecard n
Egress Linecard n
n
Ingress Linecard n
n
n
n
169
Predictions Core Internet routers
  • The need for more capacity for a given power and
    volume budget will mean
  • Fewer functions in routers
  • Little or no optimization for multicast,
  • Continued over-provisioning will lead to little
    or no support for QoS, DiffServ, ,
  • Fewer unnecessary requirements
  • Mis-sequencing will be tolerated,
  • Latency requirements will be relaxed.
  • Less programmability in routers, and hence no
    network processors (NPs used at edge).
  • Greater use of optics to reduce power in switch.

170
Likely Events
  • The need for capacity and reliability will mean
  • Widespread replacement of core routers with
    transport switching based on circuits
  • Circuit switches have proved simpler, more
    reliable, lower power, higher capacity and lower
    cost per Gb/s. Eventually, this is going to
    matter.
  • Internet will evolve to become edge routers
    interconnected by rich mesh of WDM circuit
    switches.

171
Summary
  • High speed routers lookup, switching,
    classification, buffer management
  • Lookup Range-matching, tries, multi-way tries
  • Switching circuit s/w, crossbar, batcher-banyan,
  • Queuing input/output queuing issues
  • Classification Multi-dimensional geometry
    problem
  • Road ahead to 100 Tbps routers
Write a Comment
User Comments (0)
About PowerShow.com