Title: Packet Switch Architecture and Buffered Crossbar Switches
1Packet Switch Architecture and Buffered Crossbar
Switches
- Manolis Katevenis
- FORTH and Univ. of Crete, Greece
- http//archvlsi.ics.forth.gr/kateveni
2Ubiquitous Interconnects
- Information Technology Infrastructure
- Compute processors, specialized compute engines
- Store memories, disks, etc.
- Interface keyboards, displays, sensors,
actuators, etc. - Communicate interconnect all above together!
- from on-chip to cross-continent range
- Packet Switches the building block for
high-performance interconnects This Talk - brief background review Packet Switch
Architecture - recent research at FORTH Buffered Crossbars
3Packet Switching Unscheduled Arrivals
4Buffer Memory Architectures
(1) Output Queueing the reference architecture
- ideal performance, but excessive cost
- N(N1) total memory throughput for an NN switch
5(2) Input Queueing the usual architecture
- N2 total memory throughput
- need to solve the crossbar scheduling problem
6Crossbar Scheduling (1 of 3)
7Crossbar Scheduling (2 of 3)
8Crossbar Scheduling (3 of 3)
9(3) Combined Input-Output Queueing (CIOQ) the
practical architecture
10Part 2 Buffered Crossbars _at_ FORTH
- Small buffers inside the crossbar (or switching
fabric) - Large buffers at the inputs (as before)
- Backpressure to keep the small buffers from
overflowing - simpler (distributed) scheduling, QoS capable
(WRR) - better scheduling efficiency
- directly operates with variable-size packets,
w/o SAR - NO internal speedup needed
- lower power consumption, lower cost, or
- external lines as fast as internal core
- NO output queues need
11Distributed Scheduling in Buffered Crossbars
- inputs decide independently
where to send to, subject to space
availability - outputs decide independently
where to feed from, subject to data availability
12No Speedup needed to approach Output Queuing
- Uniform destinations
- Internet-style synthetic workload 40-1500 byte
packet sizes - Unbuffered crossbar w. SAR one-iteration iSLIP,
64-byte segments
13Saturation Throughput under Unbalanced Traffic
- Poisson arrivals, Pareto sizes (40-1500)
- For iSLIP, packet sizes are multiples of 64 B (?
no SAR overhead)
14A VPS Buffered Crossbar Chip Design
- 32x32 ports, 300 Gbps aggregate throughput
- 2 KBytes / crosspoint buffer x 1024 crosspoints
- Variable-size packets (multiples of 4 Bytes)
- 32-bit datapaths
- Cut-through at the crosspoints
- Fully designed, in Verilog
- Core only, no pads transceivers
- Fully verified Verilog versus C performance
simulator - Crosspoint logic 100 FF 25 gates
(simplicity!)
15Chip Design Synthesis, Placement Routing
- 32x32 ports, 300 Gbps
- Synthesized Synopsys
- Placed routed Cadence Encounter, 0.18 µm UMC
- ? Clock frequency 300 MHz _at_ 0.18 µm
- (operates at maximum SRAM clock frequency)
- ? Core Power 6 Watt typical _at_ 0.18 µm
- ? Core Area 420 mm² _at_ 0.18 µm, or 200 mm² _at_ 0.13
µm - Conclusion
- 0.18 µm 24x24 ports (or 10x10 ports w. Jumbo
frames) - 0.13 µm 32x32 ports _at_ 10 Gbps/port
- 0.09 µm higher port counts and line rates
achievable
16Chip Core Layout
17Core Area, Power Allocation
18Optimizing for Buffer Memory Technologies
- Crosspoint buffers are too expensive when maximum
packet size is large (1.5 10 Kbytes) - DRAM on ingress line card operates efficiently
only on fixed-size blocks
19Variable-Size Multipacket Segments
Fixed size segments (cells)
Variable size segments
Buffered Crossbars can operate
on variable size units
Pack multiple packets
into each segment
- Fixed size cells induce heavy padding overheads
- Variable size segments small buffers, no
padding
Variable size multipacket segments
- Encapsulating multiple packets into each
segment
- reduces overhead of internal headers
- provides better performance with smaller
xpoint buffers - well suited to DRAM
buffers on ingress line cards
20SRAM DRAM Queueingusing Variable-Size
Multipacket Segments
21Multipacket Segments in CICOQ vs. CIOQ
Uniform synthetic workload including jumbo
frames Max Segment size 512 B Crosspoint
buffer size 512 B iSLIP with 5
iterations Switch size 32
- CICOQ delay curve is translated by the
reassembly delay relative to OQ - CIOQ curve reveals the timer setting problem
22Multipacket vs. Unipacket Segments in CICOQ
Uniform traffic of 40-byte packets only 4-byte
internal header Max segment size 512
B Crosspoint buffer size 520 B
- Multipacket Segments
- - Switch stable for all loads up to
512/516 99, due to segment size adaptivity - Unipacket Segments
- - Switch unstable for loads greater than
40/44 91
23Multipacket vs. Unipacket Segments in CICOQ
(cont.)
Unbalanced synthetic workload Maximum pkt size
1500 B Max segment size 512B Crosspoint
buffer size 512/1024B RTT 500 byte times
- Multipacket Segments
- - Satisfactory performance with just 1 max.
size segment crospoint buffer - Unipacket Segments
- - At least 1 max. size segment 1 RTT Wnd
is needed
24Switching Fabrics with Internal Backpressure
- Most promising scalable architecture
- Open questions still remain congestion control
- ? active research topic
- past present research at FORTH
25ATLAS ISingle-chip ATM Switch
- 1996-98
- 6 million transistors
- 0.35 µm CMOS
- 10 Gbit/s (16x16 _at_ 622
Mbps) - multilane backpressure at the granularity of 32 K
flows - on-chip shared buffer (pipelined memory, US
patent 5,774,653)
26Commodity Architectures an Analogy?
- Network switches
- 1985-2005 immature switch architectures.
- 1995-2005 Internet Routers, Digital Telephony
specialized, expensive, small market. - 2005(?)- clusters (fabrics) of commodity
switches ? - SAN, LAN, WAN
- Processors
- 1975-85 immature pre-RISC architectures.
- 1985-95 Supercomputers specialized, expensive,
small market. - 1995- clusters of low-cost, mass-market
(commodity) processing nodes