Title: Packet Switching on Raw
1Packet Switching on Raw
- Research Qualifying Exam
- Gleb A Chuvpilo
- January 28, 2005
2Project Publications
- High-Bandwidth Packet Switching on the Raw
General-Purpose Architecture,Gleb A. Chuvpilo
and Saman AmarasingheIn Proceedings of the
International Conference on Parallel Processing
(ICPP-03), Kaohsiung, Taiwan, Republic of China,
October 6-9, 2003. - High-Bandwidth Packet Switching on the Raw
General-Purpose Architecture,Gleb A.
Chuvpilo,S.M. Thesis, Massachusetts Institute of
Technology, Cambridge, Massachusetts, August,
2002. - RawNet Network Processing on the Raw
Processor,David Wentzlaff, Gleb A. Chuvpilo,
Arvind Saraf, Saman Amarasinghe, and Anant
Agarwal,In Research Abstracts of the MIT
Laboratory for Computer Science, Cambridge,
Massachusetts, March 2002. - Gigabit IP Routing on Raw,Gleb A. Chuvpilo,
David Wentzlaff, and Saman Amarasinghe,In
Proceedings of the 1st HPCA Workshop on Network
Processors, Cambridge, Massachusetts, February 3,
2002. - Also, unpublished work on Network Calculus at the
Computer Engineering and Networks Laboratory of
the ETH Swiss Federal Institute of Technology
3Outline
- Introduction
- Raw Processor Overview
- Internet Router Overview
- Packet Switching on Raw
- Raw Router Architecture
- Rotating Crossbar Design for Switch Fabric
- Distributed Scheduling Algorithm
- Minimization and Scheduling
- Results
- Conclusion
4Introduction
5Goal
- Build an IP router on a general-purpose processor
- Why?
- Flexibility ? new protocols and services
- Price ? economies of scale
6Raw
7Raw Processor
- A scalable computation fabric
- 4 x 4 mesh of tiles, each tile is a RISC
microprocessor - Ultra fast interconnect network
- Exposes the wires to the compiler
- Compiler orchestrates the communication
8Raw Facts
- Performance
- 16 OPS/FLOPS per cycle
- 230 Gb/s of on-chip bisection bandwidth
- 201 Gb/s off-chip I/O bandwidth
- 57 GB/s of on-chip memory bandwidth
9Raw Facts
- Layout
- Longest wire is the length of tile ? fast
clocking - Each tile
- MIPS R4000 router interconnect
- 32 KB IMEM
- 32 KB data cache
- 64 KB SMEM ? 2 MB total per chip
10Raw Facts
- Instruction Set Architecture
- Eight stage pipeline FETCH, DECODE, RF/STALL,
EXE, MUL, MEM, FPU - MIPS instruction set
- 28 general-purpose registers
- 4 register-mapped network ports
- 2-way set-associative cache,3 cycle latency, 32
byte lines
11Raw Facts
- Implementation
- ASIC _at_ 250 MHz Worst Case
- 122 million transistors (P4 43 million)
- 18.2mm x 18.2mm die (P4 15mm x 15mm)
- 1080 signal I/O pins
- 25 Watts
- IBM SA-27E 6 layer metal copper 0.15µ process
(P4 0.13µ)
12Raw Layout
13Communication Mechanisms
- 2 static networks
- 2 dynamic networks
14Static Networks
- Destinations known at compile time
- Message size known at compile time
- Cycle-by-cycle switch schedule
- Three-cycle nearest neighbor send-to-use latency
- No processing overhead
15Static Network Send
16Static Network Receive
17Dynamic Networks
- Unpredictable events
- External asynchronous interrupts
- Cache misses
- 15- to 30-cycle nearest neighbor send-to-use
latency (message header processing overhead) - Wormhole routed, two-stage pipelined,dimension-or
dered
18Routing
19What is Routing? RM OSI
20IP Router
21Switch Fabric
22Click Modular Router
- Modular software router
- MIT Parallel and Distributed OS Group
- 435,000 64-byte packets a second on a 700 MHz
Pentium III (commodity hardware) - Flexible, configurable, and easy to understand
- Interconnected collection of modules called
elements
23Click Modular Router
24Packet Switching on Raw
25Problem Four Networks
26 and Sixteen Tiles
27What is the Mapping?
?
StaticInterconnect
Dynamic Communication
28Solution Rotating Crossbar
Out 0
Out 1
In 0
In 1
In 3
In 2
Out 3
Out 2
29Switch Fabric Design
- The idea of a Token Ring network ? absolute
fairness - Algorithm uses two static networks, dynamic
networks are idle - All deadlock-free configurations are scheduled
at compile time - Four headers and token location define a global
configuration - Global configuration is computed in a distributed
manner at run time
30Rotating Crossbar Illustrated
31Rotating Crossbar Illustrated
32Phases of the Algorithm
TILE PROCESSOR
SWITCH PROCESSOR
headers_request
headers
send_prev_config
choose_new_config
route_body
confirm
update_token
33Distributed Scheduling Algorithm
- Lets enumerate the number of configurations
- SPACE Hdr0 x x Hdr3 x Token,
- where Hdr0 Hdr3 5,
- and Token 4 ?
- therefore
- SPACE 54 x 4 2,500 distinct configurations
34So What?...
- Each tile has 8,192 words of instruction memory,
same for switch ? - ? 8,192/2,500 3.3 instructions per
configuration ? not enough! ? need to use
off-chip memory ? slow! ? - ? need to minimize SPACE
35Minimization
out
cwnext
in
ccwprev
cwprev
ccwnext
36Clients and Servers of a Crossbar Processor
37Minimization and Scheduling
- We cut down the number of configurations by 78
times! Now there are only 32 entries! - ? the program can fit in the local instruction
memory! - Code generated by an automatic compile-time
scheduler - In addition, software pipelining loop unrolling
of the assembly code of the switch processors of
the crossbar to avoid deadlock
38Scheduler Output
- / AUTOGENERATED SCHEDULE FOR PORT 0 /
- / Tile Processor /
- / /
- conf_1_0303
- mtsri SW_PC, lo(sw_conf_1000)
- j conf_done
- conf_1_0304
- mtsri SW_PC, lo(sw_conf_1000)
- j conf_done
- conf_1_0310
- mtsri SW_PC, lo(sw_conf_2001)
- j conf_done
- conf_1_0311
- mtsri SW_PC, lo(sw_conf_1210)
- j conf_done
/ HAND-CODED SCHEDULE FOR PORT 0 / / Switch
Processor / / / / in-gtout, prev-gtnext,
dist1 / sw_conf_1210 nop
route IN-gtOUT nop
route IN-gtOUT, PREV-gtNEXT nop
route IN-gtOUT, PREV-gtNEXT
nop route IN-gtOUT,
PREV-gtNEXT nop
route IN-gtOUT, PREV-gtNEXT nop
route IN-gtOUT, PREV-gtNEXT
nop route IN-gtOUT,
PREV-gtNEXT nop
route IN-gtOUT, PREV-gtNEXT / /
39Results
40Implementation
- Raw Router was tested in a cycle-accurate
simulator of the Raw processor - Raw prototype clock speed is assumed to be 250
MHz - The focus of research is on switch fabric, NOT on
route lookup, etc. - Over 75,000 lines of assembly code, many of them
hand-coded
41Raw Router Results
- Features
- 4-port edge router
- 3.3 Mpps
- 26.9 Gbps
- Uses Raw static networks to stream data
42Conclusion
43Conclusion
- Implemented a gigabit switch on Raw
- Mapped dynamic communication to static
interconnect - Can intermix switch fabric with computation
- High-bandwidth I/O allows performance of custom
ASIC processors
44Future Work Critique
- Take advantage of dynamic networks
- Implement IP route lookup
- Add computation on data (encryption)
- Add support of multicast traffic
- Implement Quality of Service
- Add virtual output queueing
- Explore larger router configurations
45End of the official part!
46Current Research
- Probabilistic Robotics with Prof. John Leonard
- Robust Feature-Relative Navigation for Autonomous
Underwater Vehicles
47Robotic Kayaks
48Questions?