Title: HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture
1High-Bandwidth Packet Switching on the Raw
General-Purpose Architecture
- Gleb Chuvpilo
- Saman Amarasinghe
- MIT LCS Computer Architecture Group
- January 9, 2003
2Talk at a Glance
- Raw Processor Overview
- Raw Router Architecture
- Related WorkSimpleFit Analysis Framework
- Discussion
3We are on
- Raw Processor Overview
- Raw Router Architecture
- Related WorkSimpleFit Analysis Framework
- Discussion
4What is Raw?
Next GenerationGeneral-Purpose Processor!
5More Specifically, Raw is
- A scalable computation fabric
- 4 x 4 mesh of tiles, each tile is a simple
microprocessor - Ultra fast interconnect network
- Exposes the wires to the compiler
- Compiler orchestrates the communication
6Raw Facts
- Performance
- 16 OPS/FLOPS per cycle
- 462 Gb/s of on-chip bisection bandwidth
- 201 Gb/s I/O bandwidth
- 57 GB/s of on-chip memory bandwidth
7Raw Facts
- Layout
- Longest wire is the length of tile ? fast
clocking! - 16 tiles
- Each tile
- MIPS R4000 router interconnect
- 32 KB IMEM
- 32 KB data cache
- 64 KB SMEM
8Raw Facts
- Instruction Set Architecture
- Eight stage pipeline FETCH, DECODE, RF/STALL,
EXE, MUL, MEM, FPU - MIPS instruction set
- 28 general-purpose registers
- 4 register-mapped network ports
- 2-way set-associative cache,3 cycle latency, 32
byte lines
9Raw Facts
- Implementation
- ASIC _at_ 250 MHz
- 122 million transistors (P4 43 million)
- 18.2mm x 18.2mm die (P4 15mm x 15mm)
- 1080 signal I/O pins
- 25 Watts
- IBM SA-27E 6 layer metal copper 0.15µ process
(P4 0.13µ)
10Raw Layout
11Communication Mechanisms
- 2 static networks
- 2 dynamic networks
12Static Networks
- Destinations known at compile time
- Message size known at compile time
- Cycle-by-cycle switch schedule
- Three-cycle nearest neighbor send-to-use latency
- No processing overhead
13Static Network Send
14Static Network Receive
15Dynamic Networks
- Unpredictable events
- External asynchronous interrupts
- Cache misses
- 15- to 30-cycle nearest neighbor send-to-use
latency (message header processing overhead) - Wormhole routed, two-stage pipelined,dimension-or
dered
16How to Program? StreamIt!
- Thies et al., 2001
- Hierarchical structures
- Pipeline
- SplitJoin
- Feedback Loop
- Basic programmable unit Filter
17StreamIt In Action
18Compiling Streamit
- StreamIt language exposes the data movement
- Graph structure is architecture independent
- Each architecture is different in granularity and
topology - Communication is exposed to the compiler
- The compiler needs to efficiently bridge the
abstraction - Map the computation and communication pattern of
the program to the processors, memory and the
communication substrate - The StreamIt Compiler
- Partitioning
- Placement
- Scheduling
- Code generation
19We are on
- Raw Processor Overview
- Raw Router Architecture
- Related WorkSimpleFit Analysis Framework
- Discussion
20Motivation
- Build a fast IP router on a general-purpose
architecture - Why?
- Flexibility ? new protocols and services
- Price ? economies of scale
21Raw Router
- Chuvpilo et al., 2002
- Features
- 4-port edge router
- 3.3 Mpps
- 26.9 Gbps
- uses one Raw static network to stream data
22What is Routing? RM OSI
23Architecture of Internet Routers
24Switch Fabric
25Click Modular Router
26Problem Four Networks
27 and Sixteen Tiles
28What is the Mapping?
?
StaticInterconnect
Dynamic Communication
29Solution Rotating Crossbar
Out 0
Out 1
In 0
In 1
In 3
In 2
Out 3
Out 2
30Switch Fabric Design
- The idea of a Token Ring network ? absolute
fairness - Algorithm uses two static networks, dynamic
networks are idle - All deadlock-free configurations are scheduled
at compile time - Four headers and token location define a global
configuration - Global configuration is computed in a distributed
manner at run time
31Rotating Crossbar Illustrated
32Rotating Crossbar Illustrated
33Phases of the Algorithm
TILE PROCESSOR
SWITCH PROCESSOR
headers_request
headers
send_prev_config
choose_new_config
route_body
confirm
update_token
34Distributed Scheduling Algorithm
- Lets enumerate the number of configurations
- SPACE Hdr0 x x Hdr3 x Token,
- where Hdr0 Hdr3 5,
- and Token 4 ?
- therefore
- SPACE 54 x 4 2,500 distinct configurations
35So What?...
- Each tile has 8,192 words of instruction memory,
same for switch ? - ? 8,192/2,500 3.3 instructions per
configuration ? not enough! ? need to use
off-chip memory ? slow! ? - ? need to minimize SPACE
36Minimization
out
cwnext
in
ccwprev
cwprev
ccwnext
37Clients and Servers of a Crossbar Processor
38Outcome of Minimization
- We cut down the number of configurations by 78
times! Now there are only 32 entries! ? - ? the program can fit in the local instruction
memory!
39Implementation
- Raw Router was tested in a cycle-accurate
simulator of the Raw processorand the FPGA
emulator - Raw prototype clock speed is assumed to be 250
MHz - The focus of research is on switch fabric, NOT on
route lookup, etc.
40Peak Throughput
41Average Throughput
42Future Work
- Take advantage of dynamic networks
- Implement IP route lookup
- Add computation on data (encryption)
- Add support of multicast traffic
- Implement Quality of Service
- Add virtual output queueing
- Explore larger router configurations
43Conclusion
- Implemented a gigabit switch on Raw
- Mapped dynamic communication to static
interconnect - Can intermix switch fabric with computation
- High-bandwidth I/O allows performance of custom
ASIC processors
44We are on
- Raw Processor Overview
- Raw Router Architecture
- Related WorkSimpleFit Analysis Framework
- Discussion
45SimpleFit
- Moritz et al., 2001
- A Framework for Analyzing Design Tradeoffs in Raw
Architectures
46Analytical Framework
47Architecture Model
Constrained optimization problem find P, p, c,
m to minimize T max(Tp, Tc) subject to B
K, where Tp, Tc performance off app. in terms
of processing and communication, B is area
budget, and K is cost
48Cost Model
- Processor
- Memory
- Communication
- Global communication
- Global latency
49Application Model
- Required processing per node
- Required amount of memory words
- Required number of words of local communication
per node - Required local communication events
- Required latency of events
- Required global communication
- Required global communication events
50Performance Functions
- The maximum of the runtimes in terms of
processing and communication
51Optimization Problem
- Constrained based nonlinear optimization problem
- Given a fixed chip area or budget and problem
size - Objective minimum runtime
- Constraints budget, balanced local and global
computation and communication, sufficiency of
memory on a tile
52Results
- Application-specific results
- Sensitivity of grain size
- Sensitivity to different processor cost model
assumptions - Sensitivity to communication overlapping
assumptions - Design comparisons
53Example Processors vs. Problem Size
54Conclusions of the Talk
- Raw is good for streaming applications and
combining computation with communication - StreamIt is a good interface to tiled
architectures - Routing on Raw achieves the performance of custom
ASICs, but remains flexible - SimpleFit framework provides good reasoning about
Raw
55We are on
- Raw Processor Overview
- Raw Router Architecture
- Related WorkSimpleFit Analysis Framework
- Discussion
56Discussion
- Questions?
- Comments?
- Ideas?
57References
- Check out the website
- http//cag.lcs.mit.edu