HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture

Description:

Implement IP route lookup. Add computation on data (encryption) Add support ... Given: a fixed chip area or budget and problem size. Objective: minimum runtime ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 58

Provided by: glebchu

Category:

more less

Transcript and Presenter's Notes

Title: HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture

1
High-Bandwidth Packet Switching on the Raw
General-Purpose Architecture

Gleb Chuvpilo
Saman Amarasinghe
MIT LCS Computer Architecture Group
January 9, 2003

2
Talk at a Glance

Raw Processor Overview
Raw Router Architecture
Related WorkSimpleFit Analysis Framework
Discussion

3
We are on

Raw Processor Overview
Raw Router Architecture
Related WorkSimpleFit Analysis Framework
Discussion

4
What is Raw?

Taylor, 1999

Next GenerationGeneral-Purpose Processor!
5
More Specifically, Raw is

A scalable computation fabric
4 x 4 mesh of tiles, each tile is a simple
microprocessor
Ultra fast interconnect network
Exposes the wires to the compiler
Compiler orchestrates the communication

6
Raw Facts

Performance
16 OPS/FLOPS per cycle
462 Gb/s of on-chip bisection bandwidth
201 Gb/s I/O bandwidth
57 GB/s of on-chip memory bandwidth

7
Raw Facts

Layout
Longest wire is the length of tile ? fast
clocking!
16 tiles
Each tile
MIPS R4000 router interconnect
32 KB IMEM
32 KB data cache
64 KB SMEM

8
Raw Facts

Instruction Set Architecture
Eight stage pipeline FETCH, DECODE, RF/STALL,
EXE, MUL, MEM, FPU
MIPS instruction set
28 general-purpose registers
4 register-mapped network ports
2-way set-associative cache,3 cycle latency, 32
byte lines

9
Raw Facts

Implementation
ASIC _at_ 250 MHz
122 million transistors (P4 43 million)
18.2mm x 18.2mm die (P4 15mm x 15mm)
1080 signal I/O pins
25 Watts
IBM SA-27E 6 layer metal copper 0.15µ process
(P4 0.13µ)

10
Raw Layout
11
Communication Mechanisms

2 static networks
2 dynamic networks

12
Static Networks

Destinations known at compile time
Message size known at compile time
Cycle-by-cycle switch schedule
Three-cycle nearest neighbor send-to-use latency
No processing overhead

13
Static Network Send
14
Static Network Receive
15
Dynamic Networks

Unpredictable events
External asynchronous interrupts
Cache misses
15- to 30-cycle nearest neighbor send-to-use
latency (message header processing overhead)
Wormhole routed, two-stage pipelined,dimension-or
dered

16
How to Program? StreamIt!

Thies et al., 2001
Hierarchical structures
Pipeline
SplitJoin
Feedback Loop
Basic programmable unit Filter

17
StreamIt In Action
18
Compiling Streamit

StreamIt language exposes the data movement
Graph structure is architecture independent
Each architecture is different in granularity and
topology
Communication is exposed to the compiler
The compiler needs to efficiently bridge the
abstraction
Map the computation and communication pattern of
the program to the processors, memory and the
communication substrate
The StreamIt Compiler
Partitioning
Placement
Scheduling
Code generation

19
We are on

Raw Processor Overview
Raw Router Architecture
Related WorkSimpleFit Analysis Framework
Discussion

20
Motivation

Build a fast IP router on a general-purpose
architecture
Why?
Flexibility ? new protocols and services
Price ? economies of scale

21
Raw Router

Chuvpilo et al., 2002
Features
4-port edge router
3.3 Mpps
26.9 Gbps
uses one Raw static network to stream data

22
What is Routing? RM OSI
23
Architecture of Internet Routers
24
Switch Fabric
25
Click Modular Router
26
Problem Four Networks
27
and Sixteen Tiles
28
What is the Mapping?
?
StaticInterconnect
Dynamic Communication
29
Solution Rotating Crossbar
Out 0
Out 1
In 0
In 1
In 3
In 2
Out 3
Out 2
30
Switch Fabric Design

The idea of a Token Ring network ? absolute
fairness
Algorithm uses two static networks, dynamic
networks are idle
All deadlock-free configurations are scheduled
at compile time
Four headers and token location define a global
configuration
Global configuration is computed in a distributed
manner at run time

31
Rotating Crossbar Illustrated
32
Rotating Crossbar Illustrated
33
Phases of the Algorithm
TILE PROCESSOR
SWITCH PROCESSOR
headers_request
headers
send_prev_config
choose_new_config
route_body
confirm
update_token
34
Distributed Scheduling Algorithm

Lets enumerate the number of configurations
SPACE Hdr0 x x Hdr3 x Token,
where Hdr0 Hdr3 5,
and Token 4 ?
therefore
SPACE 54 x 4 2,500 distinct configurations

35
So What?...

Each tile has 8,192 words of instruction memory,
same for switch ?
? 8,192/2,500 3.3 instructions per
configuration ? not enough! ? need to use
off-chip memory ? slow! ?
? need to minimize SPACE

36
Minimization
out
cwnext
in
ccwprev
cwprev
ccwnext
37
Clients and Servers of a Crossbar Processor
38
Outcome of Minimization

We cut down the number of configurations by 78
times! Now there are only 32 entries! ?
? the program can fit in the local instruction
memory!

39
Implementation

Raw Router was tested in a cycle-accurate
simulator of the Raw processorand the FPGA
emulator
Raw prototype clock speed is assumed to be 250
MHz
The focus of research is on switch fabric, NOT on
route lookup, etc.

40
Peak Throughput
41
Average Throughput
42
Future Work

Take advantage of dynamic networks
Implement IP route lookup
Add computation on data (encryption)
Add support of multicast traffic
Implement Quality of Service
Add virtual output queueing
Explore larger router configurations

43
Conclusion

Implemented a gigabit switch on Raw
Mapped dynamic communication to static
interconnect
Can intermix switch fabric with computation
High-bandwidth I/O allows performance of custom
ASIC processors

44
We are on

Raw Processor Overview
Raw Router Architecture
Related WorkSimpleFit Analysis Framework
Discussion

45
SimpleFit

Moritz et al., 2001
A Framework for Analyzing Design Tradeoffs in Raw
Architectures

46
Analytical Framework
47
Architecture Model
Constrained optimization problem find P, p, c,
m to minimize T max(Tp, Tc) subject to B
K, where Tp, Tc performance off app. in terms
of processing and communication, B is area
budget, and K is cost
48
Cost Model

Processor
Memory
Communication
Global communication
Global latency

49
Application Model

Required processing per node
Required amount of memory words
Required number of words of local communication
per node
Required local communication events
Required latency of events
Required global communication
Required global communication events

50
Performance Functions

The maximum of the runtimes in terms of
processing and communication

51
Optimization Problem

Constrained based nonlinear optimization problem
Given a fixed chip area or budget and problem
size
Objective minimum runtime
Constraints budget, balanced local and global
computation and communication, sufficiency of
memory on a tile

52
Results

Application-specific results
Sensitivity of grain size
Sensitivity to different processor cost model
assumptions
Sensitivity to communication overlapping
assumptions
Design comparisons

53
Example Processors vs. Problem Size
54
Conclusions of the Talk

Raw is good for streaming applications and
combining computation with communication
StreamIt is a good interface to tiled
architectures
Routing on Raw achieves the performance of custom
ASICs, but remains flexible
SimpleFit framework provides good reasoning about
Raw

55
We are on