HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture

Description:

Implement IP route lookup. Add computation on data (encryption) Add support ... Given: a fixed chip area or budget and problem size. Objective: minimum runtime ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 58
Provided by: glebchu
Category:

less

Transcript and Presenter's Notes

Title: HighBandwidth Packet Switching on the Raw GeneralPurpose Architecture


1
High-Bandwidth Packet Switching on the Raw
General-Purpose Architecture
  • Gleb Chuvpilo
  • Saman Amarasinghe
  • MIT LCS Computer Architecture Group
  • January 9, 2003

2
Talk at a Glance
  • Raw Processor Overview
  • Raw Router Architecture
  • Related WorkSimpleFit Analysis Framework
  • Discussion

3
We are on
  • Raw Processor Overview
  • Raw Router Architecture
  • Related WorkSimpleFit Analysis Framework
  • Discussion

4
What is Raw?
  • Taylor, 1999

Next GenerationGeneral-Purpose Processor!
5
More Specifically, Raw is
  • A scalable computation fabric
  • 4 x 4 mesh of tiles, each tile is a simple
    microprocessor
  • Ultra fast interconnect network
  • Exposes the wires to the compiler
  • Compiler orchestrates the communication

6
Raw Facts
  • Performance
  • 16 OPS/FLOPS per cycle
  • 462 Gb/s of on-chip bisection bandwidth
  • 201 Gb/s I/O bandwidth
  • 57 GB/s of on-chip memory bandwidth

7
Raw Facts
  • Layout
  • Longest wire is the length of tile ? fast
    clocking!
  • 16 tiles
  • Each tile
  • MIPS R4000 router interconnect
  • 32 KB IMEM
  • 32 KB data cache
  • 64 KB SMEM

8
Raw Facts
  • Instruction Set Architecture
  • Eight stage pipeline FETCH, DECODE, RF/STALL,
    EXE, MUL, MEM, FPU
  • MIPS instruction set
  • 28 general-purpose registers
  • 4 register-mapped network ports
  • 2-way set-associative cache,3 cycle latency, 32
    byte lines

9
Raw Facts
  • Implementation
  • ASIC _at_ 250 MHz
  • 122 million transistors (P4 43 million)
  • 18.2mm x 18.2mm die (P4 15mm x 15mm)
  • 1080 signal I/O pins
  • 25 Watts
  • IBM SA-27E 6 layer metal copper 0.15µ process
    (P4 0.13µ)

10
Raw Layout
11
Communication Mechanisms
  • 2 static networks
  • 2 dynamic networks

12
Static Networks
  • Destinations known at compile time
  • Message size known at compile time
  • Cycle-by-cycle switch schedule
  • Three-cycle nearest neighbor send-to-use latency
  • No processing overhead

13
Static Network Send
14
Static Network Receive
15
Dynamic Networks
  • Unpredictable events
  • External asynchronous interrupts
  • Cache misses
  • 15- to 30-cycle nearest neighbor send-to-use
    latency (message header processing overhead)
  • Wormhole routed, two-stage pipelined,dimension-or
    dered

16
How to Program? StreamIt!
  • Thies et al., 2001
  • Hierarchical structures
  • Pipeline
  • SplitJoin
  • Feedback Loop
  • Basic programmable unit Filter

17
StreamIt In Action
18
Compiling Streamit
  • StreamIt language exposes the data movement
  • Graph structure is architecture independent
  • Each architecture is different in granularity and
    topology
  • Communication is exposed to the compiler
  • The compiler needs to efficiently bridge the
    abstraction
  • Map the computation and communication pattern of
    the program to the processors, memory and the
    communication substrate
  • The StreamIt Compiler
  • Partitioning
  • Placement
  • Scheduling
  • Code generation

19
We are on
  • Raw Processor Overview
  • Raw Router Architecture
  • Related WorkSimpleFit Analysis Framework
  • Discussion

20
Motivation
  • Build a fast IP router on a general-purpose
    architecture
  • Why?
  • Flexibility ? new protocols and services
  • Price ? economies of scale

21
Raw Router
  • Chuvpilo et al., 2002
  • Features
  • 4-port edge router
  • 3.3 Mpps
  • 26.9 Gbps
  • uses one Raw static network to stream data

22
What is Routing? RM OSI
23
Architecture of Internet Routers
24
Switch Fabric
25
Click Modular Router
26
Problem Four Networks
27
and Sixteen Tiles
28
What is the Mapping?
?
StaticInterconnect
Dynamic Communication
29
Solution Rotating Crossbar
Out 0
Out 1
In 0
In 1
In 3
In 2
Out 3
Out 2
30
Switch Fabric Design
  • The idea of a Token Ring network ? absolute
    fairness
  • Algorithm uses two static networks, dynamic
    networks are idle
  • All deadlock-free configurations are scheduled
    at compile time
  • Four headers and token location define a global
    configuration
  • Global configuration is computed in a distributed
    manner at run time

31
Rotating Crossbar Illustrated
32
Rotating Crossbar Illustrated
33
Phases of the Algorithm
TILE PROCESSOR
SWITCH PROCESSOR
headers_request
headers
send_prev_config
choose_new_config
route_body
confirm
update_token
34
Distributed Scheduling Algorithm
  • Lets enumerate the number of configurations
  • SPACE Hdr0 x x Hdr3 x Token,
  • where Hdr0 Hdr3 5,
  • and Token 4 ?
  • therefore
  • SPACE 54 x 4 2,500 distinct configurations

35
So What?...
  • Each tile has 8,192 words of instruction memory,
    same for switch ?
  • ? 8,192/2,500 3.3 instructions per
    configuration ? not enough! ? need to use
    off-chip memory ? slow! ?
  • ? need to minimize SPACE

36
Minimization
out
cwnext
in
ccwprev
cwprev
ccwnext
37
Clients and Servers of a Crossbar Processor
38
Outcome of Minimization
  • We cut down the number of configurations by 78
    times! Now there are only 32 entries! ?
  • ? the program can fit in the local instruction
    memory!

39
Implementation
  • Raw Router was tested in a cycle-accurate
    simulator of the Raw processorand the FPGA
    emulator
  • Raw prototype clock speed is assumed to be 250
    MHz
  • The focus of research is on switch fabric, NOT on
    route lookup, etc.

40
Peak Throughput
41
Average Throughput
42
Future Work
  • Take advantage of dynamic networks
  • Implement IP route lookup
  • Add computation on data (encryption)
  • Add support of multicast traffic
  • Implement Quality of Service
  • Add virtual output queueing
  • Explore larger router configurations

43
Conclusion
  • Implemented a gigabit switch on Raw
  • Mapped dynamic communication to static
    interconnect
  • Can intermix switch fabric with computation
  • High-bandwidth I/O allows performance of custom
    ASIC processors

44
We are on
  • Raw Processor Overview
  • Raw Router Architecture
  • Related WorkSimpleFit Analysis Framework
  • Discussion

45
SimpleFit
  • Moritz et al., 2001
  • A Framework for Analyzing Design Tradeoffs in Raw
    Architectures

46
Analytical Framework
47
Architecture Model
Constrained optimization problem find P, p, c,
m to minimize T max(Tp, Tc) subject to B
K, where Tp, Tc performance off app. in terms
of processing and communication, B is area
budget, and K is cost
48
Cost Model
  • Processor
  • Memory
  • Communication
  • Global communication
  • Global latency

49
Application Model
  • Required processing per node
  • Required amount of memory words
  • Required number of words of local communication
    per node
  • Required local communication events
  • Required latency of events
  • Required global communication
  • Required global communication events

50
Performance Functions
  • The maximum of the runtimes in terms of
    processing and communication

51
Optimization Problem
  • Constrained based nonlinear optimization problem
  • Given a fixed chip area or budget and problem
    size
  • Objective minimum runtime
  • Constraints budget, balanced local and global
    computation and communication, sufficiency of
    memory on a tile

52
Results
  • Application-specific results
  • Sensitivity of grain size
  • Sensitivity to different processor cost model
    assumptions
  • Sensitivity to communication overlapping
    assumptions
  • Design comparisons

53
Example Processors vs. Problem Size
54
Conclusions of the Talk
  • Raw is good for streaming applications and
    combining computation with communication
  • StreamIt is a good interface to tiled
    architectures
  • Routing on Raw achieves the performance of custom
    ASICs, but remains flexible
  • SimpleFit framework provides good reasoning about
    Raw

55
We are on
  • Raw Processor Overview
  • Raw Router Architecture
  • Related WorkSimpleFit Analysis Framework
  • Discussion

56
Discussion
  • Questions?
  • Comments?
  • Ideas?

57
References
  • Check out the website
  • http//cag.lcs.mit.edu
Write a Comment
User Comments (0)
About PowerShow.com