Title: Interconnection Networks
1Interconnection Networks
2Overview
- Physical Layer and Message Switching
- Network Topologies
- Metrics
- Deadlock Livelock
- Routing Layer
- The Messaging Layer
3Interconnection Networks
- Fabric for scalable, multiprocessor architectures
- Distinct from traditional networking
architectures such as Internet Protocol (IP)
based systems - We are interested in applications to large
clusters as well as embedded systems
4CLUX A Beowulf Cluster
Interconnection Network Cables
Myrinet Switch
Images from the Clux cluster at
http//www.fyslab.hut.fi/clux/
5The Practical Problem
From Ambuj Goyal, Computer Science Grand
Challenge Simplicity of Design, Computing
Research Association Conference on "Grand
Research Challenges" in Computer Science and
Engineering, June 2002
6Example Embedded Devices
picoChip http//www.picochip.com/
- Issues
- Execution performance
- Power dissipation
- Number of chip types
- Size and form factor
PACT XPP Technologies http//www.pactcorp.com/
7Physical Layer and Message Switching
8Messaging Hierarchy
Routing Layer
Where? Destination decisions, i.e., which output
port
Switching Layer
When? When is data forwarded
Physical Layer
How? synchronization of data transfer
- This organization is distinct from traditional
networking implementations - Emphasis is on low latency communication
- Only recently have standards been evolving
- Infiniband http//www.infinibandta.org/home
9The Physical Layer
Data
Packets
checksum
header
Flit flow control digit
Phit physical flow control digit
- Data is transmitted based on a hierarchical data
structuring mechanism - Messages ? packets ? flits ? phits
- While flits and phits are fixed size, packets and
data may be variable sized
10Flow Control
- Flow control digit synchronized transfer of a
unit of information - Based on buffer management
- Asynchronous vs. synchronous flow control
- Flow control occurs at multiple levels
- message flow control
- physical flow control
- Mechanisms
- Credit based flow control
11Switching Layer
- Comprised of three sets of techniques
- switching techniques
- flow control
- buffer management
- Organization and operation of routers are largely
determined by the switching layer - Connection Oriented vs. Connectionless
communication
12Generic Router Architecture
Wire delay
Switching delay
Routing delay
13Virtual Channels
- Each virtual channel is a pair of unidirectional
channels - Independently managed buffers multiplexed over
the physical channel - De-couples buffers from physical channels
- Originally introduced to break cyclic
dependencies - Improves performance through reduction of
blocking delay - Virtual lanes vs. virtual channels
- As the number of virtual channels increase, the
increased channel multiplexing has two effects - decrease in header delay
- increase in average data flit delay
- Impact on router performance
- switch complexity
14Circuit Switching
Data
Acknowledgment
Header Probe
Link
tr
ts
tsetup
tdata
Time Busy
- Hardware path setup by a routing header or probe
- End-to-end acknowledgment initiates transfer at
full hardware bandwidth - Source routing vs. distributed routing
- System is limited by signaling rate along the
circuits
15Packet Switching
- Blocking delays in circuit switching avoided in
packet switched networks ? full link utilization
in the presence of data - Increased storage requirements at the nodes
- Packetization and in-order delivery requirements
- Buffering
- use of local processor memory
- central queues
16Virtual Cut-Through
Packet Header
Message Packet cuts through the Router
tw
Link
tblocking
tr
ts
Time Busy
- Messages cut-through to the next router when
feasible - In the absence of blocking, messages are
pipelined - pipeline cycle time is the larger of intra-router
and inter-router flow control delays - When the header is blocked, the complete message
is buffered - High load behavior approaches that of packet
switching
17Wormhole Switching
Header Flit
Link
Single Flit
tr
ts
twormhole
Time Busy
- Messages are pipelined, but buffer space is on
the order of a few flits - Small buffers message pipelining ? small
compact buffers - Supports variable sized messages
- Messages cannot be interleaved over a channel
routing information is only associated with the
header - Base Latency is equivalent to that of virtual
cut-through
18Comparison of Switching Techniques
- Packet switching and virtual cut-through
- consume network bandwidth proportional to network
load - predictable demands
- VCT behaves like wormhole at low loads and like
packet switching at high loads - link level error control for packet switching
- Wormhole switching
- provides low latency
- lower saturation point
- higher variance of message latency than packet or
VCT switching - Virtual channels
- blocking delay vs. data delay
- router flow control latency
- Optimistic vs. conservative flow control
19Saturation
20Network Topologies
21Motivation
- Crossbars provide full connectivity among ports,
but cost and complexity grow quadratically in the
number of ports - Buses provide minimal connectivity and do not
provide scalable performance - Network topologies span a spectrum of solutions
that trade-off cost, performance (latency
bandwidth), reliability, and implementation
complexity
22Direct Networks
- Fixed degree
- Modular
- Topologies
- Meshes
- Multidimensional tori
- Special case of tori the binary hypercube
23Indirect Networks
- Indirect networks
- uniform base latency
- centralized or distributed control
- Engineering approximations to direct networks
Multistage Network
Backward
Forward
Fat Tree Network
Bandwidth increases as you go up the tree
24Specific MINs
000
000
000
000
000
000
001
001
001
001
001
001
010
010
010
010
010
010
011
011
011
011
011
011
100
100
100
100
100
100
101
101
101
101
101
101
110
110
110
110
110
110
111
111
111
111
111
111
- Switch sizes and interstage interconnect
establish distinct MINS - Majority of interesting MINs have been shown to
be topologically equivalent
25Metrics
26Evaluation Metrics
- Latency
- Message transit time
- Determined by switching technique and traffic
patterns - Node degree (channel width)
- Number of input/output channels
- This metric is determined by packaging
constraints - pin/wiring constraints
- Diameter
- Path diversity
- A measure of reliability
27Evaluation Metrics
bisection
- Bisection bandwidth
- This is minimum bandwidth across any bisection of
the network - Bisection bandwidth is a limiting attribute of
performance
28Constant Resource Analysis Bisection Width
29Constant Resource Analysis Pin out
30Latency Under Contention
32-ary 2-cube vs. 10-ary 3 cube
31Deadlock and Livelock
32Deadlock and Livelock
router
Virtual Channel
- Deadlock freedom can be ensured by enforcing
constraints - For example, following dimension order routing in
2D meshes
33Occurrence of Deadlock
1
3
2
4
- Deadlock is caused by dependencies between buffers
34Deadlock in a Ring Network
35Deadlock Avoidance Principle
- Deadlock is caused by dependencies between buffers
36Routing Constraints on Virtual Channels
- Add multiple virtual channels to each physical
channel - Place routing restrictions between virtual
channels
37Break Cycles
38Channel Dependence Graph
39Routing Layer
40Routing Protocols
Routing Algorithms
Unicast Routing
Multicast Routing
Number of Destinations
Centralized Routing
Source Routing
Distributed Routing
Multiphase Routing
Routing Decisions
Table Lookup
Finite State Machine
Implementation
Deterministic Routing
Adaptive Routing
Adaptivity
Progressiveness
Progressive
Backtracking
Profitable
Misrouting
Minimality
Number of Paths
Complete
Partial
Source J. Duato, S. Yalamanchili, and L. Ni,
Interconnection Networks, Morgan Kaufman 2003.
41Key Routing Categories
- Deterministic
- The path is fixed by the source destination pair
- Source Routing
- Path is looked up prior to message injection
- May differ each time the network and NIs are
initialized - Adaptive routing
- Path is determined by run-time network conditions
- Unicast
- Single source to single destination
- Multicast
- Single source to multiple destinations
42Generic Router Architecture
43Software Layer
44The Message Layer
- Message layer background
- Cluster computers
- Myrinet SAN
- Design properties
- End-to-End communication path
- Injection
- Network transmission
- Ejection
- Overall performance
45Cluster Computers
- Cost-effective alternative to supercomputers
- Number of commodity workstations
- Specialized network hardware and software
- Result Large pool of host processors
Courtesy of C. Ulmer
46Myrinet
- Descendant of Caltech Mosaic project
- Wormhole network
- Source routing
- High-speed, Ultra-reliable network
- Configurable topology Switches, NICs, and cables
Courtesy of C. Ulmer
47Myrinet Switches Links
- 16 Port crossbar chip
- 2.02.0 Gbps per port
- 300 ns Latency
- Line card
- 8 Network ports
- 8 Backplane ports
- Backplane cabinet
- 17 line card slots
- 128 Hosts
Courtesy of C. Ulmer
48Myrinet NI Architecture
- Custom RISC CPU
- 33-200MHz
- Big endian
- gcc is available
- SRAM
- 1-9MB
- No CPU cache
- DMA Engines
- PCI / SRAM
- SRAM / Tx
- Rx / SRAM
SRAM
RISC CPU
PCI
Tx Rx
Host DMA
SAN DMA
LANai Processor
Network Interface Card
Courtesy of C. Ulmer
49Message Layers
Courtesy of C. Ulmer
50Message Layer Communication Software
- Message layers are enabling technology for
clusters - Enable cluster to function as single image
multiprocessor system - Responsible for transferring messages between
resources - Hide hardware details from end users
Courtesy of C. Ulmer
51Message Layer Design Issues
- Performance is critical
- Competing with SMPs, where overhead is lt1us
- Use every trick to get performance
- Single cluster user -- remove device sharing
overhead - Little protection -- co-operative environment
- Reliable hardware -- optimize for common case
of few errors - Smart hardware -- offload host communication
- Arch hacks -- x86 is a turkey, use MMX, SSE, WC..
Courtesy of C. Ulmer
52Message Layer Organization
User-space Application
Kernel NI Device Driver
User-space Message Layer Library
NI Firmware
Courtesy of C. Ulmer
53End Users Perspective
Processor A
Processor B
Msg
Courtesy of C. Ulmer
54End-to-End Communication Path
- Three phases of data transfer
- Injection
- Network
- Ejection
CPU
CPU
Memory
Memory
2
1
3
NI
SAN
NI
Source
Destination
Courtesy of C. Ulmer
55TPIL Performance LANai 9 NI with Pentium
III-550 MHz Host
Bandwidth (MBytes/s)
Injection Size (Bytes)
Courtesy of C. Ulmer
56The Message Path
M
M
CPU
CPU
PCI
PCI
OS
OS
PCI
PCI
Memory
Memory
NI
NI
Network
- Wire bandwidth is not the bottleneck!
- Operating system and/or user level software
limits performance
57Universal Performance Metrics
Sender
(processor busy)
Transmission time (size bandwidth)
Time of Flight
Receiver Overhead
Receiver
(processor busy)
Transport Latency
Total Latency
Total Latency Sender Overhead Time of Flight
Message Size BW
Receiver Overhead
Includes header/trailer in BW calculation?
58Simplified Latency Model
- Total Latency Overhead Message Size / BW
- Overhead Sender Overhead Time of Flight
- Receiver Overhead
- Can relate overhead to network bandwidth
utilization
59Commercial Example
60Scalable Switching Fabrics for Internet Routers
Router
- Internet bandwidth growth ? routers with
- large numbers of ports
- high bisection bandwidth
- Historically these solutions have used
- Backplanes
- Crossbar switches
- White paper Scalable Switching Fabrics for
Internet Routers, by W. J. Dally, http
//www.avici.com/technology/whitepapers/
61Requirements
- Scalable
- Incremental
- Economical ? cost linear in the number of nodes
- Robust
- Fault tolerant ? path diversity reconfiguration
- Non-blocking features
- Performance
- High bisection bandwidth
- Quality of Service (QoS)
- Bounded delay
62Switching Fabric
- Three components
- Topology ? 3D torus
- Routing ? source routing with randomization
- Flow control ? virtual channels and virtual
networks - Maximum configuration 14 x 8 x 5 560
- Channel speed is 10 Gbps
63Packaging
- Uniformly short wires between adjacent nodes
- Can be built in passive backplanes
- Run at high speed
- Bandwidth inversely proportional to square of
wire length - Cabling costs
- Power costs
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
64Properties
- Path diversity
- Avoids tree saturation
- Edge disjoint paths for fault tolerance
- Heart beat checks (100 microsecs) deflecting
while tables are updated
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
65Properties
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
66Use of Virtual Channels
- Virtual channels aggregated into virtual networks
- Two networks for each output port
- Distinct networks prevent undesirable coupling
- Only bandwidth on a link is shared
- Fair arbitration mechanisms
- Distinct networks enable QoS constraints to be
met - Separate best effort and constant bit rate traffic
67Summary
- Distinguish between traditional networking and
high performance multiprocessor communication - Hierarchy of implementations
- Physical, switching and routing
- Protocol families and protocol layers (the
protocol stack) - Datapath and architecture of the switches
- Metrics
- Bisection bandwidth
- Reliability
- Traditional latency and bandwidth