Networks: Switch Design - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Networks: Switch Design

Description:

Switches to avoid head-of-line blocking. Additional cost. Switch ... 3D bidirectional torus, dimension order (NIC selected), virtual cut-through, packet sw. ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 29

Provided by: david3094

Category:

more less

Transcript and Presenter's Notes

Title: Networks: Switch Design

1
Networks Switch Design
2
Switch Design
3
How do you build a crossbar
4
Input buffered swtich

Independent routing logic per input
FSM
Scheduler logic arbitrates each output
priority, FIFO, random
Head-of-line blocking problem

5
Switches to avoid head-of-line blocking

Additional cost
Switch cycle time, routing delay
How would you build a shared pool?

6
Example IBM SP vulcan switch

Many gigabit Ethernet switches use similar design
without the cut-through

7
Output scheduling

n independent arbitration problems?
static priority, random, round-robin
simplifications due to routing algorithm?
Dimension order routing
Adaptive routing
general case is max bipartite matching

8
Stacked Dimension Switches

Dimension order on 3D cube?

9
Flow Control

What do you do when push comes to shove?
Ethernet collision detection and retry after
delay
FDDI, token ring arbitration token
TCP/WAN buffer, drop, adjust rate
any solution must adjust to output rate
Link-level flow control

10
Examples

Short Links
long links
several flits on the wire

11
Smoothing the flow
Incoming Phits
Flow-control Symbols
Full
High
Stop
Mark
Low
Go
Mark
Empty
Outgoing Phits

How much slack do you need to maximize bandwidth?

12
Link vs End-to-End flow control

Hot Spots
back pressure
all the buffers in the tree from the hot spot to
the sources are full
Global communication operations
Simple back pressure
with completely balanced communication patterns
simple end-to-end protocols in the global
communicationhave been shown to mitigate this
problem
a node may wait after sending a certain amount of
data until it has also received this amount, or
it may wait for chunks of its data to be
acknowledged
Admission Control
NI-to-NI credit-based flow control
keep the packet within the source NI rather than
blocking traffic within the network

13
Example T3D

3D bidirectional torus, dimension order (NIC
selected), virtual cut-through, packet sw.
16 bit x 150 MHz, short, wide, synch.
rotating priority per output
logically separate request/response (two VCs
each)
3 independent, stacked switches
8 16-bit flit buffers on each of 4 VC in each
directions

14
Example SP

8-port switch, 40 MB/s per link, 8-bit phit,
16-bit flit, single 40 MHz clock
packet sw, cut-through, no virtual channel,
source-based routing
variable packet lt 255 bytes, 31 byte fifo per
input, 7 bytes per output
128 8-byte chunks in central queue, LRU per
output
run in shadow mode

15
Summary

Routing Algorithms restrict the set of routes
within the topology
simple mechanism selects turn at each hop
arithmetic, selection, lookup
Deadlock-free if channel dependence graph is
acyclic
limit turns to eliminate dependences
add separate channel resources to break
dependences
combination of topology, algorithm, and switch
design
Deterministic vs adaptive routing
Switch design issues
input/output/pooled buffering, routing logic,
selection logic
Flow control
Real networks are a package of design choices

16
Cache Coherence in Scalable Machines
17
Context for Scalable Cache Coherence
Scalable Networks - many simultaneous transactio
ns
Realizing Pgm Models through net
transaction protocols - efficient node-to-net
interface - interprets transactions
Scalable distributed memory
Caches naturally replicate data - coherence
through bus snooping protocols - consistency
Need cache coherence protocols that scale! -
no broadcast or single point of order
18
Generic Solution Directories

Maintain state vector explicitly
associate with memory block
records state of block in each cache
On miss, communicate with directory
determine location of cached copies
determine action to take
conduct protocol to maintain coherence

19
A Cache Coherent System Must

Provide set of states, state transition diagram,
and actions
Manage coherence protocol
(0) Determine when to invoke coherence protocol
(a) Find info about state of block in other
caches to determine action
whether need to communicate with other cached
copies
(b) Locate the other copies
(c) Communicate with those copies
(inval/update)
(0) is done the same way on all systems
state of the line is maintained in the cache
protocol is invoked if an access fault occurs
on the line
Different approaches distinguished by (a) to (c)

20
Bus-based Coherence

All of (a), (b), (c) done through broadcast on
bus
faulting processor sends out a search
others respond to the search probe and take
necessary action
Could do it in scalable network too
broadcast to all processors, and let them respond
Conceptually simple, but broadcast doesnt scale
with p
on bus, bus bandwidth doesnt scale
on scalable network, every fault leads to at
least p network transactions
Scalable coherence
can have same cache states and state transition
diagram
different mechanisms to manage protocol

21
One Approach Hierarchical Snooping

Extend snooping approach hierarchy of broadcast
media
tree of buses or rings (KSR-1)
processors are in the bus- or ring-based
multiprocessors at the leaves
parents and children connected by two-way snoopy
interfaces
snoop both buses and propagate relevant
transactions
main memory may be centralized at root or
distributed among leaves
Issues (a) - (c) handled similarly to bus, but
not full broadcast
faulting processor sends out search bus
transaction on its bus
propagates up and down hiearchy based on snoop
results
Problems
high latency multiple levels, and snoop/lookup
at every level
bandwidth bottleneck at root
Not popular today

22
Scalable Approach Directories

Every memory block has associated directory
information
keeps track of copies of cached blocks and their
states
on a miss, find directory entry, look it up, and
communicate only with the nodes that have copies
if necessary
in scalable networks, communication with
directory and copies is through network
transactions
Many alternatives for organizing directory
information

23
Basic Operation of Directory
k processors. With each cache-block in
memory k presence-bits, 1 dirty-bit With
each cache-block in cache 1 valid bit, and 1
dirty (owner) bit

Read from main memory by processor i
If dirty-bit OFF then read from main memory
turn pi ON
if dirty-bit ON then recall line from dirty
proc (cache state to shared) update memory turn
dirty-bit OFF turn pi ON supply recalled data
to i
Write to main memory by processor i
If dirty-bit OFF then supply data to i send
invalidations to all caches that have the block
turn dirty-bit ON turn pi ON ...
...

24
Basic Directory Transactions
25
A Popular Middle Ground

Two-level hierarchy
Individual nodes are multiprocessors, connected
non-hierarchically
e.g. mesh of SMPs
Coherence across nodes is directory-based
directory keeps track of nodes, not individual
processors
Coherence within nodes is snooping or directory
orthogonal, but needs a good interface of
functionality
Examples
Convex Exemplar directory-directory
Sequent, Data General, HAL directory-snoopy
SMP on a chip?

26
Example Two-level Hierarchies
27
Advantages of Multiprocessor Nodes

Potential for cost and performance advantages
can use commodity SMPs
less nodes for directory to keep track of
much communication may be contained within node
(cheaper)
nodes prefetch data for each other (fewer
remote misses)
combining of requests (like hierarchical, only
two-level)
can even share caches (overlapping of working
sets)
benefits depend on sharing pattern (and mapping)
good for widely read-shared e.g. tree data in
Barnes-Hut
good for nearest-neighbor, if properly mapped
not so good for all-to-all communication

28
Disadvantages of Coherent MP Nodes

Bandwidth shared among nodes
all-to-all example
applies to coherent or not
Bus increases latency to local memory
With coherence, typically wait for local snoop
results before sending remote requests
Snoopy bus at remote node increases delays there
too, increasing latency and reducing bandwidth
May hurt performance if sharing patterns dont
comply