Title: Networks: Switch Design
1Networks Switch Design
2Switch Design
3How do you build a crossbar
4Input buffered swtich
- Independent routing logic per input
- FSM
- Scheduler logic arbitrates each output
- priority, FIFO, random
- Head-of-line blocking problem
5Switches to avoid head-of-line blocking
- Additional cost
- Switch cycle time, routing delay
- How would you build a shared pool?
6Example IBM SP vulcan switch
- Many gigabit Ethernet switches use similar design
without the cut-through
7Output scheduling
- n independent arbitration problems?
- static priority, random, round-robin
- simplifications due to routing algorithm?
- Dimension order routing
- Adaptive routing
- general case is max bipartite matching
8Stacked Dimension Switches
- Dimension order on 3D cube?
9Flow Control
- What do you do when push comes to shove?
- Ethernet collision detection and retry after
delay - FDDI, token ring arbitration token
- TCP/WAN buffer, drop, adjust rate
- any solution must adjust to output rate
- Link-level flow control
10Examples
- Short Links
- long links
- several flits on the wire
11Smoothing the flow
Incoming Phits
Flow-control Symbols
Full
High
Stop
Mark
Low
Go
Mark
Empty
Outgoing Phits
- How much slack do you need to maximize bandwidth?
12Link vs End-to-End flow control
- Hot Spots
- back pressure
- all the buffers in the tree from the hot spot to
the sources are full - Global communication operations
- Simple back pressure
- with completely balanced communication patterns
- simple end-to-end protocols in the global
communicationhave been shown to mitigate this
problem - a node may wait after sending a certain amount of
data until it has also received this amount, or
it may wait for chunks of its data to be
acknowledged - Admission Control
- NI-to-NI credit-based flow control
- keep the packet within the source NI rather than
blocking traffic within the network
13Example T3D
- 3D bidirectional torus, dimension order (NIC
selected), virtual cut-through, packet sw. - 16 bit x 150 MHz, short, wide, synch.
- rotating priority per output
- logically separate request/response (two VCs
each) - 3 independent, stacked switches
- 8 16-bit flit buffers on each of 4 VC in each
directions
14Example SP
- 8-port switch, 40 MB/s per link, 8-bit phit,
16-bit flit, single 40 MHz clock - packet sw, cut-through, no virtual channel,
source-based routing - variable packet lt 255 bytes, 31 byte fifo per
input, 7 bytes per output - 128 8-byte chunks in central queue, LRU per
output - run in shadow mode
15Summary
- Routing Algorithms restrict the set of routes
within the topology - simple mechanism selects turn at each hop
- arithmetic, selection, lookup
- Deadlock-free if channel dependence graph is
acyclic - limit turns to eliminate dependences
- add separate channel resources to break
dependences - combination of topology, algorithm, and switch
design - Deterministic vs adaptive routing
- Switch design issues
- input/output/pooled buffering, routing logic,
selection logic - Flow control
- Real networks are a package of design choices
16Cache Coherence in Scalable Machines
17Context for Scalable Cache Coherence
Scalable Networks - many simultaneous transactio
ns
Realizing Pgm Models through net
transaction protocols - efficient node-to-net
interface - interprets transactions
Scalable distributed memory
Caches naturally replicate data - coherence
through bus snooping protocols - consistency
Need cache coherence protocols that scale! -
no broadcast or single point of order
18Generic Solution Directories
- Maintain state vector explicitly
- associate with memory block
- records state of block in each cache
- On miss, communicate with directory
- determine location of cached copies
- determine action to take
- conduct protocol to maintain coherence
19A Cache Coherent System Must
- Provide set of states, state transition diagram,
and actions - Manage coherence protocol
- (0) Determine when to invoke coherence protocol
- (a) Find info about state of block in other
caches to determine action - whether need to communicate with other cached
copies - (b) Locate the other copies
- (c) Communicate with those copies
(inval/update) - (0) is done the same way on all systems
- state of the line is maintained in the cache
- protocol is invoked if an access fault occurs
on the line - Different approaches distinguished by (a) to (c)
20Bus-based Coherence
- All of (a), (b), (c) done through broadcast on
bus - faulting processor sends out a search
- others respond to the search probe and take
necessary action - Could do it in scalable network too
- broadcast to all processors, and let them respond
- Conceptually simple, but broadcast doesnt scale
with p - on bus, bus bandwidth doesnt scale
- on scalable network, every fault leads to at
least p network transactions - Scalable coherence
- can have same cache states and state transition
diagram - different mechanisms to manage protocol
21One Approach Hierarchical Snooping
- Extend snooping approach hierarchy of broadcast
media - tree of buses or rings (KSR-1)
- processors are in the bus- or ring-based
multiprocessors at the leaves - parents and children connected by two-way snoopy
interfaces - snoop both buses and propagate relevant
transactions - main memory may be centralized at root or
distributed among leaves - Issues (a) - (c) handled similarly to bus, but
not full broadcast - faulting processor sends out search bus
transaction on its bus - propagates up and down hiearchy based on snoop
results - Problems
- high latency multiple levels, and snoop/lookup
at every level - bandwidth bottleneck at root
- Not popular today
22Scalable Approach Directories
- Every memory block has associated directory
information - keeps track of copies of cached blocks and their
states - on a miss, find directory entry, look it up, and
communicate only with the nodes that have copies
if necessary - in scalable networks, communication with
directory and copies is through network
transactions - Many alternatives for organizing directory
information
23Basic Operation of Directory
k processors. With each cache-block in
memory k presence-bits, 1 dirty-bit With
each cache-block in cache 1 valid bit, and 1
dirty (owner) bit
- Read from main memory by processor i
- If dirty-bit OFF then read from main memory
turn pi ON - if dirty-bit ON then recall line from dirty
proc (cache state to shared) update memory turn
dirty-bit OFF turn pi ON supply recalled data
to i - Write to main memory by processor i
- If dirty-bit OFF then supply data to i send
invalidations to all caches that have the block
turn dirty-bit ON turn pi ON ... - ...
24Basic Directory Transactions
25A Popular Middle Ground
- Two-level hierarchy
- Individual nodes are multiprocessors, connected
non-hierarchically - e.g. mesh of SMPs
- Coherence across nodes is directory-based
- directory keeps track of nodes, not individual
processors - Coherence within nodes is snooping or directory
- orthogonal, but needs a good interface of
functionality - Examples
- Convex Exemplar directory-directory
- Sequent, Data General, HAL directory-snoopy
- SMP on a chip?
26Example Two-level Hierarchies
27Advantages of Multiprocessor Nodes
- Potential for cost and performance advantages
- can use commodity SMPs
- less nodes for directory to keep track of
- much communication may be contained within node
(cheaper) - nodes prefetch data for each other (fewer
remote misses) - combining of requests (like hierarchical, only
two-level) - can even share caches (overlapping of working
sets) - benefits depend on sharing pattern (and mapping)
- good for widely read-shared e.g. tree data in
Barnes-Hut - good for nearest-neighbor, if properly mapped
- not so good for all-to-all communication
28Disadvantages of Coherent MP Nodes
- Bandwidth shared among nodes
- all-to-all example
- applies to coherent or not
- Bus increases latency to local memory
- With coherence, typically wait for local snoop
results before sending remote requests - Snoopy bus at remote node increases delays there
too, increasing latency and reducing bandwidth - May hurt performance if sharing patterns dont
comply