Title: CS 2200 Lecture 15 Networking with a focus on architectural implications
1CS 2200 Lecture 15Networking (with a focus on
architectural implications)
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
2Quiz
3Networking
- Lots of XANs
- SAN system area network (text calls it MPP)
- Not designed for generality usually connects
homogeneous nodes - Physical extent is small less than 25 meters
usually much less - Connectivity usually from hundreds to thousands
of nodes - Main focus is high bandwidth and low latency
- Supported by MPP industry very model specific
- LAN local area network
- Heterogeneous hosts assumed designed for
generality - Physical extent usually within a few hundred kms
- Connectivity usually in the hundreds of nodes
- Performance is typically mundane
- Supported by workstation industry definite open
system model
4One More
- WAN wide area network
- General connectivity for thousands of
heterogeneous nodes - High bandwidth but latency is usually horrible
- Physical extent Ks of kilometers
- Supported by the telecommunications industry
- Open system standard model
5Slightly more general model
- Software responsible for reliable transmission
- Several problem sources
- Garbled in transit add checksum trailer
- Lost in transit
- When message is received a reply must be
generated - Pick a time when, if the message was received,
then the reply would have been received - On a send, start timer, and if it times out, then
assume the message was lost and resend - New message format
- 2 bit header (request, reply, request-ack,
reply-ack) - 32 bit original payload
- 4 bit trailer checksum
6Software Steps (just ex.)
- Send
- Application calls OS to send data
- OS copies data to an OS buffer
- OS calculates the checksum in this case it will
be put in the trailer starts the timer - OS sends the data to the NW interface and tells
the HW to send it - Receiver
- System copies the data from the NW interface into
the OS buffer - System calculates the checksum and checks it with
the transmitted version - If it matches, an ack is sent, and data is copied
into the proper user space location, and the tne
OS signals the application to continue - If it doesnt match the message is deleted
since the sender will resend after the timeout
7Notice
- Symmetry
- In send and receive protocols
- Similarity
- Quite close to a Unix UDP/IP protocol
- Note
- Lots off things are easy here
- Single message basis
- Homogeneous environment
8More realistic scenario
- Heterogeneous node types
- Enter the big endian vs. little endian byte order
- Protocol will determine which transmit order is
required - Wrong Endian
- Will need to do byte reversal
- On both sends and receives to make up for the
difference
9Other reliability issues
- Duplicate messages
- Resend happened but previous try got there anyway
- Unique Identifier allows receiver to discriminate
properly - Usually need some realistic and safe timing
assumption to avoid wrap-around aliasing - Rouge messages
- Never received but stay in the fabric
- Time to live field in packet provides a
self-destruct mechanism - Must work even when receivers buffer is full
- Flow control
- Requires some form of feedback to sender or
piecewise stall capability - Well defer issues like
- Deadlock, livelock, safety, fairness
10Performance parameters (see board)
- Bandwidth
- Maximum rate at which interconnection network can
propagate data once a message is in the network - Usually headers, overhead bits included in
calculation - Units are usually in megabits/second, not
megabytes - Sometimes see throughput
- Network bandwidth delivered to an application
- Time of Flight
- Time for 1st bit of message to arrive at receiver
- Includes delays of repeaters/switches length /
m (speed of light) (m determines property of
transmission material) - Transmission Time
- Time required for message to pass through the
network - size of message divided by the bandwidth
11Performance parameters (see board)
- Transport latency
- Time of flight transmission time
- Time message spends in interconnection network
- But not overhead of pulling out or pushing into
the network - Sender overhead
- Time for mP to inject a message into the
interconnection network including both HW and SW
components - Receiver overhead
- Time for mP to pull a message out of
interconnection network, including both HW and SW
components - So, total latency of a message is
12Example
13Some more odds and ends
- Note from the example (with regard to longer
distance) - Time of flight dominates the total latency
component - Repeater delays would factor significantly into
the equation - Message transmission failure rates rise
significantly - Its possible to send other messages with no
responses from previous ones - If you have control of the network
- Can help increase network use by overlapping
overheads and transport latencies - Can simplify the total latency equation to
- Total latency Overhead (Message
size/bandwidth) - Leads to
- Effective bandwidth Message size/Total latency
14Interconnection Networks
Node
Node
Node
Shared Media (Ethernet)
Node
Node
Switched Media (ATM)
(A. K. A. data switching interchanges,
multistage interconnection networks, interface
message processors)
Switch
Node
Node
15Shared Media Networks
- Need arbitration to decide who gets to talk
- Arbitration can be centralized or distributed
- Centralized not used much for networks
- Special arbiter device (or must elect arbiter)
- Good performance if arbiter far away? Nah.
- Distributed arbitration
- Check if media already used (carrier sensing)
- If media not used now, start sending
- Check if another also sending (collision
detection) - If collision, wait for a while and retry
- For a while is random (otherwise collisions
repeat forever) - Exponential back-off to avoid wasting bandwidth
on collisions
16Switched Networks
- Need switches
- Introduces switching overheads
- No time wasted on arbitration and collisions
- Multiple transfers can be in progress
- If they use different links, of course
- Circuit or Packet Switching
- Circuit switching end-to-end connections
- Reserves links for a connection (e.g. phone
network) - Packet switching each packet routed separately
- Links used only when data transferred (e.g.
Internet Protocol)
17Routing
- Shared media has trivial routing (broadcast)
- In switched media we can have
- Source-based (source specifies route)
- Virtual circuits (end-to-end route created)
- When connection made, set up route
- Switches forward packets along the route
- Destination-based (source specifies destination)
- Switches must route packet toward destination
- Also can be classified into
- Deterministic (one route from a source to a
destination) - Adaptive (different routes can be used)
18Routing Messages
- Shared Media
- Broadcast to everyone!
- Switched Media needs real routing. Options
- Source-based routing message specifies path to
the destination (changes of direction) - Virtual Circuit circuit established from source
to destination, message picks the circuit to
follow - Destination-based routing message specifies
destination, switch must pick the path - deterministic always follow same path
- adaptive pick different paths to avoid
congestion, failures - randomized routing pick between several good
paths to balance network load
19Routing Methods for Switches
- Store-and-Forward
- Switch receives entire packet, then forwards it
- If error occurs when forwarding, switch can
re-send - Wormhole routing
- Packet consists of flits (a few bytes each)
- First flit contains header w/ destination address
- Switch gets header, decides where to forward
- Other flits forwarded as they arrive
- Looks like packet worming through network
- If an error occurs along the way, sender must
re-send - No switch has the entire packet to re-send it
20Cut-Through Routing
- What happens when link busy?
- Header arrives to switch, but outgoing link busy
- What do we do with the other flits of the packet?
- Wormhole routing stop the tail when head stops
- Now each flit along the way blocks the a link
- One busy link creates other busy links gt
traffic jam - Cut-Through Routing
- If outgoing link busy, receive and buffer
incoming flits - The buffered flits stay there until link becomes
free - When link free, the flits start worming out of
the switch - Need packet-sized buffer space in each switch
- Wormhole Routing switch needs to buffer only one
flit
21Store and Forward vs. Cut-Through
- Store-and-forward policy each switch waits for
the full packet to arrive in switch before
sending to the next switch (good for WAN) - Cut-through routing or worm hole routing switch
examines the header, decides where to send the
message, and then starts forwarding it
immediately - In worm hole routing, when head of message is
blocked, message stays strung out over the
network, potentially blocking other messages
(needs only buffer the piece of the packet that
is sent between switches). - Cut through routing lets the tail continue when
head is blocked, accordioning the whole message
into a single switch. (Requires a buffer large
enough to hold the largest packet). - See board
22Routing Network Latency
- Switch Delay
- Time from incoming to outgoing link in a switch
- Switches
- Number of switches along the way
- Transfer time
- Time to send the packet through a link
- Store-and-Forward end-to-end transfer time
- (SwitchesSwitchDelay)(TransferTime(Switches1))
- Wormhole or Cut-Through end-to-end transfer time
- (SwitchesSwitchDelay) TransferTime
- Much better if there are many switches along the
way - See the example on page 811
23Switch Technology
- What do we want in a switch
- Many input and output links
- Usually number of input and output links the same
- Low contention inside the switch
- Best if there is none (only external links cause
contention) - Short switching delay
24Switch Technology
- Two common switching organizations
- Crossbar
- Allows any node to communicate with any other
node with 1 pass through an interconnection - Very low switching delay, no internal contention
- Complexity grows as square of number of links
- Can not have too many links (i.e. 64 in, 64 out)
- Omega
- Uses less HW (n/2 log2n vs. n2 switches) more
contention - Build switches with more ports using small
crossbars - Lower complexity per link, but longer delay and
more contention
25(others) crossbar vs. omega
Crossbar
Omega
26(others) Fat-tree topology
Circles switches, squares processor-memory
nodes
Higher bandwidth, higher in the tree match
common communication patterns
27(others) Ring topology
- Instead of centralizing small switching element,
small switches are placed at each computer - Avoids a full interconnection network
- Disadvantages
- Some nodes are not directly connected
- Results in multiple stops, more overhead
- Average message must travel through n/2 switches,
n nodes - Advantages
- Unlike shared lines, ring has several transfers
going at once
Example of a ring topology
28(others) Dedicated communication links
- Usually takes the form of a communication link
between every switch - An expensive alternative to a ring
- Get big performance gains, but big costs as well
- Usually cost scales by the square of the number
of nodes - The big costs led designers to invent things in
between - In other words, topologies between the cost of
rings and the performance of fully connected
networks - Whether or not a topology is good typically
depends on the situation - Some popular topologies for MPPs are
- Grids, tours, hypercubes
29(others) Topologies for commercial MPPs
2D grid or mesh of 16 nodes
2D tour of 16 nodes
Hypercube tree of 16 nodes (16 24, so n 4)
30Practical issues with topologies
- 3D drawings have to be mapped to chips
- This is easier said than done
- Different layers of metal in VLSI/CMOS circuits
help give you added dimensions but only so much - (See board for explanation)
- Reality things that should work perfectly
theoretically dont really work in practice - What about the speed of a switch?
- If its fixed, more links/switch less
bandwidth/link - Which could make a topology less desirable
- Latency through a switch depends on complexity of
routing pattern which depends on the topology
31Network Topology
- What do we want in a network topology
- Many nodes, high bandwidth, low contention, low
latency - Low latency few switches along any route
- For each (src, dest) pair, we choose shortest
route - Longest such route over all (src,dst) pairs
network diameter - We want networks with small diameter!
- Low contention high aggregate bandwidth
- Divide network into two groups, each with half
the nodes - Total bandwidth between groups is bisection
bandwidth - Actually, we use the minimum over all such
bisections
32Bisection bandwidth
- A popular measure for MPP connections
- Calculated by dividing all of the interconnect of
a machine/system into 2 equal parts - Each part has ½ of the nodes
- Then, sum the bandwidth of the lines that cross
the imaginary dividing line - For example
- For fully connected interconnections, the
bisection bandwidth is (n/2)2 (n of nodes) - Problem not all interconnection are symmetric
- Solution pick the worst possible configuration
- We generally want a worst-case estimate
33Example
- See board for bisection bandwidth example
34Protocol Stacks
35Protocol Stack TCP/IP
36Clusters
- A kind of message-passing machines
- Use commodity components
- Or even commodity PCs and LANs
- Very cost effective
- Uses mass-produced components (cheap)
- Very good for highly parallel tasks
- E.g. web searches largely independent
37Rack-Mounted Systems
38Top 500 Supercomputers
39TCO for Clusters