CSE 160 Lecture 2 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 160 Lecture 2

Description:

Flynn (1966) Classified machines by data and control streams. Multiple Instruction Multiple Data ... Push on gas pedal before car goes forward ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 33
Provided by: philip335
Learn more at: https://cseweb.ucsd.edu
Category:
Tags: cse | lecture

less

Transcript and Presenter's Notes

Title: CSE 160 Lecture 2


1
CSE 160 Lecture 2
2
Todays Topics
  • Flynns Taxonomy
  • Bit-Serial, Vector, Pipelined Processors
  • Interconnection Networks
  • Topologies
  • Routing
  • Embedding
  • Network Bisection

3
Taxonomy
  • Flynn (1966) Classified machines by data and
    control streams

4
SIMD
  • SIMD
  • All processors execute the same program in
    lockstep
  • Data that each processor sees is different
  • Single control processor
  • Individual processors can be turned on/off at
    each cycle
  • Illiac IV, CM-2, MasPar are some examples
  • Silicon Graphics Reality Graphics engine

5
MIMD
  • All processors execute their own set of
    instructions
  • Processors operate on separate datastreams
  • No centralized clock implied
  • SP-2, T3E, Clusters, Crays, etc.

6
SPMD/MPMD
  • Single/Multiple Program Multiple Data
  • SPMD processors run the same program but
    processors are necessarily run in lock step.
  • Very popular and scalable programming style
  • MPMD is similar except that different processors
    run different programs
  • PVM distribution has some simple examples

7
Processor Types
  • Four types
  • Bit serial
  • Vector
  • Cache-based, pipelined
  • Custom (eg. Tera MTA or KSR-1)

8
Bit Serial
  • Only seen in SIMD machines like CM-2 or MasPar
  • Each clock cycle, one bit of the data is
    loaded/written
  • Simplifies memory system and memory trace count
  • Popular for very dense (64K) processor arrays

9
Cache-based, Pipelined
  • Garden Variety Microprocessor
  • Sparc, Intel x86, MC68xxx, MIPs,
  • Register-based ALUs and FPUs
  • Registers are of scalar type
  • Pipelined execution to improve performance of
    individual chips
  • Splits up components of basic operation like
    addition into stages
  • The more stages, the faster the speedup, but more
    problems with branching and data/control hazards
  • Per-processor caches make it challenging to build
    SMPs (coherency issues)
  • Now dominates the high-end market

10
Vector Processors
  • Very specialized (eg. ) machines
  • Registers are true vectors with power of 2
    lengths
  • Designed to efficiently perform matrix-style
    operations
  • Ax b ( b(I) ? A(I,J)x(J))
  • Vector registers v1, v2, v3
  • V1 A(I,), V2 b()
  • MULV V3(I), V1, V2
  • Chaining to efficiently handle larger vectors
    than size of vector registers
  • Cray, Hitachi, SGI (now Cray SV-1) are examples

11
Some Custom Processors
  • Denelcor HEP/Tera MTA
  • Multiple register sets
  • Stack Pointer, Instruction Pointer, Frame
    Pointer, etc.
  • Facilitates hardware threads
  • Switch each clock cycle to different register set
  • Why? Stalls to memory subsystem in one thread can
    be hidden by concurrency
  • KSR-1
  • Cache-only memory processor
  • Basically 2 generations behind standard micros

12
Going Parallel
  • Late 70s, even vector monsters started to to
    go parallel
  • For //-processing to work, individual processors
    must synchronize
  • SIMD Synchronize every clock cycle
  • MIMD Explicit sychronization
  • Message passing
  • Semaphores, monitors, fetch-and-increment
  • Focus on interconnection networks for rest of
    lecture

13
Characterizing Networks
  • Bandwidth
  • Device/switch latency
  • Switching types
  • Circuit switched (eg. Telephone)
  • Packet switched (eg. Internet)
  • Store and forward
  • Virtual Cut Through
  • Wormhole routed
  • Topology
  • Number of connections
  • Diameter (how many hops through switches)

14
Latency
  • Latency is the amount of time taken for a command
    to start before any effect is seen
  • Push on gas pedal before car goes forward
  • Time you enter a line, before cashier starts on
    your job
  • First bit leaves computer A, first bit arrives at
    computer B
  • OR
  • (Message latency) First bit leaves computer A,
    last bit arrives at computer B
  • Startup latency is the amount of time to send a
    zero length message

15
Bandwidth
  • Bits/second that can travel through a connection
  • A really simple model for calculating the time to
    send a message of N bytes
  • Time latency N/bandwidth
  • Bisection is the minimum number of wires that
    must be cut to divide a network of machines into
    two equal halves.
  • Bisection bandwidth is the total bandwidth
    through the bisection

16
Interconnection Topologies
  • Completely connected
  • Every node has a direct wire connection to every
    other node
  • (N x (N-1))/2 Wires, Clearly impractical

17
Line/Ring
2
1
3
4
5
6
7
  • Simple interconnection
  • First topology where routing is an issue
  • Needed when no direct connection exists between
    nodes
  • Want go to node 4 from node 2 have to pass
    through node 3
  • What happens if 2 want to communicate with 3 at
    the same time 1 want to communicate with 4?
  • What is the bisection of a line/ring
  • If the links are of bandwidth B, what is the
    bisection bandwidth
  • What is the aggregate bandwidth of the network?

18
Mesh/Torus
  • Generalization of line/ring to multiple
    dimensions
  • More routes between nodes
  • What is the bisection of this network?

2
1
3
4
5
6
7
2
1
3
4
5
6
7
2
1
3
4
5
6
7
19
Hop Count
  • Networks are measured by diameter
  • This is the minimum number of hops that message
    must traverse for the two nodes that furthest
    apart
  • Line Diameter N-1
  • 2D (NxM) Mesh Diameter NM-2

20
Tree-based Networks
  • Nodes organized in a tree fashion (important for
    some global algorithms)

Diameter of this network? Bisection, Bisection
Bandwidth?
21
Hypercubes
1D
2D
4D
3D
22
Hypercubes 2
  • Dimension N Hypercube is constructed by
    connecting the corners of two N-1 hypercubes
  • Relatively low wire count to build large networks
  • Multiple routes from any destination to any node.
  • Exercise to the reader, what is the dimenision of
    a K-dimensional Hypercube

23
Labeling/Routing in a Hypercube
  • Nodes a labeled in Gray Code
  • Connected neighbors have their binary node number
    representation differ by one bit.
  • 3D cube

000
001
101
100
010
011
110
111
24
The e-cube routing algorithm
  • Source address S S0 S1 S2 Sn
  • Destination address D D0 D1 D2 Dn
  • Let R R0 R1 R2 Rn S ? R
  • Number of one bits in R indicate distance between
    S and D
  • Starting at S, go to neighbor where first Rj 1
    (if Sj 0 then goto neighbor where Sj1)
  • Continue routing from this intermediate node
    where the next Rk (k gt j) is one, goto that
    neighbor.

25
E-cube routing example
  • 8 Dimensional Hypercube (256 Nodes)
  • S 134 0x86 10000110
  • D 215 0xD7 11010111
  • S ? D 0x51 01010001
  • Distance 3
  • S ? 11000110 (198)
  • 11010110 (214)
  • 11010111 (215)

26
Embedding
  • A network is embeddable if nodes and links can be
    mapped to a target network
  • A mesh is embeddable in a hypercube
  • There is mapping of hypercube nodes and networks
    to a mesh
  • The dilation of an embedding is how many links
    are needed in the embedding network to represent
    the embedded network
  • Perfect embeddings have dilation 1
  • Embedding a tree into a mesh has a dilation of 2
    (See example in book)

27
Modern Parallel Machines are Packet Switched
  • Break message into smaller blocks and send these
    pieces through the network
  • Network intermediate points (routers) can be
    store-and-forward or virtual cut through
  • Store and forward requires buffering at each
    switch if an incoming packet has packets ahead of
    it on an outgoing port (congestion)
  • Virtual cut-through eliminates the always
    buffering for store and forward by cutting
    through the switch when the output port is free

28
Wormhole Routing
  • Wormhole routing is a variation of virtual cut
    through
  • Small headers (flow control digits Flits) pass
    through the network.
  • When a flit is allowed to cut through a switch,
    the original sender is guaranteed a clear path
    through that switch.
  • A tail flit closes the connection
  • Wormhole was defined by Seitz and is used in
    Myrinet, a very popular cluster interconnect.

29
Latency of Circuit Switched and Virtual Cut
Through
  • Circuit Switch Latency
  • (Lc/B) l (L/B)
  • Lc length of control packet
  • B bandwidth
  • l number of links
  • L Length of Packet
  • Virtual Cut-through latency
  • (Lh/B) l (L/B)
  • Lh length of header packet

30
Store-Forward and Wormhole routing Latency
  • Wormhole Routing Latency
  • (Lf/B) l (L/B)
  • Lf Length of flit
  • Store-Forward Latency
  • (L/B) l
  • Store and forward latency can be much worse for
    many hops.
  • Virtual Cut Through, Wormhole, and Circuit Switch
    reach (L/B) as message length increases

31
Deadlock/Livelock
  • Livelock/Deadlock is a potential problem in any
    network design.
  • Livelock occurs in adaptive routing algorithms
    when a packet never finds destination
  • Deadlock occurs when packets cannot be forwarded
    because waiting for other packets to move out of
    the way. Blocking packet is waiting for blocked
    packet to move

32
Next Time
  • All about clusters
  • Introduction to PVM (and MPI)
Write a Comment
User Comments (0)
About PowerShow.com