Parallel Architectures - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Parallel Architectures

Description:

Parallel Architectures. Flynn's taxonomy. SISD, SIMD, MISD, MIMD. Memory classification ... Nicely covered at: http://www.top500.org/ORSC/2002/ Flynn's Taxonomy ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 48
Provided by: Pao3
Category:

less

Transcript and Presenter's Notes

Title: Parallel Architectures


1
Parallel Architectures
  • Flynns taxonomy
  • SISD, SIMD, MISD, MIMD
  • Memory classification
  • shared, distributed, distributed shared
  • Interconnection networks
  • static, dynamic, network parameters
  • Nicely covered at http//www.top500.org/ORSC/2002
    /

2
Flynns Taxonomy
  • Flynn 66 classified computers according to
    instruction and data streams
  • 4 basic categories
  • Single Instruction, Single Data
  • Single Instruction, Multiple Data
  • Multiple Instructions, Single Data
  • Multiple Instructions, Multiple Data

3
SISD
  • Not a parallel computer
  • Conventional serial, scalar von Neumann computer
  • One instruction stream
  • A single instruction is issued each clock cycle
  • Each instruction operates on a single (scalar)
    data element
  • Limited by the number of instructions that can
    be issued in a given unit of time
  • Current processors (Intel Pentium, AMD Athlon,
    Alpha) are not strictly SISD due to pipelining
    and wide issue, but are close enough for our
    purposes only one thread can execute at a time.

4
SIMD
  • Also von Neumann architectures but more powerful
    instructions
  • Each instruction may operate on more than one
    data element
  • Usually intermediate host executes program logic
    and broadcasts instructions to other processors
  • Synchronous (lockstep)
  • Rating how fast these machines can issue
    instructions is not a good measure of their
    performance
  • Developed because there are many important
    applications that mostly operate upon arrays of
    data.
  • Two major types
  • Vector SIMD
  • Processor array SIMD

5
Vector SIMD
Vector processing operates on whole vectors
(groups) of data at a time Example float
A8, B8, C8 Init(A,B,C) C AB
6
Vector SIMD (cont.)
  • Examples
  • Cray 1, NEC SX-2, Fujitsu VP, Hitachi S820
  • Single processor of Cray C 90, Cray2, NEC SX-3,
    Fujitsu VP 2000 , Convex C-2
  • NEC SX-6i

7
Memory Bandwidth in Vector SIMD
CAB One read/write path Two read/write
paths Two read, one write paths
time
Read A
Read B
Write C
Read A
Read B
Write C
Read A
Read B
Write C
8
Processor Array SIMD
  • Single instruction is issued and all processors
    execute the same instruction, operating on
    different sets of data.
  • Many, simple processing elements - 1000's.
  • Processors run in a synchronous, lockstep fashion

9
Parallel SIMD (cont.)
  • Well suited only for data-parallel applications
  • Inefficiency due to the need to switch off
    processors see example
  • Out of fashion today
  • Includes systolic arrays
  • Examples
  • Connection Machine CM-2
  • Maspar MP-1, MP-2

Example for i 0 to 1000 if ai gt bi
then xi ci else
xi di
Pi x x o x work o - idle
Pj x o x
10
MISD
  • no such computer has been built
  • there are some applications using MISD approach
  • cryptography got the ciphertext, try different
    ways to decrypt it
  • sensor data analysis try different
    transformations to get maximum information out of
    the measured data

11
MIMD
  • The most flexible category
  • Parallelism achieved by connecting multiple
    processors together
  • Includes all forms of multiprocessor
    configurations
  • Each processor executes its own instruction
    stream independent of other processors on unique
    data stream
  • Advantages
  • Processors can execute multiple job streams
    simultaneously
  • Each processor can perform any operation
    regardless of what other processors are doing
  • Disadvantages
  • Load balancing overhead - synchronization needed
    to coordinate processors at end of parallel
    structure in a single application
  • Can be difficult to program

12
MIMD (cont.)
13
MIMD vs SIMD programming example
Problem Given an upper triangular matrix,
compute the sum of the number in each
column. Parallel programs Program1 for
processor i sumi 0 for(j0 jlti
j) sumi Aj,i Program2 for processor
i sumi 0 for(j0 jltn j) if
(jlti) sumi Aj,i
14
Parallel Architectures
  • Flynns taxonomy
  • SISD, SIMD, MISD, MIMD
  • Memory classification
  • shared, distributed, distributed shared
  • Interconnection networks
  • static, dynamic, network parameters
  • Nicely covered at http//www.top500.org/ORSC/2002
    /

15
Memory Classification
  • MIMD machines can be classified according to
    where the memory is located and how it is
    accessed.
  • Main classes
  • Shared Memory with Uniform Memory Access time
  • Distributed Memory
  • Single Address Space Distributed shared memory
    with Non Uniform Memory Access access time
  • Multiple Address Spaces
  • communication only via message passing
  • Called Massively Parallel Processors
  • UMA and NUMA are usually cache coherent
  • - if one processor updates a location in shared
    memory, all the other processors know about the
    update

16
Shared Memory
17
Shared Memory (cont.)
  • Multiple processors operate independently but
    share the same memory resources
  • Synchronization achieved by controlling tasks
    reading from and writing to the shared memory
  • Often called Symmetric MultiProcessor
  • Programming standard OpenMP, see
    http//www.openmp.org
  • Advantages
  • Easy for user to use efficiently
  • Data sharing among tasks is fast (speed of
    memory access)
  • Disadvantages
  • User is responsible for specifying
    synchronization, e.g., locks
  • Not scalable (low tens of processors)
  • Examples Cray Y-MP, Convex C-2, Cray C-90 , quad
    Pentium Xeon

18
Distributed Memory - (cc)NUMA
  • Local memory is directly accessible by other
    processors
  • Has Non Uniform Memory Access time, as the time
    to access the memory of different processor is
    higher
  • Popular choice today, e.g. SGI Origin 2000
  • Moderately scalable low hundreds of processors

19
Distributed Memory - MPP
  • Data is shared across a communications network
    using message passing
  • User responsible for synchronization using
    message passing
  • Scales very well thousands of processors
  • Called Massively Parallel Processors
  • Advantages
  • Memory scalable to number of processors.
    Increase number of processors, size of memory and
    bandwidth increases.
  • Each processor can rapidly access its own memory
    without interference
  • Summary easy to build

20
Distributed Memory - MPP (cont.)
  • Disadvantages
  • Difficult to map existing data structures to
    this memory organization
  • User responsible for sending and receiving data
    among processors
  • To minimize overhead and latency, data should be
    blocked up in large chunks and shipped before
    receiving node needs it
  • Summary difficult to program

21
Typical Combination
  • Multiple SMPs connected by a network
  • Processors within an SMP communicate via shared
    memory
  • Requires message passing between SMPs

22
Clusters special case of MPP
  • Main idea
  • use commodity workstations/PCs and commodity
    interconnect to get a cheap, high performance
    solution
  • Pioneering projects
  • Berkeleys Network Of Workstations
  • Beowulf clusters NASA Goddard Space Flight
    Center project
  • ultra low-cost approach commodity PCs Linux
    Ethernet
  • Advantages
  • low cost, commodity upgradeable components
  • Disadvantages (esp. early clusters)
  • inadequate interconnect
  • good only for problems requiring little
    communication (embarrassingly parallel problems)

23
Clusters (cont.)
  • Improving communication
  • replace TCP/IP by a low latency protocol
  • replace Ethernet by a more advanced hardware
  • Advanced Interconnect

24
One Page Summary
Instruction stream
Single
Multiple
  • SISD
  • sequential processors
  • MISD
  • does not really exist

Single
Data Stream
Single
  • SIMD
  • vector processors
  • processor arrays

(cc)UMA
(cc)NUMA
Multiple
Address space
  • MPP
  • clusters

Multiple
Shared
Distributed
Memory Location
25
Parallel Architectures
  • Flynns taxonomy
  • SISD, SIMD, MISD, MIMD
  • Memory classification
  • shared, distributed, distributed shared
  • Interconnection networks
  • static, dynamic, network parameters
  • Nicely covered at http//www.top500.org/ORSC/2002
    /

26
Interconnection Networks
  • Dynamic Interconnection Networks
  • built out of links and switches (also known as
    indirect networks)
  • usually associated to shared memory
    architectures
  • examples bus-based, crossbar, multistage
    (?-network)
  • Static Interconnection Networks
  • built out of point-to-point communication links
    between processors (also known as direct
    networks)
  • usually associated to message passing
    architectures
  • examples ring, 2D and 3D mesh and torus,
    hypercube, butterfly
  • Important parameters of Interconnection Networks,
    Embedding
  • latency, bandwidth
  • degree, diameter, connectivity, bisection
    (band)width
  • embeddings and their parameters

27
Dynamic Interconnection Networks
28
BUS Based Interconnection Networks
  • processors and the memory modules are connected
    to a shared bus
  • Advantages
  • simple, low cost
  • Disadvantages
  • only one processor can access memory at a given
    time
  • bandwidth does not scale with the number of
    processors/memory modules
  • Example
  • quad Pentium Xeon

29
Crossbar
  • Advantages
  • non blocking network
  • Disadvantages
  • cost O(pm)
  • Example
  • high end UMA

30
Multistage networks (i.e. ?-network)
  • - Intermediate case between bus and crossbar
  • - Blocking network (but not always)
  • - Often used in NUMA computers
  • ?-network
  • Each switch is a 2x2 crossbar
  • log(p) stages
  • cost p log(p)
  • Simple routing algorithm
  • At each stage, look at the corresponding bit
    (starting with msb) of the source and destination
    address
  • If the bits are the same, messages pass through,
    otherwise cross-over

31
?-network
32
Dynamic network exercises
Question 1 Which of the following pairs of
(processors, memory block) requests will
collide/block?
Question 2 For a given processor/memory request
(a,b), how many requests (x,y), with (x ! a) and
(y ! b) will block with (a,b) in an 8 node
?-network? How does this number depend on the
choice of (a,b)?
33
?-network
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
34
Static Interconnection Networks
  • Network Parameters
  • latency, bandwidth
  • degree, diameter, bisection (band)width
  • Specific networks
  • Linear array
  • Ring
  • Tree
  • 2D 3D mesh/torus
  • Hypercube
  • Butterfly
  • Fat tree
  • Embedding

35
Network Parameters
  • Latency, Bandwidth
  • hardware related
  • depends also on the communication protocol
  • Degree (maximum number of neighbours)
  • influences feasibility/cost, best is a low
    constant
  • Diameter (maximal distance between two nodes)
  • determines lower bound on time for some
    algorithms
  • Bisection (band)width
  • the minimal number of edges separating two part
    of equal size (the bandwidth on these edges)
  • lower bound on time for problems requiring
    exchanging a lot of data
  • VLSI area/volume in 2D,
    in 3D

36
Linear Array, Ring, Tree
  • important logical topologies
  • algorithms are often described on these
    topologies
  • actual execution is performed on the embedding
    into the physical network
  • low bisection width (1,2), high diameter for
    line ring

p0
pn-1

p1
p2
p0
pn-1

p1
p2
37
2D 3D Array Torus
  • good match for discrete simulation and matrix
    operations
  • easy to manufacture and extend
  • diameter bisection width (
    for the 3D case)
  • Examples Cray 3D (3d torus), Intel Paragon (2D
    mesh)

38
Hypercube
  • good graph-theoretic properties (low diameter,
    high bisection width)
  • nice recursive structure
  • good for simulating other topologies (they can
    be efficiently embedded into hypercube)
  • degree log (n), diameter log (n), bisection
    width n/2
  • costly/difficult to manufacture for high n, not
    so popular nowadays

39
Butterfly
  • Hypercube derived network of log(n) diameter
    and constant degree
  • perfect match for Fast Fourier Transform
  • there are other Hypercube-related networks (Cube
    Connected Cycles, Shuffle-Exchange, De-Bruin and
    Bene networks), see the Leightons book for
    details

40
Fat Tree
  • Main idea exponentially increase the
    multiplicity of links as the distance from the
    bottom increases
  • keeps nice properties of the binary tree (low
    diameter)
  • solves the low bisection and bottleneck at the
    top levels
  • Example CM5

41
Embedding
  • Problem Assume you have an algorithm designed
    for a specific topology G. How do you get it work
    on an interconnect of different topology G?
  • Solution Simulate G on G.
  • Formally Given networks G(V,E) and G(V,E),
    find a mapping f which maps each vertex from V
    into a vertex of V and each edge from E into a
    path in G. Several vertices from V may map into
    one vertex from V (especially if G has more
    vertices then G). Such a mapping is called
    embedding of G in G.
  • Goals
  • balance the number of vertices mapped to each
    node of G (to balance the workload of the
    simulating processors)
  • each edge should map to a short path, optimally
    single link (so each communication step can be
    simulated efficiently) small dilation
  • there should be little overlaps between
    resulting simulating paths (to prevent congestion
    on links)

42
Embedding - examples
  • Embedding ring into line
  • dilation 2, congestion 2
  • similar idea can be used to embed torus into
    mesh
  • Embedding ring into 2D torus
  • dilation 1, congestion 1

43
Embedding Ring into Hypercube
  • map processor i into node G(d, i) of d-ary
    hypercube
  • function G() called binary reflected Grey code
  • G() can be easily defined recursively
  • G(d1) 0G(d), 1G(d)
  • Example
  • 0,1 00,01,11,10 000,001, 011,010,
    110, 111, 101, 100

r
110
111
010
011
100
101
000
001
44
Embedding arrays into Hypercube
  • recursive construction using embedding rings as
    a building block
  • assumes array sizes are powers of 2

45
Embedding trees into Hypercube
  • arbitrary binary trees can be (slightly less)
    efficiently embedded as well
  • example below assumes the processors are only at
    the leaves of the tree

46
Possible questions
  • Assume point-to-point communication with cost 1 .
  • Is is possible to sort in 2D mesh in time O(log
    n)? ? ?
  • Is it possible to sort leaves of complete binary
    tree in time O(log n)? What about ?
  • Recall that embedding of G into H maps each link
    of G to a path in H. Dilation is the maximal
    (over all links in G) length of such path, while
    congestion is the maximal (over all links of H)
    number of such paths that use the link.
  • Given an embedding of G into H. What is its
    dilation? Congestion?
  • Show how to embed 2D torus into 2D mesh with
    constant dilation. What is the dilation of your
    embedding? What is its congestion?
  • Show how to embed ring into 2D mesh. Is it
    always possible to do it with both dilation and
    congestion equal 1? Constant?
  • Given number x. Who is its predecessor/successor
    in d-ary binary reflected Grey code?

47
New Concepts and Terms - Summary
  • SISD, vector- and processor array- SIMD, MISD,
    MIMD
  • shared memory, distributed (shared) memory,
    cache coherence
  • (cc)UMA, SMP, (cc)NUMA, MPP
  • Clusters, NOW, Beowulf
  • Interconnection Networks Static, Dynamic
  • bus, crossbar, multistage, ?-network
  • blocking, non-blocking network
  • Torus, Hypercube, Butterfly, Fat Tree
  • Embedding, dilation, congestion
  • Grey codes
Write a Comment
User Comments (0)
About PowerShow.com