Title: Parallel Architectures
1Parallel Architectures
- Flynns taxonomy
- SISD, SIMD, MISD, MIMD
- Memory classification
- shared, distributed, distributed shared
- Interconnection networks
- static, dynamic, network parameters
- Nicely covered at http//www.top500.org/ORSC/2002
/
2Flynns Taxonomy
- Flynn 66 classified computers according to
instruction and data streams - 4 basic categories
- Single Instruction, Single Data
- Single Instruction, Multiple Data
- Multiple Instructions, Single Data
- Multiple Instructions, Multiple Data
3SISD
- Not a parallel computer
- Conventional serial, scalar von Neumann computer
- One instruction stream
- A single instruction is issued each clock cycle
- Each instruction operates on a single (scalar)
data element - Limited by the number of instructions that can
be issued in a given unit of time - Current processors (Intel Pentium, AMD Athlon,
Alpha) are not strictly SISD due to pipelining
and wide issue, but are close enough for our
purposes only one thread can execute at a time.
4SIMD
- Also von Neumann architectures but more powerful
instructions - Each instruction may operate on more than one
data element - Usually intermediate host executes program logic
and broadcasts instructions to other processors - Synchronous (lockstep)
- Rating how fast these machines can issue
instructions is not a good measure of their
performance - Developed because there are many important
applications that mostly operate upon arrays of
data. - Two major types
- Vector SIMD
- Processor array SIMD
5Vector SIMD
Vector processing operates on whole vectors
(groups) of data at a time Example float
A8, B8, C8 Init(A,B,C) C AB
6Vector SIMD (cont.)
- Examples
- Cray 1, NEC SX-2, Fujitsu VP, Hitachi S820
- Single processor of Cray C 90, Cray2, NEC SX-3,
Fujitsu VP 2000 , Convex C-2 - NEC SX-6i
7Memory Bandwidth in Vector SIMD
CAB One read/write path Two read/write
paths Two read, one write paths
time
Read A
Read B
Write C
Read A
Read B
Write C
Read A
Read B
Write C
8Processor Array SIMD
- Single instruction is issued and all processors
execute the same instruction, operating on
different sets of data. - Many, simple processing elements - 1000's.
- Processors run in a synchronous, lockstep fashion
9Parallel SIMD (cont.)
- Well suited only for data-parallel applications
- Inefficiency due to the need to switch off
processors see example - Out of fashion today
- Includes systolic arrays
- Examples
- Connection Machine CM-2
- Maspar MP-1, MP-2
Example for i 0 to 1000 if ai gt bi
then xi ci else
xi di
Pi x x o x work o - idle
Pj x o x
10MISD
- no such computer has been built
- there are some applications using MISD approach
- cryptography got the ciphertext, try different
ways to decrypt it - sensor data analysis try different
transformations to get maximum information out of
the measured data
11MIMD
- The most flexible category
- Parallelism achieved by connecting multiple
processors together - Includes all forms of multiprocessor
configurations - Each processor executes its own instruction
stream independent of other processors on unique
data stream - Advantages
- Processors can execute multiple job streams
simultaneously - Each processor can perform any operation
regardless of what other processors are doing - Disadvantages
- Load balancing overhead - synchronization needed
to coordinate processors at end of parallel
structure in a single application - Can be difficult to program
12MIMD (cont.)
13MIMD vs SIMD programming example
Problem Given an upper triangular matrix,
compute the sum of the number in each
column. Parallel programs Program1 for
processor i sumi 0 for(j0 jlti
j) sumi Aj,i Program2 for processor
i sumi 0 for(j0 jltn j) if
(jlti) sumi Aj,i
14Parallel Architectures
- Flynns taxonomy
- SISD, SIMD, MISD, MIMD
- Memory classification
- shared, distributed, distributed shared
-
- Interconnection networks
- static, dynamic, network parameters
- Nicely covered at http//www.top500.org/ORSC/2002
/
15Memory Classification
- MIMD machines can be classified according to
where the memory is located and how it is
accessed. - Main classes
- Shared Memory with Uniform Memory Access time
- Distributed Memory
- Single Address Space Distributed shared memory
with Non Uniform Memory Access access time - Multiple Address Spaces
- communication only via message passing
- Called Massively Parallel Processors
- UMA and NUMA are usually cache coherent
- - if one processor updates a location in shared
memory, all the other processors know about the
update
16Shared Memory
17Shared Memory (cont.)
- Multiple processors operate independently but
share the same memory resources - Synchronization achieved by controlling tasks
reading from and writing to the shared memory - Often called Symmetric MultiProcessor
- Programming standard OpenMP, see
http//www.openmp.org - Advantages
- Easy for user to use efficiently
- Data sharing among tasks is fast (speed of
memory access) - Disadvantages
- User is responsible for specifying
synchronization, e.g., locks - Not scalable (low tens of processors)
- Examples Cray Y-MP, Convex C-2, Cray C-90 , quad
Pentium Xeon
18Distributed Memory - (cc)NUMA
- Local memory is directly accessible by other
processors - Has Non Uniform Memory Access time, as the time
to access the memory of different processor is
higher - Popular choice today, e.g. SGI Origin 2000
- Moderately scalable low hundreds of processors
19Distributed Memory - MPP
- Data is shared across a communications network
using message passing - User responsible for synchronization using
message passing - Scales very well thousands of processors
- Called Massively Parallel Processors
- Advantages
- Memory scalable to number of processors.
Increase number of processors, size of memory and
bandwidth increases. - Each processor can rapidly access its own memory
without interference - Summary easy to build
20Distributed Memory - MPP (cont.)
- Disadvantages
- Difficult to map existing data structures to
this memory organization - User responsible for sending and receiving data
among processors - To minimize overhead and latency, data should be
blocked up in large chunks and shipped before
receiving node needs it - Summary difficult to program
21Typical Combination
- Multiple SMPs connected by a network
- Processors within an SMP communicate via shared
memory - Requires message passing between SMPs
22Clusters special case of MPP
- Main idea
- use commodity workstations/PCs and commodity
interconnect to get a cheap, high performance
solution - Pioneering projects
- Berkeleys Network Of Workstations
- Beowulf clusters NASA Goddard Space Flight
Center project - ultra low-cost approach commodity PCs Linux
Ethernet - Advantages
- low cost, commodity upgradeable components
- Disadvantages (esp. early clusters)
- inadequate interconnect
- good only for problems requiring little
communication (embarrassingly parallel problems)
23Clusters (cont.)
- Improving communication
- replace TCP/IP by a low latency protocol
- replace Ethernet by a more advanced hardware
- Advanced Interconnect
24One Page Summary
Instruction stream
Single
Multiple
- SISD
- sequential processors
- MISD
- does not really exist
Single
Data Stream
Single
- SIMD
- vector processors
- processor arrays
(cc)UMA
(cc)NUMA
Multiple
Address space
Multiple
Shared
Distributed
Memory Location
25Parallel Architectures
- Flynns taxonomy
- SISD, SIMD, MISD, MIMD
- Memory classification
- shared, distributed, distributed shared
- Interconnection networks
- static, dynamic, network parameters
- Nicely covered at http//www.top500.org/ORSC/2002
/
26Interconnection Networks
- Dynamic Interconnection Networks
- built out of links and switches (also known as
indirect networks) - usually associated to shared memory
architectures - examples bus-based, crossbar, multistage
(?-network) - Static Interconnection Networks
- built out of point-to-point communication links
between processors (also known as direct
networks) - usually associated to message passing
architectures - examples ring, 2D and 3D mesh and torus,
hypercube, butterfly - Important parameters of Interconnection Networks,
Embedding - latency, bandwidth
- degree, diameter, connectivity, bisection
(band)width - embeddings and their parameters
27Dynamic Interconnection Networks
28BUS Based Interconnection Networks
- processors and the memory modules are connected
to a shared bus - Advantages
- simple, low cost
- Disadvantages
- only one processor can access memory at a given
time - bandwidth does not scale with the number of
processors/memory modules - Example
- quad Pentium Xeon
29Crossbar
- Advantages
- non blocking network
- Disadvantages
- cost O(pm)
- Example
- high end UMA
30Multistage networks (i.e. ?-network)
- - Intermediate case between bus and crossbar
- - Blocking network (but not always)
- - Often used in NUMA computers
- ?-network
- Each switch is a 2x2 crossbar
- log(p) stages
- cost p log(p)
- Simple routing algorithm
- At each stage, look at the corresponding bit
(starting with msb) of the source and destination
address - If the bits are the same, messages pass through,
otherwise cross-over
31?-network
32Dynamic network exercises
Question 1 Which of the following pairs of
(processors, memory block) requests will
collide/block?
Question 2 For a given processor/memory request
(a,b), how many requests (x,y), with (x ! a) and
(y ! b) will block with (a,b) in an 8 node
?-network? How does this number depend on the
choice of (a,b)?
33?-network
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
34Static Interconnection Networks
- Network Parameters
- latency, bandwidth
- degree, diameter, bisection (band)width
- Specific networks
- Linear array
- Ring
- Tree
- 2D 3D mesh/torus
- Hypercube
- Butterfly
- Fat tree
- Embedding
35Network Parameters
- Latency, Bandwidth
- hardware related
- depends also on the communication protocol
- Degree (maximum number of neighbours)
- influences feasibility/cost, best is a low
constant - Diameter (maximal distance between two nodes)
- determines lower bound on time for some
algorithms - Bisection (band)width
- the minimal number of edges separating two part
of equal size (the bandwidth on these edges) - lower bound on time for problems requiring
exchanging a lot of data - VLSI area/volume in 2D,
in 3D
36Linear Array, Ring, Tree
- important logical topologies
- algorithms are often described on these
topologies - actual execution is performed on the embedding
into the physical network - low bisection width (1,2), high diameter for
line ring
p0
pn-1
p1
p2
p0
pn-1
p1
p2
372D 3D Array Torus
- good match for discrete simulation and matrix
operations - easy to manufacture and extend
- diameter bisection width (
for the 3D case) - Examples Cray 3D (3d torus), Intel Paragon (2D
mesh)
38Hypercube
- good graph-theoretic properties (low diameter,
high bisection width) - nice recursive structure
- good for simulating other topologies (they can
be efficiently embedded into hypercube) - degree log (n), diameter log (n), bisection
width n/2 - costly/difficult to manufacture for high n, not
so popular nowadays
39Butterfly
- Hypercube derived network of log(n) diameter
and constant degree - perfect match for Fast Fourier Transform
- there are other Hypercube-related networks (Cube
Connected Cycles, Shuffle-Exchange, De-Bruin and
Bene networks), see the Leightons book for
details
40Fat Tree
- Main idea exponentially increase the
multiplicity of links as the distance from the
bottom increases - keeps nice properties of the binary tree (low
diameter) - solves the low bisection and bottleneck at the
top levels - Example CM5
41Embedding
- Problem Assume you have an algorithm designed
for a specific topology G. How do you get it work
on an interconnect of different topology G? - Solution Simulate G on G.
- Formally Given networks G(V,E) and G(V,E),
find a mapping f which maps each vertex from V
into a vertex of V and each edge from E into a
path in G. Several vertices from V may map into
one vertex from V (especially if G has more
vertices then G). Such a mapping is called
embedding of G in G. - Goals
- balance the number of vertices mapped to each
node of G (to balance the workload of the
simulating processors) - each edge should map to a short path, optimally
single link (so each communication step can be
simulated efficiently) small dilation - there should be little overlaps between
resulting simulating paths (to prevent congestion
on links)
42Embedding - examples
- Embedding ring into line
- dilation 2, congestion 2
- similar idea can be used to embed torus into
mesh - Embedding ring into 2D torus
- dilation 1, congestion 1
43Embedding Ring into Hypercube
- map processor i into node G(d, i) of d-ary
hypercube - function G() called binary reflected Grey code
- G() can be easily defined recursively
- G(d1) 0G(d), 1G(d)
- Example
- 0,1 00,01,11,10 000,001, 011,010,
110, 111, 101, 100
r
110
111
010
011
100
101
000
001
44Embedding arrays into Hypercube
- recursive construction using embedding rings as
a building block - assumes array sizes are powers of 2
45Embedding trees into Hypercube
- arbitrary binary trees can be (slightly less)
efficiently embedded as well - example below assumes the processors are only at
the leaves of the tree
46Possible questions
- Assume point-to-point communication with cost 1 .
- Is is possible to sort in 2D mesh in time O(log
n)? ? ? - Is it possible to sort leaves of complete binary
tree in time O(log n)? What about ? - Recall that embedding of G into H maps each link
of G to a path in H. Dilation is the maximal
(over all links in G) length of such path, while
congestion is the maximal (over all links of H)
number of such paths that use the link. - Given an embedding of G into H. What is its
dilation? Congestion? - Show how to embed 2D torus into 2D mesh with
constant dilation. What is the dilation of your
embedding? What is its congestion? - Show how to embed ring into 2D mesh. Is it
always possible to do it with both dilation and
congestion equal 1? Constant? - Given number x. Who is its predecessor/successor
in d-ary binary reflected Grey code?
47New Concepts and Terms - Summary
- SISD, vector- and processor array- SIMD, MISD,
MIMD - shared memory, distributed (shared) memory,
cache coherence - (cc)UMA, SMP, (cc)NUMA, MPP
- Clusters, NOW, Beowulf
- Interconnection Networks Static, Dynamic
- bus, crossbar, multistage, ?-network
- blocking, non-blocking network
- Torus, Hypercube, Butterfly, Fat Tree
- Embedding, dilation, congestion
- Grey codes