Parallel Architectures

About This Presentation

Title:

Parallel Architectures

Description:

Parallel Architectures. Flynn's taxonomy. SISD, SIMD, MISD, MIMD. Memory classification ... Nicely covered at: http://www.top500.org/ORSC/2002/ Flynn's Taxonomy ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 48

Provided by: Pao3

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Architectures

1
Parallel Architectures

Flynns taxonomy
SISD, SIMD, MISD, MIMD
Memory classification
shared, distributed, distributed shared
Interconnection networks
static, dynamic, network parameters
Nicely covered at http//www.top500.org/ORSC/2002
/

2
Flynns Taxonomy

Flynn 66 classified computers according to
instruction and data streams
4 basic categories
Single Instruction, Single Data
Single Instruction, Multiple Data
Multiple Instructions, Single Data
Multiple Instructions, Multiple Data

3
SISD

Not a parallel computer
Conventional serial, scalar von Neumann computer
One instruction stream
A single instruction is issued each clock cycle
Each instruction operates on a single (scalar)
data element
Limited by the number of instructions that can
be issued in a given unit of time
Current processors (Intel Pentium, AMD Athlon,
Alpha) are not strictly SISD due to pipelining
and wide issue, but are close enough for our
purposes only one thread can execute at a time.

4
SIMD

Also von Neumann architectures but more powerful
instructions
Each instruction may operate on more than one
data element
Usually intermediate host executes program logic
and broadcasts instructions to other processors
Synchronous (lockstep)
Rating how fast these machines can issue
instructions is not a good measure of their
performance
Developed because there are many important
applications that mostly operate upon arrays of
data.
Two major types
Vector SIMD
Processor array SIMD

5
Vector SIMD
Vector processing operates on whole vectors
(groups) of data at a time Example float
A8, B8, C8 Init(A,B,C) C AB
6
Vector SIMD (cont.)

Examples
Cray 1, NEC SX-2, Fujitsu VP, Hitachi S820
Single processor of Cray C 90, Cray2, NEC SX-3,
Fujitsu VP 2000 , Convex C-2
NEC SX-6i

7
Memory Bandwidth in Vector SIMD
CAB One read/write path Two read/write
paths Two read, one write paths
time
Read A
Read B
Write C
Read A
Read B
Write C
Read A
Read B
Write C
8
Processor Array SIMD

Single instruction is issued and all processors
execute the same instruction, operating on
different sets of data.
Many, simple processing elements - 1000's.
Processors run in a synchronous, lockstep fashion

9
Parallel SIMD (cont.)

Well suited only for data-parallel applications
Inefficiency due to the need to switch off
processors see example
Out of fashion today
Includes systolic arrays
Examples
Connection Machine CM-2
Maspar MP-1, MP-2

Example for i 0 to 1000 if ai gt bi
then xi ci else
xi di
Pi x x o x work o - idle
Pj x o x
10
MISD

no such computer has been built
there are some applications using MISD approach
cryptography got the ciphertext, try different
ways to decrypt it
sensor data analysis try different
transformations to get maximum information out of
the measured data

11
MIMD

The most flexible category
Parallelism achieved by connecting multiple
processors together
Includes all forms of multiprocessor
configurations
Each processor executes its own instruction
stream independent of other processors on unique
data stream
Advantages
Processors can execute multiple job streams
simultaneously
Each processor can perform any operation
regardless of what other processors are doing
Disadvantages
Load balancing overhead - synchronization needed
to coordinate processors at end of parallel
structure in a single application
Can be difficult to program

12
MIMD (cont.)
13
MIMD vs SIMD programming example
Problem Given an upper triangular matrix,
compute the sum of the number in each
column. Parallel programs Program1 for
processor i sumi 0 for(j0 jlti
j) sumi Aj,i Program2 for processor
i sumi 0 for(j0 jltn j) if
(jlti) sumi Aj,i
14
Parallel Architectures

Flynns taxonomy
SISD, SIMD, MISD, MIMD
Memory classification
shared, distributed, distributed shared
Interconnection networks
static, dynamic, network parameters
Nicely covered at http//www.top500.org/ORSC/2002
/

15
Memory Classification

MIMD machines can be classified according to
where the memory is located and how it is
accessed.
Main classes
Shared Memory with Uniform Memory Access time
Distributed Memory
Single Address Space Distributed shared memory
with Non Uniform Memory Access access time
Multiple Address Spaces
communication only via message passing
Called Massively Parallel Processors
UMA and NUMA are usually cache coherent
- if one processor updates a location in shared
memory, all the other processors know about the
update

16
Shared Memory
17
Shared Memory (cont.)

Multiple processors operate independently but
share the same memory resources
Synchronization achieved by controlling tasks
reading from and writing to the shared memory
Often called Symmetric MultiProcessor
Programming standard OpenMP, see
http//www.openmp.org
Advantages
Easy for user to use efficiently
Data sharing among tasks is fast (speed of
memory access)
Disadvantages
User is responsible for specifying
synchronization, e.g., locks
Not scalable (low tens of processors)
Examples Cray Y-MP, Convex C-2, Cray C-90 , quad
Pentium Xeon

18
Distributed Memory - (cc)NUMA

Local memory is directly accessible by other
processors
Has Non Uniform Memory Access time, as the time
to access the memory of different processor is
higher
Popular choice today, e.g. SGI Origin 2000
Moderately scalable low hundreds of processors

19
Distributed Memory - MPP

Data is shared across a communications network
using message passing
User responsible for synchronization using
message passing
Scales very well thousands of processors
Called Massively Parallel Processors
Advantages
Memory scalable to number of processors.
Increase number of processors, size of memory and
bandwidth increases.
Each processor can rapidly access its own memory
without interference
Summary easy to build

20
Distributed Memory - MPP (cont.)

Disadvantages
Difficult to map existing data structures to
this memory organization
User responsible for sending and receiving data
among processors
To minimize overhead and latency, data should be
blocked up in large chunks and shipped before
receiving node needs it
Summary difficult to program

21
Typical Combination

Multiple SMPs connected by a network
Processors within an SMP communicate via shared
memory
Requires message passing between SMPs

22
Clusters special case of MPP

Main idea
use commodity workstations/PCs and commodity
interconnect to get a cheap, high performance
solution
Pioneering projects
Berkeleys Network Of Workstations
Beowulf clusters NASA Goddard Space Flight
Center project
ultra low-cost approach commodity PCs Linux
Ethernet
Advantages
low cost, commodity upgradeable components
Disadvantages (esp. early clusters)
inadequate interconnect
good only for problems requiring little
communication (embarrassingly parallel problems)

23
Clusters (cont.)

Improving communication
replace TCP/IP by a low latency protocol
replace Ethernet by a more advanced hardware
Advanced Interconnect

24
One Page Summary
Instruction stream
Single
Multiple

SISD
sequential processors

MISD
does not really exist

Single
Data Stream
Single

SIMD
vector processors
processor arrays

(cc)UMA
(cc)NUMA
Multiple
Address space

MPP
clusters

Multiple
Shared
Distributed
Memory Location
25
Parallel Architectures

Flynns taxonomy
SISD, SIMD, MISD, MIMD
Memory classification
shared, distributed, distributed shared
Interconnection networks
static, dynamic, network parameters
Nicely covered at http//www.top500.org/ORSC/2002
/

26
Interconnection Networks

Dynamic Interconnection Networks
built out of links and switches (also known as
indirect networks)
usually associated to shared memory
architectures
examples bus-based, crossbar, multistage
(?-network)
Static Interconnection Networks
built out of point-to-point communication links
between processors (also known as direct
networks)
usually associated to message passing
architectures
examples ring, 2D and 3D mesh and torus,
hypercube, butterfly
Important parameters of Interconnection Networks,
Embedding
latency, bandwidth
degree, diameter, connectivity, bisection
(band)width
embeddings and their parameters

27
Dynamic Interconnection Networks
28
BUS Based Interconnection Networks

processors and the memory modules are connected
to a shared bus
Advantages
simple, low cost
Disadvantages
only one processor can access memory at a given
time
bandwidth does not scale with the number of
processors/memory modules
Example
quad Pentium Xeon

29
Crossbar

Advantages
non blocking network
Disadvantages
cost O(pm)
Example
high end UMA

30
Multistage networks (i.e. ?-network)

- Intermediate case between bus and crossbar
- Blocking network (but not always)
- Often used in NUMA computers
?-network
Each switch is a 2x2 crossbar
log(p) stages
cost p log(p)
Simple routing algorithm
At each stage, look at the corresponding bit
(starting with msb) of the source and destination
address
If the bits are the same, messages pass through,
otherwise cross-over

31
?-network
32
Dynamic network exercises
Question 1 Which of the following pairs of
(processors, memory block) requests will
collide/block?
Question 2 For a given processor/memory request
(a,b), how many requests (x,y), with (x ! a) and
(y ! b) will block with (a,b) in an 8 node
?-network? How does this number depend on the
choice of (a,b)?
33
?-network
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
34
Static Interconnection Networks

Network Parameters
latency, bandwidth
degree, diameter, bisection (band)width
Specific networks
Linear array
Ring
Tree
2D 3D mesh/torus
Hypercube
Butterfly
Fat tree
Embedding

35
Network Parameters

Latency, Bandwidth
hardware related
depends also on the communication protocol
Degree (maximum number of neighbours)
influences feasibility/cost, best is a low
constant
Diameter (maximal distance between two nodes)
determines lower bound on time for some
algorithms
Bisection (band)width
the minimal number of edges separating two part
of equal size (the bandwidth on these edges)
lower bound on time for problems requiring
exchanging a lot of data
VLSI area/volume in 2D,
in 3D

36
Linear Array, Ring, Tree

important logical topologies
algorithms are often described on these
topologies
actual execution is performed on the embedding
into the physical network
low bisection width (1,2), high diameter for
line ring

p0
pn-1

p1
p2
p0
pn-1

p1
p2
37
2D 3D Array Torus

good match for discrete simulation and matrix
operations
easy to manufacture and extend
diameter bisection width (
for the 3D case)
Examples Cray 3D (3d torus), Intel Paragon (2D
mesh)

38
Hypercube

good graph-theoretic properties (low diameter,
high bisection width)
nice recursive structure
good for simulating other topologies (they can
be efficiently embedded into hypercube)
degree log (n), diameter log (n), bisection
width n/2
costly/difficult to manufacture for high n, not
so popular nowadays

39
Butterfly

Hypercube derived network of log(n) diameter
and constant degree
perfect match for Fast Fourier Transform
there are other Hypercube-related networks (Cube
Connected Cycles, Shuffle-Exchange, De-Bruin and
Bene networks), see the Leightons book for
details

40
Fat Tree

Main idea exponentially increase the
multiplicity of links as the distance from the
bottom increases
keeps nice properties of the binary tree (low
diameter)
solves the low bisection and bottleneck at the
top levels
Example CM5

41
Embedding

Problem Assume you have an algorithm designed
for a specific topology G. How do you get it work
on an interconnect of different topology G?
Solution Simulate G on G.
Formally Given networks G(V,E) and G(V,E),
find a mapping f which maps each vertex from V
into a vertex of V and each edge from E into a
path in G. Several vertices from V may map into
one vertex from V (especially if G has more
vertices then G). Such a mapping is called
embedding of G in G.
Goals
balance the number of vertices mapped to each
node of G (to balance the workload of the
simulating processors)
each edge should map to a short path, optimally
single link (so each communication step can be
simulated efficiently) small dilation
there should be little overlaps between
resulting simulating paths (to prevent congestion
on links)

42
Embedding - examples

Embedding ring into line
dilation 2, congestion 2
similar idea can be used to embed torus into
mesh
Embedding ring into 2D torus
dilation 1, congestion 1

43
Embedding Ring into Hypercube

map processor i into node G(d, i) of d-ary
hypercube
function G() called binary reflected Grey code
G() can be easily defined recursively
G(d1) 0G(d), 1G(d)
Example
0,1 00,01,11,10 000,001, 011,010,
110, 111, 101, 100

r
110
111
010
011
100
101
000
001
44
Embedding arrays into Hypercube

recursive construction using embedding rings as
a building block
assumes array sizes are powers of 2

45
Embedding trees into Hypercube

arbitrary binary trees can be (slightly less)
efficiently embedded as well
example below assumes the processors are only at
the leaves of the tree

46
Possible questions

Assume point-to-point communication with cost 1 .
Is is possible to sort in 2D mesh in time O(log
n)? ? ?
Is it possible to sort leaves of complete binary
tree in time O(log n)? What about ?
Recall that embedding of G into H maps each link
of G to a path in H. Dilation is the maximal
(over all links in G) length of such path, while
congestion is the maximal (over all links of H)
number of such paths that use the link.
Given an embedding of G into H. What is its
dilation? Congestion?
Show how to embed 2D torus into 2D mesh with
constant dilation. What is the dilation of your
embedding? What is its congestion?
Show how to embed ring into 2D mesh. Is it
always possible to do it with both dilation and
congestion equal 1? Constant?
Given number x. Who is its predecessor/successor
in d-ary binary reflected Grey code?