Title: Homework from last lecture:
1Homework from last lecture
- Given an array A of n elements, evenly
distributed among p processors (you may assume p
divides n), and an integer kgt0 - an index i is a k-maximum iff AigtAj for all
j?i, i-k j ik - Goal report all k-maxima in the array A
A
0
n/p
n(p-1)/p
p0
pp-1
p1
2Homework from last lecture solution
- Case 1 k n/p
- processor i has to communicate with processors
? - what data has to be communicated?
- how many communication rounds?
- Answers
- processors i-1 and i1 (unless i0 or ip-1)
- k data items on each side, or just the k-maximum
candidate for each side - 1 or 2
- Question Which solution is better?
- depends on the platform and k
3Homework from last lecture solution
- Case 2 k gt n/p
- processor i has to communicate with n/pk
processors on each side - k is rather high, you want to send just
candidates - several techniques possible, will cover that
later on the the course - pipelining
- broadcasting/multicasting trees
4Lecture Outline (2 lectures)
- A bit of Historical Perspective
- Parallelisation Approaches
- Parallel Programming Platforms
- logical organization - programmers view of the
platform - physical organization hardware view
- communication costs
- process-processors mappings
-
5A bit of historical perspective
Parallel computing has been here since the early
days of computing. Traditionally custom HW,
custom SW, high The doom of the Moore law
custom HW has hard time catching up with the
commodity processors Current trend use commodity
HW components, standardize SW Market size of High
Performance Computing the market size for
disposable diapers (Explicitly!) Parallel
computing has never been mainstream.
6A bit of historical perspective (cont.)
- Parallelism sneaking into commodity computers
- Instruction Level Parallelism - wide issue,
pipelining, OOO - data level parallelism SSE, 3DNow, Altivec
- thread level parallelism Hyperthreading in
Pentium IV, multiple cores coming in not too
distant future - Transistor budgets allow for multiple processor
cores on a chip. - Most applications would benefit from being
parallelised and executed on a parallel computer. - even PC applications, especially the most
demanding ones - games, multimedia
7A bit of historical perspective III
- Chicken Egg Problem
- Why build parallel computers when the
applications are sequential? - Why parallelize applications when there are no
parallel commodity computers? - Answers
- What else to do with all those transistors?
- They already are a bit parallel (wide issue,
multimedia instructions, hyperthreading), and
this bit is growing. - Yet another reason to study parallel computing
- Principles of parallel algorithm design (locality
of data reference) lend themselves to
cache-friendly sequential algorithms. - The same applies for out-of-core computations
(data servers).
8Parallelisation Approaches
- Parallelizing compiler
- advantage use your current code
- disadvantage very limited abilities
- Parallel domain-specific libraries
- e.g. linear algebra, numerical libraries,
quantum chemistry - usually good choice, use when possible
- Communication libraries
- message passing libraries MPI, PVM
- shared memory libraries declare and access
shared memory variables (on MPP machines done by
emulation) - advantage use standard compiler
- disadvantage low level programming (parallel
assembler)
9Parallelisation Approaches (cont.)
- New parallel languages
- use a language with built-in explicit control
for parallelism - no language is the best in every domain
- needs new compiler
- fights against inertia
- Parallel features in existing languages
- adding parallel features to an existing language
- I.e. for expressing loop parallelism (pardo) and
data placement - example High Performance Fortran
- Additional possibilities in shared-memory systems
- use threads
- preprocessor compiler directives (OpenMP)
10Parallelisation Approaches Our Focus
- Communication libraries MPI, PVM
- industry standard, available for every platform
- very general, low level approach
- perfect match for clusters
- most likely to be useful for you
- Shared memory programming
- also very important
- likely to be useful in next iterations of PCs
11Parallel Programming Platforms
- Implicit Parallelism in Modern Microprocessors
- pipelining, superscalar execution, VLIW
- Limitations of Memory System Performance
- problem high latency of memory vs. speed of
computing - solutions caches, latency hiding using
multithreading and prefetching - Explicit Parallel Programming Platforms
- logical organization - programmers view of the
platform - physical organization hardware view
- communication costs
- process-processors mappings
12Logical View of a PP Platform
- Control Structure - how to express parallel
tasks - Single Instruction stream, Multiple Data stream
- Multiple Instruction stream, Multiple Data
stream - Single Program Multiple Data
- Communication Model - how to specify interactions
between these tasks - Shared Address Space Platforms
(multiprocessors) - Uniform Memory Access multiprocessors
- Non-Uniform Memory Access multiprocessors
- cache coherence issues
- Message Passing Platforms
13Control Structure SIMD
Single Instruction stream, Multiple Data stream
Example for (i0 ilt1000 i) pardo ci
aibi Processor k executes ck akbk
14SIMD (cont.)
- early parallel machines
- Illiac IV, MPP, CM-2, MasPar MP-1
- modern settings
- multimedia extensions - MMX, SSE
- DSP chips
- positives
- less hardware needed
- easy to understand and reason about
- negatives
- proprietary hardware needed fast obsolescence,
high development costs/time - rigid structure suitable only for highly
structured problems - inherent inefficiency due to selective turn-off
15SIMD inefficiency example
Example for (i0 ilt10 i) if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
16SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
17SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
9
4
8
1
15
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
18SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
9
4
0
0
0
8
0
1
0
15
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
19Control Structure MIMD
Multiple Instruction stream, Multiple Data stream
- Single Program, Multiple Data
- popular way to program MIMD computers
- simplifies code maintenance/program distribution
- equivalent to MIMD (big switch at the beginning)
20MIMD (cont.)
- positives
- can be easily/fast/cheaply built from existing
microprocessors - very flexible (suitable for irregular problems)
- negatives
- requires more resources (duplicated program, OS,
) - more difficult to reason about/design correct
programs
21Logical View of PP Platform
- Control Structure - how to express parallel
tasks - Single Instruction stream, Multiple Data stream
- Multiple Instruction stream, Multiple Data
stream - Single Program Multiple Data
- Communication Model - how to specify interactions
between these tasks - Shared Address Space Platforms
(multiprocessors) - Uniform Memory Access multiprocessors
- Non-Uniform Memory Access multiprocessors
- cache coherence issues
- Message Passing Platforms
22Shared Address Space Platforms
shared memory, UMA
distributed memory, NUMA
23Shared Address Space Platforms
- all memory addressable by all processors
- needs address translation mechanism
- may or may not provide cache coherence
- access time
- uniform UMA (shared memory)
- non-uniform NUMA (distributed memory)
- principal communication mechanisms
- put() and get()
24Message Passing Platforms
- p processing nodes, each with its own exclusive
address space - each node can be a single processor or a shared
address space multiprocessor - inter-node communication possible only through
message passing - principal functions
- send(), receive()
- each processor has unique ID
- mechanisms provided for learning your ID, of
nodes etc. - several standard APIs available MPI, PVM
25Physical Organization of Parallel Platforms
- Ideal Parallel Computer PRAM
- Interconnection Networks for Parallel Computers
- static
- dynamic
- cost of communication
- evaluating interconnection networks
26Parallel Random Access Machine
- p processors, each has a local memory and is
connected to unbounded shared memory - processors work in lock-step, the access time to
shared memory costs 1 step - 4 main classes, depending on how simultaneous
accesses are handled - Exclusive read, exclusive write - EREW PRAM
- Concurrent read, exclusive write - CREW PRAM
- Exclusive read, concurrent write - ERCW PRAM
(for completeness) - Concurrent read, concurrent write - CRCW PRAM
- Resolving concurrent writes
- Common all writes must write the same value
- Arbitrary arbitrary write succeeds
- Priority the write with highest priority
succeeds - Sum the sum of the written values is stored
27Parallel Random Access Machine (cont.)
- abstracts away communication, allows to focus on
the parallel tasks - an algorithm for PRAM might lead you to a good
algorithm for a real machine - if you prove that something cannot be
efficiently solved on PRAM, it cannot be
efficiently done on any practical machine (based
on current technology) - it is not feasible to manufacture PRAM
- the cost of connecting p processors to m memory
cells such that their accesses do not interfere
is 1(pm), which is huge for any practical values
of m
28PRAM Algorithm Example
Problem use EREW PRAM to sum numbers stored at
m0, m1, , mn-1, where n2k for some k. The
result should be stored at m0.
Example for k3
Algorithm for processor pi for (j0 jltk j)
if (i 2(j1) 0)
a read(mi) b read(mi2j)
write(ab, mi)
29PRAM Example Notes
- the program is written in SIMD (and SPMD) format
- the inefficiency caused by idling processors is
clearly visible - can be easily extended for n not power of 2
- takes log2(n) rounds to execute
- using similar approach ( some other ideas) it
can be shown that - Any CRCW PRAM can be simulated by EREW PRAM with
a slowdown factor of O(log n)
30PRAM Example2
- Problem use Sum - CRCW PRAM with n2 processors
to sort n numbers stored at m0, m1, , mn-1. - Question How many steps would it take?
- O(n log n)
- O(n)
- O(log n)
- O(1)
- less then (n log n)/ n2
31PRAM Example2
Note We will mark processors pi,j for 0lti,jltn
Algorithm for processor pi,j a read(mi)
b read(mj) if
((agtb) ((ab)(igtj)))
write(1, mi) if (j0)
b read(mi) write(a,
mb)
m
1
7
3
9
3
0
m0
m1
m2
m3
m4
m5
0
1
1
1
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
1
1
0
1
1
1
1
1
0
m
1
4
2
5
3
0
m
0
1
2
3
7
9
O(1) sorting algorithm!
32Think at home
Problem Adapt the above algorithm to work on
EREW PRAM. What is the time complexity?
33Physical Organization of Parallel Platforms
- Ideal Parallel Computer PRAM
- Interconnection Networks for Parallel Computers
- dynamic
- static
- graph embeddings
- cost of communication
34Interconnection Networks for Parallel Computers
- static networks
- point-to-point communication links among
processing nodes - also called direct networks
- dynamic networks
- communication links are connected dynamically by
switches to create paths between processing nodes
and memory banks/other processing nodes - also called indirect networks
35Interconnection Networks
Static/direct network
Dynamic/indirect network
p
p
p
p
p
p
p
p
network interface/switch
switching element
processing node
36Dynamic Interconnection Networks
37BUS Based Interconnection Networks
- processors and the memory modules are connected
to a shared bus - Advantages
- simple, low cost
- Disadvantages
- only one processor can access memory at a given
time - bandwidth does not scale with the number of
processors/memory modules - Example
- quad Pentium Xeon
38Crossbar
- Advantages
- non blocking network
- Disadvantages
- cost O(pm)
- Example
- high end UMA
39Multistage networks (i.e. S-network)
- - Intermediate case between bus and crossbar
- - Blocking network (but not always)
- - Often used in NUMA computers
- S-network
- Each switch is a 2x2 crossbar
- log(p) stages
- cost p log(p)
- Simple routing algorithm
- At each stage, look at the corresponding bit
(starting with msb) of the source and destination
address - If the bits are the same, messages pass through,
otherwise cross-over
40S-network
41S-network
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
42Dynamic network exercises
Question 1 Which of the following pairs of
(processors, memory block) requests will
collide/block?
Question 2 (difficult) For a given
processor/memory request (a,b), how many requests
(x,y), with (x ! a) and (y ! b) will block with
(a,b) in an 8 node S-network? How does this
number depend on the choice of (a,b)?
43Physical Organization of Parallel Platforms
- Ideal Parallel Computer PRAM
- Interconnection Networks for Parallel Computers
- dynamic
- static
- graph embeddings
- cost of communication
44Static Interconnection Networks
- Complete network (clique)
- Star network
- Linear array
- Ring
- Tree
- 2D 3D mesh/torus
- Hypercube
- Butterfly
- Fat tree
45Evaluating Interconnection Networks
- diameter
- the longest distance (number of hops) between
any two nodes - gives lower bound on time for algorithms
communicating only with direct neighbours - connectivity
- multiplicity of paths between any two nodes
- high connectivity lowers contention for
communication resources - bisection width (bisection bandwidth)
- the minimal number of links (resp. their
aggregate bandwidth) that must be removed to
partition the network into two equal halves - provides lower bound on time when the data must
be shuffled from one half of the network to
another half - VLSI area/volume in 2D,
in 3D
46- For each of the following networks
- determine the diameter and bisection width
- give a programming example which would use such
communication pattern.
47Clique, Star, Linear Array, Ring, Tree
- important logical topologies, as many common
communication patters correspond to these
topologies - clique all-to-all broadcast
- star master slave, broadcast
- line, ring pipelined execution
- tree hierarchical decomposition
- none of them is very practical
- clique cost
- star, line, ring, tree low bisection width
- line, ring high diameter
- actual execution is performed on the embedding
into the physical network
482D 3D Array Torus
- good match for discrete simulation and matrix
operations - easy to manufacture and extend
- Examples Cray 3D (3d torus), Intel Paragon (2D
mesh)
49Hypercube
- good graph-theoretic properties (low diameter,
high bisection width) - nice recursive structure
- good for simulating other topologies (they can
be efficiently embedded into hypercube) - degree log (n), diameter log (n), bisection
width n/2 - costly/difficult to manufacture for high n, not
so popular nowadays
50Butterfly
- Hypercube derived network of log(n) diameter
and constant degree - perfect match for Fast Fourier Transform
- there are other Hypercube-related networks (Cube
Connected Cycles, Shuffle-Exchange, De-Bruin and
Bene networks), see the Leightons book for
details
51Fat Tree
- Observation trees are nice low diameter,
simple structure - Problem low bandwidth
- Solution exponentially increase the multiplicity
of links as the distance from the bottom
increases - keeps nice properties of the binary tree (low
diameter) - solves the low bisection and bottleneck at the
top levels - Example CM5
52Evaluating Interconnection Networks
Cost ( of links)
Arc Connectivity
Bisection Width
Diameter
Network
p(p-1)/2
p-1
p2/4
1
clique
p-1
1
1
2
star
p-1
1
1
2log((p1)/2)
complete binary tree
p-1
1
1
p-2
linear array
2(p-?p)
2
?p
2(?p-1)
2D mesh
2p
4
2?p
2?p/2
2D torus
(p log p)/2
log p
p/2
log p
hypercube
53Possible questions
- Assume point-to-point communication with cost 1 .
- Is is possible to sort in 2D mesh in time O(log
n)? ? ? - Is it possible to sort leaves of complete binary
tree in time O(log n)? What about ? - Can you find maximum in 2D mesh in time O(log
n)? What about complete binary tree?
54Last Lecture
- PRAM
- what is PRAM? Why? Strong/weak points.
- PRAM types
- binary tree algorithm
- O(1) sorting algorithm
- Dynamic Interconnection Networks
- BUS, Crossbar, S-network
- blocking/non blocking access
- advantages/disadvantages
- Static Interconnection Networks
- tree, mesh, hypercube,
- diameter, bisection, connectivity, cost
55Possible questions
- Assume point-to-point communication with cost 1 .
- Is is possible to sort in 2D mesh in time O(log
n)? ? ? - Is it possible to sort leaves of complete binary
tree in time O(log n)? What about ? - Can you find maximum in 2D mesh in time O(log
n)? What about complete binary tree?
56Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
57Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, b0) for (lk-1 lgt0 l--) if (j
2(l1) 0) a read(bj)
write(a, bj2l) read bj
Do you see any problems? All rows use the same
array b!
58Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
59Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
60Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
61Physical Organization of Parallel Platforms
- Ideal Parallel Computer PRAM
- Interconnection Networks for Parallel Computers
- dynamic
- static
- graph embeddings
- cost of communication
62Motivating Graph Embeddings
You want to use this algorithm
But your computer is connected like this
How do you map processes to processors? Why?
63Process Processor Mappings and Graph Embeddings
Problem 1 You have an algorithm which uses p
logical processes. The communication pattern
between these processes is captured by a
communication graph G. How do you map these
processes to your real machine, which have p
processors interconnected into a network G so
that the overall communication cost is
minimized/kept reasonably small? Problem 2
Assume you have an algorithm designed for a
specific topology G. How do you get it work on an
network of different topology G? Solution
Graph embedding, simulate G on G.
64Example two mappings
1
2
3
4
a
b
d
c
5
6
7
8
underlying architecture
processes and their interactions
e
f
g
h
9
10
11
12
i
j
k
l
13
14
15
16
m
n
o
p
1
2
3
4
1
2
3
4
a
b
d
c
k
h
i
m
5
6
7
8
5
6
7
8
intuitive mapping of processes to nodes
random mapping
e
f
g
h
j
p
o
b
9
10
11
12
9
10
11
12
i
j
k
l
d
e
a
n
13
14
15
16
13
14
15
16
m
n
o
p
c
l
g
f
65Embedding
- Formally Given networks G(V,E) and G(V,E),
find a mapping f which maps each vertex from V
into a vertex of V and each edge from E into a
path in G. Several vertices from V may map into
one vertex from V (especially if G has more
vertices then G). Such a mapping is called
embedding of G in G. - Goals
- balance the number of vertices mapped to each
node of G - to balance the workload of the simulating
processors - each edge should map to a short path, optimally
single link - so each communication step can be simulated
efficiently (small dilation) - there should be little overlaps between
resulting simulating paths - to prevent congestion on links
66Embedding - examples
- Embedding ring into line
- dilation 2, congestion 2
- similar idea can be used to embed torus into mesh
- Embedding ring into 2D torus
- dilation 1, congestion 1
67Embedding Ring into Hypercube
- map processor i into node G(d, i) of d-ary
hypercube - function G() called binary reflected Grey code
- G() can be easily defined recursively
- G(d1) 0G(d), 1G(d)
- Example
- 0,1 00,01,11,10 000,001, 011,010,
110, 111, 101, 100
r
110
111
010
011
100
101
000
001
68Embedding trees into Hypercube
- arbitrary binary trees can be (slightly less)
efficiently embedded as well - example below assumes the processors are only at
the leaves of the tree
69Embedding mesh into linear array
- congestion? dilation?
- is there an embedding of n x n mesh into linear
array of n2 nodes with congestion less then n?
dilation less then n? why?
70Possible questions
- Given an embedding of G into H. What is its
dilation? Congestion? - Show how to embed 2D torus into 2D mesh with
constant dilation. What is the dilation of your
embedding? What is its congestion? - Show how to embed ring into 2D mesh. Is it
always possible to do it with both dilation and
congestion equal 1? Constant? - Given number x. Who is its predecessor/successor
in d-ary binary reflected Grey code? - Consider graph G with diameter d and bisection
width w and graph G with diameter d and
bisection width w. What is the best congestion
we can hope for in any embedding of G into G?
What is the best dilation we can hope for?
71Physical Organization of Parallel Platforms
- Ideal Parallel Computer PRAM
- Interconnection Networks for Parallel Computers
- dynamic
- static
- graph embeddings
- cost of communication
72Communication Costs in Parallel Machines
- Message Passing Costs
- Startup time (ts)
- once per message transfer
- covers message preparation, routing and
establishing interface between local node and
router - Per-hop time (th)
- incurred in each internal node of the path
- the travel time of the header between two direct
neighbours - Per-word transfer time (tw)
- determined by the bandwidth w of the channel
tw1/w - includes network and buffering overheads
73Routing Message Passing Costs
- Store-Forward Routing
- message of size m traveling l links
- tcomm ts(mtw th)l ? tsmltw
- Packet Routing
- split the message into fixed size packets, route
the packets independently - used in highly dynamic settings with high error
rates (e.g. Internet) - tcomm tslthmtw
- Cut-Through Routing
- the header establishes path, the tail (data) of
the message follows like a snake - used in networks with low failure rate, no
buffers needed - tcomm tslthmtw with smaller constants
74Message Passing Costs
- General formula tcomm tslthmtw
- In order to communicate efficiently
- communicate in bulk
- minimize the volume of the transferred data
- minimize the traveled distance
- We will talk about the first two objectives a
lot, the third is often difficult to achieve in
practice - often little control over mapping processes onto
processors - randomized routing commonly used
- usually ts dominates for small messages and mtw
for large ones, while l and th tend to be quite
small - Simplified formula tcomm tsmtw
75Message Passing Costs and Congestion
- the previous discussion applies only if the
network is not congested - effective bandwidth
- link bandwidth scaled down by the degree of
congestion - difficult to estimate, as it depends on process
to processor mapping, routing algorithm,
communication schedule - lower bound estimate scale link bandwidth by
factor of p/b
76Communication Costs in Shared Address Space
Machines
- Difficult to properly estimate due to
- programmer has minimal control over memory
layout - possible cache thrashing
- cache maintenance overhead (invalidate, update)
is difficult to quantify - difficult to model spatial locality
- prefetching uncertainties
- false sharing
- contention in shared accesses
- The simplified formula tcomm tsmtw can still
be used.
77New Concepts and Terms - Summary
- SIMD, MIMD, SPMD
- shared memory, distributed (shared) memory,
cache coherence - (cc)UMA, SMP, (cc)NUMA, MPP
- interconnection networks static, dynamic
- bus, crossbar, ?-network
- blocking, non-blocking network
- torus, hypercube, butterfly, fat tree
- bisection width
- graph embedding, dilation, congestion
- grey codes