Homework from last lecture:

About This Presentation

Title:

Homework from last lecture:

Description:

thread level parallelism Hyperthreading in Pentium IV, multiple cores coming ... (wide issue, multimedia instructions, hyperthreading), and this bit is growing. ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 78

Provided by: Pao3

Category:

more less

Transcript and Presenter's Notes

Title: Homework from last lecture:

1
Homework from last lecture

Given an array A of n elements, evenly
distributed among p processors (you may assume p
divides n), and an integer kgt0
an index i is a k-maximum iff AigtAj for all
j?i, i-k j ik
Goal report all k-maxima in the array A

A
0
n/p
n(p-1)/p
p0
pp-1
p1
2
Homework from last lecture solution

Case 1 k n/p
processor i has to communicate with processors
?
what data has to be communicated?
how many communication rounds?
Answers
processors i-1 and i1 (unless i0 or ip-1)
k data items on each side, or just the k-maximum
candidate for each side
1 or 2
Question Which solution is better?
depends on the platform and k

3
Homework from last lecture solution

Case 2 k gt n/p
processor i has to communicate with n/pk
processors on each side
k is rather high, you want to send just
candidates
several techniques possible, will cover that
later on the the course
pipelining
broadcasting/multicasting trees

4
Lecture Outline (2 lectures)

A bit of Historical Perspective
Parallelisation Approaches
Parallel Programming Platforms
logical organization - programmers view of the
platform
physical organization hardware view
communication costs
process-processors mappings

5
A bit of historical perspective
Parallel computing has been here since the early
days of computing. Traditionally custom HW,
custom SW, high The doom of the Moore law
custom HW has hard time catching up with the
commodity processors Current trend use commodity
HW components, standardize SW Market size of High
Performance Computing the market size for
disposable diapers (Explicitly!) Parallel
computing has never been mainstream.
6
A bit of historical perspective (cont.)

Parallelism sneaking into commodity computers
Instruction Level Parallelism - wide issue,
pipelining, OOO
data level parallelism SSE, 3DNow, Altivec
thread level parallelism Hyperthreading in
Pentium IV, multiple cores coming in not too
distant future
Transistor budgets allow for multiple processor
cores on a chip.
Most applications would benefit from being
parallelised and executed on a parallel computer.
even PC applications, especially the most
demanding ones
games, multimedia

7
A bit of historical perspective III

Chicken Egg Problem
Why build parallel computers when the
applications are sequential?
Why parallelize applications when there are no
parallel commodity computers?
Answers
What else to do with all those transistors?
They already are a bit parallel (wide issue,
multimedia instructions, hyperthreading), and
this bit is growing.
Yet another reason to study parallel computing
Principles of parallel algorithm design (locality
of data reference) lend themselves to
cache-friendly sequential algorithms.
The same applies for out-of-core computations
(data servers).

8
Parallelisation Approaches

Parallelizing compiler
advantage use your current code
disadvantage very limited abilities
Parallel domain-specific libraries
e.g. linear algebra, numerical libraries,
quantum chemistry
usually good choice, use when possible
Communication libraries
message passing libraries MPI, PVM
shared memory libraries declare and access
shared memory variables (on MPP machines done by
emulation)
advantage use standard compiler
disadvantage low level programming (parallel
assembler)

9
Parallelisation Approaches (cont.)

New parallel languages
use a language with built-in explicit control
for parallelism
no language is the best in every domain
needs new compiler
fights against inertia
Parallel features in existing languages
adding parallel features to an existing language
I.e. for expressing loop parallelism (pardo) and
data placement
example High Performance Fortran
Additional possibilities in shared-memory systems
use threads
preprocessor compiler directives (OpenMP)

10
Parallelisation Approaches Our Focus

Communication libraries MPI, PVM
industry standard, available for every platform
very general, low level approach
perfect match for clusters
most likely to be useful for you
Shared memory programming
also very important
likely to be useful in next iterations of PCs

11
Parallel Programming Platforms

Implicit Parallelism in Modern Microprocessors
pipelining, superscalar execution, VLIW
Limitations of Memory System Performance
problem high latency of memory vs. speed of
computing
solutions caches, latency hiding using
multithreading and prefetching
Explicit Parallel Programming Platforms
logical organization - programmers view of the
platform
physical organization hardware view
communication costs
process-processors mappings

12
Logical View of a PP Platform

Control Structure - how to express parallel
tasks
Single Instruction stream, Multiple Data stream
Multiple Instruction stream, Multiple Data
stream
Single Program Multiple Data
Communication Model - how to specify interactions
between these tasks
Shared Address Space Platforms
(multiprocessors)
Uniform Memory Access multiprocessors
Non-Uniform Memory Access multiprocessors
cache coherence issues
Message Passing Platforms

13
Control Structure SIMD
Single Instruction stream, Multiple Data stream
Example for (i0 ilt1000 i) pardo ci
aibi Processor k executes ck akbk
14
SIMD (cont.)

early parallel machines
Illiac IV, MPP, CM-2, MasPar MP-1
modern settings
multimedia extensions - MMX, SSE
DSP chips
positives
less hardware needed
easy to understand and reason about
negatives
proprietary hardware needed fast obsolescence,
high development costs/time
rigid structure suitable only for highly
structured problems
inherent inefficiency due to selective turn-off

15
SIMD inefficiency example
Example for (i0 ilt10 i) if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
16
SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
17
SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
9
4
8
1
15
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
18
SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
9
4
0
0
0
8
0
1
0
15
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
19
Control Structure MIMD
Multiple Instruction stream, Multiple Data stream

Single Program, Multiple Data
popular way to program MIMD computers
simplifies code maintenance/program distribution
equivalent to MIMD (big switch at the beginning)

20
MIMD (cont.)

positives
can be easily/fast/cheaply built from existing
microprocessors
very flexible (suitable for irregular problems)
negatives
requires more resources (duplicated program, OS,
)
more difficult to reason about/design correct
programs

21
Logical View of PP Platform

Control Structure - how to express parallel
tasks
Single Instruction stream, Multiple Data stream
Multiple Instruction stream, Multiple Data
stream
Single Program Multiple Data
Communication Model - how to specify interactions
between these tasks
Shared Address Space Platforms
(multiprocessors)
Uniform Memory Access multiprocessors
Non-Uniform Memory Access multiprocessors
cache coherence issues
Message Passing Platforms

22
Shared Address Space Platforms
shared memory, UMA
distributed memory, NUMA
23
Shared Address Space Platforms

all memory addressable by all processors
needs address translation mechanism
may or may not provide cache coherence
access time
uniform UMA (shared memory)
non-uniform NUMA (distributed memory)
principal communication mechanisms
put() and get()

24
Message Passing Platforms

p processing nodes, each with its own exclusive
address space
each node can be a single processor or a shared
address space multiprocessor
inter-node communication possible only through
message passing
principal functions
send(), receive()
each processor has unique ID
mechanisms provided for learning your ID, of
nodes etc.
several standard APIs available MPI, PVM

25
Physical Organization of Parallel Platforms

Ideal Parallel Computer PRAM
Interconnection Networks for Parallel Computers
static
dynamic
cost of communication
evaluating interconnection networks

26
Parallel Random Access Machine

p processors, each has a local memory and is
connected to unbounded shared memory
processors work in lock-step, the access time to
shared memory costs 1 step
4 main classes, depending on how simultaneous
accesses are handled
Exclusive read, exclusive write - EREW PRAM
Concurrent read, exclusive write - CREW PRAM
Exclusive read, concurrent write - ERCW PRAM
(for completeness)
Concurrent read, concurrent write - CRCW PRAM
Resolving concurrent writes
Common all writes must write the same value
Arbitrary arbitrary write succeeds
Priority the write with highest priority
succeeds
Sum the sum of the written values is stored

27
Parallel Random Access Machine (cont.)

abstracts away communication, allows to focus on
the parallel tasks
an algorithm for PRAM might lead you to a good
algorithm for a real machine
if you prove that something cannot be
efficiently solved on PRAM, it cannot be
efficiently done on any practical machine (based
on current technology)
it is not feasible to manufacture PRAM
the cost of connecting p processors to m memory
cells such that their accesses do not interfere
is 1(pm), which is huge for any practical values
of m

28
PRAM Algorithm Example
Problem use EREW PRAM to sum numbers stored at
m0, m1, , mn-1, where n2k for some k. The
result should be stored at m0.
Example for k3
Algorithm for processor pi for (j0 jltk j)
if (i 2(j1) 0)
a read(mi) b read(mi2j)
write(ab, mi)
29
PRAM Example Notes

the program is written in SIMD (and SPMD) format
the inefficiency caused by idling processors is
clearly visible
can be easily extended for n not power of 2
takes log2(n) rounds to execute
using similar approach ( some other ideas) it
can be shown that
Any CRCW PRAM can be simulated by EREW PRAM with
a slowdown factor of O(log n)

30
PRAM Example2

Problem use Sum - CRCW PRAM with n2 processors
to sort n numbers stored at m0, m1, , mn-1.
Question How many steps would it take?
O(n log n)
O(n)
O(log n)
O(1)
less then (n log n)/ n2

31
PRAM Example2
Note We will mark processors pi,j for 0lti,jltn
Algorithm for processor pi,j a read(mi)
b read(mj) if
((agtb) ((ab)(igtj)))
write(1, mi) if (j0)
b read(mi) write(a,
mb)
m
1
7
3
9
3
0
m0
m1
m2
m3
m4
m5
0
1
1
1
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
1
1
0
1
1
1
1
1
0
m
1
4
2
5
3
0
m
0
1
2
3
7
9
O(1) sorting algorithm!
32
Think at home
Problem Adapt the above algorithm to work on
EREW PRAM. What is the time complexity?
33
Physical Organization of Parallel Platforms

Ideal Parallel Computer PRAM
Interconnection Networks for Parallel Computers
dynamic
static
graph embeddings
cost of communication

34
Interconnection Networks for Parallel Computers

static networks
point-to-point communication links among
processing nodes
also called direct networks
dynamic networks
communication links are connected dynamically by
switches to create paths between processing nodes
and memory banks/other processing nodes
also called indirect networks

35
Interconnection Networks
Static/direct network
Dynamic/indirect network
p
p
p
p
p
p
p
p
network interface/switch
switching element
processing node
36
Dynamic Interconnection Networks
37
BUS Based Interconnection Networks

processors and the memory modules are connected
to a shared bus
Advantages
simple, low cost
Disadvantages
only one processor can access memory at a given
time
bandwidth does not scale with the number of
processors/memory modules
Example
quad Pentium Xeon

38
Crossbar

Advantages
non blocking network
Disadvantages
cost O(pm)
Example
high end UMA

39
Multistage networks (i.e. S-network)

- Intermediate case between bus and crossbar
- Blocking network (but not always)
- Often used in NUMA computers
S-network
Each switch is a 2x2 crossbar
log(p) stages
cost p log(p)
Simple routing algorithm
At each stage, look at the corresponding bit
(starting with msb) of the source and destination
address
If the bits are the same, messages pass through,
otherwise cross-over

40
S-network
41
S-network
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
42
Dynamic network exercises
Question 1 Which of the following pairs of
(processors, memory block) requests will
collide/block?
Question 2 (difficult) For a given
processor/memory request (a,b), how many requests
(x,y), with (x ! a) and (y ! b) will block with
(a,b) in an 8 node S-network? How does this
number depend on the choice of (a,b)?
43
Physical Organization of Parallel Platforms

Ideal Parallel Computer PRAM
Interconnection Networks for Parallel Computers
dynamic
static
graph embeddings
cost of communication

44
Static Interconnection Networks

Complete network (clique)
Star network
Linear array
Ring
Tree
2D 3D mesh/torus
Hypercube
Butterfly
Fat tree

45
Evaluating Interconnection Networks

diameter
the longest distance (number of hops) between
any two nodes
gives lower bound on time for algorithms
communicating only with direct neighbours
connectivity
multiplicity of paths between any two nodes
high connectivity lowers contention for
communication resources
bisection width (bisection bandwidth)
the minimal number of links (resp. their
aggregate bandwidth) that must be removed to
partition the network into two equal halves
provides lower bound on time when the data must
be shuffled from one half of the network to
another half
VLSI area/volume in 2D,
in 3D

For each of the following networks
determine the diameter and bisection width
give a programming example which would use such
communication pattern.

47
Clique, Star, Linear Array, Ring, Tree

important logical topologies, as many common
communication patters correspond to these
topologies
clique all-to-all broadcast
star master slave, broadcast
line, ring pipelined execution
tree hierarchical decomposition
none of them is very practical
clique cost
star, line, ring, tree low bisection width
line, ring high diameter
actual execution is performed on the embedding
into the physical network

48
2D 3D Array Torus

good match for discrete simulation and matrix
operations
easy to manufacture and extend
Examples Cray 3D (3d torus), Intel Paragon (2D
mesh)

49
Hypercube

good graph-theoretic properties (low diameter,
high bisection width)
nice recursive structure
good for simulating other topologies (they can
be efficiently embedded into hypercube)
degree log (n), diameter log (n), bisection
width n/2
costly/difficult to manufacture for high n, not
so popular nowadays

50
Butterfly

Hypercube derived network of log(n) diameter
and constant degree
perfect match for Fast Fourier Transform
there are other Hypercube-related networks (Cube
Connected Cycles, Shuffle-Exchange, De-Bruin and
Bene networks), see the Leightons book for
details

51
Fat Tree

Observation trees are nice low diameter,
simple structure
Problem low bandwidth
Solution exponentially increase the multiplicity
of links as the distance from the bottom
increases
keeps nice properties of the binary tree (low
diameter)
solves the low bisection and bottleneck at the
top levels
Example CM5

52
Evaluating Interconnection Networks
Cost ( of links)
Arc Connectivity
Bisection Width
Diameter
Network
p(p-1)/2
p-1
p2/4
1
clique
p-1
1
1
2
star
p-1
1
1
2log((p1)/2)
complete binary tree
p-1
1
1
p-2
linear array
2(p-?p)
2
?p
2(?p-1)
2D mesh
2p
4
2?p
2?p/2
2D torus
(p log p)/2
log p
p/2
log p
hypercube
53
Possible questions

Assume point-to-point communication with cost 1 .
Is is possible to sort in 2D mesh in time O(log
n)? ? ?
Is it possible to sort leaves of complete binary
tree in time O(log n)? What about ?
Can you find maximum in 2D mesh in time O(log
n)? What about complete binary tree?

54
Last Lecture

PRAM
what is PRAM? Why? Strong/weak points.
PRAM types
binary tree algorithm
O(1) sorting algorithm
Dynamic Interconnection Networks
BUS, Crossbar, S-network
blocking/non blocking access
advantages/disadvantages
Static Interconnection Networks
tree, mesh, hypercube,
diameter, bisection, connectivity, cost

55
Possible questions

Assume point-to-point communication with cost 1 .
Is is possible to sort in 2D mesh in time O(log
n)? ? ?
Is it possible to sort leaves of complete binary
tree in time O(log n)? What about ?
Can you find maximum in 2D mesh in time O(log
n)? What about complete binary tree?

56
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
57
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, b0) for (lk-1 lgt0 l--) if (j
2(l1) 0) a read(bj)
write(a, bj2l) read bj
Do you see any problems? All rows use the same
array b!
58
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
59
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
60
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
61
Physical Organization of Parallel Platforms

Ideal Parallel Computer PRAM
Interconnection Networks for Parallel Computers
dynamic
static
graph embeddings
cost of communication

62
Motivating Graph Embeddings
You want to use this algorithm
But your computer is connected like this
How do you map processes to processors? Why?
63
Process Processor Mappings and Graph Embeddings
Problem 1 You have an algorithm which uses p
logical processes. The communication pattern
between these processes is captured by a
communication graph G. How do you map these
processes to your real machine, which have p
processors interconnected into a network G so
that the overall communication cost is
minimized/kept reasonably small? Problem 2
Assume you have an algorithm designed for a
specific topology G. How do you get it work on an
network of different topology G? Solution
Graph embedding, simulate G on G.
64
Example two mappings
1
2
3
4
a
b
d
c
5
6
7
8
underlying architecture
processes and their interactions
e
f
g
h
9
10
11
12
i
j
k
l
13
14
15
16
m
n
o
p
1
2
3
4
1
2
3
4
a
b
d
c
k
h
i
m
5
6
7
8
5
6
7
8
intuitive mapping of processes to nodes
random mapping
e
f
g
h
j
p
o
b
9
10
11
12
9
10
11
12
i
j
k
l
d
e
a
n
13
14
15
16
13
14
15
16
m
n
o
p
c
l
g
f
65
Embedding

Formally Given networks G(V,E) and G(V,E),
find a mapping f which maps each vertex from V
into a vertex of V and each edge from E into a
path in G. Several vertices from V may map into
one vertex from V (especially if G has more
vertices then G). Such a mapping is called
embedding of G in G.
Goals
balance the number of vertices mapped to each
node of G
to balance the workload of the simulating
processors
each edge should map to a short path, optimally
single link
so each communication step can be simulated
efficiently (small dilation)
there should be little overlaps between
resulting simulating paths
to prevent congestion on links

66
Embedding - examples

Embedding ring into line
dilation 2, congestion 2
similar idea can be used to embed torus into mesh

Embedding ring into 2D torus
dilation 1, congestion 1

67
Embedding Ring into Hypercube

map processor i into node G(d, i) of d-ary
hypercube
function G() called binary reflected Grey code
G() can be easily defined recursively
G(d1) 0G(d), 1G(d)
Example
0,1 00,01,11,10 000,001, 011,010,
110, 111, 101, 100

r
110
111
010
011
100
101
000
001
68
Embedding trees into Hypercube

arbitrary binary trees can be (slightly less)
efficiently embedded as well
example below assumes the processors are only at
the leaves of the tree

69
Embedding mesh into linear array

congestion? dilation?
is there an embedding of n x n mesh into linear
array of n2 nodes with congestion less then n?
dilation less then n? why?

70
Possible questions

Given an embedding of G into H. What is its
dilation? Congestion?
Show how to embed 2D torus into 2D mesh with
constant dilation. What is the dilation of your
embedding? What is its congestion?
Show how to embed ring into 2D mesh. Is it
always possible to do it with both dilation and
congestion equal 1? Constant?
Given number x. Who is its predecessor/successor
in d-ary binary reflected Grey code?
Consider graph G with diameter d and bisection
width w and graph G with diameter d and
bisection width w. What is the best congestion
we can hope for in any embedding of G into G?
What is the best dilation we can hope for?

71
Physical Organization of Parallel Platforms

Ideal Parallel Computer PRAM
Interconnection Networks for Parallel Computers
dynamic
static
graph embeddings
cost of communication

72
Communication Costs in Parallel Machines

Message Passing Costs
Startup time (ts)
once per message transfer
covers message preparation, routing and
establishing interface between local node and
router
Per-hop time (th)
incurred in each internal node of the path
the travel time of the header between two direct
neighbours
Per-word transfer time (tw)
determined by the bandwidth w of the channel
tw1/w
includes network and buffering overheads

73
Routing Message Passing Costs

Store-Forward Routing
message of size m traveling l links
tcomm ts(mtw th)l ? tsmltw
Packet Routing
split the message into fixed size packets, route
the packets independently
used in highly dynamic settings with high error
rates (e.g. Internet)
tcomm tslthmtw
Cut-Through Routing
the header establishes path, the tail (data) of
the message follows like a snake
used in networks with low failure rate, no
buffers needed
tcomm tslthmtw with smaller constants

74
Message Passing Costs

General formula tcomm tslthmtw
In order to communicate efficiently
communicate in bulk
minimize the volume of the transferred data
minimize the traveled distance
We will talk about the first two objectives a
lot, the third is often difficult to achieve in
practice
often little control over mapping processes onto
processors
randomized routing commonly used
usually ts dominates for small messages and mtw
for large ones, while l and th tend to be quite
small
Simplified formula tcomm tsmtw

75
Message Passing Costs and Congestion

the previous discussion applies only if the
network is not congested
effective bandwidth
link bandwidth scaled down by the degree of
congestion
difficult to estimate, as it depends on process
to processor mapping, routing algorithm,
communication schedule
lower bound estimate scale link bandwidth by
factor of p/b

76
Communication Costs in Shared Address Space
Machines