Title: Introduction to Parallel Programming (Message Passing)
1Introduction to Parallel Programming (Message
Passing)
Francisco Almeida falmeida_at_ull.es
Parallel Computing Group
2Beowulf Computers
- COTS Commercial-Off-The-Shelf computers
3(No Transcript)
4The Parallel Model
PRAM
- Computational Models
- Programming Models
- Architectural Models
BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
5The Message Passing Model
Send(parameters) Recv(parameters)
6Network of Workstations
Hardware
- Distributed Memory
- Non Shared Memory Space
- Star Topology
- Sun Sparc Ultra 1
- 143 Mhz Etherswitch
7SGI Origin 2000
Hardware
- Shared Dsitributed Memory
- Hypercubic Topology
- C4-CEPBA
- 64 R1000processos
- 8 Gb memory
- 32 Gflop/s
8Digital AlphaServer 8400
Hardware
- Shared Memory
- BusTopology
- C4-CEPBA
- 10 Alpha processors21164
- 2 Gb Memory
- 8,8 Gflop/s
9Drawbacks that arise when solving Problems using
Parallelism
- Parallel Programming is more complex than
sequential. - Results may vary as a consequence of the
intrinsic non determinism. - New problems. Deadlocks, starvation...
- Is more difficult to debug parallel programs.
- Parallel programs are less portable.
10MPI
CMMD
pvm
Express
Zipcode
p4
PARMACS
EUI
MPI
Parallel Libraries
Parallel Applications
Parallel Languages
11MPI
- What Is MPI?
- Message Passing Interface standard
- The first standard and portable message
passing library with good performance - "Standard" by consensus of MPI Forum
participants from over 40 organizations - Finished and published in May 1994, updated
in June 1995 - What does MPI offer?
- Standardization - on many levels
- Portability - to existing and new systems
- Performance - comparable to vendors'
proprietary libraries - Richness - extensive functionality, many
quality implementations
12A Simple MPI Program
MPI hello.c include ltstdio.hgt include
ltstring.hgt include "mpi.h" main(int argc,
charargv) int name, p, source, dest, tag
0 char message100 MPI_Status
status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C
OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)
if (name ! 0) printf("Processor d of
d\n",name, p) sprintf(message,"greetings
from process d!", name)
dest 0 MPI_Send(message,
strlen(message)1,
MPI_CHAR, dest, tag,
MPI_COMM_WORLD) else
printf("processor 0, p d ",p)
for(source1 source lt p source)
MPI_Recv(message,100, MPI_CHAR, source,
tag, MPI_COMM_WORLD, status)
printf("s\n",message)
MPI_Finalize()
Processor 2 of 4 Processor 3 of 4 Processor 1 of
4 processor 0, p 4 greetings from process
1! greetings from process 2! greetings from
process 3!
mpicc o hello hello.c mpirun np 4 hello
13Basic Communication Operations
14One-to-all broadcast Single-node Accumulation
One-to-all broadcast
M
. . .
0
p
1
M
M
M
Single-node Accumulation
0
1
Step 1
2
Step 2
. . .
Step p
p
15Broadcast on Hypercubes
16Broadcast on Hypercubes
17MPI Broadcast
- int MPI_Bcast(
- void buffer
- int count
- MPI_Datatype datatype
- int root
- MPI_Comm comm
- )
- Broadcasts a message from the
- process with rank "root" to
- all other processes of the group
18Reduction on Hypercubes
A6 110
- _at_ conmutative and associative operator
- Ai in processor i
- Every processor has to obtain A0_at_A1_at_..._at_AP-1
A7 101
A2 101
A3 101
A5 101
A0 000
A1 001
19Reductions with MPI
- int MPI_Reduce(
- void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opint
rootMPI_Comm comm) - Reduces values on all processes to a single value
processes
- int MPI_Allreduce(
- void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opMPI_Comm
comm) - Combines values form all processes and
distributes the result back to all
20All-To-All BroadcastMultinode Accumulation
All-to-all broadcast
M1
M2
Mp
M0
M0
M0
Single-node Accumulation
M1
M1
M1
Mp
Mp
Mp
Reductions, Prefixsums
21MPI Collective Operations
- MPI Operator Operation
- ----------------------------------------
----------------------- - MPI_MAX maximum
- MPI_MIN minimum
- MPI_SUM sum
- MPI_PROD product
- MPI_LAND logical and
- MPI_BAND bitwise and
- MPI_LOR logical or
- MPI_BOR bitwise or
- MPI_LXOR logical
exclusive or - MPI_BXOR bitwise
exclusive or - MPI_MAXLOC max value and
location - MPI_MINLOC min value and
location
22The Master Slave Paradigm
Master
Slaves
23Computing ?
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) h 1.0 / (double) n
mypi 0.0 for (i myid 1 i lt n i
numprocs) x h ((double)i - 0.5)
mypi f(x) mypi h sum
MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD)
4
2
0.0
0.2
0.4
0.6
0.8
1.0
mpirun np 3 cpi
24The Portability of the Efficiency
25The Sequential Algorithm
- fkc max fk-1C, fk-1C - Wk pk
for C ? Wk
- void mochila01_sec (void)
-
- unsigned v1
- int c, k
- for (c 0 c lt C c)
- f0c 0
- for (k 1 k lt N k)
- for (c 0 c lt C c)
- fkc fk-1c
- if (c gt wk)
- v1 fk-1c - wk pk
- if (fkc gt v1)
- fkc v1
-
-
n
. . . .
. . . .
. . . .
. . . .
C
. . . .
. . . .
. . . .
. . . .
fk - 1
fk
O(nC)
26The Parallel Algorithm
1void transition (int stage) 2 3 unsigned
x 4 int c, k 5 k stage 6 for (c 0 c lt
C c) 7 fc 0 8 for (c 0 c lt C
c) 9 IN(x) 10 fc max(fc,
x) 11 OUT(fc, 1, sizeof(unsigned)) 12 if
(C gt c wk) 13 fc wk x
pk 14 15
fkc max fk-1C, fk-1C - Wk
pk
27The Evolution of the Pipeline
n
C
28The Running Time
n
n -1
C
C
29Processor Virtualization
n/p
Block Mapping
C
2
0
1
30Processor Virtualization
n/p
Block Mapping
C
2
0
1
31Processor Virtualization
n/p
C
2
0
1
32The Running Time
n/p
(n/p -1)C
(n/p -1)C
C
nC/p
nC
2
0
1
33Processor Virtualization
n/p
C
34The Running Time
n/p
n/p
n/p
C
35Block Mapping
void transition (void) unsigned c, k i,
inData for (c 0 c lt C c)
IN(inData) k
calcInitStage() for (i 0 i lt width k,
i) fi c max(fic, inData)
if (c wk lt C) fic wk inData
pk inData fic
OUT(fi-1c, 1, sizeof(unsigned))
width N / num_proc if (f_name lt N num_proc)
/ Load Balancing / width int
calcInitStage( void ) return (f_name lt N
num_proc) ? f_name width (f_name
width) (N num_proc)
36Cyclic Mapping
2
0
1
37The Running Time
(p-1)
n/p C
38Cyclic Mapping
int bands num_bands(n) for (i 0 i lt
bands i) stage f_name i num_proc
if (stage lt n - 1) transition
(stage) unsigned num_bands (unsigned n)
float aux_f unsigned aux aux_f (float)
n / (float) num_proc aux (unsigned) aux_f
if (aux_f gt aux) return (aux 1) return
(aux)
- void transition (int stage)
-
- unsigned x
- int c, k
- k stage
- for (c 0 c lt C c)
- fc 0
- for (c 0 c lt C c)
- IN(x)
- fc max(fc, x)
- OUT(fc, 1, sizeof(unsigned))
- if (C gt c wk)
- fc wk x pk
-
-
39Advantages and Disadvantages
- Block Distribution
- Minimizes the Number of Communications
- Penalizes the Startup Time of the Pipeline
- Cyclic Distribution
- Minimizes the Startup Time of the Pipeline
- May Produce Communications Overhead
40Transputer Network - Local Area Network
- Transputer Network
- Fine Grain
- Parallel Communications
- Local Area Network
- Coarse Grain
- Serial Communications
41Computational Results
Transputers
Local Area Network
Time
Time
Processors
Processos
42The Resource Allocation Problem
- M units of an indivisible Resource and a set of
N Tasks. - fj(x) ??Benefit obtained when x unidades of
resource are allocated to task j.
N
max
f
x
(
)
Ã¥
j
j
j
1
N
Subject to
x
M
Ã¥
j
j
1
x
B
,
0
j
j
ÃŽ
integer,
xj
j
N
M
Bj
,
.
.
.
,
,
1
N
43RAP- The Sequential Algorithm
Gkm maxGk-1m-i fk(i) / 0 ? i ? m
int rap_seq(void) int i, k, m for (m 0 m
lt M n) G0m 0 q a Q b for(k
0 k lt N k) for(m 0 m lt M m) for
(i 0 i lt m i) Gkm maxGkm,
Gk-1i fk(m- i) return GN
M
O(nM2)
44RAP - The Parallel Algorithm
1void transition (int stage) 2 3 int m, j,
x, k 4 for( m 0 m lt M m) 5 Gm
0 6 k stage 7 for (m 0 m lt M m)
8 IN(x) 9 Gm max(Gm, x f(k-1,
0)) 10 OUT(Gm, 1, sizeof(int)) 11 for
(j m 1 j lt M j) 12 Gj max(Gj,
x f(k - 1, j - m)) 13 / for m ... / 14
/ transition /
Gkm maxGk-1m-i fk(i) / 0 ? i ? m
45The Cray T3E
- CRAY T3E
- Shared Address Space
- Three-Dimensional Toroidal Network
46Block - Cyclic Mapping
g(p-1) gM2 n/gp
2
0
1
47Computational Results
48Linear Model to Predict Communication Performance
- Time to send N bytes ? n b
49PAPI
- http//icl.cs.utk.edu/projects/papi/
- PAPI aims to provide the tool designer and
application engineer with a consistent interface
and methodology for use of the performance
counter hardware found in most major
microprocessors.
50Buffering Data
Virtual Process name runs of real processor fname
if (name / grain) mod p fname
Processor 1
Processor 0
Processor 0
P 2 Grain 3
0
1
2
3
6
7
8
...
4
5
Virtual Processes
Size B
SET_BUFIO(1, size)
51The knapsack ProblemN 12800, M 12800Cray -
T3E
52The Resource Allocation Problem. Cray - T3E
53Portability of the Efficiency
- One disappointing contrast in parallel systems is
between the peak performance of the parallel
systems and the actual performance of parallel
applications. - Metrics, techniques and tools have been developed
to understand the sources of performance
degradation. - An effective parallel program development cycle,
may iterate many times before achieving the
desired performance. - Performance prediction is important in achieving
efficient execution of parallel programs, since
it allows to avoid the coding and debugging cost
of inefficient strategies. - Most of the approaches to performance analysis
fall into two categories Analytical Modeling and
Performance Profiling.
54Performance Analysis
- Profiling may be conducted on an existing
parallel system to recognize current performance
bottlenecks, correct them, and identify and
prevent potential future performance problems. - Architectural Dependent.
- The majority of performance metrics and tools
devised reflect their orientation towards the
measurement-modify paradigm. - PICL, Dimemas, Kpi.
- ParaGraph, Vampir, Paraver.
55Performance Analysis
- Analytical Modeling
- Provides a structured way for understanding
performance problems - Architectural Independent
- Has predictive ability
- Modeling is not a trivial task. The model must be
simple enough to be tractable, and sufficiently
detailed to be accurate. - PRAM, LogP, BSP, BSPWB, etc...
Analytical
Modeling
Optimal
Run Time
Parameter
Prediction
Prediction
Error
Computation
Prediction
56Standard Loop on a Pipeline Algorithm
- void f() Compute(body0) While (running)
Receive() Compute(body1)
Send() Compute(body2)
body0 take constant time body1 and body2 depends
on the iteration of the loop
Analytical Model Numerical Solutions for every
case
57The Analytical Model
- Ts denotes the startup time between
- two processors.
- Ts t0( G - 1) G?i 1, (B-1) (t1i t2i
) -
- 2bI (G - 1) B bE B b t B
- Tc denotes the whole evaluation of G processes,
including the time to send M/B packets of size B - Tc t0 G G?i 1, M (t1i t2i )
- 2bI (G - 1)M bEM (b tB) M/B
G
G
G
G
B
B
B
B
. . .
M/B
B
58The Analytical Model
- T1(G, B) Ts (p - 1) Tc N/(Gp)
- 1 ? G ? N/p and 1 ? B ? M
1
p-1
0
0
G
G
G
G
R1 Values (G, B) where Ts p ? Tc
B
B
B
B
. . .
B
1
p-1
0
0
G
G
G
G
B
B
B
B
. . .
B
R2 Values (G, B) where Ts p ? Tc
59Validation of The Model
60The Tuning Problem
- Given an Algorithm A, FA is the input/output
fuction computed by the algorithm - FA D D1x...xDn ? ? ? ?
- FA(z) is the output value of the Algorithm A for
the entry z belonging to D - TimeM(A(z)) is the execution time of the
Algorithm A over the input z on a Machine M. - CTimeM(A(z)) is the analytical Complexity Time
formula that approximates TimeM(A(z)) - T D1x...xDk T ? Tunning Parameters I
Dk1x...xDn I ? Input Parameters - x ? T if and only if, occurs that x has only
impact in the performance of the algorithm but
not in its output. - FA(x, z) FA(y, z) for any x and y? T
- TimeM(A(x, z)) ? TimeM(A(y, z)
- The Tuning Problem
- is to find x0 ?T such that CTimeM(A(x0, z))
min CTimeM(A(x, z)) / x?T
61Tunning Parameters
- The list of tuning parameters in parallel
computing is extensive - The most obvious tuning parameter is the Number
of Processors. - The size of the buffers used during data
exchange. - Under the Master-Slave paradigm, the size and the
number of data item generated by the master. - In the parallel Divide and Conquer technique, the
size of a subproblem to be considered trivial
and the the processor assignment policy. - On regular numerical HPF-like algorithms, the
block size allocation.
62The Methodology
- Profiling the execution to compute the parameters
needed for the Complexity Time function
CTimeM(A(x, z)). - Compute x0?T such that minimizes the Complexity
Time function CTimeM(A(x, z)). - CTimeM(A(x0, z)) min CTimeM(A(x, z)) /x?T
- At this point, the predictive ability of the
Complexity Time function can be used to predict
the execution time TimeM(A(z)) of an optimal
execution or to execute the algorithm according
to the tuning parameter T.
Analytical
Modeling
Instrumentation
Optimal Parameter
Run Time
Computation
Prediction
Error Prediction
Computation
63llp Solver
IN OUT
gettime() gettime()
64The MALLBA Infrastructure
65Performace PredictionBA - ULL
66The MALLBA Project
- Library for the resolution of combinatorial
optimisation problems. - 3 types of resolution techniques
- Exact
- Heuristic
- Hybrid
- 3 implementations
- Sequential
- LAN
- WAN
- Goals
- Genericity
- Ease of utilization
- Locally- and geographically-distributed
computation
67References
- Willkinson B., Allen M. Parallel Programming.
Techniques and Applications Using Networkded
Workstations and Parallel Computers. 1999.
Prentice-Hall. - Gropp W., Lusk E., Skjellum A. Using MPI.
Portable Parallel Programming with the
Message-Passing Interface. 1999. The MIT Press. - Pacheco P. Parallel Programming with MPI. 1997.
Morgan Kaufmann Publishers. - Wu X. Performance Evaluation, Prediction and
Visualization of Parallel Systems. - nereida.deioc.ull.es