Title: Parallel Algorithms
1Parallel Algorithms
- Ashok Srinivasan
- www.cs.fsu.edu/asriniva
- Florida State University
2Outline
- Background
- Primitives
- Algorithms
- Summary
3Background
- Terminology
- Time complexity
- Speedup
- Efficiency
- Scalability
- Communication cost model
4Time complexity
- Parallel computation
- A group of processors work together to solve a
problem - Time required for the computation is the period
from when the first processor starts working
until when the last processor stops
5Other terminology
- Speedup S T1/TP
- Efficiency E S/P
- Work W P TP
- Scalability
- How does TP decrease as we increase P to solve
the same problem? - How should the problem size increase with P, to
keep E constant?
- Notation
- P Number of processors
- T1 Time on one processor
- TP Time on P processors
6Communication cost model
- Processes spend some time doing useful work, and
some time communicating - Model communication cost as
- TC ts L tb
- L message size
- Independent of location of processes
- Any process can communicate with any other process
7I/O model
- We will ignore I/O issues, for the most part
- We will assume that input and output are
distributed across the processors in a manner of
our choosing - Example Sorting
- Input x1, x2, ..., xn
- Initially, xi is on processor i
- Output xp1, xp2, ..., xpn
- xpi lt xpi1
- xpi on processor i
8Primitives
- Reduction
- Broadcast
- Gather/Scatter
- All gather
- Prefix
9Reduction -- 1
x1
Compute x1 x1 ... xn
xn
x2
x4
x3
- Tn n-1 (n-1)(tstb)
- Sn 1/(1 ts tb)
10Reduction -- 2
x1
Reduction-1 for x1, ... xn/2
xn/21
Reduction-1 for xn/21, ... xn
- Tn n/2-1 (n/2-1)(ts tb) (ts tb) 1
- n/2 n/2 (ts tb)
- Sn 2/(1 ts tb)
11Reduction -- 3
xn/21
x1
Reduction-1 for x1, ... xn/2
Reduction-1 for xn/21, ... xn
xn/21
xn/41
x3n/41
x1
Reduction-1 for x1, ... xn/4
Reduction-1 for xn/41, ... xn/2
Reduction-1 for xn/21, ... x3n/4
Reduction-1 for x3n/41, ... xn
- Apply reduction-2 recursively
- Divide and conquer
- Tn log2n (ts tb) log2n
- Sn (n/ log2n) x 1/(1 ts tb)
- Note that any associative operator can be used in
place of
12Parallel addition features
- If n gtgt P
- Each processor adds n/P distinct numbers
- Perform parallel reduction on P numbers
- Tn n/P (1 ts tb) log P
- Optimal P obtained by differentiating wrt P
- Popt n/(1 ts tb)
- If communication cost is high, then fewer
processors ought to be used - E 1 (1 ts tb) P log P/n-1
- As problem size increases, efficiency increases
- As number of processors increases, efficiency
decreases
13Broadcast
x1
x8
x7
x5
x1
x4
x3
x3
x7
x1
x5
x5
x6
x8
x4
x2
x6
x1
x7
x3
x5
x2
x1
14Gather/Scatter
4L
2L
2L
L
L
L
L
- Gather Data move towards the root
- Scatter Data move towards the leaf
- T ts log P PLtb
15All gather
x8
x7
x4
x3
x5
x6
L
x2
x1
- Equivalent to each processor broadcasting to all
the processors
16All gather
x78
x78
x34
x34
2L
x56
x56
L
x12
x12
17All gather
x58
x58
x14
x14
2L
x58
x58
L
4L
x14
x14
18All gather
x18
x18
x18
x18
2L
x18
x18
L
4L
x18
x18
19Sequential prefix
- Input
- Values xi , 1 lt i lt n
- Output
- Xi x1 x2 ... xi, 1 lt i lt n
- is an associative operator
- Algorithm
- X1 x1
- for i 2 to n
- Xi Xi-1 xi
20Parallel prefix
- Input
- Processor i has xi
- Output
- Processor i has
- x1 x2 ... xi
- Define f(a,b) as follows
- if a b
- Xi xi, on Proc Pi
- Xi xi, on Proc Pi
- else
- compute in parallel
- f(a,(ab)/2)
- f((ab)/21,b)
- Pi and Pj send Xi and Xj to each other,
respectively - a lt i lt (ab)/2
- j i (ab)/2
- Xi XiXj on Pi
- Xj XiXj on Pj
- Xj XiXj on Pj
- Divide and conquer
- f(a,b) yields the following
- Xi xa ... xi, Proc Pi
- Xi xa ... xb, Proc Pi
- a lt i lt b
- f(1,n) solves the problem
- T(n) t(n/2) 2 (tstw) gt T(n) O(log n)
- An iterative implementation improves the constant
21Algorithms
- Linear recurrence
- Matrix vector multiplication
- Linear systems
22Linear recurrence
- Determine each xi, 2 lt i lt n
- xi ai xi-1 bi xi-2
- x0 x0, x1 x1
- Sequential solution
- for i 2 to n
- xi ai xi-1 bi xi-2
- Follows directly from the recurrence
- This approach is not easily parallelized
23Linear recurrence in parallel
- Given xi ai xi-1 bi xi-2
- x2i a2i x2i-1 b2i x2i-2
- x2i1 a2i1 x2i b2i1 x2i-1
- Rewrite this in matrix form
x2i x2i1
x2i-2 x2i-1
b2i a2i a2i1 b2i b2i1 a2i1
a2i
Ai
Xi-1
Xi
- Xi Ai A i-1 ... A1X0
- This is a parallel prefix computation, since
matrix multiplication is associative - Solved in O(log n) time
24Matrix-vector multiplication
- c A b
- Often performed repeatedly
- bi A bi-1
- We need same data distribution for c and b
- One dimensional decomposition
- Example row-wise block striped for A
- b and c replicated
- Each process computes its components of c
independently - Then all-gather the components of c
251-D matrix-vector multiplication
c Replicated
A Row-wise
b Replicated
- Each process computes its components of c
independently - Time Q(n2/P)
- Then all-gather the components of c
- Time ts log P tb n
- Note n lt P
262-D matrix-vector multiplication
A00
A01
A02
A03
B0
C0
A10
A11
A12
A13
B1
C1
A20
A21
A22
A23
B2
C2
A30
A31
A32
A33
B3
C3
- Process Pi0 sends Bi to P0i
- Time ts tbn/P0.5
- Process Poj broadcast Bj to all Pij
- Time ts log P0.5 tb n log P0.5 / P0.5
- Process Pij compute Cij AijBj
- Time Q(n2/P)
- Process Pij reduce Cij on to Pi0, 0 lt i lt P0.5
- Time ts log P0.5 tb n log P0.5 / P0.5
- Total time Q(n2/P ts log P tb n log P /
P0.5 ) - P lt n2
27Linear system
- Lower triangular system
- for j 0 to n-1
- xj bj/ajj
- for i j1 to n-1
- bi - aij xj
- Time n2
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
28Linear system
- Lower triangular system
- for j 0 to n-1
- xj bj/ajj
- for i j1 to n-1
- bi - aij xj
- Time n2
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
29Linear system
- Lower triangular system
- for j 0 to n-1
- xj bj/ajj
- for i j1 to n-1
- bi - aij xj
- Time n2
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
30Linear system
- Lower triangular system
- for j 0 to n-1
- xj bj/ajj
- for i j1 to n-1
- bi - aij xj
- Time n2
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
31Linear system
- Lower triangular system
- for j 0 to n-1
- xj bj/ajj
- for i j1 to n-1
- bi - aij xj
- Time n2
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
32Linear system
- Lower triangular system
- for j 0 to n-1
- xj bj/ajj
- for i j1 to n-1
- bi - aij xj
- Time n2
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
33Parallel triangular system - 1
- Row-wise striped partitioning
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- broadcast xj to all processes
- else
- Receive xj
- for all my rows i, i gt j
- bi - aij xj
- Communication cost n (ts tb) log P
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
34Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
35Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
36Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
37Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
38Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
39Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
40Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
41Parallel triangular system - 2
- Pipeline the communication
- for j 0 to n-1
- If I own row j
- xj bj/ajj
- send xj to process myrank1 (mod P)
- else
- Receive xj from process myrank-1 (mod P)
- Send xj to process myrank1 (mod P) unless that
process owns j - for all my rows i, i gt j
- bi - aij xj
- Communication cost (n P) (ts tb)
a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
42Pipelining
- Useful when repeatedly and regularly performing a
large number of primitive operations - Optimal time for a broadcast log P
- But doing this n times takes n log P time
- Pipelining the broadcasts takes n P time
- Almost constant amortized time per broadcast
- if n gtgt P
- n P ltlt n log P when n gtgt P
43Load balancing
- Distribution of rows
- Blocks of adjacent rows to each process
- Arithmetic cost 2 n2/P
- Cyclic distribution
- Arithmetic cost n2/P
- Cyclic is better load balanced
Block striped
cyclic striped
44Summary
- Model and performance measures
- Primitives
- Applications of primitives
- General techniques
- Divide and conquer
- Pipelining