Parallel Algorithms - PowerPoint PPT Presentation

About This Presentation

Title:

Parallel Algorithms

Description:

A group of processors work together to solve a problem ... T ~ (ts Ltb) log P. x1. x8. x4. x3. x2. x5. x6. x7. x1. x1. x1. x7. x5. x3. x5. x8. x4. x2. x6. x1. x7 ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 45

Provided by: asri9

Learn more at: http://www.cs.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Algorithms

1
Parallel Algorithms

Ashok Srinivasan
www.cs.fsu.edu/asriniva
Florida State University

2
Outline

Background
Primitives
Algorithms
Summary

3
Background

Terminology
Time complexity
Speedup
Efficiency
Scalability
Communication cost model

4
Time complexity

Parallel computation
A group of processors work together to solve a
problem
Time required for the computation is the period
from when the first processor starts working
until when the last processor stops

5
Other terminology

Speedup S T1/TP
Efficiency E S/P
Work W P TP
Scalability
How does TP decrease as we increase P to solve
the same problem?
How should the problem size increase with P, to
keep E constant?

Notation
P Number of processors
T1 Time on one processor
TP Time on P processors

6
Communication cost model

Processes spend some time doing useful work, and
some time communicating
Model communication cost as
TC ts L tb
L message size
Independent of location of processes
Any process can communicate with any other process

7
I/O model

We will ignore I/O issues, for the most part
We will assume that input and output are
distributed across the processors in a manner of
our choosing
Example Sorting
Input x1, x2, ..., xn
Initially, xi is on processor i
Output xp1, xp2, ..., xpn
xpi lt xpi1
xpi on processor i

8
Primitives

Reduction
Broadcast
Gather/Scatter
All gather
Prefix

9
Reduction -- 1
x1
Compute x1 x1 ... xn
xn
x2
x4
x3

Tn n-1 (n-1)(tstb)
Sn 1/(1 ts tb)

10
Reduction -- 2
x1
Reduction-1 for x1, ... xn/2
xn/21
Reduction-1 for xn/21, ... xn

Tn n/2-1 (n/2-1)(ts tb) (ts tb) 1
n/2 n/2 (ts tb)
Sn 2/(1 ts tb)

11
Reduction -- 3
xn/21
x1
Reduction-1 for x1, ... xn/2
Reduction-1 for xn/21, ... xn
xn/21
xn/41
x3n/41
x1
Reduction-1 for x1, ... xn/4
Reduction-1 for xn/41, ... xn/2
Reduction-1 for xn/21, ... x3n/4
Reduction-1 for x3n/41, ... xn

Apply reduction-2 recursively
Divide and conquer
Tn log2n (ts tb) log2n
Sn (n/ log2n) x 1/(1 ts tb)
Note that any associative operator can be used in
place of

12
Parallel addition features

If n gtgt P
Each processor adds n/P distinct numbers
Perform parallel reduction on P numbers
Tn n/P (1 ts tb) log P
Optimal P obtained by differentiating wrt P
Popt n/(1 ts tb)
If communication cost is high, then fewer
processors ought to be used
E 1 (1 ts tb) P log P/n-1
As problem size increases, efficiency increases
As number of processors increases, efficiency
decreases

13
Broadcast
x1
x8
x7
x5
x1
x4
x3
x3
x7
x1
x5
x5
x6
x8
x4
x2
x6
x1
x7
x3
x5
x2
x1

T (ts Ltb) log P

14
Gather/Scatter
4L
2L
2L
L
L
L
L

Gather Data move towards the root
Scatter Data move towards the leaf
T ts log P PLtb

15
All gather
x8
x7
x4
x3
x5
x6
L
x2
x1

Equivalent to each processor broadcasting to all
the processors

16
All gather
x78
x78
x34
x34
2L
x56
x56
L
x12
x12
17
All gather
x58
x58
x14
x14
2L
x58
x58
L
4L
x14
x14
18
All gather
x18
x18
x18
x18
2L
x18
x18
L
4L
x18
x18

Tn ts log P PLtb

19
Sequential prefix

Input
Values xi , 1 lt i lt n
Output
Xi x1 x2 ... xi, 1 lt i lt n
is an associative operator
Algorithm
X1 x1
for i 2 to n
Xi Xi-1 xi

20
Parallel prefix

Input
Processor i has xi
Output
Processor i has
x1 x2 ... xi

Define f(a,b) as follows
if a b
Xi xi, on Proc Pi
Xi xi, on Proc Pi
else
compute in parallel
f(a,(ab)/2)
f((ab)/21,b)
Pi and Pj send Xi and Xj to each other,
respectively
a lt i lt (ab)/2
j i (ab)/2
Xi XiXj on Pi
Xj XiXj on Pj
Xj XiXj on Pj

Divide and conquer
f(a,b) yields the following
Xi xa ... xi, Proc Pi
Xi xa ... xb, Proc Pi
a lt i lt b
f(1,n) solves the problem

T(n) t(n/2) 2 (tstw) gt T(n) O(log n)
An iterative implementation improves the constant

21
Algorithms

Linear recurrence
Matrix vector multiplication
Linear systems

22
Linear recurrence

Determine each xi, 2 lt i lt n
xi ai xi-1 bi xi-2
x0 x0, x1 x1
Sequential solution
for i 2 to n
xi ai xi-1 bi xi-2
Follows directly from the recurrence
This approach is not easily parallelized

23
Linear recurrence in parallel

Given xi ai xi-1 bi xi-2
x2i a2i x2i-1 b2i x2i-2
x2i1 a2i1 x2i b2i1 x2i-1
Rewrite this in matrix form

x2i x2i1
x2i-2 x2i-1
b2i a2i a2i1 b2i b2i1 a2i1
a2i
Ai
Xi-1
Xi

Xi Ai A i-1 ... A1X0
This is a parallel prefix computation, since
matrix multiplication is associative
Solved in O(log n) time

24
Matrix-vector multiplication

c A b
Often performed repeatedly
bi A bi-1
We need same data distribution for c and b
One dimensional decomposition
Example row-wise block striped for A
b and c replicated
Each process computes its components of c
independently
Then all-gather the components of c

25
1-D matrix-vector multiplication
c Replicated
A Row-wise
b Replicated

Each process computes its components of c
independently
Time Q(n2/P)
Then all-gather the components of c
Time ts log P tb n
Note n lt P

26
2-D matrix-vector multiplication
A00
A01
A02
A03
B0
C0
A10
A11
A12
A13
B1
C1
A20
A21
A22
A23
B2
C2
A30
A31
A32
A33
B3
C3

Process Pi0 sends Bi to P0i
Time ts tbn/P0.5
Process Poj broadcast Bj to all Pij
Time ts log P0.5 tb n log P0.5 / P0.5
Process Pij compute Cij AijBj
Time Q(n2/P)
Process Pij reduce Cij on to Pi0, 0 lt i lt P0.5
Time ts log P0.5 tb n log P0.5 / P0.5
Total time Q(n2/P ts log P tb n log P /
P0.5 )
P lt n2

27
Linear system

Lower triangular system
for j 0 to n-1
xj bj/ajj
for i j1 to n-1
bi - aij xj
Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
28
Linear system

Lower triangular system
for j 0 to n-1
xj bj/ajj
for i j1 to n-1
bi - aij xj
Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
29
Linear system

Lower triangular system
for j 0 to n-1
xj bj/ajj
for i j1 to n-1
bi - aij xj
Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
30
Linear system

Lower triangular system
for j 0 to n-1
xj bj/ajj
for i j1 to n-1
bi - aij xj
Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
31
Linear system

Lower triangular system
for j 0 to n-1
xj bj/ajj
for i j1 to n-1
bi - aij xj
Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
32
Linear system

Lower triangular system
for j 0 to n-1
xj bj/ajj
for i j1 to n-1
bi - aij xj
Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
33
Parallel triangular system - 1

Row-wise striped partitioning
for j 0 to n-1
If I own row j
xj bj/ajj
broadcast xj to all processes
else
Receive xj
for all my rows i, i gt j
bi - aij xj
Communication cost n (ts tb) log P

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
34
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
35
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
36
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
37
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
38
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
39
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
40
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
41
Parallel triangular system - 2

Pipeline the communication
for j 0 to n-1
If I own row j
xj bj/ajj
send xj to process myrank1 (mod P)
else
Receive xj from process myrank-1 (mod P)
Send xj to process myrank1 (mod P) unless that
process owns j
for all my rows i, i gt j
bi - aij xj
Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
42
Pipelining

Useful when repeatedly and regularly performing a
large number of primitive operations
Optimal time for a broadcast log P
But doing this n times takes n log P time
Pipelining the broadcasts takes n P time
Almost constant amortized time per broadcast
if n gtgt P
n P ltlt n log P when n gtgt P

43
Load balancing