Parallel Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Algorithms

Description:

A group of processors work together to solve a problem ... T ~ (ts Ltb) log P. x1. x8. x4. x3. x2. x5. x6. x7. x1. x1. x1. x7. x5. x3. x5. x8. x4. x2. x6. x1. x7 ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 45
Provided by: asri9
Learn more at: http://www.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel Algorithms


1
Parallel Algorithms
  • Ashok Srinivasan
  • www.cs.fsu.edu/asriniva
  • Florida State University

2
Outline
  • Background
  • Primitives
  • Algorithms
  • Summary

3
Background
  • Terminology
  • Time complexity
  • Speedup
  • Efficiency
  • Scalability
  • Communication cost model

4
Time complexity
  • Parallel computation
  • A group of processors work together to solve a
    problem
  • Time required for the computation is the period
    from when the first processor starts working
    until when the last processor stops

5
Other terminology
  • Speedup S T1/TP
  • Efficiency E S/P
  • Work W P TP
  • Scalability
  • How does TP decrease as we increase P to solve
    the same problem?
  • How should the problem size increase with P, to
    keep E constant?
  • Notation
  • P Number of processors
  • T1 Time on one processor
  • TP Time on P processors

6
Communication cost model
  • Processes spend some time doing useful work, and
    some time communicating
  • Model communication cost as
  • TC ts L tb
  • L message size
  • Independent of location of processes
  • Any process can communicate with any other process

7
I/O model
  • We will ignore I/O issues, for the most part
  • We will assume that input and output are
    distributed across the processors in a manner of
    our choosing
  • Example Sorting
  • Input x1, x2, ..., xn
  • Initially, xi is on processor i
  • Output xp1, xp2, ..., xpn
  • xpi lt xpi1
  • xpi on processor i

8
Primitives
  • Reduction
  • Broadcast
  • Gather/Scatter
  • All gather
  • Prefix

9
Reduction -- 1
x1
Compute x1 x1 ... xn
xn
x2
x4
x3
  • Tn n-1 (n-1)(tstb)
  • Sn 1/(1 ts tb)

10
Reduction -- 2
x1
Reduction-1 for x1, ... xn/2
xn/21
Reduction-1 for xn/21, ... xn
  • Tn n/2-1 (n/2-1)(ts tb) (ts tb) 1
  • n/2 n/2 (ts tb)
  • Sn 2/(1 ts tb)

11
Reduction -- 3
xn/21
x1
Reduction-1 for x1, ... xn/2
Reduction-1 for xn/21, ... xn
xn/21
xn/41
x3n/41
x1
Reduction-1 for x1, ... xn/4
Reduction-1 for xn/41, ... xn/2
Reduction-1 for xn/21, ... x3n/4
Reduction-1 for x3n/41, ... xn
  • Apply reduction-2 recursively
  • Divide and conquer
  • Tn log2n (ts tb) log2n
  • Sn (n/ log2n) x 1/(1 ts tb)
  • Note that any associative operator can be used in
    place of

12
Parallel addition features
  • If n gtgt P
  • Each processor adds n/P distinct numbers
  • Perform parallel reduction on P numbers
  • Tn n/P (1 ts tb) log P
  • Optimal P obtained by differentiating wrt P
  • Popt n/(1 ts tb)
  • If communication cost is high, then fewer
    processors ought to be used
  • E 1 (1 ts tb) P log P/n-1
  • As problem size increases, efficiency increases
  • As number of processors increases, efficiency
    decreases

13
Broadcast
x1
x8
x7
x5
x1
x4
x3
x3
x7
x1
x5
x5
x6
x8
x4
x2
x6
x1
x7
x3
x5
x2
x1
  • T (ts Ltb) log P

14
Gather/Scatter
4L
2L
2L
L
L
L
L
  • Gather Data move towards the root
  • Scatter Data move towards the leaf
  • T ts log P PLtb

15
All gather
x8
x7
x4
x3
x5
x6
L
x2
x1
  • Equivalent to each processor broadcasting to all
    the processors

16
All gather
x78
x78
x34
x34
2L
x56
x56
L
x12
x12
17
All gather
x58
x58
x14
x14
2L
x58
x58
L
4L
x14
x14
18
All gather
x18
x18
x18
x18
2L
x18
x18
L
4L
x18
x18
  • Tn ts log P PLtb

19
Sequential prefix
  • Input
  • Values xi , 1 lt i lt n
  • Output
  • Xi x1 x2 ... xi, 1 lt i lt n
  • is an associative operator
  • Algorithm
  • X1 x1
  • for i 2 to n
  • Xi Xi-1 xi

20
Parallel prefix
  • Input
  • Processor i has xi
  • Output
  • Processor i has
  • x1 x2 ... xi
  • Define f(a,b) as follows
  • if a b
  • Xi xi, on Proc Pi
  • Xi xi, on Proc Pi
  • else
  • compute in parallel
  • f(a,(ab)/2)
  • f((ab)/21,b)
  • Pi and Pj send Xi and Xj to each other,
    respectively
  • a lt i lt (ab)/2
  • j i (ab)/2
  • Xi XiXj on Pi
  • Xj XiXj on Pj
  • Xj XiXj on Pj
  • Divide and conquer
  • f(a,b) yields the following
  • Xi xa ... xi, Proc Pi
  • Xi xa ... xb, Proc Pi
  • a lt i lt b
  • f(1,n) solves the problem
  • T(n) t(n/2) 2 (tstw) gt T(n) O(log n)
  • An iterative implementation improves the constant

21
Algorithms
  • Linear recurrence
  • Matrix vector multiplication
  • Linear systems

22
Linear recurrence
  • Determine each xi, 2 lt i lt n
  • xi ai xi-1 bi xi-2
  • x0 x0, x1 x1
  • Sequential solution
  • for i 2 to n
  • xi ai xi-1 bi xi-2
  • Follows directly from the recurrence
  • This approach is not easily parallelized

23
Linear recurrence in parallel
  • Given xi ai xi-1 bi xi-2
  • x2i a2i x2i-1 b2i x2i-2
  • x2i1 a2i1 x2i b2i1 x2i-1
  • Rewrite this in matrix form

x2i x2i1
x2i-2 x2i-1
b2i a2i a2i1 b2i b2i1 a2i1
a2i
Ai
Xi-1
Xi
  • Xi Ai A i-1 ... A1X0
  • This is a parallel prefix computation, since
    matrix multiplication is associative
  • Solved in O(log n) time

24
Matrix-vector multiplication
  • c A b
  • Often performed repeatedly
  • bi A bi-1
  • We need same data distribution for c and b
  • One dimensional decomposition
  • Example row-wise block striped for A
  • b and c replicated
  • Each process computes its components of c
    independently
  • Then all-gather the components of c

25
1-D matrix-vector multiplication
c Replicated
A Row-wise
b Replicated
  • Each process computes its components of c
    independently
  • Time Q(n2/P)
  • Then all-gather the components of c
  • Time ts log P tb n
  • Note n lt P

26
2-D matrix-vector multiplication
A00
A01
A02
A03
B0
C0
A10
A11
A12
A13
B1
C1
A20
A21
A22
A23
B2
C2
A30
A31
A32
A33
B3
C3
  • Process Pi0 sends Bi to P0i
  • Time ts tbn/P0.5
  • Process Poj broadcast Bj to all Pij
  • Time ts log P0.5 tb n log P0.5 / P0.5
  • Process Pij compute Cij AijBj
  • Time Q(n2/P)
  • Process Pij reduce Cij on to Pi0, 0 lt i lt P0.5
  • Time ts log P0.5 tb n log P0.5 / P0.5
  • Total time Q(n2/P ts log P tb n log P /
    P0.5 )
  • P lt n2

27
Linear system
  • Lower triangular system
  • for j 0 to n-1
  • xj bj/ajj
  • for i j1 to n-1
  • bi - aij xj
  • Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
28
Linear system
  • Lower triangular system
  • for j 0 to n-1
  • xj bj/ajj
  • for i j1 to n-1
  • bi - aij xj
  • Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
29
Linear system
  • Lower triangular system
  • for j 0 to n-1
  • xj bj/ajj
  • for i j1 to n-1
  • bi - aij xj
  • Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
30
Linear system
  • Lower triangular system
  • for j 0 to n-1
  • xj bj/ajj
  • for i j1 to n-1
  • bi - aij xj
  • Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
31
Linear system
  • Lower triangular system
  • for j 0 to n-1
  • xj bj/ajj
  • for i j1 to n-1
  • bi - aij xj
  • Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
32
Linear system
  • Lower triangular system
  • for j 0 to n-1
  • xj bj/ajj
  • for i j1 to n-1
  • bi - aij xj
  • Time n2

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
33
Parallel triangular system - 1
  • Row-wise striped partitioning
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • broadcast xj to all processes
  • else
  • Receive xj
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost n (ts tb) log P

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
34
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
35
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
36
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
37
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
38
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
39
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
40
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
41
Parallel triangular system - 2
  • Pipeline the communication
  • for j 0 to n-1
  • If I own row j
  • xj bj/ajj
  • send xj to process myrank1 (mod P)
  • else
  • Receive xj from process myrank-1 (mod P)
  • Send xj to process myrank1 (mod P) unless that
    process owns j
  • for all my rows i, i gt j
  • bi - aij xj
  • Communication cost (n P) (ts tb)

a00
x0
b0
a10
a11
x1
b1
a20
a21
a22
x2
b2
a30
a31
a32
a33
x3
b3
42
Pipelining
  • Useful when repeatedly and regularly performing a
    large number of primitive operations
  • Optimal time for a broadcast log P
  • But doing this n times takes n log P time
  • Pipelining the broadcasts takes n P time
  • Almost constant amortized time per broadcast
  • if n gtgt P
  • n P ltlt n log P when n gtgt P

43
Load balancing
  • Distribution of rows
  • Blocks of adjacent rows to each process
  • Arithmetic cost 2 n2/P
  • Cyclic distribution
  • Arithmetic cost n2/P
  • Cyclic is better load balanced

Block striped
cyclic striped
44
Summary
  • Model and performance measures
  • Primitives
  • Applications of primitives
  • General techniques
  • Divide and conquer
  • Pipelining
Write a Comment
User Comments (0)
About PowerShow.com