Introduction to parallel algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to parallel algorithms

Description:

A group of processors work together to solve a problem ... T ~ (ts Ltb) log P. L: Length of data. x1. x8. x7. x3. x5. x2. x6. x4. x1. x1. x1. x4. x2. x3. x2 ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 31
Provided by: asri9
Learn more at: http://www.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to parallel algorithms


1
Introduction to parallel algorithms
COT 5405 Fall 2006
  • Ashok Srinivasan
  • www.cs.fsu.edu/asriniva
  • Florida State University

2
Outline
  • Background
  • Primitives
  • Algorithms
  • Important points

3
Background
  • Terminology
  • Time complexity
  • Speedup
  • Efficiency
  • Scalability
  • Communication cost model

4
Time complexity
  • Parallel computation
  • A group of processors work together to solve a
    problem
  • Time required for the computation is the period
    from when the first processor starts working
    until when the last processor stops

5
Other terminology
  • Speedup S T1/TP
  • Efficiency E S/P
  • Work W P TP
  • Scalability
  • How does TP decrease as we increase P to solve
    the same problem?
  • How should the problem size increase with P, to
    keep E constant?
  • Notation
  • P Number of processors
  • T1 Time on one processor
  • TP Time on P processors

6
Communication cost model
  • Processes spend some time doing useful work, and
    some time communicating
  • Model communication cost as
  • TC ts L tb
  • L message size
  • Independent of location of processes
  • Any process can communicate with any other
    process
  • A process can simultaneously send and receive one
    message

7
I/O model
  • We will ignore I/O issues, for the most part
  • We will assume that input and output are
    distributed across the processors in a manner of
    our choosing
  • Example Sorting
  • Input x1, x2, ..., xn
  • Initially, xi is on processor i
  • Output xp1, xp2, ..., xpn
  • xpi on processor i
  • xpi lt xpi1

8
Primitives
  • Reduction
  • Broadcast
  • Gather/Scatter
  • All gather
  • Prefix

9
Reduction -- 1
x1
Compute x1 x2 ... xn
xn
x2
x4
x3
  • Tn n-1 (n-1)(tstb)
  • Sn 1/(1 ts tb)

10
Reduction -- 2
x1
Reduction-1 for x1, ... xn/2
xn/21
Reduction-1 for xn/21, ... xn
  • Tn n/2-1 (n/2-1)(ts tb) (ts tb) 1
  • n/2 n/2 (ts tb)
  • Sn 2/(1 ts tb)

11
Reduction -- 3
xn/21
x1
Reduction-1 for x1, ... xn/2
Reduction-1 for xn/21, ... xn
xn/21
xn/41
x3n/41
x1
Reduction-1 for x1, ... xn/4
Reduction-1 for xn/41, ... xn/2
Reduction-1 for xn/21, ... x3n/4
Reduction-1 for x3n/41, ... xn
  • Apply reduction-2 recursively
  • Divide and conquer
  • Tn log2n (ts tb) log2n
  • Sn (n/ log2n) x 1/(1 ts tb)
  • Note that any associative operator can be used in
    place of

12
Parallel addition features
  • If n gtgt P
  • Each processor adds n/P distinct numbers
  • Perform parallel reduction on P numbers
  • TP n/P (1 ts tb) log P
  • Optimal P obtained by differentiating wrt P
  • Popt n/(1 ts tb)
  • If communication cost is high, then fewer
    processors ought to be used
  • E 1 (1 ts tb) P log P/n-1
  • As problem size increases, efficiency
    increases
  • As number of processors increases, efficiency
    decreases

13
Some common collective operations
14
Broadcast
x1
x8
x7
x2
x1
x4
x3
x3
x4
x1
x2
x5
x6
x8
x7
x5
x6
x1
x4
x3
x2
x2
x1
  • T (ts Ltb) log P
  • L Length of data

15
Gather/Scatter
Note Si0log P1 2i (2 log P 1)/(21)
P-1 P
4L
2L
2L
L
L
L
L
  • Gather Data move towards the root
  • Scatter Review question
  • T ts log P PLtb

16
All gather
x8
x7
x4
x3
x5
x6
L
x2
x1
  • Equivalent to each processor broadcasting to all
    the processors

17
All gather
x78
x78
x34
x34
2L
x56
x56
L
x12
x12
18
All gather
x58
x58
x14
x14
2L
x58
x58
L
4L
x14
x14
19
All gather
x18
x18
x18
x18
2L
x18
x18
L
4L
x18
x18
  • Tn ts log P PLtb

20
Review question Pipelining
  • Useful when repeatedly and regularly
    performing a large number of primitive operations
  • Optimal time for a broadcast log P
  • But doing this n times takes n log P time
  • Pipelining the broadcasts takes n P time
  • Almost constant amortized time per broadcast
  • if n gtgt P
  • n P ltlt n log P when n gtgt P
  • Review question How can you accomplish this time
    complexity?

21
Sequential prefix
  • Input
  • Values xi , 1 lt i lt n
  • Output
  • Xi x1 x2 ... xi, 1 lt i lt n
  • is an associative operator
  • Algorithm
  • X1 x1
  • for i 2 to n
  • Xi Xi-1 xi

22
Parallel prefix
  • Input
  • Processor i has xi
  • Output
  • Processor i has
  • x1 x2 ... xi
  • Define f(a,b) as follows
  • if a b
  • Xi xi, on Proc Pi
  • Xi xi, on Proc Pi
  • else
  • compute in parallel
  • f(a,(ab)/2)
  • f((ab)/21,b)
  • Pi and Pj send Xi and Xj to each other,
    respectively
  • a lt i lt (ab)/2
  • j i (ab)/2
  • Xi XiXj on Pi
  • Xj XiXj on Pj
  • Xj XiXj on Pj
  • Divide and conquer
  • f(a,b) yields the following
  • Xi xa ... xi, Proc Pi
  • Xi xa ... xb, Proc Pi
  • a lt i lt b
  • f(1,n) solves the problem
  • T(n) T(n/2) 2 (tstw) gt T(n) O(log n)
  • An iterative implementation improves the constant

23
Iterative parallel prefix example
x0
x1
x2
x3
x4
x5
x6
x7
x01
x12
x23
x34
x45
x56
x67
x02
x03
x14
x25
x36
x47
x04
x05
x06
x07
24
Algorithms
  • Linear recurrence
  • Matrix vector multiplication

25
Linear recurrence
  • Determine each xi, 2 lt i lt n
  • xi ai xi-1 bi xi-2
  • x0 x0, x1 x1
  • Sequential solution
  • for i 2 to n
  • xi ai xi-1 bi xi-2
  • Follows directly from the recurrence
  • This approach is not easily parallelized

26
Linear recurrence in parallel
  • Given xi ai xi-1 bi xi-2
  • x2i a2i x2i-1 b2i x2i-2
  • x2i1 a2i1 x2i b2i1 x2i-1
  • Rewrite this in matrix form

x2i x2i1
x2i-2 x2i-1
b2i a2i a2i1 b2i b2i1 a2i1
a2i
Ai
Xi-1
Xi
  • Xi Ai A i-1 ... A1X0
  • This is a parallel prefix computation, since
    matrix multiplication is associative
  • Solved in O(log n) time

27
Matrix-vector multiplication
  • c A b
  • Often performed repeatedly
  • bi A bi-1
  • We need same data distribution for c and b
  • One dimensional decomposition
  • Example row-wise block striped for A
  • b and c replicated
  • Each process computes its components of c
    independently
  • Then all-gather the components of c

28
1-D matrix-vector multiplication
c Replicated
A Row-wise
b Replicated
  • Each process computes its components of c
    independently
  • Time Q(n2/P)
  • Then all-gather the components of c
  • Time ts log P tb n
  • Note P lt n

29
2-D matrix-vector multiplication
A00
A01
A02
A03
B0
C0
A10
A11
A12
A13
B1
C1
A20
A21
A22
A23
B2
C2
A30
A31
A32
A33
B3
C3
  • Processes Pi0 sends Bi to P0i
  • Time ts tbn/P0.5
  • Processes P0j broadcast Bj to all Pij
  • Time ts log P0.5 tb n log P0.5 / P0.5
  • Processes Pij compute Cij AijBj
  • Time Q(n2/P)
  • Processes Pij reduce Cij on to Pi0, 0 lt i lt P0.5
  • Time ts log P0.5 tb n log P0.5 / P0.5
  • Total time Q(n2/P ts log P tb n log P /
    P0.5 )
  • P lt n2
  • More scalable than 1-dimensional decomposition

30
Important points
  • Efficiency
  • Increases with increase in problem size
  • Decreases with increase in number of processors
  • Aggregation of tasks to increase granularity
  • Reduces communication overhead
  • Data distribution
  • 2-dimensional may be more scalable than
    1-dimensional
  • Has an effect on load balance too
  • General techniques
  • Divide and conquer
  • Pipelining
Write a Comment
User Comments (0)
About PowerShow.com