Introduction to Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Parallel Programming

Description:

Given two N x N matrices A and B. Compute C = A x B. Cij = Ai1B1j ... if x not present = factor 2 ... 50] = factor 1. 3. if A[51] = x = factor 51 ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 73
Provided by: csVu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Parallel Programming


1
Introduction to Parallel Programming
  • Language notation message passing
  • Distributed-memory machine (e.g., workstations on
    a network)
  • 5 parallel algorithms of increasing complexity
  • Matrix multiplication
  • Successive overrelaxation
  • All-pairs shortest paths
  • Linear equations
  • Search problem

2
Message Passing
  • SEND (destination, message)
  • blocking wait until message has arrived (like a
    fax)
  • non blocking continue immediately (like a
    mailbox)
  • RECEIVE (source, message)
  • RECEIVE-FROM-ANY (message)
  • blocking wait until message is available
  • non blocking test if message is available

3
Syntax
  • Use pseudo-code with C-like syntax
  • Use indentation instead of to indicate
    block structure
  • Arrays can have user-defined index ranges
    default start at 1
  • int A10100 runs from 10 to 100
  • int AN runs from 1 to N
  • Use array slices (sub-arrays)
  • Ai..j elements A i to A j
  • Ai, elements Ai, 1 to Ai, N i.e.
    row i of N x N matrix A
  • A, k elements A1, k to AN, k i.e.
    column k of A

4
Parallel Matrix Multiplication
  • Given two N x N matrices A and B
  • Compute C A x B
  • Cij Ai1B1j Ai2B2j .. AiNBNj

A
B
C
5
Sequential Matrix Multiplication
  • for (i 1 i lt N i)
  • for (j 1 j lt N j)
  • C i,j 0
  • for (k 1 k lt N k)
  • Ci,j Ai,k Bk,j
  • The order of the operations is over specified
  • Everything can be computed in parallel

6
Parallel Algorithm 1
  • Each processor computes 1 element of C
  • Requires N2 processors
  • Each processor needs 1 row of A and 1 column of B
    as input

7
Structure
  • Master distributes the work and receives the
    results
  • Slaves get work and execute it
  • Slaves are numbered consecutively from 1 to P
  • How to start up master/slave processes depends on
    Operating System (not discussed here)

Master
A1,
AN,
C1,1
B,1
CN,N
B,N
.
Slave
Slave
1
N2
8
Parallel Algorithm 1
Slaves (processors 1 .. P) int AixN, BxjN,
Cij RECEIVE(0, Aix, Bxj, i, j) Cij
0 for (k 1 k lt N k) Cij Aixk
Bxjk SEND(0, Cij , i, j)
  • Master (processor 0)
  • int proc 1
  • for (i 1 i lt N i)
  • for (j 1 j lt N j)
  • SEND(proc, Ai,, B,j, i, j) proc
  • for (x 1 x lt NN x)
  • RECEIVE_FROM_ANY(result, i, j)
  • Ci,j result

9
Efficiency (complexity analysis)
  • Each processor needs O(N) communication to do
    O(N) computations
  • Communication 2N1 integers O(N)
  • Computation per processor N multiplications/addit
    ions O(N)
  • Exact communication/computation costs depend on
    network and CPU
  • Still this algorithm is inefficient for any
    existing machine
  • Need to improve communication/computation ratio

10
Parallel Algorithm 2
  • Each processor computes 1 row (N elements) of C
  • Requires N processors
  • Need entire B matrix and 1 row of A as input

11
Structure
Master
A1,
B,
AN,
C1,
CN,
B,
.
Slave
Slave
1
N
12
Parallel Algorithm 2
  • Master (processor 0)
  • for (i 1 i lt N i)
  • SEND (i, Ai,, B,, i)
  • for (x 1 x lt N x)
  • RECEIVE_FROM_ANY (result, i)
  • Ci, result

Slaves int AixN, BN,N, CN RECEIVE(0,
Aix, B, i) for (j 1 j lt N j) Cj
0 for (k 1 k lt N k) Cj Aixk
Bj,k SEND(0, C , i)
13
Problem need larger granularity
  • Each processor now needs O(N2) communication and
    O(N2) computation
  • Still inefficient
  • Assumption N gtgt P (i.e. we solve a large
    problem)
  • Assign many rows to each processor

14
Parallel Algorithm 3
  • Each processor computes N/P rows of C
  • Need entire B matrix and N/P rows of A as input
  • Each processor now needs O(N2) communication and
    O(N3 / P) computation

15
Parallel Algorithm 3
  • Master (processor 0)
  • int result N, N / P
  • int inc N / P / number of rows per cpu /
  • int lb 1 / lb lower bound /
  • for (i 1 i lt P i)
  • SEND (i, Alb .. lbinc-1, , B,, lb,
    lbinc-1)
  • lb inc
  • for (x 1 x lt P x)
  • RECEIVE_FROM_ANY (result, lb)
  • for (i 1 i lt N / P i)
  • Clbi-1, resulti,

Slaves int AN / P, N, BN,N, CN / P,
N RECEIVE(0, A, B, lb, ub) for (i lb
i lt ub i) for (j 1 j lt N
j) Ci,j 0 for (k 1 k lt N
k) Ci,j Ai,k Bk,j SEND(0,
C, , lb)
16
Comparison
Algorithm Parallelism (jobs) Communication per job Computation per job Ratio (comp/comm)
1 N2 N N 1 N O(1)
2 N N N2 N N2 O(1)
3 P N2/P N2 N2/P N3/P O(N/P)
  • If N gtgt P, algorithm 3 will have low
    communication overhead
  • Its grain size is high

17
Example speedup graph
18
Discussion
  • Matrix multiplication is trivial to parallelize
  • Getting good performance is a problem
  • Need right grain size
  • Need large input problem

19
Successive Over relaxation (SOR)
  • Iterative method for solving Laplace equations
  • Repeatedly updates elements of a grid
  • float G1N, 1M, Gnew1N, 1M
  • for (step 0 step lt NSTEPS step)
  • for (i 2 i lt N i) / update grid /
  • for (j 2 j lt M j)
  • Gnewi,j f(Gi,j, Gi-1,j,
    Gi1,j,Gi,j-1, Gi,j1)
  • G Gnew

20
SOR example
21
SOR example
22
Parallelizing SOR
  • Domain decomposition on the grid
  • Each processor owns N/P rows
  • Need communication between neighbors to exchange
    elements at processor boundaries

23
SOR example partitioning
24
SOR example partitioning
25
Communication scheme
  • Each CPU communicates with left right neighbor
    (if existing)

26
Parallel SOR
  • float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
  • for (step 0 step lt NSTEPS step)
  • SEND(cpuid-1, Glb) / send 1st row left
    /
  • SEND(cpuid1, Gub) / send last row right
    /
  • RECEIVE(cpuid-1, Glb-1) / receive from
    left /
  • RECEIVE(cpuid1, Gub1) / receive from
    right /
  • for (i lb i lt ub i) / update my rows
    /
  • for (j 2 j lt M j)
  • Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
    Gi,j-1, Gi,j1)
  • G Gnew

27
Performance of SOR
  • Communication and computation during each
    iteration
  • Each processor sends/receives 2 messages with M
    reals
  • Each processor computes N/P M updates
  • The algorithm will have good performance if
  • Problem size is large N gtgt P
  • Message exchanges can be done in parallel

28
All-pairs Shorts Paths (ASP)
  • Given a graph G with a distance table C
  • C i , j length of direct path from node
    i to node j
  • Compute length of shortest path between any two
    nodes in G

29
Floyd's Sequential Algorithm
  • Basic step
  • for (k 1 k lt N k)
  • for (i 1 i lt N i)
  • for (j 1 j lt N j)
  • C i , j MIN ( C i, j, C i ,k C k,
    j)
  • During iteration k, you can visit only
    intermediate nodes in the set 1 .. k
  • k0 gt initial problem, no intermediate nodes
  • kN gt final solution

30
Parallelizing ASP
k
j
  • Distribute rows of C among the P processors
  • During iteration k, each processor executes
  • C i,j MIN (Ci ,j, Ci,k Ck,j)
  • on its own rows i, so it needs these rows and
    row k
  • Before iteration k, the processor owning row k
    sends it to all the others

. .
i
.
k
31
Parallelizing ASP
k
j
  • Distribute rows of C among the P processors
  • During iteration k, each processor executes
  • C i,j MIN (Ci ,j, Ci,k Ck,j)
  • on its own rows i, so it needs these rows and
    row k
  • Before iteration k, the processor owning row k
    sends it to all the others

. . .
i
. .
k
32
Parallelizing ASP
j
  • Distribute rows of C among the P processors
  • During iteration k, each processor executes
  • C i,j MIN (Ci ,j, Ci,k Ck,j)
  • on its own rows i, so it needs these rows and
    row k
  • Before iteration k, the processor owning row k
    sends it to all the others

. . . . . . . .
i
. . . . . . . .
k
33
Parallel ASP Algorithm
  • int lb, ub / lower/upper bound for
    this CPU /
  • int rowKN, Clbub, N / pivot row
    matrix /
  • for (k 1 k lt N k)
  • if (k gt lb k lt ub) / do I have it?
    /
  • rowK Ck,
  • for (proc 1 proc lt P proc) /
    broadcast row /
  • if (proc ! myprocid) SEND(proc, rowK)
  • else
  • RECEIVE_FROM_ANY(rowK) / receive row /
  • for (i lb i lt ub i) / update my
    rows /
  • for (j 1 j lt N j)
  • Ci,j MIN(Ci,j, Ci,k rowKj)

34
Parallel ASP Algorithm
  • int lb, ub / lower/upper bound for
    this CPU /
  • int rowKN, Clbub, N / pivot row
    matrix /
  • for (k 1 k lt N k)
  • for (i lb i lt ub i) / update my
    rows /
  • for (j 1 j lt N j)
  • Ci,j MIN(Ci,j, Ci,k rowKj)

35
Parallel ASP Algorithm
  • int lb, ub / lower/upper bound for
    this CPU /
  • int rowKN, Clbub, N / pivot row
    matrix /
  • for (k 1 k lt N k)
  • if (k gt lb k lt ub) / do I have it?
    /
  • rowK Ck,
  • for (proc 1 proc lt P proc) /
    broadcast row /
  • if (proc ! myprocid) SEND(proc, rowK)
  • else
  • RECEIVE_FROM_ANY(rowK) / receive row /
  • for (i lb i lt ub i) / update my
    rows /
  • for (j 1 j lt N j)
  • Ci,j MIN(Ci,j, Ci,k rowKj)

36
Performance Analysis ASP
  • Per iteration
  • 1 CPU sends P -1 messages with N integers
  • Each CPU does N/P x N comparisons
  • Communication/ computation ratio is small if N
    gtgt P

37
  • ... but, is the Algorithm Correct?

38
Parallel ASP Algorithm
  • int lb, ub
  • int rowKN, Clbub, N
  • for (k 1 k lt N k)
  • if (k gt lb k lt ub)
  • rowK Ck,
  • for (proc 1 proc lt P proc)
  • if (proc ! myprocid) SEND(proc, rowK)
  • else
  • RECEIVE_FROM_ANY (rowK)
  • for (i lb i lt ub i)
  • for (j 1 j lt N j)
  • Ci,j MIN(Ci,j, Ci,k rowKj)

39
Non-FIFO Message Ordering
  • Row 2 may be received before row 1

40
FIFO Ordering
  • Row 5 may be received before row 4

41
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake
    each other

42
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake
    each other
  • Solutions

43
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake
    each other
  • Solutions
  • Synchronous SEND (less efficient)

44
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake
    each other
  • Solutions
  • Synchronous SEND (less efficient)
  • Barrier at the end of outer loop (extra
    communication)

45
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake
    each other
  • Solutions
  • Synchronous SEND (less efficient)
  • Barrier at the end of outer loop (extra
    communication)
  • Order incoming messages (requires buffering)

46
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake
    each other
  • Solutions
  • Synchronous SEND (less efficient)
  • Barrier at the end of outer loop (extra
    communication)
  • Order incoming messages (requires buffering)
  • RECEIVE (cpu, msg) (more complicated)

47
Introduction to Parallel Programming
  • Language notation message passing
  • Distributed-memory machine (e.g., workstations on
    a network)
  • 5 parallel algorithms of increasing complexity
  • Matrix multiplication
  • Successive overrelaxation
  • All-pairs shortest paths
  • Linear equations
  • Search problem

48
Linear equations
  • Linear equations
  • a1,1x1 a1,2x2 a1,nxn b1
  • ...
  • an,1x1 an,2x2 an,nxn bn
  • Matrix notation Ax b
  • Problem compute x, given A and b
  • Linear equations have many important applications
  • Practical applications need huge sets of
    equations

49
Solving a linear equation
  • Two phases
  • Upper-triangularization -gt U x y
  • Back-substitution -gt x
  • Most computation time is in upper-triangularizati
    on
  • Upper-triangular matrix
  • U i, i 1
  • U i, j 0 if i gt j

1 . . . . . . .
0 1 . . . . . .
0 0 1 . . . . .
0 0 0 1 . . . .
0 0 0 0 1 . . .
0 0 0 0 0 1 . .
0 0 0 0 0 0 1 .
0 0 0 0 0 0 0 1
50
Sequential Gaussian elimination
  • for (k 1 k lt N k)
  • for (j k1 j lt N j)
  • Ak,j Ak,j / Ak,k
  • yk bk / Ak,k
  • Ak,k 1
  • for (i k1 i lt N i)
  • for (j k1 j lt N j)
  • Ai,j Ai,j - Ai,k Ak,j
  • bi bi - Ai,k yk
  • Ai,k 0
  • Converts Ax b into Ux y
  • Sequential algorithm uses 2/3 N3 operations

normalize
1 . . . . . . .
0 . . . . . . .
0 . . . . . . .
subtract
0 . . . . . . .
A
y
51
Parallelizing Gaussian elimination
  • Row-wise partitioning scheme
  • Each cpu gets one row (striping )
  • Execute one (outer-loop) iteration at a time
  • Communication requirement
  • During iteration k, cpus Pk1 Pn-1 need part
    of row k
  • This row is stored on CPU Pk
  • -gt need partial broadcast (multicast)

52
Communication
53
Performance problems
  • Communication overhead (multicast)
  • Load imbalance
  • CPUs P0PK are idle during iteration k
  • Bad load balance means bad speedups, as some
    CPUs have too much work and others too little
  • In general, number of CPUs is less than n
  • Choice between block-striped and
    cyclic-striped distribution
  • Block-striped distribution has high
    load-imbalance
  • Cyclic-striped distribution has less
    load-imbalance

54
Block-striped distribution
  • CPU 0 gets first N/2 rows
  • CPU 1 gets last N/2 rows
  • CPU 0 has much less work to do
  • CPU 1 becomes the bottleneck

55
Cyclic-striped distribution
  • CPU 0 gets odd rows (1, 3, )
  • CPU 1 gets even rows (2, 4, )
  • CPU 0 and 1 have more or less the same amount of
    work

56
A Search Problem
  • Given an array A1..N and an item x, check if x
    is present in A
  • int present false
  • for (i 1 !present i lt N i)
  • if ( A i x) present true
  • Dont know in advance which data we need to access

57
Parallel Search on 2 CPUs
  • int lb, ub
  • int Albub
  • for (i lb i lt ub i)
  • if (A i x)
  • print( Found item")
  • SEND(1-cpuid) / send other CPU empty
    message/
  • exit()
  • / check message from other CPU /
  • if (NONBLOCKING_RECEIVE(1-cpuid)) exit()

58
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?

59
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?
  • 1. if x not present gt factor 2

60
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?
  • 1. if x not present gt factor 2
  • 2. if x present in A1 .. 50 gt factor 1

61
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?
  • 1. if x not present gt factor 2
  • 2. if x present in A1 .. 50 gt factor 1
  • 3. if A51 x gt factor 51

62
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?
  • 1. if x not present gt factor 2
  • 2. if x present in A1 .. 50 gt factor 1
  • 3. if A51 x gt factor 51
  • 4. if A75 x gt factor 3

63
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?
  • 1. if x not present gt factor 2
  • 2. if x present in A1 .. 50 gt factor 1
  • 3. if A51 x gt factor 51
  • 4. if A75 x gt factor 3
  • In case 2 the parallel program does more work
    than the sequential program gt search overhead

64
Performance Analysis
  • How much faster is the parallel program than the
    sequential program for N100 ?
  • 1. if x not present gt factor 2
  • 2. if x present in A1 .. 50 gt factor 1
  • 3. if A51 x gt factor 51
  • 4. if A75 x gt factor 3
  • In case 2 the parallel program does more work
    than the sequential program gt search overhead
  • In cases 3 and 4 the parallel program does less
    work gt negative search overhead

65
Discussion
  • Several kinds of performance overhead
  • Communication overhead communication/computation
    ratio must be low
  • Load imbalance all processors must do same
    amount of work
  • Search overhead avoid useless (speculative)
    computations
  • Making algorithms correct is nontrivial
  • Message ordering

66
Designing Parallel Algorithms
  • Source Designing and building parallel programs
    (Ian Foster, 1995)
  • (available on-line at http//www.mcs.anl.gov/dbpp)
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping

67
Figure 2.1 from Foster's book
68
Partitioning
  • Domain decomposition
  • Partition the data
  • Partition computations on data (owner-computes
    rule)
  • Functional decomposition
  • Divide computations into subtasks
  • E.g. search algorithms

69
Communication
  • Analyze data-dependencies between partitions
  • Use communication to transfer data
  • Many forms of communication, e.g.
  • Local communication with neighbors (SOR)
  • Global communication with all processors (ASP)
  • Synchronous (blocking) communication
  • Asynchronous (non blocking) communication

70
Agglomeration
  • Reduce communication overhead by
  • increasing granularity
  • improving locality

71
Mapping
  • On which processor to execute each subtask?
  • Put concurrent tasks on different CPUs
  • Put frequently communicating tasks on same CPU?
  • Avoid load imbalances

72
Summary
  • Hardware and software models
  • Example applications
  • Matrix multiplication - Trivial parallelism
    (independent tasks)
  • Successive over relaxation - Neighbor
    communication
  • All-pairs shortest paths - Broadcast
    communication
  • Linear equations - Load balancing problem
  • Search problem - Search overhead
  • Designing parallel algorithms
Write a Comment
User Comments (0)
About PowerShow.com