Chapter 4, CLR Textbook - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 4, CLR Textbook

Description:

... and simulation of complex cellular automata (e.g., ... Parallel Programming in C with MPI and OpenMP Author: jbaker Last modified by: jbaker Created Date: – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 47
Provided by: JBa999
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4, CLR Textbook


1
Chapter 4, CLR Textbook
  • Algorithms on Rings of Processors

2
Algorithms on Rings of Processors
  • When using message passing, it is common to
    abstract away from the physical network and to
    choose a convenient logical network instead.
  • This chapter presents several algorithms
    intended for the logical ring network studied
    earlier
  • Coverage of mapping logical networks map onto
    physical networks are deferred to Sections 4.6
    and 4.7
  • Rings are linear interconnection network
  • Ideal for a first look at distributed memory
    algorithms
  • Each processor has a single predecessor and
    successor

3
Matrix-Vector Multiplication
  • The first unidirectional ring algorithm will be
    the multiplication y Ax of a n?n matrix A by a
    vector x of dimension n.
  • 1. for i 0 to n-1 do
  • 2. yi ? 0
  • 3. for j 0 to n-1 do
  • 4. yi yi Ai,j ? xj
  • Each outer (e.g., i) loop computes the scalar
    product of one row of A by vector x.
  • These scalar products can be performed in any
    order.
  • These scalar products will be distributed among
    the processors so these can be done in parallel.

4
Matrix-Vector Multiplication (cont.)
  • We assume that n is divisible by p and let r
    n/p.
  • Each processor must store r contiguous rows of
    matrix A and r scalar products.
  • This is called a block row.
  • The corresponding r components of the vector y
    and x are also stored with each processor.
  • Each processor Pq will then store
  • Rows qr to (q1)r -1 of matrix A of dimension r?n
  • Components qr to (q1)r -1 of vectors x and y.
  • For simplicity, we will ignore the case where n
    is not divisible by p.
  • However, this case can be handled by temporarily
    adding additional rows of zeros to matrix and
    zeros to vector x so the resulting nr. of rows is
    divisible by p.

5
Matrix-Vector Multiplication (cont.)
  • Declarations needed
  • var A array0..r-1,0..n-1 of real
  • var x, y array 0..r-1 of real
  • Then A0,0 on P0 corresponds to A0,0 but on P1
    to Ar,0.
  • Note the subscript are global while array indices
    are local.
  • Also, note global index (i,j) corresponds to
    local index (i - ?i/r?, j) on processor Pk where
    k ?i/r?
  • The next figure illustrates how the rows and
    vectors are partitioned among the processors.

6
(No Transcript)
7
Matrix-Vector Multiplication (cont.)
  • The partitioning of the data makes it possible to
    solve larger problems in parallel.
  • The parallel algorithm can solve a problem
    roughly p times larger than the sequential
    algorithm.
  • Algorithm 4.1 is given on the next slide.
  • For each loop in Algorithm 4.1, each processor Pq
    computes the scalar product of its r?r matrix
    with a vector of size r.
  • This is a partial result.
  • The values of the components of y assigned to Pq
    is obtained by adding all of these partial
    results together.

8
(No Transcript)
9
Matrix-Vector Multiplication (cont.)
  • In the first pass through the loop, the
    x-components in Pq are the ones originally
    assigned.
  • During each pass through the loop, Pq executes
    the scalar product of the appropriate part of the
    qth block of A with Pq s current components of
    x.
  • Concurrently, each Pq sends its block of x-values
    to Pq1 (mod p) and receives a new block of
    x-values from Pq-1 (mod p).
  • At the conclusion, each Pq has its original block
    of x-values, but has calculated the correct
    values for its y-components.
  • These steps are illustrated in Figure 4.2.

10
(No Transcript)
11
Analysis of Matrix-Vector Multiplication
  • There are p identical steps
  • Each step involves three activities compute,
    send, and receive.
  • The time to send and receive are identical and
    concurrent, so the execution time is
  • T(p) p max r2w, Lrb
  • where w is the computation time for multiplying
    a vector component by matrix component adding
    two products,
  • b is the inverse of the bandwidth, and L is
    communications startup cost.
  • As r n/p, the computation cost becomes
    asymptotically larger than the communication cost
    as n increases, since

  • (for n large)

12
Matrix-Vector Multiplication Analysis (cont)
  • Next, we calculate various metrics and their
    complexity
  • For large n,
  • T(p) p(r2w) n2w/p or O(n2/p) ? O(n2) if p
    constant
  • The cost (n2w/p)p n2w or O(n2)
  • The speedup ts/T(p) cn2 (p/n2w) (c/w)p
    or O(p)
  • However if p is constant/small, the speedup is
    only O(1)
  • The efficiency ts/cost cn2/ (n2w) c/w or
    O(1)
  • Note efficiency tsp/tp ? O(1)
  • Note that if vector x were duplicated across all
    processors, then there would be no need for any
    communication and parallel efficiency would be
    O(1) for all values of n.
  • However, there would be an increased memory cost

13
Matrix-Matrix Multiplication
  • Using matrix-vector multiplication, this is easy.
  • Let C A?B, where all are n?n matrices
  • The multiplication consists of computing n scalar
    products
  • for i 0 to n-1 do
  • for j 0 to n-1 do
  • Ci,j 0
  • for k 0 to n-1 do
  • Ci,j Ci,j Ai,k ? Bk,j
  • We will distribute the matrices over the p
    processors, giving each the first processor the
    first r n/p rows, etc.
  • Declaration
  • var A, B, C array0r-1, 0r-1 of reals.

14
(No Transcript)
15
Matrix-Matrix Multiplication Analysis
  • This algorithm is very similar to the one for
    matrix-vector multiplication
  • Scalar products are replaced by sub-matrix
    multiplication
  • Circular shifting of a vector is replaced by
    circular shifting of matrix rows
  • Analysis
  • Each step lasts as long as the longest of the
    three activities performed during the step
    Compute, send, and receive.
  • T(p) p max nr2w, Lnrb
  • As before, the asymptotic parallel efficiency is
    1 when n is large.

16
Matrix-Matrix Multiplication Analysis
  • Naïve Algorithm matrix-matrix could be achieved
    by executing matrix-vector multiplication n times
  • Analysis of Naïve algorithm
  • Execution time is just the time for matrix-vector
    multiplication, multiplied by n.
  • T(p) p ? max nr2w, nL nrb
  • The only difference between T and T is that term
    L has become nL
  • Naïve approach exchange vectors of size r
  • In the algorithm while in the algorithm developed
    in this section, they exchange matrices of size r
    ? n
  • This does not change asymptotic efficiency
  • However, sending data in bulk can significantly
    reduce the communications overhead.

17
Stencil Applications
  • Popular applications that operate on a discrete
    domain that consists of cells.
  • Each cell holds some value(s) and has neighbor
    cells.
  • The application uses an application that applies
    pre-defined rules to update the value(s) of a
    cell using the values of the neighbor cells.
  • The location of the neighbor cells and the
    function used to update cell values constitute a
    stencil that is applied to all cells in the
    domain.
  • These type of applications arise in many areas of
    science and engineering.
  • Examples include image processing, approximate
    solutions to differential equations, and
    simulation of complex cellular automata (e.g.,
    Conways Game of Life)

18
A Simple Sequential Algorithm
  • We consider a stencil application on a 2D domain
    of size n?n.
  • Each cell has 8 neighbors, as shown below
  • NW N NE
  • W c E
  • SW S SE
  • The algorithm we consider updates the values of
    Cell c based on the value of the already updated
    value of its West and North neighbors.
  • The stencil is shown on the next slide and can be
    formalized as
  • cnew ? UPDATE(cold, Wnew, Nnew)

19
(No Transcript)
20
A Simple Sequential Algorithm (cont)
  • This simple stencil is similar to important
    applications
  • Gauss-Seidel numerical method algorithm
  • Smith-Waterman biological string comparison
    algorithm
  • This stencil can not be applied to cells in top
    row or left column.
  • These cells are handled by the update function.
  • To indicate that no neighbor exists for a cell
    update, we pass a Nil argument to UPDATE.

21
Greedy Parallel Algorithm for Stencil
  • Consider a ring of p processors, P0, P1, ,
    Pp-1.
  • Must decide how to allocate cells among
    processors.
  • Need to balance computational load without
    creating overly expensive communications.
  • Assume initially that p is equal to n
  • We will allocate row i of domain A to ith
    processor, Pi.
  • Declaration Needed Var A Array0..n-1 of
    real
  • As soon as Pi has computed a cell value, it sends
    that value to Pi1 (0 ? i lt p-1).
  • Initially, only A0,0 can be computed
  • Once A0,0 is computed, then A1,0 and A0,1can be
    computed.
  • The computation proceeds in steps. At step k, all
    values on the k-th anti-diagonal are computed.

22
(No Transcript)
23
General Steps of Greedy Algorithm
  • At time ij, processor Pi performs the following
    operations
  • It receives A(i-1,j) from Pi-1
  • It computes A(i,j)
  • Then it sends A(i,j) to Pi1
  • Exceptions
  • P0 does not need to receive cell values to update
    its cells.
  • Pp-1 does not send its cell values after updating
    its cells.
  • Above exceptions do not influence algorithm
    performance.
  • This algorithm is captured in Algorithm 4.3 on
    next slide.

24
(No Transcript)
25
Tracing Steps in Preceding Algorithm
  • Re-read pgs72-73 CLR on send receive for
    sych.rings.
  • See slides 35-40, esp. 37-38 in slides on
    synchronous networks
  • Steps 1-3 are performed by all processors.
  • All processors obtain a array A of n reals, their
    ID nr, and the nr of processors.
  • Steps 4-6 are preformed only by P0.
  • In Step 5, P0 updates the cell A0,0 in NW top
    corner.
  • In Step 6, P0 sends contents in A0 of cell A0,0
    to its successor, P1.
  • Steps 7- 8 are executed only by P1 with since it
    is only processor receiving a message. (Note this
    is not blocking receive, as would block all Pi
    for igt1.)
  • In Step 8. P1. stores update of A0,0 from P0 in
    address v.
  • In Step 9, P0. uses value in v to update value in
    A0 of cell A1,0.

26
Tracing Steps in Algorithm (cont)
  • Steps 12-13 are executed by P0 to update the
    value Aj of its next cell A0,j in top row and
    send its value to P1.
  • Steps 14-16 are executed only by Pn-1 on bottom
    row to update the value Aj of its next cell
    An-1,j.
  • This value will be used by Pn-1 to update its
    next cell in the next round.
  • Pn-1 does not send a value since its row is the
    last one.
  • Only Pi for 0ltiltn-1 can execute 18-19.
  • In Step 18, Pi executing 18-19 on j-th loop are
    further restricted to those receiving a message
    (i.e., blocking receive)
  • In Step 18, Pi executes the send and receive in
    parallel
  • In Step 19, Pi uses the received value to update
    the value Aj of the next cell Ai,j.

27
Algorithm for Fewer Processors
  • Typically, have much fewer processors than nr of
    rows.
  • WLOG, assume p divides n.
  • If n/p rows are assigned to each processor, then
    at least n/p steps must occur before P0 can send
    a value to P1.
  • This situation repeats for each Pi and Pi1,
    severely restricting parallelism.
  • Instead, we assign rows to processors cyclically,
    with row j assigned to Pj mod p.
  • Each processor has following declaration
  • var A array0...n/p, 0..n-1 of real
  • This is a contiguous array of rows, but these
    rows are not contiguous.
  • Algorithm 4.4 for the stencil application on a
    ring of processors using a cyclic data
    distribution is given next.

28
(No Transcript)
29
Cyclic Stencil Algorithm Execution Time
  • Let T(n,p) be the execution time for preceding
    algorithm.
  • We assume that receiving is blocking while
    sending is not.
  • The sending of a message in step k is followed by
    the reception of the message in step k1.
  • The time needed to perform one algorithm step is
    ?bL, where
  • The time needed to update a cell is ?
  • The rate at which cell values are communicated is
    b
  • The startup cost is L.
  • The computation terminates when Pp-1 finishes
    computing the rightmost cell value of its last
    row of cells.

30
Cyclic Stencil Algorithm Run Time (cont)
  • Number of algorithm steps is p-1 n2/p
  • Pp-1 is idle for first p-1 steps
  • Once Pp-1 starts computing, it computes a cell
    each step until the algorithm computation is
    completed.
  • There are n2 cells, split evenly between the
    processors, so each processor is assigned n2/p
    cells
  • This yields
  • Additional problem
  • Algorithm was designed to minimize (time between
    a cell update computation) and (its reception by
    the next processor)
  • However, the algorithm performs many
    communications of small data items.
  • L can be orders of magnitude larger than b if
    cell value small.

31
Cyclic Stencil Algorithm Run Time (cont)
  • Stencil application characteristics
  • The cell value is often as small as an integer or
    real nr.
  • The computations to update the cells may involve
    only a few operations, so ? may also be small.
  • For many computations, most of the execution time
    could be due to the L term in the equation for
    T(n,p).
  • Spending a large amount of time in communications
    overhead reduces the parallel efficiency
    considerably.
  • Note that Ep(n) Tseq(n) / p?Tpar(n) n2w /
    p?Tpar(n)
  • Ep(n) reduces to the below formula. Note that as
    n increases, the efficiency may drop well below
    1.

32
Augmenting Granularity of Algorithm
  • The communication overhead due to startup
    latencies can be decreased by sending fewer
    messages that are larger.
  • Let each processor compute k contiguous cell
    values in each row during each step, instead of
    just 1 value.
  • To simplify analysis, we assume k divides n, so
    each row has n/k segments of k contiguous cells.
  • To avoid above, let the last incomplete segment
    spill over to the next row. The last segment of
    last row may have fewer than k elements.
  • With this algorithm, cell values are communicated
    in bulk, k at a time.

33
Augmenting Granularity of Algorithm (cont)
  • Effect of bulk communication k items on algorithm
  • Larger values of k produce less communication
    overhead.
  • However, larger values of k increase the time
    between a cell value update and its reception in
    the next processor
  • In this algorithm, processors will start
    computing cell values later, leading to more idle
    time for cells.
  • This approach is illustrated in next diagram.

34
(No Transcript)
35
Block-Cyclic Allocation of Cells
  • A second way to reduce communication costs is to
    decrease the number of cell values that are being
    communicated.
  • This is done by allocating blocks from r
    consecutive rows to processors cyclically.
  • To simplify the analysis, we assume r?p divides
    n.
  • This idea of a block-cyclic allocation is very
    useful, and is illustrated below

36
Block-Cyclic Allocation of Cells (cont)
  • Each processor computes k contiguous cells in
    each row from a block of r rows.
  • At each step, each processor now computes r?k
    cells
  • Note blocks are r?k (r rows, k columns) in size
  • Note Only those values on the edges of the block
    have to be sent to other processors.
  • This general approach can dramatically decrease
    the number of cells whose updates have to be sent
    to other processors.
  • The algorithm for this allocation is similar to
    those shown for the cyclic row assignment scheme
    in Figure 4.6,
  • Simply replace rows by blocks of rows.
  • A processor calculates all cell values in its
    first block of rows in n/k steps of the
    algorithm.

37
Block-Cyclic Allocations (cont)
  • Processor Pp-1 sends its k cell values to P0
    after p algorithm steps.
  • P0 needs these values to compute its second
    block of rows .
  • As a result, we need n ? kp in order to keep
    processors busy.
  • If n gt kp, then processors must temporarily store
    received cell values while they finishing
    computing their block of rows for the previous
    step.
  • Recall processors only have to exchange data at
    the boundaries between blocks.
  • Using r rows per block, the amount of data
    communicated is r times smaller than the previous
    algorithm.

38
Block-Cyclic Allocations (cont)
  • Processor activities in computing block
  • Receive k cell values from predecessor
  • Compute kr cell values
  • Sends k cell values to its successor
  • Again we assume receives are blocking while
    sends are not.
  • The time required to perform one step of
    algorithm is
  • krwkbL
  • The computation finishes when processor Pp-1
    finishes computing its rightmost segment in its
    last block of rows of cells.
  • Pp-1 computes one segment of a block row in each
    step

39
Optimizing Block-Cyclic Allocations
  • There are n2/(kr) such segments and so p
    processors can compute them in n2/(pkr) steps
  • It takes p-1 algorithm steps before processor
    Pp-1 can start doing any computation.
  • Afterwards, Pp-1 will computer one segment at
    each step.
  • Overall, the algorithm runs for p-1n2/pkr steps
    with a total computation time of
  • The efficiency of this algorithm is

40
Optimizing Block-Cyclic Allocations (cont)
  • However, this gives the asymptotic efficiency of
  • Note that by increasing r and k, it is possible
    to achieve significantly higher efficiency.
  • However, increasing r and k reduces
    communications.
  • The text also outlines how to determine optimal
    values for k and r using a fair amount of
    mathematics.

41
Implementing Logical Topologies
  • Designers of parallel algorithms should choose
    the logical topology.
  • In section 4.5, switching the topology from
    unidirectional ring to a bidirectional ring made
    the program much simpler and lowered the
    communications time.
  • The message passing libraries, such as the ones
    implemented for MPI, allow communications between
    any two processors using the Send and Recv
    functions.
  • Using a logical topology restricts communications
    to only a few paths, which usually makes the
    algorithm design simpler.
  • The logical topology can be implemented by
    creating a set of functions that allows each
    processor to identify its neighbors.
  • Unidirectional Ring only needs NextNode(P)
  • Bidirectional Ring would need also need
    PreviousNode(P)

42
Logical Topologies (cont)
  • Some systems (e.g., many modern supercomputers)
    provide many physical networks, but sometimes
    creation of logical topologies left to the user.
  • A difficult task is matching the logical topology
    to the physical topology for the application.
  • The common wisdom is that a local topology that
    resembles the physical topology of application
    should produce a good performance.
  • Sometime the reason for using a logical topology
    is to hide the complexity of the physical
    topology.
  • Often extensive benchmarking is required to
    determine the best topology for a given algorithm
    on a given platform.
  • The local topologies studied in this chapter and
    the next are known to be useful in the majority
    of scenarios.

43
Distributed vs Centralized Implementations
  • In the CLR text, the data is already distributed
    among the processors at the start of the
    execution.
  • One may wonder how the data was distributed to
    the processors if whether that should also be
    part of the algorithm.
  • There are two approaches Distributed
    Centralized.
  • In the centralized approach,one assumes that the
    data resides in a single master location.
  • A single processor
  • A file on a disk, if data size is large.
  • The CLR book takes the distributed approach. The
    Akl book usually takes the distributed approach,
    but occasionally takes the centrailized approach.

44
Distributed vs Centralized (cont)
  • An advantage of the centralized approach is that
    the library routine can choose the data
    distribution scheme to enforce.
  • The best performance requires that the choice for
    each algorithm consider its underlying topology.
  • This cannot be done in advance
  • Often the library developer will provide multiple
    versions of possible data distribution
  • The user can then choose the version that bet
    fits the underlying platform.
  • This choice may be difficult without extensive
    benchmarking.
  • The main disadvantage of the centralized approach
    is when user applies successive algorithms using
    the same data.
  • Data will be repeatedly distributed
    undistributed.
  • Causes most library developers to opt for
    distributed option.

45
Summary of Algorithmic Principles(For
Asynchronous Message Passing)
  • Although used only for ring topology, the below
    principles are general. Unfortunately, they often
    conflict with each other.
  • Sending data in bulk
  • Reduces communication overhead due to network
    latencies
  • Sending data early
  • Sending data as early as possible allows other
    processors to start computing as early as
    possible.

46
Summary of Algorithmic Principles(For
Asynchronous Message Passing)-- Continued --
  • Overlapping communication and computation
  • If both can be performed at the same time, the
    communication cost is often hidden
  • Block Data Distribution
  • Having processors assigned blocks of contiguous
    data elements reduces the amount of communication
  • Cyclic Data Distribution
  • Having data elements interleaved among processors
    makes it possible to reduce idle time and achieve
    a better load balance
Write a Comment
User Comments (0)
About PowerShow.com