Title: Chapter 4, CLR Textbook
1Chapter 4, CLR Textbook
- Algorithms on Rings of Processors
2Algorithms on Rings of Processors
- When using message passing, it is common to
abstract away from the physical network and to
choose a convenient logical network instead. - This chapter presents several algorithms
intended for the logical ring network studied
earlier - Coverage of mapping logical networks map onto
physical networks are deferred to Sections 4.6
and 4.7 - Rings are linear interconnection network
- Ideal for a first look at distributed memory
algorithms - Each processor has a single predecessor and
successor
3Matrix-Vector Multiplication
- The first unidirectional ring algorithm will be
the multiplication y Ax of a n?n matrix A by a
vector x of dimension n. - 1. for i 0 to n-1 do
- 2. yi ? 0
- 3. for j 0 to n-1 do
- 4. yi yi Ai,j ? xj
- Each outer (e.g., i) loop computes the scalar
product of one row of A by vector x. - These scalar products can be performed in any
order. - These scalar products will be distributed among
the processors so these can be done in parallel.
4Matrix-Vector Multiplication (cont.)
- We assume that n is divisible by p and let r
n/p. - Each processor must store r contiguous rows of
matrix A and r scalar products. - This is called a block row.
- The corresponding r components of the vector y
and x are also stored with each processor. - Each processor Pq will then store
- Rows qr to (q1)r -1 of matrix A of dimension r?n
- Components qr to (q1)r -1 of vectors x and y.
- For simplicity, we will ignore the case where n
is not divisible by p. - However, this case can be handled by temporarily
adding additional rows of zeros to matrix and
zeros to vector x so the resulting nr. of rows is
divisible by p.
5Matrix-Vector Multiplication (cont.)
- Declarations needed
- var A array0..r-1,0..n-1 of real
- var x, y array 0..r-1 of real
- Then A0,0 on P0 corresponds to A0,0 but on P1
to Ar,0. - Note the subscript are global while array indices
are local. - Also, note global index (i,j) corresponds to
local index (i - ?i/r?, j) on processor Pk where
k ?i/r? - The next figure illustrates how the rows and
vectors are partitioned among the processors.
6(No Transcript)
7Matrix-Vector Multiplication (cont.)
- The partitioning of the data makes it possible to
solve larger problems in parallel. - The parallel algorithm can solve a problem
roughly p times larger than the sequential
algorithm. - Algorithm 4.1 is given on the next slide.
- For each loop in Algorithm 4.1, each processor Pq
computes the scalar product of its r?r matrix
with a vector of size r. - This is a partial result.
- The values of the components of y assigned to Pq
is obtained by adding all of these partial
results together.
8(No Transcript)
9Matrix-Vector Multiplication (cont.)
- In the first pass through the loop, the
x-components in Pq are the ones originally
assigned. - During each pass through the loop, Pq executes
the scalar product of the appropriate part of the
qth block of A with Pq s current components of
x. - Concurrently, each Pq sends its block of x-values
to Pq1 (mod p) and receives a new block of
x-values from Pq-1 (mod p). - At the conclusion, each Pq has its original block
of x-values, but has calculated the correct
values for its y-components. - These steps are illustrated in Figure 4.2.
10(No Transcript)
11Analysis of Matrix-Vector Multiplication
- There are p identical steps
- Each step involves three activities compute,
send, and receive. - The time to send and receive are identical and
concurrent, so the execution time is - T(p) p max r2w, Lrb
- where w is the computation time for multiplying
a vector component by matrix component adding
two products, - b is the inverse of the bandwidth, and L is
communications startup cost. - As r n/p, the computation cost becomes
asymptotically larger than the communication cost
as n increases, since -
(for n large)
12Matrix-Vector Multiplication Analysis (cont)
- Next, we calculate various metrics and their
complexity - For large n,
- T(p) p(r2w) n2w/p or O(n2/p) ? O(n2) if p
constant - The cost (n2w/p)p n2w or O(n2)
- The speedup ts/T(p) cn2 (p/n2w) (c/w)p
or O(p) - However if p is constant/small, the speedup is
only O(1) - The efficiency ts/cost cn2/ (n2w) c/w or
O(1) - Note efficiency tsp/tp ? O(1)
- Note that if vector x were duplicated across all
processors, then there would be no need for any
communication and parallel efficiency would be
O(1) for all values of n. - However, there would be an increased memory cost
13Matrix-Matrix Multiplication
- Using matrix-vector multiplication, this is easy.
- Let C A?B, where all are n?n matrices
- The multiplication consists of computing n scalar
products - for i 0 to n-1 do
- for j 0 to n-1 do
- Ci,j 0
- for k 0 to n-1 do
- Ci,j Ci,j Ai,k ? Bk,j
- We will distribute the matrices over the p
processors, giving each the first processor the
first r n/p rows, etc. - Declaration
- var A, B, C array0r-1, 0r-1 of reals.
14(No Transcript)
15Matrix-Matrix Multiplication Analysis
- This algorithm is very similar to the one for
matrix-vector multiplication - Scalar products are replaced by sub-matrix
multiplication - Circular shifting of a vector is replaced by
circular shifting of matrix rows - Analysis
- Each step lasts as long as the longest of the
three activities performed during the step
Compute, send, and receive. - T(p) p max nr2w, Lnrb
- As before, the asymptotic parallel efficiency is
1 when n is large.
16Matrix-Matrix Multiplication Analysis
- Naïve Algorithm matrix-matrix could be achieved
by executing matrix-vector multiplication n times - Analysis of Naïve algorithm
- Execution time is just the time for matrix-vector
multiplication, multiplied by n. - T(p) p ? max nr2w, nL nrb
- The only difference between T and T is that term
L has become nL - Naïve approach exchange vectors of size r
- In the algorithm while in the algorithm developed
in this section, they exchange matrices of size r
? n - This does not change asymptotic efficiency
- However, sending data in bulk can significantly
reduce the communications overhead.
17Stencil Applications
- Popular applications that operate on a discrete
domain that consists of cells. - Each cell holds some value(s) and has neighbor
cells. - The application uses an application that applies
pre-defined rules to update the value(s) of a
cell using the values of the neighbor cells. - The location of the neighbor cells and the
function used to update cell values constitute a
stencil that is applied to all cells in the
domain. - These type of applications arise in many areas of
science and engineering. - Examples include image processing, approximate
solutions to differential equations, and
simulation of complex cellular automata (e.g.,
Conways Game of Life)
18A Simple Sequential Algorithm
- We consider a stencil application on a 2D domain
of size n?n. - Each cell has 8 neighbors, as shown below
- NW N NE
- W c E
- SW S SE
- The algorithm we consider updates the values of
Cell c based on the value of the already updated
value of its West and North neighbors. - The stencil is shown on the next slide and can be
formalized as - cnew ? UPDATE(cold, Wnew, Nnew)
-
19(No Transcript)
20A Simple Sequential Algorithm (cont)
- This simple stencil is similar to important
applications - Gauss-Seidel numerical method algorithm
- Smith-Waterman biological string comparison
algorithm - This stencil can not be applied to cells in top
row or left column. - These cells are handled by the update function.
- To indicate that no neighbor exists for a cell
update, we pass a Nil argument to UPDATE.
21Greedy Parallel Algorithm for Stencil
- Consider a ring of p processors, P0, P1, ,
Pp-1. - Must decide how to allocate cells among
processors. - Need to balance computational load without
creating overly expensive communications. - Assume initially that p is equal to n
- We will allocate row i of domain A to ith
processor, Pi. - Declaration Needed Var A Array0..n-1 of
real - As soon as Pi has computed a cell value, it sends
that value to Pi1 (0 ? i lt p-1). - Initially, only A0,0 can be computed
- Once A0,0 is computed, then A1,0 and A0,1can be
computed. - The computation proceeds in steps. At step k, all
values on the k-th anti-diagonal are computed.
22(No Transcript)
23General Steps of Greedy Algorithm
- At time ij, processor Pi performs the following
operations - It receives A(i-1,j) from Pi-1
- It computes A(i,j)
- Then it sends A(i,j) to Pi1
- Exceptions
- P0 does not need to receive cell values to update
its cells. - Pp-1 does not send its cell values after updating
its cells. - Above exceptions do not influence algorithm
performance. - This algorithm is captured in Algorithm 4.3 on
next slide.
24(No Transcript)
25Tracing Steps in Preceding Algorithm
- Re-read pgs72-73 CLR on send receive for
sych.rings. - See slides 35-40, esp. 37-38 in slides on
synchronous networks - Steps 1-3 are performed by all processors.
- All processors obtain a array A of n reals, their
ID nr, and the nr of processors. - Steps 4-6 are preformed only by P0.
- In Step 5, P0 updates the cell A0,0 in NW top
corner. - In Step 6, P0 sends contents in A0 of cell A0,0
to its successor, P1. - Steps 7- 8 are executed only by P1 with since it
is only processor receiving a message. (Note this
is not blocking receive, as would block all Pi
for igt1.) - In Step 8. P1. stores update of A0,0 from P0 in
address v. - In Step 9, P0. uses value in v to update value in
A0 of cell A1,0.
26Tracing Steps in Algorithm (cont)
- Steps 12-13 are executed by P0 to update the
value Aj of its next cell A0,j in top row and
send its value to P1. - Steps 14-16 are executed only by Pn-1 on bottom
row to update the value Aj of its next cell
An-1,j. - This value will be used by Pn-1 to update its
next cell in the next round. - Pn-1 does not send a value since its row is the
last one. - Only Pi for 0ltiltn-1 can execute 18-19.
- In Step 18, Pi executing 18-19 on j-th loop are
further restricted to those receiving a message
(i.e., blocking receive) - In Step 18, Pi executes the send and receive in
parallel - In Step 19, Pi uses the received value to update
the value Aj of the next cell Ai,j.
27Algorithm for Fewer Processors
- Typically, have much fewer processors than nr of
rows. - WLOG, assume p divides n.
- If n/p rows are assigned to each processor, then
at least n/p steps must occur before P0 can send
a value to P1. - This situation repeats for each Pi and Pi1,
severely restricting parallelism. - Instead, we assign rows to processors cyclically,
with row j assigned to Pj mod p. - Each processor has following declaration
- var A array0...n/p, 0..n-1 of real
- This is a contiguous array of rows, but these
rows are not contiguous. - Algorithm 4.4 for the stencil application on a
ring of processors using a cyclic data
distribution is given next.
28(No Transcript)
29Cyclic Stencil Algorithm Execution Time
- Let T(n,p) be the execution time for preceding
algorithm. - We assume that receiving is blocking while
sending is not. - The sending of a message in step k is followed by
the reception of the message in step k1. - The time needed to perform one algorithm step is
?bL, where - The time needed to update a cell is ?
- The rate at which cell values are communicated is
b - The startup cost is L.
- The computation terminates when Pp-1 finishes
computing the rightmost cell value of its last
row of cells.
30Cyclic Stencil Algorithm Run Time (cont)
- Number of algorithm steps is p-1 n2/p
- Pp-1 is idle for first p-1 steps
- Once Pp-1 starts computing, it computes a cell
each step until the algorithm computation is
completed. - There are n2 cells, split evenly between the
processors, so each processor is assigned n2/p
cells - This yields
- Additional problem
- Algorithm was designed to minimize (time between
a cell update computation) and (its reception by
the next processor) - However, the algorithm performs many
communications of small data items. - L can be orders of magnitude larger than b if
cell value small.
31Cyclic Stencil Algorithm Run Time (cont)
- Stencil application characteristics
- The cell value is often as small as an integer or
real nr. - The computations to update the cells may involve
only a few operations, so ? may also be small. - For many computations, most of the execution time
could be due to the L term in the equation for
T(n,p). - Spending a large amount of time in communications
overhead reduces the parallel efficiency
considerably. - Note that Ep(n) Tseq(n) / p?Tpar(n) n2w /
p?Tpar(n) - Ep(n) reduces to the below formula. Note that as
n increases, the efficiency may drop well below
1.
32Augmenting Granularity of Algorithm
- The communication overhead due to startup
latencies can be decreased by sending fewer
messages that are larger. - Let each processor compute k contiguous cell
values in each row during each step, instead of
just 1 value. - To simplify analysis, we assume k divides n, so
each row has n/k segments of k contiguous cells. - To avoid above, let the last incomplete segment
spill over to the next row. The last segment of
last row may have fewer than k elements. - With this algorithm, cell values are communicated
in bulk, k at a time.
33Augmenting Granularity of Algorithm (cont)
- Effect of bulk communication k items on algorithm
- Larger values of k produce less communication
overhead. - However, larger values of k increase the time
between a cell value update and its reception in
the next processor - In this algorithm, processors will start
computing cell values later, leading to more idle
time for cells. - This approach is illustrated in next diagram.
34(No Transcript)
35Block-Cyclic Allocation of Cells
- A second way to reduce communication costs is to
decrease the number of cell values that are being
communicated. - This is done by allocating blocks from r
consecutive rows to processors cyclically. - To simplify the analysis, we assume r?p divides
n. - This idea of a block-cyclic allocation is very
useful, and is illustrated below
36Block-Cyclic Allocation of Cells (cont)
- Each processor computes k contiguous cells in
each row from a block of r rows. - At each step, each processor now computes r?k
cells - Note blocks are r?k (r rows, k columns) in size
- Note Only those values on the edges of the block
have to be sent to other processors. - This general approach can dramatically decrease
the number of cells whose updates have to be sent
to other processors. - The algorithm for this allocation is similar to
those shown for the cyclic row assignment scheme
in Figure 4.6, - Simply replace rows by blocks of rows.
- A processor calculates all cell values in its
first block of rows in n/k steps of the
algorithm.
37Block-Cyclic Allocations (cont)
- Processor Pp-1 sends its k cell values to P0
after p algorithm steps. - P0 needs these values to compute its second
block of rows . - As a result, we need n ? kp in order to keep
processors busy. - If n gt kp, then processors must temporarily store
received cell values while they finishing
computing their block of rows for the previous
step. - Recall processors only have to exchange data at
the boundaries between blocks. - Using r rows per block, the amount of data
communicated is r times smaller than the previous
algorithm.
38Block-Cyclic Allocations (cont)
- Processor activities in computing block
- Receive k cell values from predecessor
- Compute kr cell values
- Sends k cell values to its successor
- Again we assume receives are blocking while
sends are not. - The time required to perform one step of
algorithm is - krwkbL
- The computation finishes when processor Pp-1
finishes computing its rightmost segment in its
last block of rows of cells. - Pp-1 computes one segment of a block row in each
step
39Optimizing Block-Cyclic Allocations
- There are n2/(kr) such segments and so p
processors can compute them in n2/(pkr) steps - It takes p-1 algorithm steps before processor
Pp-1 can start doing any computation. - Afterwards, Pp-1 will computer one segment at
each step. - Overall, the algorithm runs for p-1n2/pkr steps
with a total computation time of - The efficiency of this algorithm is
40Optimizing Block-Cyclic Allocations (cont)
- However, this gives the asymptotic efficiency of
- Note that by increasing r and k, it is possible
to achieve significantly higher efficiency. - However, increasing r and k reduces
communications. - The text also outlines how to determine optimal
values for k and r using a fair amount of
mathematics.
41Implementing Logical Topologies
- Designers of parallel algorithms should choose
the logical topology. - In section 4.5, switching the topology from
unidirectional ring to a bidirectional ring made
the program much simpler and lowered the
communications time. - The message passing libraries, such as the ones
implemented for MPI, allow communications between
any two processors using the Send and Recv
functions. - Using a logical topology restricts communications
to only a few paths, which usually makes the
algorithm design simpler. - The logical topology can be implemented by
creating a set of functions that allows each
processor to identify its neighbors. - Unidirectional Ring only needs NextNode(P)
- Bidirectional Ring would need also need
PreviousNode(P)
42Logical Topologies (cont)
- Some systems (e.g., many modern supercomputers)
provide many physical networks, but sometimes
creation of logical topologies left to the user. - A difficult task is matching the logical topology
to the physical topology for the application. - The common wisdom is that a local topology that
resembles the physical topology of application
should produce a good performance. - Sometime the reason for using a logical topology
is to hide the complexity of the physical
topology. - Often extensive benchmarking is required to
determine the best topology for a given algorithm
on a given platform. - The local topologies studied in this chapter and
the next are known to be useful in the majority
of scenarios.
43Distributed vs Centralized Implementations
- In the CLR text, the data is already distributed
among the processors at the start of the
execution. - One may wonder how the data was distributed to
the processors if whether that should also be
part of the algorithm. - There are two approaches Distributed
Centralized. - In the centralized approach,one assumes that the
data resides in a single master location. - A single processor
- A file on a disk, if data size is large.
- The CLR book takes the distributed approach. The
Akl book usually takes the distributed approach,
but occasionally takes the centrailized approach.
44Distributed vs Centralized (cont)
- An advantage of the centralized approach is that
the library routine can choose the data
distribution scheme to enforce. - The best performance requires that the choice for
each algorithm consider its underlying topology. - This cannot be done in advance
- Often the library developer will provide multiple
versions of possible data distribution - The user can then choose the version that bet
fits the underlying platform. - This choice may be difficult without extensive
benchmarking. - The main disadvantage of the centralized approach
is when user applies successive algorithms using
the same data. - Data will be repeatedly distributed
undistributed. - Causes most library developers to opt for
distributed option.
45Summary of Algorithmic Principles(For
Asynchronous Message Passing)
- Although used only for ring topology, the below
principles are general. Unfortunately, they often
conflict with each other. - Sending data in bulk
- Reduces communication overhead due to network
latencies - Sending data early
- Sending data as early as possible allows other
processors to start computing as early as
possible.
46Summary of Algorithmic Principles(For
Asynchronous Message Passing)-- Continued --
- Overlapping communication and computation
- If both can be performed at the same time, the
communication cost is often hidden - Block Data Distribution
- Having processors assigned blocks of contiguous
data elements reduces the amount of communication - Cyclic Data Distribution
- Having data elements interleaved among processors
makes it possible to reduce idle time and achieve
a better load balance