Title: A Taste of Parallel Algorithms
1A Taste of Parallel Algorithms
by Shietung Peng
2 Some Simple Computations
- We examine some of the following five simple
building-block parallel operations on three
simple parallel architectures linear array,
binary tree, and 2D mesh - Semi-group computation
- Parallel prefix computation
- Package routing
- Broadcasting and multicasting
- Sorting
3 Some Simple Architectures
- A linear array or ring
- A balanced binary tree
- A 2D mesh or torus
4Algorithms for Linear Array
- A special case of semi-group computation, namely,
maximum finding, is shown below In each step, a
processor sends its max-thus-far value
(initialized to its own data value) to its two
neighbors. Each processor, on receiving values
from its left and right neighbors, sets its
max-thus-far value to the largest of the three
values.
5Algorithms for Linear Array
- The algorithm for parallel prefix computation is
similar to the semi-group computation algorithm.
The processor at the left end becomes active and
sends its data value to the right. On receiving a
value from its left neighbor, a processor becomes
active, sums up the value received from the left
and its own data value, and sends the result to
the right.
6Algorithms for Linear Array
- Extension of the algorithm for parallel prefix
and the semi-group algorithms to the case where
each processor holds several data items is
straightforward. Each processor does a prefix
computation on its own data set of size n/p, then
does a diminished parallel prefix computation
(the prefix up to the (i-1)th value), and finally
combines this results with locally computed
prefix. In all, 2n/p p 2 computation steps
and p 1 communication steps are required. - An example of computing prefix sum on a linear
array with two items per processor is shown in
the next page.
7Algorithms for Linear Array
8Algorithms for Linear Array
- We consider two versions of sorting on a linear
array with and without I/O. Sorting with the
keys input sequentially from the left is depicted
below. - Each processor, on receiving a key
- value from the left, compares the
- received value with the local value.
- the smaller of the two values is kept
- and the larger value is passed on to
- the right.
- The total sorting time is equal to the
- I/O time.
9Algorithms for Linear Array
- If the key values are already in place, one per
processor, then an algorithm known as odd-even
transposition can be used for sorting.
10Algorithms for Linear Array
- The odd-even transposition algorithm uses p
processors to sort p keys in p compare-exchange
steps. How good is the algorithm? Assume that the
best sequential sorting algorithm takes plgp
steps. Then, we have T(p) p, W(p) ,
S(p) lgp, E(p) (lgp)/p, R(p) p/(2lgp), U(p)
1/2, and Q(p) . - In practice, the number n of keys to be sorted is
greater than the number p of processors. In this
case, each processor first sorts its list of size
n/p using any efficient sequential sorting
algorithm. Next, we perform the odd-even
transposition sort as before except that each
compare-exchange step is replaced by a
merge-split step. For example, if P0 is holding
(1,3,7,8) and P1 has (2,4,5,9), a merge-split
step will turn the lists into (1,2,3,4) and
(5,7,8,9), respectively. - The total time of this generalized algorithm is
(n/p)lg(n/p) 2n.
11Algorithms for Binary Tree
- In algorithms for a binary tree of processors, we
assume that the data elements are initially held
by the leaf processors only. The non-leaf
processors participate in the computation, but do
not have data elements of their own. - The binary-tree architecture is ideally suited
for parallel-prefix computation. The algorithm
consists of two phases an upward phase followed
by a downward phase. The two phases are depicted
by the figure in the next page. - Given a list of 0s and 1s, the rank of each 1 in
the list can be determined by a prefix sum
computation -
12Algorithms for Binary Tree
- At the downward phase, each
- processor receives value p from
- its parent and value l from its
- left-child. Then, passes p to its
- left-child and combine p and l,
- and sends the result to its right-
- child.
13Algorithms for Binary Tree
- For sorting, we can use an algorithm similar to
bubble sort that allows the smaller elements in
the leaves to bubble up to the root processor
first. Then, the root sends the elements to leaf
nodes in the proper order. - Initially, each leaf has a single data item and
all other nodes are empty. At the upward phase,
each inner node has storage space of two values,
migrating upward from its left and right
sub-trees. There are three cases - Contains 2 items do nothing
- Contains 1 item that came from left (right) get
the smaller item from right (left) child - Empty get smaller item from each child
14Algorithms for Binary Tree
- At the downward phase, each node knows the number
of leaf nodes in its left sub-tree. If the rank
of the element received from above larger than
the number of leaf node to the left, then the
data item is sent to the right, otherwise, to the
left. - The figure in the next page shows the upward data
movement (up to the point when the smallest
element is in the root node, ready to begin the
downward movement). - Because of the bisection width of the binary tree
is 1, the above linear time algorithm can not be
improved.
15Algorithms for Binary Tree
16Algorithms for 2D Mesh
- The linear array algorithms can be used as
building blocks in the 2D mesh algorithms. This
leads to simple algorithms, but not necessarily
the most efficient ones. - Parallel prefix computation in 2D mesh can be
done easily in the following three phases,
similar to that in linear array for the case ngtp. - (1) do a parallel prefix computation on each row,
- (2) do a diminished parallel prefix computation
in the rightmost column, and - (3) broadcast the results in the rightmost column
to all of the elements in the same rows and
combine with the local prefix values.
17Algorithms for 2D Mesh
- The parallel prefix algorithms in 2D mesh takes
- unit
time. - Next, we describe without proof, the simple
version of a sorting algorithm known as shear
sort. The algorithm consists of
phases in a 2D mesh with r rows. In each phase,
except the last one, all rows are sorted
independently in a snakelike order even-numbered
rows from left to right, odd-numbered rows from
right to left. Then, all columns are sorted
independently from top to bottom. In the final
phase, rows are sorted from left to right.
18Algorithms for 2D Mesh
- Using the odd-even transposition algorithm for
row-sort and column-sort in the shear-sort
algorithm, we get that the algorithm needs - compare-exchange
steps for sorting in row-major order. - The figure below shows the execution of the
algorithm in a 3-by-3 mesh.
19Exercise 2
- Given n data items, determine the optimal number
p of processors in a linear array such that if
the n data items are distributed in the
processors with each holding approximately n/p
elements, the time to perform parallel prefix
computation is minimized. - Compute the effectiveness measures introduced in
the lecture note for parallel prefix computation
algorithm on linear array, binary tree, and 2D
mesh architecture.
20Exercise 2
- Shear-sort on 2D mesh of processors
- Write down the number of compare-exchange steps
required in perform shear-sort in 2D mesh with r
rows and p/r columns. - Compute the effectiveness measures for the
shear-sort based on the results above. - Discuss the best row/column ratio that minimizes
the sorting time. - How would shear-sort work if each processor
initially hold more than one key?