Title: Algorithms and Applications
1Algorithms and Applications
Some Theory Stuff Sorting Algorithms Numerical
(Matrix) Algorithms Graph Algorithms Searching
and Optimization
2Theory reminder
- Big ?, 1 and S notation.
- Definition of work (cost)
- work(cost) parallel time x number of
processors - Cost-optimal algorithm
- parallel time x number of processors
O(sequential time) - Optimal parallel time
- opt. parallel time sequential
time/number of processors
3Algorithms and Applications
Some Theory Stuff Sorting Algorithms Numerical
(Matrix) Algorithms Graph Algorithms Searching
and Optimization
4Processor i,j Processori,0 Bpos
Ai // send Ai to Processorpos,0
Time O(log n) Processors n2 Cost O(n2 log n)
Rank Sort using n2 processors
if (Aj ltAi) res 1 else res 0
Reduce( sendBuf
res,
recvBuf pos, source
Processori,0, group Processorsi,
, operation )
5Complexity of the odd-even sort
- Using n processors
- n phases, n-1 compare-swap operations in a phase
- time O(n), cost O(n2)
- Using p processors
- each processor gets n/p values and sorts them
internally in time 1(n/p log (n/p)) - after that, p odd-even sorting phases are needed
- 1(n/p) operations spent on merging two blocks,
and 1(n/p) for communicating a block - p phases 1(p x n/p) merging 1(p x n/p)
communication 1(n) - overall 1(n/p log (n/p)) 1(n) 1(n)
-
- local sorting
merging communication
60-1 Sorting Lemma
Lemma If an oblivious comparison-exchange
algorithm sorts all inputs sets consisting solely
of 0s and 1s, then it sorts all input sets with
arbitrary values. Oblivious algorithm the
compare-exchange operations are prespecified,
i.e. the comparisons performed do not depend on
the outcome of the previous comparisons.
Examples odd-even transposition sort,
shearsort Allows relatively simple proofs of
correctness of oblivious compare-exchange sorting
algorithms. 0-1 Sorting Lemma and proof of
Shearsort complexity are from Leightons book
70-1 Sorting Lemma (Proof)
Proof By contradiction. We show that if an
algorithm A fails to sort arbitrary values, then
it does not sort all 0-1 sequences. Let the
algorithm A fails to sort an input sequence xi.
Let ai be the correct output sequence and let
bi be the output of A. Let k be the smallest
index such that bk differs from ak and let l be
the position of ak in the output sequence.
Clearly, lgtk and al lt ak. Consider the input
sequence xi replaced by the sequence yi,
where yi0 if xi ak, otherwise yi1. Since xi
xj ? yi yj for every i and j, the algorithm
performs the same compare-exchange operations on
the input y as it did on x. In particular, at the
position k there will be 1 and at position l
there will be 0. Contradiction with the fact that
A sorts all 0-1 sequences. a1, a2, , ak, ,
al,, an good output on
xi b1, b2, , bk, , bl,, bn
output of A on xi 0, 0, , 1, , 0,
output of A on yi
8Shearsort correctness and complexity
Using the 0-1 sorting Lemma We show that
Shearsort sorts any 0-1 sequence in log n
iterations of the row and column sort
pair. Consider the situation after the columns
have been sorted
0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1
0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
upper region of all 0 rows
middle region of dirty rows
lower region of all 1 rows
At the beginning we assume all rows are dirty. We
show that after a pair of row and column sort,
the number of dirty rows is at least halved.
9Shearsort correctness and complexity II
- We show that from every pair of consecutive dirty
rows, at least one row becomes clean in the
column sort. - We show this for a specific way to perform the
column sort. Since the outcome of the column sort
does not depend on the way it is done, the result
holds for any column sort algorithm. - The column sort strategy
- compare-exchange odd rows with the consecutive
even rows - move the resulting clean rows out
- sort somehow the remaining dirty rows
10Shearsort correctness and complexity III
Consecutive pairs of dirty rows after rows are
sorted
0 . . . . . 01 . . . 1 0 . . . 01 . . . . .
1 0 . . . . 01 . . . . 1 1 . . . 10 . . . .
. 0 1 . . . . . 10 . . . 0 1 . . . .
10 . . . . 0 (more 0s)
(more 1s) (equal number)
Consecutive pairs of dirty rows after performing
the first step of the column sort
0 . . . . . . . . . . . 0 0 . . 01 . . 10 .
. 1 0 . . . . . . . . . . . 0 1 . . 10 . .
01 . . 1 1 . . . . . . . . . . . 1 1
. . . . . . . . . . . 1 (more 0s)
(more 1s) (equal number)
11Shearsort correctness and complexity IV
Example
Before the first step
After the first step
After moving out newly clean rows
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1
1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 clean row 1 1 1 1 1 1
1 0 0 0 dirty row 0 0 0 0 0 0 0 0 0 0
newly cleaned row
Summing together O(?n log n) sorting
row/column number of phases)
12Merge sort complexity
- The divide phase
- at best, the data are already where they are
needed - The merging phase
- merging two subsequences of size k costs O(k)
computation and communication - step i of the merging phase merges subsequences
of size 2i, until n is reached - total complexity O(248n) O(n)
- not very parallel, only log n processors can be
efficiently utilized
13Sorting Networks
Comparators
x
xmax(x,y)
x
xmin(x,y)
y
ymin(x,y)
y
ymax(x,y)
columns of comparators
interconnection network
input wires
output wires
The rest of the sorting slides is according to
Kumars book.
14Bitonic sort I
- Bitonic sequence
- A sequence (a0, a1, , an-1) is bitonic if
- There exists i such that (a0, a1, , ai) is
monotonically increasing and (ai, ai1, , an-1)
is monotonically decreasing, or - there exists a cyclic shift of indices such that
1. is satisfied - Which of these are bitonic?
15Bitonic Sort II
- Bitonic split
- s1 (min(a0, an/2), min(a1, an/21), ,
min(an/2-1, an-1)) - s2 (max(a0, an/2), max(a1, an/21), ,
max(an/2-1, an-1)) - Properties of bitonic split
- s1 and s2 are bitonic
- every element of s1 is smaller then every
element of s2
16Bitonic Sort III
- Bitonic merge
- sorts a bitonic sequence in log n steps
3
3
3
3
0
5
5
5
0
3
8
8
8
8
5
9
9
0
5
8
10
10
10
10
9
9
10
12
12
12
14
14
14
14
12
20
0
9
12
14
95
95
35
18
18
90
90
23
20
20
60
60
18
35
23
40
40
20
23
35
35
35
95
60
40
23
23
90
40
60
18
18
60
95
90
0
20
40
90
95
17Bitonic Sort IV
BM2
BM4
BM2
BM8
BM2
BM4
BM2
BM16
BM2
BM4
BM2
BM8
BM2
BM4
BM2
18Bitonic Sort V
BM2s
BM4s
BM8s
10
10
5
3
20
20
9
5
5
9
10
8
9
5
20
9
3
3
14
10
8
8
12
12
12
14
8
14
14
12
3
20
90
0
0
95
0
90
40
90
60
60
60
60
40
40
90
40
23
23
95
35
35
35
35
23
95
95
23
18
18
18
18
0
19Bitonic Sort VI
- Implementation
- simulate the sorting network
- a column of comparators can be simulated in
parallel - how to map processors to comparators so that
communication is minimized? - Complexity (comparisons, ignoring communication)
- BM2BM4BMn
- Gi1 i O(log2 n)
log(n)
20Bitonic Sort VII
Mapping Bironic Sort to a Hypercube (n
processors)
0000
1
0001
2,1
0010
1
0011
3,2,1,
0100
1
0101
2,1
0110
1
0111
4,3,2,1,
1000
1
1001
2,1
1010
1
1011
3,2,1
1100
1
1101
2,1
1110
1
1111
21Bitonic Sort VIII
BM16 in hypercube in detail
step 1
step 2
3
4
step 4
step 3
2
1
22Bitonic Sort VIII
Overall complexity in a hypercube n processors
Tp O(log2 n) O(log2 n)
comparisons communication
- p processors
- n/p comparators per process
- use compare-and-swap operations
Tp O(n/p log(n/p)) O(n/p log2 p) O(n/p
log2 p)
local sort comparisons communication
23Bitonic Sort IX
BM16 in mesh in detail
step 1
0000
0001
0010
0011
0100
0101
0110
0111
mapping
1000
1001
1010
1011
1100
1101
1110
1111
step 2
step 3
step 4
24Bitonic Sort X
Overall complexity in a mesh n processors Tp
O(log2 n) O(?n)
comparisons communication
- p processors
- n/p comparators per process
- use compare-and-swap operations
Tp O(n/p log(n/p)) O(n/p log2 p)
O(n/?p)
local sort comparisons communication
25Parallel Quicksort
- Sequential Quicksort
- choose pivot
- split the sequence into L (ltpivot) and R (gt
pivot) - recursively sort L and R
- Naïve Parallel Quicksort
- parallelise only the last step
- Complexity
- O(nn/2n/4) O(n) (average)
- dominated by the sequential splitting
- only log n processors can be efficiently used
26Parallel Quicksort II
- Parallel Quicksort for shared memory computer
- every processor gets n/p elements
- repeat
- choose pivot and broadcast it
- each processor i splits its sequence into Li
(ltpivot) and Ri (gt pivot) - collect all Lis and Ris into global L and R
- split the processors into left and right in the
ratio L/R - the left processors recursively sort L, the
right processors R - until a single processor is left for the whole
(reduced) range - sort your range sequentially
27Parallel Quicksort III
First step
P0
P1
P2
P3
P4
7
13
18
2
17
1
14
20
6
10
15
9
3
16
19
4
11
12
5
8
pivot selection
pivot 7
P0
P1
P2
P3
P4
after local rearrangement
7
2
18
13
1
17
14
20
6
10
15
9
3
4
19
16
5
12
11
8
after global rearrangement
7
2
18
13
1
17
14
20
6
10
15
9
3
4
19
16
5
12
11
8
28Parallel Quicksort IV
Second step
P0
P1
P2
P3
P4
pivot selection
7
2
18
13
1
17
14
20
6
10
15
9
3
4
19
16
5
12
11
8
pivot 5
pivot 17
P0
P1
P2
P3
P4
after local rearrangement
1
2
7
6
3
4
5
14
13
17
18
20
10
15
9
19
16
12
11
8
after global rearrangement
1
2
7
6
3
4
5
14
13
17
18
20
10
15
9
19
16
12
11
8
29Parallel Quicksort V
Third step
P0
P1
P2
P3
P4
pivot selection
1
2
7
6
3
4
5
14
13
17
18
20
10
15
9
19
16
12
11
8
pivot 11
P0
P1
P2
P3
P4
after local rearrangement
1
2
6
7
3
4
5
10
13
17
18
19
14
15
9
20
8
12
11
16
after global rearrangement
10
13
17
14
15
9
8
12
11
16
30Parallel Quicksort VI
Fourth step
P2
P3
after local rearrangement
10
13
17
14
15
9
8
12
11
16
Solution
P0
P1
P2
P3
P4
1
2
6
7
3
4
5
8
9
10
18
19
11
12
13
20
14
15
16
18
31Parallel Quicksort VII
- Complexity analysis
- selecting pivot O(1)
- broadcasting pivot O(log p)
- local rearrangement O(n/p)
- global rearrangement O(log p n/p)
- multiply by log p iterations
- local sequential sort O(n/p log (n/p))
- Overall complexity
- O(n/p log n/p) O(n/p log p) O(log2 p)
lDest Scan( Li, , )
rDest Scan( Ri, , ) copyElements(AlDest,
Li , Li,) copyElements(AlSizerDest, Ri ,
Ri,)
local sort local rearr.
broadcasting and scan()
global moving
32Parallel Quicksort VIII
- Message passing implementation
- more complications with explicit moving the data
around - main complication in the global rearrangement
phase - each process may need to send its Li and Ri to
several other processes - each process may receive its new Li and Ri from
several other processes - the destination of the pieces of Li and Ri
(where to send them in the global rearrangement)
contains destination process and an address
within that process - all-to-all communication may be necessary
- asymptotic complexity remains the same
33Sorting Conclusions
Parallel rank sort - the only one non
compare-exchange Odd-even transposition sort
O(n) time with n processors Shear sort 2D mesh,
O(?n log n) time with n processors 0-1 Sorting
Lemma Naïve parallel merge sort - O(n) time with
n processors Sorting Networks Bitonic sort -
O(log2 n) time with n processors, hypercube and
mesh implementations Quick Sort - O(log2 n) time
with n processors, shared memory and
message-passing implementations