Title: Programming Paradigms and Algorithms
1Programming Paradigms and Algorithms
- WA 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6,
9.2.8, 10.4.1, - Kumar 12.1.3
- 1.      Berman, F., Wolski, R., Figueira, S.,
Schopf, J. and Shao, G., "Application-Level
Scheduling on Distributed Heterogeneous
Networks," Proceedings of Supercomputing '96 - (httpapples.ucsd.edu)
-
2Common Parallel Programming Paradigms
- Embarrassingly parallel programs
- Workqueue
- Master/Slave programs
- Monte Carlo methods
- Regular, Iterative (Stencil) Computations
- Pipelined Computations
- Synchronous Computations
3Pipelined Computations
- Pipelined program divided into a series of tasks
that have to be completed one after the other. - Each task executed by a separate pipeline stage
- Data streamed from stage to stage to form
computation
4Pipelined Computations
- Computation consists of data streaming through
pipeline stages - Execution Time Time to fill pipeline (P-1)
Time to run in steady state (N-P1) - Time to empty pipeline (P-1)
P of processors N of data items (assume P
lt N)
5Pipelined Example Sieve of Eratosthenes
- Goal is to take a list of integers greater than 1
and produce a list of primes - E.g. For input 2 3 4 5 6 7 8 9 10, output is
2 3 5 7 - Frans pipelined approach (a little different
than book) - Processor Pi divides each input by the ith prime
- If the input is divisible (and not equal to the
divisor), it is marked (with a negative sign) and
forwarded - If the input is not divisible, it is forwarded
- Last processor only forwards unmarked (positive)
data primes
6Sieve of Eratosthenes Pseudo-Code
- Code for last processor
- xrecv(data,P_(i-1))
- If xgt0 then send(x,OUTPUT)
- Code for processor Pi (and prime p_i)
- xrecv(data,P_(i-1))
- If (xgt0 and xp_i) then
- If (p_i divides x) then send(-x,P_(i1)
- If (p_i does not divide x) then send(x, P_(i1))
- Else
- Send(x,P_(i1))
- 6 5 4 3 2
- 6 5 4 3 2
- 6 5 -4 3 2
- 6 5 -4 3 2
7Programming Issues
- Algorithm will take NP-1 to run where N is the
number of data items and P is the number of
processors. - Can consider just the odds or do some initial
part separately - In given implementation, number of processors
must store all primes which will appear in
sequence - Not a scalable approach
- Can fix this by having each processor do the job
of multiple primes, i.e. mapping logical
processors in the pipeline to each physical
processor - What is the impact of this on performance?
8More Programming Issues
- In pipelined algorithm, flow of data moves
through processors in lockstep, attempt to
balance work so that there is no bottleneck at
any processor - In mid-80s, processors developed to support in
hardware this kind of parallel pipelined
computation - Two commercial products from Intel Warp (1D
array) and iWarp (components for 2D array) - Warp and iWarp were meant to operate
synchronously Wavefront Array Processor (S.Y.
Kung) was meant to operate asynchronously, i.e.
arrival of data would signal that it was time to
execute
9Systolic Arrays
- Warp and iWarp were examples of systolic arrays
- Systolic means regular and rhythmic, data was
supposed to move through pipelined computational
units in a regular and rhythmic fashion - Systolic arrays meant to be special-purpose
processors or co-processors and were very
fine-grained - Processors implement a limited and very simple
computation, usually called cells - Communication is very fast, granularity meant to
be around 1!
10Systolic Algorithms
- Systolic arrays built to support systolic
algorithms, a hot area of research in the early
80s - Systolic algorithms used pipelining through
various kinds of arrays to accomplish
computational goals - Some of the data streaming and applications were
very creative and quite complex - CMU a hotbed of systolic algorithm and array
research (especially H.T. Kung and his group)
11Example Systolic Algorithm Matrix Multiplication
- Problem multiply two nxn matrices A a_ij and
Bb_ij. Product matrix will be Rr_ij. - Systolic solution uses 2D array with NxN cells, 2
input streams and 2 output streams
12Systolic Matrix Multiplication
b41 b42 b43 b44 b31 b32
b33 b34 b21 b22 b23
b24 b11 b12 b13 b14 --
-- -- -- --
-- ----
a44 a34 a24 a14 a43
a33 a23 a13 a42 a32
a22 a12 a41 a31 a21
a11 -- -- -- -- -- --
P11
P12
P21
P31
P13
P22
P32
P41
P14
P23
P33
P42
P24
P34
P43
P44
13Operation at each cell
- Each cell updates at each time step as
shown below - initialized to 0
14Data Flow for Systolic MM
15Data Flow for Systolic MM
16Data Flow for Systolic MM
17Data Flow for Systolic MM
18Data Flow for Systolic MM
19Programming Issues
- Performance of systolic algorithms based on fine
granularity (1 update about the same as a
communication) and regular dataflow - Can be done on asynchronous platforms with
tagging but must ensure that idle time does not
dominate computation - Many systolic algorithms may not map well to more
general MIMD or distributed platforms
20Synchronous Computations
- Synchronous computations have the form
- (Barrier)
- Computation
- Barrier
- Computation
-
- Frequency of the barrier and homogeneity of the
intervening computations on the processors may
vary - Weve seen several synchronous computations
already (Jacobi2D, Parallel Prefix, Systolic MM)
21Synchronous Computations
- Synchronous computations can be simulated using
asynchronous programming models - Iterations can be tagged so that the appropriate
data is combined - Performance of such computations depends on the
granularity of the platform, how expensive
synchronizations are, and how much time is spent
idle waiting for the right data to arrive
22Barrier Synchronizations
- Barrier synchronizations can be implemented in
many ways - As part of the algorithm
- As a part of the communication library
- PVM and MPI have barrier operations
- In hardware
- Implementations vary
23Synchronous Computation Example Bitonic Sort
- Bitonic Sort an interesting example of a
synchronous algorithm - Computation proceeds in stages where each stage
is a (smaller or larger) shuffle-exchange network - Barrier synchronization at each stage
24Bitonic Sort
- A bitonic sequence is a list of keys
- such that
- For some i, the keys have the ordering
- or
- The sequence can be shifted cyclically so that 1)
holds
25Bitonic Sort Algorithm
- The bitonic sort algorithm recursively calls two
procedures - BSORT(i,j,X) takes bitonic sequence
and produces a non-decreasing (X) or a
non-increasing sorted sequence (X-) - BITONIC(i,j) takes an unsorted sequence
and produces a bitonic sequence - The main algorithm is then
- BITONIC(0,n-1)
- BSORT(0,n-1,)
26How does it do this?
- Well show how BSORT and BITONIC work but first
consider an interesting property of bitonic
sequences - Assume that is bitonic
and that n is even. Let - Then and are bitonic sequences and
for all
27Picture Proof of Interesting Property
28Picture Proof of Interesting Property
29Picture Proof of Interesting Property
30Back to Bitonic Sort
- Remember
- BSORT(i,j,X) takes bitonic sequence
and produces a non-decreasing (X) or a
non-increasing sorted sequence (X-) - BITONIC(i,j) takes an unsorted sequence
and produces a bitonic sequence - Lets look at BSORT first
min bitonic max bitonic
bitonic
31Heres where the shuffle-exchange comes in
- Shuffle-exchange network routes the data
correctly for comparison - At each shuffle stage, can use switch to
separate B1 and B2
bitonic
32Sort bitonic subsequences to get a sorted sequence
- BSORT(i,j,X)
- If j-ilt2 then return min(i,i1), max(i,i1)
- Else
- Shuffle(i,j,X)
- Unshuffle(i,j)
- Pardo
- BSORT (i,i(j-i1)/2 - 1,X)
- BSORT (i(j-i1)/2 1,j,X)
shuffle
unshuffle
bitonic
Sort maxs
33BITONIC takes an unsorted sequence as input and
returns a bitonic sequence
- BITONIC(i,j)
- If j-ilt2 then return i,i1
- Else
- Pardo
- BITONIC(i,i(j-i1)/2 1) BSORT (i,i(j-i1)/2
- 1,) - BITONIC(i(j-i1)/2 1,j) BSORT (i(j-i1)/2
1,j,-)
(note that any 2 keys arealready a bitonic
sequence)
Sort first half
2-way bitonic
4-way bitonic
8-way bitonic
unsorted
Sort second half
34Putting it all together
a b
a b
unsorted
8-waybitonic
35Complexity of Bitonic Sort
36Programming Issues
- Flow of data is assumed to transfer from stage to
stage synchronously usual issues with
performance if algorithm is executed
asynchronously - Note that logical interconnect for each problem
size is different - Bitonic sort must be mapped efficiently to target
platform - Unless granularity of platform very fine,
multiple comparators will be mapped to each
processor
371-1 Mappings of Bitonic Sort
- Bitonic sort on a hypercube
- Each shuffle and unshuffle connection compare
keys which differ in a single bit - These keys can be compared over single hypercube
edges
2-way shuffle
4-way shuffle
8-way shuffle
381-1 Mappings of Bitonic Sort
- Bitonic sort on a multistage full shuffle
- Small shuffles do not map 1-1 to larger shuffles!
- Stone used a clever approach to map logical
stages into full-sized shuffle stages while
preserving O(log2 n) complexity
?
39Outline of Stones Method
- Pivot bit index being shuffled
- Stone noticed that for successive stages, the
pivot bits are - If the pivot bit is in place, each subsequent
stage can be done using a full-sized shuffle (a_0
done with a single comparator) - For pivot bit j, need k-j full shuffles to
position bit j for comparison - Complexity of Stones method
40Many-one Mappings of Bitonic Sort
- For platforms where granularity is coarser, it
will be more cost-efficient to map multiple
comparators to one processor - Several possible conventional mappings
- Compare-split provides another approach
41Compare-Split
- For a block of keys, may want to use a
compare-split operation (rather than
compare-exchange) to accommodate multiple keys at
a processor - Idea is to assume that each processor is assigned
a block of keys, rather than 2 keys - Blocks are already sorted with a sequential sort
- To perform compare-split, a processor compares
blocks and returns the smaller half of the
aggregate keys as the min block and the larger
half of the aggregate keys as the max block
Block A
Compare-split
Min Block
Block B
Max Block
42Performance
- Which mapping is best?
- Compare Split
- Block
- Row