Programming Paradigms and Algorithms - PowerPoint PPT Presentation

About This Presentation

Title:

Programming Paradigms and Algorithms

Description:

Pipelined program divided into a series of tasks that have to be ... CMU a hotbed of systolic algorithm and array research (especially H.T. Kung and his group) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 43

Provided by: csewe4

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Programming Paradigms and Algorithms

1
Programming Paradigms and Algorithms

WA 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6,
9.2.8, 10.4.1,
Kumar 12.1.3
1. Berman, F., Wolski, R., Figueira, S.,
Schopf, J. and Shao, G., "Application-Level
Scheduling on Distributed Heterogeneous
Networks," Proceedings of Supercomputing '96
(httpapples.ucsd.edu)

2
Common Parallel Programming Paradigms

Embarrassingly parallel programs
Workqueue
Master/Slave programs
Monte Carlo methods
Regular, Iterative (Stencil) Computations
Pipelined Computations
Synchronous Computations

3
Pipelined Computations

Pipelined program divided into a series of tasks
that have to be completed one after the other.
Each task executed by a separate pipeline stage
Data streamed from stage to stage to form
computation

4
Pipelined Computations

Computation consists of data streaming through
pipeline stages
Execution Time Time to fill pipeline (P-1)
Time to run in steady state (N-P1)
Time to empty pipeline (P-1)

P of processors N of data items (assume P
lt N)
5
Pipelined Example Sieve of Eratosthenes

Goal is to take a list of integers greater than 1
and produce a list of primes
E.g. For input 2 3 4 5 6 7 8 9 10, output is
2 3 5 7
Frans pipelined approach (a little different
than book)
Processor Pi divides each input by the ith prime
If the input is divisible (and not equal to the
divisor), it is marked (with a negative sign) and
forwarded
If the input is not divisible, it is forwarded
Last processor only forwards unmarked (positive)
data primes

6
Sieve of Eratosthenes Pseudo-Code

Code for last processor
xrecv(data,P_(i-1))
If xgt0 then send(x,OUTPUT)

Code for processor Pi (and prime p_i)
xrecv(data,P_(i-1))
If (xgt0 and xp_i) then
If (p_i divides x) then send(-x,P_(i1)
If (p_i does not divide x) then send(x, P_(i1))
Else
Send(x,P_(i1))

6 5 4 3 2

6 5 4 3 2

6 5 -4 3 2

6 5 -4 3 2

7
Programming Issues

Algorithm will take NP-1 to run where N is the
number of data items and P is the number of
processors.
Can consider just the odds or do some initial
part separately
In given implementation, number of processors
must store all primes which will appear in
sequence
Not a scalable approach
Can fix this by having each processor do the job
of multiple primes, i.e. mapping logical
processors in the pipeline to each physical
processor
What is the impact of this on performance?

8
More Programming Issues

In pipelined algorithm, flow of data moves
through processors in lockstep, attempt to
balance work so that there is no bottleneck at
any processor
In mid-80s, processors developed to support in
hardware this kind of parallel pipelined
computation
Two commercial products from Intel Warp (1D
array) and iWarp (components for 2D array)
Warp and iWarp were meant to operate
synchronously Wavefront Array Processor (S.Y.
Kung) was meant to operate asynchronously, i.e.
arrival of data would signal that it was time to
execute

9
Systolic Arrays

Warp and iWarp were examples of systolic arrays
Systolic means regular and rhythmic, data was
supposed to move through pipelined computational
units in a regular and rhythmic fashion
Systolic arrays meant to be special-purpose
processors or co-processors and were very
fine-grained
Processors implement a limited and very simple
computation, usually called cells
Communication is very fast, granularity meant to
be around 1!

10
Systolic Algorithms

Systolic arrays built to support systolic
algorithms, a hot area of research in the early
80s
Systolic algorithms used pipelining through
various kinds of arrays to accomplish
computational goals
Some of the data streaming and applications were
very creative and quite complex
CMU a hotbed of systolic algorithm and array
research (especially H.T. Kung and his group)

11
Example Systolic Algorithm Matrix Multiplication

Problem multiply two nxn matrices A a_ij and
Bb_ij. Product matrix will be Rr_ij.
Systolic solution uses 2D array with NxN cells, 2
input streams and 2 output streams

12
Systolic Matrix Multiplication
b41 b42 b43 b44 b31 b32
b33 b34 b21 b22 b23
b24 b11 b12 b13 b14 --
-- -- -- --
-- ----
a44 a34 a24 a14 a43
a33 a23 a13 a42 a32
a22 a12 a41 a31 a21
a11 -- -- -- -- -- --
P11
P12
P21
P31
P13
P22
P32
P41
P14
P23
P33
P42
P24
P34
P43
P44
13
Operation at each cell

Each cell updates at each time step as
shown below
initialized to 0

14
Data Flow for Systolic MM

Beat 2

Beat 1

15
Data Flow for Systolic MM

Beat 4

Beat 3

16
Data Flow for Systolic MM

Beat 6

Beat 5

17
Data Flow for Systolic MM

Beat 8

Beat 7

18
Data Flow for Systolic MM

Beats 10 and 11

Beat 9

19
Programming Issues

Performance of systolic algorithms based on fine
granularity (1 update about the same as a
communication) and regular dataflow
Can be done on asynchronous platforms with
tagging but must ensure that idle time does not
dominate computation
Many systolic algorithms may not map well to more
general MIMD or distributed platforms

20
Synchronous Computations

Synchronous computations have the form
(Barrier)
Computation
Barrier
Computation
Frequency of the barrier and homogeneity of the
intervening computations on the processors may
vary
Weve seen several synchronous computations
already (Jacobi2D, Parallel Prefix, Systolic MM)

21
Synchronous Computations

Synchronous computations can be simulated using
asynchronous programming models
Iterations can be tagged so that the appropriate
data is combined
Performance of such computations depends on the
granularity of the platform, how expensive
synchronizations are, and how much time is spent
idle waiting for the right data to arrive

22
Barrier Synchronizations

Barrier synchronizations can be implemented in
many ways
As part of the algorithm
As a part of the communication library
PVM and MPI have barrier operations
In hardware
Implementations vary

23
Synchronous Computation Example Bitonic Sort

Bitonic Sort an interesting example of a
synchronous algorithm
Computation proceeds in stages where each stage
is a (smaller or larger) shuffle-exchange network
Barrier synchronization at each stage

24
Bitonic Sort

A bitonic sequence is a list of keys
such that
For some i, the keys have the ordering
or
The sequence can be shifted cyclically so that 1)
holds

25
Bitonic Sort Algorithm

The bitonic sort algorithm recursively calls two
procedures
BSORT(i,j,X) takes bitonic sequence
and produces a non-decreasing (X) or a
non-increasing sorted sequence (X-)
BITONIC(i,j) takes an unsorted sequence
and produces a bitonic sequence
The main algorithm is then
BITONIC(0,n-1)
BSORT(0,n-1,)

26
How does it do this?

Well show how BSORT and BITONIC work but first
consider an interesting property of bitonic
sequences
Assume that is bitonic
and that n is even. Let
Then and are bitonic sequences and
for all

27
Picture Proof of Interesting Property

Consider
Two cases and

28
Picture Proof of Interesting Property

Consider

29
Picture Proof of Interesting Property

Consider

30
Back to Bitonic Sort

Remember
BSORT(i,j,X) takes bitonic sequence
and produces a non-decreasing (X) or a
non-increasing sorted sequence (X-)
BITONIC(i,j) takes an unsorted sequence
and produces a bitonic sequence
Lets look at BSORT first

min bitonic max bitonic
bitonic
31
Heres where the shuffle-exchange comes in

Shuffle-exchange network routes the data
correctly for comparison
At each shuffle stage, can use switch to
separate B1 and B2

bitonic
32
Sort bitonic subsequences to get a sorted sequence

BSORT(i,j,X)
If j-ilt2 then return min(i,i1), max(i,i1)
Else
Shuffle(i,j,X)
Unshuffle(i,j)
Pardo
BSORT (i,i(j-i1)/2 - 1,X)
BSORT (i(j-i1)/2 1,j,X)

shuffle
unshuffle
bitonic
Sort maxs
33
BITONIC takes an unsorted sequence as input and
returns a bitonic sequence

BITONIC(i,j)
If j-ilt2 then return i,i1
Else
Pardo
BITONIC(i,i(j-i1)/2 1) BSORT (i,i(j-i1)/2
- 1,)
BITONIC(i(j-i1)/2 1,j) BSORT (i(j-i1)/2
1,j,-)

(note that any 2 keys arealready a bitonic
sequence)
Sort first half
2-way bitonic
4-way bitonic
8-way bitonic
unsorted
Sort second half
34
Putting it all together
a b
a b

Bitonic sort for 8 keys

unsorted
8-waybitonic
35
Complexity of Bitonic Sort
36
Programming Issues

Flow of data is assumed to transfer from stage to
stage synchronously usual issues with
performance if algorithm is executed
asynchronously
Note that logical interconnect for each problem
size is different
Bitonic sort must be mapped efficiently to target
platform
Unless granularity of platform very fine,
multiple comparators will be mapped to each
processor

37
1-1 Mappings of Bitonic Sort

Bitonic sort on a hypercube
Each shuffle and unshuffle connection compare
keys which differ in a single bit
These keys can be compared over single hypercube
edges

2-way shuffle
4-way shuffle
8-way shuffle
38
1-1 Mappings of Bitonic Sort

Bitonic sort on a multistage full shuffle
Small shuffles do not map 1-1 to larger shuffles!
Stone used a clever approach to map logical
stages into full-sized shuffle stages while
preserving O(log2 n) complexity

?
39
Outline of Stones Method

Pivot bit index being shuffled
Stone noticed that for successive stages, the
pivot bits are
If the pivot bit is in place, each subsequent
stage can be done using a full-sized shuffle (a_0
done with a single comparator)
For pivot bit j, need k-j full shuffles to
position bit j for comparison
Complexity of Stones method

40
Many-one Mappings of Bitonic Sort

For platforms where granularity is coarser, it
will be more cost-efficient to map multiple
comparators to one processor
Several possible conventional mappings
Compare-split provides another approach

41
Compare-Split

For a block of keys, may want to use a
compare-split operation (rather than
compare-exchange) to accommodate multiple keys at
a processor
Idea is to assume that each processor is assigned
a block of keys, rather than 2 keys
Blocks are already sorted with a sequential sort
To perform compare-split, a processor compares
blocks and returns the smaller half of the
aggregate keys as the min block and the larger
half of the aggregate keys as the max block