Programming Paradigms and Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Programming Paradigms and Algorithms

Description:

Pipelined program divided into a series of tasks that have to be ... CMU a hotbed of systolic algorithm and array research (especially H.T. Kung and his group) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 43
Provided by: csewe4
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Programming Paradigms and Algorithms


1
Programming Paradigms and Algorithms
  • WA 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6,
    9.2.8, 10.4.1,
  • Kumar 12.1.3
  • 1.       Berman, F., Wolski, R., Figueira, S.,
    Schopf, J. and Shao, G., "Application-Level
    Scheduling on Distributed Heterogeneous
    Networks," Proceedings of Supercomputing '96
  • (httpapples.ucsd.edu)

2
Common Parallel Programming Paradigms
  • Embarrassingly parallel programs
  • Workqueue
  • Master/Slave programs
  • Monte Carlo methods
  • Regular, Iterative (Stencil) Computations
  • Pipelined Computations
  • Synchronous Computations

3
Pipelined Computations
  • Pipelined program divided into a series of tasks
    that have to be completed one after the other.
  • Each task executed by a separate pipeline stage
  • Data streamed from stage to stage to form
    computation

4
Pipelined Computations
  • Computation consists of data streaming through
    pipeline stages
  • Execution Time Time to fill pipeline (P-1)
    Time to run in steady state (N-P1)
  • Time to empty pipeline (P-1)

P of processors N of data items (assume P
lt N)
5
Pipelined Example Sieve of Eratosthenes
  • Goal is to take a list of integers greater than 1
    and produce a list of primes
  • E.g. For input 2 3 4 5 6 7 8 9 10, output is
    2 3 5 7
  • Frans pipelined approach (a little different
    than book)
  • Processor Pi divides each input by the ith prime
  • If the input is divisible (and not equal to the
    divisor), it is marked (with a negative sign) and
    forwarded
  • If the input is not divisible, it is forwarded
  • Last processor only forwards unmarked (positive)
    data primes

6
Sieve of Eratosthenes Pseudo-Code
  • Code for last processor
  • xrecv(data,P_(i-1))
  • If xgt0 then send(x,OUTPUT)
  • Code for processor Pi (and prime p_i)
  • xrecv(data,P_(i-1))
  • If (xgt0 and xp_i) then
  • If (p_i divides x) then send(-x,P_(i1)
  • If (p_i does not divide x) then send(x, P_(i1))
  • Else
  • Send(x,P_(i1))
  1. 6 5 4 3 2
  1. 6 5 4 3 2
  1. 6 5 -4 3 2
  1. 6 5 -4 3 2

7
Programming Issues
  • Algorithm will take NP-1 to run where N is the
    number of data items and P is the number of
    processors.
  • Can consider just the odds or do some initial
    part separately
  • In given implementation, number of processors
    must store all primes which will appear in
    sequence
  • Not a scalable approach
  • Can fix this by having each processor do the job
    of multiple primes, i.e. mapping logical
    processors in the pipeline to each physical
    processor
  • What is the impact of this on performance?

8
More Programming Issues
  • In pipelined algorithm, flow of data moves
    through processors in lockstep, attempt to
    balance work so that there is no bottleneck at
    any processor
  • In mid-80s, processors developed to support in
    hardware this kind of parallel pipelined
    computation
  • Two commercial products from Intel Warp (1D
    array) and iWarp (components for 2D array)
  • Warp and iWarp were meant to operate
    synchronously Wavefront Array Processor (S.Y.
    Kung) was meant to operate asynchronously, i.e.
    arrival of data would signal that it was time to
    execute

9
Systolic Arrays
  • Warp and iWarp were examples of systolic arrays
  • Systolic means regular and rhythmic, data was
    supposed to move through pipelined computational
    units in a regular and rhythmic fashion
  • Systolic arrays meant to be special-purpose
    processors or co-processors and were very
    fine-grained
  • Processors implement a limited and very simple
    computation, usually called cells
  • Communication is very fast, granularity meant to
    be around 1!

10
Systolic Algorithms
  • Systolic arrays built to support systolic
    algorithms, a hot area of research in the early
    80s
  • Systolic algorithms used pipelining through
    various kinds of arrays to accomplish
    computational goals
  • Some of the data streaming and applications were
    very creative and quite complex
  • CMU a hotbed of systolic algorithm and array
    research (especially H.T. Kung and his group)

11
Example Systolic Algorithm Matrix Multiplication
  • Problem multiply two nxn matrices A a_ij and
    Bb_ij. Product matrix will be Rr_ij.
  • Systolic solution uses 2D array with NxN cells, 2
    input streams and 2 output streams

12
Systolic Matrix Multiplication
b41 b42 b43 b44 b31 b32
b33 b34 b21 b22 b23
b24 b11 b12 b13 b14 --
-- -- -- --
-- ----
a44 a34 a24 a14 a43
a33 a23 a13 a42 a32
a22 a12 a41 a31 a21
a11 -- -- -- -- -- --
P11
P12
P21
P31
P13
P22
P32
P41
P14
P23
P33
P42
P24
P34
P43
P44
13
Operation at each cell
  • Each cell updates at each time step as
    shown below
  • initialized to 0

14
Data Flow for Systolic MM
  • Beat 2
  • Beat 1

15
Data Flow for Systolic MM
  • Beat 4
  • Beat 3

16
Data Flow for Systolic MM
  • Beat 6
  • Beat 5

17
Data Flow for Systolic MM
  • Beat 8
  • Beat 7

18
Data Flow for Systolic MM
  • Beats 10 and 11
  • Beat 9

19
Programming Issues
  • Performance of systolic algorithms based on fine
    granularity (1 update about the same as a
    communication) and regular dataflow
  • Can be done on asynchronous platforms with
    tagging but must ensure that idle time does not
    dominate computation
  • Many systolic algorithms may not map well to more
    general MIMD or distributed platforms

20
Synchronous Computations
  • Synchronous computations have the form
  • (Barrier)
  • Computation
  • Barrier
  • Computation
  • Frequency of the barrier and homogeneity of the
    intervening computations on the processors may
    vary
  • Weve seen several synchronous computations
    already (Jacobi2D, Parallel Prefix, Systolic MM)

21
Synchronous Computations
  • Synchronous computations can be simulated using
    asynchronous programming models
  • Iterations can be tagged so that the appropriate
    data is combined
  • Performance of such computations depends on the
    granularity of the platform, how expensive
    synchronizations are, and how much time is spent
    idle waiting for the right data to arrive

22
Barrier Synchronizations
  • Barrier synchronizations can be implemented in
    many ways
  • As part of the algorithm
  • As a part of the communication library
  • PVM and MPI have barrier operations
  • In hardware
  • Implementations vary

23
Synchronous Computation Example Bitonic Sort
  • Bitonic Sort an interesting example of a
    synchronous algorithm
  • Computation proceeds in stages where each stage
    is a (smaller or larger) shuffle-exchange network
  • Barrier synchronization at each stage

24
Bitonic Sort
  • A bitonic sequence is a list of keys
  • such that
  • For some i, the keys have the ordering
  • or
  • The sequence can be shifted cyclically so that 1)
    holds

25
Bitonic Sort Algorithm
  • The bitonic sort algorithm recursively calls two
    procedures
  • BSORT(i,j,X) takes bitonic sequence
    and produces a non-decreasing (X) or a
    non-increasing sorted sequence (X-)
  • BITONIC(i,j) takes an unsorted sequence
    and produces a bitonic sequence
  • The main algorithm is then
  • BITONIC(0,n-1)
  • BSORT(0,n-1,)

26
How does it do this?
  • Well show how BSORT and BITONIC work but first
    consider an interesting property of bitonic
    sequences
  • Assume that is bitonic
    and that n is even. Let
  • Then and are bitonic sequences and
    for all

27
Picture Proof of Interesting Property
  • Consider
  • Two cases and

28
Picture Proof of Interesting Property
  • Consider

29
Picture Proof of Interesting Property
  • Consider

30
Back to Bitonic Sort
  • Remember
  • BSORT(i,j,X) takes bitonic sequence
    and produces a non-decreasing (X) or a
    non-increasing sorted sequence (X-)
  • BITONIC(i,j) takes an unsorted sequence
    and produces a bitonic sequence
  • Lets look at BSORT first

min bitonic max bitonic
bitonic
31
Heres where the shuffle-exchange comes in
  • Shuffle-exchange network routes the data
    correctly for comparison
  • At each shuffle stage, can use switch to
    separate B1 and B2

bitonic
32
Sort bitonic subsequences to get a sorted sequence
  • BSORT(i,j,X)
  • If j-ilt2 then return min(i,i1), max(i,i1)
  • Else
  • Shuffle(i,j,X)
  • Unshuffle(i,j)
  • Pardo
  • BSORT (i,i(j-i1)/2 - 1,X)
  • BSORT (i(j-i1)/2 1,j,X)

shuffle
unshuffle
bitonic
Sort maxs
33
BITONIC takes an unsorted sequence as input and
returns a bitonic sequence
  • BITONIC(i,j)
  • If j-ilt2 then return i,i1
  • Else
  • Pardo
  • BITONIC(i,i(j-i1)/2 1) BSORT (i,i(j-i1)/2
    - 1,)
  • BITONIC(i(j-i1)/2 1,j) BSORT (i(j-i1)/2
    1,j,-)

(note that any 2 keys arealready a bitonic
sequence)
Sort first half
2-way bitonic
4-way bitonic
8-way bitonic
unsorted
Sort second half
34
Putting it all together
a b
a b
  • Bitonic sort for 8 keys

unsorted
8-waybitonic
35
Complexity of Bitonic Sort
36
Programming Issues
  • Flow of data is assumed to transfer from stage to
    stage synchronously usual issues with
    performance if algorithm is executed
    asynchronously
  • Note that logical interconnect for each problem
    size is different
  • Bitonic sort must be mapped efficiently to target
    platform
  • Unless granularity of platform very fine,
    multiple comparators will be mapped to each
    processor

37
1-1 Mappings of Bitonic Sort
  • Bitonic sort on a hypercube
  • Each shuffle and unshuffle connection compare
    keys which differ in a single bit
  • These keys can be compared over single hypercube
    edges

2-way shuffle
4-way shuffle
8-way shuffle
38
1-1 Mappings of Bitonic Sort
  • Bitonic sort on a multistage full shuffle
  • Small shuffles do not map 1-1 to larger shuffles!
  • Stone used a clever approach to map logical
    stages into full-sized shuffle stages while
    preserving O(log2 n) complexity

?
39
Outline of Stones Method
  • Pivot bit index being shuffled
  • Stone noticed that for successive stages, the
    pivot bits are
  • If the pivot bit is in place, each subsequent
    stage can be done using a full-sized shuffle (a_0
    done with a single comparator)
  • For pivot bit j, need k-j full shuffles to
    position bit j for comparison
  • Complexity of Stones method

40
Many-one Mappings of Bitonic Sort
  • For platforms where granularity is coarser, it
    will be more cost-efficient to map multiple
    comparators to one processor
  • Several possible conventional mappings
  • Compare-split provides another approach

41
Compare-Split
  • For a block of keys, may want to use a
    compare-split operation (rather than
    compare-exchange) to accommodate multiple keys at
    a processor
  • Idea is to assume that each processor is assigned
    a block of keys, rather than 2 keys
  • Blocks are already sorted with a sequential sort
  • To perform compare-split, a processor compares
    blocks and returns the smaller half of the
    aggregate keys as the min block and the larger
    half of the aggregate keys as the max block

Block A
Compare-split
Min Block
Block B
Max Block
42
Performance
  • Which mapping is best?
  • Compare Split
  • Block
  • Row
Write a Comment
User Comments (0)
About PowerShow.com