Le mod - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Le mod

Description:

Le mod le BSP Bulk-Synchronous Parallel Sieve of Eratosthenes Presentation Classic : find the prime number by enumeration Pure functional implementation using list ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 57
Provided by: lac144
Category:

less

Transcript and Presenter's Notes

Title: Le mod


1
Frédéric Gava
Le modèle BSP Bulk-Synchronous Parallel
2
Background
Parallel programming
3
The BSP model
BSP architecture
  • Characterized by
  • p Number of processors
  • r Processors speed
  • L Global synchronization
  • g Phase of communication (1 word at most sent
    of received by each processor)

4
Model of execution
Beginning of the super-step i
Local computing on each processor
Global (collective) communications between
processors
Global synchronization exchanged data available
for the next super-step
Cost(i) (max0?xltp wxi) hi?g L
5
Exemple dune machine BSP
6
Modèle de coût
  • Coût(programme)somme des coûts des super-étapes
  • BSP computation scalable, portable, predictable
  • BSP algorithm design minimising W (temps de
    calculs), H (communications), S (nombre de
    super-étapes)?
  • Coût(programme) W gH SL
  • g et L sont calculables (benchmark) doù
    possibilité de prédiction
  • Main principles
  • Load-balancing minimises W
  • data locality minimises H
  • coarse granularity minimises S
  • In genral, data locality good, network locality
    bad!
  • Typically, problem size ngtgtgtp (slackness)?
  • Input/output distribution even, but otherwise
    arbitrary

7
A libertarian model
  • No master
  • Homogeneous power of the nodes
  • Global (collective) decision procedure instead
  • No god
  • Confluence (no divine intervention)
  • Cost predictable
  • Scalable performances
  • Practiced but confined

8
Advantages and drawbacks
  • Advantages
  • Allows cost prediction and deadlock free
  • Structuring execution and thus bulk-sending it
    can be very efficient (sending one file of 1000
    bytes performs better than sending 1000 file of 1
    byte) in many architectures (multi-cores,
    clusters, etc.)
  • Abstract architecture portable
  • ?
  • Drawbacks
  • Some algorithmic patterns dont feet well in the
    BSP model pipeline etc.
  • Some problem are difficult (impossible) to feet
    to a Coarse-grain execution model (fine-grained
    parallelism)
  • Abstract architecture dont take care of some
    efficient possibilities of some architecture
    (cluster of multi-core, grid) and thus need other
    libraries or model of execution
  • ?

9
Example broadcast
  • Direct broadcast (one super-step)

1
0
2
BSP cost p?n?g L
  • Broadcast with 2 super-steps

BSP cost ??n?g ??L
10
Algorithmes BSP
  • Matrices multiplication, inversion,
    décomposition, algèbre linéaire, etc.
  • Matrices creuses idem.
  • Graphes plus court chemin, décomposition, etc.
  • Géométrie diagramme de Voronoi, intersection de
    polygones, etc.
  • FFT Fast Fournier Transformation
  • Recherches de motifs
  • Etc.

11
Parallel prefixes
  • If we suppose associative operator ()
  • a(bc)(ab)c or better
  • a(b(cd))(ab) (cd)
  • Example

12
Parallel Prefixes
  • Classical log(p) super-steps method

0
1
2
3
Cost log(p) ( Time(op)Size(d)gL)
13
Parallel Prefixes
  • Divide-and-conquer method

0
1
2
3
14
Our parallel machine
  • Cluster of PCs
  • Pentium IV 2.8 Ghz
  • 512 Mb RAM
  • A front-end Pentium IV 2.8 Ghz, 512 Mb RAM
  • Gigabit Ethernet cards and switch,
  • Ubuntu 7.04 as OS

15
Our BSP Parameters g
16
Our BSP Parameters L
17
How to read bench
  • There are many manners to publish benchs
  • Tables
  • Graphics
  • The goal is to say  it is a good parallel
    method, see my benchs  but it is often easy to
    arrange the presentation of the graphics to hide
    the problems
  • Using graphics (from the simple to hide to the
    hardest)
  • Increase size of data and see for some number of
    processors
  • Increase number of processors to a typical size
    of data
  • Acceleration, i.e, Time(seq)/Time(par)
  • Efficienty , i.e, Acceleration/Number of
    processors
  • Increase number of processors and size of the
    data

18
Increase number of processors
19
Acceleration
20
Efficienty
21
Increase data and processors
22
Super-linear acceleration ?
  • Better than theoretical acceleration. Possible if
    data feet more on caches memories than in the RAM
    due to a small number of data on each processor.
  • Why the impact of caches ? Mainly, each
    processor has a little of memory call cache.
    Access to this memory is (all most) twice faster
    that RAM accesses.
  • Take for example, multiplication of matrices

23
Fast multiplication
  • A straight-forward C implementation of
    resmult(A,B) (of size NN) can look like this
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • for (k 0 k lt N k)
  • resij aik bkj
  • Considerer the following equation
  • where T is the transposition of matrix b

24
Fast multiplication
  • One can implement this equation in C as
  • double tmpNN
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • tmpij bji
  • for (i 0 i lt N i)
  • for (j 0 j lt N j)
  • for (k 0 k lt N k)
  • resij aik tmpjk
  • where tmp is the transpose of b
  • This new multiplication if 2 time fasters. With
    other caches optimisations, one can have a 64
    faster programs without modifying really the
    algorithm.

25
More complicated examples
26
N-body problem
27
Presentation
  • We have a set of body
  • coordinate in 2D or 3D
  • point masse
  • The classic N-body problem is to calculate the
    gravitational energy of N point masses that is
  • Quadratique complexity
  • In practice, N is very big and sometime, it is
    impossible to keep the set in the main memory

28
Parallel methods
  • Each processor has a sub-part of the original
    set
  • Parallel method one each processsor
  • compute local interactions
  • compute interactions with other point masses
  • parallel prefixes of the local interactions
  • For 2) simple parallel methods
  • using a total exchange of the sub-sets
  • using a systolic loop

29
Systolic loop
0
1
2
3
30
Benchs and BSP predictions
31
Benchs and BSP predictions
32
Benchs and BSP predictions
33
Parallel methods
  • There exist many better algorithms than this one
  • Especially, considering when computing all
    interactions is not needed (distancing molecules)
  • One classic algorithm is to divide the space
    into-subspace, and computed recursively the
    n-body on each sub-space (so have sub-sub-spaces)
    and only consider, interactions between these
    sub-spaces. Stop the recursion when there are at
    most two molecules in the sub-space
  • That introduces nlog(n) computations

34
Sieve of Eratosthenes
35
Presentation
  • Classic find the prime number by enumeration
  • Pure functional implementation using list
  • Complexity nlog(n)/log(log(n))
  • We used
  • elimint list?int?int list which deletes from a
    list all the integers multiple of the given
    parameter
  • final elimint list?int list?int list iterates
    elim
  • seq_generateint?int?int list which returns the
    list of integers between 2 bounds
  • selectint?int list?int list which gives the
    first prime numbers of a list.

36
Parallel methods
  • Simple Parallel methods
  • using a kind of scan
  • using a direct sieve
  • using a recursive one
  • Different partitions of data
  • per block (for scan)
  • cyclic distribution

11,14,17,20,23
12,15,18,21,24
13,16,19,22,25
37
Scan version
  • Method using a scan
  • Each processor computes a local sieve (the
    processor 0 contains thus the first prime
    numbers)
  • then our scan is applied and we eliminate on
    processor i the integers that are multiple of
    integers of processors i-1, i-2, etc.
  • Cost as a scan (logarithmic)

38
Direct version
  • Method
  • each processor computes a local sieve
  • then integers that are less to are
    globally exchanged and a new sieve is applied to
    this list of integers (thus giving prime numbers)
  • each processor eliminates, in its own list,
    integers that are multiples of this first primes

39
Inductive version
  • Recursive method by induction over n
  • We suppose that the inductive step gives the
    th first primes
  • we perform a total exchange on them to
    eliminates the non-primes.
  • End of this induction comes from the BSP cost
    we end when n is small enough so that the
    sequential methods is faster than the parallel
    one
  • Cost

40
Benchs and BSP predictions
41
Benchs and BSP predictions
42
Benchs and BSP predictions
43
Benchs and BSP predictions
44
Parallel sample sorting
45
Presentation
  • Each processor has listed set of data (array,
    list, etc.)
  • The goal is that
  • data on each processor are ordored.
  • data on processor i are smaller than data on
    processor i1
  • good balancing
  • Parallel sorting is not very efficient due to too
    many communications
  • But usefull and more efficient than gather all
    the data in one processor and then sort them

46
Tri Parallèle BSP
47
Tiskins Sampling Sort
0
1
2
1,11,16,7,14,2,20
18,9,13,21,6,12,4
15,5,19,3,17,8,10
48
Tiskins Sampling Sort
49
Benchs and BSP predictions
50
Benchs and BSP predictions
51
Matrix multiplication
52
Naive parallel algorithm
  • We have two matrices A and B of size nn
  • We supose
  • Each matrice is distributed by blocs of size
  • That is, element A(i,j) is on processor
  • Algorithm

53
Benchs and BSP predictions
54
Benchs and BSP predictions
55
Benchs and BSP predictions
56
Benchs and BSP predictions
Write a Comment
User Comments (0)
About PowerShow.com