Le mod - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Le mod

Description:

Le mod le BSP Bulk-Synchronous Parallel Sieve of Eratosthenes Presentation Classic : find the prime number by enumeration Pure functional implementation using list ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 57

Provided by: lac144

Category:

more less

Transcript and Presenter's Notes

Title: Le mod

1
Frédéric Gava
Le modèle BSP Bulk-Synchronous Parallel
2
Background
Parallel programming
3
The BSP model
BSP architecture

Characterized by
p Number of processors
r Processors speed
L Global synchronization
g Phase of communication (1 word at most sent
of received by each processor)

4
Model of execution
Beginning of the super-step i
Local computing on each processor
Global (collective) communications between
processors
Global synchronization exchanged data available
for the next super-step
Cost(i) (max0?xltp wxi) hi?g L
5
Exemple dune machine BSP
6
Modèle de coût

Coût(programme)somme des coûts des super-étapes
BSP computation scalable, portable, predictable
BSP algorithm design minimising W (temps de
calculs), H (communications), S (nombre de
super-étapes)?
Coût(programme) W gH SL
g et L sont calculables (benchmark) doù
possibilité de prédiction
Main principles
Load-balancing minimises W
data locality minimises H
coarse granularity minimises S
In genral, data locality good, network locality
bad!
Typically, problem size ngtgtgtp (slackness)?
Input/output distribution even, but otherwise
arbitrary

7
A libertarian model

No master
Homogeneous power of the nodes
Global (collective) decision procedure instead
No god
Confluence (no divine intervention)
Cost predictable
Scalable performances
Practiced but confined

8
Advantages and drawbacks

Advantages
Allows cost prediction and deadlock free
Structuring execution and thus bulk-sending it
can be very efficient (sending one file of 1000
bytes performs better than sending 1000 file of 1
byte) in many architectures (multi-cores,
clusters, etc.)
Abstract architecture portable
?
Drawbacks
Some algorithmic patterns dont feet well in the
BSP model pipeline etc.
Some problem are difficult (impossible) to feet
to a Coarse-grain execution model (fine-grained
parallelism)
Abstract architecture dont take care of some
efficient possibilities of some architecture
(cluster of multi-core, grid) and thus need other
libraries or model of execution
?

9
Example broadcast

Direct broadcast (one super-step)

1
0
2
BSP cost p?n?g L

Broadcast with 2 super-steps

BSP cost ??n?g ??L
10
Algorithmes BSP

Matrices multiplication, inversion,
décomposition, algèbre linéaire, etc.
Matrices creuses idem.
Graphes plus court chemin, décomposition, etc.
Géométrie diagramme de Voronoi, intersection de
polygones, etc.
FFT Fast Fournier Transformation
Recherches de motifs
Etc.

11
Parallel prefixes

If we suppose associative operator ()
a(bc)(ab)c or better
a(b(cd))(ab) (cd)
Example

12
Parallel Prefixes

Classical log(p) super-steps method

0
1
2
3
Cost log(p) ( Time(op)Size(d)gL)
13
Parallel Prefixes

Divide-and-conquer method

0
1
2
3
14
Our parallel machine

Cluster of PCs
Pentium IV 2.8 Ghz
512 Mb RAM
A front-end Pentium IV 2.8 Ghz, 512 Mb RAM
Gigabit Ethernet cards and switch,
Ubuntu 7.04 as OS

15
Our BSP Parameters g
16
Our BSP Parameters L
17
How to read bench

There are many manners to publish benchs
Tables
Graphics
The goal is to say it is a good parallel
method, see my benchs but it is often easy to
arrange the presentation of the graphics to hide
the problems
Using graphics (from the simple to hide to the
hardest)
Increase size of data and see for some number of
processors
Increase number of processors to a typical size
of data
Acceleration, i.e, Time(seq)/Time(par)
Efficienty , i.e, Acceleration/Number of
processors
Increase number of processors and size of the
data

18
Increase number of processors
19
Acceleration
20
Efficienty
21
Increase data and processors
22
Super-linear acceleration ?

Better than theoretical acceleration. Possible if
data feet more on caches memories than in the RAM
due to a small number of data on each processor.
Why the impact of caches ? Mainly, each
processor has a little of memory call cache.
Access to this memory is (all most) twice faster
that RAM accesses.
Take for example, multiplication of matrices

23
Fast multiplication

A straight-forward C implementation of
resmult(A,B) (of size NN) can look like this
for (i 0 i lt N i)
for (j 0 j lt N j)
for (k 0 k lt N k)
resij aik bkj
Considerer the following equation
where T is the transposition of matrix b

24
Fast multiplication

One can implement this equation in C as
double tmpNN
for (i 0 i lt N i)
for (j 0 j lt N j)
tmpij bji
for (i 0 i lt N i)
for (j 0 j lt N j)
for (k 0 k lt N k)
resij aik tmpjk
where tmp is the transpose of b
This new multiplication if 2 time fasters. With
other caches optimisations, one can have a 64
faster programs without modifying really the
algorithm.

25
More complicated examples
26
N-body problem
27
Presentation

We have a set of body
coordinate in 2D or 3D
point masse
The classic N-body problem is to calculate the
gravitational energy of N point masses that is
Quadratique complexity
In practice, N is very big and sometime, it is
impossible to keep the set in the main memory

28
Parallel methods

Each processor has a sub-part of the original
set
Parallel method one each processsor
compute local interactions
compute interactions with other point masses
parallel prefixes of the local interactions
For 2) simple parallel methods
using a total exchange of the sub-sets
using a systolic loop

29
Systolic loop
0
1
2
3
30
Benchs and BSP predictions
31
Benchs and BSP predictions
32
Benchs and BSP predictions
33
Parallel methods

There exist many better algorithms than this one
Especially, considering when computing all
interactions is not needed (distancing molecules)
One classic algorithm is to divide the space
into-subspace, and computed recursively the
n-body on each sub-space (so have sub-sub-spaces)
and only consider, interactions between these
sub-spaces. Stop the recursion when there are at
most two molecules in the sub-space
That introduces nlog(n) computations

34
Sieve of Eratosthenes
35
Presentation

Classic find the prime number by enumeration
Pure functional implementation using list
Complexity nlog(n)/log(log(n))
We used
elimint list?int?int list which deletes from a
list all the integers multiple of the given
parameter
final elimint list?int list?int list iterates
elim
seq_generateint?int?int list which returns the
list of integers between 2 bounds
selectint?int list?int list which gives the
first prime numbers of a list.

36
Parallel methods

Simple Parallel methods
using a kind of scan
using a direct sieve
using a recursive one
Different partitions of data
per block (for scan)
cyclic distribution

11,14,17,20,23
12,15,18,21,24
13,16,19,22,25
37
Scan version

Method using a scan
Each processor computes a local sieve (the
processor 0 contains thus the first prime
numbers)
then our scan is applied and we eliminate on
processor i the integers that are multiple of
integers of processors i-1, i-2, etc.
Cost as a scan (logarithmic)

38
Direct version

Method
each processor computes a local sieve
then integers that are less to are
globally exchanged and a new sieve is applied to
this list of integers (thus giving prime numbers)
each processor eliminates, in its own list,
integers that are multiples of this first primes

39
Inductive version

Recursive method by induction over n
We suppose that the inductive step gives the
th first primes
we perform a total exchange on them to
eliminates the non-primes.
End of this induction comes from the BSP cost
we end when n is small enough so that the
sequential methods is faster than the parallel
one
Cost

40
Benchs and BSP predictions
41
Benchs and BSP predictions
42
Benchs and BSP predictions
43
Benchs and BSP predictions
44
Parallel sample sorting
45
Presentation

Each processor has listed set of data (array,
list, etc.)
The goal is that
data on each processor are ordored.
data on processor i are smaller than data on
processor i1
good balancing
Parallel sorting is not very efficient due to too
many communications
But usefull and more efficient than gather all
the data in one processor and then sort them

46
Tri Parallèle BSP
47
Tiskins Sampling Sort
0
1
2
1,11,16,7,14,2,20
18,9,13,21,6,12,4
15,5,19,3,17,8,10
48
Tiskins Sampling Sort
49
Benchs and BSP predictions
50
Benchs and BSP predictions
51
Matrix multiplication
52
Naive parallel algorithm