Principles of High Performance Computing ICS 632 - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Principles of High Performance Computing ICS 632

Description:

Principles of High Performance Computing (ICS 632) Classic ... possibly over-and-over in an iterative fashion. CFD, game of life, image processing, etc. ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 44

Provided by: henrica

Category:

more less

Transcript and Presenter's Notes

Title: Principles of High Performance Computing ICS 632

1
Principles of High Performance Computing (ICS
632)

Classic Examples of
Shared-Memory Programs

2
Domain Decomposition

Now that we know how to create and manage
threads, we need to decide which thread does what
This is really the art of parallel computing
Fortunately, in shared memory, it is often quite
simple
Well look at three examples
Embarrassingly parallel application
load-balancing issue
Non-embarrassingly parallel application
thread synchronization issue
Shark Fish simulation
load-balancing AND thread synchronization issue

3
Embarrassingly Parallel

Embarrassingly parallel applications
Consists of a set of elementary computations
These computations can be done in any order
They are said to be independent
Sometimes referred to as pleasantly parallel
Trivial Example Compute all values of a function
of two variables over a 2-D domain
function f(x,y) ltrequires many flopsgt
domain (0,10,0,10)
domain resolution 0.001
number of points (10 / 0.001)2 108
number of processors and of threads 4
each thread performs 25x106 function evaluations
No need for critical sections
No shared output

4
Mandelbrot Set

In many cases, the cost of computing f varies
with it input
Example Mandelbrot
For each complex number c
Define the series
Z0 0
Zn1 Zn2 c
If the series converges, put a black dot at
point c
i.e., if it hasnt diverged after many iterations
If one partitions the domain in 4 squares among 4
threads, some of the threads will have much more
work to do than others

5
Mandelbrot and Load Balancing

The problem with partitioning the domain into 4
identical tiles is that it leads to load
imbalance
i.e., suboptimal use of the hardware resources
Solution
do not partition the domain in as many tiles as
threads
instead use many more tiles than threads
Then have each thread operate as follows
compute a tile
when done request another tile
until there are no tiles left to compute
This is called a master-worker execution
confusing terminology that will make more sense
when we do distributed memory programming

6
Mandelbrot implementation

Conceptually very simple, but how do we write
code to do it?
Pthreads
Use some shared (protected) counter that keeps
track of the next tile
the keeping track can be easy or difficult
depending of the shape of the tiles
Threads read and update the counter each time
When the counter goes over some predefined value
terminate
OpenMP
Could be done in the same way
But OpenMP provides tons of convenient ways to do
parallel loops
including dynamic scheduling strategies, which
do exactly what we need!
Just write the code as a loop over the tiles
Add the proper pragma
And youre done

7
Dependent Computations

In many applications, things are not so simple
elementary computations may not be independent
otherwise parallel computing would be pretty easy
A common example
Consider a (1-D, 2-D, ...) domain that consists
of cells
Each cell holds some state, for example
temperature, pressure, humidity, wind velocity
RGB color value
The application consists of rule(s) that must be
applied to update the cell states
possibly over-and-over in an iterative fashion
CFD, game of life, image processing, etc.
Such applications are often termed Stencil
Applications
We have already talked about one example Heat
Transfer

8
Dependent Computations

Really simple
Cell values one floating point number
Program written with two arrays
f_old
f_new
One simple loop f_newi f_oldi ...
In more real cases, the domain in 2-D (or
worse), there are more terms, and the values on
the right hand side can be at time step m1 as
well
Example from http//ocw.mit.edu/NR/rdonlyres/Nucl
ear-Engineering/22-00JIntroduction-to-Modeling-and
-SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64
F21D23D0/0/lecture_16.pdf

9
Stencil Applications

More generally, a stencil application is defined
by
A shape for the stencil, which defines which
neighboring cells are used to update a cell
with modified shaped for the edges of the domain
For each neighboring cell in the stencil, the
iteration (i.e., time step) of the value that
should be used to compute the cell value at time
step t1
t1, t, t-1, t-2, etc.

2-D domain
Example stencil shapes
t1
t1
t
t
t
10
Stencil Example
t1
t1
t
t
t
11
Stencil applications
0
t1
t1
t
t
t
12
Stencil applications
0
1
t1
1
t1
t
t
t
13
Stencil applications
0
1
2
t1
1
2
t1
t
t
2
t
14
Stencil applications
0
1
2
3
t1
1
2
3
t1
t
t
2
3
t
3
15
Stencil applications
0
1
2
3
4
5
6
t1
1
2
3
4
5
6
7
t1
t
t
2
3
4
5
6
7
8
t
3
4
5
6
7
8
9
4
5
6
7
8
9
10
Called a wavefront computation
5
6
7
8
9
10
11
6
7
8
9
10
11
12
16
Wavefront computation

How can we parallelize a wavefront computation?
We have seen that the computation consists in
computing 2n-2 antidiagonals, in sequence.
Computations within each antidiagonal are
independent, and can be done in a multithreaded
fashion
One possible Algorithm
for each antidiagonal
use multiple threads to compute its elements
one may need to use a variable number of threads
because some diagonals are very small, while some
can be large
can be implemented with a single array

17
Wavefront computation

What about cache efficiency?
After all, reading only one element from an anti
diagonal at a time is probably not good
They are not contiguous in memory!
Solution blocking
Just like matrix multiply

18
Wavefront computation

What about cache efficiency?
After all, reading only one element from a
diagonal at a time is probably not good
Solution blocking
Just like matrix multiply

T0
19
Wavefront computation

What about cache efficiency?
After all, reading only one element from a
diagonal at a time is probably not good
Solution blocking
Just like matrix multiply

T0
T1
20
Wavefront computation

What about cache efficiency?
After all, reading only one element from a
diagonal at a time is probably not good
Solution blocking
Just like matrix multiply

T0
T1
T0
21
Wavefront computation

What about cache efficiency?
After all, reading only one element from a
diagonal at a time is probably not good
Solution blocking
Just like matrix multiply

T0
T1
T0
T1
22
Performance Modeling

One thing well do often in this class is
building performance models
Given simple assumptions regarding the underlying
architecture
e.g., ignore cache effects
Come up with an analytical formula for the
parallel speed-up
Lets try it on this simple application
Let N be the (square) matrix size
Let p be the number of threads/cores, which is
fixed

23
Performance Modeling

What if we use p2 blocks?
We assume that p divides N (N gt p)
Then the computation proceeds in 2p-1 phases
each phase lasts as long as the time to compute
one block (because of concurrency), Tb
Therefore
Parallel time (2p-1) Tb
Sequential time p2 Tb
Parallel speedup p2 / (2p - 1)
Parallel efficiency p / (2p -1)
Example
p2, speedup 4/3, efficiency 66
p4, speedup 16/7, efficiency 57
p8, speedup 64/17, efficiency 53
Asymptotically efficiency 50

24
Performance Modeling

What if we use (bxp)2 blocks?
b some integer between 1 and N/p
We assume that p divides N (N gt p)
But performance modeling becomes more complicated
The computation still proceeds in 2bp-1 phases
But a thread can have more than one block to
compute during a phase!
During phase i, there are
i blocks to compute for i1,..,bp
2bp-i blocks to compute for ibp1,...,2bp-1
If there are x (gt0) blocks to compute in a phase,
then the execution time for that phase is
?(x-1)/p? 1)
Assuming Tb 1
Therefore, the parallel execution time is

25
Performance Modeling
26
Performance Modeling

Example Speedup for p 4

27
Performance Modeling

When b gets larger, speedup increases and tends
to p
Since b lt N/p, best speed-up Np / (N p -1)
When N is large compared to p, speedup is very
close to p
Therefore, use a block size of 1, meaning no
blocking!
Were back to were we started because our
performance model ignores cache effects!
Trade-off
From a parallel efficiency perspective small
block size
From a cache efficiency perspective big block
size
Possible rule of thumb use the biggest block
size that fits in the L1 cache (L2 cache?)
Lesson full performance modeling is difficult
We could add the cache behavior, but think of a
dual-core machine with shared L2 cache, etc.
In practice do performance modeling for
asymptotic behaviors, and then do experiments to
find out what works best

28
Another Possible Algorithm

The Dojo cleaning algorithm

29
Another Possible Algorithm

The Dojo cleaning algorithm

30
Another Possible Algorithm

The Dojo cleaning algorithm

31
Another Possible Algorithm

The Dojo cleaning algorithm

32
Another Possible Algorithm

The Dojo cleaning algorithm

33
Performance Modeling

What if we use (bxp)2 blocks?
b some integer between 1 and N/p
We assume that p divides N (N gt p)
Each processor computes b2p2/p b2p blocks
The last processor starts computing after p-1
steps
So the last processor finishes computing after
p-1b2p steps
We obtain
Tb,p b2p p - 1
Sb,p b2p2 / (b2p p -1)
Eb,p b2p / (b2p p -1)
The algorithm is therefore asymptotically
optimal, like the previous one
And its much easier to analyze
But perhaps not much easier to implement
Question is it better???

34
Performance Modeling
Speedup for p4
Conclusion the simpler algorithm is likely
better in practice!
35
Sharks and Fish

Simulation of a population of preys and predators
Each entity follows some behavior
Preys move and breed
Predators move, hunt, and breed
Given initial populations, nature of the entity
behaviors (e.g., probability of breeding,
probability of successful hunting), what do
populations look like after some time?
This is something computational ecologists do all
the time to study ecosystems

36
Sharks and Fish

There are several possibilities to implement such
a simulation
A simple one is to do something that looks like
the game of life
A 2-D domain, with NxN cells (each cell can be
described by many environmental parameters)
Each cell in the domain can hold a shark or a
fish
The simulation is iterative
There are several rules for movement, breeding,
preying
Why do it in parallel?
Many entities
Entity interactions may be complex
How can one write this in parallel with threads
and shared memory?