Principles of High Performance Computing ICS 632 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Principles of High Performance Computing ICS 632

Description:

Principles of High Performance Computing (ICS 632) Classic ... possibly over-and-over in an iterative fashion. CFD, game of life, image processing, etc. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 44
Provided by: henrica
Category:

less

Transcript and Presenter's Notes

Title: Principles of High Performance Computing ICS 632


1
Principles of High Performance Computing (ICS
632)
  • Classic Examples of
  • Shared-Memory Programs

2
Domain Decomposition
  • Now that we know how to create and manage
    threads, we need to decide which thread does what
  • This is really the art of parallel computing
  • Fortunately, in shared memory, it is often quite
    simple
  • Well look at three examples
  • Embarrassingly parallel application
  • load-balancing issue
  • Non-embarrassingly parallel application
  • thread synchronization issue
  • Shark Fish simulation
  • load-balancing AND thread synchronization issue

3
Embarrassingly Parallel
  • Embarrassingly parallel applications
  • Consists of a set of elementary computations
  • These computations can be done in any order
  • They are said to be independent
  • Sometimes referred to as pleasantly parallel
  • Trivial Example Compute all values of a function
    of two variables over a 2-D domain
  • function f(x,y) ltrequires many flopsgt
  • domain (0,10,0,10)
  • domain resolution 0.001
  • number of points (10 / 0.001)2 108
  • number of processors and of threads 4
  • each thread performs 25x106 function evaluations
  • No need for critical sections
  • No shared output

4
Mandelbrot Set
  • In many cases, the cost of computing f varies
    with it input
  • Example Mandelbrot
  • For each complex number c
  • Define the series
  • Z0 0
  • Zn1 Zn2 c
  • If the series converges, put a black dot at
    point c
  • i.e., if it hasnt diverged after many iterations
  • If one partitions the domain in 4 squares among 4
    threads, some of the threads will have much more
    work to do than others

5
Mandelbrot and Load Balancing
  • The problem with partitioning the domain into 4
    identical tiles is that it leads to load
    imbalance
  • i.e., suboptimal use of the hardware resources
  • Solution
  • do not partition the domain in as many tiles as
    threads
  • instead use many more tiles than threads
  • Then have each thread operate as follows
  • compute a tile
  • when done request another tile
  • until there are no tiles left to compute
  • This is called a master-worker execution
  • confusing terminology that will make more sense
    when we do distributed memory programming

6
Mandelbrot implementation
  • Conceptually very simple, but how do we write
    code to do it?
  • Pthreads
  • Use some shared (protected) counter that keeps
    track of the next tile
  • the keeping track can be easy or difficult
    depending of the shape of the tiles
  • Threads read and update the counter each time
  • When the counter goes over some predefined value
    terminate
  • OpenMP
  • Could be done in the same way
  • But OpenMP provides tons of convenient ways to do
    parallel loops
  • including dynamic scheduling strategies, which
    do exactly what we need!
  • Just write the code as a loop over the tiles
  • Add the proper pragma
  • And youre done

7
Dependent Computations
  • In many applications, things are not so simple
    elementary computations may not be independent
  • otherwise parallel computing would be pretty easy
  • A common example
  • Consider a (1-D, 2-D, ...) domain that consists
    of cells
  • Each cell holds some state, for example
  • temperature, pressure, humidity, wind velocity
  • RGB color value
  • The application consists of rule(s) that must be
    applied to update the cell states
  • possibly over-and-over in an iterative fashion
  • CFD, game of life, image processing, etc.
  • Such applications are often termed Stencil
    Applications
  • We have already talked about one example Heat
    Transfer

8
Dependent Computations
  • Really simple
  • Cell values one floating point number
  • Program written with two arrays
  • f_old
  • f_new
  • One simple loop f_newi f_oldi ...
  • In more real cases, the domain in 2-D (or
    worse), there are more terms, and the values on
    the right hand side can be at time step m1 as
    well
  • Example from http//ocw.mit.edu/NR/rdonlyres/Nucl
    ear-Engineering/22-00JIntroduction-to-Modeling-and
    -SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64
    F21D23D0/0/lecture_16.pdf

9
Stencil Applications
  • More generally, a stencil application is defined
    by
  • A shape for the stencil, which defines which
    neighboring cells are used to update a cell
  • with modified shaped for the edges of the domain
  • For each neighboring cell in the stencil, the
    iteration (i.e., time step) of the value that
    should be used to compute the cell value at time
    step t1
  • t1, t, t-1, t-2, etc.

2-D domain
Example stencil shapes
t1
t1
t
t
t
10
Stencil Example
t1
t1
t
t
t
11
Stencil applications
0
t1
t1
t
t
t
12
Stencil applications
0
1
t1
1
t1
t
t
t
13
Stencil applications
0
1
2
t1
1
2
t1
t
t
2
t
14
Stencil applications
0
1
2
3
t1
1
2
3
t1
t
t
2
3
t
3
15
Stencil applications
0
1
2
3
4
5
6
t1
1
2
3
4
5
6
7
t1
t
t
2
3
4
5
6
7
8
t
3
4
5
6
7
8
9
4
5
6
7
8
9
10
Called a wavefront computation
5
6
7
8
9
10
11
6
7
8
9
10
11
12
16
Wavefront computation
  • How can we parallelize a wavefront computation?
  • We have seen that the computation consists in
    computing 2n-2 antidiagonals, in sequence.
  • Computations within each antidiagonal are
    independent, and can be done in a multithreaded
    fashion
  • One possible Algorithm
  • for each antidiagonal
  • use multiple threads to compute its elements
  • one may need to use a variable number of threads
    because some diagonals are very small, while some
    can be large
  • can be implemented with a single array

17
Wavefront computation
  • What about cache efficiency?
  • After all, reading only one element from an anti
    diagonal at a time is probably not good
  • They are not contiguous in memory!
  • Solution blocking
  • Just like matrix multiply

18
Wavefront computation
  • What about cache efficiency?
  • After all, reading only one element from a
    diagonal at a time is probably not good
  • Solution blocking
  • Just like matrix multiply

T0
19
Wavefront computation
  • What about cache efficiency?
  • After all, reading only one element from a
    diagonal at a time is probably not good
  • Solution blocking
  • Just like matrix multiply

T0
T1
20
Wavefront computation
  • What about cache efficiency?
  • After all, reading only one element from a
    diagonal at a time is probably not good
  • Solution blocking
  • Just like matrix multiply

T0
T1
T0
21
Wavefront computation
  • What about cache efficiency?
  • After all, reading only one element from a
    diagonal at a time is probably not good
  • Solution blocking
  • Just like matrix multiply

T0
T1
T0
T1
22
Performance Modeling
  • One thing well do often in this class is
    building performance models
  • Given simple assumptions regarding the underlying
    architecture
  • e.g., ignore cache effects
  • Come up with an analytical formula for the
    parallel speed-up
  • Lets try it on this simple application
  • Let N be the (square) matrix size
  • Let p be the number of threads/cores, which is
    fixed

23
Performance Modeling
  • What if we use p2 blocks?
  • We assume that p divides N (N gt p)
  • Then the computation proceeds in 2p-1 phases
  • each phase lasts as long as the time to compute
    one block (because of concurrency), Tb
  • Therefore
  • Parallel time (2p-1) Tb
  • Sequential time p2 Tb
  • Parallel speedup p2 / (2p - 1)
  • Parallel efficiency p / (2p -1)
  • Example
  • p2, speedup 4/3, efficiency 66
  • p4, speedup 16/7, efficiency 57
  • p8, speedup 64/17, efficiency 53
  • Asymptotically efficiency 50

24
Performance Modeling
  • What if we use (bxp)2 blocks?
  • b some integer between 1 and N/p
  • We assume that p divides N (N gt p)
  • But performance modeling becomes more complicated
  • The computation still proceeds in 2bp-1 phases
  • But a thread can have more than one block to
    compute during a phase!
  • During phase i, there are
  • i blocks to compute for i1,..,bp
  • 2bp-i blocks to compute for ibp1,...,2bp-1
  • If there are x (gt0) blocks to compute in a phase,
    then the execution time for that phase is
    ?(x-1)/p? 1)
  • Assuming Tb 1
  • Therefore, the parallel execution time is

25
Performance Modeling
26
Performance Modeling
  • Example Speedup for p 4

27
Performance Modeling
  • When b gets larger, speedup increases and tends
    to p
  • Since b lt N/p, best speed-up Np / (N p -1)
  • When N is large compared to p, speedup is very
    close to p
  • Therefore, use a block size of 1, meaning no
    blocking!
  • Were back to were we started because our
    performance model ignores cache effects!
  • Trade-off
  • From a parallel efficiency perspective small
    block size
  • From a cache efficiency perspective big block
    size
  • Possible rule of thumb use the biggest block
    size that fits in the L1 cache (L2 cache?)
  • Lesson full performance modeling is difficult
  • We could add the cache behavior, but think of a
    dual-core machine with shared L2 cache, etc.
  • In practice do performance modeling for
    asymptotic behaviors, and then do experiments to
    find out what works best

28
Another Possible Algorithm
  • The Dojo cleaning algorithm

29
Another Possible Algorithm
  • The Dojo cleaning algorithm

30
Another Possible Algorithm
  • The Dojo cleaning algorithm

31
Another Possible Algorithm
  • The Dojo cleaning algorithm

32
Another Possible Algorithm
  • The Dojo cleaning algorithm

33
Performance Modeling
  • What if we use (bxp)2 blocks?
  • b some integer between 1 and N/p
  • We assume that p divides N (N gt p)
  • Each processor computes b2p2/p b2p blocks
  • The last processor starts computing after p-1
    steps
  • So the last processor finishes computing after
    p-1b2p steps
  • We obtain
  • Tb,p b2p p - 1
  • Sb,p b2p2 / (b2p p -1)
  • Eb,p b2p / (b2p p -1)
  • The algorithm is therefore asymptotically
    optimal, like the previous one
  • And its much easier to analyze
  • But perhaps not much easier to implement
  • Question is it better???

34
Performance Modeling
Speedup for p4
Conclusion the simpler algorithm is likely
better in practice!
35
Sharks and Fish
  • Simulation of a population of preys and predators
  • Each entity follows some behavior
  • Preys move and breed
  • Predators move, hunt, and breed
  • Given initial populations, nature of the entity
    behaviors (e.g., probability of breeding,
    probability of successful hunting), what do
    populations look like after some time?
  • This is something computational ecologists do all
    the time to study ecosystems

36
Sharks and Fish
  • There are several possibilities to implement such
    a simulation
  • A simple one is to do something that looks like
    the game of life
  • A 2-D domain, with NxN cells (each cell can be
    described by many environmental parameters)
  • Each cell in the domain can hold a shark or a
    fish
  • The simulation is iterative
  • There are several rules for movement, breeding,
    preying
  • Why do it in parallel?
  • Many entities
  • Entity interactions may be complex
  • How can one write this in parallel with threads
    and shared memory?

37
Space partitioning
  • One solution is to divide the 2-D domain between
    threads
  • Each thread deals with the entities in its domain

38
Space partitioning
  • One solution is the divide the 2-D domain between
    threads
  • Each thread deals with the entities in its domain

4 threads
39
Move conflict?
  • Threads can make decisions that will lead to
    conflicts!

40
Move conflict?
  • Threads can make decisions that will lead to
    conflicts!

41
Dealing with conflicts
  • Concept of shadow cells

Only entities in the red regions may cause a
conflict
  • One possible implementation
  • Each thread deals with its green region
  • Thread 1 deals with its red region
  • Thread 2 deals with its red region
  • Thread 3 deals with its red region
  • Thread 4 deals with its red region
  • Repeat
  • Will still prevent some types of moves
  • No swapping of location
  • The implementer must make choices

42
Load Balancing
  • What if all the fish end up in the same region?
  • because they move
  • because they breed
  • Then one thread has much more work to do that the
    others
  • Solution dynamic repartitioning
  • Modify the partitioning so that the load is
    balanced
  • But perhaps one good idea would be to not do
    domain partitioning at all!
  • How about doing entity partitioning
  • Better load balancing, but more difficult to deal
    with conflicts
  • May use locks, but high overhead

43
Conclusion
  • Main lessons
  • There are many classes of applications, with many
    domain partitioning schemes
  • Performance modeling is fun but inherently
    limited
  • Its all about trade-offs
  • overhead - load balancing
  • parallelism - cache usage
  • etc.
  • Remember, this is the easy side of parallel
    computing
  • Things will become more complex in distributed
    memory programming
Write a Comment
User Comments (0)
About PowerShow.com