Title: High%20Performance%20Parallel%20Programming
1High Performance Parallel Programming
- Dirk van der Knijff
- Advanced Research Computing
- Information Division
2 High Performance Parallel Programming
- Lecture 1 Introduction to Parallel Programming
3Introduction
- Parallel programming covers all occasions where
more than 1 functional element is involved. - Most simple hardware parallelism is already
exploited - Bit-parallel adders and multipliers
- Multiple arithmetic units
- Overlapped compute and I/O
- Pipelined instruction execution
- We are concerned with higher levels.
4Why parallel?
- Because its faster!!!
- Even without good scalability -
- but not always
- Because its cheaper!
- Leverages commodity components
- Allows arbitrary size computers
- Because its natural!
- The real world is parallel
- Computer languages usually force a serial view
5Taxonomy
- Flynns taxonomy
- based on hardware but can be applied to software
also.
Today all hardware is MIMD Most programs are SPMD
(based on SIMD)
6Parallel Architectures
SIMD - Single Instruction Multiple Data Data
Parallel Machines MIMD - Multiple Instruction
Multiple Data SMP - Shared Memory
Processing ccNUMA - cache coherent Non-Uniform
Memory Access DM - Distrubuted Memory MPP -
Massively Parallel Processors NOWS - Network Of
Work-Stations
7Programming Models
- Machine architectures and Programming models
have a natural affinity, both in their
descriptions and in there effects on each other.
Algorithms have been developed to fit machines
and machines have been built to fit algorithms. - Purpose built Supercomputers - ASCI
- FPGAs - QCDmachines
8Parallel Programming Models
- Control
- How is parallelism created?
- What orderings exist between operations?
- How do different threads of control synchronize?
- Data
- What data is private vs. shared?
- How is logically shared data accessed or
communicated? - Operations
- What are the atomic operations?
- Cost
- How do we account for the cost of each of the
above?
9Trivial Example
- Parallel Decomposition
- Each evaluation and each partial sum is a task.
- Assign n/p numbers to each of p procs
- Each computes independent private results and
partial sum. - One (or all) collects the p partial sums and
computes the global sum. - Two Classes of Data
- Logically Shared
- The original n numbers, the global sum.
- Logically Private
- The individual function evaluations.
- What about the individual partial sums?
10Model 1 Shared Address Space
- Program consists of a collection of threads of
control. - Each has a set of private variables, e.g. local
variables on the stack. - Collectively with a set of shared variables,
e.g., static variables, shared common blocks,
global heap. - Threads communicate implicitly by writing and
reading shared variables. - Threads coordinate explicitly by synchronization
operations on shared variables -- writing and
reading flags, locks or semaphores. - Like concurrent programming on a uniprocessor.
11Shared Address Space
Machine model - Shared memory
12Shared Memory Code for Computing a Sum
Thread 1 s 0 initially local_s1 0 for i
0, n/2-1 local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 s 0 initially local_s2 0 for i
n/2, n-1 local_s2 local_s2 f(Ai) s s
local_s2
What could go wrong?
13Solution via Synchronization
- Pitfall in computing a global sum s local_s1
local_s2
Thread 1 (initially s0) load s from mem to
reg s slocal_s1 local_s1, in reg
store s from reg to mem
Thread 2 (initially s0) load s from
mem to reg initially 0 s slocal_s2
local_s2, in reg store s from reg to mem
Time
- Instructions from different threads can be
interleaved arbitrarily. - What can final result s stored in memory be?
- Problem race condition.
- Possible solution mutual exclusion with locks
Thread 1 lock load s s slocal_s1
store s unlock
Thread 2 lock load s s slocal_s2
store s unlock
- Locks must be atomic (execute completely without
interruption).
14Model 2 Message Passing
- Program consists of a collection of named
processes. - Thread of control plus local address space -- NO
shared data. - Local variables, static variables, common blocks,
heap. - Processes communicate by explicit data transfers
-- matching send and receive pair by source and
destination processors(these may be broadcast). - Coordination is implicit in every communication
event. - Logically shared data is partitioned over local
processes. - Like distributed programming -- program with MPI,
PVM.
15Message Passing
send P0,X
recv Pn,Y
X
Y
. . .
P
P
n
0
Machine model - Distributed Memory
16Computing s x(1)x(2) on each processor
Processor 1 send xlocal, proc2 xlocal
x(1) receive xremote, proc2 s xlocal xremote
Processor 2 receive xremote, proc1 send xlocal,
proc1 xlocal x(2) s xlocal xremote
- Second possible solution -- what could go wrong?
Processor 1 send xlocal, proc2 xlocal
x(1) receive xremote, proc2 s xlocal xremote
Processor 2 send xlocal, proc1 xlocal
x(2) receive xremote, proc1 s xlocal xremote
- What if send/receive acts like the telephone
system? The post office?
17Model 3 Data Parallel
- Single sequential thread of control consisting of
parallel operations. - Parallel operations applied to all (or a defined
subset) of a data structure. - Communication is implicit in parallel operators
and shifted data structures. - Elegant and easy to understand and reason about.
- Like marching in a regiment.
- Used by Matlab.
- Drawback not all problems fit this model.
18Data Parallel
Machine model - SIMD
19SIMD Architecture
- A large number of (usually) small processors.
- A single control processor issues each
instruction. - Each processor executes the same instruction.
- Some processors may be turned off on some
instructions. - Machines are not popular (CM2), but programming
model is. - Implemented by mapping n-fold parallelism to p
processors. - Programming
- Mostly done in the compilers (HPF High
Performance Fortran). - Usually modified to Single Program Multiple Data
or SPMD
20Data Parallel Programming
- Data Parallel programming involves the processing
and manipulation of large arrays - Processors should (simultaneously) perform
similar operations on different array elements - Each processor has a local memory gt large arrays
must be distributed across many different
processors - Arrays can be distributed in different ways
depending on how they are used - We want to choose a distribution which maximises
the ratio of local work to communication
21Data Parallel Programming (HPF)
- Most successful Data-parallel language
- Extension to Fortran90/95
- Fastest language
- Has powerful array constructs
- Has suitable data structures implementation
- HPF is based on compiler directives
- Easiest way to write parallel programs
- Compiler does most of the work
22OpenMP - Shared Memory Systems
- Key feature is a single address space across the
whole memory system. - every processor can read and write all memory
locations - Caches are kept coherent
- all processors have same view of memory.
- Two main types
- true shared memory
- distributed shared memory
23Threads and thread teams
- A thread is a (lightweight) process - an instance
of a program and its data. - Each thread can follow its own flow of control
through a program. - Threads can share data with other threads, but
also have private data. - Threads communicate with each other via the
shared data. - A thread team is a set of threads which
co-operate on a task. - The master thread is responsible for
co-ordinating the team.
thread 1
thread 2
thread 3
PC
PC
PC
Private data
Private data
Private data
shared data
24Directives and sentinels
- A directive is a special line of source code with
meaning only to a compiler that understands it. - Note the difference between directives (must be
obeyed) and hints (may be obeyed). - A directive is distinguished by a sentinel at the
start of the line. - OpenMP sentinels are
- Fortran !OMP (or COMP or OMP)
- C/C pragma omp
25Parallel region
- The parallel region is the basic parallel
construct in OpenMP. - A parallel region defines a section of a program.
- Program begins execution on a single thread (the
master thread). - When the first parallel region is encountered,
the master thread creates a team of threads.
(Fork/join model) - Every thread executes the statements which are
inside the parallel region - At the end of the parallel region, the master
thread waits for the other threads to finish, and
continues executing the next statements
26Parallel region
program fred . . !omp
parallel . . . .
. !omp end parallel . .
. !omp parallel . .
. !omp end parallel . .
27Shared and private data
- Inside a parallel region, variables can either be
shared or private. - All threads see the same copy of shared
variables. - All threads can read or write shared variables.
- Each thread has its own copy of private
variables these are invisible to other threads.
- A private variable can only be read or written by
its own thread.
28Parallel loops
- Loops are the main source of parallelism in many
applications. - If the iterations of a loop are independent (can
be done in any order) then we can share out the
iterations between different threads. - e.g. if we have two threads and the loop
- do i 1, 100
- a(i) a(i) b(i)
- end do
- we could do iteration 1-50 on one thread and
iterations 51-100 on the other.
29Synchronisation
- Need to ensure that actions on shared variables
occur in the correct order e.g. - thread 1 must write variable A before thread 2
reads it, - or
- thread 1 must read variable A before thread 2
writes it. - Note that updates to shared variables (e.g. a
a 1)are not atomic! - If two threads try to do this at the same time,
one of the updates may get overwritten.
30Reductions
- A reduction produces a single value from
associative operations such as addition,
multiplication, max, min, and, or. - For example
- b 0
- for (i0 iltn i)
- b b a(i)
- Allowing only one thread at a time to update b
would remove all parallelism. - Instead, each thread can accumulate its own
private copy, then these copies are reduced to
give final result.
31Parallel region directive
- Code within a parallel region is executed by all
threads. - Syntax
- Fortran C/C
- !omp parallel pragma omp parallel
- block
- !omp end parallel block
-
- e.g.
- call fred
- !omp parallel
- call billy
- !omp end parallel
- call daisy
-
fred
billy
billy
billy
billy
daisy
32Useful functions
- Often useful to find out number of threads being
used. - Fortran integer function omp_get_num_threads()
- C/C include ltomp.hgt
- int omp_get_num_threads(void)
- Also useful to find out number of the executing
thread. - Fortran integer function omp_get_thread_num()
- C/C include ltomp.hgt
- int omp_get_thread_num(void)
- Takes values between 0 and omp_get_num_threads()-
1
33Clauses
- Specify additional information in the parallel
region directive through clauses - Fortran !omp parallel clauses
- C/C pragma omp parallel clauses
- Clauses are comma or space separated in Fortran,
space separated in C/C.
34Shared and private variables
- Inside a parallel region, variables can be either
shared (all threads see same copy) or private
(each thread has private copy). - Defined using shared, private and default clauses
- Fortran shared(list)
- private(list)
- default(sharedprivatenone)
- C/C shared(list)
- private(list)
- default(sharednone)
35Shared and private (cont)
- Example each thread initialises its own column
of a shared array - !OMP PARALLEL DEFAULT(NONE),PRIVATE(I,MYID),
- !OMP SHARED(A,N)
- myid omp_get_thread_num() 1
- do i 1,n
- a(i,myid) 1.0
- end do
- !OMP END PARALLEL
36Shared and private (cont)
- How do we decide which variables should be shared
and which private? - Most variables are shared
- Loop indices are private
- Loop temporaries are private
- Read-only variables - shared
- Main arrays - shared
- Write-before-read scalars - usually private
- Sometimes either is semantically OK, but there
may be performance implications in making the
choice. - N.B. can have private arrays as well as scalars
37Initialising private variables
- Private variables are uninitialised at the start
of the parallel region. - If we wish to initialise them, we use the
FIRSTPRIVATE clause - Fortran firstprivate(list) C/C firstprivate(li
st) - e.g. b 23.0
- . . . . .
- !OMP PARALLEL FIRSTPRIVATE(B),
- !OMP PRIVATE(I,MYID)
- myid omp_get_thread_num() 1
- do i 1,n
- b b c(i,myid)
- end do
- c(n1,myid) b
- !OMP END PARALLEL
38Reductions
- A reduction produces a single value from
associative operations such as addition,
multiplication,max, min, and, or. - Would like each thread to reduce into a private
copy, then reduce all these to give final result. - Use REDUCTION clause
- Fortran reduction(oplist) C/C
reduction(oplist) - N.B. Cannot have reduction arrays, only scalars
or array elements!
39Reduction example
- !OMP PARALLEL REDUCTION(B),
- !OMP PRIVATE(I,MYID)
- myid omp_get_thread_num() 1
- do i 1,n
- b b c(i,myid)
- end do
- !OMP END PARALLEL
40Parallel do loops
- Loops are the most common source of parallelism
in most codes. Parallel loop directives are
therefore very important! - A parallel do loop divides up the iterations of
the loop between threads. - Fortran !OMP DO clauses C/C pragma omp
for clauses - do loop for loop
- !OMP END DO
- Restrictions in C/C. It has to look like a DO
loop - it must be of the - form for (var a var logical-op b incr-exp)
- where logical-op is one of lt, lt, gt, gt
- and incr-exp is var var /- incr or var.
41Parallel do loops (cont)
- With no additional clauses, the DO/FOR directive
will usually partition the iterations as equally
as possible between the threads. - However, this is implementation dependent, and
there is still some ambiguity - e.g. 7 iterations, 3 threads. Could partition as
331 or 322 - How can you tell if a loop is parallel or not?
- Useful test if the loop gives the same answers
if it is run in reverse order, then it is almost
certainly parallel - e.g. do i2,n
- a(i)2a(i-1)
- end do
42Parallel do loops (cont)
- ix base
- do i1,n
- a(ix) a(ix)b(i)
- ix ix stride
- end do
- do i1,n
- b(i) (a(i)-a(i-1))0.5
- end do
43Parallel do loops (example)
- Example
- !OMP PARALLEL
- !OMP DO
- do i1,n
- b(i) (a(i)-a(i-1))0.5
- end do
- !OMP END DO
- !OMP END PARALLEL
44SCHEDULE clause
- The SCHEDULE clause gives a variety of options
for specifying which loops iterations are
executed by which thread. -
- Syntax schedule (kind, chunksize)
- where kind is one of
- STATIC, DYNAMIC, GUIDED or RUNTIME
- and chunksize is an integer expression with
positive value.
45STATIC schedule
- With no chunksize specified, the iteration space
is divided into (approximately) equal chunks, and
one chunk is assigned to each thread (block
schedule). - If chunksize is specified, the iteration space is
divided into chunks, each of chunksize
iterations, and the chunks are assigned
cyclically to each thread (block cyclic schedule)
T0
T1
T2
T3
schedule(static)
T0
T1
T2
T3
T0
T1
T2
T3
T0
T1
T2
T3
T0
schedule(static,4)
46DYNAMIC schedule
- DYNAMIC schedule divides the iteration space up
into chunks of size chunksize, and assigns them
to threads on a first-come-first-served basis. - i.e. as a thread finishes a chunk, it is assigned
the next chunk in the list. - When no chunksize is specified, it defaults to 1.
- Note - this may be inefficient - you should
specify a chunksize that matches the cache-line
length to avoid false sharing
schedule(dynamic,4)
47GUIDED schedule
- GUIDED schedule is similar to DYNAMIC, but the
chunks start off large and get smaller
exponentially. - The size of the next chunk is (roughly) the
number of remaining iterations divided by the
number of threads. - The chunksize specifies the minimum size of the
chunks. - When no chunksize is specified it defaults to 1.
schedule(guided,3)
48RUNTIME schedule
- The RUNTIME schedule defers the choice of
schedule to run time, when it is determined by
the value of the environment variable
OMP_SCHEDULE. - e.g. export OMP_SCHEDULEguided,4
- It is illegal to specify a chunksize with the
RUNTIME schedule.
49Synchronization
- Recall
- Need to synchronise actions on shared variables.
- Need to respect dependencies (true and anti)
- Need to protect updates to shared variables (not
atomic by default)
50Critical sections
- A critical section is a block of code which can
be executed by only one thread at a time. - Can be used to protect updates to shared
variables. - The CRITICAL directive allows critical sections
to be named. - If one thread is in a critical section with a
given name, no other thread may be in a critical
section with the same name, though they can be in
critical sections with other names. - Fortran !OMP CRITICAL ( name )
- block
- !OMP END CRITICAL ( name )
- C/C pragma omp critical ( name )
- structured block
51BARRIER directive
- No thread can proceed past a barrier until all
the other threads have arrived. - Note that there is an implicit barrier at the end
of DO/FOR, SECTIONS and SINGLE directives. - Fortran !omp barrier
- C/C pragma omp barrier
- Either all threads or none must encounter the
barrier (DEADLOCK!!) - e.g. !OMP PARALLEL PRIVATE(I,MYID)
- myid omp_get_thread_num()
- a(myid) a(myid)3.5
- !OMP BARRIER
- b(myid) a(neighb(myid)) c
- !OMP END PARALLEL
52ATOMIC directive
- C/C pragma omp atomic
- statement
- where statement must have one of the forms
- x binop expr, x, x, x--, or --x
- and binop is one of , , -, /, , , ltlt, or gtgt
- Note that the evaluation of expr is not atomic.
- May be more efficient that using CRITICAL
directives,e.g. if different array elements can
be protected separately.
53ATOMIC directive
- Used to protect a single update to a shared
variable. - Applies only to a single statement.
- Syntax
- Fortran !OMP ATOMIC
- statement
- where statement must have one of these forms
- x x op expr, x expr op x, x intr (x,
expr) or - x intr(expr, x)
- op is one of , , -, /, .and. , .or. , .eqv., or
.neqv. - intr is one of MAX, MIN, IAND, IOR or IEOR
54Lock routines
- Occasionally we may require more flexibility than
is provided by CRITICAL and ATOMIC directions. - A lock is a special variable that may be set by a
thread. No other thread may set the lock until
the thread which set the lock has unset it. - Setting a lock can either be blocking or
non-blocking. - A lock must be initialised before it is used, and
may be destroyed when it is not longer required. - Lock variables should not be used for any other
purpose.
55Choosing synchronisation
- As a rough guide, use ATOMIC directives if
possible, as these allow most optimisation. - If this is not possible, use CRITICAL directives.
Make sure you use different names wherever
possible. - As a last resort you may need to use the lock
routines, but this should be quite a rare
occurrence.
56Further reading
- OpenMP Specification
- http//www.openmp.org/
- My self-paced course (under development)
- http//www.hpc.unimelb.edu.au/vpic/omp/contents.h
tml
57 High Performance Parallel Programming
- Tomorrow - MPI programming.