Title: Introduction to MPI Programming
1- Introduction to MPI Programming
- (Part II)?
- Michael Griffiths, Deniz Savas Alan Real
- January 2006
-
2Overview
- Review point to point communications
- Data types
- Data packing
- Collective Communication
- Broadcast, Scatter Gather of data
- Reduction Operations
- Barrier Synchronisation
- Patterns for Parallel Programming
- Exercises
3Blocking operations
- Relate to when the operation has completed
- Only return from the subroutine call when the
operation has completed
4Non-blocking operations
- Return straight away and allow the sub-program to
return to perform other work. - At some time later the sub-program should test or
wait for the completion of the non-blocking
operation. - A non-blocking operation immediately followed by
a matching wait is equivalent to a blocking
operation. - Non-blocking operations are not the same as
sequential subroutine calls as the operation
continues after the call has returned.
5(No Transcript)
6Non-blocking communication
- Separate communication into three phases
- Initiate non-blocking communication
- Do some work
- Perhaps involving other communications
- Wait for non-blocking communication to complete.
7Non-blocking send
Receive
MPI_COMM_WORLD
Send req
Wait
- Send is initiated and returns straight away.
- Sending process can do other things
- Can test later whether operation has completed.
8Non-blocking receive
Rec req
Wait
MPI_COMM_WORLD
Send
- Receive is initiated and returns straight away.
- Receiving process can do other things
- Can test later whether operation has completed.
9The Request Handle
- Same arguments as non-blocking call
- Additional request handle
- In C/C is of type MPI_Request/MPIRequest
- In Fortran is an INTEGER
- Request handle is allocated when a communication
is initiated - Can query to test whether non-blocking operation
has completed
10Non-blocking synchronous send
- Fortran
- CALL MPI_ISSEND(buf, count, datatype, dest, tag,
comm, request, error)? - CALL MPI_WAIT(request, status, error)?
- C
- MPI_Issend(buf, count, datatype, dest, tag,
comm, request) - MPI_Wait(request, status)
- C
- request comm.Issend(buf, count, datatype,
dest, tag) - request.Wait()
11Non-blocking synchronous receive
- Fortran
- CALL MPI_IRECV(buf, count, datatype, src, tag,
comm, request, error)? - CALL MPI_WAIT(request, status, error)?
- C
- MPI_Irecv(buf, count, datatype, src, tag, comm,
request) - MPI_Wait(request, status)
- C
- request comm.Irecv(buf, count, datatype, src,
tag) - request.Wait(status)
12Blocking v Non-blocking
- Send and receive can be blocking or non-blocking.
- A blocking send can be used with a non-blocking
receive, and vice versa. - Non-blocking sends can use any mode
- Synchronous mode affects completion, not
initiation. - A non-blocking call followed by an explicit wait
is identical to the corresponding blocking
communication.
13Completion
- Can either wait or test for completion
- Fortran (LOGICAL flag)
- CALL MPI_WAIT(request, status, ierror)?
- CALL MPI_TEST(request, flag, status, ierror)?
- C (int flag)
- MPI_Wait(request, status)?
- MPI_Test(request, flag, status)?
- C (bool flag)
- request.Wait()?
- flag request.Test() (for sends)?
- request.Wait(status)
- flag request.Test(status) (for receives)?
14Other related wait and test routines
- If multiple non-blocking calls are issued
- MPI_TESTANY Tests if any one of a list of
requests (they could be send or receive
requests) have been completed. - MPI_WAITANY Waits until any one of the list of
requests have been completed. - MPI_TESTALL Test if all the requests in a list
are completed. - MPI_WAITALL Waits until all the requests in a
list are completed. - MPI_PROBE , MPI_IPROBE Allows for the incoming
messages to be checked for without actually
receiving them. Note that MPI_PROBE is
blocking. It waits until there is something to
probe for. - MPI_CANCEL Cancels pending communication. Last
resort, clean- up operation ! - All routines take an array of requests and can
return an array of statuses. - any routines return an index of the completed
operation
15Merging send and receive operations into a single
unit
- The following is the syntax of the MPI_Sendrecv
command - IN C
- int MPI_Sendrecv( void sendbuf, int sendcount,
MPI_Datatype sendtype, int dest, int sendtag,
void recvbuf, int recvcount, MPI_Datatye
recvtype ,int source , int recvtag, MPI_Comm
comm, MPI_Status status )? - IN FORTRAN
- ltsendtypegt sendbuf()
- ltrecvtypegt recvbuf()?
- INTEGER sendcount,sendtype, dest, sendtag,
recvcount, recvtype, - INTEGER source, recvtag, comm, status(MPI_STATUS_S
IZE), ierror - MPI_SENDRECV( sendbuf,sendcount,sendtype, dest,
sendtag, recvbuf, recvcount , recvtype , source,
recvtag , comm , status , ierror )?
16Important Notes about MPI_Sendrecv
- Beware! A message sent by MPI_sendrecv is
receivable by a regular receive operation if the
destination and tag match. - For the destination and source MPI_PROC_NULL can
be specified to allow one directional working.
(Useful in non-circular communication for the
very end-nodes). - Any communication with MPI_PROC_NULL returns
immediately with no effect but as if the
operation has been successful. This can make
programming easier. - The send and receive buffers must not overlap,
they must be separate memory locations. This
restriction can be avoided by using the
MPI_Sendrecv_replace routine
17Data Packing
- Up until now we have only seen contiguous data
of pre-defined data-types being communicated by
MPI calls. This can be rather restricting if what
we are intending to transfer involves structures
of data made up of mixtures of primitive data
types, such as integer count followed by a
sequence of real numbers. - One solution to this problem is to use the
MPI_PACK and MPI_UNPACK routines. The philosophy
used is similar to the Fortran write/read to/from
internal buffers and the scanf function in C. - MPI_PACK routine can be called consecutively to
compress the data into a send_buffer, the
resulting buffer of data can then be sent by
using MPI_SEND or equivalent with the data_type
set to MPI_PACKED. - At the receiving-end it can be received by
using MPI_RECV with the data type MPI_PACKED. The
received data can then be unpacked by using
MPI_UNPACK to recover the original packed data.
This method of working can also improve
communications efficiency by reducing the number
of data transfer send-receive calls. There are
usually fixed overheads associated with setting
up the communications that would cause
inefficiencies if the sent/received messages are
just too small. -
18MPI_Pack
- Fortran
- lttypegt INBUF() , OUTBUF()?
- INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IE
RROR - MPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF,
OUTSIZE,POSITION, COMM,IERROR )? - C
- int MPI_Pack(void inbuf, int incount,
MPI_Datatype datatype, void outbuf ,int outsize,
int position, MPI_Comm comm )? - Packs the message in inbuf of type datatype and
lengthincount and stores it in outbuf . Outbuf
size is specified in bytes. Outsize being the
maximum length of outbuf in bytes, rather than
its actuaL size. - On entry position indicates the starting
location at the outbuf where data will be
written. On exit position points to the first
free position in outbuf following the location
occupied by the packed message. This can then be
readily used as the position parameter for the
next mpi_pack call.
19MPI_Unpack
- Fortran
- lttypegt INBUF() , OUTBUF()?
- INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE,
COMM,IERROR - MPI_UNPACK(INBUF,INSIZE,POSITION,
OUTBUF,OUTCOUNT,DATATYPE, ,COMM,IERROR )? - C
- int MPI_Unpack(void inbuf, int insize, int
position, void outbuf ,int outcount,
MPI_Datatype datatype, MPI_Comm comm )? - Unpacks the message which is in inbuf as data
of type datatype and length of outcounts and
stores it in outbuf . - On entry, position indicates the starting
location of data in inbuf where data will be read
from. On exit position points to the first
position of the next set of data in inbuf. This
can then be readily used as the position
parameter for the next mpi_unpack call.
20Derived Datatypes
- Basic data types provided in MPI allow us to send
messages consisting of arrays of these types. We
can also pack mixtures of these arrays into a
single array of type MPI_PACKED to be sent at one
go. - However under certain circumstances a better
approach would be to define a data type of our
own choosing, constructed by using the existing
data types and then define our messages in units
of this newly invented data-type. This is a
similar approach to defining structs in C and
user-defined types in Fortran. It improves
efficiency by reducing the number of
communications calls needed to communicate data (
as it can not be a mixture of basic types). - The following table shows how the new data types
are specified as an ordered list of its
constituent components and the location of each
component within the over-all structure. This is
referred to as the type-map of the new data type.
Displacements are measured from the beginning of
the structure.
21Derived data types
- Use of derived data types involve the following
steps - Construct define the new data type
- Commit the new data type
- Use the new type in message passing
(send/receive) calls. - Optionally any no-longer needed data-types can be
freed. - CONSTRUCTING THE NEW DATA TYPE
- Rather than a single routine, the MPI library
provides a set of routines for constructing new
data_types, each one suitable for a particular
form of data. - These being
- Contiguous
- Vector
- Indexed
- Structure
22Derived data types
- The following routines help construct new data
types - MPI_Type_Contiguous
- MPI_Type_vector, MPI_Type_hvector,
- MPI_Type_indexed, MPI_Type_hindexed
- MPI_Create_type_struct
- MPI_Type_Contiguous will allow you to refer to a
contiguous vector of a primitive type as a new
type to be used in communications. A bit like
being able to reference a matrix with its name
only. - MPI_Type_vector will allow us to refer to a
collection of elements that are seperated from
each other by constant strides. For example
elements (1) , (3) , (5) , (7) . of an existing
vector as our unit. - MPI_Type_indexed allows the vector strides to
vary in a predefined manner which is not possible
to define using MPI_Type_vector. - We shall study only MPI_Type_struct as an
example, as it is the most general and complex of
all types.
23MPI_Create_type_struct
- FORTRAN
- MPI_CREATE_TYPE_STRUCT( COUNT,BLOCK_LENGHTS,DISPLA
CEMENTS, TYPES,NEWTYPE,IERROR ) - INTEGER COUNT, LENGTHS() , DISPLACEMENTS()
,TYPES() ,NEWTYPE,IERROR ) - C
- Int MPI_Create_type_struct(int count, int
block_lengths, MPI_Aint displacements,
MPI_Datatype types , MPI_Datatype newtype )? - The data is made up of (COUNT) blocks. Each
block(i) is made up of - block_lenghts(i) number of data of types(i).
Displacement of each block within the type is
given by displacements(i) in BYTES. - When the type is successfully created NEWTYPE
returns a handle to the new data type that can be
used in subsequent send/receive calls. For
example a structure which is made up of 2
integers followed by 6 reals followed by a
character string of 5 characters is seen as a
structure which is 3-blocks, lengths of which are
2,6,5 respectively and data_types are
(MPI_INTEGER, MPI_REAL, MPI_CHARACTER )
Displacements are (0 , 4,28 ) (BYTES)? - NOTEThe MPI1 standards define the function name
as MPI_TYPE_STRUCT - which was changed in MPI2 . So, the old name is
also valid.
24MPI_Type_commit
- Once a type is constructed, it can be committed
for use by invoking this function. This allows
us to send messages of new-type by using all
the MPI message communications routines. - FORTRAN
- MPI_TYPE_COMMIT( DATATYPE, IERROR )?
- C
- MPI_Type_commit( MPI_Datatype datatype )
25Timers
- Double precision MPI functions
- Fortran, DOUBLE PRECISION t1
- t1 MPI_WTIME()
- C double t1
- t1 MPI_Wtime()
- C double t1
- t1 MPIWtime()
- Time is measured in seconds.
- Time to perform a task is measured by consulting
the timer before and after.
26 27Overview
- Introduction characteristics
- Barrier Synchronisation
- Global reduction operations
- Predefined operations
- Broadcast
- Scatter
- Gather
- Partial sums
- Exercise
28Collective communications
- Are higher-level routines involving several
processes at a time. - Can be built out of point-to-point
communications. - Examples are
- Barriers
- Broadcast
- Reduction operations
29Collective Communication
- Communications involving a group of processes.
- Called by all processes in a communicator.
- Examples
- Broadcast, scatter, gather (Data Distribution)?
- Global sum, global maximum, etc. (Reduction
Operations)? - Barrier synchronisation
- Characteristics
- Collective communication will not interfere with
point-to-point communication and vice-versa. - All processes must call the collective routine.
- Synchronization not guaranteed (except for
barrier)? - No non-blocking collective communication
- No tags
- Receive buffers must be exactly the right size
30Collective Communications(one for all, all for
one!!!)?
- Collective communication is defined as that which
involves all the processes in a group. Collective
communication routines can be divided into the
following broad categories - Barrier synchronisation
- Broadcast from one to all.
- Scatter from one to all
- Gather from all to one.
- Scatter/Gather. From all to all.
- Global reduction (distribute elementary
operations)? - IMPORTANT NOTE Collective Communication
operations and point-to-point operations we have
seen earlier are invisible to each other and
hence do not interfere with each other. - This is important to avoid dead-locks due to
interference.
31BARRIER SYNCHRONIZATION
T I M E
B A R R I E R STATEMENT
Here, there are seven processes running and three
of them are waiting idle at the barrier statement
for the other four to catch up.
32Graphic Representations of Collective
Communication Types
P R O C E S S E S
ALLGATHER
BROADCAST
SCATTER
ALLTOALL
GATHER
D A T A
D A T A
D A T A
D A T A
33Barrier Synchronisation
- Each processes in communicator waits at barrier
until all processes encounter the barrier. - Fortran
- INTEGER comm, error
- CALL MPI_BARRIER(comm, error)?
- C
- MPI_Barrier(MPI_Comm comm)
- C
- Comm.Barrier()
- E.g.
- MPICOMM_WORLD.Barrier()
34Global reduction operations
- Used to compute a result involving data
distributed over a group of processes - Global sum or product
- Global maximum or minimum
- Global user-defined operation
35Predefined operations
36MPI_Reduce
- Performs count operations (o) on individual
elements of sendbuf between processes
Rank
0
1
BoEoH
2
AoDoG
37MPI_Reduce syntax
- Fortran
- INTEGER count, type, count, rtype, root, comm,
error - CALL MPI_REDUCE(sbuf, rbuf, count, rtype, op,
root, comm, error)? - C
- MPI_Reduce(void sbuf, void rbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root,
MPI_Comm comm) - C
- CommReduce(const void sbuf, void recvbuf, int
count, const MPIDatatype datatype, const
MPIOp op, int root)
38MPI_Reduce example
- Integer global sum
- Fortran
- INTEGER x, result, error
- CALL MPI_REDUCE(x, result, 1, MPI_INTEGER,
MPI_SUM, 0, MPI_COMM_WORLD, error)? - C
- int x, result
- MPI_Reduce(x, result, 1, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD) - C
- int x, result
- MPICOMM_WORLD.Reduce(x, result, 1, MPIINT,
MPISUM)
39MPI_Allreduce
- No root process
- All processes get results of reduction operation
Rank
0
1
2
AoDoG
40MPI_Allreduce syntax
- Fortran
- INTEGER count, type, count, rtype, comm, error
- CALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op,
comm, error)? - C
- MPI_Allreduce(void sbuf, void rbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) - C
- Comm.Allreduce(const void sbuf, void recvbuf,
int count, const MPIDatatype datatype, const
MPIOp op)
41Practice Session 3
- Using reduction operations
- This example shows the use of the continued
fraction method of calculating pi and makes each
processor calculate a different portion of the
expansion series.
42Broadcast
- Duplicates data from root process to other
processes in communicator
A
Broadcast
A
A
A
A
A
Rank
0
1
2
3
43Broadcast syntax
- Fortran
- INTEGER count, datatype, root, comm, error
- CALL MPI_BCAST(buffer, count, datatype, root,
comm, error)? - C
- MPI_Bcast (void buffer, int count, MPI_Datatype
datatype, int root, MPI_Comm comm) - C
- Comm.Bcast(void buffer, int count, const
MPIDatatype datatype, int root) - E.g broadcasting 10 integers from rank 0
- int tenints10
- MPICOMM_WORLD.Bcast(tenints, 10, MPIINT, 0)
44Scatter
- Distributes data from root process amongst
processors within communicator.
Scatter
A
D
C
Rank
0
1
2
3
45Scatter syntax
- scount (and rcount) is number of elements each
process is sent (i.e. no received)? - Fortran
- INTEGER scount, stype, rcount, rtype, root, comm,
error - CALL MPI_SCATTER(sbuf, scount, stype, rbuf,
rcount, rtype, root, comm, error)? - C
- MPI_Scatter(void sbuf, int scount, MPI_Datatype
stype, void rbuf, int rcount, MPI_Datatype
rtype, root, comm) - C
- Comm.Scatter(const void sbuf, int scount, const
MPIDatatype stype, void rbuf, int rcount,
const MPIDatatype rtype, int root)
46Gather
- Collects data distributed amongst processes in
communicator onto root process ( Collection done
in rank order ) .
B
A
D
C
Gather
A
D
C
Rank
0
1
2
3
47Gather syntax
- Takes same arguments as Scatter operation
- Fortran
- INTEGER scount, stype, rcount, rtype, root, comm,
error - CALL MPI_GATHER(sbuf, scount, stype, rbuf,
rcount, rtype, root, comm, error)? - C
- MPI_Gather(void sbuf, int scount, MPI_Datatype
stype, void rbuf, int rcount, MPI_Datatype
rtype, root, comm) - C
- Comm.Gather(const void sbuf, int scount, const
MPIDatatype stype, void rbuf, int rcount,
const MPIDatatype rtype, int root)
48All Gather
- Collects all data on all processes in communicator
B
A
D
C
Gather
B
A
D
C
Rank
0
1
2
3
49All Gather syntax
- As Gather but no root defined.
- Fortran
- INTEGER scount, stype, rcount, rtype, comm, error
- CALL MPI_GATHER(sbuf, scount, stype, rbuf,
rcount, rtype, comm, error)? - C
- MPI_Gather(void sbuf, int scount, MPI_Datatype
stype, void rbuf, int rcount, MPI_Datatype
rtype, comm) - C
- Comm.Gather(const void sbuf, int scount, const
MPIDatatype stype, void rbuf, int rcount,
const MPIDatatype rtype)
50MPI_Scan
- Performs a partial reductions
- E.g. partial sum
Rank
0
A
1
AoD
2
AoDoG
51MPI_Scan syntax
- Fortran
- INTEGER count, type, count, rtype, comm, error
- CALL MPI_SCAN(sbuf, rbuf, count, rtype, op, comm,
error)? - C
- MPI_Scan(void sbuf, void rbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) - C
- Comm.Scan(const void sbuf, void recvbuf, int
count, const MPIDatatype datatype, const
MPIOp op)
52Practice Session 4 diffusion example
- Arrange processes to communicate round a ring.
- Each process stores a copy of its rank in an
integer variable. - Each process communicates this value to its right
neighbour and receives a value from its left
neighbour. - Each process computes the sum of all the values
received. - Repeat for the number of processes involved and
print out the sum stored at each process.
53Generating Cartesian Topologies
- MPI_Cart_create
- Makes a new communicator to which topology
information has been attached - MPI_Cart_coords
- Determines process coords in cartesian topology
given rank in group - MPI_Cart_shift
- Returns the shifted source and destination ranks,
given a shift direction and amount
54MPI_Cart_create syntax
- Fortran
- INTEGER comm_old, ndims, dims(), comm_cart,
ierror logical periods(), reorder - CALL MPI_CART_CREATE(comm_old, ndims, dims,
periods, reorder, comm_cart, ierror) - C
- MPI_Cart_create(MPI_Comm comm_old, int ndims, int
dims, int periods, int reorder, MPI_Comm
comm_cart ) - C
- MPIIntracommCreate_cart (int ndims, const int
dims, const bool periods, bool reorder )
55MPI_Cart_coords syntax
- Fortran
- CALL MPI_CART_COORDS(INTEGER COMM,INTEGER
RANK,INTEGER MAXDIMS,INTEGER COORDS(),INTEGER
IERROR) - C
- int MPI_Cart_coords(MPI_Comm comm,int rank,int
maxdims,int coords) - C
- void MPICartcommGet_coords(int rank, int
maxdims, int coords) const
56MPI_Cart_shift syntax
- Fortran
- MPI_CART_SHIFT(INTEGER COMM,INTEGER
DIRECTION,INTEGER DISP, INTEGER
RANK_SOURCE,INTEGER RANK_DEST,INTEGER IERROR)? - C
- int MPI_Cart_shift(MPI_Comm comm,int
direction,int disp,int rank_source,int
rank_dest) - C
- void MPICartcommShift(int direction, int
disp, int rank_source, int rank_dest) const
57Topologies Examples
- See Diffusion example
- See cartesian example
58Examples for Parallel Programming
- Master slave
- E.g. share work example
- Example ising model
- Communicating Sequential Elements Pattern
- Poisson equation
- Highly coupled processes
- Systolic loop algorithm
- E.g. md example
59Poisson Solver Using Jacobi Iteration
- Communicating Sequential Elements Pattern
- Operations in each component depend on partial
results in neighbour components.
Thread
Thread
Thread
Slave
Slave
Slave
Data Exchange
Data Exchange
Thread
Thread
Thread
Slave
Slave
Slave
60Layered Decomposition of 2d Array
- Distribute 2d array across processors
- Processors store all columns
- Rows allocated amongst processors
- Each proc has left proc and right proc
- Each proc has max and min vertex that it stores
- Uijnew(Ui1jUi-1jUij1Uij-1)/4
- Each proc has a ghost layer
- Used in calculation of update (see above)?
- Obtained from neighbouring left and right
processors - Pass top and bottom layers to neighbouring
processors - Become neighbours ghost layers
- Distribute rows over processors N/nproc rows per
proc - Every processor stores all N columns
61N1
N
Processor 1
p1min
p2max
p1min
p2max
Processor 2
p2min
Send top layer
p3max
Receive bottom layer
p2min
p3max
Processor 3
Send bottom layer
Receive top layer
Processor 4
1
N1
62Master Slave
Thread
Data Exchange
Slave
Thread
Slave
Master
Thread
Slave
- A computation is required where independent
computations are performed, perhaps repeatedly,
on all elements of some ordered data. - Example
- Image processing perform computation on different
sets of pixels within an image
63Highly Coupled Efficient Element Exchange
- Highly Coupled Efficient Element Exchange using
Systolic loop techniques - Extreme example of Communicating Sequential
Elements Pattern
64Systolic Loop
- Distribute Elements Over Processors
- Three buffers
- Local elements
- Travelling Elements (local elements at start)?
- Send buffer
- Loop over number of processors
- Transfer travelling elements
- Interleave send/receive to prevent deadlock
- Send contents of send buffer to next proc
- Receive buffer from previous proc to travelling
elements - Point travelling elements to send buffer
- Allow local elements to interact with travelling
elements - Accumulate reduced computations over processors
65Systolic Loop Element Pump
First cycle of 3 for 4 processor systolic loop
Proc 2
Proc 1
Proc 3
Proc 4
Local Elements
Local Elements
Local Elements
Local Elements
Moving Elements (from 1)?
Moving Elements (from 4)?
Moving Elements (from 2)?
Moving Elements (from 3)?
66Practice Sessions 5 and 6
- Defining and Using Processor Topologies
- Patterns for parallel computing
67Further Information
- All MPI routines have a UNIX man page
- Use C-style definition for Fortran/C/C
- E.g. man MPI_Finalize will give correct syntax
and information for Fortran, C and C calls. - Designing and building parallel programs (Ian
Foster)? - http//www-unix.mcs.anl.gov/dbpp/
- Standard documents
- http//www.mpi-forum.org/
- Many books and information on web.
- EPCC documents.