Introduction to MPI Programming

About This Presentation

Title:

Introduction to MPI Programming

Description:

Review point to point communications. Data types. Data packing ... Collective communication will not interfere with point-to-point communication and vice-versa. ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 68

Provided by: wrgridGro

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to MPI Programming

1

Introduction to MPI Programming
(Part II)?
Michael Griffiths, Deniz Savas Alan Real
January 2006

2
Overview

Review point to point communications
Data types
Data packing
Collective Communication
Broadcast, Scatter Gather of data
Reduction Operations
Barrier Synchronisation
Patterns for Parallel Programming
Exercises

3
Blocking operations

Relate to when the operation has completed
Only return from the subroutine call when the
operation has completed

4
Non-blocking operations

Return straight away and allow the sub-program to
return to perform other work.
At some time later the sub-program should test or
wait for the completion of the non-blocking
operation.
A non-blocking operation immediately followed by
a matching wait is equivalent to a blocking
operation.
Non-blocking operations are not the same as
sequential subroutine calls as the operation
continues after the call has returned.

5
(No Transcript)
6
Non-blocking communication

Separate communication into three phases
Initiate non-blocking communication
Do some work
Perhaps involving other communications
Wait for non-blocking communication to complete.

7
Non-blocking send
Receive
MPI_COMM_WORLD
Send req
Wait

Send is initiated and returns straight away.
Sending process can do other things
Can test later whether operation has completed.

8
Non-blocking receive
Rec req
Wait
MPI_COMM_WORLD
Send

Receive is initiated and returns straight away.
Receiving process can do other things
Can test later whether operation has completed.

9
The Request Handle

Same arguments as non-blocking call
Additional request handle
In C/C is of type MPI_Request/MPIRequest
In Fortran is an INTEGER
Request handle is allocated when a communication
is initiated
Can query to test whether non-blocking operation
has completed

10
Non-blocking synchronous send

Fortran
CALL MPI_ISSEND(buf, count, datatype, dest, tag,
comm, request, error)?
CALL MPI_WAIT(request, status, error)?
C
MPI_Issend(buf, count, datatype, dest, tag,
comm, request)
MPI_Wait(request, status)
C
request comm.Issend(buf, count, datatype,
dest, tag)
request.Wait()

11
Non-blocking synchronous receive

Fortran
CALL MPI_IRECV(buf, count, datatype, src, tag,
comm, request, error)?
CALL MPI_WAIT(request, status, error)?
C
MPI_Irecv(buf, count, datatype, src, tag, comm,
request)
MPI_Wait(request, status)
C
request comm.Irecv(buf, count, datatype, src,
tag)
request.Wait(status)

12
Blocking v Non-blocking

Send and receive can be blocking or non-blocking.
A blocking send can be used with a non-blocking
receive, and vice versa.
Non-blocking sends can use any mode

Synchronous mode affects completion, not
initiation.
A non-blocking call followed by an explicit wait
is identical to the corresponding blocking
communication.

13
Completion

Can either wait or test for completion
Fortran (LOGICAL flag)
CALL MPI_WAIT(request, status, ierror)?
CALL MPI_TEST(request, flag, status, ierror)?
C (int flag)
MPI_Wait(request, status)?
MPI_Test(request, flag, status)?
C (bool flag)
request.Wait()?
flag request.Test() (for sends)?
request.Wait(status)
flag request.Test(status) (for receives)?

14
Other related wait and test routines

If multiple non-blocking calls are issued
MPI_TESTANY Tests if any one of a list of
requests (they could be send or receive
requests) have been completed.
MPI_WAITANY Waits until any one of the list of
requests have been completed.
MPI_TESTALL Test if all the requests in a list
are completed.
MPI_WAITALL Waits until all the requests in a
list are completed.
MPI_PROBE , MPI_IPROBE Allows for the incoming
messages to be checked for without actually
receiving them. Note that MPI_PROBE is
blocking. It waits until there is something to
probe for.
MPI_CANCEL Cancels pending communication. Last
resort, clean- up operation !
All routines take an array of requests and can
return an array of statuses.
any routines return an index of the completed
operation

15
Merging send and receive operations into a single
unit

The following is the syntax of the MPI_Sendrecv
command
IN C
int MPI_Sendrecv( void sendbuf, int sendcount,
MPI_Datatype sendtype, int dest, int sendtag,
void recvbuf, int recvcount, MPI_Datatye
recvtype ,int source , int recvtag, MPI_Comm
comm, MPI_Status status )?
IN FORTRAN
ltsendtypegt sendbuf()
ltrecvtypegt recvbuf()?
INTEGER sendcount,sendtype, dest, sendtag,
recvcount, recvtype,
INTEGER source, recvtag, comm, status(MPI_STATUS_S
IZE), ierror
MPI_SENDRECV( sendbuf,sendcount,sendtype, dest,
sendtag, recvbuf, recvcount , recvtype , source,
recvtag , comm , status , ierror )?

16
Important Notes about MPI_Sendrecv

Beware! A message sent by MPI_sendrecv is
receivable by a regular receive operation if the
destination and tag match.
For the destination and source MPI_PROC_NULL can
be specified to allow one directional working.
(Useful in non-circular communication for the
very end-nodes).
Any communication with MPI_PROC_NULL returns
immediately with no effect but as if the
operation has been successful. This can make
programming easier.
The send and receive buffers must not overlap,
they must be separate memory locations. This
restriction can be avoided by using the
MPI_Sendrecv_replace routine

17
Data Packing

Up until now we have only seen contiguous data
of pre-defined data-types being communicated by
MPI calls. This can be rather restricting if what
we are intending to transfer involves structures
of data made up of mixtures of primitive data
types, such as integer count followed by a
sequence of real numbers.
One solution to this problem is to use the
MPI_PACK and MPI_UNPACK routines. The philosophy
used is similar to the Fortran write/read to/from
internal buffers and the scanf function in C.
MPI_PACK routine can be called consecutively to
compress the data into a send_buffer, the
resulting buffer of data can then be sent by
using MPI_SEND or equivalent with the data_type
set to MPI_PACKED.
At the receiving-end it can be received by
using MPI_RECV with the data type MPI_PACKED. The
received data can then be unpacked by using
MPI_UNPACK to recover the original packed data.
This method of working can also improve
communications efficiency by reducing the number
of data transfer send-receive calls. There are
usually fixed overheads associated with setting
up the communications that would cause
inefficiencies if the sent/received messages are
just too small.

18
MPI_Pack

Fortran
lttypegt INBUF() , OUTBUF()?
INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IE
RROR
MPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF,
OUTSIZE,POSITION, COMM,IERROR )?
C
int MPI_Pack(void inbuf, int incount,
MPI_Datatype datatype, void outbuf ,int outsize,
int position, MPI_Comm comm )?
Packs the message in inbuf of type datatype and
lengthincount and stores it in outbuf . Outbuf
size is specified in bytes. Outsize being the
maximum length of outbuf in bytes, rather than
its actuaL size.
On entry position indicates the starting
location at the outbuf where data will be
written. On exit position points to the first
free position in outbuf following the location
occupied by the packed message. This can then be
readily used as the position parameter for the
next mpi_pack call.

19
MPI_Unpack

Fortran
lttypegt INBUF() , OUTBUF()?
INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE,
COMM,IERROR
MPI_UNPACK(INBUF,INSIZE,POSITION,
OUTBUF,OUTCOUNT,DATATYPE, ,COMM,IERROR )?
C
int MPI_Unpack(void inbuf, int insize, int
position, void outbuf ,int outcount,
MPI_Datatype datatype, MPI_Comm comm )?
Unpacks the message which is in inbuf as data
of type datatype and length of outcounts and
stores it in outbuf .
On entry, position indicates the starting
location of data in inbuf where data will be read
from. On exit position points to the first
position of the next set of data in inbuf. This
can then be readily used as the position
parameter for the next mpi_unpack call.

20
Derived Datatypes

Basic data types provided in MPI allow us to send
messages consisting of arrays of these types. We
can also pack mixtures of these arrays into a
single array of type MPI_PACKED to be sent at one
go.
However under certain circumstances a better
approach would be to define a data type of our
own choosing, constructed by using the existing
data types and then define our messages in units
of this newly invented data-type. This is a
similar approach to defining structs in C and
user-defined types in Fortran. It improves
efficiency by reducing the number of
communications calls needed to communicate data (
as it can not be a mixture of basic types).
The following table shows how the new data types
are specified as an ordered list of its
constituent components and the location of each
component within the over-all structure. This is
referred to as the type-map of the new data type.
Displacements are measured from the beginning of
the structure.

21
Derived data types

Use of derived data types involve the following
steps
Construct define the new data type
Commit the new data type
Use the new type in message passing
(send/receive) calls.
Optionally any no-longer needed data-types can be
freed.
CONSTRUCTING THE NEW DATA TYPE
Rather than a single routine, the MPI library
provides a set of routines for constructing new
data_types, each one suitable for a particular
form of data.
These being
Contiguous
Vector
Indexed
Structure

22
Derived data types

The following routines help construct new data
types
MPI_Type_Contiguous
MPI_Type_vector, MPI_Type_hvector,
MPI_Type_indexed, MPI_Type_hindexed
MPI_Create_type_struct
MPI_Type_Contiguous will allow you to refer to a
contiguous vector of a primitive type as a new
type to be used in communications. A bit like
being able to reference a matrix with its name
only.
MPI_Type_vector will allow us to refer to a
collection of elements that are seperated from
each other by constant strides. For example
elements (1) , (3) , (5) , (7) . of an existing
vector as our unit.
MPI_Type_indexed allows the vector strides to
vary in a predefined manner which is not possible
to define using MPI_Type_vector.
We shall study only MPI_Type_struct as an
example, as it is the most general and complex of
all types.

23
MPI_Create_type_struct

FORTRAN
MPI_CREATE_TYPE_STRUCT( COUNT,BLOCK_LENGHTS,DISPLA
CEMENTS, TYPES,NEWTYPE,IERROR )
INTEGER COUNT, LENGTHS() , DISPLACEMENTS()
,TYPES() ,NEWTYPE,IERROR )
C
Int MPI_Create_type_struct(int count, int
block_lengths, MPI_Aint displacements,
MPI_Datatype types , MPI_Datatype newtype )?
The data is made up of (COUNT) blocks. Each
block(i) is made up of
block_lenghts(i) number of data of types(i).
Displacement of each block within the type is
given by displacements(i) in BYTES.
When the type is successfully created NEWTYPE
returns a handle to the new data type that can be
used in subsequent send/receive calls. For
example a structure which is made up of 2
integers followed by 6 reals followed by a
character string of 5 characters is seen as a
structure which is 3-blocks, lengths of which are
2,6,5 respectively and data_types are
(MPI_INTEGER, MPI_REAL, MPI_CHARACTER )
Displacements are (0 , 4,28 ) (BYTES)?
NOTEThe MPI1 standards define the function name
as MPI_TYPE_STRUCT
which was changed in MPI2 . So, the old name is
also valid.

24
MPI_Type_commit

Once a type is constructed, it can be committed
for use by invoking this function. This allows
us to send messages of new-type by using all
the MPI message communications routines.
FORTRAN
MPI_TYPE_COMMIT( DATATYPE, IERROR )?
C
MPI_Type_commit( MPI_Datatype datatype )

25
Timers

Double precision MPI functions
Fortran, DOUBLE PRECISION t1
t1 MPI_WTIME()
C double t1
t1 MPI_Wtime()
C double t1
t1 MPIWtime()
Time is measured in seconds.
Time to perform a task is measured by consulting
the timer before and after.

Collective Communication

27
Overview

Introduction characteristics
Barrier Synchronisation
Global reduction operations
Predefined operations
Broadcast
Scatter
Gather
Partial sums
Exercise

28
Collective communications

Are higher-level routines involving several
processes at a time.
Can be built out of point-to-point
communications.
Examples are
Barriers
Broadcast
Reduction operations

29
Collective Communication

Communications involving a group of processes.
Called by all processes in a communicator.
Examples
Broadcast, scatter, gather (Data Distribution)?
Global sum, global maximum, etc. (Reduction
Operations)?
Barrier synchronisation
Characteristics
Collective communication will not interfere with
point-to-point communication and vice-versa.
All processes must call the collective routine.
Synchronization not guaranteed (except for
barrier)?
No non-blocking collective communication
No tags
Receive buffers must be exactly the right size

30
Collective Communications(one for all, all for
one!!!)?

Collective communication is defined as that which
involves all the processes in a group. Collective
communication routines can be divided into the
following broad categories
Barrier synchronisation
Broadcast from one to all.
Scatter from one to all
Gather from all to one.
Scatter/Gather. From all to all.
Global reduction (distribute elementary
operations)?
IMPORTANT NOTE Collective Communication
operations and point-to-point operations we have
seen earlier are invisible to each other and
hence do not interfere with each other.
This is important to avoid dead-locks due to
interference.

31
BARRIER SYNCHRONIZATION
T I M E
B A R R I E R STATEMENT
Here, there are seven processes running and three
of them are waiting idle at the barrier statement
for the other four to catch up.
32
Graphic Representations of Collective
Communication Types
P R O C E S S E S
ALLGATHER
BROADCAST
SCATTER
ALLTOALL
GATHER
D A T A
D A T A
D A T A
D A T A
33
Barrier Synchronisation

Each processes in communicator waits at barrier
until all processes encounter the barrier.
Fortran
INTEGER comm, error
CALL MPI_BARRIER(comm, error)?
C
MPI_Barrier(MPI_Comm comm)
C
Comm.Barrier()
E.g.
MPICOMM_WORLD.Barrier()

34
Global reduction operations

Used to compute a result involving data
distributed over a group of processes
Global sum or product
Global maximum or minimum
Global user-defined operation

35
Predefined operations
36
MPI_Reduce

Performs count operations (o) on individual
elements of sendbuf between processes

Rank
0
1
BoEoH
2
AoDoG
37
MPI_Reduce syntax

Fortran
INTEGER count, type, count, rtype, root, comm,
error
CALL MPI_REDUCE(sbuf, rbuf, count, rtype, op,
root, comm, error)?
C
MPI_Reduce(void sbuf, void rbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root,
MPI_Comm comm)
C
CommReduce(const void sbuf, void recvbuf, int
count, const MPIDatatype datatype, const
MPIOp op, int root)

38
MPI_Reduce example

Integer global sum
Fortran
INTEGER x, result, error
CALL MPI_REDUCE(x, result, 1, MPI_INTEGER,
MPI_SUM, 0, MPI_COMM_WORLD, error)?
C
int x, result
MPI_Reduce(x, result, 1, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD)
C
int x, result
MPICOMM_WORLD.Reduce(x, result, 1, MPIINT,
MPISUM)

39
MPI_Allreduce

No root process
All processes get results of reduction operation

Rank
0
1
2
AoDoG
40
MPI_Allreduce syntax

Fortran
INTEGER count, type, count, rtype, comm, error
CALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op,
comm, error)?
C
MPI_Allreduce(void sbuf, void rbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
C
Comm.Allreduce(const void sbuf, void recvbuf,
int count, const MPIDatatype datatype, const
MPIOp op)

41
Practice Session 3

Using reduction operations
This example shows the use of the continued
fraction method of calculating pi and makes each
processor calculate a different portion of the
expansion series.

42
Broadcast

Duplicates data from root process to other
processes in communicator

A

Broadcast
A
A
A
A
A
Rank
0
1
2
3
43
Broadcast syntax

Fortran
INTEGER count, datatype, root, comm, error
CALL MPI_BCAST(buffer, count, datatype, root,
comm, error)?
C
MPI_Bcast (void buffer, int count, MPI_Datatype
datatype, int root, MPI_Comm comm)
C
Comm.Bcast(void buffer, int count, const
MPIDatatype datatype, int root)
E.g broadcasting 10 integers from rank 0
int tenints10
MPICOMM_WORLD.Bcast(tenints, 10, MPIINT, 0)

44
Scatter

Distributes data from root process amongst
processors within communicator.

Scatter
A
D
C
Rank
0
1
2
3
45
Scatter syntax

scount (and rcount) is number of elements each
process is sent (i.e. no received)?
Fortran
INTEGER scount, stype, rcount, rtype, root, comm,
error
CALL MPI_SCATTER(sbuf, scount, stype, rbuf,
rcount, rtype, root, comm, error)?
C
MPI_Scatter(void sbuf, int scount, MPI_Datatype
stype, void rbuf, int rcount, MPI_Datatype
rtype, root, comm)
C
Comm.Scatter(const void sbuf, int scount, const
MPIDatatype stype, void rbuf, int rcount,
const MPIDatatype rtype, int root)

46
Gather

Collects data distributed amongst processes in
communicator onto root process ( Collection done
in rank order ) .

B
A
D
C
Gather
A
D
C
Rank
0
1
2
3
47
Gather syntax

Takes same arguments as Scatter operation
Fortran
INTEGER scount, stype, rcount, rtype, root, comm,
error
CALL MPI_GATHER(sbuf, scount, stype, rbuf,
rcount, rtype, root, comm, error)?
C
MPI_Gather(void sbuf, int scount, MPI_Datatype
stype, void rbuf, int rcount, MPI_Datatype
rtype, root, comm)
C
Comm.Gather(const void sbuf, int scount, const
MPIDatatype stype, void rbuf, int rcount,
const MPIDatatype rtype, int root)

48
All Gather

Collects all data on all processes in communicator

B
A
D
C
Gather
B
A
D
C
Rank
0
1
2
3
49
All Gather syntax

As Gather but no root defined.
Fortran
INTEGER scount, stype, rcount, rtype, comm, error
CALL MPI_GATHER(sbuf, scount, stype, rbuf,
rcount, rtype, comm, error)?
C
MPI_Gather(void sbuf, int scount, MPI_Datatype
stype, void rbuf, int rcount, MPI_Datatype
rtype, comm)
C
Comm.Gather(const void sbuf, int scount, const
MPIDatatype stype, void rbuf, int rcount,
const MPIDatatype rtype)

50
MPI_Scan

Performs a partial reductions
E.g. partial sum

Rank
0
A
1
AoD
2
AoDoG
51
MPI_Scan syntax

Fortran
INTEGER count, type, count, rtype, comm, error
CALL MPI_SCAN(sbuf, rbuf, count, rtype, op, comm,
error)?
C
MPI_Scan(void sbuf, void rbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
C
Comm.Scan(const void sbuf, void recvbuf, int
count, const MPIDatatype datatype, const
MPIOp op)

52
Practice Session 4 diffusion example

Arrange processes to communicate round a ring.
Each process stores a copy of its rank in an
integer variable.
Each process communicates this value to its right
neighbour and receives a value from its left
neighbour.
Each process computes the sum of all the values
received.
Repeat for the number of processes involved and
print out the sum stored at each process.

53
Generating Cartesian Topologies

MPI_Cart_create
Makes a new communicator to which topology
information has been attached
MPI_Cart_coords
Determines process coords in cartesian topology
given rank in group
MPI_Cart_shift
Returns the shifted source and destination ranks,
given a shift direction and amount

54
MPI_Cart_create syntax

Fortran
INTEGER comm_old, ndims, dims(), comm_cart,
ierror logical periods(), reorder
CALL MPI_CART_CREATE(comm_old, ndims, dims,
periods, reorder, comm_cart, ierror)
C
MPI_Cart_create(MPI_Comm comm_old, int ndims, int
dims, int periods, int reorder, MPI_Comm
comm_cart )
C
MPIIntracommCreate_cart (int ndims, const int
dims, const bool periods, bool reorder )

55
MPI_Cart_coords syntax

Fortran
CALL MPI_CART_COORDS(INTEGER COMM,INTEGER
RANK,INTEGER MAXDIMS,INTEGER COORDS(),INTEGER
IERROR)
C
int MPI_Cart_coords(MPI_Comm comm,int rank,int
maxdims,int coords)
C
void MPICartcommGet_coords(int rank, int
maxdims, int coords) const

56
MPI_Cart_shift syntax

Fortran
MPI_CART_SHIFT(INTEGER COMM,INTEGER
DIRECTION,INTEGER DISP, INTEGER
RANK_SOURCE,INTEGER RANK_DEST,INTEGER IERROR)?
C
int MPI_Cart_shift(MPI_Comm comm,int
direction,int disp,int rank_source,int
rank_dest)
C
void MPICartcommShift(int direction, int
disp, int rank_source, int rank_dest) const

57
Topologies Examples

See Diffusion example
See cartesian example

58
Examples for Parallel Programming

Master slave
E.g. share work example
Example ising model
Communicating Sequential Elements Pattern
Poisson equation
Highly coupled processes
Systolic loop algorithm
E.g. md example

59
Poisson Solver Using Jacobi Iteration

Communicating Sequential Elements Pattern
Operations in each component depend on partial
results in neighbour components.

Thread
Thread
Thread
Slave
Slave
Slave
Data Exchange
Data Exchange
Thread
Thread
Thread
Slave
Slave
Slave
60
Layered Decomposition of 2d Array

Distribute 2d array across processors
Processors store all columns
Rows allocated amongst processors
Each proc has left proc and right proc
Each proc has max and min vertex that it stores
Uijnew(Ui1jUi-1jUij1Uij-1)/4
Each proc has a ghost layer
Used in calculation of update (see above)?
Obtained from neighbouring left and right
processors
Pass top and bottom layers to neighbouring
processors
Become neighbours ghost layers
Distribute rows over processors N/nproc rows per
proc
Every processor stores all N columns

61
N1
N
Processor 1
p1min
p2max
p1min
p2max
Processor 2
p2min
Send top layer
p3max
Receive bottom layer
p2min
p3max
Processor 3
Send bottom layer
Receive top layer
Processor 4
1
N1
62
Master Slave
Thread
Data Exchange
Slave
Thread
Slave
Master
Thread
Slave

A computation is required where independent
computations are performed, perhaps repeatedly,
on all elements of some ordered data.
Example
Image processing perform computation on different
sets of pixels within an image

63
Highly Coupled Efficient Element Exchange

Highly Coupled Efficient Element Exchange using
Systolic loop techniques
Extreme example of Communicating Sequential
Elements Pattern

64
Systolic Loop

Distribute Elements Over Processors
Three buffers
Local elements
Travelling Elements (local elements at start)?
Send buffer
Loop over number of processors
Transfer travelling elements
Interleave send/receive to prevent deadlock
Send contents of send buffer to next proc
Receive buffer from previous proc to travelling
elements
Point travelling elements to send buffer
Allow local elements to interact with travelling
elements
Accumulate reduced computations over processors

65
Systolic Loop Element Pump
First cycle of 3 for 4 processor systolic loop
Proc 2
Proc 1
Proc 3
Proc 4
Local Elements
Local Elements
Local Elements
Local Elements
Moving Elements (from 1)?
Moving Elements (from 4)?
Moving Elements (from 2)?
Moving Elements (from 3)?
66
Practice Sessions 5 and 6

Defining and Using Processor Topologies
Patterns for parallel computing

67
Further Information

All MPI routines have a UNIX man page
Use C-style definition for Fortran/C/C
E.g. man MPI_Finalize will give correct syntax
and information for Fortran, C and C calls.
Designing and building parallel programs (Ian
Foster)?
http//www-unix.mcs.anl.gov/dbpp/
Standard documents
http//www.mpi-forum.org/
Many books and information on web.
EPCC documents.