Title: An Introduction to Parallel Programming with MPI
1An Introduction to Parallel Programming with MPI
- March 22, 24, 29, 31
- 2005
- David Adams daadams3_at_vt.edu
- http//research.cs.vt.edu/lasca/schedule
2MPI and Classical References
- MPI
- M. Snir, W. Gropp, MPI The Complete Reference
(2-volume set), MIT Press, MA, (1998). - Parallel Computing
- D. P. Bertsekas and J. N. Tsitsiklis, Parallel
and Distributed Computation, Prentice-Hall,
Englewood Cliffs, NJ, (1989). - M. J. Quinn, Designing Efficient Algorithms for
Parallel Computers, Mcgraw-Hill, NY, (1987).
3Outline
- Disclaimers
- Overview of basic parallel programming on a
cluster with the goals of MPI - Batch system interaction
- Startup procedures
- Quick review
- Blocking message passing
- Non-blocking message passing
- Collective communications
4Review
- Messages are the only way processors can pass
information. - MPI hides the low level details of message
transport leaving the user to specify only the
message logic. - Parallel algorithms are built from identifying
the concurrency opportunities in the problem
itself, not in the serial algorithm. - Communication is slow.
- Partitioning and pipelining are two primary
methods for exploiting concurrency. - To make good use of the hardware we want to
balance the computational load across all
processors and maintain a compute bound process
rather than a communication bound process.
5More Review
- MPI messages specify a starting point, a length,
and data type information. - MPI messages are read from contiguous memory.
- These functions will generally appear in all MPI
programs - MPI_INIT MPI_FINALIZE
- MPI_COMM_SIZE MPI_COMM_RANK
- MPI_COMM_WORLD is the global communicator
available at the start of all MPI runs.
6Hello WorldFortran90
- PROGRAM Hello_World
- IMPLICIT NONE
- INCLUDE 'mpif.h'
- INTEGER ierr_p, rank_p, size_p
- INTEGER, DIMENSION(MPI_STATUS_SIZE) status_p
- CALL MPI_INIT(ierr_p)
- CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank_p,
ierr_p) - CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p,
ierr_p) - IF (rank_p0) THEN
- WRITE(,) Hello world! I am process 0 and I am
special! - ELSE
- WRITE(,) Hello world! I am process, rank_p
- END IF
- CALL MPI_FINALIZE(ierr_p)
7Hello WorldC (case sensitive)
- include ltstdio.hgt
- include ltmpi.hgt
- int main (int argc, char argv)
-
- int rank_p,size_p
- MPI_Init(argc, argv)
- MPI_Comm_rank(MPI_COMM_WORLD, rank_p)
- MPI_Comm_size(MPI_COMM_WORLD, size_p)
- if (rank_p0)
- printf("d Hello World! I am special!\n",
rank_p) -
- else
- printf("d Hello World!\n", size_p)
-
- MPI_Finalize()
8MPI Messages
- Messages are non-overtaking.
- All MPI messages are completed in two parts
- Send
- Can be blocking or non-blocking.
- Identifies the destination, data type and length,
and a message type identifier (tag). - Identifies to MPI a space in memory specifically
reserved for the sending of this message. - Receive
- Can be blocking or non-blocking
- Identifies the source, data type and length, and
a message type identifier (tag). - Identifies to MPI a space in memory specifically
reserved for the completion of this message.
9Message Semantics(Modes)
- Standard
- The completion of the send does not necessarily
mean that the matching receive has started, and
no assumption should be made in the application
program about whether the out-going data is
buffered. - All buffering is made at the discretion of your
MPI implementation. - Completion of an operation simply means that the
message buffer space can now be modified safely
again. - Buffered
- Synchronous
- Ready
10Message Semantics(Modes)
- Standard
- Buffered (not recommended)
- The user can guarantee that a certain amount of
buffer space is available. - The catch is that the space must be explicitly
provided by the application program. - Making sure the buffer space does not become full
is completely the users responsibility. - Synchronous
- Ready
11Message Semantics(Modes)
- Standard
- Buffered (not recommended)
- Synchronous
- A rendezvous semantic between sender and receiver
is used. - Completion of a send signals that the receive has
at least started. - Ready
12Message Semantics(Modes)
- Standard
- Buffered (not recommended)
- Synchronous
- Ready (not recommended)
- Allows the user to exploit extra knowledge to
simplify the protocol and potentially achieve
higher performance. - In a ready-mode send, the user asserts that the
matching receive already has been posted.
13Blocking Message Passing(SEND)
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- Performs a standard-mode, blocking send.
- Blocking means that the code can not continue
until the send has completed. - Completion of the send means either that the data
has been buffered non-locally or locally and that
the message buffer is now free to modify. - Completion implies nothing about the matching
receive.
14Buffer
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- BUF is an array. It can be an array of one
object but it must be an array. - The definition
- INTEGER X
- DOES NOT EQUAL
- INTEGER X(1)
15Buffer
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- BUF is the parameter in which MPI determines the
starting point for the memory space to be
allocated to this message. - Recall that this memory space must be contiguous
and allocatable arrays in Fortran90 are not
necessarily contiguous. Also, array segments are
certainly not in general contiguous.
16Buffer
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- Until the send is complete the data inside BUF is
undefined. - Any attempt to change the data in BUF before the
send completes is also an undefined operation
(though possible). - Once a send operation begins it is the users job
to see that no modifications to BUF are made. - Completion of the send ensures the user that it
is safe to modify the contents of BUF again.
17DATATYPE
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- DATATYPE is an MPI specific data type
corresponding to the type of data stored in BUF. - An array of integers would be sent using the
MPI_INTEGER data type - An array of logical variables would be sent using
the MPI_LOGICAL data type - etc.
18MPI Types in Fortran 77
- MPI_INTEGER INTEGER
- MPI_REAL REAL
- MPI_DOUBLE_PRECISION DOUBLE PRECISION
- MPI_COMPLEX COMPLEX
- MPI_LOGICAL LOGICAL
- MPI_CHARACTER CHARACTER(1)
- MPI_BYTE
- MPI_PACKED
19MPI types in C
- MPI_CHAR signed char
- MPI_SHORT signed short int
- MPI_INT signed int
- MPI_LONG signed long int
- MPI_UNSIGNED_CHAR unsigned short int
- MPI_UNSIGNED unsigned int
- MPI_UNSIGNED_LONG unsigned long int
- MPI_FLOAT float
- MPI_DOUBLE double
- MPI_LONG_DOUBLE long double
- MPI_BYTE
- MPI_PACKED
20COUNT
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- COUNT specifies the number of entries of type
DATATYPE in the buffer BUF. - From the combined information of COUNT, DATATYPE,
and BUF, MPI can determine the starting point in
memory for the message and the number of bytes to
move.
21Communicator
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- COMM provides MPI with the reference point for
the communication domain applied to this send. - For most MPI programs MPI_COMM_WORLD will be
sufficient as the argument for this parameter.
22DESTINATION
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- DEST is an integer representing the rank of the
process I am trying to send a message to. - The rank value is with respect to the
communicator in the COMM parameter. - For MPI_COMM_WORLD, the value in DEST is the
absolute rank of the processor you are trying to
reach.
23TAG
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- The TAG parameter is an integer between 0 and
some upper bound where the upper bound is machine
dependent. The value for the upper bound is
found in the attribute MPI_TAG_UB. - This integer value can be used to distinguish
messages since send-receive pairs will only match
if their TAG values also match.
24IERROR
- MPI_SEND (BUF, COUNT, DATATYPE, DEST, TAG, COMM,
IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR
- Assuming everything is working as planned then
the value of IERROR on exit will be MPI_SUCCESS.
- Values not equal to MPI_SUCCESS indicate some
error but these values are implementation
specific.
25Send Modes
- Standard
- MPI_SEND
- Buffered (not recommended)
- MPI_BSEND
- Synchronous
- MPI_SSEND
- Ready (not recommended)
- MPI_RSEND
26Blocking Message Passing(RECEIVE)
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- Performs a standard-mode, blocking receive.
- Blocking means that the code can not continue
until the receive has completed. - Completion of the receive means that the data has
been placed into the message buffer locally and
that the message buffer is now safe to modify or
use. - Completion implies nothing about the completion
of the matching send (except that the send has
started).
27BUFFER, DATATYPE, COMM, and IERROR
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- The parameters BUF, DATATYPE and IERROR follow
the same rules as that of the send. - Send receive pairs will only match if their
SOURCE/DEST, TAG, and COMM information match.
28COUNT
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- Like in the send operation, the COUNT parameter
indicates the number of entries of type DATATYPE
in BUF. - The COUNT values of a send-receive pair, however,
do not need to match. - It is the users responsibility to see that the
buffer on the receiving end is big enough to
store the incoming message. An overflow error
would be returned in IERROR in the case when BUF
is too small.
29Source
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- SOURCE is an integer representing the rank of the
process I am willing to receive a message from. - The rank value is with respect to the
communicator in the COMM parameter. - For MPI_COMM_WORLD, the value in SOURCE is the
absolute rank of the processor you are willing to
receive from. - The receiver can specify a wildcard value for
SOURCE (MPI_ANY_SOURCE) indicating that any
source is acceptable as long as the TAG and COMM
parameters match.
30Source
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- The TAG value is an integer that must be matched
with the TAG value of the corresponding send. - The receiver can specify a wildcard value for TAG
(MPI_ANY_TAG) indicating that it is willing to
receive any tag value as long as the source and
COMM values match.
31Source
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- The STATUS parameter is a returned parameter that
contains information about the completion of the
message. - When using wildcards you may need to find out who
it was that sent you a message, what it was
about, and how long the message was before
continuing to process. This is the type of
information found in STATUS.
32Source
- MPI_RECV (BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, STATUS, IERROR) - IN lttypegt BUF()
- IN INTEGER, COUNT, DATATYPE, DEST, TAG, COMM,
- OUT IERROR, STATUS(MPI_STATUS_SIZE)
- In FORTRAN77 STATUS is an array of integers of
size MPI_STATUS_SIZE. - The three constants, MPI_SOURCE, MPI_TAG, and
MPI_ERROR are the indices of the entries that
store the source, tag and error fields
respectively. - In C, STATUS is a structure of type MPI_Status
that contains three fields named MPI_Source,
MPI_Tag, and MPI_Error. - Notice that the length of the message doesnt
appear to be included
33Questions/Answers
- Question What is the purpose of having the
error returned in the STATUS data structure? It
seems redundant. - Answer It is possible for a single function such
as MPI_WAIT_ALL( ) to complete multiple messages
in a single call. In these cases each individual
message may produce its own error code and that
code is what is returned in the STATUS data
structure.
34MPI_GET_COUNT
- MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERROR)
- IN INTEGER STATUS(MPI_STATUS_SIZE), DATA_TYPE,
- OUT COUNT, IERROR
- MPI_GET_COUNT will allow you to determine the
number of entities of type DATATYPE were received
in the message. - For advanced users see also MPI_GET_ELEMENT
35Six Powerful Functions
- MPI_INIT
- MPI_FINALIZE
- MPI_COMM_RANK
- MPI_COMM_SIZE
- MPI_SEND
- MPI_RECV
36Deadlock
- MPI does not enforce a safe programming style.
- It is the users responsibility to ensure that it
is impossible for the program to fall into a
deadlock condition. - Deadlock occurs when a process blocks to wait for
an event that, given the current state of the
system, can never happen.
37Deadlock examples
-
- CALL MPI_COMM_RANK(comm, rank, ierr)
- IF (rank .EQ. 0) THEN
- CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag,
comm, status, ierr) - CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag,
comm, ierr) - ELSE IF (rank .EQ. 1) THEN
- CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag,
comm, status, ierr) - CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag,
comm, ierr) - END IF
-
- This program will always deadlock.
38Deadlock examples
-
- CALL MPI_COMM_RANK(comm, rank, ierr)
- IF (rank .EQ. 0) THEN
- CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag,
comm, ierr) - CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag,
comm, status, ierr) - ELSE IF (rank .EQ. 1) THEN
- CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag,
comm, ierr) - CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag,
comm, status, ierr) - END IF
-
- This program is unsafe. Why?
39Safe Way
-
- CALL MPI_COMM_RANK(comm, rank, ierr)
- IF (rank .EQ. 0) THEN
- CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag,
comm, ierr) - CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag,
comm, status, ierr) - ELSE IF (rank .EQ. 1) THEN
- CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag,
comm, status, ierr) CALL MPI_SEND(sendbuf, count,
MPI_REAL, 0, tag, comm, ierr) - END IF
-
-
- This is a silly exampleno one would ever try to
do it the other waysright?
40Motivating Example for Deadlock
41Motivating Example for Deadlock
Timestep 1
42Motivating Example for Deadlock
Timestep 2
43Motivating Example for Deadlock
Timestep 3
44Motivating Example for Deadlock
Timestep 4
45Motivating Example for Deadlock
Timestep 5
46Motivating Example for Deadlock
Timestep 6
47Motivating Example for Deadlock
Timestep 7
48Motivating Example for Deadlock
Timestep 8
49Motivating Example for Deadlock
Timestep 9
50Motivating Example for Deadlock
Timestep 10!
51Super Idea!
-
- CALL MPI_COMM_RANK(comm, rank, ierr)
- IF (rank .EQ. 0) THEN
- CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag,
comm, ierr) - CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag,
comm, status, ierr) - ELSE IF (rank .EQ. 1) THEN
- CALL MPI_SEND(sendbuf, count, MPI_REAL, 2, tag,
comm, ierr) - CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag,
comm, status, ierr) - ELSE IF (rank .EQ. 2) THEN
-
- Ill cleverly order my sends so that they all
happen at the same time and all the communication
will be completed in one time step!
52WRONG!
- The code will be unsafe.
- It worked perfectly for me, why doesnt it work
on this machine? - It ran fine on Washday and now it doesnt work. I
havent changed anything! - My code works if I send smaller messages. Maybe
your machine cant handle my optimized code. - Why?
- http//research.cs.vt.edu/lasca/schedule
- Please send any additional questions to
- lasca_at_cs.vt.edu