Title: CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI
1CS 267 Applications of Parallel
ComputersLecture 7Message Passing Programming
(MPI)
- Jonathan Carter
- jtcarter_at_lbl.gov
2Overview
- Message Passing Programming Review
- What is MPI ?
- Parallel Programming with MPI
- Basic Send and Receive
- Buffering and message delivery
- Non-blocking communication
- Collective Communication
- Notes and Further Information
3Message Passing Programming - Review
- Model
- Set of processes that each have local data and
are able to communicate with each other by
sending and receiving messages - Advantages
- Useful and complete model to express parallel
algorithms - Potentially fast
- What is used in practice
4What is MPI ?
- A coherent effort to produce a standard for
message passing - Before MPI, proprietary (Cray shmem, IBM MPL),
and research community (PVM, p4) libraries - A message-passing library specification
- For Fortran, C, and C
5MPI History
- MPI forum government, academia, and industry
- November 1992 committee formed
- May 1994 MPI 1.0 published
- June 1995 MPI 1.1 published (clarifications)
- April 1995 MPI 2.0 committee formed
- July 1997 MPI 2.0 published
- July 1997 MPI 1.2 published (clarifications)
6Current Status
- MPI 1.2
- MPICH from ANL/MSU
- LAM from Indiana University (Bloomington)
- IBM, Cray, HP, SGI, NEC, Fujitsu
- MPI 2.0
- Fujitsu (all), IBM, Cray, NEC (most), MPICH, LAM,
HP (some)
7Parallel Programming With MPI
- Communication
- Basic send/receive (blocking)
- Collective
- Non-blocking
- One-sided (MPI 2)
- Synchronization
- Implicit in point-to-point communication
- Global synchronization via collective
communication - Parallel I/O (MPI 2)
8Creating Parallelism
- Single Program Multiple Data (SPMD)
- Each MPI process runs a copy of the same program
on different data - Each copy runs at own rate and is not explicitly
synchronized - May take different paths through the program
- Control through rank and number of tasks
9Creating Parallelism
- Multiple Program Multiple Data
- Each MPI process can be a separate program
- With OpenMP, pthreads
- Each MPI process can be explicitly
multi-threaded, or threaded via some directive
set such as OpenMP
10Hello World
- include "mpi.h"
- include ltstdio.hgt
- int main ( int argc, char argv )
-
- int rank, size
- MPI_Init( argc, argv )
- MPI_Comm_rank( MPI_COMM_WORLD, rank )
- MPI_Comm_size( MPI_COMM_WORLD, size )
- printf( "I am d of d\n", rank, size )
- MPI_Finalize()
- return 0
11MPI Basic Send/Receive
Process 1
Process 0
send
recv
- Need to describe
- How to identify process
- How to identify message
- How to identify data
12Identifying Processes
- MPI Communicator
- Defines a group (set of ordered processes) and a
context (a virtual network) - Rank
- Process number within the group
- MPI_ANY_SOURCE will receive from any process
- Default communicator
- MPI_COMM_WORLD the whole group
13Identifying Messages
- An MPI Communicator defines a virtual network,
send/recv pairs must use the same communicator - send/recv routines have a tag (integer variable)
argument that can be used to identify a message,
or screen for a particular message. - MPI_ANY_TAG will receive a message with any tag
14Identifying Data
- Data is described by a triple (address, type,
count) - For send, this defines the message
- For recv, this defines the size of the receive
buffer - Amount of data received, source, and tag
available via status data structure - Useful if using MPI_ANY_SOURCE, MPI_ANY_TAG, or
unsure of message size (must be smaller than
buffer)
15MPI Types
- Type may be recursively defined as
- An MPI predefined type
- A contiguous array of types
- An array of equally spaced blocks
- An array of arbitrary spaced blocks
- Arbitrary structure
- Each user-defined type constructed via an MPI
routine, e.g. MPI_TYPE_VECTOR
16MPI Predefined Types
- C Fortran
- MPI_INT MPI_INTEGER
- MPI_FLOAT MPI_REAL
- MPI_DOUBLE MPI_DOUBLE_PRECISION
- MPI_CHAR MPI_CHARACTER
- MPI_UNSIGNED MPI_LOGICAL
- MPI_LONG MPI_COMPLEX
- Language Independent
- MPI_BYTE
17MPI Types
- Explicit data description is useful
- Simplifies programming, e.g. send row/column of a
matrix with a single call - Heterogeneous machines
- May improve performance
- Reduce memory-to-memory copies
- Allow use of scatter/gather hardware
- May hurt performance
- User packing of data likely faster
18Point-to-point Example
Process 0 Process 1
define TAG 999 float a10 int
dest1 MPI_Send(a, 10, MPI_FLOAT, dest,
TAG, MPI_COMM_WORLD)
define TAG 999 MPI_Status status int
count float b20 int sender0 MPI_Recv(b, 20,
MPI_FLOAT, sender, TAG, MPI_COMM_WORLD,
status) MPI_Get_count(status, MPI_FLOAT,
count)
19MPI_Send
- MPI_Send(address, count, type, dest, tag, comm)
- address pointer to data
- count number of elements to be sent
- type data type
- dest destination process
- tag identifying tag
- comm communicator
- When MPI_Send returns the message is sent and the
data can be reused. The message has not
necessarily been received.
20MPI_Recv
- MPI_Recv(address, count, type, dest, tag, comm,
status) - address pointer to data
- count number of elements to be sent
- type data type
- dest destination process
- tag identifying tag
- comm communicator
- status sender, tag, and message size
- When MPI_Recv returns the message has been
received and the data can be used.
21MPI Status Data Structure
- In C
- MPI_Status status
- int recvd_tag, recvd_from, recvd_count
- recvd_tag status.MPI_TAG
- recvd_from status.MPI_SOURCE
- MPI_Get_count( status, MPI_INT, recvd_count)
- In Fortran
- integer status(MPI_STATUS_SIZE)
-
22MPI Programming with Six Routines
- Some programs can be written with only six
routines - MPI_Init
- MPI_Finalize
- MPI_Comm_size
- MPI_Comm_rank
- MPI_Send
- MPI_Recv
23Data Exchange
MPI_Recv(,1,) MPI_Send(,1,)
MPI_Recv(,0,) MPI_Send(,0,)
Deadlock. MPI_Recv will not return until send is
posted.
24Data Exchange
Process 0 Process 1
MPI_Send(,1,) MPI_Recv(,1,)
MPI_Send(,0,) MPI_Recv(,0,)
May deadlock, depending on the implementation. If
the messages can be buffered, program will run.
Called 'unsafe' in the MPI standard.
25Buffering in MPI
- Implementation may buffer on sending process,
receiving process, both, or none. - In practice, tend to buffer "small" messages on
receiving process. - MPI has a buffered send-mode
- MPI_Buffer_attach
- MPI_Buffer_detach
- MPI_Bsend
26Message Delivery
P0
P1
- Eager send data immediately store in remote
buffer - No synchronization
- Only one message sent
- Data is copied
- Uses memory for buffering (less for application)
- Rendezvous send message header wait for recv to
be posted send data - No data copy
- More memory for application
- More messages required
- Synchronization (send blocks until recv posted)
27Message Delivery
- Many MPI implementations use both the eager and
rendezvous methods of message delivery - Switch between the two methods according to
message size - Often the cutover point is controllable via an
environment variable, e.g. MP_EAGER_LIMIT and
MP_USE_FLOW_CONTROL on the IBM SP
28Message Delivery
- Non-overtaking messages
- Message sent from the same process will arrive in
the order sent - No fairness
- On a wildcard receive, possible to receive from
only one source despite other messages being sent - Progress
- For a pair of matched send and receives, at least
one will complete independent of other messages.
29Performance Comparison
30Data Exchange
- Ways around 'unsafe' data exchange
- Match send/recv pairs hard in general case
- Use MPI_Sendrecv
- Use non-blocking communication
31Non-blocking Communication
- Communication split into two parts
- MPI_Isend or MPI_Irecv starts communication and
returns request data structure. - MPI_Wait (also MPI_Waitall, MPI_Waitany) uses
request as an argument and blocks until
communication is complete. - MPI_Test uses request as an argument and checks
for completion. - Advantages
- No deadlocks
- Overlap communication with computation
- Exploit bi-directional communication
32Data Exchange with Isend/recv
Process 0 Process 1
MPI_Isend(,1,, request) MPI_Recv(,1,) MPI_Wa
it(request, status)
MPI_Isend(,0,, request) MPI_Recv(,0,) MPI_Wa
it(request, status)
33Non-blocking send/recv buffers
- May not modify or read the message buffer between
MPI_Irecv and MPI_Wait calls. - May not modify or read the message buffer between
MPI_Isend and MPI_Wait calls. - May not have two MPI_Irecv pending on the same
buffer. - May not have two MPI_Isend pending on the same
buffer. - Restrictions provide flexibility for implementers.
34Performance Comparison
35Collective Communication
- Optimized algorithms, scaling as log(n)
- Differences from point-to-point
- Amount of data sent must match amount of data
specified by receivers - No tags
- Blocking only
- MPI_barrier(comm)
- All processes in the communicator are
synchronized. The only collective call where
synchronization is guaranteed.
36Collective Move Functions
- MPI_Bcast(data, count, type, src, comm)
- Broadcast data from src to all processes in the
communicator. - MPI_Gather(in, count, type, out, count, type,
dest, comm) - Gathers data from all nodes to dest node
- MPI_Scatter(in, count, type, out, count, type,
src, comm) - Scatters data from src node to all nodes
37Collective Move Functions
data
processes
broadcast
scatter
gather
38Collective Move Functions
- Additional functions
- MPI_Allgather, MPI_Gatherv, MPI_Scatterv,
MPI_Allgatherv, MPI_Alltoall
39Collective Reduce Functions
- MPI_Reduce(send, recv, count, type, op, root,
comm) - Global reduction operation, op, on send buffer.
Result is at process root in recv buffer. op may
be user defined, MPI predefined operation. - MPI_Allreduce(send, recv, count, type, op, comm)
- As above, except result broadcast to all
processes.
40Collective Reduce Functions
data
processes
allreduce
41Collective Reduce Functions
- Additional functions
- MPI_Reduce_scatter, MPI_Scan
- Predefined operations
- Sum, product, min, max,
- User-defined operations
- MPI_Op_create
42Notes on C, Fortran, C
- In C
- include mpi.h
- MPI functions return error code or MPI_SUCCESS
- In Fortran
- include mpif.h
- use mpi (MPI 2)
- All MPI calls are subroutines, return code is
final argument - In C
- Size MPICOMM_WORLD.Get_size() (MPI 2)
43Other Features
- Other send modes
- synchronous mode can be used to check if the
program is safe, since it forces a rendezvous
protocol - ready mode is difficult to use and doesn't boost
performance on any implementation. Need to ensure
that recv is posted, this leads to more user
explicit synchronization. - persistent communication, pre-specify a message
envelope and data - Create new communicators
- Libraries, logically partitioning tasks
- Topologies
- Cartesian and graph topologies can map physical
hardware to processes
44Other Features
- Probe and cancel
- Check for characteristics of incoming message,
possibly cancel - I/O (MPI 2)
- Individual, shared or explicit file pointers
- Collective or individual (by process) file access
- Blocking and non-blocking access
- One-sided communication (MPI 2)
- Put, get, and accumulate
- Loose synchronization model
- Remote lock/unlock of memory
45Free MPI Implementations
- MPICH from Argonne National Lab and Mississippi
State Univ. - http//www-unix.mcs.anl.gov/mpi/mpich/
- Runs on
- network of workstations
- SMP using shared memory
- MPP system support limited
- Windows
46Free MPI Implementations
- LAM from Ohio Supercomputer Center, University of
Notre Dame, Indiana University - http//www.lam-mpi.org/
- Many MPI 2 features
- Runs on
- Network of workstations
47IBM MPI Implementation
- MPI Programming Guide and MPI Subroutine
Reference - http//www1.ibm.com/servers/eserver/pseries/librar
y/sp_books/pe.html - Compatible with Pthreads and OpenMP
- All of MPI 2 except for process spawning
48Further Information
- MPI Standards
- http//www.mpi-forum.org/
- Books
- Using MPI Portable Parallel Programming with the
Message-Passing Interface (second edition), by
Gropp, Lusk and Skjellum - Using MPI-2 Advanced Features of the
Message-Passing Interface, by Gropp, Lusk and
Thakur - MPI The Complete Reference. Volume 1, by Snir,
Otto, Huss-Lederman, Walker and Dongarra - MPI The Complete Reference. Volume 2, by Gropp,
Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir,
and Snir
49Example
- Calculate the energy of a system of particles
interacting via a Coulomb potential.
real coord(3,n), charge(n)
energy0.0 do i 1, n do j 1,
i-1 rdist 1.0/sqrt((coord(1,i)-coord(1
,j))2 (coord(2,i)-coord(2,j))2(c
oord(3,i)-coord(3,j))2) energy
energy charge(i)charge(j)rdist end
do end do
50MPI Example 1
- Functional decomposition
- each process will compute roughly the same number
of interactions - accomplish this by dividing up the outer loop
- replicate data to make communication simple
- this approach will not scale
51MPI - Example 1
include 'mpif.h' parameter(n50000)
dimension coord(3,n), charge(n) call
mpi_init(ierr) call mpi_comm_rank(MPI_COMM_W
ORLD, mype, ierr) call mpi_comm_size(MPI_COM
M_WORLD, npes, ierr) call
initdata(n,coord,charge,mype) e
energy(mype,npes,n,coord,charge)
etotal0.0 call mpi_reduce(e, etotal, 1,
MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr) if (mype.eq.0) write(,) etotal
call mpi_finalize(ierr)
52MPI - Example 1
subroutine initdata(n,coord,charge,mype)
include 'mpif.h' dimension coord(3,n),
charge(n) if (mype.eq.0) then
GENERATE coords, charge end if ! broadcast
data to slaves call mpi_bcast(coord, 3n,
MPI_REAL, 0, MPI_COMM_WORLD, ierr) call
mpi_bcast(charge, n, MPI_REAL, 0, MPI_COMM_WORLD,
ierr) return
53MPI - Example 1
real function energy(mype,npes,n,coord,charg
e) dimension coord(3,n), charge(n)
intern(n-1)/npes nstartnint(sqrt(real(myp
einter)))1 nfinishnint(sqrt(real((mype1)
inter))) if (mype.eq.npes-1) nfinishn
total 0.0 do i nstart, nfinish
do j 1, i-1 .... total
total charge(i)charge(j)rdist end do
end do energy total return
54MPI - Example 2
- Domain decomposition
- each task takes a chunk of particles
- in turn, receives particle data from another
process and computes all interactions between own
data and received data - repeat until all interactions are done
55MPI - Example 2
Proc 0
Proc 1
Proc 2
Proc 3
Proc 4
Step 1
21-40
41-60
61-80
81-100
1-20
21-40
41-60
61-80
81-100
1-20
Step 2
21-40
41-60
61-80
81-100
1-20
41-60
61-80
81-100
1-20
21-40
Step 3
21-40
41-60
61-80
81-100
1-20
61-80
81-100
1-20
21-40
41-60
56 subroutine initdata(n,coord,charge,mype,npes
,npepmax,nmax,nmin) include 'mpif.h'
dimension coord(3,n), charge(n) integer
status(MPI_STATUS_SIZE) itag0
isender0 if (mype.eq.0) then do
ipe1,npes-1 GENERATE coord, charge
for PEipe call mpi_send(coord, nj3,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) call mpi_send(charge, nj,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) end do GENERATE coord,
charge for self else ! receive particles
call mpi_recv(coord, 3n, MPI_REAL,
isender, itag, MPI_COMM_WORLD, status,
ierror) call mpi_recv(charge, n,
MPI_REAL, isender, itag,
MPI_COMM_WORLD, status, ierror) endif
return
57 niternpes/2 do iter1, niter ! PE
to send to and receive from if
(ipsend.eq.npes-1) then ipsend0
else ipsendipsend1 end if
if (iprecv.eq.0) then
iprecvnpes-1 else
iprecviprecv-1 end if ! send and
receive particles call mpi_sendrecv(coordi
, 3n, MPI_REAL, ipsend, itag, coordj,
3n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) call
mpi_sendrecv(chargei, n, MPI_REAL, ipsend, itag,
chargej, n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) !
accumulate energy e e energy2(n,
coordi, chargei, n, coordj, chargej) end do