CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI

1
CS 267 Applications of Parallel
ComputersLecture 7Message Passing Programming
(MPI)

Jonathan Carter
jtcarter_at_lbl.gov

2
Overview

Message Passing Programming Review
What is MPI ?
Parallel Programming with MPI
Basic Send and Receive
Buffering and message delivery
Non-blocking communication
Collective Communication
Notes and Further Information

3
Message Passing Programming - Review

Model
Set of processes that each have local data and
are able to communicate with each other by
sending and receiving messages
Advantages
Useful and complete model to express parallel
algorithms
Potentially fast
What is used in practice

4
What is MPI ?

A coherent effort to produce a standard for
message passing
Before MPI, proprietary (Cray shmem, IBM MPL),
and research community (PVM, p4) libraries
A message-passing library specification
For Fortran, C, and C

5
MPI History

MPI forum government, academia, and industry
November 1992 committee formed
May 1994 MPI 1.0 published
June 1995 MPI 1.1 published (clarifications)
April 1995 MPI 2.0 committee formed
July 1997 MPI 2.0 published
July 1997 MPI 1.2 published (clarifications)

6
Current Status

MPI 1.2
MPICH from ANL/MSU
LAM from Indiana University (Bloomington)
IBM, Cray, HP, SGI, NEC, Fujitsu
MPI 2.0
Fujitsu (all), IBM, Cray, NEC (most), MPICH, LAM,
HP (some)

7
Parallel Programming With MPI

Communication
Basic send/receive (blocking)
Collective
Non-blocking
One-sided (MPI 2)
Synchronization
Implicit in point-to-point communication
Global synchronization via collective
communication
Parallel I/O (MPI 2)

8
Creating Parallelism

Single Program Multiple Data (SPMD)
Each MPI process runs a copy of the same program
on different data
Each copy runs at own rate and is not explicitly
synchronized
May take different paths through the program
Control through rank and number of tasks

9
Creating Parallelism

Multiple Program Multiple Data
Each MPI process can be a separate program
With OpenMP, pthreads
Each MPI process can be explicitly
multi-threaded, or threaded via some directive
set such as OpenMP

10
Hello World

include "mpi.h"
include ltstdio.hgt
int main ( int argc, char argv )
int rank, size
MPI_Init( argc, argv )
MPI_Comm_rank( MPI_COMM_WORLD, rank )
MPI_Comm_size( MPI_COMM_WORLD, size )
printf( "I am d of d\n", rank, size )
MPI_Finalize()
return 0

11
MPI Basic Send/Receive
Process 1
Process 0
send
recv

Need to describe
How to identify process
How to identify message
How to identify data

12
Identifying Processes

MPI Communicator
Defines a group (set of ordered processes) and a
context (a virtual network)
Rank
Process number within the group
MPI_ANY_SOURCE will receive from any process
Default communicator
MPI_COMM_WORLD the whole group

13
Identifying Messages

An MPI Communicator defines a virtual network,
send/recv pairs must use the same communicator
send/recv routines have a tag (integer variable)
argument that can be used to identify a message,
or screen for a particular message.
MPI_ANY_TAG will receive a message with any tag

14
Identifying Data

Data is described by a triple (address, type,
count)
For send, this defines the message
For recv, this defines the size of the receive
buffer
Amount of data received, source, and tag
available via status data structure
Useful if using MPI_ANY_SOURCE, MPI_ANY_TAG, or
unsure of message size (must be smaller than
buffer)

15
MPI Types

Type may be recursively defined as
An MPI predefined type
A contiguous array of types
An array of equally spaced blocks
An array of arbitrary spaced blocks
Arbitrary structure
Each user-defined type constructed via an MPI
routine, e.g. MPI_TYPE_VECTOR

16
MPI Predefined Types

C Fortran
MPI_INT MPI_INTEGER
MPI_FLOAT MPI_REAL
MPI_DOUBLE MPI_DOUBLE_PRECISION
MPI_CHAR MPI_CHARACTER
MPI_UNSIGNED MPI_LOGICAL
MPI_LONG MPI_COMPLEX
Language Independent
MPI_BYTE

17
MPI Types

Explicit data description is useful
Simplifies programming, e.g. send row/column of a
matrix with a single call
Heterogeneous machines
May improve performance
Reduce memory-to-memory copies
Allow use of scatter/gather hardware
May hurt performance
User packing of data likely faster

18
Point-to-point Example
Process 0 Process 1
define TAG 999 float a10 int
dest1 MPI_Send(a, 10, MPI_FLOAT, dest,
TAG, MPI_COMM_WORLD)
define TAG 999 MPI_Status status int
count float b20 int sender0 MPI_Recv(b, 20,
MPI_FLOAT, sender, TAG, MPI_COMM_WORLD,
status) MPI_Get_count(status, MPI_FLOAT,
count)
19
MPI_Send

MPI_Send(address, count, type, dest, tag, comm)
address pointer to data
count number of elements to be sent
type data type
dest destination process
tag identifying tag
comm communicator
When MPI_Send returns the message is sent and the
data can be reused. The message has not
necessarily been received.

20
MPI_Recv

MPI_Recv(address, count, type, dest, tag, comm,
status)
address pointer to data
count number of elements to be sent
type data type
dest destination process
tag identifying tag
comm communicator
status sender, tag, and message size
When MPI_Recv returns the message has been
received and the data can be used.

21
MPI Status Data Structure

In C
MPI_Status status
int recvd_tag, recvd_from, recvd_count
recvd_tag status.MPI_TAG
recvd_from status.MPI_SOURCE
MPI_Get_count( status, MPI_INT, recvd_count)
In Fortran
integer status(MPI_STATUS_SIZE)

22
MPI Programming with Six Routines

Some programs can be written with only six
routines
MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Send
MPI_Recv

23
Data Exchange

Process 0 Process 1

MPI_Recv(,1,) MPI_Send(,1,)
MPI_Recv(,0,) MPI_Send(,0,)
Deadlock. MPI_Recv will not return until send is
posted.
24
Data Exchange
Process 0 Process 1
MPI_Send(,1,) MPI_Recv(,1,)
MPI_Send(,0,) MPI_Recv(,0,)
May deadlock, depending on the implementation. If
the messages can be buffered, program will run.
Called 'unsafe' in the MPI standard.
25
Buffering in MPI

Implementation may buffer on sending process,
receiving process, both, or none.
In practice, tend to buffer "small" messages on
receiving process.
MPI has a buffered send-mode
MPI_Buffer_attach
MPI_Buffer_detach
MPI_Bsend

26
Message Delivery
P0
P1

Eager send data immediately store in remote
buffer
No synchronization
Only one message sent
Data is copied
Uses memory for buffering (less for application)
Rendezvous send message header wait for recv to
be posted send data
No data copy
More memory for application
More messages required
Synchronization (send blocks until recv posted)

27
Message Delivery

Many MPI implementations use both the eager and
rendezvous methods of message delivery
Switch between the two methods according to
message size
Often the cutover point is controllable via an
environment variable, e.g. MP_EAGER_LIMIT and
MP_USE_FLOW_CONTROL on the IBM SP

28
Message Delivery

Non-overtaking messages
Message sent from the same process will arrive in
the order sent
No fairness
On a wildcard receive, possible to receive from
only one source despite other messages being sent
Progress
For a pair of matched send and receives, at least
one will complete independent of other messages.

29
Performance Comparison
30
Data Exchange

Ways around 'unsafe' data exchange
Match send/recv pairs hard in general case
Use MPI_Sendrecv
Use non-blocking communication

31
Non-blocking Communication

Communication split into two parts
MPI_Isend or MPI_Irecv starts communication and
returns request data structure.
MPI_Wait (also MPI_Waitall, MPI_Waitany) uses
request as an argument and blocks until
communication is complete.
MPI_Test uses request as an argument and checks
for completion.
Advantages
No deadlocks
Overlap communication with computation
Exploit bi-directional communication

32
Data Exchange with Isend/recv
Process 0 Process 1
MPI_Isend(,1,, request) MPI_Recv(,1,) MPI_Wa
it(request, status)
MPI_Isend(,0,, request) MPI_Recv(,0,) MPI_Wa
it(request, status)
33
Non-blocking send/recv buffers

May not modify or read the message buffer between
MPI_Irecv and MPI_Wait calls.
May not modify or read the message buffer between
MPI_Isend and MPI_Wait calls.
May not have two MPI_Irecv pending on the same
buffer.
May not have two MPI_Isend pending on the same
buffer.
Restrictions provide flexibility for implementers.

34
Performance Comparison
35
Collective Communication

Optimized algorithms, scaling as log(n)
Differences from point-to-point
Amount of data sent must match amount of data
specified by receivers
No tags
Blocking only
MPI_barrier(comm)
All processes in the communicator are
synchronized. The only collective call where
synchronization is guaranteed.

36
Collective Move Functions

MPI_Bcast(data, count, type, src, comm)
Broadcast data from src to all processes in the
communicator.
MPI_Gather(in, count, type, out, count, type,
dest, comm)
Gathers data from all nodes to dest node
MPI_Scatter(in, count, type, out, count, type,
src, comm)
Scatters data from src node to all nodes

37
Collective Move Functions
data
processes
broadcast
scatter
gather
38
Collective Move Functions

Additional functions
MPI_Allgather, MPI_Gatherv, MPI_Scatterv,
MPI_Allgatherv, MPI_Alltoall

39
Collective Reduce Functions

MPI_Reduce(send, recv, count, type, op, root,
comm)
Global reduction operation, op, on send buffer.
Result is at process root in recv buffer. op may
be user defined, MPI predefined operation.
MPI_Allreduce(send, recv, count, type, op, comm)
As above, except result broadcast to all
processes.

40
Collective Reduce Functions
data
processes
allreduce
41
Collective Reduce Functions

Additional functions
MPI_Reduce_scatter, MPI_Scan
Predefined operations
Sum, product, min, max,
User-defined operations
MPI_Op_create

42
Notes on C, Fortran, C

In C
include mpi.h
MPI functions return error code or MPI_SUCCESS
In Fortran
include mpif.h
use mpi (MPI 2)
All MPI calls are subroutines, return code is
final argument
In C
Size MPICOMM_WORLD.Get_size() (MPI 2)

43
Other Features

Other send modes
synchronous mode can be used to check if the
program is safe, since it forces a rendezvous
protocol
ready mode is difficult to use and doesn't boost
performance on any implementation. Need to ensure
that recv is posted, this leads to more user
explicit synchronization.
persistent communication, pre-specify a message
envelope and data
Create new communicators
Libraries, logically partitioning tasks
Topologies
Cartesian and graph topologies can map physical
hardware to processes

44
Other Features

Probe and cancel
Check for characteristics of incoming message,
possibly cancel
I/O (MPI 2)
Individual, shared or explicit file pointers
Collective or individual (by process) file access
Blocking and non-blocking access
One-sided communication (MPI 2)
Put, get, and accumulate
Loose synchronization model
Remote lock/unlock of memory

45
Free MPI Implementations

MPICH from Argonne National Lab and Mississippi
State Univ.
http//www-unix.mcs.anl.gov/mpi/mpich/
Runs on
network of workstations
SMP using shared memory
MPP system support limited
Windows

46
Free MPI Implementations

LAM from Ohio Supercomputer Center, University of
Notre Dame, Indiana University
http//www.lam-mpi.org/
Many MPI 2 features
Runs on
Network of workstations

47
IBM MPI Implementation

MPI Programming Guide and MPI Subroutine
Reference
http//www1.ibm.com/servers/eserver/pseries/librar
y/sp_books/pe.html
Compatible with Pthreads and OpenMP
All of MPI 2 except for process spawning

48
Further Information

MPI Standards
http//www.mpi-forum.org/
Books
Using MPI Portable Parallel Programming with the
Message-Passing Interface (second edition), by
Gropp, Lusk and Skjellum
Using MPI-2 Advanced Features of the
Message-Passing Interface, by Gropp, Lusk and
Thakur
MPI The Complete Reference. Volume 1, by Snir,
Otto, Huss-Lederman, Walker and Dongarra
MPI The Complete Reference. Volume 2, by Gropp,
Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir,
and Snir

49
Example

Calculate the energy of a system of particles
interacting via a Coulomb potential.

real coord(3,n), charge(n)
energy0.0 do i 1, n do j 1,
i-1 rdist 1.0/sqrt((coord(1,i)-coord(1
,j))2 (coord(2,i)-coord(2,j))2(c
oord(3,i)-coord(3,j))2) energy
energy charge(i)charge(j)rdist end
do end do
50
MPI Example 1

Functional decomposition
each process will compute roughly the same number
of interactions
accomplish this by dividing up the outer loop
replicate data to make communication simple
this approach will not scale

51
MPI - Example 1
include 'mpif.h' parameter(n50000)
dimension coord(3,n), charge(n) call
mpi_init(ierr) call mpi_comm_rank(MPI_COMM_W
ORLD, mype, ierr) call mpi_comm_size(MPI_COM
M_WORLD, npes, ierr) call
initdata(n,coord,charge,mype) e
energy(mype,npes,n,coord,charge)
etotal0.0 call mpi_reduce(e, etotal, 1,
MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr) if (mype.eq.0) write(,) etotal
call mpi_finalize(ierr)
52
MPI - Example 1
subroutine initdata(n,coord,charge,mype)
include 'mpif.h' dimension coord(3,n),
charge(n) if (mype.eq.0) then
GENERATE coords, charge end if ! broadcast
data to slaves call mpi_bcast(coord, 3n,
MPI_REAL, 0, MPI_COMM_WORLD, ierr) call
mpi_bcast(charge, n, MPI_REAL, 0, MPI_COMM_WORLD,
ierr) return
53
MPI - Example 1
real function energy(mype,npes,n,coord,charg
e) dimension coord(3,n), charge(n)
intern(n-1)/npes nstartnint(sqrt(real(myp
einter)))1 nfinishnint(sqrt(real((mype1)
inter))) if (mype.eq.npes-1) nfinishn
total 0.0 do i nstart, nfinish
do j 1, i-1 .... total
total charge(i)charge(j)rdist end do
end do energy total return
54
MPI - Example 2

Domain decomposition
each task takes a chunk of particles
in turn, receives particle data from another
process and computes all interactions between own
data and received data
repeat until all interactions are done

55
MPI - Example 2
Proc 0
Proc 1
Proc 2
Proc 3
Proc 4
Step 1
21-40
41-60
61-80
81-100
1-20
21-40
41-60
61-80
81-100
1-20
Step 2
21-40
41-60
61-80
81-100
1-20
41-60
61-80
81-100
1-20
21-40
Step 3
21-40
41-60
61-80
81-100
1-20
61-80
81-100
1-20
21-40
41-60
56
subroutine initdata(n,coord,charge,mype,npes
,npepmax,nmax,nmin) include 'mpif.h'
dimension coord(3,n), charge(n) integer
status(MPI_STATUS_SIZE) itag0
isender0 if (mype.eq.0) then do
ipe1,npes-1 GENERATE coord, charge
for PEipe call mpi_send(coord, nj3,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) call mpi_send(charge, nj,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) end do GENERATE coord,
charge for self else ! receive particles
call mpi_recv(coord, 3n, MPI_REAL,
isender, itag, MPI_COMM_WORLD, status,
ierror) call mpi_recv(charge, n,
MPI_REAL, isender, itag,
MPI_COMM_WORLD, status, ierror) endif
return
57
niternpes/2 do iter1, niter ! PE
to send to and receive from if
(ipsend.eq.npes-1) then ipsend0
else ipsendipsend1 end if
if (iprecv.eq.0) then
iprecvnpes-1 else
iprecviprecv-1 end if ! send and
receive particles call mpi_sendrecv(coordi
, 3n, MPI_REAL, ipsend, itag, coordj,
3n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) call
mpi_sendrecv(chargei, n, MPI_REAL, ipsend, itag,
chargej, n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) !
accumulate energy e e energy2(n,
coordi, chargei, n, coordj, chargej) end do

Write a Comment

User Comments (0)

About PowerShow.com

CS 267: Applications of Parallel Computers Lecture 7: Message Passing Programming MPI PowerPoint PPT Presentation