Advanced MPI

About This Presentation

Title:

Advanced MPI

Description:

A new communication domain is created for each subgroup and a handle to the ... overlap of computation and communication. In theory, at least, and sometimes, in ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 86

Provided by: hpc9

Category:

more less

Transcript and Presenter's Notes

Title: Advanced MPI

1
Advanced MPI

Daniel S. Katz
Assistant Director for Cyberinfrastructure
Development (CyD), CCT
Associate Research Professor, ECE
dsk_at_cct.lsu.edu
Credits TACC HPC Course,Introduction to
Parallel I/O tutorial (Thakur),MPI Standards

2
Introduction Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

3
MPI-2 Status Assessment

All vendors have complete MPI-1, and have for 5 -
10 years
Free implementations (MPICH, LAM) support
heterogeneous workstation networks
MPI-2 implementations are being undertaken by all
vendors
Fujitsu, NEC have complete MPI-2 implementations
Other vendors generally have all but dynamic
process management
MPICH-2 is complete
Open MPI (new MPI from LAM and other MPIs) is
becoming complete

4
Communicators and Groups I

A group is an ordered set of process identifiers
(processes)
Processes are implementation-dependent objects
Each process in a group is associated with an
integer rank
Ranks are contiguous and start from zero
Communicators were created to allow groups of
processes to act together
MPI_COMM_WORLD is the communicator that contains
all processes, other communicators are subsets
MPI_COMM_EMPTY contains no processes

5
Communicators and Groups II

Can create new communicators from existing
communicators or existing groups
Both called by all processes in comm, can return
MPI_COMM_NULL
MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm newcomm)
Partitions the group associated with comm into
disjoint subgroups, one for each value of color
(color can be MPI_UNDEFINED)
Each subgroup contains all processes of the same
color
Within each subgroup, the processes are ranked in
the order defined by the value of the argument
key, with ties broken according to their rank in
the old group
A new communication domain is created for each
subgroup and a handle to the representative
communicator is returned in newcomm
MPI_Comm_create(MPI_Comm comm, MPI_Group group,
MPI_Comm newcomm)
creates a new intracommunicator newcomm with
communication group defined by group

6
MPI_Comm_split example

MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm newcomm)

Column
0 1 2 3 4 5
6
0 1 2 3
Row
MPI_Comm_rank(MPI_COMM_WORLD, rank) row
int(rank/ncol) MPI_Comm_split(MPI_COMM_WORLD,
row, rank, row_comm)
7
MPI_Comm_create example

MPI_Comm_group(MPI_Comm comm, MPI_Group group)
Returns a handle to a group from a communicator
MPI_Group_union(MPI_Group group1, MPI_Group
group2, MPI_Group newgroup)
MPI_Group_intersection(MPI_Group group1,
MPI_Group group2, MPI_Group newgroup)
MPI_Group_difference(MPI_Group group1, MPI_Group
group2, MPI_Group newgroup)
3 above are self-explantory
MPI_Group_incl(MPI_Group group, int n, int
ranks, MPI_Group newgroup)
New group with n elements of group
MPI_Group_excl(MPI_Group group, int n, int
ranks, MPI_Group newgroup)
New group with all but n elements of group
More complicated constructors also exist

8
Freeing Communicators Groups

Can destroy communicators and groups, too
MPI_Comm_free(MPI_Comm comm)
MPI_Group_free(MPI_Group group)

9
Intra- and Intercommunicators,and Dynamic
Processes

So far, just intracommunicators
MPI also has intercommunicators
Utility of this was dubious in MPI-1
In MPI-2, with dynamic processes, this is more
useful
A process may start new processes with
MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE
Returns an intercommunicator
Does not change MPI_COMM_WORLD
Two independently started MPI applications can
establish communications
MPI_Open_port, MPI_Comm_connect, MPI_Comm_accept
allow two running MPI programs to connect and
communicate
Not intended for client/server applications
Designed to support HPC applications
MPI_Join allows the use of a TCP socket to
connect two applications
Details are beyond the scope of this class
see http//www.mpi-forum.org/docs/mpi-20-html/nod
e88.htm

10
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

11
Basic Datatypes
12
Derived Datatypes I

Basic communication calls so far have involved
only contiguous buffers with a sequence of
elements of a single type
Many applications need more
Could use multiple communication calls
Could manually pack data into buffers and
communicate those
Could use derived datatypes
Lets library optimize how the communication is
done
User tells library what is desired, library does
it
Derived datatypes can be created at runtime
Derived datatypes can be recursive

13
Derived Datatypes II

Contiguous
Allows replication of a datatype into contiguous
locations
int MPI_Type_contiguous(in int count,
in MPI_Datatype oldtype, out MPI_Datatype
newtype)
Vector
Allows replication of a datatype into locations
that consist of equally spaced blocks
Each block is obtained by concatenating the same
number of copies of the old datatype
int MPI_Type_vector(in int count, in int
blocklength, in int stride, in MPI_Datatype
oldtype, out MPI_Datatype newtype)
Hvector
Identical to vector, but stride is in bytes,
rather than in number of oldtype elements

14
Derived Datatypes III

Indexed
Allows replication of an old datatype into a
sequence of blocks (each block is a concatenation
of the old datatype), where each block can
contain a different number of copies and have a
different displacement
int MPI_Type_indexed(in int count,
in int array_of_blocklengths,
in int array_of_displacements,
in MPI_Datatype oldtype, out MPI_Datatype
newtype)
Hindexed
Identical to indexed, but displacements are in
bytes, rather than in number of oldtype extents
int MPI_Type_hindexed(in int count,
in int array_of_blocklengths,
in MPI_Aint array_of_displacements,
in MPI_Datatype oldtype, out MPI_Datatype
newtype)

15
Derived Datatypes IV

Struct
Most general
Generalizes hindexed with array of types
int MPI_Type_struct(in int count,
in int array_of_blocklengths,
in MPI_Aint array_of_displacements,
in MPI_Datatype array_of_types,
out MPI_Datatype newtype)

Struct Count 2, array_of_blocklengths3,
1, Array_of_displacements0,12 (in
bytes)array_of_typesMPI_FLOAT,MPI_INT
16
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

17
Point-to-point Communications

MPI_Send and MPI_Recv are blocking calls
MPI_Send does not return until the message data
and envelope have been safely stored away so that
the sender is free to access and overwrite the
send buffer
The message might be copied directly into the
matching receive buffer, or it might be copied
into a temporary system buffer

18
Buffering Communication Modes

Message buffering decouples the send and receive
operations
A blocking send can complete as soon as the
message was buffered, even if no matching receive
has been executed by the receiver
On the other hand, message buffering can be
expensive, as it entails additional
memory-to-memory copying, and it requires the
allocation of memory for buffering
There are 4 communication modes in MPI
Standard (no prefix), buffered (B prefix),
synchronous (S prefix), ready (R prefix)
MPI_Send(), MPI_Bsend(), MPI_Ssend(), MPI_Rsend()

19
Standard Communication Mode

MPI offers the choice of several communication
modes that allow control of the communication
protocol choice
MPI_Send uses the standard communication mode
It is up to MPI to decide if outgoing messages
will be buffered
MPI may buffer outgoing messages
The send call may complete before a matching
receive is invoked
On the other hand, buffer space may be
unavailable, or MPI may not buffer outgoing
messages, for performance reasons
The send call will not complete until a matching
receive has been posted, and the data has been
moved to the receiver
Thus, a send in standard mode can be started
whether or not a matching receive has been posted
It may complete before a matching receive is
posted
The standard mode send is non-local successful
completion of the send operation may depend on
the occurrence of a matching receive.

20
Buffered Communication Mode

A buffered mode send operation can be started
whether or not a matching receive has been posted
It may complete before a matching receive is
posted
However, unlike the standard send, this operation
is local, and its completion does not depend on
the occurrence of a matching receive
Thus, if a send is executed and no matching
receive is posted, then MPI must buffer the
outgoing message, so as to allow the send call to
complete
An error will occur if there is insufficient
buffer space
The amount of available buffer space is
controlled by the user
Buffer allocation by the user may be required for
the buffered mode to be effective

21
Synchronous Communication Mode

A send that uses the synchronous mode can be
started whether or not a matching receive was
posted
However, the send will complete successfully only
if a matching receive is posted, and the receive
operation has started to receive the message sent
by the synchronous send
Thus, the completion of a synchronous send not
only indicates that the send buffer can be
reused, but also indicates that the receiver has
reached a certain point in its execution, namely
that it has started executing the matching
receive
If both sends and receives are blocking
operations then the use of the synchronous mode
provides synchronous communication semantics a
communication does not complete at either end
before both processes rendezvous at the
communication
A send executed in this mode is non-local

22
Ready Communication Mode

A send that uses the ready communication mode may
be started only if the matching receive is
already posted
Otherwise, the operation is erroneous and its
outcome is undefined
On some systems, this allows the removal of a
hand-shake operation that is otherwise required
and results in improved performance
The completion of the send operation does not
depend on the status of a matching receive, and
merely indicates that the send buffer can be
reused
A send operation that uses the ready mode has the
same semantics as a standard send operation, or a
synchronous send operation it is merely that the
sender provides additional information to the
system (namely that a matching receive is already
posted), that can save some overhead
In a correct program, therefore, a ready send
could be replaced by a standard send with no
effect on the behavior of the program other than
performance

23
Blocking Receive

There is only one receive operation, which can
match any of the send modes
The receive operation just described is blocking
it returns only after the receive buffer contains
the newly received message
A receive can complete before the matching send
has completed (of course, it can complete only
after the matching send has started).

24
Communication Modes Summary

Ready mode has least total overhead, but requires
receive to be posted before send
Can post receive, synchronize, then post send
Synchronous mode is most portable, but can be
slow
Does not depend on order (ready) or buffers
(buffered)
Buffered mode doesnt depend on order (ready) and
has no synchronization delays (synchronous), but
buffering can add overhead and user may need to
control buffers
Standard mode is implementation dependant
Often, small messages are buffered and large
messages are sent synchronously

25
Non-blocking Communication I

Allows overlap of computation and communication
In theory, at least, and sometimes, in practice
A non-blocking send start call initiates the send
operation, but does not complete it
Call returns before message is copied out of send
buffer
A separate send complete call is needed to
complete the communication, i.e., to verify that
the data has been copied out of the send buffer
With suitable hardware, the transfer of data out
of the sender memory may proceed concurrently
with computations done at the sender after the
send was initiated and before it completed

26
Non-blocking Communication II

Similarly, a non-blocking receive start call
initiates the receive operation, but does not
complete it
Call returns before message is stored in receive
buffer
A separate receive complete call is needed to
complete the receive operation and verify that
the data has been received into the receive
buffer
With suitable hardware, the transfer of data into
the receiver memory may proceed concurrently with
computations done after the receive was initiated
and before it completed
The use of non-blocking receives may also avoid
system buffering and memory-to-memory copying, as
information is provided early on the location of
the receive buffer

27
Non-blocking Communication III

non-blocking send start calls can use the same
four modes as blocking sends standard, buffered,
synchronous and ready
These carry the same meaning
Sends of all modes, ready excepted, can be
started whether a matching receive has been
posted or not a non-blocking ready send can be
started only if a matching receive is posted
In all cases, the send start call is local it
returns immediately, irrespective of the status
of other processes
The send-complete call returns when data has been
copied out of the send buffer
non-blocking sends can be matched with blocking
receives, and vice-versa

28
Non-blocking Communication IV

Syntax - add an I, get back a request handle
int MPI_Bs,Ss,Rs,Send(void buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm)
int MPI_Ib,s,rsend(void buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request request)
int MPI_Recv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status status)
int MPI_Irecv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request request)
Blocking communication completion
int MPI_Wait(MPI_Request request, MPI_Status
status)
non-blocking communication completion
int MPI_Test(MPI_Request request, int flag,
MPI_Status status) (flag indicates if the
request has completed)
Request freeing (dont care when it completes)
int MPI_Request_free(MPI_Request request)
There are also calls for multiple completion/free

29
Non-blocking Communication Examples
IF(rank.EQ.0) THEN DO i1, n CALL
MPI_ISEND(outval, 1, MPI_REAL, 1, 0, req,
ierr) CALL MPI_REQUEST_FREE(req, ierr)
CALL MPI_IRECV(inval, 1, MPI_REAL, 1, 0,
req, ierr) CALL MPI_WAIT(req, status, ierr)
END DO ELSE ! rank.EQ.1 CALL
MPI_IRECV(inval, 1, MPI_REAL, 0, 0, req,
ierr) CALL MPI_WAIT(req, status) DO I1, n-1
CALL MPI_ISEND(outval, 1, MPI_REAL, 0,
0, req, ierr) CALL MPI_REQUEST_FREE(req,
ierr) CALL MPI_IRECV(inval, 1, MPI_REAL,
0, 0, req, ierr) CALL MPI_WAIT(req, status,
ierr) END DO CALL MPI_ISEND(outval, 1,
MPI_REAL, 0, 0, req, ierr) CALL
MPI_WAIT(req, status) END IF

IF(rank.EQ.0) THEN
CALL MPI_ISEND(a(1), 10,
MPI_REAL, 1, tag, comm,
request, ierr)
compute
CALL MPI_WAIT(request,
status, ierr)
ELSE ! rank.eq.1
CALL MPI_IRECV(a(1), 10,
MPI_REAL, 0, tag, comm,
request, ierr)
compute
CALL MPI_WAIT(request,
status, ierr)
END IF

30
Non-blocking Communication Examples
IF (rank.EQ.0) THEN CALL MPI_SEND(i, 1,
MPI_INTEGER, 2, 0, comm, ierr) ELSE IF(rank.EQ.1)
THEN CALL MPI_SEND(x, 1, MPI_REAL, 2, 0, comm,
ierr) ELSE ! rank.EQ.2 DO i1, 2 CALL
MPI_PROBE(MPI_ANY_SOURCE, 0, comm, status, ierr)
IF (status(MPI_SOURCE) 0) THEN CALL
MPI_RECV(i, 1, MPI_INTEGER,
MPI_ANY_SOURCE, MPI_ANY_TAG, status2, ierr)
CALL MPI_RECV(i, 1, MPI_INTEGER,
status(MPI_SOURCE), status(MPI_TAG), status2,
ierr) ELSE CALL MPI_RECV(x, 1,
MPI_REAL, MPI_ANY_SOURCE, MPI_ANY_TAG,
status2, ierr) CALL MPI_RECV(x, 1,
MPI_REAL, status(MPI_SOURCE),
status(MPI_TAG), status2, ierr) END IF END
DO END IF
31
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

32
Advanced Collective Communication I
Root
P0 P1 P2 P3
P0 P1 P2 P3
A B C D A B C D A B C D A B
C D
A B C D
A B C D
A B C D
A B C D
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
Alltoall
33
Advanced Collective Communication II

What if data to be gathered from all processes to
root is not the same?
Use Gatherv instead of Gather (or
ScatterV/Scatter, Allgatherv/Allgather)
v is for varying
MPI_Gather
int MPI_Gather(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm)
All n processes(including root) call
MPI_Send(sendbuf, sendcount, sendtype, root, )
Root calls (for i 0 to n-1)
MPI_Recv(recvbufirecvcountextent(recvtype),
recvcount, recvtype, i, )
MPI_Gatherv
int MPI_Gatherv(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcounts, int displs, MPI_Datatype recvtype,
int root, MPI_Comm comm)
All n processes(including root) call
MPI_Send(sendbuf, sendcount, sendtype, root, )
Root calls (for i 0 to n-1)
MPI_Recv(recvbufdisplsiextent(recvtype),
recvcountsi, recvtype, i, )

34
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

35
Persistent Communication I

Often, a communication with the same argument
list is repeatedly executed within the inner loop
of a parallel computation
In such a situation, it may be possible to
optimize the communication by binding the list of
communication arguments to a persistent
communication request once and, then, repeatedly
using the request to initiate and complete
messages
The persistent request thus created can be
thought of as a communication port or a
half-channel
It does not provide the full functionality of a
conventional channel, since there is no binding
of the send port to the receive port
This construct allows reduction of the overhead
for communication between the process and
communication controller, but not of the overhead
for communication between one communication
controller and another
It is not necessary that messages sent with a
persistent request be received by a receive
operation using a persistent request, or vice
versa

36
Persistent Communication II

Create a persistent communication request before
the loop starts
MPI_S,Bs,Ss,Rsend_init(buffer, , request)
MPI_Recv_init(buffer, , request)
Start the request(s) inside each iteration of the
loop
MPI_Start(request)
MPI_Startall(count, array_of_requests)
Complete the request(s) inside each iteration of
the loop
MPI_Wait(), MPI_Waitall(), MPI_Waitany(),
MPI_Waitsome(), MPI_Test(), MPI_Testany(),
MPI_Testsome()
Free the request(s) after the loop completes
MPI_Request_free(request)

37
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

38
I/O in Parallel Applications

Old ways to do I/O
Process 0 does all the I/O to a single file and
broadcasts/scatters/gathers to/from other
processes
All processes do their own I/O to separate files
All tasks read from same file
All tasks write to same file, using seeks to get
to right place
One task at a time appends to a single file,
using barriers to prevent overlapping writes.

39
I/O in Parallel Applications

New way is to use parallel I/O library, such as
MPI I/O
Multiple tasks can simultaneously read or write
to a single file (possibly on a parallel file
system) using the MPI I/O API
A parallel file system usually looks like a
single file system, but has multiple I/O servers
to permit high bandwidth from multiple processes
MPI I/O is part of MPI-2
Allows single or collective operations to/of
contiguous or non-contiguous regions/data using
MPI datatypes, including derived datatypes,
blocking or non-blocking
Sound familiar? Writing sending message,
reading receiving

40
Levels of Parallel I/O

Example problem Distributed Array Access

P0
P2
P1
P3
Large array distributed among 16 processes
Each square represents a subarray in the
memory of a single process
P4
P6
P5
P7
P8
P10
P9
P11
P12
P14
P13
P15
Access Pattern in the file
P10
P11
P10
P15
P13
P12
P12
P13
P14
P14
41
Level-0 Access

Each process makes one independent read request
for each row in the local array (as in Unix)
MPI_File_open(..., file, ..., fh)
for (i0 iltn_local_rows i)
MPI_File_seek(fh, ...)
MPI_File_read(fh, (Ai0), ...)
MPI_File_close(fh)

42
Level-1 Access

Similar to level 0, but each process uses
collective I/O functions
MPI_File_open(MPI_COMM_WORLD, file, ..., fh)
for (i0 iltn_local_rows i)
MPI_File_seek(fh, ...)
MPI_File_read_all(fh, (Ai0), ...)
MPI_File_close(fh)

43
Level-2 Access

Each process creates a derived datatype to
describe the noncontiguous access pattern,
defines a file view, and calls independent I/O
functions
MPI_Type_create_subarray(..., subarray, ...)
MPI_Type_commit(subarray)
MPI_File_open(..., file, ..., fh)
MPI_File_set_view(fh, ..., subarray, ...)
MPI_File_read(fh, A, ...)
MPI_File_close(fh)

44
Level-3 Access

Similar to level-2, except that each process uses
collective I/O functions
MPI_Type_create_subarray(..., subarray, ...)
MPI_Type_commit(subarray)
MPI_File_open(MPI_COMM_WORLD, file, ..., fh)
MPI_File_set_view(fh, ..., subarray, ...)
MPI_File_read_all(fh, A, ...)
MPI_File_close(fh)

45
The Four Access Levels
Level 0
Level 1
Level 2
Level 3
46
Why Higher-level Access is Good

Given complete access information, an
implementation can perform optimizations such as
Data Sieving Read large chunks and extract what
is really needed
Collective I/O Merge requests of different
processes into larger requests
Improved prefetching and caching

47
Some Quick Details

Collective operation on a filename
MPI_File_open()
Single processor operation on a filename
MPI_File_delete()
Collective operations on a file handle
MPI_File_open(), MPI_File_close(),
MPI_File_set_size(), MPI_File_preallocate()
Single processor operation on a file handle
MPI_File_get_size(), MPI_File_get_amode(),
MPI_File_get_group() (Note creates dup. comm.
group)

48
Hints

Allow a user to provide info (e.g., file access
patterns, file system specifics) for optimization
Optional - may be ignored by implementation
Can be provided for
MPI_File_open(), MPI_File_delete(),
MPI_File_set_info(), MPI_get_info()
Examples hints
access_style read_once, write_once, read_mostly,
write_mostly, sequential, reverse_sequential,
random
collective_buffering true, false
num_io_nodes (integer - )
striping_factor (integer - of devices)
striping_unit (integer - of bytes)
Context sensitive, Implementation-dependent

49
File Views

MPI_File_set_view(MPI_File fh, MPI_Offset disp,
MPI_Datatype etype, MPI_Datatype filetype, char
datarep, MPI_Info info)
Changes the process's view of the data in the
file
disp - Start of the view etype - type of data
filetype - distribution of data to processes
datarep - representation of data in the file
native - data stored as in memory internal -
data stored in librarys choice external32 -
data stored in portable format
Resets individual file pointers and shared file
pointer to zero
Collective
datarep and extent of etype in datarep must be
identical on all processes in the group
disp, filetype, and info may vary
datatypes passed in etype and filetype must be
committed
Should call immediately after open, and perhaps
other times, too

50
Data Access I

Three orthogonal aspects to data access
Positioning
Explicit offset vs. implicit file pointer
Synchronism
blocking vs. non-blocking and split collective
Coordination
noncollective vs. collective
Two types of file pointers
Individual, shared

51
Data Access Routines
52
Positioning Routines
53
Parallel I/O notes

Non-collective, non-blocking calls return an
MPI_Request
Must wait or otherwise free request at some point
Only one collective, split-phase call per file
handle at a time
Therefore, no request is used/needed
POSIX file consistency
When write returns, data immediately visible to
other processes
Atomicity If two writes occur simultaneously on
overlapping areas in a file, data stored will be
from one or the other, not a combination
MPI I/O file consistency
Default semantics weaker than POSIX for
optimization
Can get close to POSIX semantics by setting
atomicity to TRUE
Otherwise, to read data written by another
process, you need to call MPI_File_sync or close
and reopen the file
See examples at end of section from Thakurs
tutorial

54
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

55
One-sided Operations Issues

Balancing efficiency and portability across a
wide class of architectures
shared-memory multiprocessors
NUMA architectures
distributed-memory MPPs, clusters
Workstation networks
Retaining look and feel of MPI-1
Dealing with subtle memory behavior issues
cache coherence, sequential consistency
Synchronization is separate from data movement

56
Remote Memory Access Windows and Window Objects
Process 0
Process 1
window
Process 2
Process 3
address spaces
window object
57
One-sided Communication Calls

MPI_Put - stores into remote memory
MPI_Get - reads from remote memory
MPI_Accumulate - updates remote memory
All are non-blocking data transfer is
described, maybe even initiated, but may
continue after call returns
Subsequent synchronization on window object is
needed to ensure operations are complete, e.g.,
MPI_Win_fence

58
Outline

Status of MPI-2
Groups and communication management
Derived Datatypes
Point-to-Point blocking/non-blocking
communication
Advanced Collective Communication
Persistent communication
Parallel I/O
One-sided communication
Parallel I/O examples (from Thakur)

59
Using MPI for Simple I/O
Each process needs to read a chunk of data from a
common file
60
Using Individual File Pointers
MPI_File fh MPI_Status status MPI_Comm_rank(MPI
_COMM_WORLD, rank) MPI_Comm_size(MPI_COMM_WORLD,
nprocs) bufsize FILESIZE/nprocs nints
bufsize/sizeof(int) MPI_File_open(MPI_COMM_WORLD
, "/pfs/datafile",
MPI_MODE_RDONLY, MPI_INFO_NULL,
fh) MPI_File_seek(fh, rank bufsize,
MPI_SEEK_SET) MPI_File_read(fh, buf, nints,
MPI_INT, status) MPI_File_close(fh)
61
Using Explicit Offsets
include 'mpif.h' integer status(MPI_STATUS_SI
ZE) integer (kindMPI_OFFSET_KIND) offset C in
F77, see implementation notes (might be
integer8) call MPI_FILE_OPEN(MPI_COMM_WORLD,
'/pfs/datafile', MPI_MODE_RDONLY,
MPI_INFO_NULL, fh, ierr) nints FILESIZE /
(nprocsINTSIZE) offset rank nints
INTSIZE call MPI_FILE_READ_AT(fh, offset, buf,
nints, MPI_INTEGER,
status, ierr) call MPI_GET_COUNT(status,
MPI_INTEGER, count, ierr) print , 'process ',
rank, 'read ', count, 'integers' call
MPI_FILE_CLOSE(fh, ierr)
62
Writing to a File

Use MPI_File_write or MPI_File_write_at
Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags
to MPI_File_open
If the file doesnt exist previously, the flag
MPI_MODE_CREATE must also be passed to
MPI_File_open
We can pass multiple flags by using bitwise-or
in C, or addition in Fortran

63
Using File Views

Processes write to shared file

MPI_File_set_view assigns regions of the file to
separate processes
64
File Views

Specified by a triplet (displacement, etype, and
filetype) passed to MPI_File_set_view
displacement number of bytes to be skipped from
the start of the file
etype basic unit of data access (can be any
basic or derived datatype)
filetype specifies which portion of the file is
visible to the process

65
File View Example
MPI_File file for (i0 iltBUFSIZE i) bufi
myrank BUFSIZE i MPI_File_open(MPI_COMM_WOR
LD, "testfile", MPI_MODE_CREATE
MPI_MODE_WRONLY, MPI_INFO_NULL,
file) MPI_File_set_view(file, myrank BUFSIZE
sizeof(int), MPI_INT, MPI_INT,
"native", MPI_INFO_NULL) MPI_File_write(file,
buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE) MPI_Fi
le_close(file)
66
Other Ways to Write to a Shared File

MPI_File_seek
MPI_File_read_at
MPI_File_write_at
MPI_File_read_shared
MPI_File_write_shared
Collective operations

like Unix seek
combine seek and I/O for thread safety
use shared file pointer
67
Noncontiguous Accesses

Common in parallel applications
Example distributed arrays stored in files
A big advantage of MPI I/O over Unix I/O is the
ability to specify noncontiguous accesses in
memory and file within a single function call by
using derived datatypes
Allows implementation to optimize the access
Collective IO combined with noncontiguous
accesses yields the highest performance.

68
Example Distributed Array Access
2D array distributed among four processes
P1
P0
P3
P2
File containing the global array in row-major
order
69
A Simple File View Example
etype MPI_INT
head of file
FILE
displacement
filetype
filetype
and so on...
70
File View Code
MPI_Aint lb, extent MPI_Datatype etype,
filetype, contig MPI_Offset disp MPI_Type_conti
guous(2, MPI_INT, contig) lb 0 extent 6
sizeof(int) MPI_Type_create_resized(contig, lb,
extent, filetype) MPI_Type_commit(filetype) di
sp 5 sizeof(int) etype MPI_INT MPI_File_o
pen(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_RDWR, MPI_INFO_NULL,
fh) MPI_File_set_view(fh, disp, etype,
filetype, "native", MPI_INFO_NULL) MPI_File_wri
te(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE)
71
Collective I/O in MPI

A critical optimization in parallel I/O
Allows communication of big picture to file
system
Framework for 2-phase I/O, in which communication
precedes I/O (can use MPI machinery)
Basic idea build large blocks, so that
reads/writes in I/O system will be large

Small individual requests
Large collective access
72
Collective I/O

MPI_File_read_all, MPI_File_read_at_all, etc
_all indicates that all processes in the group
specified by the communicator passed to
MPI_File_open will call this function
Each process specifies only its own access
information -- the argument list is the same as
for the non-collective functions

73
Collective I/O

By calling the collective I/O functions, the user
allows an implementation to optimize the request
based on the combined request of all processes
The implementation can merge the requests of
different processes and service the merged
request efficiently
Particularly effective when the accesses of
different processes are noncontiguous and
interleaved

74
Accessing Arrays Stored in Files
75
Using the Distributed Array (Darray) Datatype
int gsizes2, distribs2, dargs2,
psizes2 gsizes0 m / no. of rows in
global array / gsizes1 n / no. of
columns in global array/ distribs0
MPI_DISTRIBUTE_BLOCK distribs1
MPI_DISTRIBUTE_BLOCK dargs0
MPI_DISTRIBUTE_DFLT_DARG dargs1
MPI_DISTRIBUTE_DFLT_DARG psizes0 2 / no.
of processes in vertical dimension
of process grid / psizes1 3 / no. of
processes in horizontal dimension
of process grid /
76
Darray Continued
MPI_Comm_rank(MPI_COMM_WORLD, rank) MPI_Type_cre
ate_darray(6, rank, 2, gsizes, distribs, dargs,
psizes, MPI_ORDER_C, MPI_FLOAT,
filetype) MPI_Type_commit(filetype) MPI_File_
open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_WRONLY,
MPI_INFO_NULL, fh) MPI_File_set_view(fh, 0,
MPI_FLOAT, filetype, "native", MPI_INFO_NULL)
local_array_size num_local_rows
num_local_cols MPI_File_write_all(fh,
local_array, local_array_size, MPI_FLOAT,
status) MPI_File_close(fh)
77
A Word of Warning about Darray

The darray datatype assumes a very specific
definition of data distribution -- the exact
definition as in HPF
For example, if the array size is not divisible
by the number of processes, darray calculates the
block size using a ceiling division (20 / 6 4 )
darray assumes a row-major ordering of processes
in the logical grid, as assumed by cartesian
process topologies in MPI-1
If your application uses a different definition
for data distribution or logical grid ordering,
you cannot use darray. Use subarray instead.

78
Using the Subarray Datatype
gsizes0 m / rows in global array
/ gsizes1 n / columns in global
array/ psizes0 2 / procs. in vertical
dimension / psizes1 3 / procs. in
horizontal dimension / lsizes0 m/psizes0
/ rows in local array / lsizes1
n/psizes1 / columns in local array
/ dims0 2 dims1 3 periods0
periods1 1 MPI_Cart_create(MPI_COMM_WORLD,
2, dims, periods, 0, comm) MPI_Comm_rank(comm,
rank) MPI_Cart_coords(comm, rank, 2, coords)
79
Subarray Datatype II
/ global indices of first element of local array
/ start_indices0 coords0
lsizes0 start_indices1 coords1
lsizes1 MPI_Type_create_subarray(2, gsizes,
lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT,
filetype) MPI_Type_commit(filetype) MPI_File_
open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_WRONLY,
MPI_INFO_NULL, fh) MPI_File_set_view(fh, 0,
MPI_FLOAT, filetype, "native", MPI_INFO_NULL) l
ocal_array_size lsizes0 lsizes1 MPI_File_
write_all(fh, local_array, local_array_size,
MPI_FLOAT, status)
80
Local Array with Ghost Area in Memory

Use a subarray datatype to describe the
noncontiguous layout in memory
Pass this datatype as argument to
MPI_File_write_all

81
Local Array with Ghost Area
memsizes0 lsizes0 8 / no. of rows
in allocated array / memsizes1 lsizes1
8 / no. of columns in allocated array
/ start_indices0 start_indices1 4
/ indices of the first element of the local
array in the allocated array
/ MPI_Type_create_subarray(2, memsizes, lsizes,
start_indices, MPI_ORDER_C, MPI_FLOAT,
memtype) MPI_Type_commit(memtype) / create
filetype and set file view exactly as in the
subarray example / MPI_File_write_all(fh,
local_array, 1, memtype, status)
82
Accessing Irregularly Distributed Arrays
Process 0s map array
Process 1s map array
Process 2s map array
0
14
13
7
4
2
11
8
3
10
5
1
The map array describes the location of each
element of the data array in the common file
83
Accessing Irregularly Distributed Arrays
integer (kindMPI_OFFSET_KIND) disp call
MPI_FILE_OPEN(MPI_COMM_WORLD, '/pfs/datafile',
MPI_MODE_CREATE MPI_MODE_RDWR,
MPI_INFO_NULL, fh, ierr) call
MPI_TYPE_CREATE_INDEXED_BLOCK(bufsize, 1, map,
MPI_DOUBLE_PRECISION, filetype, ierr) call
MPI_TYPE_COMMIT(filetype, ierr) disp 0 call
MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION,
filetype, 'native', MPI_INFO_NULL, ierr) call
MPI_FILE_WRITE_ALL(fh, buf, bufsize,
MPI_DOUBLE_PRECISION, status, ierr) call
MPI_FILE_CLOSE(fh, ierr)
84
non-blocking I/O
MPI_Request request MPI_Status
status MPI_File_iwrite_at(fh, offset, buf,
count, datatype,
request) for (i0 ilt1000 i) /
perform computation / MPI_Wait(request,
status)
85
Split Collective I/O

A restricted form of non-blocking collective I/O
Only one active non-blocking collective operation
allowed at a time on a file handle
Therefore, no request object necessary

MPI_File_write_all_begin(fh, buf, count,
datatype) for (i0 ilt1000 i) /
perform computation / MPI_File_write_all_end(f
h, buf, status)

Write a Comment

User Comments (0)