Title: Advanced MPI
1Advanced MPI
- Daniel S. Katz
- Assistant Director for Cyberinfrastructure
Development (CyD), CCT - Associate Research Professor, ECE
- dsk_at_cct.lsu.edu
- Credits TACC HPC Course,Introduction to
Parallel I/O tutorial (Thakur),MPI Standards
2Introduction Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
3MPI-2 Status Assessment
- All vendors have complete MPI-1, and have for 5 -
10 years - Free implementations (MPICH, LAM) support
heterogeneous workstation networks - MPI-2 implementations are being undertaken by all
vendors - Fujitsu, NEC have complete MPI-2 implementations
- Other vendors generally have all but dynamic
process management - MPICH-2 is complete
- Open MPI (new MPI from LAM and other MPIs) is
becoming complete
4Communicators and Groups I
- A group is an ordered set of process identifiers
(processes) - Processes are implementation-dependent objects
- Each process in a group is associated with an
integer rank - Ranks are contiguous and start from zero
- Communicators were created to allow groups of
processes to act together - MPI_COMM_WORLD is the communicator that contains
all processes, other communicators are subsets - MPI_COMM_EMPTY contains no processes
5Communicators and Groups II
- Can create new communicators from existing
communicators or existing groups - Both called by all processes in comm, can return
MPI_COMM_NULL - MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm newcomm) - Partitions the group associated with comm into
disjoint subgroups, one for each value of color
(color can be MPI_UNDEFINED) - Each subgroup contains all processes of the same
color - Within each subgroup, the processes are ranked in
the order defined by the value of the argument
key, with ties broken according to their rank in
the old group - A new communication domain is created for each
subgroup and a handle to the representative
communicator is returned in newcomm - MPI_Comm_create(MPI_Comm comm, MPI_Group group,
MPI_Comm newcomm) - creates a new intracommunicator newcomm with
communication group defined by group
6MPI_Comm_split example
- MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm newcomm)
Column
0 1 2 3 4 5
6
0 1 2 3
Row
MPI_Comm_rank(MPI_COMM_WORLD, rank) row
int(rank/ncol) MPI_Comm_split(MPI_COMM_WORLD,
row, rank, row_comm)
7MPI_Comm_create example
- MPI_Comm_group(MPI_Comm comm, MPI_Group group)
- Returns a handle to a group from a communicator
- MPI_Group_union(MPI_Group group1, MPI_Group
group2, MPI_Group newgroup) - MPI_Group_intersection(MPI_Group group1,
MPI_Group group2, MPI_Group newgroup) - MPI_Group_difference(MPI_Group group1, MPI_Group
group2, MPI_Group newgroup) - 3 above are self-explantory
- MPI_Group_incl(MPI_Group group, int n, int
ranks, MPI_Group newgroup) - New group with n elements of group
- MPI_Group_excl(MPI_Group group, int n, int
ranks, MPI_Group newgroup) - New group with all but n elements of group
- More complicated constructors also exist
8Freeing Communicators Groups
- Can destroy communicators and groups, too
- MPI_Comm_free(MPI_Comm comm)
- MPI_Group_free(MPI_Group group)
9Intra- and Intercommunicators,and Dynamic
Processes
- So far, just intracommunicators
- MPI also has intercommunicators
- Utility of this was dubious in MPI-1
- In MPI-2, with dynamic processes, this is more
useful - A process may start new processes with
MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE - Returns an intercommunicator
- Does not change MPI_COMM_WORLD
- Two independently started MPI applications can
establish communications - MPI_Open_port, MPI_Comm_connect, MPI_Comm_accept
allow two running MPI programs to connect and
communicate - Not intended for client/server applications
- Designed to support HPC applications
- MPI_Join allows the use of a TCP socket to
connect two applications - Details are beyond the scope of this class
- see http//www.mpi-forum.org/docs/mpi-20-html/nod
e88.htm
10Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
11Basic Datatypes
12Derived Datatypes I
- Basic communication calls so far have involved
only contiguous buffers with a sequence of
elements of a single type - Many applications need more
- Could use multiple communication calls
- Could manually pack data into buffers and
communicate those - Could use derived datatypes
- Lets library optimize how the communication is
done - User tells library what is desired, library does
it - Derived datatypes can be created at runtime
- Derived datatypes can be recursive
13Derived Datatypes II
- Contiguous
- Allows replication of a datatype into contiguous
locations - int MPI_Type_contiguous(in int count,
in MPI_Datatype oldtype, out MPI_Datatype
newtype) - Vector
- Allows replication of a datatype into locations
that consist of equally spaced blocks - Each block is obtained by concatenating the same
number of copies of the old datatype - int MPI_Type_vector(in int count, in int
blocklength, in int stride, in MPI_Datatype
oldtype, out MPI_Datatype newtype) - Hvector
- Identical to vector, but stride is in bytes,
rather than in number of oldtype elements
14Derived Datatypes III
- Indexed
- Allows replication of an old datatype into a
sequence of blocks (each block is a concatenation
of the old datatype), where each block can
contain a different number of copies and have a
different displacement - int MPI_Type_indexed(in int count,
in int array_of_blocklengths,
in int array_of_displacements,
in MPI_Datatype oldtype, out MPI_Datatype
newtype) - Hindexed
- Identical to indexed, but displacements are in
bytes, rather than in number of oldtype extents - int MPI_Type_hindexed(in int count,
in int array_of_blocklengths,
in MPI_Aint array_of_displacements,
in MPI_Datatype oldtype, out MPI_Datatype
newtype)
15Derived Datatypes IV
- Struct
- Most general
- Generalizes hindexed with array of types
- int MPI_Type_struct(in int count,
in int array_of_blocklengths,
in MPI_Aint array_of_displacements,
in MPI_Datatype array_of_types,
out MPI_Datatype newtype)
Struct Count 2, array_of_blocklengths3,
1, Array_of_displacements0,12 (in
bytes)array_of_typesMPI_FLOAT,MPI_INT
16Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
17Point-to-point Communications
- MPI_Send and MPI_Recv are blocking calls
- MPI_Send does not return until the message data
and envelope have been safely stored away so that
the sender is free to access and overwrite the
send buffer - The message might be copied directly into the
matching receive buffer, or it might be copied
into a temporary system buffer
18Buffering Communication Modes
- Message buffering decouples the send and receive
operations - A blocking send can complete as soon as the
message was buffered, even if no matching receive
has been executed by the receiver - On the other hand, message buffering can be
expensive, as it entails additional
memory-to-memory copying, and it requires the
allocation of memory for buffering - There are 4 communication modes in MPI
- Standard (no prefix), buffered (B prefix),
synchronous (S prefix), ready (R prefix) - MPI_Send(), MPI_Bsend(), MPI_Ssend(), MPI_Rsend()
19Standard Communication Mode
- MPI offers the choice of several communication
modes that allow control of the communication
protocol choice - MPI_Send uses the standard communication mode
- It is up to MPI to decide if outgoing messages
will be buffered - MPI may buffer outgoing messages
- The send call may complete before a matching
receive is invoked - On the other hand, buffer space may be
unavailable, or MPI may not buffer outgoing
messages, for performance reasons - The send call will not complete until a matching
receive has been posted, and the data has been
moved to the receiver - Thus, a send in standard mode can be started
whether or not a matching receive has been posted - It may complete before a matching receive is
posted - The standard mode send is non-local successful
completion of the send operation may depend on
the occurrence of a matching receive.
20Buffered Communication Mode
- A buffered mode send operation can be started
whether or not a matching receive has been posted - It may complete before a matching receive is
posted - However, unlike the standard send, this operation
is local, and its completion does not depend on
the occurrence of a matching receive - Thus, if a send is executed and no matching
receive is posted, then MPI must buffer the
outgoing message, so as to allow the send call to
complete - An error will occur if there is insufficient
buffer space - The amount of available buffer space is
controlled by the user - Buffer allocation by the user may be required for
the buffered mode to be effective
21Synchronous Communication Mode
- A send that uses the synchronous mode can be
started whether or not a matching receive was
posted - However, the send will complete successfully only
if a matching receive is posted, and the receive
operation has started to receive the message sent
by the synchronous send - Thus, the completion of a synchronous send not
only indicates that the send buffer can be
reused, but also indicates that the receiver has
reached a certain point in its execution, namely
that it has started executing the matching
receive - If both sends and receives are blocking
operations then the use of the synchronous mode
provides synchronous communication semantics a
communication does not complete at either end
before both processes rendezvous at the
communication - A send executed in this mode is non-local
22Ready Communication Mode
- A send that uses the ready communication mode may
be started only if the matching receive is
already posted - Otherwise, the operation is erroneous and its
outcome is undefined - On some systems, this allows the removal of a
hand-shake operation that is otherwise required
and results in improved performance - The completion of the send operation does not
depend on the status of a matching receive, and
merely indicates that the send buffer can be
reused - A send operation that uses the ready mode has the
same semantics as a standard send operation, or a
synchronous send operation it is merely that the
sender provides additional information to the
system (namely that a matching receive is already
posted), that can save some overhead - In a correct program, therefore, a ready send
could be replaced by a standard send with no
effect on the behavior of the program other than
performance
23Blocking Receive
- There is only one receive operation, which can
match any of the send modes - The receive operation just described is blocking
it returns only after the receive buffer contains
the newly received message - A receive can complete before the matching send
has completed (of course, it can complete only
after the matching send has started).
24Communication Modes Summary
- Ready mode has least total overhead, but requires
receive to be posted before send - Can post receive, synchronize, then post send
- Synchronous mode is most portable, but can be
slow - Does not depend on order (ready) or buffers
(buffered) - Buffered mode doesnt depend on order (ready) and
has no synchronization delays (synchronous), but
buffering can add overhead and user may need to
control buffers - Standard mode is implementation dependant
- Often, small messages are buffered and large
messages are sent synchronously
25Non-blocking Communication I
- Allows overlap of computation and communication
- In theory, at least, and sometimes, in practice
- A non-blocking send start call initiates the send
operation, but does not complete it - Call returns before message is copied out of send
buffer - A separate send complete call is needed to
complete the communication, i.e., to verify that
the data has been copied out of the send buffer - With suitable hardware, the transfer of data out
of the sender memory may proceed concurrently
with computations done at the sender after the
send was initiated and before it completed
26Non-blocking Communication II
- Similarly, a non-blocking receive start call
initiates the receive operation, but does not
complete it - Call returns before message is stored in receive
buffer - A separate receive complete call is needed to
complete the receive operation and verify that
the data has been received into the receive
buffer - With suitable hardware, the transfer of data into
the receiver memory may proceed concurrently with
computations done after the receive was initiated
and before it completed - The use of non-blocking receives may also avoid
system buffering and memory-to-memory copying, as
information is provided early on the location of
the receive buffer
27Non-blocking Communication III
- non-blocking send start calls can use the same
four modes as blocking sends standard, buffered,
synchronous and ready - These carry the same meaning
- Sends of all modes, ready excepted, can be
started whether a matching receive has been
posted or not a non-blocking ready send can be
started only if a matching receive is posted - In all cases, the send start call is local it
returns immediately, irrespective of the status
of other processes - The send-complete call returns when data has been
copied out of the send buffer - non-blocking sends can be matched with blocking
receives, and vice-versa
28Non-blocking Communication IV
- Syntax - add an I, get back a request handle
- int MPI_Bs,Ss,Rs,Send(void buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm) - int MPI_Ib,s,rsend(void buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request request) - int MPI_Recv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status status) - int MPI_Irecv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request request) - Blocking communication completion
- int MPI_Wait(MPI_Request request, MPI_Status
status) - non-blocking communication completion
- int MPI_Test(MPI_Request request, int flag,
MPI_Status status) (flag indicates if the
request has completed) - Request freeing (dont care when it completes)
- int MPI_Request_free(MPI_Request request)
- There are also calls for multiple completion/free
29Non-blocking Communication Examples
IF(rank.EQ.0) THEN DO i1, n CALL
MPI_ISEND(outval, 1, MPI_REAL, 1, 0, req,
ierr) CALL MPI_REQUEST_FREE(req, ierr)
CALL MPI_IRECV(inval, 1, MPI_REAL, 1, 0,
req, ierr) CALL MPI_WAIT(req, status, ierr)
END DO ELSE ! rank.EQ.1 CALL
MPI_IRECV(inval, 1, MPI_REAL, 0, 0, req,
ierr) CALL MPI_WAIT(req, status) DO I1, n-1
CALL MPI_ISEND(outval, 1, MPI_REAL, 0,
0, req, ierr) CALL MPI_REQUEST_FREE(req,
ierr) CALL MPI_IRECV(inval, 1, MPI_REAL,
0, 0, req, ierr) CALL MPI_WAIT(req, status,
ierr) END DO CALL MPI_ISEND(outval, 1,
MPI_REAL, 0, 0, req, ierr) CALL
MPI_WAIT(req, status) END IF
- IF(rank.EQ.0) THEN
- CALL MPI_ISEND(a(1), 10,
- MPI_REAL, 1, tag, comm,
- request, ierr)
- compute
- CALL MPI_WAIT(request,
- status, ierr)
- ELSE ! rank.eq.1
- CALL MPI_IRECV(a(1), 10,
- MPI_REAL, 0, tag, comm,
- request, ierr)
- compute
- CALL MPI_WAIT(request,
- status, ierr)
- END IF
30Non-blocking Communication Examples
IF (rank.EQ.0) THEN CALL MPI_SEND(i, 1,
MPI_INTEGER, 2, 0, comm, ierr) ELSE IF(rank.EQ.1)
THEN CALL MPI_SEND(x, 1, MPI_REAL, 2, 0, comm,
ierr) ELSE ! rank.EQ.2 DO i1, 2 CALL
MPI_PROBE(MPI_ANY_SOURCE, 0, comm, status, ierr)
IF (status(MPI_SOURCE) 0) THEN CALL
MPI_RECV(i, 1, MPI_INTEGER,
MPI_ANY_SOURCE, MPI_ANY_TAG, status2, ierr)
CALL MPI_RECV(i, 1, MPI_INTEGER,
status(MPI_SOURCE), status(MPI_TAG), status2,
ierr) ELSE CALL MPI_RECV(x, 1,
MPI_REAL, MPI_ANY_SOURCE, MPI_ANY_TAG,
status2, ierr) CALL MPI_RECV(x, 1,
MPI_REAL, status(MPI_SOURCE),
status(MPI_TAG), status2, ierr) END IF END
DO END IF
31Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
32Advanced Collective Communication I
Root
P0 P1 P2 P3
P0 P1 P2 P3
A B C D A B C D A B C D A B
C D
A B C D
A B C D
A B C D
A B C D
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
Alltoall
33Advanced Collective Communication II
- What if data to be gathered from all processes to
root is not the same? - Use Gatherv instead of Gather (or
ScatterV/Scatter, Allgatherv/Allgather) - v is for varying
- MPI_Gather
- int MPI_Gather(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm) - All n processes(including root) call
- MPI_Send(sendbuf, sendcount, sendtype, root, )
- Root calls (for i 0 to n-1)
- MPI_Recv(recvbufirecvcountextent(recvtype),
recvcount, recvtype, i, ) - MPI_Gatherv
- int MPI_Gatherv(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcounts, int displs, MPI_Datatype recvtype,
int root, MPI_Comm comm) - All n processes(including root) call
- MPI_Send(sendbuf, sendcount, sendtype, root, )
- Root calls (for i 0 to n-1)
- MPI_Recv(recvbufdisplsiextent(recvtype),
recvcountsi, recvtype, i, )
34Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
35Persistent Communication I
- Often, a communication with the same argument
list is repeatedly executed within the inner loop
of a parallel computation - In such a situation, it may be possible to
optimize the communication by binding the list of
communication arguments to a persistent
communication request once and, then, repeatedly
using the request to initiate and complete
messages - The persistent request thus created can be
thought of as a communication port or a
half-channel - It does not provide the full functionality of a
conventional channel, since there is no binding
of the send port to the receive port - This construct allows reduction of the overhead
for communication between the process and
communication controller, but not of the overhead
for communication between one communication
controller and another - It is not necessary that messages sent with a
persistent request be received by a receive
operation using a persistent request, or vice
versa
36Persistent Communication II
- Create a persistent communication request before
the loop starts - MPI_S,Bs,Ss,Rsend_init(buffer, , request)
- MPI_Recv_init(buffer, , request)
- Start the request(s) inside each iteration of the
loop - MPI_Start(request)
- MPI_Startall(count, array_of_requests)
- Complete the request(s) inside each iteration of
the loop - MPI_Wait(), MPI_Waitall(), MPI_Waitany(),
MPI_Waitsome(), MPI_Test(), MPI_Testany(),
MPI_Testsome() - Free the request(s) after the loop completes
- MPI_Request_free(request)
37Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
38I/O in Parallel Applications
- Old ways to do I/O
- Process 0 does all the I/O to a single file and
broadcasts/scatters/gathers to/from other
processes - All processes do their own I/O to separate files
- All tasks read from same file
- All tasks write to same file, using seeks to get
to right place - One task at a time appends to a single file,
using barriers to prevent overlapping writes.
39I/O in Parallel Applications
- New way is to use parallel I/O library, such as
MPI I/O - Multiple tasks can simultaneously read or write
to a single file (possibly on a parallel file
system) using the MPI I/O API - A parallel file system usually looks like a
single file system, but has multiple I/O servers
to permit high bandwidth from multiple processes - MPI I/O is part of MPI-2
- Allows single or collective operations to/of
contiguous or non-contiguous regions/data using
MPI datatypes, including derived datatypes,
blocking or non-blocking - Sound familiar? Writing sending message,
reading receiving
40Levels of Parallel I/O
- Example problem Distributed Array Access
P0
P2
P1
P3
Large array distributed among 16 processes
Each square represents a subarray in the
memory of a single process
P4
P6
P5
P7
P8
P10
P9
P11
P12
P14
P13
P15
Access Pattern in the file
P10
P11
P10
P15
P13
P12
P12
P13
P14
P14
41Level-0 Access
- Each process makes one independent read request
for each row in the local array (as in Unix) -
- MPI_File_open(..., file, ..., fh)
- for (i0 iltn_local_rows i)
- MPI_File_seek(fh, ...)
- MPI_File_read(fh, (Ai0), ...)
-
- MPI_File_close(fh)
42Level-1 Access
- Similar to level 0, but each process uses
collective I/O functions - MPI_File_open(MPI_COMM_WORLD, file, ..., fh)
- for (i0 iltn_local_rows i)
- MPI_File_seek(fh, ...)
- MPI_File_read_all(fh, (Ai0), ...)
-
- MPI_File_close(fh)
43Level-2 Access
- Each process creates a derived datatype to
describe the noncontiguous access pattern,
defines a file view, and calls independent I/O
functions - MPI_Type_create_subarray(..., subarray, ...)
- MPI_Type_commit(subarray)
- MPI_File_open(..., file, ..., fh)
- MPI_File_set_view(fh, ..., subarray, ...)
- MPI_File_read(fh, A, ...)
- MPI_File_close(fh)
44Level-3 Access
- Similar to level-2, except that each process uses
collective I/O functions -
- MPI_Type_create_subarray(..., subarray, ...)
- MPI_Type_commit(subarray)
- MPI_File_open(MPI_COMM_WORLD, file, ..., fh)
- MPI_File_set_view(fh, ..., subarray, ...)
- MPI_File_read_all(fh, A, ...)
- MPI_File_close(fh)
45The Four Access Levels
Level 0
Level 1
Level 2
Level 3
46Why Higher-level Access is Good
- Given complete access information, an
implementation can perform optimizations such as - Data Sieving Read large chunks and extract what
is really needed - Collective I/O Merge requests of different
processes into larger requests - Improved prefetching and caching
47Some Quick Details
- Collective operation on a filename
- MPI_File_open()
- Single processor operation on a filename
- MPI_File_delete()
- Collective operations on a file handle
- MPI_File_open(), MPI_File_close(),
MPI_File_set_size(), MPI_File_preallocate() - Single processor operation on a file handle
- MPI_File_get_size(), MPI_File_get_amode(),
MPI_File_get_group() (Note creates dup. comm.
group)
48Hints
- Allow a user to provide info (e.g., file access
patterns, file system specifics) for optimization - Optional - may be ignored by implementation
- Can be provided for
- MPI_File_open(), MPI_File_delete(),
MPI_File_set_info(), MPI_get_info() - Examples hints
- access_style read_once, write_once, read_mostly,
write_mostly, sequential, reverse_sequential,
random - collective_buffering true, false
- num_io_nodes (integer - )
- striping_factor (integer - of devices)
- striping_unit (integer - of bytes)
- Context sensitive, Implementation-dependent
49File Views
- MPI_File_set_view(MPI_File fh, MPI_Offset disp,
MPI_Datatype etype, MPI_Datatype filetype, char
datarep, MPI_Info info) - Changes the process's view of the data in the
file - disp - Start of the view etype - type of data
filetype - distribution of data to processes
datarep - representation of data in the file - native - data stored as in memory internal -
data stored in librarys choice external32 -
data stored in portable format - Resets individual file pointers and shared file
pointer to zero - Collective
- datarep and extent of etype in datarep must be
identical on all processes in the group - disp, filetype, and info may vary
- datatypes passed in etype and filetype must be
committed - Should call immediately after open, and perhaps
other times, too
50Data Access I
- Three orthogonal aspects to data access
- Positioning
- Explicit offset vs. implicit file pointer
- Synchronism
- blocking vs. non-blocking and split collective
- Coordination
- noncollective vs. collective
- Two types of file pointers
- Individual, shared
51Data Access Routines
52Positioning Routines
53Parallel I/O notes
- Non-collective, non-blocking calls return an
MPI_Request - Must wait or otherwise free request at some point
- Only one collective, split-phase call per file
handle at a time - Therefore, no request is used/needed
- POSIX file consistency
- When write returns, data immediately visible to
other processes - Atomicity If two writes occur simultaneously on
overlapping areas in a file, data stored will be
from one or the other, not a combination - MPI I/O file consistency
- Default semantics weaker than POSIX for
optimization - Can get close to POSIX semantics by setting
atomicity to TRUE - Otherwise, to read data written by another
process, you need to call MPI_File_sync or close
and reopen the file - See examples at end of section from Thakurs
tutorial
54Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
55One-sided Operations Issues
- Balancing efficiency and portability across a
wide class of architectures - shared-memory multiprocessors
- NUMA architectures
- distributed-memory MPPs, clusters
- Workstation networks
- Retaining look and feel of MPI-1
- Dealing with subtle memory behavior issues
cache coherence, sequential consistency - Synchronization is separate from data movement
56Remote Memory Access Windows and Window Objects
Process 0
Process 1
window
Process 2
Process 3
address spaces
window object
57One-sided Communication Calls
- MPI_Put - stores into remote memory
- MPI_Get - reads from remote memory
- MPI_Accumulate - updates remote memory
- All are non-blocking data transfer is
described, maybe even initiated, but may
continue after call returns - Subsequent synchronization on window object is
needed to ensure operations are complete, e.g.,
MPI_Win_fence
58Outline
- Status of MPI-2
- Groups and communication management
- Derived Datatypes
- Point-to-Point blocking/non-blocking
communication - Advanced Collective Communication
- Persistent communication
- Parallel I/O
- One-sided communication
- Parallel I/O examples (from Thakur)
59Using MPI for Simple I/O
Each process needs to read a chunk of data from a
common file
60Using Individual File Pointers
MPI_File fh MPI_Status status MPI_Comm_rank(MPI
_COMM_WORLD, rank) MPI_Comm_size(MPI_COMM_WORLD,
nprocs) bufsize FILESIZE/nprocs nints
bufsize/sizeof(int) MPI_File_open(MPI_COMM_WORLD
, "/pfs/datafile",
MPI_MODE_RDONLY, MPI_INFO_NULL,
fh) MPI_File_seek(fh, rank bufsize,
MPI_SEEK_SET) MPI_File_read(fh, buf, nints,
MPI_INT, status) MPI_File_close(fh)
61Using Explicit Offsets
include 'mpif.h' integer status(MPI_STATUS_SI
ZE) integer (kindMPI_OFFSET_KIND) offset C in
F77, see implementation notes (might be
integer8) call MPI_FILE_OPEN(MPI_COMM_WORLD,
'/pfs/datafile', MPI_MODE_RDONLY,
MPI_INFO_NULL, fh, ierr) nints FILESIZE /
(nprocsINTSIZE) offset rank nints
INTSIZE call MPI_FILE_READ_AT(fh, offset, buf,
nints, MPI_INTEGER,
status, ierr) call MPI_GET_COUNT(status,
MPI_INTEGER, count, ierr) print , 'process ',
rank, 'read ', count, 'integers' call
MPI_FILE_CLOSE(fh, ierr)
62Writing to a File
- Use MPI_File_write or MPI_File_write_at
- Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags
to MPI_File_open - If the file doesnt exist previously, the flag
MPI_MODE_CREATE must also be passed to
MPI_File_open - We can pass multiple flags by using bitwise-or
in C, or addition in Fortran
63Using File Views
- Processes write to shared file
MPI_File_set_view assigns regions of the file to
separate processes
64File Views
- Specified by a triplet (displacement, etype, and
filetype) passed to MPI_File_set_view - displacement number of bytes to be skipped from
the start of the file - etype basic unit of data access (can be any
basic or derived datatype) - filetype specifies which portion of the file is
visible to the process
65File View Example
MPI_File file for (i0 iltBUFSIZE i) bufi
myrank BUFSIZE i MPI_File_open(MPI_COMM_WOR
LD, "testfile", MPI_MODE_CREATE
MPI_MODE_WRONLY, MPI_INFO_NULL,
file) MPI_File_set_view(file, myrank BUFSIZE
sizeof(int), MPI_INT, MPI_INT,
"native", MPI_INFO_NULL) MPI_File_write(file,
buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE) MPI_Fi
le_close(file)
66Other Ways to Write to a Shared File
- MPI_File_seek
- MPI_File_read_at
- MPI_File_write_at
- MPI_File_read_shared
- MPI_File_write_shared
- Collective operations
like Unix seek
combine seek and I/O for thread safety
use shared file pointer
67Noncontiguous Accesses
- Common in parallel applications
- Example distributed arrays stored in files
- A big advantage of MPI I/O over Unix I/O is the
ability to specify noncontiguous accesses in
memory and file within a single function call by
using derived datatypes - Allows implementation to optimize the access
- Collective IO combined with noncontiguous
accesses yields the highest performance.
68Example Distributed Array Access
2D array distributed among four processes
P1
P0
P3
P2
File containing the global array in row-major
order
69A Simple File View Example
etype MPI_INT
head of file
FILE
displacement
filetype
filetype
and so on...
70File View Code
MPI_Aint lb, extent MPI_Datatype etype,
filetype, contig MPI_Offset disp MPI_Type_conti
guous(2, MPI_INT, contig) lb 0 extent 6
sizeof(int) MPI_Type_create_resized(contig, lb,
extent, filetype) MPI_Type_commit(filetype) di
sp 5 sizeof(int) etype MPI_INT MPI_File_o
pen(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_RDWR, MPI_INFO_NULL,
fh) MPI_File_set_view(fh, disp, etype,
filetype, "native", MPI_INFO_NULL) MPI_File_wri
te(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE)
71Collective I/O in MPI
- A critical optimization in parallel I/O
- Allows communication of big picture to file
system - Framework for 2-phase I/O, in which communication
precedes I/O (can use MPI machinery) - Basic idea build large blocks, so that
reads/writes in I/O system will be large
Small individual requests
Large collective access
72Collective I/O
- MPI_File_read_all, MPI_File_read_at_all, etc
- _all indicates that all processes in the group
specified by the communicator passed to
MPI_File_open will call this function - Each process specifies only its own access
information -- the argument list is the same as
for the non-collective functions
73Collective I/O
- By calling the collective I/O functions, the user
allows an implementation to optimize the request
based on the combined request of all processes - The implementation can merge the requests of
different processes and service the merged
request efficiently - Particularly effective when the accesses of
different processes are noncontiguous and
interleaved
74Accessing Arrays Stored in Files
75Using the Distributed Array (Darray) Datatype
int gsizes2, distribs2, dargs2,
psizes2 gsizes0 m / no. of rows in
global array / gsizes1 n / no. of
columns in global array/ distribs0
MPI_DISTRIBUTE_BLOCK distribs1
MPI_DISTRIBUTE_BLOCK dargs0
MPI_DISTRIBUTE_DFLT_DARG dargs1
MPI_DISTRIBUTE_DFLT_DARG psizes0 2 / no.
of processes in vertical dimension
of process grid / psizes1 3 / no. of
processes in horizontal dimension
of process grid /
76Darray Continued
MPI_Comm_rank(MPI_COMM_WORLD, rank) MPI_Type_cre
ate_darray(6, rank, 2, gsizes, distribs, dargs,
psizes, MPI_ORDER_C, MPI_FLOAT,
filetype) MPI_Type_commit(filetype) MPI_File_
open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_WRONLY,
MPI_INFO_NULL, fh) MPI_File_set_view(fh, 0,
MPI_FLOAT, filetype, "native", MPI_INFO_NULL)
local_array_size num_local_rows
num_local_cols MPI_File_write_all(fh,
local_array, local_array_size, MPI_FLOAT,
status) MPI_File_close(fh)
77A Word of Warning about Darray
- The darray datatype assumes a very specific
definition of data distribution -- the exact
definition as in HPF - For example, if the array size is not divisible
by the number of processes, darray calculates the
block size using a ceiling division (20 / 6 4 ) - darray assumes a row-major ordering of processes
in the logical grid, as assumed by cartesian
process topologies in MPI-1 - If your application uses a different definition
for data distribution or logical grid ordering,
you cannot use darray. Use subarray instead.
78Using the Subarray Datatype
gsizes0 m / rows in global array
/ gsizes1 n / columns in global
array/ psizes0 2 / procs. in vertical
dimension / psizes1 3 / procs. in
horizontal dimension / lsizes0 m/psizes0
/ rows in local array / lsizes1
n/psizes1 / columns in local array
/ dims0 2 dims1 3 periods0
periods1 1 MPI_Cart_create(MPI_COMM_WORLD,
2, dims, periods, 0, comm) MPI_Comm_rank(comm,
rank) MPI_Cart_coords(comm, rank, 2, coords)
79 Subarray Datatype II
/ global indices of first element of local array
/ start_indices0 coords0
lsizes0 start_indices1 coords1
lsizes1 MPI_Type_create_subarray(2, gsizes,
lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT,
filetype) MPI_Type_commit(filetype) MPI_File_
open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_WRONLY,
MPI_INFO_NULL, fh) MPI_File_set_view(fh, 0,
MPI_FLOAT, filetype, "native", MPI_INFO_NULL) l
ocal_array_size lsizes0 lsizes1 MPI_File_
write_all(fh, local_array, local_array_size,
MPI_FLOAT, status)
80Local Array with Ghost Area in Memory
- Use a subarray datatype to describe the
noncontiguous layout in memory - Pass this datatype as argument to
MPI_File_write_all
81Local Array with Ghost Area
memsizes0 lsizes0 8 / no. of rows
in allocated array / memsizes1 lsizes1
8 / no. of columns in allocated array
/ start_indices0 start_indices1 4
/ indices of the first element of the local
array in the allocated array
/ MPI_Type_create_subarray(2, memsizes, lsizes,
start_indices, MPI_ORDER_C, MPI_FLOAT,
memtype) MPI_Type_commit(memtype) / create
filetype and set file view exactly as in the
subarray example / MPI_File_write_all(fh,
local_array, 1, memtype, status)
82Accessing Irregularly Distributed Arrays
Process 0s map array
Process 1s map array
Process 2s map array
0
14
13
7
4
2
11
8
3
10
5
1
The map array describes the location of each
element of the data array in the common file
83Accessing Irregularly Distributed Arrays
integer (kindMPI_OFFSET_KIND) disp call
MPI_FILE_OPEN(MPI_COMM_WORLD, '/pfs/datafile',
MPI_MODE_CREATE MPI_MODE_RDWR,
MPI_INFO_NULL, fh, ierr) call
MPI_TYPE_CREATE_INDEXED_BLOCK(bufsize, 1, map,
MPI_DOUBLE_PRECISION, filetype, ierr) call
MPI_TYPE_COMMIT(filetype, ierr) disp 0 call
MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION,
filetype, 'native', MPI_INFO_NULL, ierr) call
MPI_FILE_WRITE_ALL(fh, buf, bufsize,
MPI_DOUBLE_PRECISION, status, ierr) call
MPI_FILE_CLOSE(fh, ierr)
84non-blocking I/O
MPI_Request request MPI_Status
status MPI_File_iwrite_at(fh, offset, buf,
count, datatype,
request) for (i0 ilt1000 i) /
perform computation / MPI_Wait(request,
status)
85Split Collective I/O
- A restricted form of non-blocking collective I/O
- Only one active non-blocking collective operation
allowed at a time on a file handle - Therefore, no request object necessary
MPI_File_write_all_begin(fh, buf, count,
datatype) for (i0 ilt1000 i) /
perform computation / MPI_File_write_all_end(f
h, buf, status)