Advanced MPI - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Advanced MPI

Description:

A new communication domain is created for each subgroup and a handle to the ... overlap of computation and communication. In theory, at least, and sometimes, in ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 86
Provided by: hpc9
Category:

less

Transcript and Presenter's Notes

Title: Advanced MPI


1
Advanced MPI
  • Daniel S. Katz
  • Assistant Director for Cyberinfrastructure
    Development (CyD), CCT
  • Associate Research Professor, ECE
  • dsk_at_cct.lsu.edu
  • Credits TACC HPC Course,Introduction to
    Parallel I/O tutorial (Thakur),MPI Standards

2
Introduction Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

3
MPI-2 Status Assessment
  • All vendors have complete MPI-1, and have for 5 -
    10 years
  • Free implementations (MPICH, LAM) support
    heterogeneous workstation networks
  • MPI-2 implementations are being undertaken by all
    vendors
  • Fujitsu, NEC have complete MPI-2 implementations
  • Other vendors generally have all but dynamic
    process management
  • MPICH-2 is complete
  • Open MPI (new MPI from LAM and other MPIs) is
    becoming complete

4
Communicators and Groups I
  • A group is an ordered set of process identifiers
    (processes)
  • Processes are implementation-dependent objects
  • Each process in a group is associated with an
    integer rank
  • Ranks are contiguous and start from zero
  • Communicators were created to allow groups of
    processes to act together
  • MPI_COMM_WORLD is the communicator that contains
    all processes, other communicators are subsets
  • MPI_COMM_EMPTY contains no processes

5
Communicators and Groups II
  • Can create new communicators from existing
    communicators or existing groups
  • Both called by all processes in comm, can return
    MPI_COMM_NULL
  • MPI_Comm_split(MPI_Comm comm, int color, int key,
    MPI_Comm newcomm)
  • Partitions the group associated with comm into
    disjoint subgroups, one for each value of color
    (color can be MPI_UNDEFINED)
  • Each subgroup contains all processes of the same
    color
  • Within each subgroup, the processes are ranked in
    the order defined by the value of the argument
    key, with ties broken according to their rank in
    the old group
  • A new communication domain is created for each
    subgroup and a handle to the representative
    communicator is returned in newcomm
  • MPI_Comm_create(MPI_Comm comm, MPI_Group group,
    MPI_Comm newcomm)
  • creates a new intracommunicator newcomm with
    communication group defined by group

6
MPI_Comm_split example
  • MPI_Comm_split(MPI_Comm comm, int color, int key,
    MPI_Comm newcomm)

Column
0 1 2 3 4 5
6
0 1 2 3
Row
MPI_Comm_rank(MPI_COMM_WORLD, rank) row
int(rank/ncol) MPI_Comm_split(MPI_COMM_WORLD,
row, rank, row_comm)
7
MPI_Comm_create example
  • MPI_Comm_group(MPI_Comm comm, MPI_Group group)
  • Returns a handle to a group from a communicator
  • MPI_Group_union(MPI_Group group1, MPI_Group
    group2, MPI_Group newgroup)
  • MPI_Group_intersection(MPI_Group group1,
    MPI_Group group2, MPI_Group newgroup)
  • MPI_Group_difference(MPI_Group group1, MPI_Group
    group2, MPI_Group newgroup)
  • 3 above are self-explantory
  • MPI_Group_incl(MPI_Group group, int n, int
    ranks, MPI_Group newgroup)
  • New group with n elements of group
  • MPI_Group_excl(MPI_Group group, int n, int
    ranks, MPI_Group newgroup)
  • New group with all but n elements of group
  • More complicated constructors also exist

8
Freeing Communicators Groups
  • Can destroy communicators and groups, too
  • MPI_Comm_free(MPI_Comm comm)
  • MPI_Group_free(MPI_Group group)

9
Intra- and Intercommunicators,and Dynamic
Processes
  • So far, just intracommunicators
  • MPI also has intercommunicators
  • Utility of this was dubious in MPI-1
  • In MPI-2, with dynamic processes, this is more
    useful
  • A process may start new processes with
    MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE
  • Returns an intercommunicator
  • Does not change MPI_COMM_WORLD
  • Two independently started MPI applications can
    establish communications
  • MPI_Open_port, MPI_Comm_connect, MPI_Comm_accept
    allow two running MPI programs to connect and
    communicate
  • Not intended for client/server applications
  • Designed to support HPC applications
  • MPI_Join allows the use of a TCP socket to
    connect two applications
  • Details are beyond the scope of this class
  • see http//www.mpi-forum.org/docs/mpi-20-html/nod
    e88.htm

10
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

11
Basic Datatypes
12
Derived Datatypes I
  • Basic communication calls so far have involved
    only contiguous buffers with a sequence of
    elements of a single type
  • Many applications need more
  • Could use multiple communication calls
  • Could manually pack data into buffers and
    communicate those
  • Could use derived datatypes
  • Lets library optimize how the communication is
    done
  • User tells library what is desired, library does
    it
  • Derived datatypes can be created at runtime
  • Derived datatypes can be recursive

13
Derived Datatypes II
  • Contiguous
  • Allows replication of a datatype into contiguous
    locations
  • int MPI_Type_contiguous(in int count,
    in MPI_Datatype oldtype, out MPI_Datatype
    newtype)
  • Vector
  • Allows replication of a datatype into locations
    that consist of equally spaced blocks
  • Each block is obtained by concatenating the same
    number of copies of the old datatype
  • int MPI_Type_vector(in int count, in int
    blocklength, in int stride, in MPI_Datatype
    oldtype, out MPI_Datatype newtype)
  • Hvector
  • Identical to vector, but stride is in bytes,
    rather than in number of oldtype elements

14
Derived Datatypes III
  • Indexed
  • Allows replication of an old datatype into a
    sequence of blocks (each block is a concatenation
    of the old datatype), where each block can
    contain a different number of copies and have a
    different displacement
  • int MPI_Type_indexed(in int count,
    in int array_of_blocklengths,
    in int array_of_displacements,
    in MPI_Datatype oldtype, out MPI_Datatype
    newtype)
  • Hindexed
  • Identical to indexed, but displacements are in
    bytes, rather than in number of oldtype extents
  • int MPI_Type_hindexed(in int count,
    in int array_of_blocklengths,
    in MPI_Aint array_of_displacements,
    in MPI_Datatype oldtype, out MPI_Datatype
    newtype)

15
Derived Datatypes IV
  • Struct
  • Most general
  • Generalizes hindexed with array of types
  • int MPI_Type_struct(in int count,
    in int array_of_blocklengths,
    in MPI_Aint array_of_displacements,
    in MPI_Datatype array_of_types,
    out MPI_Datatype newtype)


Struct Count 2, array_of_blocklengths3,
1, Array_of_displacements0,12 (in
bytes)array_of_typesMPI_FLOAT,MPI_INT
16
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

17
Point-to-point Communications
  • MPI_Send and MPI_Recv are blocking calls
  • MPI_Send does not return until the message data
    and envelope have been safely stored away so that
    the sender is free to access and overwrite the
    send buffer
  • The message might be copied directly into the
    matching receive buffer, or it might be copied
    into a temporary system buffer

18
Buffering Communication Modes
  • Message buffering decouples the send and receive
    operations
  • A blocking send can complete as soon as the
    message was buffered, even if no matching receive
    has been executed by the receiver
  • On the other hand, message buffering can be
    expensive, as it entails additional
    memory-to-memory copying, and it requires the
    allocation of memory for buffering
  • There are 4 communication modes in MPI
  • Standard (no prefix), buffered (B prefix),
    synchronous (S prefix), ready (R prefix)
  • MPI_Send(), MPI_Bsend(), MPI_Ssend(), MPI_Rsend()

19
Standard Communication Mode
  • MPI offers the choice of several communication
    modes that allow control of the communication
    protocol choice
  • MPI_Send uses the standard communication mode
  • It is up to MPI to decide if outgoing messages
    will be buffered
  • MPI may buffer outgoing messages
  • The send call may complete before a matching
    receive is invoked
  • On the other hand, buffer space may be
    unavailable, or MPI may not buffer outgoing
    messages, for performance reasons
  • The send call will not complete until a matching
    receive has been posted, and the data has been
    moved to the receiver
  • Thus, a send in standard mode can be started
    whether or not a matching receive has been posted
  • It may complete before a matching receive is
    posted
  • The standard mode send is non-local successful
    completion of the send operation may depend on
    the occurrence of a matching receive.

20
Buffered Communication Mode
  • A buffered mode send operation can be started
    whether or not a matching receive has been posted
  • It may complete before a matching receive is
    posted
  • However, unlike the standard send, this operation
    is local, and its completion does not depend on
    the occurrence of a matching receive
  • Thus, if a send is executed and no matching
    receive is posted, then MPI must buffer the
    outgoing message, so as to allow the send call to
    complete
  • An error will occur if there is insufficient
    buffer space
  • The amount of available buffer space is
    controlled by the user
  • Buffer allocation by the user may be required for
    the buffered mode to be effective

21
Synchronous Communication Mode
  • A send that uses the synchronous mode can be
    started whether or not a matching receive was
    posted
  • However, the send will complete successfully only
    if a matching receive is posted, and the receive
    operation has started to receive the message sent
    by the synchronous send
  • Thus, the completion of a synchronous send not
    only indicates that the send buffer can be
    reused, but also indicates that the receiver has
    reached a certain point in its execution, namely
    that it has started executing the matching
    receive
  • If both sends and receives are blocking
    operations then the use of the synchronous mode
    provides synchronous communication semantics a
    communication does not complete at either end
    before both processes rendezvous at the
    communication
  • A send executed in this mode is non-local

22
Ready Communication Mode
  • A send that uses the ready communication mode may
    be started only if the matching receive is
    already posted
  • Otherwise, the operation is erroneous and its
    outcome is undefined
  • On some systems, this allows the removal of a
    hand-shake operation that is otherwise required
    and results in improved performance
  • The completion of the send operation does not
    depend on the status of a matching receive, and
    merely indicates that the send buffer can be
    reused
  • A send operation that uses the ready mode has the
    same semantics as a standard send operation, or a
    synchronous send operation it is merely that the
    sender provides additional information to the
    system (namely that a matching receive is already
    posted), that can save some overhead
  • In a correct program, therefore, a ready send
    could be replaced by a standard send with no
    effect on the behavior of the program other than
    performance

23
Blocking Receive
  • There is only one receive operation, which can
    match any of the send modes
  • The receive operation just described is blocking
    it returns only after the receive buffer contains
    the newly received message
  • A receive can complete before the matching send
    has completed (of course, it can complete only
    after the matching send has started).

24
Communication Modes Summary
  • Ready mode has least total overhead, but requires
    receive to be posted before send
  • Can post receive, synchronize, then post send
  • Synchronous mode is most portable, but can be
    slow
  • Does not depend on order (ready) or buffers
    (buffered)
  • Buffered mode doesnt depend on order (ready) and
    has no synchronization delays (synchronous), but
    buffering can add overhead and user may need to
    control buffers
  • Standard mode is implementation dependant
  • Often, small messages are buffered and large
    messages are sent synchronously

25
Non-blocking Communication I
  • Allows overlap of computation and communication
  • In theory, at least, and sometimes, in practice
  • A non-blocking send start call initiates the send
    operation, but does not complete it
  • Call returns before message is copied out of send
    buffer
  • A separate send complete call is needed to
    complete the communication, i.e., to verify that
    the data has been copied out of the send buffer
  • With suitable hardware, the transfer of data out
    of the sender memory may proceed concurrently
    with computations done at the sender after the
    send was initiated and before it completed

26
Non-blocking Communication II
  • Similarly, a non-blocking receive start call
    initiates the receive operation, but does not
    complete it
  • Call returns before message is stored in receive
    buffer
  • A separate receive complete call is needed to
    complete the receive operation and verify that
    the data has been received into the receive
    buffer
  • With suitable hardware, the transfer of data into
    the receiver memory may proceed concurrently with
    computations done after the receive was initiated
    and before it completed
  • The use of non-blocking receives may also avoid
    system buffering and memory-to-memory copying, as
    information is provided early on the location of
    the receive buffer

27
Non-blocking Communication III
  • non-blocking send start calls can use the same
    four modes as blocking sends standard, buffered,
    synchronous and ready
  • These carry the same meaning
  • Sends of all modes, ready excepted, can be
    started whether a matching receive has been
    posted or not a non-blocking ready send can be
    started only if a matching receive is posted
  • In all cases, the send start call is local it
    returns immediately, irrespective of the status
    of other processes
  • The send-complete call returns when data has been
    copied out of the send buffer
  • non-blocking sends can be matched with blocking
    receives, and vice-versa

28
Non-blocking Communication IV
  • Syntax - add an I, get back a request handle
  • int MPI_Bs,Ss,Rs,Send(void buf, int count,
    MPI_Datatype datatype, int dest, int tag,
    MPI_Comm comm)
  • int MPI_Ib,s,rsend(void buf, int count,
    MPI_Datatype datatype, int dest, int tag,
    MPI_Comm comm, MPI_Request request)
  • int MPI_Recv(void buf, int count, MPI_Datatype
    datatype, int source, int tag, MPI_Comm comm,
    MPI_Status status)
  • int MPI_Irecv(void buf, int count, MPI_Datatype
    datatype, int source, int tag, MPI_Comm comm,
    MPI_Request request)
  • Blocking communication completion
  • int MPI_Wait(MPI_Request request, MPI_Status
    status)
  • non-blocking communication completion
  • int MPI_Test(MPI_Request request, int flag,
    MPI_Status status) (flag indicates if the
    request has completed)
  • Request freeing (dont care when it completes)
  • int MPI_Request_free(MPI_Request request)
  • There are also calls for multiple completion/free

29
Non-blocking Communication Examples
IF(rank.EQ.0) THEN DO i1, n CALL
MPI_ISEND(outval, 1, MPI_REAL, 1, 0, req,
ierr) CALL MPI_REQUEST_FREE(req, ierr)
CALL MPI_IRECV(inval, 1, MPI_REAL, 1, 0,
req, ierr) CALL MPI_WAIT(req, status, ierr)
END DO ELSE ! rank.EQ.1 CALL
MPI_IRECV(inval, 1, MPI_REAL, 0, 0, req,
ierr) CALL MPI_WAIT(req, status) DO I1, n-1
CALL MPI_ISEND(outval, 1, MPI_REAL, 0,
0, req, ierr) CALL MPI_REQUEST_FREE(req,
ierr) CALL MPI_IRECV(inval, 1, MPI_REAL,
0, 0, req, ierr) CALL MPI_WAIT(req, status,
ierr) END DO CALL MPI_ISEND(outval, 1,
MPI_REAL, 0, 0, req, ierr) CALL
MPI_WAIT(req, status) END IF
  • IF(rank.EQ.0) THEN
  • CALL MPI_ISEND(a(1), 10,
  • MPI_REAL, 1, tag, comm,
  • request, ierr)
  • compute
  • CALL MPI_WAIT(request,
  • status, ierr)
  • ELSE ! rank.eq.1
  • CALL MPI_IRECV(a(1), 10,
  • MPI_REAL, 0, tag, comm,
  • request, ierr)
  • compute
  • CALL MPI_WAIT(request,
  • status, ierr)
  • END IF

30
Non-blocking Communication Examples
IF (rank.EQ.0) THEN CALL MPI_SEND(i, 1,
MPI_INTEGER, 2, 0, comm, ierr) ELSE IF(rank.EQ.1)
THEN CALL MPI_SEND(x, 1, MPI_REAL, 2, 0, comm,
ierr) ELSE ! rank.EQ.2 DO i1, 2 CALL
MPI_PROBE(MPI_ANY_SOURCE, 0, comm, status, ierr)
IF (status(MPI_SOURCE) 0) THEN CALL
MPI_RECV(i, 1, MPI_INTEGER,
MPI_ANY_SOURCE, MPI_ANY_TAG, status2, ierr)
CALL MPI_RECV(i, 1, MPI_INTEGER,
status(MPI_SOURCE), status(MPI_TAG), status2,
ierr) ELSE CALL MPI_RECV(x, 1,
MPI_REAL, MPI_ANY_SOURCE, MPI_ANY_TAG,
status2, ierr) CALL MPI_RECV(x, 1,
MPI_REAL, status(MPI_SOURCE),
status(MPI_TAG), status2, ierr) END IF END
DO END IF
31
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

32
Advanced Collective Communication I
Root
P0 P1 P2 P3
P0 P1 P2 P3
A B C D A B C D A B C D A B
C D
A B C D
A B C D
A B C D
A B C D
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
Alltoall
33
Advanced Collective Communication II
  • What if data to be gathered from all processes to
    root is not the same?
  • Use Gatherv instead of Gather (or
    ScatterV/Scatter, Allgatherv/Allgather)
  • v is for varying
  • MPI_Gather
  • int MPI_Gather(void sendbuf, int sendcount,
    MPI_Datatype sendtype, void recvbuf, int
    recvcount, MPI_Datatype recvtype, int root,
    MPI_Comm comm)
  • All n processes(including root) call
  • MPI_Send(sendbuf, sendcount, sendtype, root, )
  • Root calls (for i 0 to n-1)
  • MPI_Recv(recvbufirecvcountextent(recvtype),
    recvcount, recvtype, i, )
  • MPI_Gatherv
  • int MPI_Gatherv(void sendbuf, int sendcount,
    MPI_Datatype sendtype, void recvbuf, int
    recvcounts, int displs, MPI_Datatype recvtype,
    int root, MPI_Comm comm)
  • All n processes(including root) call
  • MPI_Send(sendbuf, sendcount, sendtype, root, )
  • Root calls (for i 0 to n-1)
  • MPI_Recv(recvbufdisplsiextent(recvtype),
    recvcountsi, recvtype, i, )

34
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

35
Persistent Communication I
  • Often, a communication with the same argument
    list is repeatedly executed within the inner loop
    of a parallel computation
  • In such a situation, it may be possible to
    optimize the communication by binding the list of
    communication arguments to a persistent
    communication request once and, then, repeatedly
    using the request to initiate and complete
    messages
  • The persistent request thus created can be
    thought of as a communication port or a
    half-channel
  • It does not provide the full functionality of a
    conventional channel, since there is no binding
    of the send port to the receive port
  • This construct allows reduction of the overhead
    for communication between the process and
    communication controller, but not of the overhead
    for communication between one communication
    controller and another
  • It is not necessary that messages sent with a
    persistent request be received by a receive
    operation using a persistent request, or vice
    versa

36
Persistent Communication II
  • Create a persistent communication request before
    the loop starts
  • MPI_S,Bs,Ss,Rsend_init(buffer, , request)
  • MPI_Recv_init(buffer, , request)
  • Start the request(s) inside each iteration of the
    loop
  • MPI_Start(request)
  • MPI_Startall(count, array_of_requests)
  • Complete the request(s) inside each iteration of
    the loop
  • MPI_Wait(), MPI_Waitall(), MPI_Waitany(),
    MPI_Waitsome(), MPI_Test(), MPI_Testany(),
    MPI_Testsome()
  • Free the request(s) after the loop completes
  • MPI_Request_free(request)

37
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

38
I/O in Parallel Applications
  • Old ways to do I/O
  • Process 0 does all the I/O to a single file and
    broadcasts/scatters/gathers to/from other
    processes
  • All processes do their own I/O to separate files
  • All tasks read from same file
  • All tasks write to same file, using seeks to get
    to right place
  • One task at a time appends to a single file,
    using barriers to prevent overlapping writes.

39
I/O in Parallel Applications
  • New way is to use parallel I/O library, such as
    MPI I/O
  • Multiple tasks can simultaneously read or write
    to a single file (possibly on a parallel file
    system) using the MPI I/O API
  • A parallel file system usually looks like a
    single file system, but has multiple I/O servers
    to permit high bandwidth from multiple processes
  • MPI I/O is part of MPI-2
  • Allows single or collective operations to/of
    contiguous or non-contiguous regions/data using
    MPI datatypes, including derived datatypes,
    blocking or non-blocking
  • Sound familiar? Writing sending message,
    reading receiving

40
Levels of Parallel I/O
  • Example problem Distributed Array Access

P0
P2
P1
P3
Large array distributed among 16 processes
Each square represents a subarray in the
memory of a single process
P4
P6
P5
P7
P8
P10
P9
P11
P12
P14
P13
P15
Access Pattern in the file
P10
P11
P10
P15
P13
P12
P12
P13
P14
P14
41
Level-0 Access
  • Each process makes one independent read request
    for each row in the local array (as in Unix)
  • MPI_File_open(..., file, ..., fh)
  • for (i0 iltn_local_rows i)
  • MPI_File_seek(fh, ...)
  • MPI_File_read(fh, (Ai0), ...)
  • MPI_File_close(fh)

42
Level-1 Access
  • Similar to level 0, but each process uses
    collective I/O functions
  • MPI_File_open(MPI_COMM_WORLD, file, ..., fh)
  • for (i0 iltn_local_rows i)
  • MPI_File_seek(fh, ...)
  • MPI_File_read_all(fh, (Ai0), ...)
  • MPI_File_close(fh)

43
Level-2 Access
  • Each process creates a derived datatype to
    describe the noncontiguous access pattern,
    defines a file view, and calls independent I/O
    functions
  • MPI_Type_create_subarray(..., subarray, ...)
  • MPI_Type_commit(subarray)
  • MPI_File_open(..., file, ..., fh)
  • MPI_File_set_view(fh, ..., subarray, ...)
  • MPI_File_read(fh, A, ...)
  • MPI_File_close(fh)

44
Level-3 Access
  • Similar to level-2, except that each process uses
    collective I/O functions
  • MPI_Type_create_subarray(..., subarray, ...)
  • MPI_Type_commit(subarray)
  • MPI_File_open(MPI_COMM_WORLD, file, ..., fh)
  • MPI_File_set_view(fh, ..., subarray, ...)
  • MPI_File_read_all(fh, A, ...)
  • MPI_File_close(fh)

45
The Four Access Levels
Level 0
Level 1
Level 2
Level 3
46
Why Higher-level Access is Good
  • Given complete access information, an
    implementation can perform optimizations such as
  • Data Sieving Read large chunks and extract what
    is really needed
  • Collective I/O Merge requests of different
    processes into larger requests
  • Improved prefetching and caching

47
Some Quick Details
  • Collective operation on a filename
  • MPI_File_open()
  • Single processor operation on a filename
  • MPI_File_delete()
  • Collective operations on a file handle
  • MPI_File_open(), MPI_File_close(),
    MPI_File_set_size(), MPI_File_preallocate()
  • Single processor operation on a file handle
  • MPI_File_get_size(), MPI_File_get_amode(),
    MPI_File_get_group() (Note creates dup. comm.
    group)

48
Hints
  • Allow a user to provide info (e.g., file access
    patterns, file system specifics) for optimization
  • Optional - may be ignored by implementation
  • Can be provided for
  • MPI_File_open(), MPI_File_delete(),
    MPI_File_set_info(), MPI_get_info()
  • Examples hints
  • access_style read_once, write_once, read_mostly,
    write_mostly, sequential, reverse_sequential,
    random
  • collective_buffering true, false
  • num_io_nodes (integer - )
  • striping_factor (integer - of devices)
  • striping_unit (integer - of bytes)
  • Context sensitive, Implementation-dependent

49
File Views
  • MPI_File_set_view(MPI_File fh, MPI_Offset disp,
    MPI_Datatype etype, MPI_Datatype filetype, char
    datarep, MPI_Info info)
  • Changes the process's view of the data in the
    file
  • disp - Start of the view etype - type of data
    filetype - distribution of data to processes
    datarep - representation of data in the file
  • native - data stored as in memory internal -
    data stored in librarys choice external32 -
    data stored in portable format
  • Resets individual file pointers and shared file
    pointer to zero
  • Collective
  • datarep and extent of etype in datarep must be
    identical on all processes in the group
  • disp, filetype, and info may vary
  • datatypes passed in etype and filetype must be
    committed
  • Should call immediately after open, and perhaps
    other times, too

50
Data Access I
  • Three orthogonal aspects to data access
  • Positioning
  • Explicit offset vs. implicit file pointer
  • Synchronism
  • blocking vs. non-blocking and split collective
  • Coordination
  • noncollective vs. collective
  • Two types of file pointers
  • Individual, shared

51
Data Access Routines
52
Positioning Routines
53
Parallel I/O notes
  • Non-collective, non-blocking calls return an
    MPI_Request
  • Must wait or otherwise free request at some point
  • Only one collective, split-phase call per file
    handle at a time
  • Therefore, no request is used/needed
  • POSIX file consistency
  • When write returns, data immediately visible to
    other processes
  • Atomicity If two writes occur simultaneously on
    overlapping areas in a file, data stored will be
    from one or the other, not a combination
  • MPI I/O file consistency
  • Default semantics weaker than POSIX for
    optimization
  • Can get close to POSIX semantics by setting
    atomicity to TRUE
  • Otherwise, to read data written by another
    process, you need to call MPI_File_sync or close
    and reopen the file
  • See examples at end of section from Thakurs
    tutorial

54
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

55
One-sided Operations Issues
  • Balancing efficiency and portability across a
    wide class of architectures
  • shared-memory multiprocessors
  • NUMA architectures
  • distributed-memory MPPs, clusters
  • Workstation networks
  • Retaining look and feel of MPI-1
  • Dealing with subtle memory behavior issues
    cache coherence, sequential consistency
  • Synchronization is separate from data movement

56
Remote Memory Access Windows and Window Objects
Process 0
Process 1
window
Process 2
Process 3
address spaces
window object
57
One-sided Communication Calls
  • MPI_Put - stores into remote memory
  • MPI_Get - reads from remote memory
  • MPI_Accumulate - updates remote memory
  • All are non-blocking data transfer is
    described, maybe even initiated, but may
    continue after call returns
  • Subsequent synchronization on window object is
    needed to ensure operations are complete, e.g.,
    MPI_Win_fence

58
Outline
  • Status of MPI-2
  • Groups and communication management
  • Derived Datatypes
  • Point-to-Point blocking/non-blocking
    communication
  • Advanced Collective Communication
  • Persistent communication
  • Parallel I/O
  • One-sided communication
  • Parallel I/O examples (from Thakur)

59
Using MPI for Simple I/O
Each process needs to read a chunk of data from a
common file
60
Using Individual File Pointers
MPI_File fh MPI_Status status MPI_Comm_rank(MPI
_COMM_WORLD, rank) MPI_Comm_size(MPI_COMM_WORLD,
nprocs) bufsize FILESIZE/nprocs nints
bufsize/sizeof(int) MPI_File_open(MPI_COMM_WORLD
, "/pfs/datafile",
MPI_MODE_RDONLY, MPI_INFO_NULL,
fh) MPI_File_seek(fh, rank bufsize,
MPI_SEEK_SET) MPI_File_read(fh, buf, nints,
MPI_INT, status) MPI_File_close(fh)
61
Using Explicit Offsets
include 'mpif.h' integer status(MPI_STATUS_SI
ZE) integer (kindMPI_OFFSET_KIND) offset C in
F77, see implementation notes (might be
integer8) call MPI_FILE_OPEN(MPI_COMM_WORLD,
'/pfs/datafile', MPI_MODE_RDONLY,
MPI_INFO_NULL, fh, ierr) nints FILESIZE /
(nprocsINTSIZE) offset rank nints
INTSIZE call MPI_FILE_READ_AT(fh, offset, buf,
nints, MPI_INTEGER,
status, ierr) call MPI_GET_COUNT(status,
MPI_INTEGER, count, ierr) print , 'process ',
rank, 'read ', count, 'integers' call
MPI_FILE_CLOSE(fh, ierr)
62
Writing to a File
  • Use MPI_File_write or MPI_File_write_at
  • Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags
    to MPI_File_open
  • If the file doesnt exist previously, the flag
    MPI_MODE_CREATE must also be passed to
    MPI_File_open
  • We can pass multiple flags by using bitwise-or
    in C, or addition in Fortran

63
Using File Views
  • Processes write to shared file

MPI_File_set_view assigns regions of the file to
separate processes
64
File Views
  • Specified by a triplet (displacement, etype, and
    filetype) passed to MPI_File_set_view
  • displacement number of bytes to be skipped from
    the start of the file
  • etype basic unit of data access (can be any
    basic or derived datatype)
  • filetype specifies which portion of the file is
    visible to the process

65
File View Example
MPI_File file for (i0 iltBUFSIZE i) bufi
myrank BUFSIZE i MPI_File_open(MPI_COMM_WOR
LD, "testfile", MPI_MODE_CREATE
MPI_MODE_WRONLY, MPI_INFO_NULL,
file) MPI_File_set_view(file, myrank BUFSIZE
sizeof(int), MPI_INT, MPI_INT,
"native", MPI_INFO_NULL) MPI_File_write(file,
buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE) MPI_Fi
le_close(file)
66
Other Ways to Write to a Shared File
  • MPI_File_seek
  • MPI_File_read_at
  • MPI_File_write_at
  • MPI_File_read_shared
  • MPI_File_write_shared
  • Collective operations

like Unix seek
combine seek and I/O for thread safety
use shared file pointer
67
Noncontiguous Accesses
  • Common in parallel applications
  • Example distributed arrays stored in files
  • A big advantage of MPI I/O over Unix I/O is the
    ability to specify noncontiguous accesses in
    memory and file within a single function call by
    using derived datatypes
  • Allows implementation to optimize the access
  • Collective IO combined with noncontiguous
    accesses yields the highest performance.

68
Example Distributed Array Access
2D array distributed among four processes
P1
P0
P3
P2
File containing the global array in row-major
order
69
A Simple File View Example
etype MPI_INT
head of file
FILE
displacement
filetype
filetype
and so on...
70
File View Code
MPI_Aint lb, extent MPI_Datatype etype,
filetype, contig MPI_Offset disp MPI_Type_conti
guous(2, MPI_INT, contig) lb 0 extent 6
sizeof(int) MPI_Type_create_resized(contig, lb,
extent, filetype) MPI_Type_commit(filetype) di
sp 5 sizeof(int) etype MPI_INT MPI_File_o
pen(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_RDWR, MPI_INFO_NULL,
fh) MPI_File_set_view(fh, disp, etype,
filetype, "native", MPI_INFO_NULL) MPI_File_wri
te(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE)
71
Collective I/O in MPI
  • A critical optimization in parallel I/O
  • Allows communication of big picture to file
    system
  • Framework for 2-phase I/O, in which communication
    precedes I/O (can use MPI machinery)
  • Basic idea build large blocks, so that
    reads/writes in I/O system will be large

Small individual requests
Large collective access
72
Collective I/O
  • MPI_File_read_all, MPI_File_read_at_all, etc
  • _all indicates that all processes in the group
    specified by the communicator passed to
    MPI_File_open will call this function
  • Each process specifies only its own access
    information -- the argument list is the same as
    for the non-collective functions

73
Collective I/O
  • By calling the collective I/O functions, the user
    allows an implementation to optimize the request
    based on the combined request of all processes
  • The implementation can merge the requests of
    different processes and service the merged
    request efficiently
  • Particularly effective when the accesses of
    different processes are noncontiguous and
    interleaved

74
Accessing Arrays Stored in Files
75
Using the Distributed Array (Darray) Datatype
int gsizes2, distribs2, dargs2,
psizes2 gsizes0 m / no. of rows in
global array / gsizes1 n / no. of
columns in global array/ distribs0
MPI_DISTRIBUTE_BLOCK distribs1
MPI_DISTRIBUTE_BLOCK dargs0
MPI_DISTRIBUTE_DFLT_DARG dargs1
MPI_DISTRIBUTE_DFLT_DARG psizes0 2 / no.
of processes in vertical dimension
of process grid / psizes1 3 / no. of
processes in horizontal dimension
of process grid /
76
Darray Continued
MPI_Comm_rank(MPI_COMM_WORLD, rank) MPI_Type_cre
ate_darray(6, rank, 2, gsizes, distribs, dargs,
psizes, MPI_ORDER_C, MPI_FLOAT,
filetype) MPI_Type_commit(filetype) MPI_File_
open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_WRONLY,
MPI_INFO_NULL, fh) MPI_File_set_view(fh, 0,
MPI_FLOAT, filetype, "native", MPI_INFO_NULL)
local_array_size num_local_rows
num_local_cols MPI_File_write_all(fh,
local_array, local_array_size, MPI_FLOAT,
status) MPI_File_close(fh)
77
A Word of Warning about Darray
  • The darray datatype assumes a very specific
    definition of data distribution -- the exact
    definition as in HPF
  • For example, if the array size is not divisible
    by the number of processes, darray calculates the
    block size using a ceiling division (20 / 6 4 )
  • darray assumes a row-major ordering of processes
    in the logical grid, as assumed by cartesian
    process topologies in MPI-1
  • If your application uses a different definition
    for data distribution or logical grid ordering,
    you cannot use darray. Use subarray instead.

78
Using the Subarray Datatype
gsizes0 m / rows in global array
/ gsizes1 n / columns in global
array/ psizes0 2 / procs. in vertical
dimension / psizes1 3 / procs. in
horizontal dimension / lsizes0 m/psizes0
/ rows in local array / lsizes1
n/psizes1 / columns in local array
/ dims0 2 dims1 3 periods0
periods1 1 MPI_Cart_create(MPI_COMM_WORLD,
2, dims, periods, 0, comm) MPI_Comm_rank(comm,
rank) MPI_Cart_coords(comm, rank, 2, coords)
79
Subarray Datatype II
/ global indices of first element of local array
/ start_indices0 coords0
lsizes0 start_indices1 coords1
lsizes1 MPI_Type_create_subarray(2, gsizes,
lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT,
filetype) MPI_Type_commit(filetype) MPI_File_
open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE MPI_MODE_WRONLY,
MPI_INFO_NULL, fh) MPI_File_set_view(fh, 0,
MPI_FLOAT, filetype, "native", MPI_INFO_NULL) l
ocal_array_size lsizes0 lsizes1 MPI_File_
write_all(fh, local_array, local_array_size,
MPI_FLOAT, status)
80
Local Array with Ghost Area in Memory
  • Use a subarray datatype to describe the
    noncontiguous layout in memory
  • Pass this datatype as argument to
    MPI_File_write_all

81
Local Array with Ghost Area
memsizes0 lsizes0 8 / no. of rows
in allocated array / memsizes1 lsizes1
8 / no. of columns in allocated array
/ start_indices0 start_indices1 4
/ indices of the first element of the local
array in the allocated array
/ MPI_Type_create_subarray(2, memsizes, lsizes,
start_indices, MPI_ORDER_C, MPI_FLOAT,
memtype) MPI_Type_commit(memtype) / create
filetype and set file view exactly as in the
subarray example / MPI_File_write_all(fh,
local_array, 1, memtype, status)
82
Accessing Irregularly Distributed Arrays
Process 0s map array
Process 1s map array
Process 2s map array
0
14
13
7
4
2
11
8
3
10
5
1
The map array describes the location of each
element of the data array in the common file
83
Accessing Irregularly Distributed Arrays
integer (kindMPI_OFFSET_KIND) disp call
MPI_FILE_OPEN(MPI_COMM_WORLD, '/pfs/datafile',
MPI_MODE_CREATE MPI_MODE_RDWR,
MPI_INFO_NULL, fh, ierr) call
MPI_TYPE_CREATE_INDEXED_BLOCK(bufsize, 1, map,
MPI_DOUBLE_PRECISION, filetype, ierr) call
MPI_TYPE_COMMIT(filetype, ierr) disp 0 call
MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION,
filetype, 'native', MPI_INFO_NULL, ierr) call
MPI_FILE_WRITE_ALL(fh, buf, bufsize,
MPI_DOUBLE_PRECISION, status, ierr) call
MPI_FILE_CLOSE(fh, ierr)
84
non-blocking I/O
MPI_Request request MPI_Status
status MPI_File_iwrite_at(fh, offset, buf,
count, datatype,
request) for (i0 ilt1000 i) /
perform computation / MPI_Wait(request,
status)
85
Split Collective I/O
  • A restricted form of non-blocking collective I/O
  • Only one active non-blocking collective operation
    allowed at a time on a file handle
  • Therefore, no request object necessary

MPI_File_write_all_begin(fh, buf, count,
datatype) for (i0 ilt1000 i) /
perform computation / MPI_File_write_all_end(f
h, buf, status)
Write a Comment
User Comments (0)
About PowerShow.com