Title: CSCI-4320/6360: Parallel Programming
1CSCI-4320/6360 Parallel Programming
ComputingTues./Fri. 12-120 p.m.MPI File I/O
- Prof. Chris Carothers
- Computer Science Department
- MRC 309a
- chrisc_at_cs.rpi.edu
- www.cs.rpi.edu/chrisc/COURSES/PARALLEL/SPRING-201
0 - Adapted from people.cs.uchicago.edu/asiegel/cour
ses/cspp51085/.../mpi-io.ppt
2Common Ways of Doing I/O in Parallel Programs
- Sequential I/O
- All processes send data to rank 0, and 0 writes
it to the file
3Pros and Cons of Sequential I/O
- Pros
- parallel machine may support I/O from only one
process (e.g., no common file system) - Some I/O libraries (e.g. HDF-4, NetCDF, PMPIO)
not parallel - resulting single file is handy for ftp, mv
- big blocks improve performance
- short distance from original, serial code
- Cons
- lack of parallelism limits scalability,
performance (single node bottleneck)
4Another Way
- Each process writes to a separate file
- Pros
- parallelism, high performance
- Cons
- lots of small files to manage
- LOTS OF METADATA stress parallel filesystem
- difficult to read back data from different number
of processes
5What is Parallel I/O?
- Multiple processes of a parallel program
accessing data (reading or writing) from a common
file
FILE
P(n-1)
P0
P1
P2
6Why Parallel I/O?
- Non-parallel I/O is simple but
- Poor performance (single process writes to one
file) or - Awkward and not interoperable with other tools
(each process writes a separate file) - Parallel I/O
- Provides high performance
- Can provide a single file that can be used with
other tools (such as visualization programs)
7Why is MPI a Good Setting for Parallel I/O?
- Writing is like sending a message and reading is
like receiving. - Any parallel I/O system will need a mechanism to
- define collective operations (MPI communicators)
- define noncontiguous data layout in memory and
file (MPI datatypes) - Test completion of nonblocking operations (MPI
request objects) - i.e., lots of MPI-like machinery
8MPI-IO Background
- Marc Snir et al (IBM Watson) paper exploring MPI
as context for parallel I/O (1994) - MPI-IO email discussion group led by J.-P. Prost
(IBM) and Bill Nitzberg (NASA), 1994 - MPI-IO group joins MPI Forum in June 1996
- MPI-2 standard released in July 1997
- MPI-IO is Chapter 9 of MPI-2
9Using MPI for Simple I/O
Each process needs to read a chunk of data from a
common file
10Using Individual File Pointers
includeltstdio.hgt includeltstdlib.hgt include
"mpi.h" define FILESIZE 1000 int main(int argc,
char argv) int rank, nprocs MPI_File
fh MPI_Status status int bufsize, nints
int bufFILESIZE MPI_Init(argc, argv)
MPI_Comm_rank(MPI_COMM_WORLD, rank)
MPI_Comm_size(MPI_COMM_WORLD, nprocs)
bufsize FILESIZE/nprocs nints
bufsize/sizeof(int) MPI_File_open(MPI_COMM_W
ORLD, "datafile", MPI_MODE_RDONLY,
MPI_INFO_NULL, fh) MPI_File_seek(fh, rank
bufsize, MPI_SEEK_SET) MPI_File_read(fh, buf,
nints, MPI_INT, status) MPI_File_close(fh)
11Using Explicit Offsets
includeltstdio.hgt includeltstdlib.hgt include
"mpi.h" define FILESIZE 1000 int main(int argc,
char argv) int rank, nprocs MPI_File
fh MPI_Status status int bufsize, nints
int bufFILESIZE MPI_Init(argc, argv)
MPI_Comm_rank(MPI_COMM_WORLD, rank)
MPI_Comm_size(MPI_COMM_WORLD, nprocs)
bufsize FILESIZE/nprocs nints
bufsize/sizeof(int) MPI_File_open(MPI_COMM_W
ORLD, "datafile", MPI_MODE_RDONLY, MPI_INFO_NULL,
fh) MPI_File_read_at(fh, rankbufsize, buf,
nints, MPI_INT, status) MPI_File_close(fh)
12Function Details
MPI_File_open(MPI_Comm comm, char file, int
mode, MPI_Info info, MPI_File fh) (note mode
MPI_MODE_RDONLY, MPI_MODE_RDWR,
MPI_MODE_WRONLY,
MPI_MODE_CREATE, MPI_MODE_EXCL,
MPI_MODE_DELETE_ON_CLOSE, MPI_MODE_UNIQUE_OPEN,
MPI_MODE_SEQUENTIAL,
MPI_MODE_APPEND) MPI_File_close(MPI_File
fh) MPI_File_read(MPI_File fh, void buf, int
count, MPI_Datatype type, MPI_Status
status) MPI_File_read_at(MPI_File fh, int
offset, void buf, int count,
MPI_Datatype type, MPI_Status
status) MPI_File_seek(MPI_File fh, MPI_Offset
offset, in whence) (note whence
MPI_SEEK_SET, MPI_SEEK_CUR, or MPI_SEEK_END) MPI_
File_write(MPI_File fh, void buf, int count,
MPI_Datatype datatype, MPI_Status
status) MPI_File_write_at( same as read_at
) (Note Many other functions to get/set
properties (see Gropp et al))
13Writing to a File
- Use MPI_File_write or MPI_File_write_at
- Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags
to MPI_File_open - If the file doesnt exist previously, the flag
MPI_MODE_CREATE must also be passed to
MPI_File_open - We can pass multiple flags by using bitwise-or
in C, or addition in Fortran
14MPI Datatype Interlude
- Datatypes in MPI
- Elementary MPI_INT, MPI_DOUBLE, etc
- everything weve used to this point
- Contiguous
- Next easiest sequences of elementary types
- Vector
- Sequences separated by a constant stride
15MPI Datatypes, cont
- Indexed more general
- does not assume a constant stride
- Struct
- General mixed types (like C structs)
16Creating simple datatypes
- Lets just look at the simplest types contiguous
and vector datatypes. - Contiguous example
- Lets create a new datatype which is two ints
side by side. The calling sequence is - MPI_Type_contiguous(int count, MPI_Datatype
oldtype, MPI_Datatype newtype) - MPI_Datatype newtype
- MPI_Type_contiguous(2, MPI_INT, newtype)
- MPI_Type_commit(newtype) / required /
17Using File Views
- Processes write to shared file
- MPI_File_set_view assigns regions of the file to
separate processes
18File Views
- Specified by a triplet (displacement, etype, and
filetype) passed to MPI_File_set_view - displacement number of bytes to be skipped from
the start of the file - etype basic unit of data access (can be any
basic or derived datatype) - filetype specifies which portion of the file is
visible to the process - This is a collective operation and so all
processors/ranks must use the same data rep,
etypes in the group determined when the file was
open..
19File Interoperability
- Users can optionally create files with a portable
binary data representation - datarep parameter to MPI_File_set_view
- native - default, same as in memory, not portable
- internal - impl. defined representation providing
an impl. defined level of portability - external32 - a specific representation defined in
MPI, (basically 32-bit big-endian IEEE format),
portable across machines and MPI implementations
20File View Example
MPI_File thefile for (i0 iltBUFSIZE i)
bufi myrank BUFSIZE i MPI_File_open(MPI_
COMM_WORLD, "testfile", MPI_MODE_CREATE
MPI_MODE_WRONLY, MPI_INFO_NULL,
thefile) MPI_File_set_view(thefile, myrank
BUFSIZE, MPI_INT, MPI_INT, "native",
MPI_INFO_NULL) MPI_File_write(thefi
le, buf, BUFSIZE, MPI_INT,
MPI_STATUS_IGNORE) MPI_File_close(thefile)
21Ways to Write to a Shared File
- MPI_File_seek
- MPI_File_read_at
- MPI_File_write_at
- MPI_File_read_shared
- MPI_File_write_shared
- Collective operations
like Unix seek
combine seek and I/O for thread safety
use shared file pointer good when order doesnt
matter
22Collective I/O in MPI
- A critical optimization in parallel I/O
- Allows communication of big picture to file
system - Framework for 2-phase I/O, in which communication
precedes I/O (can use MPI machinery) - Basic idea build large blocks, so that
reads/writes in I/O system will be large
Small individual requests
Large collective access
23Collective I/O
- MPI_File_read_all, MPI_File_read_at_all, etc
- _all indicates that all processes in the group
specified by the communicator passed to
MPI_File_open will call this function - Each process specifies only its own access
information -- the argument list is the same as
for the non-collective functions
24Collective I/O
- By calling the collective I/O functions, the user
allows an implementation to optimize the request
based on the combined request of all processes - The implementation can merge the requests of
different processes and service the merged
request efficiently - Particularly effective when the accesses of
different processes are noncontiguous and
interleaved
25Collective non-contiguousMPI-IO examples
define mpi.h define FILESIZE 1048576 define
INTS_PER_BLK 16 int main(int argc, char
argv) int buf, rank, nprocs, nints,
bufsize MPI_File fh MPI_Datatype
filetype MPI_Init(argc, argv)
MPI_Comm_rank(MPI_COMM_WORLD, rank)
MPI_Comm_size(MPI_COMM_WORLD, nprocs)
bufsize FILESIZE/nprocs buf (int )
malloc(bufsize) nints bufsize/sizeof(int)
MPI_File_open(MPI_COMM_WORLD, filename,
MPI_MODE_RD_ONLY, MPI_INFO_NULL, fh)
MPI_Type_vector(nints/INTS_PER_BLK, INTS_PER_BLK,
INTS_PER_BLKnprocs, MPI_INT, filetype)
MPI_Type_commit(filetype) MPI_File_set_view(fh
, INTS_PER_BLKsizeof(int)rank, MPI_INT,
filetype, native, MPI_INFO_NULL)
MPI_File_read_all(fh, buf, nints, MPI_INT,
MPI_STATUS_IGNORE) MPI_Type_free(filetype)
free(buf) MPI_Finalize() return(0)
26More on MPI_Read_all
- Note that the _all version has the same argument
list - Difference is that all processes involved in
MPI_Open must call this the read - Contrast with the non-all version where any
subset may or may not call it - Allows for many optimizations
27Split Collective I/O
- A restricted form of nonblocking collective I/O
- Only one active nonblocking collective operation
allowed at a time on a file handle - Therefore, no request object necessary
MPI_File_write_all_begin(fh, buf, count,
datatype) // available on Blue Gene/L, but may
not improve // performance for (i0 ilt1000
i) / perform computation
/ MPI_File_write_all_end(fh, buf, status)
28Passing Hints to the Implementation
MPI_Info info MPI_Info_create(info) / no.
of I/O devices to be used for file striping
/ MPI_Info_set(info, "striping_factor",
"4") / the striping unit in bytes
/ MPI_Info_set(info, "striping_unit",
"65536") MPI_File_open(MPI_COMM_WORLD,
"/pfs/datafile", MPI_MODE_CREATE
MPI_MODE_RDWR, info, fh) MPI_Info_free(info)
29Examples of Hints (used in ROMIO)
- striping_unit
- striping_factor
- cb_buffer_size
- cb_nodes
- ind_rd_buffer_size
- ind_wr_buffer_size
- start_iodevice
- pfs_svr_buf
- direct_read
- direct_write
MPI-2 predefined hints
New Algorithm Parameters
Platform-specific hints
30I/O Consistency Semantics
- The consistency semantics specify the results
when multiple processes access a common file and
one or more processes write to the file - MPI guarantees stronger consistency semantics if
the communicator used to open the file accurately
specifies all the processes that are accessing
the file, and weaker semantics if not - The user can take steps to ensure consistency
when MPI does not automatically do so
31Example 1
- File opened with MPI_COMM_WORLD. Each process
writes to a separate region of the file and reads
back only what it wrote.
- MPI guarantees that the data will be read
correctly
32Example 2
- Same as example 1, except that each process wants
to read what the other process wrote (overlapping
accesses) - In this case, MPI does not guarantee that the
data will automatically be read correctly
Process 0
Process 1
/ incorrect program / MPI_File_open(MPI_COMM_WOR
LD,) MPI_File_write_at(off0,cnt100) MPI_Barrier
MPI_File_read_at(off100,cnt100)
/ incorrect program / MPI_File_open(MPI_COMM_WOR
LD,) MPI_File_write_at(off100,cnt100) MPI_Barri
er MPI_File_read_at(off0,cnt100)
- In the above program, the read on each process is
not guaranteed to get the data written by the
other process!
33Example 2 contd.
- The user must take extra steps to ensure
correctness - There are three choices
- set atomicity to true
- close the file and reopen it
- ensure that no write sequence on any process is
concurrent with any sequence (read or write) on
another process/MPI rank - Can hurt performance.
34Example 2, Option 1Set atomicity to true
35Example 2, Option 2Close and reopen file
Process 0
Process 1
MPI_File_open(MPI_COMM_WORLD,) MPI_File_write_at(
off0,cnt100) MPI_File_close MPI_Barrier MPI_File
_open(MPI_COMM_WORLD,) MPI_File_read_at(off100,c
nt100)
MPI_File_open(MPI_COMM_WORLD,) MPI_File_write_at(
off100,cnt100) MPI_File_close MPI_Barrier MPI_Fi
le_open(MPI_COMM_WORLD,) MPI_File_read_at(off0,c
nt100)
36Example 2, Option 3
- Ensure that no write sequence on any process is
concurrent with any sequence (read or write) on
another process - a sequence is a set of operations between any
pair of open, close, or file_sync functions - a write sequence is a sequence in which any of
the functions is a write operation
37Example 2, Option 3
38General Guidelines for Achieving High I/O
Performance
- Buy sufficient I/O hardware for the machine
- Use fast file systems, not NFS-mounted home
directories - Do not perform I/O from one process only
- Make large requests wherever possible
- For noncontiguous requests, use derived datatypes
and a single collective I/O call
39 Optimizations
- Given complete access information, an
implementation can perform optimizations such as - Data Sieving Read large chunks and extract what
is really needed - Collective I/O Merge requests of different
processes into larger requests - Improved prefetching and caching
40Summary
- MPI-IO has many features that can help users
achieve high performance - The most important of these features are the
ability to specify noncontiguous accesses, the
collective I/O functions, and the ability to pass
hints to the implementation - Users must use the above features!
- In particular, when accesses are noncontiguous,
users must create derived datatypes, define file
views, and use the collective I/O functions