Title: Advanced MPI
1Advanced MPI
2Audience Background
- This class assumes you have some background in
parallel programming - You are familiar with basic MPI
- You are familiar with basic OpenMP
- And, of course, you are familiar with a
programming language and the UNIX/Linux operating
environment - You have run parallel jobs through a queueing
system in a shared environment such as Ranger
before. - You can use SSH to log into remote systems and
transfer files - If this is not the case, please see the TACC
introductory courses - Slides from Lisboa class this week
http//taccspringschool.ist.utl.pt/SpringSchool/Ag
enda.html - Ranger Virtual Workshop
- https//www.cac.cornell.edu/ranger/
3Outline
- Review of MPI Advanced Topics
- Derived Datatypes
- Communicator manipulations
- One-sided communication
- I/O
- What is Parallel I/O? Do I need it?
- Cluster Filesystem Options
- MPI I/O and ROMIO
- Example striping schemes
4User Defined Datatypes
- Methods for creating data types
- MPI_Type_contiguous()
- MPI_Type_vector()
- MPI_Type_indexed()
- MPI_Type_struct()
- MPI_Pack()
- MPI_Unpack()
- MPI allows datatypes to be defined in much the
same way as modern programming languages
(C,C,F90) - This allows your communication and I/O operations
to operate using the same datatypes as the rest
of your program - Makes expressing the partitioning of datasets
easier
5Contiguous Array
- Creates an array of counts elements
- MPI_Type_contiguous(int count,
- MPI_Datatype oldtype,
- MPI_Datatype newtype)
6Strided Vector
- Constructs a cyclic set of elements
- MPI_Type_vector(int count,
- int blocklength,
- int stride,
- MPI_Datatype oldtype,
- MPI_Datatype newtype)
- Stride specified in number of elements
- Stride can be specified in bytes
- MPI_Type_hvector()
- Stride counts from start of block
7Subarrarys
- Perhaps the most useful MPI Datatype, the
Subarray type lets you divide a
multi-dimensional array into smaller blocks - int MPI_Type_create_subarray(ndims,
array_of_sizes, array_of_subsizes,
array_of_starts, order, oldtype, newtype) - int ndims
- int array_of_sizes
- int array_of_subsizes
- int array_of_starts
- int order
- MPI_Datatype oldtype
- MPI_Datatype newtype
- SubArray Example coming after we add some
communicator and I/O operations
8SubArray Illustration
start1
Size0
start0
subsize0
subsize1
Size1
9Indexed Vector
- Allows an irregular pattern of elements
- MPI_Type_indexed(int count,
- int array_of_blocklengths,
- int array_of_displacements,
- MPI_Datatype oldtype,
- MPI_Datatype newtype)
- Displacements specified in number of elements
- Displacements can be specified in bytes
- MPI_Type_hindexed()
- MPI_Type_create_indexed_block() is a shortcut if
all blocks are the same length
10Structured Records
- Allows different types to be combined
- MPI_Type_struct(int count,
- int array_of_blocklengths,
- MPI_Aint array_of_displacements,
- MPI_Datatype array_of_types,
- MPI_Datatype newtype)
- Blocklengths specified in number of elements
- Displacements specified in bytes
11Committing types
- In order for a user-defined derived datatype to
be used as an argument to other MPI calls, the
type must be committed. - MPI_Type_commit(type)
- MPI_Type_free(type)
- Use commit after calling the type constructor,
but before using the type anywhere else - Call free after the type is no longer in use (no
one actually does this, but it makes computer
scientists happy...)
12Vector Example
- MPI_TYPE_VECTOR function allows creating
non-contiguous vectors with constant stride
mpi_type_vector(count, blocklen, stride, oldtype,
vtype, ierr)mpi_type_commit(vtype, ierr)
1 6 11 16 2 7 12 17 3 8 13
18 4 9 14 19 5 10 15 20
nrows
ncols
call MPI_Type_vector(ncols, 1, nrows,
MPI_DOUBLE_PRECISION, vtype, ierr) call
MPI_Type_commit(vtype, ierr) call MPI_Send(
A(nrows,1) , 1 , vtype )
13Dealing with Communicators
- Many MPI operations deal with all the processes
in a communicator - MPI_COMM_WORLD by default contains every task in
your MPI job - Other communicators can be defined for more
complex operations for different parts of the
task, to add topology, to segregate different
kinds of messaging
14Communicators and Groups I
- All MPI communication is relative to a
communicator which contains a context and a
group. The group is just a set of processes.
MPI_COMM_WORLD
3
2
4
1
0
MPI_COMM_WORLD
2
0
COMM1
1
0
1
COMM2
15Communicators and Groups II
- To subdivide communicators into multiple
non-overlapping communicators Approach I - e.g. to form groups of rows of PEs
MPI_Comm_rank(MPI_COMM_WORLD, rank) myrow
(int)(rank/ncol)
16MPI_Comm_split
- Argument 1 communicator to split
- Argument 2 key, all processes with the same key
go in the same communicator - Argument 3 (optional) value to determine
ordering in the result communicator - Argument 4 result communicator
MPI_Comm_rank(MPI_COMM_WORLD, rank) myrow
(int)(rank/ncol) MPI_Comm_split(MPI_COMM_WORLD,
myrow,rank,row_comm)
17Topologies and Communicators
- MPI allows processes to be grouped in logical
topologies - Topologies can aid the programmer
- Convenient naming methods for processes in a
group - Naming can match communication patterns
- a standard mechanism for representing common
algorithmic concepts (i.e. 2D grids) - Topologies can aid the runtime environment
- Better mappings of MPI tasks to hardware nodes
- (Not really widely used in most implementations
yet)
18Topology Mechanics
- Topologies have the scope of a single
(intra)communicator - Topologies are an optional attribute given to a
communicator - Two topologies are supported
- Cartesian coordinates (grid)
- Graph
- nodes are tasks
- edges are named communication pathways
19Cartesian Topologies
- int MPI_Cart_create(MPI_Comm comm_old, int ndims,
int dims, int periods, int reorder, MPI_Comm
comm_cart) - MPICartcomm MPIIntracommCreateCart(...)
- MPI_CART_CREAT(...)
- comm_old - input communicator
- ndims - of dimensions in cartesian grid
- dims - integer array of size ndims specifying the
number of processes in each dimension - periods - true/false specifying whether each
dimension is periodic - reorder - ranks may be reordered or not
- comm_cart - new communicator containing new
topology.
20MPI_DIMS_CREATE
- A helper function for specifying a likely
dimension decomposition. - int MPI_Dims_create(int nnodes, int ndims, int
dims) - MPI_DIMS_CREATE(NNODES, NDIMS, DIMS, IERROR)
- void MPICompute_dims(int nnodes,int ndims, int
dims) - nnodes - total nodes in grid
- ndims - number of dimensions
- dims - array returned with dimensions
- Example
- MPI_Dims_create(6,2,dims) will return (3,2) in
dims - MPI_Dims_create(6,3,dims) will return (3,2,1) in
dims - No rounding or ceiling function provided
21Cartesian Inquiry Functions
- MPI_Cartdim_get will return the number of
dimensions in a cartesian structure - int MPI_Cartdim_get(MPI_Comm comm, int ndims)
- MPI_Cart_get provides information on an existing
topology - Arguments roughly mirror the create call
- int MPI_Cart_get(MPI_Comm comm,int maxdims, int
dims, int periods, int coords) - maxdims keeps a given communicator from
overflowing your arguments
22Cartesian Translator Functions
- Task IDs in a cartesian coordinate system
correspond to ranks in a "normal" communicator. - point-to-point communication routines
(send/receive) rely on ranks - int MPI_Cart_rank(MPI_Comm comm, int coords, int
rank) - int MPI_Cart_coords(MPI_Comm comm, int rank, int
maxdims, int coords) - Coords - cartesian coordinates
- rank - ranks
23Cartesian Shift function
- int MPI_Cart_Shift(MPI_Comm comm, int direction,
int disp, int rank_source, int rank_dest) - direction - coordinate dimension of shift
- disp - displacement (can be positive or negative)
- rank_source and rank_dest are return values
- use that source and dest to call MPI_Sendrecv
24Remote Memory Access Windows and Window Objects
- MPI-2 provides a facility for remote memory
access in specified regions - "one-sided communication" - no need for send and
receive - From what I hear, this may not work well yet,
but... - MPI_Win_create( base, size, disp_unit, info,
comm, win) - exposes memory from base to base(sizesizeof(disp
_unit) to remote memory access
25One-Sided Communication Calls
- MPI_Put() - stores into remote memory
- MPI_Get() - reads from remote memory
- MPI_Accumulate() - updates remote memory has an
op like MPI_Reduce - All are non-blocking data transfer is described,
maybe even initiated, but may continue after call
returns - Subsequent synchronization is needed to make sure
operations on window object are complete
MPI_Win_fence()
26I/O -(Parallel and Otherwise) on Large Scale
Systems Dan Stanzione Arizona State University
27Parallel I/O in Data Parallel Programs
- Each task reads a distinct partition of the input
data and writes a distinct partition of the
output data. - Each task reads its partition in parallel
- Data is distributed to the slave nodes
- Each task computes output data from input data
- Each task writes its partition in parallel
28What Are All These Names?
- MPI - Message Passing Interface Standard
- Also known as MPI-1
- MPI-2 - Extensions to MPI standard
- I/O, RDMA, dynamic processes
- MPI-IO - I/O part of MPI-2 extensions
- ROMIO - Implementation of MPI-IO
- Handles mapping MPI-IO calls into communication
(MPI) and file I/O
29Filesystems
- Since each node in a cluster has it's own disk,
making the same files available on each node can
be problematic - Three filesystem options
- Local
- Remote (eg. NFS)
- Parallel (eg. PVFS, LUSTRE)
30Filesystems (cont.)
- Local - Use storage on each node's disk
- Relatively high performance
- Each node has different filesystem
- Shared datafiles must be copied to each node
- No synchronization
- Most useful for temporary/scratch files accessed
only by copy of program running on single node - RANGER DOESNT HAVE LOCAL DISKS
- This trend may continue with other large scale
systems for reliability reasons - Very, very small RAMdisk (300MB)
31Accessing Local File Systems
- I/O system calls on compute nodes are executed on
the compute node - File systems on the slave can be made available
to tasks running there and accessed as on any
Linux system - Recommended programming model does not assume
that a task will run on a specific node - Best used for temporary storage
- Access permissions may be a problem
- Very small on newer systems like Ranger
32Filesystems(cont.)
- Remote - Share a single disk among all nodes
- Every node sees same filesystem
- Synchronization mechanisms manage changes
- "Traditional" UNIX approach
- Relatively low performance
- Doesn't scale well server becomes bottleneck in
large systems - Simplest solution for small clusters,
reading/writing small files
33Accessing Network File Systems
- Network file systems such as NFS and AFS can be
mounted by slave nodes - Provides a shared storage space for home
directories, parameter files, smaller data files - Performance problems can be severe for a very
large number of nodes (100) - Otherwise, works like local file systems
34Filesystems(cont.)
- Parallel - Stripe files across multiple disks on
multiple nodes - Relatively high performance
- Each node sees same filesystem
- Works best for I/O intensive applications
- Not a good solution for small files
- Certain slave nodes are designated I/O nodes,
local disks used to store pieces of filesystem
35Accessing Parallel File Systems
- Distribute file data among many I/O nodes
(servers), potentially every node in the system - Typically not so good for small files, but very
good for large data files - Should provide good performance even for a very
large degree of sharing - Critical for scalability in applications with
large I/O demands - Particularly good for data parallel model
36Using File Systems
- Local File Systems
- EXT3, /tmp
- Network File Systems
- NFS, AFS
- Parallel File Systems
- PVFS, LUSTRE, IBRIX,Panasas
- I/O Libraries
- HDF, NetCDF, Panda
37Example Application for Parallel I/O
Input
Read
Process
Write
Output
38Issues in Parallel I/O
- Physical distribution of data to I/O nodes
interacts with logical distribution of the I/O
requests to affect performance - Logical record sizes should be considered in
physical distribution - I/O buffer sizes depend on physical distribution
and number of tasks - Performance is best with rather large requests
- Buffering should be used to get requests of 1MB
or more, depending on the size of the system
39I/O Libraries
- May make I/O simpler for certain applications
- Multidimensional data sets
- Special data formats
- Consistent access to shared data
- "Out-of-core" computation
- May hide some details of parallel file systems
- Partitioning
- May provide access to special features
- Caching, buffering, asynchronous I/O, performance
40MPI-IO
- Common file operations
- MPI_File_open()
- MPI_File_close()
- MPI_File_read()
- MPI_File_write()
- MPI_File_read_at()
- MPI_File_write_at()
- MPI_File_read_shared()
- MPI_File_write_shared()
- Open, close are collective. The rest have
collective counterparts add _all
41MPI_File_open
- MPI_File_open(
- MPI_Comm comm,
- char filename,
- int amode,
- MPI_Info info,
- MPI_File fh)
- Collective operation on comm
- amode similar to UNIX file mode a few extra MPI
possibilities
42MPI_File_close
- MPI_File_close(
- MPI_File fh
- )
43File Views
- File views supported
- MPI_File_set_view()
- Essentially, a file view allows you to change
your program's treatment of a file as simply a
stream of bytes, to viewing the file as a set of
MPI_Datatypes and displacements. - Arguments to set view are similar to the
arguments for creating derived datatypes
44MPI_File_read
- MPI_File_read(
- MPI_File fh,
- void buf,
- int count,
- MPI_Datatype datatype,
- MPI_Status status
- )
45MPI_File_read_at
- MPI_File_read_at(
- MPI_File fh,
- MPI_Offset offest,
- void buf,
- int count,
- MPI_Datatype datatype,
- MPI_Status status
- )
- MPI_File_read_at_all() is the collective version
46Non-Blocking I/O
- MPI_File_iread()
- MPI_File_iwrite()
- MPI_File_iread_at()
- MPI_File_iwrite_at()
- MPI_File_iread_shared()
- MPI_File_iwrite_shared()
47MPI_File_iread
- MPI_File_iread(
- MPI_File fh,
- void buf,
- int count,
- MPI_Datatype datatype,
- MPI_Request request
- )
- Request structure can be queried to determine if
the operation is complete
48Collective access
- The shared routines use a collective file
pointer - Collective routines also provided to allow each
task to read/write a specific chunk of the file - MPI_File_read_ordered(MPI_File fh, void buf, int
count, MPI_Datatype type, MPI_Status st) - MPI_File_write_ordered()
- MPI_File_seek_shared()
- MPI_File_read_all()
- MPI_File_write_all()
49File Functions
- MPI_File_delete()
- MPI_File_set_size()
- MPI_File_preallocate()
- MPI_File_get_size()
- MPI_File_get_group()
- MPI_File_get_amode()
- MPI_File_set_info()
- MPI_File_get_info()
50ROMIO MPI-IO Implementation
- Implementation of MPI-2 I/O specification
- Operates on wide variety of platforms
- Abstract Device Interface for I/O (ADIO) aids in
porting to new file systems - Fortran and C bindings
- Successes
- Adopted by industry (e.g. Compaq, HP, SGI)
- Used at ASCI sites (e.g. LANL Blue Mountain)
51Data Staging for Tiled Display
- Commodity components
- projectors, PCs
- Provide very high resolutionvisualization
- Staging application splitsframes into a tile
stream foreach visualization node - Uses MPI-IO to access data from PVFS file system
- Streams of tiles are merged into movie files on
visualization node
52Splitting Movie Frames into Tiles
- Hundreds of frames make up a single movie
- Each frame is stored in its own file in PVFS
- Frame size is 2532x1408 pixels
- 3x2 display
- Tile size is 1024x768 pixels (overlapped)
53Obtaining Highest Performance
- To make best use of PVFS
- Use MPI-IO (ROMIO) for data access
- Use file views and datatypes
- Take advantage of collectives
- Use hints to optimize for your platform
- Simple, right )?
54Trivial MPI-IO Example
- Reading contiguous pieces with MPI-IO calls
- Simplest, least powerful way to use MPI-IO
- Easy to port from POSIX calls
- Lots of I/O operations to get desired data
MPI_File_open(comm, fname, MPI_MODE_RDONLY, MPI_I
NFO_NULL, handle) / read tile data from one
frame / for (row 0 row lt 768 row)
offset rowrow_size tile_offset
header_size MPI_File_read_at(handle, offset,
buffer, 10243, MPI_BYTE, status) MPI_File_
close(handle)
55Avoiding the VFS Layer
- UNIX calls go through VFS layer
- MPI-IO calls use Filesystem library directly
- Significant performance gain
56Why Use File Views?
- Concisely describe noncontiguous regions in a
file - Create datatype describing region
- Assign view to open file handle
- Separate description of region from I/O operation
- Datatype can be reused on subsequent calls
- Access these regions with a single operation
- Single MPI read call requests all data
- Provides opportunity for optimization of access
in MPI-IO implementation
57Setting a File View
- Use MPI_Type_create_subarray() to define a
datatype describing the data in the file - Example for tile access (24-bit data)
MPI_Type_contiguous(3, MPI_BYTE,
rgbtype) frame_size1 2532 / frame width
/ frame_size0 1408 / frame height
/ tile_size1 1024 / tile width
/ tile_size0 768 / tile height / /
create datatype describing tile
/ MPI_Type_create_subarray(2, frame_size,
tile_size, tile_offset, MPI_ORDER_C, rgbtype,
tiletype) MPI_Type_commit(tiletype) MPI_File_
set_view(handle, header_size, rgbtype, tiletype,
native, MPI_INFO_NULL) MPI_File_read(handle,
buffer, buffer_size, rgbtype, status)
58Noncontiguous Access in ROMIO
- ROMIO performs data sieving to cut down number
of I/O operations - Uses large reads which grab multiple
noncontiguous pieces - Example, reading tile 1
59Data Sieving Performance
- Reduces I/O operations from 4600 to 6
- 87 effective throughput improvement
- Reading 3 times as much data as necessary
60Collective I/O
- MPI-IO supports collective I/O calls (_all
suffix) - All processes call the same function at once
- May vary parameters (to access different regions)
- More fully describe the access pattern as a whole
- Explicitly define relationship between accesses
- Allow use of ROMIO aggregation optimizations
- Flexibility in what processes interact with I/O
servers - Fewer, larger I/O requests
61Collective I/O Example
/ create datatype describing tile
/ MPI_Type_create_subarray(2, frame_size,
tile_size, tile_offset, MPI_ORDER_C, rgbtype,
tiletype) MPI_Type_commit(tiletype) MPI_File_
set_view(handle, header_size, rgbtype, tiletype,
native, MPI_INFO_NULL) if 0 MPI_File_read(han
dle, buffer, buffer_size, rgbtype,
status) endif / collective read
/ MPI_File_read_all(handle, buffer, buffer_size,
rgbtype, status)
62Two-Phase Access
- ROMIO implements two-phase collective I/O
- Data is read by clients in contiguous pieces
(phase 1) - Data is redistributed to the correct client
(phase 2) - ROMIO applies two-phase when collective accesses
overlap between processes - More efficent I/O access than data sieving alone
63Two-Phase Performance
64Hints
- Controlling PVFS
- striping_factor - size of strips on I/O servers
- striping_unit - number of I/O servers to stripe
across - start_iodevice - which I/O server to start with
- Controlling aggregation
- cb_config_list - list of aggregators
- cb_nodes - number of aggregators (upper bound)
- Tuning ROMIO optimizations
- romio_cb_read, romio_cb_write - aggregation
on/off - romio_ds_read, romio_ds_write - data sieving
on/off
65The Proof is in the Performance
- Final performance is almost 3 times VFS access!
- Hints allowed us to turn off two-phase, modify
striping of data
66A More Sophisticated I/O Example
- Dividing a 2D Matrix with Ghost Rows
67File - on Disk vs. in Memory
Partition on one processor (with ghost rows on
borders)
Full Dataset
68includeltstdio.hgt includeltmpi.hgt main(int argc,
char argv) int gsizes2,psizes2,lsi
zes2,memsizes2 int
dims2,periods2,coords2,start_indices2
MPI_Comm comm MPI_Datatype filetype,
memtype MPI_File fh MPI_Status status
int local_array1212 int rank, m, n
m20 n30 gsizes0m
gsizes1n psizes02 psizes13
lsizes0 m/psizes0 / Rows in local
array / lsizes1 m/psizes1 / Cols
in local array / dims02 dims13
periods0periods1 1
MPI_Init(argc,argv) MPI_Cart_create(MPI
_COMM_WORLD, 2, dims, periods, 0, comm)
MPI_Comm_rank(comm, rank)
69MPI_Cart_coords(comm, rank, 2, coords)
start_indices0 coords0 lsizes0
start_indices1 coords1 lsizes1
MPI_Type_create_subarray(2, gsizes, lsizes,
start_indices, MPI_ORDER_C,
MPI_FLOAT, filetype) MPI_Type_commit(fi
letype) MPI_File_open(MPI_COMM_WORLD,
"datafile", MPI_MODE_CREATE
MPI_MODE_WRONLY, MPI_INFO_NULL, fh)
MPI_File_set_view(fh,0, MPI_CHAR, filetype,
"native", MPI_INFO_NULL) memsizes0
lsizes0 2 memsizes1 lsizes1
2 start_indices0 start_indices1
1 MPI_Type_create_subarray(2, memsizes,
lsizes, start_indices, MPI_ORDER_C,
MPI_CHAR, memtype) MPI_Type_commit(me
mtype) MPI_File_write_all(fh,
local_array, 1, memtype, status)
MPI_File_close(fh)
70Summary Why Use MPI-IO?
- Better concurrent access model than POSIX one
- Explicit list of processes accessing concurrently
- More lax (but still very usable) consistency
model - More descriptive power in interface
- Derived datatypes for concise, noncontiguous file
and/or memory regions - Collective I/O functions
- Optimizations built into MPI-IO implementation
- Noncontiguous access
- Collective I/O (aggregation)
- Performance portability
71Optional, Really Advanced Stuff
- Dynamic Process Management
- Intercommunicator communication
- MPI external connections
72Creating Communicators
- int MPI_Comm_dup(MPI_Comm comm, MPI_Comm
newcomm) - MPIIntracomm MPIIntracommDup() const
- MPIIntercomm MPIIntercommDup() const
- MPICartcomm MPICartcommDup() const
- MPIGraph comm MPIGraphcommDup() const
- Creates an exact copy of the communicator
73Creating Communicators
- int MPI_Comm_create(MPI_Comm comm, MPI_Group
group, MPI_Comm newcomm) - MPIIntracomm MPIIntracommcreate(...) const
- MPIIntercomm MPIIntercommcreate(...) const
- Creates a new communicator with the contents of
group - Group must be a subset of Comm
- Int MPI_Comm_split(comm,color,key,newcomm)
74Destroying Communicators
- int MPI_Comm_free(MPI_Comm comm)
- void MPICommFree()
- Destroys the named communicator
75Dynamic Process Management
- Create new processes from running programs (as
opposed to with MPIrun) - MPI_Comm_spawn
- (for SPMD style programs)
- MPI_Comm_spawn_multiple
- (MPMD style programs)
- Connecting two (or more) applications together
- MPI_Comm_accept and MPI_Comm_connect
- Useful in assembling complex distributed
applications
76Dynamic Process Management
- Issues
- maintaining simplicity, flexibility, and
correctness - interaction with OS, resource manager, and
process manager - connecting independently started processes
- Spawning new processes is collective, returning
an intercommunicator - Local group is group of spawning processes.
- Remote group is group of new processes
- New processes have own MPI_COMM_WORLD
- MPI_Comm_get_parent lets new processes find
parent communicator
77Intercommunicators
- Contain a local group and a remote group
- Point-to-point communication is between a process
in one group and a process in the other - Can be merged into a normal communicator
- Created by MPI_Intercomm_create()
78Spawning Processes
- int MPI_Comm_spawn(command, argv, numprocs, info,
root, comm, intercomm, errcodes) - Tries to start numprocs process running command,
passing them command line arguments argv - The operation is collective over comm
- Spawnees are in remote group of intercomm
- Errors are reported on a per-process basis in
errcodes - Info used to optionally specify hostname,
archname, etc...
79Spawning Multiple Executables
- MPI_Comm_spawn_multiple(...)
- Arguments command, argv, numprocs, info all
become arrays - Still Collective
80Establishing Connections
- MPI-2 makes it possible for two MPI jobs started
separately to establish communication - e.g. visualizer connecting to simulation
- Connection results in an intercommunicator
- Client/server architecture
- Similar to TCP Sockets programming
81Establishing Connections
- Server
- MPI_Open_port(info, port_name)
- MPI_Comm_accept(port_name, info,root, comm,
intercomm) - Client
- MPI_Comm_connect( port_name, info, root, comm,
intercomm) - Optional name service (like normal UNIX)
- MPI_Publish_name( ... )
- MPI_Lookup_name( ... )
- (not sure if name service is implemented)