Title: Ace104 Lecture 8
1Ace104Lecture 8
- Tightly Coupled Components
- MPI (Message Passing Interface)
2Motivation
- To this point we have focused on highly granular,
loosely coupled components via web services (ie
using SOAP/XML/http) - Some components need to couple more tightly
- Rate and volume of data exchange, e.g.
- Granularity of interfaces
- These components are normally controlled in a
unified back-end environment, so
inter-component security is a less prominent
issue
3Multi-grained services
- Tight coupling implies fine granularity, but not
necessarily an rpc architectural style - Real world architectures are built of multi-grain
components - Low granularity loosely coupled components
communicating via web services - These components themselves are made up of high
granularity(sub)- components communicating via
some more efficient mechanism - Java rmi
- Raw sockets
- MPI, etc.
4Role of MPI -- HPC is not all
- One good example of this is speeding up numerical
operation by parallelization - Risk management, option pricing, data mining,
flow simulation, etc. - These faster components can then be coupled via
web services (e.g. this is the common
architectural model of Grid Computing) - However, tight coupling is more general than
parallel computing - Can be used for any sub-service where performance
matters has gained popularity recently in this
area.
5Standardization
- Parallel computing community has resorted to
community-based standards - HPF
- MPI
- OpenMP?
- Some commercial products are becoming de facto
standards, but only because they are portable - TotalView parallel debugger, PBS batch scheduler
6Risks of Standardization
- Failure to involve all stakeholders can result in
standard being ignored - application programmers
- researchers
- vendors
- Premature standardization can limit production of
new ideas by shutting off support for further
research projects in the area
7Models for Parallel Computation
- Shared memory (load, store, lock, unlock)
- Message Passing (send, receive, broadcast, ...)
- Transparent (compiler works magic)
- Directive-based (compiler needs help)
- Others (BSP, OpenMP, ...)
- Task farming (scientific term for large
transaction processing)
8The Message-Passing Model
- A process is (traditionally) a program counter
and address space - Processes may have multiple threads (program
counters and associated stacks) sharing a single
address space - Message passing is for communication among
processes, which have separate address spaces - Interprocess communication consists of
- synchronization
- movement of data from one processs address space
to anothers
9What is MPI?
- A message-passing library specification
- extended message-passing model
- not a language or compiler specification
- not a specific implementation or product
- For parallel computers, clusters, and
heterogeneous networks - Full-featured
- Designed to provide access to advanced parallel
hardware for end users, library writers, and tool
developers
10Where Did MPI Come From?
- Early vendor systems (Intels NX, IBMs EUI,
TMCs CMMD) were not portable (or very capable) - Early portable systems (PVM, p4, TCGMSG,
Chameleon) were mainly research efforts - Did not address the full spectrum of issues
- Lacked vendor support
- Were not implemented at the most efficient level
- The MPI Forum organized in 1992 with broad
participation by - vendors IBM, Intel, TMC, SGI, Convex, Meiko
- portability library writers PVM, p4
- users application scientists and library
writers - finished in 18 months
11Novel Features of MPI
- Communicators encapsulate communication spaces
for library safety - Datatypes reduce copying costs and permit
heterogeneity - Multiple communication modes allow precise buffer
management - Extensive collective operations for scalable
global communication - Process topologies permit efficient process
placement, user views of process layout - Profiling interface encourages portable tools
12MPI References
- The Standard itself
- at http//www.mpi-forum.org
- All MPI official releases, in both postscript and
HTML - Books
- Using MPI Portable Parallel Programming with
the Message-Passing Interface, 2nd Edition, by
Gropp, Lusk, and Skjellum, MIT Press, 1999. Also
Using MPI-2, w. R. Thakur - MPI The Complete Reference, 2 vols, MIT Press,
1999. - Designing and Building Parallel Programs, by Ian
Foster, Addison-Wesley, 1995. - Parallel Programming with MPI, by Peter Pacheco,
Morgan-Kaufmann, 1997. - Other information on Web
- at http//www.mcs.anl.gov/mpi
- pointers to lots of stuff, including other talks
and tutorials, a FAQ, other MPI pages
13send/recv
- Basic MPI functionality
- MPI_Send(void buf, int count, MPI_Datatype type,
int dest, int tag, MPI_Comm comm) - MPI_Recv(void buf, int count, MPI_Datatype type,
int src, int tag, MPI_Comm comm, MPI_Status
stat) - stat is a C struct returned with at least the
following fields - stat.MPI_SOURCE
- stat.MPI_TAG
- stat.MPI_ERROR
14Blocking vs. non-blocking
- Send/recv functions in previous slide is referred
to as blocking point-to-point communication - MPI also has non-blocking send/recv functions
that will be studied next class MPI_Isend,
MPI_Irecv - Semantics between two are very different must
be very careful to understand rules to write safe
programs
15Blocking recv
- Semantics of blocking recv
- A blocking receive can be started whether or not
a matching send has been posted - A blocking receive returns only after its receive
buffer contains the newly received message - A blocking receive can complete before the
matching send has completed (but only after it
has started)
16Blocking send
- Semantics of blocking send
- Can start whether or not a matching recv has been
posted - Returns only after message in data envelope is
safe to be overwritten - This can mean that date was either buffered or
that it was sent directly to receive process - Which happens is up to implementation
- Very strong implications for writing safe programs
17Examples
MPI_Comm_rank(MPI_COMM_WORLD, rank) if (rank
0) MPI_Send(sendbuf, count, MPI_DOUBLE, 1,
tag, comm) MPI_Recv(recvbuf, count,
MPI_DOUBLE, 1, tag, comm, stat) else if (rank
1) MPI_Recv(recvbuf, count, MPI_DOUBLE, 0,
tag, comm, stat) MPI_Send(sendbuf, count,
MPI_DOUBLE, 0, tag, comm) Is this program
safe? Why or why not? Yes, this is safe even if
no buffer space is available!
18Examples
MPI_Comm_rank(MPI_COMM_WORLD, rank) if (rank
0) MPI_Recv(recvbuf, count, MPI_DOUBLE, 1,
tag, comm, stat) MPI_Send(sendbuf, count,
MPI_DOUBLE, 1, tag, comm) else if (rank
1) MPI_Recv(recvbuf, count, MPI_DOUBLE, 0,
tag, comm, stat) MPI_Send(sendbuf, count,
MPI_DOUBLE, 0, tag, comm) Is this program
safe? Why or why not? No, this will always
deadlock!
19Examples
MPI_Comm_rank(MPI_COMM_WORLD, rank) if (rank
0) MPI_Send(sendbuf, count, MPI_DOUBLE, 1,
tag, comm) MPI_Recv(recvbuf, count,
MPI_DOUBLE, 1, tag, comm, stat) else if (rank
1) MPI_Send(sendbuf, count, MPI_DOUBLE, 0,
tag, comm) MPI_Recv(recvbuf, count,
MPI_DOUBLE, 0, tag, comm, stat) Is this
program safe? Why or why not? Often, but not
always! Depends on buffer space.
20Message order
- Messages in MPI are said to be non-overtaking.
- That is, messages sent from a process to another
process are guaranteed to arrive in same order. - However, nothing is guaranteed about messages
sent from other processes, regardless of when
send was initiated
21Illustration of message ordering
P0 (send)
P1 (recv)
P2 (send)
22Another example
int rank MPI_Comm_rank() if (rank 0)
MPI_Send(buf1, count, MPI_FLOAT, 2, tag)
MPI_Send(buf2, count, MPI_FLOAT, 1, tag) else
if (rank 1) MPI_Recv(buf2, count,
MPI_FLOAT, 0, tag) MPI_Send(buf2, count,
MPI_FLOAT, 2, tag) else if (rank 2)
MPI_Recv(buf1, count, MPI_FLOAT, MPI_ANY_SOURCE,
tag) MPI_Recv(buf2, count, MPI_FLOAT,
MPI_ANY_SOURCE, tag)
23Illustration of previous code
send send
recv send
recv recv
Which message will arrive first?
Impossible to say!
24Progress
- Progress
- If a pair of matching send/recv has been
initiated, at least one of the two operations
will complete, regardless of any other actions in
the system - send will complete, unless recv is satisfied by
another message - recv will complete, unless message sent is
consumed by another matching recv
25Fairnesss
- MPI makes no guarantee of fairness
- If MPI_ANY_SOURCE is used, a sent message may
repeatedly be overtaken by other messages (from
different processes) that match the same receive.
26Send Modes
- To this point, we have studied non-blocking send
routines using standard mode. - In standard mode, the implementation determines
whether buffering occurs. - This has major implications for writing safe
programs
27Other send modes
- MPI includes three other send modes that give the
user explicit control over buffering. - These are buffered, synchronous, and ready
modes. - Corresponding MPI functions
- MPI_Bsend
- MPI_Ssend
- MPI_Rsend
28MPI_Bsend
- Buffered Send allows user to explicitly create
buffer space and attach buffer to send
operations - MPI_BSend(void buf, int count, MPI_Datatype
type, int dest, int tag, MPI_Comm comm) - Note this is same as standard send arguments
- MPI_Buffer_attach(void buf, int size)
- Create buffer space to be used with BSend
- MPI_Buffer_detach(void buf, int size)
- Note in detach case void argument is really
pointer to buffer address, so that add - Note call blocks until message has been safely
sent - Note It is up to the user to properly manage the
buffer and ensure space is available for any
Bsend call
29MPI_Ssend
- Synchronous Send
- Ensures that no buffering is used
- Couples send and receive operation send cannot
complete until matching receive is posted and
message is fully copied to remove processor - Very good for testing buffer safety of program
30MPI_Rsend
- Ready Send
- Matching receive must be posted before send,
otherwise program is incorrect - Can be implemented to avoid handshake overhead
when program is known to meet this condition - Not very typical dangerous
31Implementation oberservations
- MPI_Send could be implemented as MPI_Ssend, but
this would be weird and undesirable - MPI_Rsend could be implemented as MPI_Ssend, but
this would eliminate any performance enhancements - Standard mode (MPI_Send) is most likely to be
efficiently implemented
32MPIs Non-blocking Operations
- Non-blocking operations return (immediately)
request handles that can be tested and waited
on. - MPI_Isend(start, count, datatype, dest,
tag, comm, request) - MPI_Irecv(start, count, datatype, dest,
tag, comm, request) - MPI_Wait(request, status)
- One can also test without waiting
- MPI_Test(request, flag, status)
33Multiple Completions
- It is sometimes desirable to wait on multiple
requests - MPI_Waitall(count, array_of_requests,
array_of_statuses) - MPI_Waitany(count, array_of_requests, index,
status) - MPI_Waitsome(count, array_of_requests, array_of
indices, array_of_statuses) - There are corresponding versions of test for each
of these.
34Embarrassingly parallel examples
- Mandelbrot set
- Monte Carlo Methods
- Image manipulation
35Embarrassingly Parallel
- Also referred to as naturally parallel
- Each Processor works on their own sub-chunk of
data independently - Little or no communication required
36Mandelbrot Set
- Creates pretty and interesting fractal images
with a simple recursive algorithm - zk1 zk zk c
- Both z and c are imaginary numbers
- for each point c we compute this formula until
either - A specified number of iterations has occurred
- The magnitude of z surpasses 2
- In the former case the point is not in the
Mandelbrot set - In the latter case it is in the Mandelbrot set
37Parallelizing Mandelbrot Set
- What are the major defining features of problem?
- Each point is computed completely independently
of every other point - Load balancing issues how to keep procs busy
- Strategies for Parallelization?
38Mandelbrot Set Simple Example
- See mandelbrot.c and mandelbrot_par.c for simple
serial and parallel implementation - Think how load balacing could be better handled
39Monte Carlo Methods
- Generic description of a class of methods that
uses random sampling to estimate values of
integrals, etc. - A simple example is to estimate the value of pi
40Using Monte Carlo to Estimate p
- Fraction of randomly
- Selected points that lie
- In circle is ratio of areas,
- Hence pi/4
- Ratio of Are of circle to Square is pi/4
- What is value of pi?
41Parallelizing Monte Carlo
- What are general features of algorithm?
- Each sample is independent of the others
- Memory is not an issue master-slave
architecture? - Getting independent random numbers in parallel is
an issue. How can this be done?
42MPI Datatypes
- The data in a message to send or receive is
described by a triple (address, count, datatype),
where - An MPI datatype is recursively defined as
- predefined, corresponding to a data type from the
language (e.g., MPI_INT, MPI_DOUBLE) - a contiguous array of MPI datatypes
- a strided block of datatypes
- an indexed array of blocks of datatypes
- an arbitrary structure of datatypes
- There are MPI functions to construct custom
datatypes, in particular ones for subarrays
43MPI Tags
- Messages are sent with an accompanying
user-defined integer tag, to assist the receiving
process in identifying the message - Messages can be screened at the receiving end by
specifying a specific tag, or not screened by
specifying MPI_ANY_TAG as the tag in a receive - Some non-MPI message-passing systems have called
tags message types. MPI calls them tags to
avoid confusion with datatypes
44MPI is Simple
- Many MPI programs can be written using just these
six functions, only two of which are non-trivial - MPI_INIT
- MPI_FINALIZE
- MPI_COMM_SIZE
- MPI_COMM_RANK
- MPI_SEND
- MPI_RECV
45Collective Operations in MPI
- Collective operations are called by all processes
in a communicator - MPI_BCAST distributes data from one process (the
root) to all others in a communicator - MPI_REDUCE combines data from all processes in
communicator and returns it to one process - In many numerical algorithms, SEND/RECEIVE can be
replaced by BCAST/REDUCE, improving both
simplicity and efficiency
46Example PI in C - 1
- include "mpi.h"
- include ltmath.hgt
- int main(int argc, char argv)
- int done 0, n, myid, numprocs, i, rcdouble
PI25DT 3.141592653589793238462643double mypi,
pi, h, sum, x, aMPI_Init(argc,argv)MPI_Comm_
size(MPI_COMM_WORLD,numprocs)MPI_Comm_rank(MPI_
COMM_WORLD,myid)while (!done) if (myid
0) printf("Enter the number of intervals
(0 quits) ") scanf("d",n)
MPI_Bcast(n, 1, MPI_INT, 0, MPI_COMM_WORLD)
if (n 0) break
47Example PI in C - 2
- h 1.0 / (double) n sum 0.0 for (i
myid 1 i lt n i numprocs) x h
((double)i - 0.5) sum 4.0 / (1.0 xx)
mypi h sum MPI_Reduce(mypi, pi, 1,
MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD) if (myid 0) printf("pi
is approximately .16f, Error is .16f\n",
pi, fabs(pi - PI25DT))MPI_Finalize() - return 0
48Alternative Set of 6 Functions
- Using collectives
- MPI_INIT
- MPI_FINALIZE
- MPI_COMM_SIZE
- MPI_COMM_RANK
- MPI_BCAST
- MPI_REDUCE
49Buffers
- When you send data, where does it go? One
possibility is
50Avoiding Buffering
- It is better to avoid copies
Process 0
Process 1
User data
the network
User data
This requires that MPI_Send wait on delivery, or
that MPI_Send return before transfer is complete,
and we wait later.
51Blocking and Non-blocking Communication
- So far we have been using blocking communication
- MPI_Recv does not complete until the buffer is
full (available for use). - MPI_Send does not complete until the buffer is
empty (available for use). - Completion depends on size of message and amount
of system buffering.
52Sources of Deadlocks
- Send a large message from process 0 to process 1
- If there is insufficient storage at the
destination, the send must wait for the user to
provide the memory space (through a receive) - What happens with this code?
- This is called unsafe because it depends on the
availability of system buffers
53Some Solutions to the unsafe Problem
- Order the operations more carefully
Supply receive buffer at same time as send
54More Solutions to the unsafe Problem
- Supply own space as buffer for send
Use non-blocking operations
55Collective Operations in MPI
- Collective operations must be called by all
processes in a communicator. - MPI_BCAST distributes data from one process (the
root) to all others in a communicator. - MPI_REDUCE combines data from all processes in
communicator and returns it to one process. - In many numerical algorithms, SEND/RECEIVE can be
replaced by BCAST/REDUCE, improving both
simplicity and efficiency.
56MPI Collective Communication
- Communication and computation is coordinated
among a group of processes in a communicator. - Groups and communicators can be constructed by
hand or using topology routines. - Tags are not used different communicators
deliver similar functionality. - No non-blocking collective operations.
- Three classes of operations synchronization,
data movement, collective computation.
57Synchronization
- MPI_Barrier( comm )
- Blocks until all processes in the group of the
communicator comm call it.
58Synchronization
- MPI_Barrier( comm, ierr )
- Blocks until all processes in the group of the
communicator comm call it.
59Collective Data Movement
Broadcast
Scatter
B
C
D
Gather
60More Collective Data Movement
A
B
C
D
Allgather
A
B
C
D
A
B
C
D
A
B
C
D
A0
B0
C0
D0
A0
A1
A2
A3
A1
B1
C1
D1
B0
B1
B2
B3
A2
B2
C2
D2
C0
C1
C2
C3
A3
B3
C3
D3
D0
D1
D2
D3
61Collective Computation
62MPI Collective Routines
- Many Routines Allgather, Allgatherv, Allreduce,
Alltoall, Alltoallv, Bcast, Gather, Gatherv,
Reduce, Reduce_scatter, Scan, Scatter, Scatterv - All versions deliver results to all participating
processes. - V versions allow the hunks to have different
sizes. - Allreduce, Reduce, Reduce_scatter, and Scan take
both built-in and user-defined combiner functions.
63MPI Built-in Collective Computation Operations
- MPI_Max
- MPI_Min
- MPI_Prod
- MPI_Sum
- MPI_Land
- MPI_Lor
- MPI_Lxor
- MPI_Band
- MPI_Bor
- MPI_Bxor
- MPI_Maxloc
- MPI_Minloc
- Maximum
- Minimum
- Product
- Sum
- Logical and
- Logical or
- Logical exclusive or
- Binary and
- Binary or
- Binary exclusive or
- Maximum and location
- Minimum and location
64How Deterministic are Collective Computations?
- In exact arithmetic, you always get the same
results - but roundoff error, truncation can happen
- MPI does not require that the same input give the
same output - Implementations are encouraged but not required
to provide exactly the same output given the same
input - Round-off error may cause slight differences
- Allreduce does guarantee that the same value is
received by all processes for each call - Why didnt MPI mandate determinism?
- Not all applications need it
- Implementations can use deferred
synchronization ideas to provide better
performance
65Defining your own Collective Operations
- Create your own collective computations
withMPI_Op_create( user_fcn, commutes, op
)MPI_Op_free( op )user_fcn( invec,
inoutvec, len, datatype ) - The user function should performinoutveci
inveci op inoutvecifor i from 0 to
len-1. - The user function can be non-commutative.
66Defining your own Collective Operations (Fortran)
- Create your own collective computations
withcall MPI_Op_create( user_fcn, commutes, op,
ierr )MPI_Op_free( op, ierr )subroutine
user_fcn( invec, inoutvec, len, datatype
) - The user function should performinoutvec(i)
invec(i) op inoutvec(i)for i from 1 to len. - The user function can be non-commutative.
67MPICH Goals
- Complete MPI implementation
- Portable to all platforms supporting the
message-passing model - High performance on high-performance hardware
- As a research project
- exploring tradeoff between portability and
performance - removal of performance gap between user level
(MPI) and hardware capabilities - As a software project
- a useful free implementation for most machines
- a starting point for vendor proprietary
implementations
68MPICH Architecture
- Most code is completely portable
- An Abstract Device defines the communication
layer - The abstract device can have widely varying
instantiations, using - sockets
- shared memory
- other special interfaces
- e.g. Myrinet, Quadrics, InfiniBand, Grid protocols
69Getting MPICH for your cluster
- http//www.mcs.anl.gov/mpi/mpich
- Either MPICH-1 or
- MPICH-2
70Performance Visualization with Jumpshot
- For detailed analysis of parallel program
behavior, timestamped events are collected into a
log file during the run. - A separate display program (Jumpshot) aids the
user in conducting a post mortem analysis of
program behavior.
71High-Level Programming With MPI
- MPI was designed from the beginning to support
libraries - Many libraries exist, both open source and
commercial - Sophisticated numerical programs can be built
using libraries - Solve a PDE (e.g., PETSc)
- Scalable I/O of data to a community standard file
format