Title: MPI Workshop I
1MPI Workshop - I
- Introduction to Point-to-Point Communications
- HPC_at_UNM Research Staff
- Dr. Andrew C. Pineda, Dr. Paul M. Alsing
- Week 1 of 2
2Table of Contents
- Introduction to Parallelism and MPI
- MPI Standard
- MPI Course Map
- MPI Routines/Exercises
- point-to-point communications basics
- communicators
- blocking versus non-blocking calls
- collective communications basics
- how to run MPI routines at HPC_at_UNM
- References
3Parallelism and MPI
- Parallelism is accomplished by
- Breaking up the task into smaller tasks
- Assigning the smaller tasks to multiple workers
to work on simultaneously - Coordinating the workers
- Not breaking up the task so small that it takes
longer to tell the worker what to do than it does
to do it - Buzzwords latency, bandwidth
4Parallelism MPI
- Message Passing Model
- Multiple processors operate independently but
each has its own private memory (distributed
processors and memory) - Data is shared across a communications network
using cooperative operations via library calls - User responsible for synchronization using
message passing - Advantages
- Memory scalable to number of processors. Increase
number of processors, size of memory and
bandwidth increases. - Each processor can rapidly access its own memory
without interference from others.
5Parallelism MPI
- Another advantage is flexibility of programming
schemes - Functional parallelism - different tasks done at
the same time by different nodes - Master-Slave (Client-server) parallelism - one
process assigns subtasks to other processes. - Data parallelism - data can be distributed
- SPMD parallelism - Single Program, Multiple Data
- same code replicated to each process but
operating on different data
6Parallelism MPI
- Disadvantages
- Sometimes difficult to map existing data
structures to this memory organization - User responsible for sending and receiving data
among processors - To minimize overhead and latency, data should be
blocked up in large chunks and shipped before
receiving node needs it
7Parallelism MPI
- Message Passing Interface - MPI
- A standard portable message-passing library
definition developed in 1993 by a group of
parallel computer vendors, computer scientists,
and applications developers. - Available to both Fortran and C programs (and
through these F90 and C). - Available on a wide variety of parallel machines.
- Target platform is a distributed memory system
such as the Los Lobos Linux Cluster. - All inter-task communication is by message
passing. - All parallelism is explicit the programmer is
responsible for parallelism of the program and
implementing it with MPI constructs.
8MPI Standardization Effort
- MPI Forum initiated in April 1992 Workshop on
Message Passing Standards. - Initially about 60 people from 40 organizations
participated. - Defines an interface that can be implemented on
many vendor's platforms with no significant
changes in the underlying communication and
system software. - Allow for implementations that can be used in a
heterogeneous environment. - Semantics of the interface should be language
independent. - There are currently over 125 people from 52
organizations who have contributed to this
effort.
9MPI-Standard Releases
- May, 1994 MPI-Standard version 1.0
- June, 1995 MPI-Standard version 1.1
- includes minor revisions of 1.0
- July, 1997 MPI-Standard version 1.2 and 2.0
- with extended functions
- 2.0 - support real time operations, spawning of
processes, more collective operations - 2.0 - explicit C and F90 bindings
- Complete postscript and HTML documentation can be
found at http//www.mpi-forum.org/docs/docs.html
- Currently available at HPC_at_UNM
10MPI Implementations
- Message Passing Libraries
- MPI - Message Passing Interface
- PVM - Parallel Virtual Machine
- Public Domain MPI Implementations
- MPICH (ANL and MSU) (v. 1.2.5, 6 Jan 2003)
- www-unix.mcs.anl.gov/mpi/mpich/
- MPICH2 (ANL and MSU) (v. 0.94, 22 Aug 2003)
- LAM (v. 7.0, MPI 1.2 much of 2.0, 2 Jul 2003)
- www.lam-mpi.org
- Vendor MPI Implementations
- IBM-MPI , SGI (based on MPICH), others
- Available on HPC_at_UNM platforms.
11Course Roadmap
12Program examples/MPI calls
- Hello - Basic MPI code with no communications.
- Illustrates some basic aspects of parallel
programming - MPI_INIT - starts MPI communications
- MPI_COMM_RANK - get processor id
- MPI_COMM_SIZE - get number of processors
- MPI_FINALIZE - end MPI communications
- Swap - Basic MPI point-to-point messages
- MPI_SEND - blocking send
- MPI_RECV - blocking receive
- MPI_IRECV, MPI_WAIT - non-blocking receive
13Program examples/MPI calls
- client-server - Basic MPI code illustrating
functional parallelism - MPI_INIT - starts MPI communications
- MPI_COMM_RANK - get processor id
- MPI_COMM_SIZE - get number of processors
- MPI_ATTR_GET - get attributes (in this case
maximum allowed tag value) - MPI_BARRIER - synchronization
- MPI_BCAST - collective call - broadcast
- MPI_PROBE - peek at message queue to see what
next message is/wait for message - MPI_SEND/MPI_RECV
- MPI_FINALIZE - end MPI communications
14MPI 1.x Language Bindings
- Fortran 77
- include mpif.h
- call MPI_ABCDEF(list of arguments, IERROR)
- Fortran 90 via Fortran 77 Library
- F90 strong type checking of arguments can cause
difficulties - cannot handle more than one object type
- include mpif90.h in F90.
- ANSI C
- include mpi.h
- IERRORMPI_Abcdef(list of arguments)
- C via C Library
- via extern C declaration, include mpi.h
-
151st Example
- Hello, World in MPI
- hello.f - Fortran 90 version
- hello.c - C version
- Can be found in the mpi1 subdirectory of your
guest account. - Goals
- Demonstrate basic elements of an MPI program
- Describe available compiler/network combinations
- Basic introduction to job scheduler (PBS) used on
HPC_at_UNM platforms
161st Example
17Communicators in MPI
- Communicators provide a handle to a group of
processes/processors. - Within each communicator, a process is assigned a
rank. - MPI provides a standard communicator
MPI_COMM_WORLD that represents all of the
processes/processors. - MPI_Comm_rank and MPI_Comm_size return rank and
number of processors. - Mapping of processors to rank is implementation
dependent. - Appear in virtually every MPI call.
- Communicators can be created
- Can create subgroups of processors
- Can be used to map topology of processors onto
topology of data using MPI Cartesian Topology
functions - Useful if your data can be mapped onto a grid.
181st Example
19Compiling MPI codes
- You invoke your compiler via scripts that tack on
the appropriate MPI include and library files - mpif77 -o ltprognamegt ltfilenamegt.f
- mpif77 -c ltfilenamegt.f
- mpif77 -o progname ltfilenamegt.o
- mpif90 -o ltprognamegt ltfilenamegt.f90
- mpicc -o ltprognamegt ltfilenamegt.c
- mpiCC -o ltprognamegt ltfilenamegt.cc (not
supported) - These scripts save you from having to know where
the libraries are located. Use them!!! - The underlying compiler, NAG, PGI, etc. is
determined by how MPIHOME and PATH are set up in
your environment.
20Compiling MPI codes contd.
- MPICH
- Two choices of communications networks
- eth - FastEthernet (100Mb/sec)
- gm - Myrinet (1.2 Gb/sec, Los Lobos)
- Gigabit Ethernet (1.0Gb/sec, Azul)
- Many compilers
- NAG F95 - f95
- PGI - pgf77, pgcc, pgCC, pgf90
- GNU Compiler Suite - gcc, g77
- Combination is determined by your environment.
21Compiling MPI codes contd.
- MPIHOME settings determine your choice of the
underlying compilers and communications network
for your compilation. - Compilers
- PGI - Portland Group (pgcc, pgf77, pgf90)
- GCC - Gnu Compiler Suite (gcc, g77, g)
- NAG - Numerical Algorithms Group (f95)
- Networks
- FastEthernet (ch_p4)
- Myrinet (ch_gm)
22Supported MPIHOME values
Stealth PGI!!!
23Portable Batch Scheduler(PBS)
- To submit job use
- qsub file.pbs
- file.pbs is a shell script that invokes mpirun
- qsub -q R1234 file.pbs
- submit to a reservation queue R1234
- qsub -I -l nodes1
- Interactive session
- To check status
- qstat
- qstat -a (shows status of everyones
jobs) - qstat -n jobid (shows nodes assigned to job)
- To cancel job
- qdel job_id
24PBS command file (file.pbs)
- Just a variation on a shell script
- PBS -l nodes4ppn2,walltime40000
- any set up you need to do,
- e.g. staging data
- mpirun -np 8 -machinefile PBS_NODEFILE
ltexecutable or scriptgt - cleanup or save auxiliary files
See man qsub for other -l options
The script runs on the head node. Use ssh or dsh
to run commands on others. In most cases, you can
rely on PBS to clean up after you.
25Lab exercise 1
- Download, compile and run hello.f or hello.c.
- Run several times with different numbers of
processors. - Do the results always come out the same?
- If not, can you explain why?
- Copy files from
- mpi1 subdirectory of your guest account.
262nd Example Code
- Swap
- swap.f - F90 implementation
- swap.c - C implementation
- Goals
- Illustrate a basic message exchange among a few
processors - Introduce basic flavors of send and receive
operations - Illustrate potential pitfalls such as deadlock
situations - Can be found in the mpi1 subdirectory of your
guest account.
27Basic Send/Receive operations
- Send
- MPI_Send - standard mode blocking send
- blocking does not return until the message
buffer can be reused - semantics are blocking, but may not block in
practice - vendor free to implement in most efficient manner
- in most cases you should use this
- MPI_Isend - immediate non-blocking send
- lets MPI know we are sending data and where it
is, but we return immediately with the promise
that we will not touch the data until we have
verified that the message buffer can be safely
reused. - Pair with MPI_Wait or MPI_Test to complete/verify
completion - Allows overlap of communication and computation
by letting processor do work while we wait on
completion. - Many other flavors MPI_Bsend (Buffered Send),
MPI_Ssend (Synchronous Send), MPI_Rsend (Ready
Send) to name a few.
28Basic Send/Receive operations
- Receive
- MPI_Recv - Standard mode blocking receive
- Does not return until data from matching send is
in receive buffer. - MPI_IRecv - Immediate mode non-blocking receive
- lets MPI know we are expecting data and where to
put it. - Pair with MPI_Wait or MPI_Test to complete
- Unlike sends these are the only two...
- Completing non-blocking calls
- MPI_Wait - blocking
- MPI_Test - non-blocking
- Getting information about an incoming message
- MPI_Probe - blocking
- MPI_Iprobe - non-blocking
29Send/Receive Structure
- Basic structure of a standard send/receive call
- Fortran
- MPI_Recv(data components, message envelope,
status, ierror) - C
- ierrorMPI_Recv(data envelope, message envelope,
status) - Data components consists of 3 parts
- data buffer (holds data you are sending)
- size of buffer - in units of the data type (e.g.
5 integers, or 5 floats) - type descriptor - corresponds to standard
language data types - Message envelope consists of 4 parts, 3 of which
are specified - source/destination - integer
- tag - an integer label between 0 and an
implementation dependent maxmum value (gt32K-1) - communicator
- Status (does not appear in Send operations), and
Ierror - In Fortran, an array used to return information
about the received message, e.g. source, tag.
Example status(MPI_SOURCE) - In C, this is a C structure. Example
status.MPI_SOURCE. - Ierror returns error status.
- Other types of sends, receives require additional
arguments, see supplementary materials.
30MPI Type Descriptors
- C types
- MPI_INT
- MPI_CHAR
- MPI_FLOAT
- MPI_DOUBLE
- MPI_LONG
- MPI_UNSIGNED_INT
- Many others corresponding to other C types
- MPI_BYTE
- MPI_PACKED
- Fortran 77/Fortran 90 types
- MPI_INTEGER
- MPI_CHARACTER
- MPI_REAL
- MPI_DOUBLE_PRECISION
- MPI_COMPLEX
- MPI_LOGICAL
- MPI_BYTE
- MPI_PACKED
31Matching Sends to Receives
- Message Envelope - consists of the source,
destination, tag, and communicator values. - A message can only be received if the specified
envelope agrees with the message envelope. - The source and tag portions can be wildcarded
using MPI_ANY_SOURCE and MPI_ANY_TAG. (Useful for
writing client-server applications.) - Sourcedestination is allowed except for blocking
operations. - Variable types of the messages must match.
- In heterogeneous systems, MPI automatically
handles data conversions, e.g. big-endian to
little-endian. - Messages (with the same envelope) are not
overtaking.
322nd Example
332nd Example
34Lab exercise 2
- swap.f, swap.c
- Compile and run either the Fortran or C code with
two processes. - Try running with the send and receive operations
in the two sections in the sequences shown below
(in addition to that in the code). What happens
in each case? - Can be found in the mpi1 subdirectory of your
guest account.
35Non-blocking communications
- Heres what the MPI standard says about how your
experiment should have worked out.
The last case fails because both processors are
blocked in the receive operation and never
execute their sends. Case 1 works if the send is
buffered. This allows the sends to complete.
362nd Example RevisitedNon-blocking calls
37Non-blocking calls
- Can use MPI_TEST in place of MPI_WAIT to
periodically check on a message rather than
blocking and waiting. - Client-server applications can use MPI_WAITANY or
MPI_TESTANY. - Can peek ahead at messages with MPI_PROBE and
MPI_IPROBE. MPI_PROBE is used in our
client-server application.
38Lab exercise 3
- Change your broken copy of swap to use
MPI_IRECV and MPI_WAIT instead of MPI_RECV and
try running again. Below is the syntax of these
calls. - C language
- int MPI_Irecv(void buf, int count, MPI_Datatype
datatype, - int source, int tag,
MPI_Comm comm, MPI_Request request) - int MPI_Wait(MPI_Request request, MPI_Status
status) - Fortran
- lttypegt buf()
- integer count, datatype, source, tag, comm,
request, ierror - integer request, status(MPI_STATUS_SIZE), ierror
- call MPI_WAIT(request, status, ierror)
- call MPI_IRECV(buf, count, datatype, source, tag,
comm, request, ierror)
393rd example code
- Client-Server Application
- Illustrates use of point-to-point communications
calls, message tags. - Illustrates one of the basic paradigms of
parallel computing - task decomposition -
provides a general framework for solving a wide
range of problems. - Easy to understand example - multiplication of a
vector by a matrix. - In next weeks workshop, well re-implement this
code entirely using collective communications
calls. - This example uses 2 collective calls MPI_Barrier,
and MPI_Bcast (the latter for clarity)
40Collective Communications
MPI_Barrier(communicator, ierror) - used to
synchronize processes within a communicator MPI_Bc
ast(data envelope, source, communicator, ierror)
- broadcast copy of a piece of data to all
processes. Equivalent to N-1 sends from processor
with data to remaining processes.
413rd ExampleMatrix-vector Multiplication
- Goal Compute r Ax.
- ri ?aijxj
- Can be decomposed into either a row or column
oriented algorithm. - Important because different programming languages
have different conventions for storing 2-D
arrays. - C language - arrays stored in row major order -
multiply using traditional dot product
interpretation of matrix multiplication. - Fortran language - arrays stored in column major
order - multiplication is linear combination of
columns.
X
423rd ExampleAbridged client-server.f code
- call MPI_INIT (ierror)
- call MPI_COMM_SIZE (MPI_COMM_WORLD,
num_processes,ierror) - call MPI_COMM_RANK (MPI_COMM_WORLD, rank,
ierror) - call MPI_ATTR_GET(MPI_COMM_WORLD,
MPI_TAG_UB, NO_DATA, flag, ierror) - ! Set NO_DATA to the largest allowed value
for a tag - about 227 under - ! the MPICH implementation.
- if (rank .eq. server) then
- ! read dimensions of A(m,n)
- call MPI_BCAST(m, 1, MPI_INTEGER,
server, MPI_COMM_WORLD, ierror) - call MPI_BCAST(n, 1, MPI_INTEGER,
server, MPI_COMM_WORLD, ierror) - ! Allocate memory for arrays, only server
keeps full A. - ! Allocate memory for arrays, initialize
to zero if necessary. - buffer(1m)0.0d0
- ! Server - read and broadcast vector
x(0),...,x(n) - read(f_ptr,)(x(j),j1,n)
- call MPI_BCAST (x, n, MPI_REAL, server,
MPI_COMM_WORLD, ierror) - ! read A(m,n)
433rd ExamplePriming the pump
- ! Here we give the compute processes their
1st batch of data - ! Send out up to num_processes-1 rows to
clients. - i1
- do while ((i.le.n).and.(i.lt.num_processes))
- ! Distribute a(m,n) by columns
- alocal(1m)a(1m,i)
- call MPI_SEND (alocal, m, MPI_REAL, i,
i,MPI_COMM_WORLD, ierror) - active_client_countactive_client_count1
- ii1
- end do
- ! note At the end of the above loop
imin(n,num_processes). - ! Handle case where there is more processes
than rows. - ! Use tagNO_DATA as message to go to
waiting area pending finish - do ji1,num_processes,1
- call MPI_SEND (alocal, m, MPI_REAL, j,
NO_DATA, MPI_COMM_WORLD, - ierror)
- end do
- ! Note that the message is the TAGNO_DATA,
not alocal.
443rd ExampleServer loop
- do while ((active_client_count.gt.0) .or. (i.le.
n)) - call MPI_RECV (buffer, m,
MPI_REAL,MPI_ANY_SOURCE, - MPI_ANY_TAG, MPI_COMM_WORLD, status,
ierror) - if (status(MPI_TAG).ne.NO_DATA) then
- result(1m)result(1m)buffer(1m)!
Accumulate result, F90 array syntax - if (i.le.n) then
- alocal(1m)a(1m,i)
- call MPI_SEND (alocal, m, MPI_REAL,
- status(MPI_SOURCE), i, MPI_COMM_WORLD,
ierror) - ii1
- else
- ! No more data
- active_client_countactive_client_count-1
- call MPI_SEND (alocal, m, MPI_REAL,
- status(MPI_SOURCE), NO_DATA,
MPI_COMM_WORLD, ierror) - endif
- else
- ! Handle node errors
- endif
453rd ExampleClient Loop
- do i1,m
- print , result(i)
- end do
- else
- ! Client side - Receive broadcasts of row, and
column dimension - call MPI_BCAST (m, 1, MPI_INTEGER, server,
MPI_COMM_WORLD, ierror) - call MPI_BCAST (n, 1, MPI_INTEGER, server,
MPI_COMM_WORLD, ierror) - ! Allocate memory then get x values from the
broadcast. - call MPI_BCAST (x, n, MPI_REAL, server,
MPI_COMM_WORLD, ierror) - ! Listen for 1st message from server - decide
if data or loop terminator. - call MPI_PROBE(server, MPI_ANY_TAG,
MPI_COMM_WORLD, status, ierror) - do while(status(MPI_TAG).ne.NO_DATA) ! Loop
until NO_DATA message recd. - call MPI_RECV (alocal,m,MPI_REAL, server,
- status(MPI_TAG),MPI_COMM_WORLD,
status,ierror) - buffer(1m)alocal(1m)x(status(MPI_TAG))
! multiply array by const. - call MPI_SEND (buffer, m, MPI_REAL, server,
- status(MPI_TAG), MPI_COMM_WORLD,
ierror) ! return results - ! Listen for the next message
- call MPI_PROBE(server, MPI_ANY_TAG,
MPI_COMM_WORLD,status, ierror)
46Program Termination
- endif
-
- ! everyone waits here until all are done.
Waiting area - call MPI_Barrier(MPI_COMM_WORLD, ierror)
- call MPI_Finalize(ierror)
- end program
47Exercises
- Pick an example code in a language you are
familiar with and rewrite one of the broadcast
operations using the equivalent sends and
receives. - Located in /mpi1.
- server_client.f - column-oriented Fortran 90
- server_client_row2.c - row-oriented C
- server_client_col2.c - column-oriented C
- How would you rewrite one of the column-oriented
example codes to do a full matrix-matrix
multiplication? (Hint look back at the pictures
and think a bit about how the arrays have to be
distributed.) What issues would you need to
resolve? - Can you rewrite this without the MPI_Probe call?
48References - MPI Tutorial
- PACS online course
- http//webct.ncsa.uiuc.edu8900/
- Edinburgh Parallel Computing Center
- http//www.epcc.ed.ac.uk/epic/mpi/notes/mpi-course
-epic.book_1.html - Argonne National Laboratory (MPICH)
- http//www-unix.mcs.anl.gov/mpi/
- MPI Forum
- http//www.mpi-forum.org/docs/docs.html
- MPI The Complete Reference (vols. 1, 2)
- Vol. 1. at http//www.netlib.org/utk/papers/mpi-bo
ok/mpi-book.html - IBM (MPI on the RS/6000 (IBM SP))
- http//publib-b.boulder.ibm.com/Redbooks.nsf/Redbo
okAbstracts
49References Some useful books
- MPI The Complete Reference
- Marc Snir, Steve Otto, Steven Huss-Lederman,
David Walker and Jack Dongara, MIT Press - examples/mpidocs/mpi_complete_reference.ps.Z
- Parallel Programming with MPI
- Peter S. Pacheco, Morgan Kaufman Publishers, Inc
- Using MPI Portable Parallel Programing with
Message Passing Interface - William Gropp, E. Lusk and A. Skjellum, MIT Press
-