Crash Course in Parallel Programming Using MPI

About This Presentation

Title:

Crash Course in Parallel Programming Using MPI

Description:

Crash Course in Parallel Programming Using MPI. Adam Jacobs. HCS ... int main(int argc, char* argv[]) int my_rank, p; // process rank and number of processes ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 58

Provided by: adam179

Category:

more less

Transcript and Presenter's Notes

Title: Crash Course in Parallel Programming Using MPI

1
Crash Course in Parallel Programming Using MPI

Adam Jacobs
HCS Research Lab
01/10/07

2
Outline PCA Preparation

Parallel Computing
Distributed Memory Architectures
Programming Models
Flynns Taxonomy
Parallel Decomposition
Speedups

3
Parallel Computing

Motivated by high computational complexity and
memory requirements of large applications
Two Approaches
Shared Memory
Distributed Memory
The majority of modern systems are clusters
(distributed memory architecture)
Many simple machines connected with a powerful
interconnect
ASCI Red, ASCI White,
Also a hybrid approach can be used
IBM Blue Gene

4
Shared Memory Systems

Memory resources are shared among processors
Relatively easy to program for since there is a
single unified memory space
Scales poorly with system size due to the need
for cache coherency
Example
Symmetric Multiprocessors (SMP)
Each processor has equal access to RAM
4-way motherboards MUCH more expensive than 2-way

5
Distributed Memory Systems

Individual nodes consist of a CPU, RAM, and a
network interface
A hard disk is not necessary mass storage can be
supplied using NFS
Information is passed between nodes using the
network
No need for special cache coherency hardware
More difficult to write programs for distributed
memory systems since the programmer must keep
track of memory usage

6
Programming Models

Multiprogramming
Multiple programs running simultaneously
Shared Address
Global address space available to all processors
Shared data is written to this global space
Message Passing
Data is sent directly to processors using
messages
Data Parallel

7
Flynns Taxonomy

SISD Single Instruction, Single Data
Normal Instructions
SIMD Single Instruction, Multiple Data
Vector Operations, MMX, SSE, Altivec
MISD Multiple Instructions, Single Data
MIMD Multiple Instructions, Multiple Data
SPMD Single Program, Multiple Data

8
Parallel Decomposition

Data Parallelism
Parallelism within a dataset such that a portion
of the data can be computed independently from
the rest
Usually results in coarse-grained parallelism
(compute farms)
Allows for automatic load balancing strategies
Functional Parallelism
Parallelism between distinct functional blocks
such that each block can be performed
independently
Especially useful for pipeline structures

9
Speedup
10
Super-linear Speedup

Linear speedup is the best that can be achieved
Or is it?
Super-linear speedup occurs when parallelizing an
algorithm results in a more efficient use of
hardware resources
1MB task doesnt fit in a single processor
2 512 KB tasks do fit, results in lower effective
memory access times

11
MPI Message Passing Interface

Adam Jacobs
HCS Research Lab
01/10/07

Slides created by Raj Subramaniyan
12
Outline MPI Usage

Introduction
MPI Standard
MPI Implementations
MPICH Introduction
MPI calls
Present Emphasis

13
Parallel Computing

Motivated by high computational complexity and
memory requirements of large applications
Cooperation with other processes
Cooperative and one-sided operations
Processes interact with each other by exchanging
information
Models
SIMD
SPMD
MIMD

14
Cooperative Operations

Cooperative all parties agree to transfer data
Message-passing is an approach that makes the
exchange of data cooperative
Data must both be explicitly sent and received
Any change in the receiver's memory is made with
the receiver's participation

15
MPI Message Passing Interface

MPI A message passing library specification
A message passing model and not a specific
product
Designed for parallel computers, clusters and
heterogeneous networks
Standardization began in 1992 and the final draft
was made available in 1994
Broad participation of vendors, library writers,
application specialists and scientists

Message Passing Interface Forum accessible at
http//www.mpi-forum.org/
16
Features of MPI

Point-to-point communication
Collective operations
Process groups
Communication contexts
Process topologies
Bindings for Fortran 77 and C
Environmental management and inquiry
Profiling interface

17
Features NOT included in MPI

Explicit shared-memory operations
Operations that require more operating system
support than the standard for example,
interrupt-driven receives, remote execution, or
active messages
Program construction tools
Explicit support for threads
Support for task management
I/O functions

18
MPI Implementations

Listed below are MPI implementations available
for free
Appleseed (UCLA)
CRI/EPCC (Edinburgh Parallel Computing Centre)
LAM/MPI (Indiana University)
MPI for UNICOS Systems (SGI)
MPI-FM (University of Illinois) for Myrinet
MPICH (ANL)
MVAPICH (Infiniband)
SGI Message Passing Toolkit
OpenMPI

A detailed list of MPI implementations with
features can be found at http//www.lam-mpi.org/mp
i/implementations/
19
MPICH

MPICH A portable implementation of MPI developed
at the Argonne National Laboratory (ANL) and
Mississippi State University (MSU)
Very widely used
Supports all the specs of MPI-1 standard
Features part of MPI-2 standard are under
development (ANL alone)

http//www-unix.mcs.anl.gov/mpi/mpich/
20
Writing MPI Programs
Part of all programs

include "mpi.h" // Gives basic MPI types,
definitions
include ltstdio.hgt
int main( argc, argv )
int argc
char argv
MPI_Init( argc, argv ) // Starts MPI
Actual code including normal C calls and MPI
calls
MPI_Finalize() // Ends MPI
return 0

21
Initialize and Finalize

MPI_Init
Initializes all necessary MPI variables
Forms the MPI_COMM_WORLD communicator
A communicator is a list of all the connections
between nodes
Opens necessary TCP connections
MPI_Finalize
Waits for all processes to reach the function
Closes TCP connections
Cleans up

22
Rank and Size

Environment details
How many processes are there? (MPI_Comm_size)
Who am I? (MPI_Comm_rank)
MPI_Comm_size( MPI_COMM_WORLD, size )
MPI_Comm_rank( MPI_COMM_WORLD, rank )
The rank is a number between 0 and size-1

23
Sample Hello World Program

includes int main(int argc, char argv)
    int my_rank, p     // process rank
and number of processes    int source, dest
// rank of sender and receiving process    int
tag 0 // tag for messages    char
mesg100 // storage for message
MPI_Status status // stores status for
MPI_Recv statements    MPI_Init(argc,
argv)    MPI_Comm_rank(MPI_COMM_WORLD,
my_rank)     MPI_Comm_size(MPI_COMM_WORLD,
p)     if (my_rank!0)
        sprintf(mesg,
"Greetings from d!", my_rank) // stores into
character array        dest 0 // sets
destination for MPI_Send to process 0
MPI_Send(mesg, strlen(mesg)1, MPI_CHAR, dest,
tag, MPI_COMM_WORLD)
// sends string to process 0
else         for(source 1 source lt p
source)        MPI_Recv(message, 100,
MPI_CHAR, source, tag, MPI_COMM_WORLD, status)
// recv from each process
printf("s\n", message) // prints out
greeting to screen
MPI_Finalize() // shuts down MPI

24
Compiling MPI Programs

Two methods
Compilation commands
Using Makefile
Compilation commands
mpicc -o hello_world hello-world.c
mpif77 -o hello_world hello-world.f
Likewise mpiCC and mpif90 are available for C
and Fortran90 respectively
Makefile.in is a template Makefile
mpireconfig translates Makefile.in to a Makefile
for a particular system

25
Running MPI Programs

To run hello_world on two machines
mpirun -np 2 hello_world
Must specify full path of executable
To know the commands executed by mpirun
mpirun t
To get all the mpirun options
mpirun -help

26
MPI Communications

Typical blocking send
send (dest, type, address, length)
dest integer representing the process to
receive the message
type data type being sent (often overloaded)
(address, length) contiguous area in memory
being sent
MPI_Send/MPI_Recv provide point-to-point
communication
Typical global operation
broadcast (type, address, length)
Six basic MPI calls (init, finalize, comm, rank,
send, recv)

27
MPI Basic Send/Recv

int MPI_Send( void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm )
buf initial address of send buffer dest rank
of destination (integer)
tag message tag (integer) comm communicator
(handle)
count number of elements in send buffer
(nonnegative integer)
datatype datatype of each send buffer element
(handle)
int MPI_Recv( void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status status )
status status object (Status) source rank of
source (integer)
status is mainly useful when messages are
received with MPI_ANY_TAG and/or MPI_ANY_SOURCE

28
Information about a Message

count argument in recv indicates maximum length
of a message
Actual length of message can be got using
MPI_Get_Count
MPI_Status status
MPI_Recv( ..., status )
... status.MPI_TAG
... status.MPI_SOURCE
MPI_Get_count( status, datatype, count )

29
Example Matrix Multiplication Program
/ send matrix data to the worker tasks /
averow NRA/numworkers extra
NRAnumworkers offset 0 mtype
FROM_MASTER for (dest1 destltnumworkers
dest) rows (dest lt extra) ?
averow1 averow // If rows not divisible
absolutely by workers printf("sending d
rows to task d\n",rows,dest) // some workers
get an additional row MPI_Send(offset, 1,
MPI_INT, dest, mtype, MPI_COMM_WORLD) //
Starting row being sent MPI_Send(rows, 1,
MPI_INT, dest, mtype, MPI_COMM_WORLD) //
rows sent count rowsNCA // Gives
total elements being sent
MPI_Send(aoffset0, count, MPI_DOUBLE, dest,
mtype, MPI_COMM_WORLD) count
NCANCB // Equivalent to NRB NCB elements
in B MPI_Send(b, count, MPI_DOUBLE, dest,
mtype, MPI_COMM_WORLD) offset offset
rows // Increment offset for the next worker

MASTER SIDE
30
Example Matrix Multiplication Program (contd.)
/ wait for results from all worker tasks /
mtype FROM_WORKER for (i1 iltnumworkers
i) // Get results from each worker
source i MPI_Recv(offset, 1,
MPI_INT, source, mtype, MPI_COMM_WORLD,
status) MPI_Recv(rows, 1, MPI_INT,
source, mtype, MPI_COMM_WORLD, status)
count rowsNCB // elements in the result
from the worker MPI_Recv(coffset0,
count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD,
status) / print results /
/ end of master section /
MASTER SIDE
31
Example Matrix Multiplication Program (contd.)
if (taskid gt MASTER) // Implies a worker
node mtype FROM_MASTER source MASTER
printf ("Master d, mtyped\n", source,
mtype) // Receive the offset and number of
rows MPI_Recv(offset, 1, MPI_INT, source,
mtype, MPI_COMM_WORLD, status) printf
("offset d\n", offset) MPI_Recv(rows, 1,
MPI_INT, source, mtype, MPI_COMM_WORLD,
status) printf ("row d\n", rows)
count rowsNCA // elements to receive for
matrix A MPI_Recv(a, count, MPI_DOUBLE,
source, mtype, MPI_COMM_WORLD, status) printf
("a00 e\n", a00) count
NCANCB // elements to receive for matrix B
MPI_Recv(b, count, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, status)
WORKER SIDE
32
Example Matrix Multiplication Program (contd.)
for (k0 kltNCB k) for (i0 iltrows i)
cik 0.0 // Do the matrix
multiplication fro the rows you are assigned
to for (j0 jltNCA j)
cik cik aij bjk
mtype FROM_WORKER printf ("after computing
\n") MPI_Send(offset, 1, MPI_INT, MASTER,
mtype, MPI_COMM_WORLD) MPI_Send(rows, 1,
MPI_INT, MASTER, mtype, MPI_COMM_WORLD)
MPI_Send(c, rowsNCB, MPI_DOUBLE,
MASTER, mtype, MPI_COMM_WORLD) // Sending
the actual result printf ("after send \n")
/ end of worker /
WORKER SIDE
33
Asynchronous Send/Receive

MPI_Isend() and MPI_Irecv() are non-blocking
control returns to program after call is made
int MPI_Isend( void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm,
MPI_Request request )
int MPI_Irecv( void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request request )
request communication request (handle) output
parameter

34
Detecting Completions

Non-blocking operations return (immediately)
request handles that can be waited on and
queried
MPI_Wait waits for an MPI send or receive to
complete
int MPI_Wait ( MPI_Request request, MPI_Status
status)
request matches request on Isend or Irecv
status returns the status equivalent to status
for MPI_Recv when complete
blocks for send until message is buffered or sent
so message variable is free
blocks for receive until message is received and
ready

35
Detecting Completions (contd.)

MPI_Test tests for the completion of a send or
receive
int MPI_Test ( MPI_Request request, int flag,
MPI_Status status)
request, status as for MPI_Wait
does not block
flag indicates whether operation is complete or
not
enables code which can repeatedly check for
communication completion

36
Multiple Completions

Often desirable to wait on multiple requests
ex., A master/slave program
int MPI_Waitall( int count, MPI_Request
array_of_requests, MPI_Status
array_of_statuses )
int MPI_Waitany( int count, MPI_Request
array_of_requests, int index, MPI_Status
status )
int MPI_Waitsome( int incount, MPI_Request
array_of_requests, int outcount, int
array_of_indices, MPI_Status array_of_statuses
)
There are corresponding versions of test for each
of these

37
Communication Modes

Synchronous mode (MPI_Ssend) the send does not
complete until a matching receive has begun
Buffered mode (MPI_Bsend) the user supplies the
buffer to system
Ready mode (MPI_Rsend) user guarantees that
matching receive has been posted
Non-blocking versions are MPI_Issend, MPI_Irsend,
MPI_Ibsend

38
Miscellaneous Point-to-Point Commands

MPI_Sendrecv
MPI_Sendrecv_replace
MPI_cancel
Used for buffered modes
MPI_Buffer_attach
MPI_Buffer_detach

39
Collective Communication

One to Many (Broadcast, Scatter)
Many to One (Reduce, Gather)
Many to Many (Allreduce, Allgather)

40
Broadcast and Barrier

Any type of message can be sent size of message
should be known to all
int MPI_Bcast ( void buffer, int count,
MPI_Datatype datatype, int root, MPI_Comm comm )
buffer pointer to message buffer count number
of items sent
datatype type of item sent root sending
processor
comm communicator within which broadcast takes
place
Note count and type should be the same on all
processors
Barrier synchronization (broadcast without
message?)
int MPI_Barrier ( MPI_Comm comm )

41
Reduce

Reverse of broadcast all processors send to a
single processor
Several combining functions available
MAX, MIN, SUM, PROD, LAND, BAND, LOR, BOR, LXOR,
BXOR, MAXLOC, MINLOC
int MPI_Reduce ( void sentbuf, void result, int
count, MPI_Datatype datatype, MPI_Op op, int
root, MPI_Comm comm )

42
Scatter and Gather

MPI_Scatter Source (array) on the sending
processor is spread to all processors
MPI_Gather Opposite of scatter array locations
at the receiver correspond to the rank of the
senders

43
Many-to-many Communication

MPI_Allreduce
Syntax like reduce, except no root parameter
All nodes get result
MPI_Allgather
Syntax like gather, except no root parameter
All nodes get resulting array

44
Evaluating Parallel Programs

MPI provides tools to evaluate performance of
parallel programs
Timer
Profiling Interface
MPI_Wtime gives the wall clock time
MPI_WTIME_IS_GLOBAL can be used to check the
synchronization of times for all the processes
PMPI_.... is an entry point for all routines can
be used for profiling
-mpilog option at compile time can be used to
generate logfiles

45
Recent Developments

MPI-2
Dynamic process management
One-sided communication
Parallel file-IO
Extended collective operations
MPI for Grids ex., MPICH-G, MPICH-G2
Fault-tolerant MPI ex., Starfish, Cocheck

46
One-sided Operations

One-sided one worker performs transfer of data
Remote memory reads and writes
Data can be accessed without waiting for other
processes

47
File Handling

Similar to general programming languages
Sample function calls
MPI_File_open
MPI_File_read
MPI_File_seek
MPI_File_write
MPI_File_set_size
Non-blocking reads and writes are also possible
MPI_File_Iread
MPI_File_Iwrite

48
C Datatypes

MPI_CHAR char
MPI_BYTE See standard like unsigned char
MPI_SHORT short
MPI_INT int
MPI_LONG long
MPI_FLOAT float
MPI_DOUBLE double
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long
MPI_LONG_DOUBLE long double

49
mpiP

A lightweight profiling library for MPI
applications
In order to use in an application, simply add the
lmpiP flag to the compile script
Determines how much time a program spends in MPI
calls versus the rest of the application
Shows which MPI calls are used most frequently

50
Jumpshot

Graphical profiling tool for MPI
Java-Based
Useful for determining communication patterns in
an application
Color-coded bars represent time spent in an MPI
function
Arrows denote message passing
Single line denotes actual processing time

51
Summary

The parallel computing community has cooperated
to develop a full-featured standard
message-passing library interface
Several implementations are available
Many applications are being developed or ported
presently
MPI-2 process beginning
Lots of MPI material available
Very good facilities available at the HCS Lab for
MPI-based projects
Zeta Cluster will be available for class projects

52
References

1 The Message Passing Interface (MPI)
Standard, http//www-unix.mcs.anl.gov/mpi/
2 LAM/MPI Parallel Computing,
http//www.lam-mpi.org
3 W. Gropp, Tutorial on MPI The
Message-Passing Interface, http//www-unix.mcs.an
l.gov/mpi/tutorial/gropp/talk.html
4 D. Culler and J. Singh, Parallel Computer
Architecture A Hardware/Software Approach

53
Fault-Tolerant Embedded MPI
54
Motivations

MPI functionality required for HPC space
applications
De-facto standard/parallel programming model in
HPC
Fault-tolerant extensions for HPEC space systems
MPI is inherently fault-intolerant, original
design choice
Existing HPC tools for MPI and fault-tolerant MPI
Good basis for ideas, API standards, etc.
Not readily amenable to HPEC platforms
Focus on lightweight fault-tolerant MPI for HPEC
(FEMPI Fault-tolerant Embedded Message Passing
Interface)
Leverage prior work throughout HPC community
Leverage prior work at UF on HPC with MPI

55
Primary Source of Failures in MPI

Nature of failures
Individual processes of MPI job crash (Process
failure)
Communication failure between two MPI processes
(Network failure)
Behavior on failure
When a receiver node fails, sender encounters a
timeout on a blocking send call, as no matching
receive is found and returns an error
Whole communicator context crashes and hence the
entire MPI job
NN open TCP connections in many MPI
implementations in such cases, the whole job
crashes immediately on failure of any node
Applies to collective communication calls as well
Avoid failure/crash of entire application
Health status of nodes provided by failure
detection service (via SR)
Check node status before communication with
another node to avoid establishing communication
with a dead process
If receiver dies after status check and before
communication, then timeout-based recovery will
be used

56
FEMPI Software Architecture

Low-level communication is provided through FEMPI
using Self-Reliants DMS
Heartbeating via SR and a process notification
extension to the SRP enables FEMPI fault
detection
Application and FEMPI checkpointing make use of
existing checkpointing libraries checkpoint
communication uses DMS
MPI Restore process on System Controller is
responsible for recovery decisions based on
application policies

57
Fault Tolerance Actions

Fault tolerance is provided through three stages
Detection of a fault
Notification
Recovery
Self-Reliant services used to provide detection
and notification capabilities
Heartbeats and other functionality are already
provided in API
Notification service built as an extension to FTM
of JMS
FEMPI will provide features to enable recovery of
an application
Employs reliable communications to reduce faults
due to communication failure
Low-level communications provided through
Self-Reliant services (DMS) instead of directly
over TCP/IP