An Introduction to Parallel Programming with MPI - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

An Introduction to Parallel Programming with MPI

Description:

Overview of basic parallel programming on a cluster with the goals of MPI ... Message Passing Paradigm. P 6. P 5. P 4. P 3. P 2. P 1. Network. Message Passing Paradigm ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 56

Provided by: davidb122

Learn more at: https://research.cs.vt.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Parallel Programming with MPI

1
An Introduction to Parallel Programming with MPI

March 22, 24, 29, 31
2005
David Adams

2
Outline

Disclaimers
Overview of basic parallel programming on a
cluster with the goals of MPI
Batch system interaction
Startup procedures
Blocking message passing
Non-blocking message passing
Collective Communications

3
Disclaimers

I do not have all the answers.
Completion of this short course will give you
enough tools to begin making use of MPI. It will
not automagically allow your code to run on a
parallel machine simply by logging in.
Some codes are easier to parallelize than others.

4
The Goals of MPI

Design an application programming interface.
Allow efficient communication.
Allow for implementations that can be used in a
heterogeneous environment.
Allow convenient C and Fortran 77 bindings.
Provide a reliable communication interface.
Portable.
Thread safe.

5
Message Passing Paradigm
6
Message Passing Paradigm
7
Message Passing Paradigm

Conceptually, all processors communicate through
messages (even though some may share memory
space).
Low level details of message transport are
handled by MPI and are invisible to the user.
Every processor is running the same program but
will take different logical paths determined by
self processor identification (Who am I?).
Programs are written, in general, for an
arbitrary number of processors though they may be
more efficient on specific numbers (powers of
2?).

8
Distributed Memory and I/O Systems

The cluster machines available at Virginia Tech
are distributed memory distributed I/O systems.
Each node (processor pair) has its own memory and
local hard disk.
Allows asynchronous execution of multiple
instruction streams.
Heavy disk I/O should be delegated to the local
disk instead of across the network and minimized
as much as possible.
While getting your program running, another goal
to keep in mind is to see that it makes good use
of the hardware available to you.
What does good use mean?

9
Speedup

The speedup achieved by a parallel algorithm
running on p processors is the ratio between the
time taken by that parallel computer executing
the fastest serial algorithm and the time taken
by the same parallel computer executing the
parallel algorithm using p processors.
-Designing Efficient Algorithms for Parallel
Computers, Michael J. Quinn

10
Speedup

Sometimes a fastest serial version of the code
is unavailable.
The speedup of a parallel algorithm can be
measured based on the speed of the parallel
algorithm run serially but this gives an unfair
advantage to the parallel code as the
inefficiencies of making the code parallel will
also appear in the serial version.

11
Speedup Example

Our really_big_code01 executes on a single
processor in 100 hours.
The same code on 10 processors takes 10 hours.
100 hrs./10 hrs. 10 speedup.
When speedup p it is called ideal (or perfect)
speedup.
Speedup by itself is not very meaningful. A
speedup of 10 may sound good (We are solving the
problem 10 times as fast!) but what if we were
using 1000 processors to get that number?

12
Efficiency

The efficiency of a parallel algorithm running on
p processors is the speedup divided by p.
-Designing Efficient Algorithms for Parallel
Computers, Michael J. Quinn
From our last example,
when p 10 the efficiency is 10/101 (great!),
When p 1000 the efficiency is 10/10000.01
(bad!).
Speedup and efficiency give us an idea of how
well our parallel code is making use of the
available resources.

13
Concurrency

The first step in parallelizing any code is to
identify the types of concurrency found in the
problem itself (not necessarily the serial
algorithm).
Many parallel algorithms show few resemblances to
the (fastest known) serial version they are
compared to and sometimes require an unusual
perspective on the problem.

14
Concurrency

Consider the problem of finding the sum of n
integer values.
A sequential algorithm may look something like
this
BEGIN
sum A0
FOR i 1 TO n 1 DO
sum sum Ai
ENDFOR
END

15
Concurrency

Suppose n 4. Then the additions would be done
in a precise order as follows
(A0 A1) A2 A3
Without any insight into the problem itself we
might assume that the process is completely
sequential and can not be parallelized.
Of course, we know that addition is associative
(mostly). The same expression could be written
as
(A0 A1) (A2 A3)
By using our insight into the problem of addition
we can exploit the inherent concurrency of the
problem and not the algorithm.

16
Communication is Slow

Continuing our example of adding n integers we
may want to parallelize the process to exploit as
much concurrency as possible. We call on the
services of Clovus the Parallel Guru.
Let n 128.
Clovus divides the integers into pairs and
distributes them to 64 processors maximizing the
concurrency inherent in the problem.
The solution to the 64 sub-problems are
distributed to 32 and those 32 to 16 etc

17
Communication Overhead

Suppose it takes t units of time to perform a
floating-point addition.
Suppose it takes 100t units of time to pass a
floating-point number from one processor to
another.
The entire calculation on a single processor
would take 127t time units.
Using the maximum number of processors possible
(64) Clovus finds the sum of the first set of
pairs in 101t time units. Further steps for 32,
16, 8, 4, and 2 follow to obtain the final
solution.
(64) (32) (16) (8) (4) (2)
101t 101t 101t 101t 101t 101t 606t
total time units

18
Parallelism and Pipelining to Achieve Concurrency

There are two primary ways to achieve concurrency
in an algorithm.
Parallelism
The use of multiple resources to increase
concurrency.
Partitioning.
Example Our summation problem.
Pipelining
Dividing the computation into a number of steps
that are repeated throughout the algorithm.
An ordered set of segments in which the output of
each segment is the input of its successor.
Example Automobile assembly line.

19
Examples(Jacobi style update)

Imagine we have a cellular automata that we want
to parallelize.

7
8

1
2
3
4
5
6
20
Examples

We try to distribute the rows evenly between two
processors.

7
8

1
2
3
4
5
6
21
Examples

Columns seem to work better for this problem.

7
8

1
2
3
4
5
6
22
Examples

Minimizing communication.

7
8

1
2
3
4
5
6
23
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
24
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
25
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
26
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
27
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
28
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
29
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
30
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
31
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
32
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
33
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
34
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
35
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
36
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
37
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
38
Examples(Gauss-Seidel style update)

Emulating a serial Gauss-Seidel update style with
a pipe.

7
8

1
2
3
4
5
6
39
Batch System Interaction

Both Anantham (400 processors) and System X
(2200 processors) will normally operate in batch
mode.
Jobs are not interactive.
Multi-user etiquette is enforced by a job
scheduler and queuing system.
Users will submit jobs using a script file built
by the administrator and modified by the user.

40
PBS (Portable Batch Scheduler) Submission Script

/bin/bash
!
! Example of job file to submit parallel MPI
applications.
! Lines starting with PBS are options for the
qsub command.
! Lines starting with ! are comments
! Set queue (production queue --- the only one
right now) and
! the number of nodes.
! In this case we require 10 nodes from the
entire set ("all").
PBS -q prod_q
PBS -l nodes10all

41
PBS Submission Script

! Set time limit.
! The default is 30 minutes of cpu time.
! Here we ask for up to 1 hour.
! (Note that this is total cpu time, e.g., 10
minutes on
! each of 4 processors is 40 minutes)
! Hoursminutesseconds
PBS -l cput010000
! Name of output files for std output and error
! Defaults are ltjob-namegt.oltjob numbergt and
ltjob-namegt.eltjob-numbergt
!PBS -e ZCA.err
!PBS -o ZCA.log

42
PBS Submission Script

! Mail to user when job terminates or aborts
! PBS -m ae
!change the working directory (default is home
directory)
cd PBS_O_WORKDIR
! Run the parallel MPI executable (change the
default a.out)
! (Note omit "-kill" if you are running a 1
node job)
/usr/local/bin/mpiexec -kill a.out

43
Common Scheduler Commands

qsub ltscript file namegt
Submits your script file for scheduling. It is
immediately checked for validity and if it passes
the check you will get a message that your job
has been added to the queue.
qstat
Displays information on jobs waiting in the queue
and jobs that are running. How much time they
have left and how many processors they are using.
Each job aquires a unique job_id that can be used
to communicate with a job that is already running
(perhaps to kill it).
qdel ltjob_idgt
If for some reason you have a job that you need
to remove from the queue, this command will do
it. It will also kill a job in progress.
You, of course, only have access to delete your
own jobs.

44
MPI Data Types

MPI thinks of every message as a starting point
in memory and some measure of length along with a
possible interpretation of the data.
The direct measure of length (number of bytes) is
hidden from the user through the use of MPI data
types.
Each language binding (C and Fortran 77) has its
own list of MPI types that are intended to
increase portability as the length of these types
can change from machine to machine.
Interpretations of data can change from machine
to machine in heterogeneous clusters (Macs and
PCs in the same cluster for example).

45
MPI types in C

MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED

46
MPI Types in Fortran 77

MPI_INTEGER INTEGER
MPI_REAL REAL
MPI_DOUBLE_PRECISION DOUBLE PRECISION
MPI_COMPLEX COMPLEX
MPI_LOGICAL LOGICAL
MPI_CHARACTER CHARACTER(1)
MPI_BYTE
MPI_PACKED
Caution Fortran90 does not always store arrays
contiguously.

47
Functions Appearing in all MPI Programs (Fortran
77)

MPI_INIT(IERROR)
INTEGER IERROR
Must be called before any other MPI routine.
Can be visualized as the point in the code where
every processor obtains its own copy of the
program and continues to execute though this may
happen earlier.

48
Functions Appearing in all MPI Programs (Fortran
77)

MPI_FINALIZE (IERROR)
INTEGER IERROR
This routine cleans up all MPI state.
Once this routine is called no MPI routine may be
called.
It is the users responsibility to ensure that ALL
pending communications involving a process
complete before the process calls MPI_FINALIZE

49
Typical Startup Functions

MPI_COMM_SIZE(COMM, SIZE, IERROR)
IN INTEGER COMM
OUT INTEGER SIZE, IERROR
Returns the size of the group associated with the
communicator COMM.
Whats a communicator?

50
Communicators

A communicator is an integer that tells MPI what
communication domain it is in.
There is a special communicator that exists in
every MPI program called MPI_COMM_WORLD.
MPI_COMM_WORLD can be thought of as the superset
of all communication domains. Every processor
requested by your initial script is a member of
MPI_COMM_WORLD.

51
Typical Startup Functions

MPI_COMM_SIZE(COMM, SIZE, IERROR)
IN INTEGER COMM, SIZE, IERROR
OUT INTEGER SIZE, IERROR
Returns the size of the group associated with the
communicator COMM.
A typical program contains the following command
as one of the very first MPI calls to provide the
code with the number of processors it has
available for this execution. (Step one of self
identification).
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)

52
Typical Startup Functions

MPI_COMM_RANK(COMM, RANK, IERROR)
IN INTEGER COMM
OUT INTEGER RANK, IERROR
Indicates the rank of the process that calls it
in the range from 0..size-1, where size is the
return value of MPI_COMM_SIZE.
This rank is relative to the communication domain
specified by the communicator COMM.
For MPI_COMM_WORLD, this function will return the
absolute rank of the process, a unique
identifier. (Step 2 of self identification).
CALL MPI_COMM_Rank(MPI_COMM_WORLD, size_p, ierr_p)

53
Startup Variables