Programming with MPI A Detailed Example - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Programming with MPI A Detailed Example

Description:

How to Build a Beowulf, Chapter 9. 2. Overview of MPI. MPI a distributed space model ... Requires each processor to argue how much data is incoming from ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 26

Provided by: lyap

Category:

more less

Transcript and Presenter's Notes

Title: Programming with MPI A Detailed Example

1
Programming with MPI A Detailed Example
EE524 LECTUREHow to Build a Beowulf, Chapter
9

By Llewellyn YapWednesday, 2-23-2000

2
Overview of MPI

MPI a distributed space model
Decomposition can be static or dynamic
Static fixed once and for all
Dynamic changing in response to the simulation
When partitioning, consider minimizing
communication

3
Implementing High Performance, Parallel
Applications for MPI

Choose algorithm with sufficient parallelism
Optimize a sequential version of the algorithm
Use simplest possible MPI operations
usually blocking, standard mode procedures
Profiling and analysis
find what operations take the most time
Attack the most time-consuming components

4
Steps to Good Parallel Implementation

Understand strengths and weaknesses of sequential
solutions
Choose a good sequential algorithm
Design a strategy for parallelization
Develop a rough semi-analytic model of how the
parallel algorithm should perform

5
Steps to Good Parallel Implementation (contd)

Implement MPI for interprocessor communication
Carry out measurements for verifying performance
Identify bottlenecks, sources or overhead, etc.
minimize their impact
Iterate to improve performance if possible

6
Investigate Sorting

Sorting is a multi-faceted, irregular,
non-grid-based problem
Used widely in database servers
Issues of sorting (size of elements, form of
result, storage etc.)
Two approaches will be discussed1. A fairly
restricted domain2. More general approach

7
Sorting (Simple Approach)Assumptions

Elements are positive integers
Secondary storage (disk, tape) not used
Auxiliary primary storage is available
Input data uniformly distributed over range of
integers

8
Sorting (Simple Approach)The Approach

Most problems for parallel computation has
already been solved for the traditional
sequential architectures
Sequential solutions exist as libraries, system
calls or language constructs will be used as
building blocks for a parallel solution

9
Sorting (Simple Approach)The Approach (contd)

Advantages of this approach leverages the design,
debugging and optimization that has been
performed for the sequential case.
Assume that we have at disposal an optimized,
debugged, sequential function isort that sorts
arrays of integers in the memory of a single
processor.

10
Sorting (Simple Approach)The Algorithm

Partially pre-sort the elements so that all
elements in processor p are less than all those
in higher-numbered processors
Recall that on Beowulf systems, the high latency
of network communication favors transmission of
large messages over small ones
Determine range of values to each processor
Values between p(INT_MAX/commsize) and
(p1)(INT_MAX/commsize)-1, inclusive

11
Sorting (Simple Approach)The Algorithm (contd)

Each processor scans its own list and for other
elements, labels them to a destination processor
Elements placed in buffer specific to that
processor
Communicate data between processors

12
Sorting (Simple Approach)The Algorithm (contd)

MPI provides communication tools
MPI_Alltoallv
Requires each processor to argue how much data is
incoming from every partner, and exactly where it
should go
First distribute lengths with MPI_Alltoall
Allocate contiguous space for all outgoing
elements (temporarily)
Pacreate - Initialize the returned structure

13
Analysis of Integer Sort

Quality of a parallel implementation is assessed
by measuring
speedup s(P) T(1)/T(P)
efficiency e(P) T(1)/(PT(P)) s(P)/P
overhead n(P) (PT(P)-T(1))/T(1) (1-e)/e
where T(1) best available implementation on a
single processor
Overhead is useful because it is additive

14
Sources of Overhead

Communication
Redundancy
Extra Work
Load Imbalance
Waiting

15
Sources of OverheadCommunication

Time spent communicating in parallel code
(exclude sequential implementation)
Easy to estimate largest contribution to
overhead (in sort example)
examine MPI_Alltoall and MPI_Alltoallv calls
for MPICH, calls are implemented as loops over
point-to-point calls, hence
Tcomm 2 P tlatency sizeof(local
arrays)/bandwidth

16
Sources of OverheadRedundancy

Performs same computation on many processors
P-1 processors not carrying out useful work
Negligible for sort1
Some O(1) operations (calling malloc to obtain
temporary space) do not impact performance

17
Sources of OverheadExtra Work

Parallel computation that does not take place in
a sequential implementation
e.g. for sort1 computing processor destination
for every input element

18
Sources of OverheadLoad Imbalance

Measures extra time spent by the slowest
processor, in excess of the mean over all
processors
Load balance should satisfy with a Gaussian
distribution of N(N/P,N/Psqrt((P-1)/N)
imbal(nlargest-nmean)/nmeanO(1)sqrt((P-1)/N)

19
Sources of OverheadWaiting

Fine-grained imbalance even though the overall
load may be balanced
e.g. frequent synchronization between short
computations
For sort, synchronization occurs during calls to
MPI_Alltoall and MPI_Alltoallv
occurs immediately after initial decomposition
overhead negligible

20
Measurement of Integer Sort

upshot - (MPICH)
Graphical tool to render a visual representation
of parallel program behavior
Logs the time spent in different phases
Goals to improve the performance

21
More General SortingPerformance Improvement

Faster sequential sort routines
D.E. Knuth, The Art of Computer Programming
volume 3 Sorting and Searching, Addison Wesley,
1973
Relax restrictions on input data
sort more general objects
May no long use an integer key
use of compar function
choosing fenceposts

22
More General SortingApproach

Solution to choosing fencepost oversample
Select nfence objects
larger value of nfence, better load balance
but results in more work
nfence value difficult to determine a priori
MPI_Allgather and bsearch to divide data into
more bins than processors

23
More General SortingApproach (contd)

Look at population of each bin and bins to
processors, achieving load balance
MPI_Allreduce computes sum over all processors of
the count in each bucket
Finally, MPI_Alltoallv delivers elements to the
correct destinations and a call to qsort
completes the local sort in each processor

24
Analysis of General Sort

More complicated program, as it invokes more MPI
routines
Tradeoff between cost of selecting more fencepost
and improving load balance by using more samples
Author chooses an intermediate case P12, N1M,
object size32 bytes

25
Summary