Title: Programming with MPI A Detailed Example
EE524 LECTUREHow to Build a Beowulf, Chapter
- By Llewellyn YapWednesday, 2-23-2000
2Overview of MPI
- MPI a distributed space model
- Decomposition can be static or dynamic
- Static fixed once and for all
- Dynamic changing in response to the simulation
- When partitioning, consider minimizing
3Implementing High Performance, Parallel
Applications for MPI
- Choose algorithm with sufficient parallelism
- Optimize a sequential version of the algorithm
- Use simplest possible MPI operations
- usually blocking, standard mode procedures
- Profiling and analysis
- find what operations take the most time
- Attack the most time-consuming components
4Steps to Good Parallel Implementation
- Understand strengths and weaknesses of sequential
- Choose a good sequential algorithm
- Design a strategy for parallelization
- Develop a rough semi-analytic model of how the
parallel algorithm should perform
5Steps to Good Parallel Implementation (contd)
- Implement MPI for interprocessor communication
- Carry out measurements for verifying performance
- Identify bottlenecks, sources or overhead, etc.
minimize their impact
- Iterate to improve performance if possible
6Investigate Sorting
- Sorting is a multi-faceted, irregular,
non-grid-based problem
- Used widely in database servers
- Issues of sorting (size of elements, form of
result, storage etc.)
- Two approaches will be discussed1. A fairly
restricted domain2. More general approach
7Sorting (Simple Approach)Assumptions
- Elements are positive integers
- Secondary storage (disk, tape) not used
- Auxiliary primary storage is available
- Input data uniformly distributed over range of
8Sorting (Simple Approach)The Approach
- Most problems for parallel computation has
already been solved for the traditional
sequential architectures
- Sequential solutions exist as libraries, system
calls or language constructs will be used as
building blocks for a parallel solution
9Sorting (Simple Approach)The Approach (contd)
- Advantages of this approach leverages the design,
debugging and optimization that has been
performed for the sequential case.
- Assume that we have at disposal an optimized,
debugged, sequential function isort that sorts
arrays of integers in the memory of a single
10Sorting (Simple Approach)The Algorithm
- Partially pre-sort the elements so that all
elements in processor p are less than all those
in higher-numbered processors
- Recall that on Beowulf systems, the high latency
of network communication favors transmission of
large messages over small ones
- Determine range of values to each processor
- Values between p(INT_MAX/commsize) and
(p1)(INT_MAX/commsize)-1, inclusive
11Sorting (Simple Approach)The Algorithm (contd)
- Each processor scans its own list and for other
elements, labels them to a destination processor
- Elements placed in buffer specific to that
- Communicate data between processors
12Sorting (Simple Approach)The Algorithm (contd)
- MPI provides communication tools
- MPI_Alltoallv
- Requires each processor to argue how much data is
incoming from every partner, and exactly where it
should go
- First distribute lengths with MPI_Alltoall
- Allocate contiguous space for all outgoing
elements (temporarily)
- Pacreate - Initialize the returned structure
13Analysis of Integer Sort
- Quality of a parallel implementation is assessed
by measuring
- speedup s(P) T(1)/T(P)
- efficiency e(P) T(1)/(PT(P)) s(P)/P
- overhead n(P) (PT(P)-T(1))/T(1) (1-e)/e
- where T(1) best available implementation on a
single processor
- Overhead is useful because it is additive
14Sources of Overhead
- Communication
- Redundancy
- Extra Work
- Load Imbalance
- Waiting
15Sources of OverheadCommunication
- Time spent communicating in parallel code
(exclude sequential implementation)
- Easy to estimate largest contribution to
overhead (in sort example)
- examine MPI_Alltoall and MPI_Alltoallv calls
- for MPICH, calls are implemented as loops over
point-to-point calls, hence
- Tcomm 2 P tlatency sizeof(local
16Sources of OverheadRedundancy
- Performs same computation on many processors
- P-1 processors not carrying out useful work
- Negligible for sort1
- Some O(1) operations (calling malloc to obtain
temporary space) do not impact performance
17Sources of OverheadExtra Work
- Parallel computation that does not take place in
a sequential implementation
- e.g. for sort1 computing processor destination
for every input element
18Sources of OverheadLoad Imbalance
- Measures extra time spent by the slowest
processor, in excess of the mean over all
- Load balance should satisfy with a Gaussian
distribution of N(N/P,N/Psqrt((P-1)/N)
- imbal(nlargest-nmean)/nmeanO(1)sqrt((P-1)/N)
19Sources of OverheadWaiting
- Fine-grained imbalance even though the overall
load may be balanced
- e.g. frequent synchronization between short
- For sort, synchronization occurs during calls to
MPI_Alltoall and MPI_Alltoallv
- occurs immediately after initial decomposition
overhead negligible
20Measurement of Integer Sort
- upshot - (MPICH)
- Graphical tool to render a visual representation
of parallel program behavior
- Logs the time spent in different phases
- Goals to improve the performance
21More General SortingPerformance Improvement
- Faster sequential sort routines
- D.E. Knuth, The Art of Computer Programming
volume 3 Sorting and Searching, Addison Wesley,
- Relax restrictions on input data
- sort more general objects
- May no long use an integer key
- use of compar function
- choosing fenceposts
22More General SortingApproach
- Solution to choosing fencepost oversample
- Select nfence objects
- larger value of nfence, better load balance
- but results in more work
- nfence value difficult to determine a priori
- MPI_Allgather and bsearch to divide data into
more bins than processors
23More General SortingApproach (contd)
- Look at population of each bin and bins to
processors, achieving load balance
- MPI_Allreduce computes sum over all processors of
the count in each bucket
- Finally, MPI_Alltoallv delivers elements to the
correct destinations and a call to qsort
completes the local sort in each processor
24Analysis of General Sort
- More complicated program, as it invokes more MPI
- Tradeoff between cost of selecting more fencepost
and improving load balance by using more samples
- Author chooses an intermediate case P12, N1M,
object size32 bytes
- Trust no one
- A performance model
- Instrumentation and graphs
- Graphical tools
- Superlinear speed up
- Enough is enough