Programming with MPI A Detailed Example - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Programming with MPI A Detailed Example

Description:

How to Build a Beowulf, Chapter 9. 2. Overview of MPI. MPI a distributed space model ... Requires each processor to argue how much data is incoming from ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 26
Provided by: lyap
Category:

less

Transcript and Presenter's Notes

Title: Programming with MPI A Detailed Example


1
Programming with MPI A Detailed Example
EE524 LECTUREHow to Build a Beowulf, Chapter
9
  • By Llewellyn YapWednesday, 2-23-2000

2
Overview of MPI
  • MPI a distributed space model
  • Decomposition can be static or dynamic
  • Static fixed once and for all
  • Dynamic changing in response to the simulation
  • When partitioning, consider minimizing
    communication

3
Implementing High Performance, Parallel
Applications for MPI
  • Choose algorithm with sufficient parallelism
  • Optimize a sequential version of the algorithm
  • Use simplest possible MPI operations
  • usually blocking, standard mode procedures
  • Profiling and analysis
  • find what operations take the most time
  • Attack the most time-consuming components

4
Steps to Good Parallel Implementation
  • Understand strengths and weaknesses of sequential
    solutions
  • Choose a good sequential algorithm
  • Design a strategy for parallelization
  • Develop a rough semi-analytic model of how the
    parallel algorithm should perform

5
Steps to Good Parallel Implementation (contd)
  • Implement MPI for interprocessor communication
  • Carry out measurements for verifying performance
  • Identify bottlenecks, sources or overhead, etc.
    minimize their impact
  • Iterate to improve performance if possible

6
Investigate Sorting
  • Sorting is a multi-faceted, irregular,
    non-grid-based problem
  • Used widely in database servers
  • Issues of sorting (size of elements, form of
    result, storage etc.)
  • Two approaches will be discussed1. A fairly
    restricted domain2. More general approach

7
Sorting (Simple Approach)Assumptions
  • Elements are positive integers
  • Secondary storage (disk, tape) not used
  • Auxiliary primary storage is available
  • Input data uniformly distributed over range of
    integers

8
Sorting (Simple Approach)The Approach
  • Most problems for parallel computation has
    already been solved for the traditional
    sequential architectures
  • Sequential solutions exist as libraries, system
    calls or language constructs will be used as
    building blocks for a parallel solution

9
Sorting (Simple Approach)The Approach (contd)
  • Advantages of this approach leverages the design,
    debugging and optimization that has been
    performed for the sequential case.
  • Assume that we have at disposal an optimized,
    debugged, sequential function isort that sorts
    arrays of integers in the memory of a single
    processor.

10
Sorting (Simple Approach)The Algorithm
  • Partially pre-sort the elements so that all
    elements in processor p are less than all those
    in higher-numbered processors
  • Recall that on Beowulf systems, the high latency
    of network communication favors transmission of
    large messages over small ones
  • Determine range of values to each processor
  • Values between p(INT_MAX/commsize) and
    (p1)(INT_MAX/commsize)-1, inclusive

11
Sorting (Simple Approach)The Algorithm (contd)
  • Each processor scans its own list and for other
    elements, labels them to a destination processor
  • Elements placed in buffer specific to that
    processor
  • Communicate data between processors

12
Sorting (Simple Approach)The Algorithm (contd)
  • MPI provides communication tools
  • MPI_Alltoallv
  • Requires each processor to argue how much data is
    incoming from every partner, and exactly where it
    should go
  • First distribute lengths with MPI_Alltoall
  • Allocate contiguous space for all outgoing
    elements (temporarily)
  • Pacreate - Initialize the returned structure

13
Analysis of Integer Sort
  • Quality of a parallel implementation is assessed
    by measuring
  • speedup s(P) T(1)/T(P)
  • efficiency e(P) T(1)/(PT(P)) s(P)/P
  • overhead n(P) (PT(P)-T(1))/T(1) (1-e)/e
  • where T(1) best available implementation on a
    single processor
  • Overhead is useful because it is additive

14
Sources of Overhead
  • Communication
  • Redundancy
  • Extra Work
  • Load Imbalance
  • Waiting

15
Sources of OverheadCommunication
  • Time spent communicating in parallel code
    (exclude sequential implementation)
  • Easy to estimate largest contribution to
    overhead (in sort example)
  • examine MPI_Alltoall and MPI_Alltoallv calls
  • for MPICH, calls are implemented as loops over
    point-to-point calls, hence
  • Tcomm 2 P tlatency sizeof(local
    arrays)/bandwidth

16
Sources of OverheadRedundancy
  • Performs same computation on many processors
  • P-1 processors not carrying out useful work
  • Negligible for sort1
  • Some O(1) operations (calling malloc to obtain
    temporary space) do not impact performance

17
Sources of OverheadExtra Work
  • Parallel computation that does not take place in
    a sequential implementation
  • e.g. for sort1 computing processor destination
    for every input element

18
Sources of OverheadLoad Imbalance
  • Measures extra time spent by the slowest
    processor, in excess of the mean over all
    processors
  • Load balance should satisfy with a Gaussian
    distribution of N(N/P,N/Psqrt((P-1)/N)
  • imbal(nlargest-nmean)/nmeanO(1)sqrt((P-1)/N)

19
Sources of OverheadWaiting
  • Fine-grained imbalance even though the overall
    load may be balanced
  • e.g. frequent synchronization between short
    computations
  • For sort, synchronization occurs during calls to
    MPI_Alltoall and MPI_Alltoallv
  • occurs immediately after initial decomposition
    overhead negligible

20
Measurement of Integer Sort
  • upshot - (MPICH)
  • Graphical tool to render a visual representation
    of parallel program behavior
  • Logs the time spent in different phases
  • Goals to improve the performance

21
More General SortingPerformance Improvement
  • Faster sequential sort routines
  • D.E. Knuth, The Art of Computer Programming
    volume 3 Sorting and Searching, Addison Wesley,
    1973
  • Relax restrictions on input data
  • sort more general objects
  • May no long use an integer key
  • use of compar function
  • choosing fenceposts

22
More General SortingApproach
  • Solution to choosing fencepost oversample
  • Select nfence objects
  • larger value of nfence, better load balance
  • but results in more work
  • nfence value difficult to determine a priori
  • MPI_Allgather and bsearch to divide data into
    more bins than processors

23
More General SortingApproach (contd)
  • Look at population of each bin and bins to
    processors, achieving load balance
  • MPI_Allreduce computes sum over all processors of
    the count in each bucket
  • Finally, MPI_Alltoallv delivers elements to the
    correct destinations and a call to qsort
    completes the local sort in each processor

24
Analysis of General Sort
  • More complicated program, as it invokes more MPI
    routines
  • Tradeoff between cost of selecting more fencepost
    and improving load balance by using more samples
  • Author chooses an intermediate case P12, N1M,
    object size32 bytes

25
Summary
  • Trust no one
  • A performance model
  • Instrumentation and graphs
  • Graphical tools
  • Superlinear speed up
  • Enough is enough
Write a Comment
User Comments (0)
About PowerShow.com