Crash Course in Parallel Programming Using MPI - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Crash Course in Parallel Programming Using MPI

Description:

Crash Course in Parallel Programming Using MPI. Adam Jacobs. HCS ... int main(int argc, char* argv[]) int my_rank, p; // process rank and number of processes ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 58
Provided by: adam179
Category:

less

Transcript and Presenter's Notes

Title: Crash Course in Parallel Programming Using MPI


1
Crash Course in Parallel Programming Using MPI
  • Adam Jacobs
  • HCS Research Lab
  • 01/10/07

2
Outline PCA Preparation
  • Parallel Computing
  • Distributed Memory Architectures
  • Programming Models
  • Flynns Taxonomy
  • Parallel Decomposition
  • Speedups

3
Parallel Computing
  • Motivated by high computational complexity and
    memory requirements of large applications
  • Two Approaches
  • Shared Memory
  • Distributed Memory
  • The majority of modern systems are clusters
    (distributed memory architecture)
  • Many simple machines connected with a powerful
    interconnect
  • ASCI Red, ASCI White,
  • Also a hybrid approach can be used
  • IBM Blue Gene

4
Shared Memory Systems
  • Memory resources are shared among processors
  • Relatively easy to program for since there is a
    single unified memory space
  • Scales poorly with system size due to the need
    for cache coherency
  • Example
  • Symmetric Multiprocessors (SMP)
  • Each processor has equal access to RAM
  • 4-way motherboards MUCH more expensive than 2-way

5
Distributed Memory Systems
  • Individual nodes consist of a CPU, RAM, and a
    network interface
  • A hard disk is not necessary mass storage can be
    supplied using NFS
  • Information is passed between nodes using the
    network
  • No need for special cache coherency hardware
  • More difficult to write programs for distributed
    memory systems since the programmer must keep
    track of memory usage

6
Programming Models
  • Multiprogramming
  • Multiple programs running simultaneously
  • Shared Address
  • Global address space available to all processors
  • Shared data is written to this global space
  • Message Passing
  • Data is sent directly to processors using
    messages
  • Data Parallel

7
Flynns Taxonomy
  • SISD Single Instruction, Single Data
  • Normal Instructions
  • SIMD Single Instruction, Multiple Data
  • Vector Operations, MMX, SSE, Altivec
  • MISD Multiple Instructions, Single Data
  • MIMD Multiple Instructions, Multiple Data
  • SPMD Single Program, Multiple Data

8
Parallel Decomposition
  • Data Parallelism
  • Parallelism within a dataset such that a portion
    of the data can be computed independently from
    the rest
  • Usually results in coarse-grained parallelism
    (compute farms)
  • Allows for automatic load balancing strategies
  • Functional Parallelism
  • Parallelism between distinct functional blocks
    such that each block can be performed
    independently
  • Especially useful for pipeline structures

9
Speedup
10
Super-linear Speedup
  • Linear speedup is the best that can be achieved
  • Or is it?
  • Super-linear speedup occurs when parallelizing an
    algorithm results in a more efficient use of
    hardware resources
  • 1MB task doesnt fit in a single processor
  • 2 512 KB tasks do fit, results in lower effective
    memory access times

11
MPI Message Passing Interface
  • Adam Jacobs
  • HCS Research Lab
  • 01/10/07

Slides created by Raj Subramaniyan
12
Outline MPI Usage
  • Introduction
  • MPI Standard
  • MPI Implementations
  • MPICH Introduction
  • MPI calls
  • Present Emphasis

13
Parallel Computing
  • Motivated by high computational complexity and
    memory requirements of large applications
  • Cooperation with other processes
  • Cooperative and one-sided operations
  • Processes interact with each other by exchanging
    information
  • Models
  • SIMD
  • SPMD
  • MIMD

14
Cooperative Operations
  • Cooperative all parties agree to transfer data
  • Message-passing is an approach that makes the
    exchange of data cooperative
  • Data must both be explicitly sent and received
  • Any change in the receiver's memory is made with
    the receiver's participation

15
MPI Message Passing Interface
  • MPI A message passing library specification
  • A message passing model and not a specific
    product
  • Designed for parallel computers, clusters and
    heterogeneous networks
  • Standardization began in 1992 and the final draft
    was made available in 1994
  • Broad participation of vendors, library writers,
    application specialists and scientists

Message Passing Interface Forum accessible at
http//www.mpi-forum.org/
16
Features of MPI
  • Point-to-point communication
  • Collective operations
  • Process groups
  • Communication contexts
  • Process topologies
  • Bindings for Fortran 77 and C
  • Environmental management and inquiry
  • Profiling interface

17
Features NOT included in MPI
  • Explicit shared-memory operations
  • Operations that require more operating system
    support than the standard for example,
    interrupt-driven receives, remote execution, or
    active messages
  • Program construction tools
  • Explicit support for threads
  • Support for task management
  • I/O functions

18
MPI Implementations
  • Listed below are MPI implementations available
    for free
  • Appleseed (UCLA)
  • CRI/EPCC (Edinburgh Parallel Computing Centre)
  • LAM/MPI (Indiana University)
  • MPI for UNICOS Systems (SGI)
  • MPI-FM (University of Illinois) for Myrinet
  • MPICH (ANL)
  • MVAPICH (Infiniband)
  • SGI Message Passing Toolkit
  • OpenMPI

A detailed list of MPI implementations with
features can be found at http//www.lam-mpi.org/mp
i/implementations/
19
MPICH
  • MPICH A portable implementation of MPI developed
    at the Argonne National Laboratory (ANL) and
    Mississippi State University (MSU)
  • Very widely used
  • Supports all the specs of MPI-1 standard
  • Features part of MPI-2 standard are under
    development (ANL alone)

http//www-unix.mcs.anl.gov/mpi/mpich/
20
Writing MPI Programs
Part of all programs
  • include "mpi.h" // Gives basic MPI types,
    definitions
  • include ltstdio.hgt
  • int main( argc, argv )
  • int argc
  • char argv
  • MPI_Init( argc, argv ) // Starts MPI
  • Actual code including normal C calls and MPI
    calls
  • MPI_Finalize() // Ends MPI
  • return 0

21
Initialize and Finalize
  • MPI_Init
  • Initializes all necessary MPI variables
  • Forms the MPI_COMM_WORLD communicator
  • A communicator is a list of all the connections
    between nodes
  • Opens necessary TCP connections
  • MPI_Finalize
  • Waits for all processes to reach the function
  • Closes TCP connections
  • Cleans up

22
Rank and Size
  • Environment details
  • How many processes are there? (MPI_Comm_size)
  • Who am I? (MPI_Comm_rank)
  • MPI_Comm_size( MPI_COMM_WORLD, size )
  • MPI_Comm_rank( MPI_COMM_WORLD, rank )
  • The rank is a number between 0 and size-1

23
Sample Hello World Program
  • includes     int main(int argc, char argv)
  •     int my_rank, p         // process rank
    and number of processes    int source, dest  
    // rank of sender and receiving process    int
    tag 0         // tag for messages    char
    mesg100   // storage for message   
    MPI_Status status   // stores status for
    MPI_Recv statements    MPI_Init(argc,
    argv)    MPI_Comm_rank(MPI_COMM_WORLD,
    my_rank)     MPI_Comm_size(MPI_COMM_WORLD,
    p)     if (my_rank!0)
  •         sprintf(mesg,
    "Greetings from d!", my_rank) // stores into
    character array        dest 0 // sets
    destination for MPI_Send to process 0       
    MPI_Send(mesg, strlen(mesg)1, MPI_CHAR, dest,
    tag, MPI_COMM_WORLD)
  • // sends string to process 0    
    else         for(source 1 source lt p
    source)        MPI_Recv(message, 100,
    MPI_CHAR, source, tag, MPI_COMM_WORLD, status)
  • // recv from each process       
    printf("s\n", message) // prints out
    greeting to screen   
  •     MPI_Finalize() // shuts down MPI

24
Compiling MPI Programs
  • Two methods
  • Compilation commands
  • Using Makefile
  • Compilation commands
  • mpicc -o hello_world hello-world.c
  • mpif77 -o hello_world hello-world.f
  • Likewise mpiCC and mpif90 are available for C
    and Fortran90 respectively
  • Makefile.in is a template Makefile
  • mpireconfig translates Makefile.in to a Makefile
    for a particular system

25
Running MPI Programs
  • To run hello_world on two machines
  • mpirun -np 2 hello_world
  • Must specify full path of executable
  • To know the commands executed by mpirun
  • mpirun t
  • To get all the mpirun options
  • mpirun -help

26
MPI Communications
  • Typical blocking send
  • send (dest, type, address, length)
  • dest integer representing the process to
    receive the message
  • type data type being sent (often overloaded)
  • (address, length) contiguous area in memory
    being sent
  • MPI_Send/MPI_Recv provide point-to-point
    communication
  • Typical global operation
  • broadcast (type, address, length)
  • Six basic MPI calls (init, finalize, comm, rank,
    send, recv)

27
MPI Basic Send/Recv
  • int MPI_Send( void buf, int count, MPI_Datatype
    datatype, int dest, int tag, MPI_Comm comm )
  • buf initial address of send buffer dest rank
    of destination (integer)
  • tag message tag (integer) comm communicator
    (handle)
  • count number of elements in send buffer
    (nonnegative integer)
  • datatype datatype of each send buffer element
    (handle)
  • int MPI_Recv( void buf, int count, MPI_Datatype
    datatype, int source, int tag, MPI_Comm comm,
    MPI_Status status )
  • status status object (Status) source rank of
    source (integer)
  • status is mainly useful when messages are
    received with MPI_ANY_TAG and/or MPI_ANY_SOURCE

28
Information about a Message
  • count argument in recv indicates maximum length
    of a message
  • Actual length of message can be got using
    MPI_Get_Count
  • MPI_Status status
  • MPI_Recv( ..., status )
  • ... status.MPI_TAG
  • ... status.MPI_SOURCE
  • MPI_Get_count( status, datatype, count )

29
Example Matrix Multiplication Program
/ send matrix data to the worker tasks /
averow NRA/numworkers extra
NRAnumworkers offset 0 mtype
FROM_MASTER for (dest1 destltnumworkers
dest) rows (dest lt extra) ?
averow1 averow // If rows not divisible
absolutely by workers printf("sending d
rows to task d\n",rows,dest) // some workers
get an additional row MPI_Send(offset, 1,
MPI_INT, dest, mtype, MPI_COMM_WORLD) //
Starting row being sent MPI_Send(rows, 1,
MPI_INT, dest, mtype, MPI_COMM_WORLD) //
rows sent count rowsNCA // Gives
total elements being sent
MPI_Send(aoffset0, count, MPI_DOUBLE, dest,
mtype, MPI_COMM_WORLD) count
NCANCB // Equivalent to NRB NCB elements
in B MPI_Send(b, count, MPI_DOUBLE, dest,
mtype, MPI_COMM_WORLD) offset offset
rows // Increment offset for the next worker

MASTER SIDE
30
Example Matrix Multiplication Program (contd.)
/ wait for results from all worker tasks /
mtype FROM_WORKER for (i1 iltnumworkers
i) // Get results from each worker
source i MPI_Recv(offset, 1,
MPI_INT, source, mtype, MPI_COMM_WORLD,
status) MPI_Recv(rows, 1, MPI_INT,
source, mtype, MPI_COMM_WORLD, status)
count rowsNCB // elements in the result
from the worker MPI_Recv(coffset0,
count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD,
status) / print results /
/ end of master section /
MASTER SIDE
31
Example Matrix Multiplication Program (contd.)
if (taskid gt MASTER) // Implies a worker
node mtype FROM_MASTER source MASTER
printf ("Master d, mtyped\n", source,
mtype) // Receive the offset and number of
rows MPI_Recv(offset, 1, MPI_INT, source,
mtype, MPI_COMM_WORLD, status) printf
("offset d\n", offset) MPI_Recv(rows, 1,
MPI_INT, source, mtype, MPI_COMM_WORLD,
status) printf ("row d\n", rows)
count rowsNCA // elements to receive for
matrix A MPI_Recv(a, count, MPI_DOUBLE,
source, mtype, MPI_COMM_WORLD, status) printf
("a00 e\n", a00) count
NCANCB // elements to receive for matrix B
MPI_Recv(b, count, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, status)
WORKER SIDE
32
Example Matrix Multiplication Program (contd.)
for (k0 kltNCB k) for (i0 iltrows i)
cik 0.0 // Do the matrix
multiplication fro the rows you are assigned
to for (j0 jltNCA j)
cik cik aij bjk
mtype FROM_WORKER printf ("after computing
\n") MPI_Send(offset, 1, MPI_INT, MASTER,
mtype, MPI_COMM_WORLD) MPI_Send(rows, 1,
MPI_INT, MASTER, mtype, MPI_COMM_WORLD)
MPI_Send(c, rowsNCB, MPI_DOUBLE,
MASTER, mtype, MPI_COMM_WORLD) // Sending
the actual result printf ("after send \n")
/ end of worker /
WORKER SIDE
33
Asynchronous Send/Receive
  • MPI_Isend() and MPI_Irecv() are non-blocking
    control returns to program after call is made
  • int MPI_Isend( void buf, int count, MPI_Datatype
    datatype, int dest, int tag, MPI_Comm comm,
    MPI_Request request )
  • int MPI_Irecv( void buf, int count, MPI_Datatype
    datatype, int source, int tag, MPI_Comm comm,
    MPI_Request request )
  • request communication request (handle) output
    parameter

34
Detecting Completions
  • Non-blocking operations return (immediately)
    request handles that can be waited on and
    queried
  • MPI_Wait waits for an MPI send or receive to
    complete
  • int MPI_Wait ( MPI_Request request, MPI_Status
    status)
  • request matches request on Isend or Irecv
  • status returns the status equivalent to status
    for MPI_Recv when complete
  • blocks for send until message is buffered or sent
    so message variable is free
  • blocks for receive until message is received and
    ready

35
Detecting Completions (contd.)
  • MPI_Test tests for the completion of a send or
    receive
  • int MPI_Test ( MPI_Request request, int flag,
    MPI_Status status)
  • request, status as for MPI_Wait
  • does not block
  • flag indicates whether operation is complete or
    not
  • enables code which can repeatedly check for
    communication completion

36
Multiple Completions
  • Often desirable to wait on multiple requests
    ex., A master/slave program
  • int MPI_Waitall( int count, MPI_Request
    array_of_requests, MPI_Status
    array_of_statuses )
  • int MPI_Waitany( int count, MPI_Request
    array_of_requests, int index, MPI_Status
    status )
  • int MPI_Waitsome( int incount, MPI_Request
    array_of_requests, int outcount, int
    array_of_indices, MPI_Status array_of_statuses
    )
  • There are corresponding versions of test for each
    of these

37
Communication Modes
  • Synchronous mode (MPI_Ssend) the send does not
    complete until a matching receive has begun
  • Buffered mode (MPI_Bsend) the user supplies the
    buffer to system
  • Ready mode (MPI_Rsend) user guarantees that
    matching receive has been posted
  • Non-blocking versions are MPI_Issend, MPI_Irsend,
    MPI_Ibsend

38
Miscellaneous Point-to-Point Commands
  • MPI_Sendrecv
  • MPI_Sendrecv_replace
  • MPI_cancel
  • Used for buffered modes
  • MPI_Buffer_attach
  • MPI_Buffer_detach

39
Collective Communication
  • One to Many (Broadcast, Scatter)
  • Many to One (Reduce, Gather)
  • Many to Many (Allreduce, Allgather)

40
Broadcast and Barrier
  • Any type of message can be sent size of message
    should be known to all
  • int MPI_Bcast ( void buffer, int count,
    MPI_Datatype datatype, int root, MPI_Comm comm )
  • buffer pointer to message buffer count number
    of items sent
  • datatype type of item sent root sending
    processor
  • comm communicator within which broadcast takes
    place
  • Note count and type should be the same on all
    processors
  • Barrier synchronization (broadcast without
    message?)
  • int MPI_Barrier ( MPI_Comm comm )

41
Reduce
  • Reverse of broadcast all processors send to a
    single processor
  • Several combining functions available
  • MAX, MIN, SUM, PROD, LAND, BAND, LOR, BOR, LXOR,
    BXOR, MAXLOC, MINLOC
  • int MPI_Reduce ( void sentbuf, void result, int
    count, MPI_Datatype datatype, MPI_Op op, int
    root, MPI_Comm comm )

42
Scatter and Gather
  • MPI_Scatter Source (array) on the sending
    processor is spread to all processors
  • MPI_Gather Opposite of scatter array locations
    at the receiver correspond to the rank of the
    senders

43
Many-to-many Communication
  • MPI_Allreduce
  • Syntax like reduce, except no root parameter
  • All nodes get result
  • MPI_Allgather
  • Syntax like gather, except no root parameter
  • All nodes get resulting array

44
Evaluating Parallel Programs
  • MPI provides tools to evaluate performance of
    parallel programs
  • Timer
  • Profiling Interface
  • MPI_Wtime gives the wall clock time
  • MPI_WTIME_IS_GLOBAL can be used to check the
    synchronization of times for all the processes
  • PMPI_.... is an entry point for all routines can
    be used for profiling
  • -mpilog option at compile time can be used to
    generate logfiles

45
Recent Developments
  • MPI-2
  • Dynamic process management
  • One-sided communication
  • Parallel file-IO
  • Extended collective operations
  • MPI for Grids ex., MPICH-G, MPICH-G2
  • Fault-tolerant MPI ex., Starfish, Cocheck

46
One-sided Operations
  • One-sided one worker performs transfer of data
  • Remote memory reads and writes
  • Data can be accessed without waiting for other
    processes

47
File Handling
  • Similar to general programming languages
  • Sample function calls
  • MPI_File_open
  • MPI_File_read
  • MPI_File_seek
  • MPI_File_write
  • MPI_File_set_size
  • Non-blocking reads and writes are also possible
  • MPI_File_Iread
  • MPI_File_Iwrite

48
C Datatypes
  • MPI_CHAR char
  • MPI_BYTE See standard like unsigned char
  • MPI_SHORT short
  • MPI_INT int
  • MPI_LONG long
  • MPI_FLOAT float
  • MPI_DOUBLE double
  • MPI_UNSIGNED_CHAR unsigned char
  • MPI_UNSIGNED_SHORT unsigned short
  • MPI_UNSIGNED unsigned int
  • MPI_UNSIGNED_LONG unsigned long
  • MPI_LONG_DOUBLE long double

49
mpiP
  • A lightweight profiling library for MPI
    applications
  • In order to use in an application, simply add the
    lmpiP flag to the compile script
  • Determines how much time a program spends in MPI
    calls versus the rest of the application
  • Shows which MPI calls are used most frequently

50
Jumpshot
  • Graphical profiling tool for MPI
  • Java-Based
  • Useful for determining communication patterns in
    an application
  • Color-coded bars represent time spent in an MPI
    function
  • Arrows denote message passing
  • Single line denotes actual processing time

51
Summary
  • The parallel computing community has cooperated
    to develop a full-featured standard
    message-passing library interface
  • Several implementations are available
  • Many applications are being developed or ported
    presently
  • MPI-2 process beginning
  • Lots of MPI material available
  • Very good facilities available at the HCS Lab for
    MPI-based projects
  • Zeta Cluster will be available for class projects

52
References
  • 1 The Message Passing Interface (MPI)
    Standard, http//www-unix.mcs.anl.gov/mpi/
  • 2 LAM/MPI Parallel Computing,
    http//www.lam-mpi.org
  • 3 W. Gropp, Tutorial on MPI The
    Message-Passing Interface, http//www-unix.mcs.an
    l.gov/mpi/tutorial/gropp/talk.html
  • 4 D. Culler and J. Singh, Parallel Computer
    Architecture A Hardware/Software Approach

53
Fault-Tolerant Embedded MPI
54
Motivations
  • MPI functionality required for HPC space
    applications
  • De-facto standard/parallel programming model in
    HPC
  • Fault-tolerant extensions for HPEC space systems
  • MPI is inherently fault-intolerant, original
    design choice
  • Existing HPC tools for MPI and fault-tolerant MPI
  • Good basis for ideas, API standards, etc.
  • Not readily amenable to HPEC platforms
  • Focus on lightweight fault-tolerant MPI for HPEC
    (FEMPI Fault-tolerant Embedded Message Passing
    Interface)
  • Leverage prior work throughout HPC community
  • Leverage prior work at UF on HPC with MPI

55
Primary Source of Failures in MPI
  • Nature of failures
  • Individual processes of MPI job crash (Process
    failure)
  • Communication failure between two MPI processes
    (Network failure)
  • Behavior on failure
  • When a receiver node fails, sender encounters a
    timeout on a blocking send call, as no matching
    receive is found and returns an error
  • Whole communicator context crashes and hence the
    entire MPI job
  • NN open TCP connections in many MPI
    implementations in such cases, the whole job
    crashes immediately on failure of any node
  • Applies to collective communication calls as well
  • Avoid failure/crash of entire application
  • Health status of nodes provided by failure
    detection service (via SR)
  • Check node status before communication with
    another node to avoid establishing communication
    with a dead process
  • If receiver dies after status check and before
    communication, then timeout-based recovery will
    be used

56
FEMPI Software Architecture
  • Low-level communication is provided through FEMPI
    using Self-Reliants DMS
  • Heartbeating via SR and a process notification
    extension to the SRP enables FEMPI fault
    detection
  • Application and FEMPI checkpointing make use of
    existing checkpointing libraries checkpoint
    communication uses DMS
  • MPI Restore process on System Controller is
    responsible for recovery decisions based on
    application policies

57
Fault Tolerance Actions
  • Fault tolerance is provided through three stages
  • Detection of a fault
  • Notification
  • Recovery
  • Self-Reliant services used to provide detection
    and notification capabilities
  • Heartbeats and other functionality are already
    provided in API
  • Notification service built as an extension to FTM
    of JMS
  • FEMPI will provide features to enable recovery of
    an application
  • Employs reliable communications to reduce faults
    due to communication failure
  • Low-level communications provided through
    Self-Reliant services (DMS) instead of directly
    over TCP/IP
Write a Comment
User Comments (0)
About PowerShow.com