PARALLEL COMPUTING WITH MPI - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

PARALLEL COMPUTING WITH MPI

Description:

Sender can find out if the message-buffer can be re-used ... The jth block of the receive buffer is the block of data sent from the jth process ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 105
Provided by: suz86
Category:

less

Transcript and Presenter's Notes

Title: PARALLEL COMPUTING WITH MPI


1
PARALLEL COMPUTING WITH MPI
  • Anne Weill-Zrahia
  • With acknowledgments to Cornell Theory Center

2
Introduction to Parallel Computing
  • Parallel computer A set of processors that
    work cooperatively to solve a computational
    problem.
  • Distributed computing a number of processors
    communicating over a network
  • Metacomputing Use of several parallel computers

3
Why parallel computing
  • Single processor performance limited by physics
  • Multiple processors break down problem into
    simple tasks or domains
  • Plus obtain same results as in sequential
    program, faster.
  • Minus need to rewrite code

4
Parallel classification
  • Parallel architectures
  • Shared Memory /
  • Distributed Memory
  • Programming paradigms
  • Data parallel /
  • Message passing

5
Shared memory
P
P
P
P
Memory
6
Shared Memory
  • Each processor can access any part of the memory
  • Access times are uniform (in principle)
  • Easier to program (no explicit message passing)
  • Bottleneck when several tasks access same
    location

7
Data-parallel programming
  • Single program defining operations
  • Single memory
  • Loosely synchronous (completion of loop)
  • Parallel operations on array elements

8
Distributed Memory
  • Processor can only access local memory
  • Access times depend on location
  • Processors must communicate via explicit message
    passing

9
Distributed Memory
Processor Memory
Processor Memory
Interconnection network
10
Message Passing Programming
  • Separate program on each processor
  • Local Memory
  • Control over distribution and transfer of data
  • Additional complexity of debugging due to
    communications

11
Performance issues
  • Concurrency ability to perform actions
    simultaneously
  • Scalability performance is not impaired by
    increasing number of processors
  • Locality high ration of local memory
    accesses/remote memory accesses (or low
    communication)

12
SP2 Benchmark
  • Goal Checking performance of real world
    applications on the SP2
  • Execution time (seconds)CPU time for
    applications
  • Speedup
  • Execution time for 1 processor
  • ---------------------------------
    ---
  • Execution time for p processors

13
(No Transcript)
14
WHAT is MPI?
  • A message- passing library specification
  • Extended message-passing model
  • Not specific to implementation or computer

15
BASICS of MPI PROGRAMMING
  • MPI is a message-passing library
  • Assumes a distributed memory architecture
  • Includes routines for performing communication
    (exchange of data and synchronization) among the
    processors.

16
Message Passing
  • Data transfer synchronization
  • Synchronization the act of bringing one or more
    processes to known points in their execution
  • Distributed memory memory split up into
    segments, each may be accessed by only one
    process.

17
Message Passing
May I send?
yes
Send data
18
MPI STANDARD
  • Standard by consensus, designed in an open forum
  • Introduced by the MPI FORUM in May 1994, updated
    in June 1995.
  • MPI-2 (1998) produces extensions to the MPI
    standard

19
Why use MPI ?
  • Standardization
  • Portability
  • Performance
  • Richness
  • Designed to enable libraries

20
Writing an MPI Program
  • If there is a serial version , make sure it is
    debugged
  • If not, try to write a serial version first
  • When debugging in parallel , start with a few
    nodes first.

21
Format of MPI routines
22
Six useful MPI functions
23
Communication routines
24
End MPI part of program
25
  • program hello
  • include mpif.h
  • integer rank,size,ierror,tag,status(MPI_STATUS_S
    IZE) character12 message call
    MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORL
    D, size,ierror) call MPI_COMM_RANK(MPI_COMM_WORLD
    , rank,ierror) tag 100 if(rank .eq. 0) then
    message 'Hello, world' do i1, size-1
    call MPI_SEND(message, 12, MPI_CHARACTER , i,
    tag,MPI_COMM_WORLD,ierror)
  • enddo
  • else
  • call MPI_RECV(message, 12, MPI_CHARACTER,
    0,tag,MPI_COMM_WORLD, status, ierror)
  • endif
  • print, 'node', rank, '', message
  • call MPI_FINALIZE(ierror)
  • end

26
include ltstring.hgt include ltmpi.hgt int main(
int argc, char argv) int tag100
int rank,size,i MPI_Status status char
message12 MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,size)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
strcpy(message,"Hello,world")
if (rank0) for
(i1iltsizei)
MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD)
else
MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_C
OMM_WORLD,status) printf("node d s
\n",rank,message) MPI_Finalize() retur
n 0
27
MPI Messages
  • DATA data to be sent
  • ENVELOPE information to route the data.

28
Description of MPI_Send (MPI_Recv)
29
Description of MPI_Send (MPI_Recv)
30
Some useful remarks
  • Source MPI_ANY_SOURCE means that any source is
    acceptable
  • Tags specified by sender and receiver must match,
    or MPI_ANY_TAG any tag is acceptable
  • Communicator must be the same for send/receive.
    Usually MPI_COMM_WORLD

31
POINT-TO-POINT COMMUNICATION
  • Transmission of a message between one pair of
    processes
  • Order of messages within same communicator will
    be respected.
  • Programmer can choose mode of transmission

32
Point to point dilemma
  • If (rank0) then
  • MPI_send(sbuffer,1,..) else
  • MPI_recv(rbuffer,0..)
  • P0 stops and waits till P1 is ready
  • P0 copies data to some buffer and returns
  • Send fails

or
What if P1 is not ready?
or
33
MODE of TRANSMISSION
  • Can be chosen by programmer
  • or let the system decide
  • Synchronous mode
  • Ready mode
  • Buffered mode
  • Standard mode

34
BLOCKING /NON-BLOCKING COMMUNICATIONS
35
BLOCKING STANDARD SEND
Date transfer from source complete
MPI_SEND
Sizegtthreshold
Task waits
S
R
wait
Transfer begins when MPI_RECV has been posted
MPI_RECV
Task continues when data transfer to buffer is
complete
36
NON BLOCKING STANDARD SEND
Date transfer from source complete
MPI_ISEND
MPI_WAIT
Sizegtthreshold
Task waits
S
R
wait
Transfer begins when MPI_IRECV has been posted
MPI_IRECV
MPI_WAIT
No interruption if wait is late enough
37
BLOCKING STANDARD SEND
MPI_SEND
Sizeltthreshold
Data transfer from source complete
S
R
Transfer to buffer on receiver
MPI_RECV
Task continues when data transfer to
usersbuffer is complete
38
NON BLOCKING STANDARD SEND
Date transfer from source complete
MPI_ISEND
MPI_WAIT
Sizeltthreshold
No delay even though message is not yet in buffer
on R
S
R
Transfer to buffer can be avoided if
MPI_IRECV posted early enough
MPI_IRECV
MPI_WAIT
No delay if wait is late enough
39
BLOCKING COMMUNICATION
40
NON-BLOCKING
41
NON-BLOCKING(C)
42
(No Transcript)
43
Deadlock program (cont)
if ( irank.EQ.0 ) then idest 1
isrc 1 isend_tag ITAG_A
irecv_tag ITAG_B else if ( irank.EQ.1 )
then idest 0 isrc 0
isend_tag ITAG_B irecv_tag ITAG_A
end if C ------------------------------------
----------------------------C send and
receive messagesC ---------------------------
---------------------------------- print ,
" Task ", irank, " has sent the message"
call MPI_Send ( rmessage1, MSGLEN, MPI_REAL,
idest, isend_tag, . MPI_COMM_WORLD, ierr
) call MPI_Recv ( rmessage2, MSGLEN,
MPI_REAL, isrc, irecv_tag, .
MPI_COMM_WORLD, istatus, ierr ) print , "
Task ", irank, " has received the message"
call MPI_Finalize (ierr) end
44
/
Code deadlock.c
Author Roslyn Leibensperger

/ include ltstdio.hgt
include "mpi.h" define MSGLEN 120000 /
length of message in elements / define TAG_A
100 define TAG_B 200 main( argc, argv ) int
argc char argv float message1 MSGLEN,
/ message buffers / message2 MSGLEN int
rank, / rank of task in communicator / dest,
source, / rank in communicator of destination /
/ and source tasks / send_tag, recv_tag, /
message tags /
45
MPI_Status status / status of communication /
MPI_Init( argc, argv ) MPI_Comm_rank(
MPI_COMM_WORLD, rank ) printf ( " Task d
initialized\n", rank ) / initialize message
buffers / for ( i0 iltMSGLEN i )
message1i 100 message2i 100
46
MPI_Status status / status of communication /
MPI_Init( argc, argv ) MPI_Comm_rank(
MPI_COMM_WORLD, rank ) printf ( " Task d
initialized\n", rank ) / initialize message
buffers / for ( i0 iltMSGLEN i )
message1i 100 message2i 100
/
each task sets its message
tags for the send and receive, plus the
destination for the send, and the source for the
receive
/ if ( rank 0 )
dest 1 source 1 send_tag
TAG_A recv_tag TAG_B else if ( rank 1)
dest 0 source 0 send_tag TAG_B
recv_tag TAG_A printf ( " Task d has
sent the message\n", rank ) MPI_Send (
message1, MSGLEN, MPI_FLOAT, dest, send_tag,
MPI_COMM_WORLD ) MPI_Recv ( message2, MSGLEN,
MPI_FLOAT, source, recv_tag, MPI_COMM_WORLD,
status ) printf ( " Task d has received the
message\n", rank ) MPI_Finalize() return 0

47
DEADLOCK example
MPI_RECV
MPI_SEND
A
B
MPI_SEND
MPI_RECV
48
Deadlock example
  • O2000 implementationNo Receive has been posted
    yet,so both processes block
  • Solutions
  • Different ordering
  • Non-blocking calls
  • MPI_Sendrecv

49
Determining Information about Messages
  • Wait
  • Test
  • Probe

50
MPI_WAIT
  • Useful for both sender and receiver of
    non-blocking communications
  • Receiving process blocks until message is
    received, under programmer control
  • Sending process blocks until send operation
    completes, at which time the message buffer is
    available for re-use

51
MPI_WAIT
compute
transmit
S
R
MPI_WAIT
52
MPI_TEST
MPI_TEST
compute
transmit
S
MPI_Isend
R
53
MPI_TEST
  • Used for both sender and receiver of non-blocking
    communication
  • Non-blocking call
  • Receiver checks to see if a specific sender has
    sent a message that is waiting to be delivered
    ... messages from all other senders are ignored

54
MPI_TEST (cont.)
  • Sender can find out if the message-buffer can be
    re-used ... have to wait until operation is
    complete before doing so

55
MPI_PROBE
  • Receiver is notified when messages from
    potentially any sender arrive and are ready to be
    processed.
  • Blocking call

56
Programming recommendations
  • Blocking calls are needed when
  • Tasks must synchronize
  • MPI_Wait immediately follows communication call

57
Collective Communication
  • Establish a communication pattern within a group
    of nodes.
  • All processes in the group call the communication
    routine, with matching arguments.
  • Collective routine calls can return when their
    participation in the collective communication is
    complete.

58
Properties of collective calls
  • On completion he caller is now free to access
    locations in the communication buffer.
  • Does NOT indicate that other processors in the
    group have completed
  • Only MPI_BARRIER will synchronize all processes

59
Properties
  • MPI guarantees that a message generated by
    collective communication calls will not be
    confused with a message generated by
    point-to-point communication
  • Communicator is the group identifier.

60
Barrier
  • Synchronization primitive. A node calling it will
    block until all the nodes within the group have
    called it.
  • Syntax
  • MPI_Barrier(Comm, Ierr)

61
Broadcast
  • Send data on one node to all other nodes in
    communicator.
  • MPI_Bcast(buffer, count, datatype,root,comm,ierr)

62
Broadcast
DATA
A0
A0
P0
A0
P1
A0
P2
A0
P3
63
Gather and Scatter
DATA
scatter
A0
A0
P0
A1
A2
A3
A1
P1
A2
P2
A3
P3
gather
64
Allgather effect
DATA
C0
A0
D0
B0
A0
P0
A0
B0
B0
D0
C0
P1
A0
C0
D0
B0
C0
P2
D0
A0
P3
D0
B0
C0
allgather
65
Syntax for Scatter Gather
66
Scatter and Gather
  • Gather Collect data from every member of the
    group (including the root) on the root node in
    linear order by the rank of the node.
  • Scatter Distribute data from the root to every
    member of the group in linear order by node.

67
ALLGATHER
  • All processes, not just the root, receive the
    result. The jth block of the receive buffer is
    the block of data sent from the jth process
  • Syntax
  • MPI_Allgather( sndbuf,scount,datatype,recvbuf,r
    count,rdatatype,comm,ierr)

68
Gather example
  • DIMENSION A(25,100),b(100),cpart(25),ctotal(100)
  • INTEGER root
  • DATA root/0/
  • DO I1,25
  • cpart(I) 0
  • . DO K1,100
  • cpart(I) cpart(I) A(I,K)b(K)
  • END DO
  • END DO
  • call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_RE
    AL, root, MPI_COMM_WORLD, ierr)

69
AllGather example
  • DIMENSION A(25,100),b(100),cpart(25),ctotal(100)
  • INTEGER root
  • DO I1,25
  • cpart(I) 0
  • . DO K1,100
  • cpart(I) cpart(I) A(I,K)b(K)
  • END DO
  • END DO
  • call MPI_AllGATHER(cpart,25,MPI_REAL,ctotal,25,MPI
    _REAL, MPI_COMM_WORLD, ierr)

70
Parallel matrix-vector multiplication
A b c
P1
25
P2

25
P3
25
P4
25
71
Global Computations
  • Reduction
  • Scan

72
Reduction
  • The partial result in each process in the group
    is combined in one specified process

73
Reduction
Dj D(0,j)D(1,j) ... D(n-1,j)
74
Scan operation
  • Scan or prefix-reduction operation performs
    partial reductions on distributed data
  • Dkj D0jD1j ... Dkj
  • k0,1,n-1

75
Varying size gather and scatter
  • Both size and memory location of the messages are
    varying
  • More flexibility in writing code
  • less need to copy data into temporary buffers
  • more compact final code
  • Vendor implementation may be optimal

76
Scatterv syntax
77
SCATTER
P0
P0
P1
P2
P3
78
SCATTERV

P0
P0
P1
P2
P3
79
Advanced Datatypes
  • Predefined basic datatypes -- contiguous data
    of the same type.
  • We sometimes need
  • non-contiguous data of single type
  • contiguous data of mixed types

80
Solutions
  • multiple MPI calls to send and receive each data
    element
  • copy the data to a buffer before sending it
    (MPI_PACK)
  • use MPI_BYTE to get around the datatype-matching
    rules

81
Drawback
  • Slow , clumsy and wasteful of memory
  • Using MPI_BYTE or MPI_PACKED can hamper
    portability

82
General Datatypes and Typemaps
  • a sequence of basic datatypes
  • a sequence of integer (byte) displacements

83
Typemaps
  • typemap (type0,disp0),(type1,disp1),.,
  • (typen,disp n)
  • Displacement are relative to the buffer
  • Example
  • Typemap (MPI_INT) (int,0)

84
Extent of a Derived Datatype
85
MPI_TYPE_EXTENT
  • MPI_TYPE_EXTENT(datatype,extent,ierr)
  • Describes distance (in bytes) from start of
    datatype to start of the next datatype .

86
How and When Do I Use Derived Datatypes?
  • MPI derived datatypes are created at run-time
    through calls to MPI library routines.

87
How to use
  • Construct the datatype
  • Allocate the datatype.
  • Use the datatype
  • Deallocate the datatype

88
EXAMPLE
  • integer oldtype,newtype,count,blocklength,stride
  • integer ierr,n
  • real buffer(n,n)
  • call MPI_TYPE_VECTOR(count,blocklength,stride,oldt
    ype,newtype,ierr)
  • call MPI_TYPE_COMMIT(newtype,ierr)
  • call MPI_SEND(buffer,1,newtype,dest,tag,comm,err)
  • use it in communication operation
  • call MPI_TYPE_FREE(newtype,ierr)
  • deallocate it

89
Example on MPI_TYPE_VECTOR
oldtype
newtype
BLOCK
BLOCK
90
Summary
  • Derived datatypes are datatypes that are built
    from the basic MPI datatypes
  • Derived datatypes provide a portable and elegant
    way of communicating non-contiguous or mixed
    types in a message.
  • Efficiency may depend on the implementation(see
    how it compares to MPI_BYTE)

91
Several datatypes
92
Several datatypes
93
GROUP
94
Group (cont.)
95
Group (cont.)
c if(rank .eq. 1) then print, 'sum of
group1', (rbuf(i), i1, count)c
print, 'sum of group1', (sbuf(i), i1,
count) endif count2 size do i1,
count2 sbuf2(i) rank
rank enddo CALL MPI_REDUCE(SBUF2,RBUF2,COUNT2,MP
I_INTEGER,
MPI_SUM,0,WCOMM,IERR) if(rank .eq. 0) then
print, 'sum of wgroup', (rbuf2(i), i1,
count2) else CALL
MPI_COMM_FREE(SUBCOMM, IERR) endif CALL
MPI_GROUP_FREE(GROUP1, IERR) CALL
MPI_FINALIZE(IERR) stop end 
96
PERFORMANCE ISSUES
  • Hidden communication takes place
  • Performance depends on implementation of MPI
  • Because of forced synchronization, it is not
    always best to use collective communication

97
Example simple broadcast
1
B
DataB(P-1) Steps P-1
2
B
B
3
8
98
Example better broadcast
1
B
B
DataB(P-1) Steps log P
2
1
2
7
1
3
1
5
3
6
2
7
4
8
99
Example simple scatter
1
B
DataB(P-1) Steps P-1
2
B
B
3
8
100
Example better scatter
1
DataBplogP Steps log P
4B
2
1
2B
2B
2
4
1
3
B
B
B
B
1
5
3
6
2
7
4
8
101
Timing for sending a message
  • Time is composed of startup time time to send a
    0 length message and transfer time time to
    transfer a byte of data.

Tcomm Tstartup B Ttransfer It may
be worthwhile to group several sends together
102
Performance evaluation
  • Fortran
  • Real8 t1
  • T1 MPI_Wtime() ! Returns elapsed time
  • C
  • double t1
  • t1 MPI_Wtime ()

103
More on timing programs
  • for (i0ilt1i)
  • MPI_Barrier(MPI_COMM_WORLD)
  • T1MPI_Wtime()
  • ltdo workgt
  • MPI_Barrier(MPI_COMM_WORLD)
  • total_time MPI_Wtime() t1
  • Does it take all effects (cache , etc. into
    account)?
  • Better add

104
MPI References
  • The MPI Standard
  • www-unix.mcs.anl.gov/mpi/index.html
  • Parallel Programming with MPI,Peter S.
    Pacheco,Morgan Kaufmann,1997
  • Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum,
    The MIT Press,1999.
Write a Comment
User Comments (0)
About PowerShow.com