Title: MPICH2 : MPI_Init, Send, Recv
1MPICH2 MPI_Init, Send, Recv
- Oct 7, 2005
- Sogang University
- Distributed Computing Communication Laboratory
- Eunseok, Kim
2Outline
- Internal routine naming
- MPI_Init ()
- MPI_Send()
- MPI_Recv()
- Conclusion
3Internal routine name
- MPI
- MPI implementation
- MPIR
- Routines used only within the MPI implementation
- Outside of the ADI
- MPID
- Routines either defined in the ADI or used within
the ADI - MPIU
- Routines that are defined in the util directory
and may be used by either the ADI or the
implementation of the MPI routines
4Concept of Initialization
- Lazy initialization
- Each module initialize itself
- Reduce
- Executable size
- Link time
- Enabled by Weak symbol option
- Process info comes from PM (Process Manager)
- Via PMI (Process Management Interface)
- Most info set through EV
- MPD, Gforker, Smpd
- By default, MPD is used to launch mpi job
5Starting MPI jobs using MPD
3. Pass the XML doc
MPI_EXEC
MPD
1. Read machine file
PMI
2. Generate XML
4. PMI_Init
Machine file
P0
Pn
. . . . . . . . . . .
6MPI_Init()
MPI_Init(int argc, )
MPIR_Init_thread( int argc..)
MPIR_Err_init( int argc..)
MPIR_Wtime_init( int argc..)
MPIR_Datatype_init( int argc..)
MPIR_Nest_init( int argc..)
MPID_Init ( int argc..)
MPIDI_CH3_Init( int argc..)
End of MPI_Init
7Global Object
- MPIR_Process
- MPIDI_Process
- MPIDI_CH3_Process
- Etc
- Each module has its own global variable
- Init finalized within the module
8MPIR_Process
- Comm
- Comm_world
- Comm_self
- Comm_parent (spawn)
- Group
- Remote / local
- Comm_kind
- Intra / inter
- VCRT (Virtual Connection Reference Table)
- Thread
- Condition variables
- Attribute
- Attribute variable
- Attribute function
9MPIDI_Process
- Receive queue
- Posted
- Unexpected
- Process group id rank
- Condition variable
10MPIDI_CH3I_Process
- Parent port number
- Accept queue
- Condition variable
11MPI_Init()
- Declare nest level value
- Ex) MPID_MPI_INIT_STATE_DECL(MPID_STATE_MPI_INIT)
- Init critical section
- Thread mutex creation
- Call MPIR_Init_thread
12MPIR_Init_thread()
- Address alignment issues
- Ex) HAVE_WINDOWNS_H, _WIN64
- Setup MPIR_Process object
- attrs
- Init comm (null)
- Call all MPID_Init type functions
- Ex) MPIR_Err_init(), MPIR_Datatype_init()
- Determine Several policy
- Ex) allowing the device to select an alternative
function for some function - Call MPID_Init (Ch3)
13MPID_Init()
- Init receive queues
- Set processor name and other configurations
- By system call and environmental variable
- Call MPIDI_CH3_Init
- Assign Process group id rank to MPIDI_Process
- Cf) rank in comm_world
- Init MPI_COMM_(WORLD / SELF) object
- Creating VCRT
- Get parent or parent port
- Determine who my parent is
- Others or no parent
- Synchronization through MPIR_Bcast
14MPIDI_CH3_Init()
- PMI_Init
- Port, Host name, Fd (socket)
- Ex) readline(Fd, ), writeline(Fd, ) or from E.V
- PMI_Get_(rank/id)
- Pg_rank, Pg_id
- MPIDI_CH3I_Progress_init
- Establish non-blocking listener
- From EV
15MPIDI_CH3_Init ()
- Setup Process group
- vct
- KVS (Key-Value Space)
- Contains information about processes
- Each process publishes own Business card and
submit - Contact info
- PMI_Barrier()
- Synchronization command barrier_in,
barrier_out - Through blocking i/o
16An Example CH3 Implementation over TCP
- Dashed lines separate four communication types
17MPI_Send()
MPI_Send(int argc, )
My_rank dest_rank
MPID_Send ( int argc..)
N
Y
Pkt_size gt PKT_MAX_LEN
MPIDI_Isend_self()
N
Contiguous?
Contiguous?
Y
Y
N
MPIDI_CH3_iStartMsgv()
MPID_Segment_Init ()
MPID_Segment_Init ()
MPIDI_CH3_iStartRndvMsg()
MPIDI_CH3_iSendv ()
MPIDI_CH3_iStartMsgv()
MPID_Request_Release ()
End of MPI_Send
18MPI_Send()
- Validate handle params
- MPIR_ERRTEST_COMM()
- MPID_Comm_valid_ptr(), etc
- Call MPID_Send()
- If all data were sent
- Return NULL
- Otherwise
- Return ptr of request
- This function is block
- Until the request is complete
19MPID_Send ()
- If the pkt can be sent directly
- If data is contiguous
- A single iov used
- By MPIDI_CH3_iStartMsgv()
- Otherwise
- Segmentation
- MPID_Segment_init()
- MPIDI_CH3U_Request_load_send_iov()
- MPIDI_CH3_iSendv()
- Pkts which not sent are queued in the vc
20MPID_Send ()
- Else
- The pkt should be sent by rendezvous
- If data is contiguous
- A single iov used
- Otherwise
- Segmentation
- MPID_Segment_init()
- MPIDI_CH3U_Request_load_send_iov()
- First send rndv msg
- MPIDI_CH3_iStartRndvMsg()
- MPIDI_CH3_iStartMsgv()
- After finishing send
- Release Request obj
21MPIDI_CH3_iStartMsg()
- Attempt to send msg immediately
- If successful, return NULL
- Otherwise create Request and return ptr of it
- Unset msg will be queued in the vc and sent later
22MPI_Recv()
MPI_Recv(int argc, )
MPID_Recv ( int argc..)
MPIDI_REQUEST_SELF_MSG
MPIDI_REQUEST_EAGER_MSG
Msg type
MPIDI_REQUEST_RNDV_MSG
MPIDI_CH3_iStartMsgv()
Just copying buff
MPIDI_CH3U_Post_data_receive ()
MPIDI_CH3_iStartRndvTransfer ()
unpacking
MPID_Request_Release ()
End of MPI_Send
23MPI_Recv ()
- Validate handle params
- MPIR_ERRTEST_COMM()
- MPID_Comm_valid_ptr(), etc
- Call MPID_Recv()
- Release request after finished
24MPID_Recv ()
- Check unexpected / posted queue
- MPIDI_CH3U_Recvq_FDU_or_AEP()
- If pending req exists, then dequeue and process
it - Get msg type of the request
- MPIDI_REQUEST_EAGER_MSG
- MPIDI_REQUEST_RNDV_MSG
- MPIDI_REQUEST_SELF_MSG
- Release request after finished
25Message type
- MPIDI_REQUEST_EAGER_MSG
- Send back eager sync ack
- MPIDI_CH3_iStartMsg()
- Get pending count in the request
- Unpack data and free buffer
- MPIDI_REQUEST_RNDV_MSG (RTS)
- Check posted queue and obtain req obj
- MPIDI_CH3U_Post_data_receive()
- If Rendezvous channel defined
- Call MPIDI_CH3_iStartRndvTransfer()
- RDMA read is performed.
- MPIDI_REQUEST_RNDV_MSG (RTS)
- Just copying buffer
26Summary
- Modular architecture
- Provides independence
- Local init / finalization
- Developer can implement an algorithm within a
specific module - Obey the interfaces
- And link
- However
- Too complicated
- Too deep reference link
- Scattered source
- More precise analysis can be possible
- After analyzing coll and ptp, etc