Title: MPI pointtopoint protocols and our improvements
1MPI point-to-point protocols and our improvements
2MPI point-to-point communication
- MPI supports many modes of point-to-point
communication blocking, non-blocking, buffered,
immediate, etc. - Sender specifies the memory to be sent
- Receiver specifies the memory to store the
message. - The goal is to move data from the user space in
the sender to the user space in the receiver.
3MPI point-to-point communication
MPI_Send(send_buf, )
MPI_Recv(recv_buf, )
- Reliability and performance
- Requires 100 reliability, cannot drop messages.
- Sender and receiver may arrive at the operation
at different times.
4Current protocol for small messages
- Eager protocol
- Sender copies msg to system buffer, Issues the
send command (send to a designated system
buffer), and completes (send_buff can be reused) - Receiver copies from the designated system buffer
when the message arrives. - Ideal for small messages (copy overhead
negligible, memory overheads not excessive) - Drawbacks
- Copy overhead
- Per process buffer requirement (O(P) memory).
- Cannot apply to large messages.
Eager protocol
5Current protocol for large messages
- Rendezvous protocol
- Sender and receiver hand-shake before data is
transferred. - Drawbacks
- Unnecessary synchronization.
- Communication progress issue.
- Receiver arriving early does not help.
Rendezvous protocol
6- Eager rendezvous protocols have been around for
20 years - Were designed when communication and message
processing are expensive - Minimize the number of messages needed
- are not optimized in many situations
- Newer systems communication and message
processing are not as expensive - Can use more complex protocols to improve
performance. - Our recent work (with Matthew Small) Improve MPI
point-to-point communication with new protocols
on RDMA-enabled systems.
7- RDMA Remote Direct Memory Access
- RDMA devices allow direct access to memory in
other nodes (machines) without the remote/local
CPU involvement. - Supported by all almost all contemporary
networking systems (InfiniBand, Myrinet, Ethernet
(iWarp)).
sender
receiver
sender
receiver
RDMA_Put
RDMA_Get
Move data from the local user space to the
remote user space.
Move data from the remote user space to the
local user space.
8- More detailed problems with current protocols
- Eager protocol near optimal, keep it
- Rendezvous protocol
- Unnecessary synchronization
- The sender may wait for receiver
- Communication progress issue
- The early arriving receiver MPI call is wasted.
sender
receiver
Sender_Ready
MPI_Isend
MPI_wait
Receiver_ready
MPI_Irecv
MPI_Wait returns here
MPI_wait
MPI_Wait returns here
9- More detailed issues with current protocols
- Rendezvous protocol
- Communication progress issue
- The early arriving receiver MPI call is wasted.
sender
receiver
Sender_Ready
MPI_Isend
MPI_wait
MPI_Irecv
Receiver_ready
Idle
MPI_Wait returns here
MPI_wait
MPI_Wait returns here
10- More detailed issues with current protocols
- Rendezvous protocol
- Communication progress issue
- The early arriving receiver MPI call is wasted.
- Rendezvous can be pretty bad depending on how
users write the program.
sender
receiver
Sender_Ready
MPI_Isend
MPI_Irecv
MPI_wait
Idle
Receiver_ready
MPI_wait
MPI_Wait returns here
MPI_Wait returns here
11Our idea 1 use a hybrid protocol for medium
sized message
Hybrid protocol make one copy at the sender
side, and use RDMA read to load the data. Why
No more unnecessary synchronization between the
sender and the receiver.
12Our idea 2 whoever arrives early start the
communication
Sender initiated protocol when sender arrives
early, receiver initiated protocol when receiver
arrives early. Receiver initiated protocol is
much cleaner than the sender initiated protocol.
13What about both arrive at the same time
Receiver initiated protocol one extra useless
SENDER_READY message a small price to
pay Compared to the original sender initiated
protocol, SENDER_READY is taken out of the
critical path of the communication.
14The integrated protocols protocol selected based
on msg size and arrival time.
15Some performance results
- Our prototype library is on top of the InfiniBand
Verbs API. - Supports commonly used MPI p2p routines.
- Experiments are done on draco.cs.fsu.edu
- Dell Poweredge 1950, dual 2.33 Ghz Quad-core Xeon
E5345, 8GB memory - InfiniBand DDR (20Gbps)
- MVAPICH2.1.2.rc1
- EAGER_THRESHOLD 12KB, HYBRID_THRESHOLD40KB
16Pingpong performance
17Progress benchmark
18Applications
19Conclusion
- By using customized rendezous protocols for
different situations, our combined protocols - Reduce unnecessary synchronizations
- Decrease the number of control messages in the
critical path of communications - Have a better communication-computation overlap
capability - It is in general a more efficient point-to-point
communication system on RDMA-enabled clusters.