Title: Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand
1Minimizing Communication Latency to Maximize
Network Communication Throughput over InfiniBand
Design and Implementation of MPICH-2 over
InfiniBand with RDMA Support Liu, Jiang, Wyckoff,
Panda, Ashton, Buntinas, Gropp, Toonen
Host-Assisted Zero-Copy Remote Memory Access
Communication On InfiniBand Tipparaju,
Santhanaraman, Nieplocha, Panda
Presented by Nikola Vouk Advisor Dr. Frank
Mueller
2Background
General Buffer Manipulation in Communication
Protoocls
3InfiniBand
- 7.6 microsecond latency
- 857 MB/s peak bandwidth
- Send/Receive QueueWork Completed interface
- Asynchronous calls
- Remote Direct Memory Access
- Between Shared memory architecture and MPI
- Not exactly NUMA, but close
- Provides channel Interface (read/write) for
communication - Each side registers memory accessible freely to
other hosts for security purposes.
4(No Transcript)
5Common Problems
- Link-layer/Network Protocol in-efficiencies
(unnecessary messages sent) - User-space to System-Buffer copy overhead (copy
time) - Synchronous sending/receiving and computing
(Application has to stop in order to handle
requests)
6Problem 1 Message Passing Protocol
Basic InfiniBand protocol requires three matching
writes
- RDMA CHANNEL INTERFACE
- Put Operation
- Copy user buffer to pre-registered buffer
- RDMA write buffer to receiver
- Adjust local head pointer
- RDMA write new head pointer to receiver
- Return Bytes written
7SolutionsPiggybacking and Pipelining
Send Pointer update with Packets
Chop buffers into packet size and Send out as
message comes in
Improvement, but still less than 870 MB/s
8Problem 2 Internal buffer copying
overheadSolution Zero-Copy Buffers
- Internal overhead where the user must copy data
to system (and into a registered memory slot) - Allows system to read directly from the user
9Zero-Copy Protocol at different Levels of MPICH
Hierarchy
- If Packet is Large enough
- Register user buffer
- Notify end-host of request
- End-host sends a RDMA-read
- Reads from user buffer space
10Comparing Interfaces CH3 interface vs RDMA
Interface
- Implement directly off of CH3 interface
- More flexible due to access to complete ADI-3
interface - Always uses RMDA-write
11(No Transcript)
12CH3 Implementation Performance
- A function of raw underlying performance
13- Pipelining always performed the worst
- RDMA Channel within 1 of CH3
14Problem 3 To much overhead, not enough execution
- Unanswered Problems
- Registration overhead still there even in cached
version - Data transfer still requires significant
cooperation from both sides (taking away from
computation) - Non-contiguous data not addressed
- Solutions
- Provide custom API allocates out of large
pre-registers memory chunks - Overlapping as much as possible communication
with computation - Applying zero-copy techniques using
scatter/gather RMDA calls
15Host-Assisted Zero-Copy Protocol
- Host sends request for gather from receiver
- Receiver posts a descriptor and continues working
- Can be implemented as a helper thread on
receiving host - Same as previous Zero-Copy idea, but extended to
Non-contiguous data
16NAS MG
- Again the Pipelined method performs similarly to
the zero-copy method
17Summa Matrix Multiplication
- Significant benefit of Host-Assisted Zero-Copy
18Conclusions
- Minimizing internal memory copying removes
primary memory performance obstacle - Infiniband allows DMA that offloads work from the
CPU. Can benefit by coordinating registered
memory to minimize CPU involvment
- With proper coding, can achieve almost wire-speed
on existing MPI programs over infiniband - Could be implemented on other architectures
(Gig-E, Myranet)
19Thesis Implications
- Buddy MPICH is a latency hiding implementation
of MPICH also. - Separation at the ADI layer. Buddy thread listens
for connections and accepts work from worker
thread via send/receive queues.