Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

About This Presentation

Title:

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Description:

Minimizing Communication Latency to Maximize Network Communication ... Summa Matrix Multiplication. Significant benefit of Host-Assisted Zero-Copy. Conclusions ... – PowerPoint PPT presentation

Number of Views:162

Avg rating:3.0/5.0

Slides: 20

Provided by: tri5474

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

1
Minimizing Communication Latency to Maximize
Network Communication Throughput over InfiniBand
Design and Implementation of MPICH-2 over
InfiniBand with RDMA Support Liu, Jiang, Wyckoff,
Panda, Ashton, Buntinas, Gropp, Toonen
Host-Assisted Zero-Copy Remote Memory Access
Communication On InfiniBand Tipparaju,
Santhanaraman, Nieplocha, Panda
Presented by Nikola Vouk Advisor Dr. Frank
Mueller
2
Background
General Buffer Manipulation in Communication
Protoocls
3
InfiniBand

7.6 microsecond latency
857 MB/s peak bandwidth
Send/Receive QueueWork Completed interface
Asynchronous calls
Remote Direct Memory Access
Between Shared memory architecture and MPI
Not exactly NUMA, but close
Provides channel Interface (read/write) for
communication
Each side registers memory accessible freely to
other hosts for security purposes.

4
(No Transcript)
5
Common Problems

Link-layer/Network Protocol in-efficiencies
(unnecessary messages sent)
User-space to System-Buffer copy overhead (copy
time)
Synchronous sending/receiving and computing
(Application has to stop in order to handle
requests)

6
Problem 1 Message Passing Protocol
Basic InfiniBand protocol requires three matching
writes

RDMA CHANNEL INTERFACE
Put Operation
Copy user buffer to pre-registered buffer
RDMA write buffer to receiver
Adjust local head pointer
RDMA write new head pointer to receiver
Return Bytes written

7
SolutionsPiggybacking and Pipelining
Send Pointer update with Packets
Chop buffers into packet size and Send out as
message comes in
Improvement, but still less than 870 MB/s
8
Problem 2 Internal buffer copying
overheadSolution Zero-Copy Buffers

Internal overhead where the user must copy data
to system (and into a registered memory slot)
Allows system to read directly from the user

9
Zero-Copy Protocol at different Levels of MPICH
Hierarchy

If Packet is Large enough
Register user buffer
Notify end-host of request
End-host sends a RDMA-read
Reads from user buffer space

10
Comparing Interfaces CH3 interface vs RDMA
Interface

Implement directly off of CH3 interface
More flexible due to access to complete ADI-3
interface
Always uses RMDA-write

11
(No Transcript)
12
CH3 Implementation Performance

A function of raw underlying performance

Pipelining always performed the worst
RDMA Channel within 1 of CH3

14
Problem 3 To much overhead, not enough execution

Unanswered Problems
Registration overhead still there even in cached
version
Data transfer still requires significant
cooperation from both sides (taking away from
computation)
Non-contiguous data not addressed

Solutions
Provide custom API allocates out of large
pre-registers memory chunks
Overlapping as much as possible communication
with computation
Applying zero-copy techniques using
scatter/gather RMDA calls

15
Host-Assisted Zero-Copy Protocol

Host sends request for gather from receiver
Receiver posts a descriptor and continues working
Can be implemented as a helper thread on
receiving host
Same as previous Zero-Copy idea, but extended to
Non-contiguous data

16
NAS MG

Again the Pipelined method performs similarly to
the zero-copy method

17
Summa Matrix Multiplication

Significant benefit of Host-Assisted Zero-Copy

18
Conclusions

Minimizing internal memory copying removes
primary memory performance obstacle
Infiniband allows DMA that offloads work from the
CPU. Can benefit by coordinating registered
memory to minimize CPU involvment

With proper coding, can achieve almost wire-speed
on existing MPI programs over infiniband
Could be implemented on other architectures
(Gig-E, Myranet)

19
Thesis Implications

Buddy MPICH is a latency hiding implementation
of MPICH also.
Separation at the ADI layer. Buddy thread listens
for connections and accepts work from worker
thread via send/receive queues.

Write a Comment

User Comments (0)