Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand - PowerPoint PPT Presentation

About This Presentation
Title:

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Description:

Minimizing Communication Latency to Maximize Network Communication ... Summa Matrix Multiplication. Significant benefit of Host-Assisted Zero-Copy. Conclusions ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 20
Provided by: tri5474
Category:

less

Transcript and Presenter's Notes

Title: Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand


1
Minimizing Communication Latency to Maximize
Network Communication Throughput over InfiniBand
Design and Implementation of MPICH-2 over
InfiniBand with RDMA Support Liu, Jiang, Wyckoff,
Panda, Ashton, Buntinas, Gropp, Toonen
Host-Assisted Zero-Copy Remote Memory Access
Communication On InfiniBand Tipparaju,
Santhanaraman, Nieplocha, Panda
Presented by Nikola Vouk Advisor Dr. Frank
Mueller
2
Background
General Buffer Manipulation in Communication
Protoocls
3
InfiniBand
  • 7.6 microsecond latency
  • 857 MB/s peak bandwidth
  • Send/Receive QueueWork Completed interface
  • Asynchronous calls
  • Remote Direct Memory Access
  • Between Shared memory architecture and MPI
  • Not exactly NUMA, but close
  • Provides channel Interface (read/write) for
    communication
  • Each side registers memory accessible freely to
    other hosts for security purposes.

4
(No Transcript)
5
Common Problems
  • Link-layer/Network Protocol in-efficiencies
    (unnecessary messages sent)
  • User-space to System-Buffer copy overhead (copy
    time)
  • Synchronous sending/receiving and computing
    (Application has to stop in order to handle
    requests)

6
Problem 1 Message Passing Protocol
Basic InfiniBand protocol requires three matching
writes
  • RDMA CHANNEL INTERFACE
  • Put Operation
  • Copy user buffer to pre-registered buffer
  • RDMA write buffer to receiver
  • Adjust local head pointer
  • RDMA write new head pointer to receiver
  • Return Bytes written

7
SolutionsPiggybacking and Pipelining
Send Pointer update with Packets
Chop buffers into packet size and Send out as
message comes in
Improvement, but still less than 870 MB/s
8
Problem 2 Internal buffer copying
overheadSolution Zero-Copy Buffers
  • Internal overhead where the user must copy data
    to system (and into a registered memory slot)
  • Allows system to read directly from the user

9
Zero-Copy Protocol at different Levels of MPICH
Hierarchy
  • If Packet is Large enough
  • Register user buffer
  • Notify end-host of request
  • End-host sends a RDMA-read
  • Reads from user buffer space

10
Comparing Interfaces CH3 interface vs RDMA
Interface
  • Implement directly off of CH3 interface
  • More flexible due to access to complete ADI-3
    interface
  • Always uses RMDA-write

11
(No Transcript)
12
CH3 Implementation Performance
  • A function of raw underlying performance

13
  • Pipelining always performed the worst
  • RDMA Channel within 1 of CH3

14
Problem 3 To much overhead, not enough execution
  • Unanswered Problems
  • Registration overhead still there even in cached
    version
  • Data transfer still requires significant
    cooperation from both sides (taking away from
    computation)
  • Non-contiguous data not addressed
  • Solutions
  • Provide custom API allocates out of large
    pre-registers memory chunks
  • Overlapping as much as possible communication
    with computation
  • Applying zero-copy techniques using
    scatter/gather RMDA calls

15
Host-Assisted Zero-Copy Protocol
  • Host sends request for gather from receiver
  • Receiver posts a descriptor and continues working
  • Can be implemented as a helper thread on
    receiving host
  • Same as previous Zero-Copy idea, but extended to
    Non-contiguous data

16
NAS MG
  • Again the Pipelined method performs similarly to
    the zero-copy method

17
Summa Matrix Multiplication
  • Significant benefit of Host-Assisted Zero-Copy

18
Conclusions
  • Minimizing internal memory copying removes
    primary memory performance obstacle
  • Infiniband allows DMA that offloads work from the
    CPU. Can benefit by coordinating registered
    memory to minimize CPU involvment
  • With proper coding, can achieve almost wire-speed
    on existing MPI programs over infiniband
  • Could be implemented on other architectures
    (Gig-E, Myranet)

19
Thesis Implications
  • Buddy MPICH is a latency hiding implementation
    of MPICH also.
  • Separation at the ADI layer. Buddy thread listens
    for connections and accepts work from worker
    thread via send/receive queues.
Write a Comment
User Comments (0)
About PowerShow.com