High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters

Description:

Current TOP500 list has 360/500 (72%) cluster based systems ... Optimized BLAS library from Intel MKL was used. 8 Processes are used on Cluster C ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 31
Provided by: nowlabCse
Category:

less

Transcript and Presenter's Notes

Title: High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters


1
High Performance RDMA Based All-to-all Broadcast
for InfiniBand Clusters
  • S. Sur, U.K.R. Bondhugula, A. Mamidala, H.-W. Jin
    and D. K. Panda
  • Network Based Computing Laboratory
  • The Ohio State University

2
Presentation Layout
  • Introduction
  • Overview of Existing All to all Broadcast
    Algorithms
  • Can RDMA benefit collective operations?
  • Proposed RDMA based designs
  • Experimental Evaluation
  • Conclusions and Future Work

3
Introduction
  • Cluster based computing is growing popularity
  • Current TOP500 list has 360/500 (72) cluster
    based systems
  • MPI is the de-facto standard for scientific
    applications on cluster-based systems
  • All-to-all broadcast (MPI_Allgather) is used
    widely
  • Matrix multiplication, LU triangle factorization,
    differential eqns, etc.
  • InfiniBand is an emerging high performance
    interconnect with powerful features like RDMA
  • Can the All-to-all broadcast be optimized further
    for InfiniBand based clusters?

4
Overview of InfiniBand
  • Proposed as an Industry standard
  • Switched fabric for connecting compute and I/O
    nodes
  • Host Channel Adapters (HCAs) are used to connect
    hosts to fabric
  • InfiniBand utilities are exposed to applications
    by Verbs interface
  • Mellanox Verbs API (VAPI), OpenIB Gen2 Verbs
    (IBVerbs), etc.
  • High performance
  • Low latency (1-3us)
  • Over 1400 MByte/s uni-dir and 2700 MBytes/sec
    bi-dir bandwidth
  • Provides one-sided operations (RDMA, RDMA
    Gather/Scatter, Remote Atomics)
  • Requires all communication buffers to be
    registered
  • Pages are marked unswapable
  • HCA has access tables with physical addresses of
    communication buffer

5
MVAPICH/MVAPICH2 Software Distribution
  • Focusing both on MPI-1 (MVAPICH) and MPI-2
    (MVAPICH2)
  • Open Source (BSD licensing)
  • Have been directly downloaded by more than 285
    organizations worldwide (in 30 countries)
  • Empowering many large-scale clusters in TOP500
    list (including the 5 ranked 4000-node
    Thunderbird cluster at Sandia)
  • Multiple Implementations on different low-level
    APIs
  • VAPI
  • OpenIB Gen2 stack
  • uDAPL
  • To achieve portability across different
    interconnects through uDAPL
  • Available for MPI-2 (MVAPICH2 0.9.0)
  • Tested with uDAPL-Ammasso/GigE,
    uDAPL-Solaris/IBA, uDAPL-OpenIBGen2/IBA
  • Available with MVAPICH 0.9.6
  • Available and Optimized for
  • Platforms IA-32, IA-64, Opteron, EM64T, and
    Apple G5
  • Operating Systems Linux, Solaris, and Mac OSX
  • Compilers gcc, intel, pathscale and pgi
  • InfiniBand Adapters
  • PCI-X and PCI-Express (SDR and DDR with
    mem-full/mem-free cards)
  • More details at

6
Presentation Progress
  • Introduction
  • Overview of Existing All to all Broadcast
    Algorithms
  • Can RDMA benefit collective operations?
  • Proposed RDMA based designs
  • Experimental Evaluation
  • Conclusions and Future Work

7
All-to-all Broadcast Algorithms
  • MPI_Allgather is used to distribute data from the
    jth process into the jth receive buffer of each
    process
  • Depending on system/message size, some algorithms
    can outperform the others
  • Two prominent algorithms
  • Recursive Doubling Algorithm
  • Ring Algorithm
  • MPICH-1.2.7 uses a combination of these algorithms

8
Recursive Doubling Algorithm
  • Algorithm details
  • Pair of processes exchange their buffer contents
  • Each iteration contains data from previous
    iterations
  • Number of steps log (p)
  • Size of message doubles at each step
  • Implementation
  • Buffer required at intermediate stages
  • Typically used up to medium message sizes
  • Cluster should have constant bisection bandwidth
    to avoid contention

p Number of nodes
9
Ring Algorithm
  • Algorithm Details
  • In every iteration, processes pass a message to
    their neighbor
  • Number of steps (p 1)
  • Size of message for each step is constant (m)
  • Implementation
  • Buffers may not be required at intermediate
    stages
  • May not involve copies
  • Typically used for larger message sizes

p Number of nodes, m message size
10
Presentation Progress
  • Introduction
  • Overview of Existing All to all Broadcast
    Algorithms
  • Can RDMA benefit collective operations?
  • Proposed RDMA based designs
  • Experimental Evaluation
  • Conclusions and Future Work

11
Can RDMA benefit Collective Operations?
  • RDMA is defined for point-to-point operations
  • How can we leverage benefits of RDMA for
    collectives?
  • RDMA based scheme has the following advantages
  • Bypass intermediate software layers
  • Reduce number of copies
  • Reduce protocol handshakes
  • Reduce cost of memory registration

12
RDMA Benefits
MPI
Collectives
Bypass Software Layers
Point-to-Point
ADI
Channel Interface
Shame
Ch_p4
InfiniBand (VAPI)
13
RDMA Benefits Continued
All-to-all Broadcast Buffer
All-to-all Broadcast Buffer
Direct RDMA Collective
Small Msg
Small Msg
copy
copy
Large Msg
Large Msg
RTS
Zero copy
CTS
Zero copy
DATA
FIN
ADI
ADI
Receiver
Sender
14
RDMA Benefits Continued
  • Each registration operation has high startup
    overhead
  • RDMA mechanism can treat the entire collective
    buffer as one message instead of several messages
  • Number of registrations can be greatly reduced

Point-to-point based design
RDMA Collective
15
Presentation Progress
  • Introduction
  • Overview of Existing All to all Broadcast
    Algorithms
  • Can RDMA benefit collective operations?
  • Proposed RDMA based designs
  • Experimental Evaluation
  • Conclusions and Future Work

16
Proposed RDMA Based Design
  • Design issue, Copy Based or Zero Copy?
  • Copy cost proportional to size of message
  • On demand registration costly, but efficient for
    large messages
  • RDMA Based Recursive Doubling
  • Design a dynamic threshold that switches between
    copy based and zero copy approaches
  • RDMA Ring Algorithm
  • Used only for large messages, so we use only zero
    copy

17
RDMA Based Recursive Doubling
  • Maintain a pre-registered Collective buffer per
    communicator for small messages
  • Message size increases from m (in first
    iteration) to mp/2 (in log(p) th iteration)
  • Iterations 1 k log (p)
  • Message size at iteration k 2k-1m
  • When message size at a iteration exceeds a
    threshold MT, then use zero copy
  • MT empirically determined to be 4KB
  • Total memory requirement 2MT 8KB per
    communicator

p Number of nodes, m message size
18
RDMA Ring Algorithm
  • Used only for large messages
  • Complete zero copy approach used
  • Single buffer registration and address exchange
  • Each iteration (1kp), RDMA of size m is
    performed
  • Used for messages larger than 1MB and pgt32

p Number of nodes, m message size
19
Presentation Progress
  • Introduction
  • Overview of Existing All to all Broadcast
    Algorithms
  • Can RDMA benefit collective operations?
  • Proposed RDMA based designs
  • Experimental Evaluation
  • Conclusions and Future Work

20
Experimental Setup
  • Cluster A
  • 32 Dual Intel Xeon 2.66GHz nodes with 512KB cache
  • 2 GB main memory
  • MT23108 HCA using PCI-X 133 MHz I/O bus
  • Mellanox 144-port switch (MTS 14400)
  • Cluster B
  • 16 Dual Intel Xeon 3.6GHz nodes with 1MB cache
  • 4 GB main memory
  • MHES18-XT HCA using PCI-Express (x8) I/O bus
  • Cluster C
  • 8 Dual Intel Xeon 3.0GHz nodes with 512KB cache
  • 2 GB main memory
  • MT23108 HCA using PCI-X 133 MHz I/O bus

21
Experimental Software
  • The RDMA based performance numbers are labeled
    MVAPICH-RDMA
  • Point-to-point based numbers are labeled
    MVAPICH-P2P
  • The design is available integrated in
    MVAPICH-0.9.6

22
Latency of MPI_Allgather
  • All processes are synchronized and MPI_Allgather
    is timed for 1000 times
  • Average values (across all the processes) are
    reported
  • Small Messages Latency is reduced by 17 for
    4-byte message size
  • Medium Messages Latency is reduced by 30 for
    32KB message size

23
Scalability of MPI_Allgather
  • All processes are synchronized and MPI_Allgather
    is timed for 1000 times
  • Average values (across all the processes) are
    reported
  • Message size of 32KB is used
  • For 32 processes, scalability is improved by 30

24
MPI_Allgather with no Buffer Reuse
  • All processes are synchronized and MPI_Allgather
    is timed for 1000 times
  • In each iteration a new buffer is used to
    eliminate cache effects
  • 32 Processes are used
  • For 32KB message size, improvement is by a
    factor of 4.75

25
Matrix Multiplication Application Kernel
  • Distributed memory matrix multiplication
    algorithm
  • Optimized BLAS library from Intel MKL was used
  • 8 Processes are used on Cluster C
  • For 256x256 matrix size, improvement is by 37

26
Presentation Progress
  • Introduction
  • Overview of Existing All to all Broadcast
    Algorithms
  • Can RDMA benefit collective operations?
  • Proposed RDMA based designs
  • Experimental Evaluation
  • Conclusions and Future Work

27
Conclusions
  • New RDMA based design reduces
  • Software overheads
  • Message copy costs
  • Protocol handshake overhead
  • Unnecessary registration of buffers
  • Latency of MPI_Allgather is reduced by 30 for 32
    processes and message size 32KB
  • Latency is improved by a factor of 4.75 under no
    buffer reuse conditions for 32 processes and 32KB
    message size
  • Matrix Multiplication application kernel is able
    to perform 37 better for 256x256 matrix size

28
Future Work
  • Investigate impact on other applications on large
    scale clusters
  • RDMA based All-to-all broadcast is a good
    building block for other collectives, for e.g.
    All-to-all personalized
  • Investigate other algorithms for clusters without
    constant bisection bandwidth

29
Acknowledgements
  • Our research is supported by the following
    organizations
  • Current Funding support by
  • Current Equipment support by

30
Web Pointers
http//www.cse.ohio-state.edu/panda/ http//nowla
b.cse.ohio-state.edu/ MVAPICH Web
Page http//nowlab.cse.ohio-state.edu/projects/mpi
-iba/
E-mail panda_at_cse.ohio-state.edu
Write a Comment
User Comments (0)
About PowerShow.com