High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters

About This Presentation

Title:

High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters

Description:

Current TOP500 list has 360/500 (72%) cluster based systems ... Optimized BLAS library from Intel MKL was used. 8 Processes are used on Cluster C ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 31

Provided by: nowlabCse

Category:

more less

Transcript and Presenter's Notes

Title: High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters

1
High Performance RDMA Based All-to-all Broadcast
for InfiniBand Clusters

S. Sur, U.K.R. Bondhugula, A. Mamidala, H.-W. Jin
and D. K. Panda
Network Based Computing Laboratory
The Ohio State University

2
Presentation Layout

Introduction
Overview of Existing All to all Broadcast
Algorithms
Can RDMA benefit collective operations?
Proposed RDMA based designs
Experimental Evaluation
Conclusions and Future Work

3
Introduction

Cluster based computing is growing popularity
Current TOP500 list has 360/500 (72) cluster
based systems
MPI is the de-facto standard for scientific
applications on cluster-based systems
All-to-all broadcast (MPI_Allgather) is used
widely
Matrix multiplication, LU triangle factorization,
differential eqns, etc.
InfiniBand is an emerging high performance
interconnect with powerful features like RDMA
Can the All-to-all broadcast be optimized further
for InfiniBand based clusters?

4
Overview of InfiniBand

Proposed as an Industry standard
Switched fabric for connecting compute and I/O
nodes
Host Channel Adapters (HCAs) are used to connect
hosts to fabric
InfiniBand utilities are exposed to applications
by Verbs interface
Mellanox Verbs API (VAPI), OpenIB Gen2 Verbs
(IBVerbs), etc.
High performance
Low latency (1-3us)
Over 1400 MByte/s uni-dir and 2700 MBytes/sec
bi-dir bandwidth
Provides one-sided operations (RDMA, RDMA
Gather/Scatter, Remote Atomics)
Requires all communication buffers to be
registered
Pages are marked unswapable
HCA has access tables with physical addresses of
communication buffer

5
MVAPICH/MVAPICH2 Software Distribution

Focusing both on MPI-1 (MVAPICH) and MPI-2
(MVAPICH2)
Open Source (BSD licensing)
Have been directly downloaded by more than 285
organizations worldwide (in 30 countries)
Empowering many large-scale clusters in TOP500
list (including the 5 ranked 4000-node
Thunderbird cluster at Sandia)
Multiple Implementations on different low-level
APIs
VAPI
OpenIB Gen2 stack
uDAPL
To achieve portability across different
interconnects through uDAPL
Available for MPI-2 (MVAPICH2 0.9.0)
Tested with uDAPL-Ammasso/GigE,
uDAPL-Solaris/IBA, uDAPL-OpenIBGen2/IBA
Available with MVAPICH 0.9.6
Available and Optimized for
Platforms IA-32, IA-64, Opteron, EM64T, and
Apple G5
Operating Systems Linux, Solaris, and Mac OSX
Compilers gcc, intel, pathscale and pgi
InfiniBand Adapters
PCI-X and PCI-Express (SDR and DDR with
mem-full/mem-free cards)
More details at

6
Presentation Progress

Introduction
Overview of Existing All to all Broadcast
Algorithms
Can RDMA benefit collective operations?
Proposed RDMA based designs
Experimental Evaluation
Conclusions and Future Work

7
All-to-all Broadcast Algorithms

MPI_Allgather is used to distribute data from the
jth process into the jth receive buffer of each
process
Depending on system/message size, some algorithms
can outperform the others
Two prominent algorithms
Recursive Doubling Algorithm
Ring Algorithm
MPICH-1.2.7 uses a combination of these algorithms

8
Recursive Doubling Algorithm

Algorithm details
Pair of processes exchange their buffer contents
Each iteration contains data from previous
iterations
Number of steps log (p)
Size of message doubles at each step
Implementation
Buffer required at intermediate stages
Typically used up to medium message sizes
Cluster should have constant bisection bandwidth
to avoid contention

p Number of nodes
9
Ring Algorithm

Algorithm Details
In every iteration, processes pass a message to
their neighbor
Number of steps (p 1)
Size of message for each step is constant (m)
Implementation
Buffers may not be required at intermediate
stages
May not involve copies
Typically used for larger message sizes

p Number of nodes, m message size
10
Presentation Progress

Introduction
Overview of Existing All to all Broadcast
Algorithms
Can RDMA benefit collective operations?
Proposed RDMA based designs
Experimental Evaluation
Conclusions and Future Work

11
Can RDMA benefit Collective Operations?

RDMA is defined for point-to-point operations
How can we leverage benefits of RDMA for
collectives?
RDMA based scheme has the following advantages
Bypass intermediate software layers
Reduce number of copies
Reduce protocol handshakes
Reduce cost of memory registration

12
RDMA Benefits
MPI
Collectives
Bypass Software Layers
Point-to-Point
ADI
Channel Interface
Shame
Ch_p4
InfiniBand (VAPI)
13
RDMA Benefits Continued
All-to-all Broadcast Buffer
All-to-all Broadcast Buffer
Direct RDMA Collective
Small Msg
Small Msg
copy
copy
Large Msg
Large Msg
RTS
Zero copy
CTS
Zero copy
DATA
FIN
ADI
ADI
Receiver
Sender
14
RDMA Benefits Continued

Each registration operation has high startup
overhead
RDMA mechanism can treat the entire collective
buffer as one message instead of several messages
Number of registrations can be greatly reduced

Point-to-point based design
RDMA Collective
15
Presentation Progress

Introduction
Overview of Existing All to all Broadcast
Algorithms
Can RDMA benefit collective operations?
Proposed RDMA based designs
Experimental Evaluation
Conclusions and Future Work

16
Proposed RDMA Based Design

Design issue, Copy Based or Zero Copy?
Copy cost proportional to size of message
On demand registration costly, but efficient for
large messages
RDMA Based Recursive Doubling
Design a dynamic threshold that switches between
copy based and zero copy approaches
RDMA Ring Algorithm
Used only for large messages, so we use only zero
copy

17
RDMA Based Recursive Doubling

Maintain a pre-registered Collective buffer per
communicator for small messages
Message size increases from m (in first
iteration) to mp/2 (in log(p) th iteration)
Iterations 1 k log (p)
Message size at iteration k 2k-1m
When message size at a iteration exceeds a
threshold MT, then use zero copy
MT empirically determined to be 4KB
Total memory requirement 2MT 8KB per
communicator

p Number of nodes, m message size
18
RDMA Ring Algorithm

Used only for large messages
Complete zero copy approach used
Single buffer registration and address exchange
Each iteration (1kp), RDMA of size m is
performed
Used for messages larger than 1MB and pgt32

p Number of nodes, m message size
19
Presentation Progress

Introduction
Overview of Existing All to all Broadcast
Algorithms
Can RDMA benefit collective operations?
Proposed RDMA based designs
Experimental Evaluation
Conclusions and Future Work

20
Experimental Setup

Cluster A
32 Dual Intel Xeon 2.66GHz nodes with 512KB cache
2 GB main memory
MT23108 HCA using PCI-X 133 MHz I/O bus
Mellanox 144-port switch (MTS 14400)
Cluster B
16 Dual Intel Xeon 3.6GHz nodes with 1MB cache
4 GB main memory
MHES18-XT HCA using PCI-Express (x8) I/O bus
Cluster C
8 Dual Intel Xeon 3.0GHz nodes with 512KB cache
2 GB main memory
MT23108 HCA using PCI-X 133 MHz I/O bus

21
Experimental Software

The RDMA based performance numbers are labeled
MVAPICH-RDMA
Point-to-point based numbers are labeled
MVAPICH-P2P
The design is available integrated in
MVAPICH-0.9.6

22
Latency of MPI_Allgather

All processes are synchronized and MPI_Allgather
is timed for 1000 times
Average values (across all the processes) are
reported
Small Messages Latency is reduced by 17 for
4-byte message size
Medium Messages Latency is reduced by 30 for
32KB message size

23
Scalability of MPI_Allgather

All processes are synchronized and MPI_Allgather
is timed for 1000 times
Average values (across all the processes) are
reported
Message size of 32KB is used
For 32 processes, scalability is improved by 30

24
MPI_Allgather with no Buffer Reuse

All processes are synchronized and MPI_Allgather
is timed for 1000 times
In each iteration a new buffer is used to
eliminate cache effects
32 Processes are used
For 32KB message size, improvement is by a
factor of 4.75

25
Matrix Multiplication Application Kernel

Distributed memory matrix multiplication
algorithm
Optimized BLAS library from Intel MKL was used
8 Processes are used on Cluster C
For 256x256 matrix size, improvement is by 37

26
Presentation Progress

Introduction
Overview of Existing All to all Broadcast
Algorithms
Can RDMA benefit collective operations?
Proposed RDMA based designs
Experimental Evaluation
Conclusions and Future Work

27
Conclusions

New RDMA based design reduces
Software overheads
Message copy costs
Protocol handshake overhead
Unnecessary registration of buffers
Latency of MPI_Allgather is reduced by 30 for 32
processes and message size 32KB
Latency is improved by a factor of 4.75 under no
buffer reuse conditions for 32 processes and 32KB
message size
Matrix Multiplication application kernel is able
to perform 37 better for 256x256 matrix size

28
Future Work

Investigate impact on other applications on large
scale clusters
RDMA based All-to-all broadcast is a good
building block for other collectives, for e.g.
All-to-all personalized
Investigate other algorithms for clusters without
constant bisection bandwidth

29
Acknowledgements

Our research is supported by the following
organizations
Current Funding support by

Current Equipment support by

30
Web Pointers
http//www.cse.ohio-state.edu/panda/ http//nowla
b.cse.ohio-state.edu/ MVAPICH Web
Page http//nowlab.cse.ohio-state.edu/projects/mpi
-iba/
E-mail panda_at_cse.ohio-state.edu

Write a Comment

User Comments (0)

About PowerShow.com

High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters - PowerPoint PPT Presentation

High Performance RDMA Based Alltoall Broadcast for InfiniBand Clusters

Current TOP500 list has 360/500 (72%) cluster based systems ... Optimized BLAS library from Intel MKL was used. 8 Processes are used on Cluster C ... – PowerPoint PPT presentation