Title: Designing High Performance and Scalable MPI Intranode Communication Support for Clusters
1Designing High Performance and Scalable MPI
Intra-node CommunicationSupport for Clusters
- Lei Chai Albert Hartono Dhabaleswar. K.
Panda - Computer Science Engineering Department
- The Ohio State University
2Outline
- Introduction and Motivation
- Background
- Design Description
- Performance Evaluation
- Conclusions and Future Work
3SMP Based Cluster
Inter-node Communication
Network
Memory
Memory
SMP Intra-node Communication
Core
Core
Core
Core
Dual Core Chip
CMP Intra-node Communication
Dual Core NUMA Node
4Motivation
- Advances in processor and memory architecture
- NUMA systems
- Multi-core systems
- Good scalability
- Large SMP systems available
- E.g Suns Niagara 2 System has 8 cores on the
same chip and can run 64 threads simultaneously - MPI intra-node communication more critical!
- Goals
- To improve MPI intra-node communication
performance - To reduce memory usage
5Outline
- Introduction and Motivation
- Background
- Design Description
- Performance Evaluation
- Conclusions and Future Work
6MPI Intra-node Communication
- Existing approaches
- NIC based loop back
- Kernel assisted memory mapping
- User space memory copy
- Advantages of user space memory copy
- Good performance
- Portability
- User space memory copy is deployed by many MPI
implementations - MVAPICH
- MPICH-MX
- Nemesis
7MVAPICH
- MVAPICH High performance MPI on InfiniBand
clusters developed by OSU - Based on MPICH
- MVAPICH and MVAPICH2 are currently being used by
more than 405 organizations worldwide - Latest release MVAPICH-0.9.8 MVAPICH2-0.9.5
- http//nowlab.cse.ohio-state.edu/projects/mpi-iba/
index.html
8Intra-node Communication Design in MVAPICH
RB10
RB 01
User buffer
User buffer
RB20
RB 21
RB30
Process 0
RB 31
Process 1
RB 02
RB 03
RB 12
RB 13
RB 32
Process 2
RB 23
Process 3
- RBxy Receive Buffers
- Buffers shared between processes
- x sender, y receiver
9Analysis of the Current Design
- Advantages
- Lock-free
- Messages in-order
- Flaws
- Large memory usage
- Not scalable
- Inefficient in cache utilization
- Need to walk through the receive buffer
- Performance is not optimized
10Outline
- Introduction and Motivation
- Background
- Design Description
- Performance Evaluation
- Conclusions and Future Work
11Data Structures
SBP0
SBP1
- - SBP Shared Buffer Pool
- - SQxy Send Queue
- - x sender, y receiver
- RBxy Receive Buffer
- - x sender, y receiver
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
12Small Message Transfer
SBP0
SBP1
User Buffer
User Buffer
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
13Large Message Transfer
SBP0
SBP1
User Buffer
User Buffer
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
14Analysis of the New Design
- Lock free
- Messages in-order
- Control messages are going through receive
buffers - Efficient in cache utilization
- Small messages small receive buffer, likely in
the cache - Large messages chances of buffer reuse improved
- Efficient memory usage
- Receive buffers become smaller
- Large message buffers are shared among all the
connections
15Outline
- Introduction and Motivation
- Background
- Design Description
- Performance Evaluation
- Conclusions and Future Work
16Experimental System Setup
- NUMA Cluster
- Two nodes connected by InfiniBand
- Each node has four AMD Opteron processors, 2.0GHz
- 1MB L2 cache
- Linux 2.6.16
- Multi-core Cluster
- Two nodes connected by InfiniBand
- Each node has four dual-core AMD Opteron
processors, 2.0GHz - Two cores per chip, two chips in total
- Each core has 1MB L2 cache
- Linux 2.6.16
17Latency on NUMA Cluster
- Latency for small and medium messages is improved
by up to 15 - Latency for large messages is improved by up to
35
18Bandwidth on NUMA Cluster
- Bandwidth is improved by up to 50
19L2 Cache Miss Rate
- Tool Valgrind
- The improvement in latency and bandwidth comes
from better L2 cache utilization
20Collectives on NUMA Cluster
- MPI_Barrier latency is improved by up to 19
- MPI_Alltoall latency is improved by 10
21Latency on Multi-core Cluster
- CMP latency is lower than SMP latency for small
messages, but higher for large messages - Cache transaction vs. memory contention
- The new design improves SMP latency for all the
messages - The new design improves CMP latency for small
messages
22Bandwidth on Multi-core Cluster
- The new design improves SMP bandwidth
significantly - The new design also improves CMP bandwidth for
small and medium messges
23Collectives on Multi-core Cluster
- The new design improves collective performance on
multi-core cluster
24Outline
- Introduction and Motivation
- Background
- Design Description
- Performance Evaluation
- Conclusions and Future Work
25Conclusions
- Designed and implemented high-performance and
scalable MPI intra-node communication support - Lock free
- Efficient cache utilization
- Efficient memory usage
- Evaluated on NUMA and multi-core systems
- Both point-to-point and collective performance
has been improved significantly
26Future Work
- Application level study
- Evaluation on larger systems
- Further optimizations on multi-core systems
27Acknowledgements
- Our research is supported by the following
organizations - Current Funding support by
- Current Equipment support by
28Thank you
- chail, hartonoa, panda_at_cse.ohio-state.edu
- Network-Based Computing Laboratory
- http//nowlab.cse.ohio-state.edu/