Designing High Performance and Scalable MPI Intranode Communication Support for Clusters - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Designing High Performance and Scalable MPI Intranode Communication Support for Clusters

Description:

User space memory copy is deployed by many MPI implementations. MVAPICH. MPICH-MX. Nemesis ... L2 Cache Miss Rate. Tool: Valgrind ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 29
Provided by: hyunwo
Category:

less

Transcript and Presenter's Notes

Title: Designing High Performance and Scalable MPI Intranode Communication Support for Clusters


1
Designing High Performance and Scalable MPI
Intra-node CommunicationSupport for Clusters
  • Lei Chai Albert Hartono Dhabaleswar. K.
    Panda
  • Computer Science Engineering Department
  • The Ohio State University

2
Outline
  • Introduction and Motivation
  • Background
  • Design Description
  • Performance Evaluation
  • Conclusions and Future Work

3
SMP Based Cluster
Inter-node Communication
Network
Memory
Memory
SMP Intra-node Communication
Core
Core
Core
Core
Dual Core Chip
CMP Intra-node Communication
Dual Core NUMA Node
4
Motivation
  • Advances in processor and memory architecture
  • NUMA systems
  • Multi-core systems
  • Good scalability
  • Large SMP systems available
  • E.g Suns Niagara 2 System has 8 cores on the
    same chip and can run 64 threads simultaneously
  • MPI intra-node communication more critical!
  • Goals
  • To improve MPI intra-node communication
    performance
  • To reduce memory usage

5
Outline
  • Introduction and Motivation
  • Background
  • Design Description
  • Performance Evaluation
  • Conclusions and Future Work

6
MPI Intra-node Communication
  • Existing approaches
  • NIC based loop back
  • Kernel assisted memory mapping
  • User space memory copy
  • Advantages of user space memory copy
  • Good performance
  • Portability
  • User space memory copy is deployed by many MPI
    implementations
  • MVAPICH
  • MPICH-MX
  • Nemesis

7
MVAPICH
  • MVAPICH High performance MPI on InfiniBand
    clusters developed by OSU
  • Based on MPICH
  • MVAPICH and MVAPICH2 are currently being used by
    more than 405 organizations worldwide
  • Latest release MVAPICH-0.9.8 MVAPICH2-0.9.5
  • http//nowlab.cse.ohio-state.edu/projects/mpi-iba/
    index.html

8
Intra-node Communication Design in MVAPICH
RB10
RB 01
User buffer
User buffer
RB20
RB 21
RB30
Process 0
RB 31
Process 1
RB 02
RB 03
RB 12
RB 13
RB 32
Process 2
RB 23
Process 3
  • RBxy Receive Buffers
  • Buffers shared between processes
  • x sender, y receiver

9
Analysis of the Current Design
  • Advantages
  • Lock-free
  • Messages in-order
  • Flaws
  • Large memory usage
  • Not scalable
  • Inefficient in cache utilization
  • Need to walk through the receive buffer
  • Performance is not optimized

10
Outline
  • Introduction and Motivation
  • Background
  • Design Description
  • Performance Evaluation
  • Conclusions and Future Work

11
Data Structures
SBP0
SBP1
  • - SBP Shared Buffer Pool
  • - SQxy Send Queue
  • - x sender, y receiver
  • RBxy Receive Buffer
  • - x sender, y receiver

RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
12
Small Message Transfer
SBP0
SBP1
User Buffer
User Buffer
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
13
Large Message Transfer
SBP0
SBP1
User Buffer
User Buffer
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
14
Analysis of the New Design
  • Lock free
  • Messages in-order
  • Control messages are going through receive
    buffers
  • Efficient in cache utilization
  • Small messages small receive buffer, likely in
    the cache
  • Large messages chances of buffer reuse improved
  • Efficient memory usage
  • Receive buffers become smaller
  • Large message buffers are shared among all the
    connections

15
Outline
  • Introduction and Motivation
  • Background
  • Design Description
  • Performance Evaluation
  • Conclusions and Future Work

16
Experimental System Setup
  • NUMA Cluster
  • Two nodes connected by InfiniBand
  • Each node has four AMD Opteron processors, 2.0GHz
  • 1MB L2 cache
  • Linux 2.6.16
  • Multi-core Cluster
  • Two nodes connected by InfiniBand
  • Each node has four dual-core AMD Opteron
    processors, 2.0GHz
  • Two cores per chip, two chips in total
  • Each core has 1MB L2 cache
  • Linux 2.6.16

17
Latency on NUMA Cluster
  • Latency for small and medium messages is improved
    by up to 15
  • Latency for large messages is improved by up to
    35

18
Bandwidth on NUMA Cluster
  • Bandwidth is improved by up to 50

19
L2 Cache Miss Rate
  • Tool Valgrind
  • The improvement in latency and bandwidth comes
    from better L2 cache utilization

20
Collectives on NUMA Cluster
  • MPI_Barrier latency is improved by up to 19
  • MPI_Alltoall latency is improved by 10

21
Latency on Multi-core Cluster
  • CMP latency is lower than SMP latency for small
    messages, but higher for large messages
  • Cache transaction vs. memory contention
  • The new design improves SMP latency for all the
    messages
  • The new design improves CMP latency for small
    messages

22
Bandwidth on Multi-core Cluster
  • The new design improves SMP bandwidth
    significantly
  • The new design also improves CMP bandwidth for
    small and medium messges

23
Collectives on Multi-core Cluster
  • The new design improves collective performance on
    multi-core cluster

24
Outline
  • Introduction and Motivation
  • Background
  • Design Description
  • Performance Evaluation
  • Conclusions and Future Work

25
Conclusions
  • Designed and implemented high-performance and
    scalable MPI intra-node communication support
  • Lock free
  • Efficient cache utilization
  • Efficient memory usage
  • Evaluated on NUMA and multi-core systems
  • Both point-to-point and collective performance
    has been improved significantly

26
Future Work
  • Application level study
  • Evaluation on larger systems
  • Further optimizations on multi-core systems

27
Acknowledgements
  • Our research is supported by the following
    organizations
  • Current Funding support by
  • Current Equipment support by

28
Thank you
  • chail, hartonoa, panda_at_cse.ohio-state.edu
  • Network-Based Computing Laboratory
  • http//nowlab.cse.ohio-state.edu/
Write a Comment
User Comments (0)
About PowerShow.com