Designing High Performance and Scalable MPI Intranode Communication Support for Clusters - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Designing High Performance and Scalable MPI Intranode Communication Support for Clusters

Description:

User space memory copy is deployed by many MPI implementations. MVAPICH. MPICH-MX. Nemesis ... L2 Cache Miss Rate. Tool: Valgrind ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 29

Provided by: hyunwo

Category:

more less

Transcript and Presenter's Notes

Title: Designing High Performance and Scalable MPI Intranode Communication Support for Clusters

1
Designing High Performance and Scalable MPI
Intra-node CommunicationSupport for Clusters

Lei Chai Albert Hartono Dhabaleswar. K.
Panda
Computer Science Engineering Department
The Ohio State University

2
Outline

Introduction and Motivation
Background
Design Description
Performance Evaluation
Conclusions and Future Work

3
SMP Based Cluster
Inter-node Communication
Network
Memory
Memory
SMP Intra-node Communication
Core
Core
Core
Core
Dual Core Chip
CMP Intra-node Communication
Dual Core NUMA Node
4
Motivation

Advances in processor and memory architecture
NUMA systems
Multi-core systems
Good scalability
Large SMP systems available
E.g Suns Niagara 2 System has 8 cores on the
same chip and can run 64 threads simultaneously
MPI intra-node communication more critical!
Goals
To improve MPI intra-node communication
performance
To reduce memory usage

5
Outline

Introduction and Motivation
Background
Design Description
Performance Evaluation
Conclusions and Future Work

6
MPI Intra-node Communication

Existing approaches
NIC based loop back
Kernel assisted memory mapping
User space memory copy
Advantages of user space memory copy
Good performance
Portability
User space memory copy is deployed by many MPI
implementations
MVAPICH
MPICH-MX
Nemesis

7
MVAPICH

MVAPICH High performance MPI on InfiniBand
clusters developed by OSU
Based on MPICH
MVAPICH and MVAPICH2 are currently being used by
more than 405 organizations worldwide
Latest release MVAPICH-0.9.8 MVAPICH2-0.9.5
http//nowlab.cse.ohio-state.edu/projects/mpi-iba/
index.html

8
Intra-node Communication Design in MVAPICH
RB10
RB 01
User buffer
User buffer
RB20
RB 21
RB30
Process 0
RB 31
Process 1
RB 02
RB 03
RB 12
RB 13
RB 32
Process 2
RB 23
Process 3

RBxy Receive Buffers
Buffers shared between processes
x sender, y receiver

9
Analysis of the Current Design

Advantages
Lock-free
Messages in-order
Flaws
Large memory usage
Not scalable
Inefficient in cache utilization
Need to walk through the receive buffer
Performance is not optimized

10
Outline

Introduction and Motivation
Background
Design Description
Performance Evaluation
Conclusions and Future Work

11
Data Structures
SBP0
SBP1

- SBP Shared Buffer Pool
- SQxy Send Queue
- x sender, y receiver
RBxy Receive Buffer
- x sender, y receiver

RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
12
Small Message Transfer
SBP0
SBP1
User Buffer
User Buffer
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
13
Large Message Transfer
SBP0
SBP1
User Buffer
User Buffer
RB10
RB01
NULL
SQ01
NULL
SQ10
RB20
RB21
NULL
SQ02
NULL
SQ12
NULL
SQ03
RB30
NULL
SQ13
RB31
Process 0
Process 1
SBP2
SBP3
RB02
RB03
NULL
SQ20
NULL
SQ30
RB12
RB13
NULL
SQ21
NULL
SQ31
NULL
SQ23
RB32
NULL
SQ32
RB23
Process 2
Process 0
14
Analysis of the New Design

Lock free
Messages in-order
Control messages are going through receive
buffers
Efficient in cache utilization
Small messages small receive buffer, likely in
the cache
Large messages chances of buffer reuse improved
Efficient memory usage
Receive buffers become smaller
Large message buffers are shared among all the
connections

15
Outline

Introduction and Motivation
Background
Design Description
Performance Evaluation
Conclusions and Future Work

16
Experimental System Setup

NUMA Cluster
Two nodes connected by InfiniBand
Each node has four AMD Opteron processors, 2.0GHz
1MB L2 cache
Linux 2.6.16
Multi-core Cluster
Two nodes connected by InfiniBand
Each node has four dual-core AMD Opteron
processors, 2.0GHz
Two cores per chip, two chips in total
Each core has 1MB L2 cache
Linux 2.6.16

17
Latency on NUMA Cluster

Latency for small and medium messages is improved
by up to 15
Latency for large messages is improved by up to
35

18
Bandwidth on NUMA Cluster

Bandwidth is improved by up to 50

19
L2 Cache Miss Rate

Tool Valgrind
The improvement in latency and bandwidth comes
from better L2 cache utilization

20
Collectives on NUMA Cluster

MPI_Barrier latency is improved by up to 19
MPI_Alltoall latency is improved by 10

21
Latency on Multi-core Cluster

CMP latency is lower than SMP latency for small
messages, but higher for large messages
Cache transaction vs. memory contention
The new design improves SMP latency for all the
messages
The new design improves CMP latency for small
messages

22
Bandwidth on Multi-core Cluster

The new design improves SMP bandwidth
significantly
The new design also improves CMP bandwidth for
small and medium messges

23
Collectives on Multi-core Cluster

The new design improves collective performance on
multi-core cluster

24
Outline

Introduction and Motivation
Background
Design Description
Performance Evaluation
Conclusions and Future Work

25
Conclusions

Designed and implemented high-performance and
scalable MPI intra-node communication support
Lock free
Efficient cache utilization
Efficient memory usage
Evaluated on NUMA and multi-core systems
Both point-to-point and collective performance
has been improved significantly

26
Future Work