Title: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability
1Virtualizing Modern High-Speed Interconnection
Networks with Performance and Scalability
Bo Li, Zhigang Huo, Panyong Zhang, Dan
Meng leo, zghuo, zhangpanyong, md_at_ncic.ac.cn
Presenter Xiang Zhang zhangxiang_at_ncic.ac.cn
- Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, China
2Introduction
- Virtualization is now one of the enabling
technologies of Cloud Computing - Many HPC providers now use their systems as
platforms for cloud/utility computing, these HPC
on Demand offerings include - Penguin's POD
- IBM's Computing On Demand service
- R Systems' dedicated hosting service
- Amazons EC2
3IntroductionVirtualizing HPC clouds?
- Pros
- good manageability
- proactive fault tolerance
- performance isolation
- online system maintenance
- Cons
- Performance gap
- Lack low latency interconnects, which is
important to tightly-coupled MPI applications - VMM-bypass has been proposed to relieve the worry
4IntroductionVMM-bypass I/O Virtualization
- Xen split device driver model only used to setup
necessary user access points - data communication in the critical path bypasses
both the guest OS and the VMM
VMM-Bypass I/O (courtesy 7)
5IntroductionInfiniBand Overview
- InfiniBand is a popular high-speed interconnect
- OS-bypass/RDMA
- Latency 1us
- BW 3300MB/s
- 41.4 of Top500 now uses InfiniBand as the
primary interconnect
Interconnect Family / Systems June 2010 Source
http//www.top500.org
6IntroductionInfiniBand Scalability Problem
- Reliable Connection (RC)
- Queue Pair (QP), Each QP consists of SQ and RQ
- QPs require memory
- Shared Receive Queue (SRQ)
- eXtensible Reliable Connection (XRC)
- XRC domain SRQ-based addressing
N node count C cores per node
Conns/Process (N-1)C
Conns/Process (N-1)
RQ
SRQ
7Problem Statement
- Does scalability gap exist between native and
virtualized environments? - CV cores per VM
Scalability gap exists!
Transport QPs per Process QPs per Node
Native RC (N-1)C (N-1)C2
Native XRC (N-1) (N-1)C
VM RC (N-1)C (N-1)C2
VM XRC (N-1)(C/CV) (N-1)(C2/CV)
8Presentation Outline
- Introduction
- Problem Statement
- Proposed Design
- Evaluation
- Conclusions and Future Work
9Proposed DesignVM-proof XRC design
- Design goal is to eliminate the scalability gap
- Conns/Process (N-1)(C/CV) ? (N-1)
10Proposed DesignDesign Challenges
- VM-proof sharing of XRC domain
- A single XRC domain must be shared among
different VMs within a physical node - VM-proof connection management
- With a single XRC connection, P1 is able to send
data to all the processes in another physical
node (P5P8), no matter which VMs those processes
reside in
11Proposed DesignImplementation
- VM-proof sharing of XRCD
- XRCD is shared by opening the same XRCD file
- guest domains and IDD have dedicated, non-shared
filesystem - pseudo XRCD file and real XRCD file
- VM-proof CM
- Traditionally IP/hostname was used to identify a
node - LID of the HCA is used instead
12Proposed DesignDiscussions
- safe XRCD sharing
- unauthorized applications from other VMs may
share the XRCD - the isolation of the sharing of XRCD could be
guaranteed by the IDD - isolation between VMs running different MPI jobs
- By using different XRCD files, different jobs (or
VMs) could share different XRCDs and run without
interfering with each other - XRC migration
- main challenge XRC connection is a
process-to-node communication channel. - Future work
13Presentation Outline
- Introduction
- Problem Statement
- Proposed Design
- Evaluation
- Conclusions and Future Work
14EvaluationPlatform
- Cluster Configuration
- 128-core InfiniBand Cluster
- Quad Socket, Quad-Core Barcelona 1.9GHz
- Mellanox DDR ConnectX HCA, 24-port MT47396
Infiniscale-III switch - Implementation
- Xen 3.4 with Linux 2.6.18.8
- OpenFabrics Enterprise Edition (OFED) 1.4.2
- MVAPICH-1.1.0
15EvaluationMicrobenchmark
Explanation Memory copy operations under
virtualized case would include interactions
between the guest domain and the IDD.
- The bandwidth results are nearly the same
- Virtualized IB performs 0.1us worse when using
blueframe mechanism. - memory copy of the sending data to the HCA's
blueframe page
IB verbs latency using doorbell
MPI latency using blueframe
IB verbs latency using blueframe
16Evaluation VM-proof XRC Evaluation
- Configurations
- Native-XRC Native environment running XRC-based
MVAPICH. - VM-XRC (CVn) VM-based environment running
unmodified XRC-based MVAPICH. The parameter CV
denotes the number of cores per VM. - VM-proof XRC VM-based environment running
MVAPICH with our VM-proof XRC design.
17EvaluationMemory Usage
13GB
- 16 cores/node cluster fully connected
- The X-axis denotes the process count
- 12KB memory for each QP
- 16x less memory usage
- 64K processes will consume 13GB/node with the
VM-XRC (CV1) configuration - The VM-proof XRC design reduces the memory usage
to only 800MB/node
Better
800MB
18EvaluationMPI Alltoall Evaluation
VM-proof XRC
Better
- a total of 32 processes
- 1025 improvement for messages lt 256B
19Evaluation Application Benchmarks
- VM-proof XRC performs nearly the same as
Native-XRC - Except BT and EP
- Both are better than VM-XRC
VM-proof XRC
Better
- little variation for different CV values
- Cv8 is an exception
- Memory allocation not NUMA-aware guaranteed
Better
20Evaluation Application Benchmarks (Contd)
Benchmark Configuration Comm. Peers Avg. QPs/Process Max QPs/Process Avg. QPs/Node
FT VM-XRC (Cv1) 127 127 127 2032
FT VM-XRC (Cv2) 127 63.4 65 1014
FT VM-XRC (Cv4) 127 31.1 32 498
FT VM-XRC (Cv8) 127 15.1 16 242
FT VM-proof XRC 127 8 8 128
FT Native-XRC 127 7 7 112
IS VM-XRC (Cv1) 127 127 127 2032
IS VM-XRC (Cv2) 127 63.7 65 1019
IS VM-XRC (Cv4) 127 31.7 33 507
IS VM-XRC (Cv8) 127 15.8 18 253
IS VM-proof XRC 127 8.6 12 138
IS Native-XRC 127 7.6 11 122
15.9x less conns
14.7x less conns
21Conclusion and Future Work
- VM-proof XRC design converges two technologies
- VMM-bypass I/O virtualization
- eXtensible Reliable Connection in modern high
speed interconnection networks (InfiniBand) - the same raw performance and scalability as in
native non-virtualized environment with our
VM-proof XRC design - 16x scalability improvement is seen in
16-core/node clusters - Future work
- evaluations on different platforms with increased
scale - add VM migration support to our VM-proof XRC
design - extend our work to the newly SRIOV-enabled
ConnectX-2 HCAs
22leo, zghuo, zhangpanyong, md_at_ncic.ac.cn
23Backup Slides
24OS-bypass of InfiniBand
OpenIB Gen2 stack