Title: GridMPI: Grid Enabled MPI
1GridMPIGrid Enabled MPI
- Yutaka Ishikawa
- University of Tokyo
- and
- AIST
2Motivation
- MPI has been widely used to program parallel
applications - Users want to run such applications over the Grid
environment without any modifications of the
program - However, the performance of existing MPI
implementations is not scaled up on the Grid
environment
computing resource site A
computing resource site B
Wide-area Network
Single (monolithic) MPI application over the Grid
environment
3Motivation
- Focus on metropolitan-area, high-bandwidth
environment ?10Gpbs, ? 500miles (smaller than
10ms one-way latency) - Internet Bandwidth in Grid ? Interconnect
Bandwidth in Cluster - 10 Gbps vs. 1 Gbps
- 100 Gbps vs. 10 Gbps
computing resource site A
computing resource site B
Wide-area Network
Single (monolithic) MPI application over the Grid
environment
4Motivation
- Focus on metropolitan-area, high-bandwidth
environment ?10Gpbs, ? 500miles (smaller than
10ms one-way latency) - We have already demonstrated that the performance
of the NAS parallel benchmark programs are scaled
up if one-way latency is smaller than 10ms using
an emulated WAN environment
Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro
Kudoh, Evaluation of MPI Implementations on
Grid-connected Clusters using an Emulated WAN
Environment,'' CCGRID2003, 2003
computing resource site A
computing resource site B
Wide-area Network
Single (monolithic) MPI application over the Grid
environment
5Issues
- High Performance Communication Facilities for MPI
on Long and Fat Networks - TCP vs. MPI communication patterns
- Network Topology
- Latency and Bandwidth
- Interoperability
- Most MPI library implementations use their own
network protocol. - Fault Tolerance and Migration
- To survive a site failure
- Security
6Issues
- High Performance Communication Facilities for MPI
on Long and Fat Networks - TCP vs. MPI communication patterns
- Network Topology
- Latency and Bandwidth
- Interoperability
- Most MPI library implementations use their own
network protocol. - Fault Tolerance and Migration
- To survive a site failure
- Security
7Issues
- High Performance Communication Facilities for MPI
on Long and Fat Networks - TCP vs. MPI communication patterns
- Network Topology
- Latency and Bandwidth
- Interoperability
- Most MPI library implementations use their own
network protocol. - Fault Tolerance and Migration
- To survive a site failure
- Security
8Issues
- High Performance Communication Facilities for MPI
on Long and Fat Networks - TCP vs. MPI communication patterns
- Network Topology
- Latency and Bandwidth
- Interoperability
- Many MPI library implementations. Most
implementations use their own network protocol - Fault Tolerance and Migration
- To survive a site failure
- Security
Internet
9GridMPI Features
- MPI-2 implementation
- YAMPII, developed at the University of Tokyo, is
used as the core implementation - Intra communication by YAMPII(TCP/IP?SCore)
- Inter communication by IMPI(Interoperable MPI),
protocol and extension to Grid - MPI-2
- New Collective protocols
- Integration of Vendor MPI
- IBM Regatta MPI, MPICH2, Solaris MPI, Fujitsu
MPI, (NEC SX MPI) - Incremental checkpoint
- High Performance TCP/IP implementation
- LAC Latency Aware Collectives
- bcast/allreduce algorithms have been developed
(to appear at the cluster06 conference)
10High-performance Communication Mechanisms in the
Long and Fat Network
- Modifications of TCP Behavior
- M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y.
Ishikawa, TCP Adaptation for MPI on Long-and-Fat
Networks, - IEEE Cluster 2005, 2005.
- Precise Software Pacing
- R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H.
Tezuka, Y. Ishikawa, Design and Evaluation of
Precise Software Pacing Mechanisms for Fast
Long-Distance Networks, - PFLDnet2005, 2005.
- Collective communication algorithms with respect
to network latency and bandwidth. - M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y.
Ishikawa, Efficient MPI Collective Operations
for Clusters in Long-and-Fast Networks, - to appear at IEEE Cluster 2006.
11Evaluation
- It is almost impossible to reproduce the
execution behavior of communication performance
in the wide area network - A WAN emulator, GtrcNET-1, is used to
scientifically examine implementations,
protocols, communication algorithms, etc.
GtrcNET-1
- GtrcNET-1 is developed at AIST.
- injection of delay, jitter, error,
- traffic monitor, frame capture
http//www.gtrc.aist.go.jp/gnet/
12Experimental Environment
8 PCs
8 PCs
Node15
Node7
- Bandwidth1Gbps
- Delay 0ms -- 10ms
- CPU Pentium4/2.4GHz, Memory DDR400 512MB
- NIC Intel PRO/1000 (82547EI)
- OS Linux-2.6.9-1.6 (Fedora Core 2)
- Socket Buffer Size 20MB
13GridMPI vs. MPICH-G2 (1/4)
FT (Class B) of NAS Parallel Benchmarks 3.2 on 8
x 8 processes
Relative Performance
One way delay (msec)
14GridMPI vs. MPICH-G2 (2/4)
IS (Class B) of NAS Parallel Benchmarks 3.2 on 8
x 8 processes
Relative Performance
One way delay (msec)
15GridMPI vs. MPICH-G2 (3/4)
LU (Class B) of NAS Parallel Benchmarks 3.2 on 8
x 8 processes
Relative Performance
One way delay (msec)
16GridMPI vs. MPICH-G2 (4/4)
NAS Parallel Benchmarks 3.2 Class B on 8 x 8
processes
Relative Performance
No parameters tuned in GridMPI
One way delay (msec)
17GridMPI on Actual Network
- NAS Parallel Benchmarks run using 8 node (2.4GHz)
cluster at Tsukuba and 8 node (2.8GHz) cluster at
Akihabara - 16 nodes
- Comparing the performance with
- result using 16 node (2.4 GHz)
- result using 16 node (2.8 GHz)
18GridMPI Now and Future
- GridMPI version 1.0 has been released
- Conformance Tests
- MPICH Test Suite 0/142 (Fails/Tests)
- Intel Test Suite 0/493 (Fails/Tests)
- GridMPI is integrated into the NaReGI package
- Extension of IMPI Specification
- Refine the current extensions
- Collective communication and check point
algorithms could not be fixed. The current idea
is specifying - The mechanism of
- dynamic algorithm selection
- dynamic algorithm shipment and load
- virtual machine to implement algorithms
19Dynamic Algorithm Shipment
- A collective communication algorithm implemented
in the virtual machine - The code is shipped to all MPI processes
- The MPI runtime library interprets the algorithm
to perform the collective communication for
inter-clusters
Internet
20Concluding Remarks
- Our Main Concern is the metropolitan area network
- high-bandwidth environment ?10Gpbs, ? 500miles
(smaller than 10ms one-way latency) - Overseas (? 100 milliseconds)
- Applications must be aware of the communication
latency - data movement using MPI-IO ?
- Collaborations
- We would like to ask people, who are interested
in this work, for collaborations