Title: Communication Performance Measurement and Analysis on Commodity Clusters
1Communication Performance Measurement and
Analysis on Commodity Clusters
Research Proposal
Name Nor
Asilah Wati Abdul Hamid
Supervisor Dr. Paul
Coddington Dr. Francis Vaughan
2Table of Content
- Introduction
- Message-Passing Multicomputers.
- Previous Research to Improve Communication Over
Ethernet. - Communication Performance Measurement.
- Previous Benchmark Software
- Performance Analysis for MPIBench.
- Motivation
- Methodology
- Value of the Research.
3Introduction
- The proposed research is on parallel computing
and focus on message-passing parallel computers. - This research will study communications benchmark
software and performance measurement and analysis
for message-passing parallel computers. - The proposed research will find a clearer
understanding of communications performance
problems and how they can be improved,
particularly for commodity clusters using Linux
PCs and Ethernet networks.
4Message-Passing Parallel Computers
- There are various types of message-passing
parallel computers, from high end to the low end.
- Beowulf clusters are high-performance computers
built from off-the-shelf commodity components -
PCs running Linux and Fast Ethernet network. - However, some clusters use high-end Unix
workstations (such as Compaq Alpha or Sun
UltraSPARC machines) and/or high-end gigabit
networks (such as Myrinet, QSNet)
Hydra
APAC NF
5Message-Passing Parallel Computers
- The low end commodity cluster - consist of a
cluster of PCs running Linux connected using a
Fast Ethernet network, e.g Perseus. - Use MPI message-passing libraries, e.g MPICH,
LAM MPI. - MPI standard library specification for
message-passing computer. - MPICH freely available implementation of MPI
- The proposed research is mainly focussed on low
end commodity clusters.
Perseus
6Message-Passing Parallel Computers
- Beowulf clusters have become very popular over
the past couple of years, due to the rapid
improvements in the performance of commodity
processors and networking infrastructure, and the
development of Linux, for PCs. - For most applications, Beowulf clusters offer
much better price/performance than standard
supercomputers. - Beowulf cluster commonly use Ethernet network
and TCP/IP for communication and MPICH for MPI
library. - Ethernet network is much cheaper than high-speed
networks. - However there are several inadequacies related to
the Ethernet network due to TCP/IP and MPI
implementation.
7Network Cost Comparison (Clustervision.com)
Interconnect Bandwidth (Mbytes/s) Latency (µs) Cost/port (Euro)
QsNet (Quadrics) 360 5 4770
Myrinet (Myricom) 245 10 2050
Gigabit Ethernet 90 100 200
Megabit Ethernet 12 100 28
Infiniband 560 - 610 13 - 17 2000
8Ethernet Problems
- TCP/IP is specifically designed for Internet use,
hence, there are several problems in using it for
parallel computing - Examples mechanism for packet loss and
congestion control, timeout etc. - Problems in MPI implementation occur because -
- TCP/IP support detect errors, loss of data and
retransmission until data is correct and receive
- BUT
- MPI implementation assume network with
reliable data transfer. - There is much research trying to improve the
performance of TCP/IP, but mostly focussed on
optimizing the performance for internet and
local-area network.
9Previous Research to Improve Communication Over
Ethernet
- Active Messages aims to reducing the
communication overhead and allowing communication
and computation overlap. - GAMMA an extension layer in communication layer
for Linux in cluster of PCs. - BIP Basic Interface for Parallelism, an
interface for network communication for
message-passing parallel computing. - VIA is a standard communication infrastructure
for System Area Networks (SANs) that provides
protected, zero-copy user-space inter-process
communication - MVICH is an MPICH-based implementation of MPI
for Virtual Interface Architecture (VIA).
10Protocol Comparison (Ping-Pong
Application)
Platform Latency(us) Bandwidth (Mbyte/s)
BIP Myrinet 5.0 108.0
TCP - Myrinet 103.0 42.0
GAMMA Gigabit Ethernet 9.6 90.0
TCP Gigabit Ethernet 103.0 62.0
GAMMA - Fast Ethernet 12.7 12.2
VIA Fast Ethernet 27.0 -
TCP Fast Ethernet 105.0 10.0
11Previous Research to Improve Communication Over
Ethernet
- Previous research focusing more on developing a
new design for replacing the TCP/IP protocol. - However, a new protocol will require new software
(e.g drivers) for all Ethernet hardware. - Also, need to port MPI implementation to new
protocol, e.g MVICH. - TCP/IP and MPICH are widely used in existing
Beowulf cluster. So a more flexible TCP/IP and
better MPICH will be better than a new protocol. - Research from Pope et al is an example of
research aiming to design a more flexible TCP/IP
using a compliant systems approach. - They proposed the argument for separation of
policy and mechanism and examine what policies is
suitable for TCP/IP stacks which depends on the
type of communication use.
12Communication Performance Measurement
- Why communication performance measurement is
important, examples - - To improve the performance of the machine and the
MPI implementation - Needed as input to performance modeling tools for
parallel programs - To compare the performance of the machine, in
order to find the fastest machine. - Benchmark software, e.g SKaMPI, MPBench,
Mpptest, Pallas MPI Benchmark, and recently
developed MPIBench
13Previous Benchmark Software
- SKaMPI, MPBench, Pallas MPI Benchmark, Mpptest.
- Existing benchmark software has several
weaknesses, which can result in the inaccuracy of
time measurement. - The use of relatively coarse grained clocks for
timing measurement, which will lead a benchmark
to average results over a high number of test
repetitions. - Rely on MPI_Wtime for timing and use ping-pong
test to measure the total round trip time, not
single communication time. - None of the communication patterns used in
existing benchmark consider clusters of SMP
nodes.
14MPIBench
- MPIBench has been developed by Duncan Grove as
part of his PhD research. - The extra functionality in MPIBench
- Topology-aware, specifically designed to ensure
meaningful results on clusters of SMP nodes. - Uses an accurate globally synchronized clock to
measure the performance of all the processes
involved. - Can measure times of single communications - not
just averages. - Can generate histograms (distributions) of
communication times. - The proposed research will used MPIBench for the
performance measurement and also improve the
MPIBench.
15Performance Analysis with MPIBench
- Comparison of communication performance of
different networks. - Beowulf-type cluster of PCs connected by Fast
Ethernet (Perseus and Bunyip). - Perseus vs Bunyip to analyse effects of
different communication topology. - Sun Technical Compute Farm connected with Myrinet
(Orion). - Compaq AlphaServer SC connected with QsNet (APAC
NF).
16Performance Analysis with MPIBench
- MPIBench found several inadequacies from the
performance analysis, for examples - - Problem caused by TCP/IP timeouts and congestion
control. - Problems with MPI implementations.
- Problems caused by network congestion.
- Distribution results with long tails, including
outliers with very long communication time due
to - - Spurious interference from unrelated operating
system services. - Cluster management system daemons
- Outlier - An extreme point that is much longer
than the average value of distribution.
17Perseus Average time for MPI_Bcast
18Perseus Percentage of procesess experiencing
outliers during MPI_Bcast
19Distribution of times for MPI_Bcast
20Perseus Average times for MPI_Alltoall
21Perseus Percentage of processess experiencing
outliers during MPI_Alltoall
22Motivation 1
- MPIBench is a new communication benchmark
software which has new capability compared to
existing benchmark software. - HOWEVER, there has been no detailed comparison or
study between MPIBench with the existing MPI
benchmarks. Furthermore, in order to improve
MPIBench a comparison with existed benchmark
software is important, to identify any
inadequacies in MPIBench. -
- Research Aims
- To compare MPIBench with the other existing
benchmark software . The comparison also to test
the scalability, functionality and usability of
MPIBench compared with the existing software. - Based from the comparison results, improvements
and changes can be done to MPIBench.
23Methodology
- Comparison of different benchmark software for
message-passing parallel computer. - Particularly, the comparison is divided into
theoretical and experimental part. - The theoretical part will involved a study based
from the conference or journal paper and the
documentation from the benchmark software. - The experimental part will involve installation
of the benchmark software into the Hydra cluster
and test the functionality of the software. - Then, a standard procedure for test particular
such as size of data, MPI routine and number of
iterations will be identify to standardized the
experiment. All the data that obtain from the
experiment will be recorded and compared.
24Methodology
- Improvement to MPIBench
- Generally, the second method will required a
detailed understanding to the MPIBench code. - After that, changes to the code will be
highlighted and then changes will be made to the
code. - Crucially important after the changes is the
testing to the MPIBench, the testing should be
done with the same testing in the first
methodology to ensure the correctness of the
program.
25Motivation 2
Previously, Grove had used MPIBench to test
between two cluster which has a similar commodity
component but different in their topology,
Perseus and Bunyip. HOWEVER, there has not been
any experimental work done with MPIBench to test
on a machine which has a similar components and
similar topology but only different in their
network type. Research Aims 3. To analyze the
performance between Myrinet and Ethernet network
on a large Linux PC cluster (Hydra). Results
obtained from the test will be analyze and may
provide ideas on how to upgrade the communication
performance for Ethernet network in Beowulf
cluster.
26Methodology
- Performance Analysis and Investigation of
Communication Performance on Different Networks. - Design a method to differentiate between Ethernet
and Myrinet network to run the program. - A set of procedure or parameter is required to
standardize the experiment, for examples number
of iterations, MPI routine, number of processors
and size of data. - The performance analysis result will be recorded
and analysed. - After the performance analysis results is
obtained, then, the results will be used to
investigate the problems in Ethernet network. - The investigation will involve study, analysis
and discussion regarding the comparison results
on communication performance for Myrinet and
Ethernet network. - The expectation of this stage is to obtain ideas
for problems that occur in the Ethernet network,
particularly for TCP/IP and MPI implementation.
27Motivation 3
- Previously, there are several research to
overcome the problems of communication
performance for Ethernet network in Beowulf
cluster. - However, previous research focus more on a new
design of protocol. A new protocol will require
new software (e.g drivers) for all Ethernet
hardware and also need to port MPI implementation
to new protocol. - It will be more valuable if the problems of
TCP/IP and MPICH itself can be fixed. - Research Aims
- 4. To propose or develop solutions to
communication problems in Beowulf clusters using
Ethernet network, particularly for TCP/IP and MPI
implementation.
28Methodology
- 4. Propose or Develop Solutions for the
Ethernet Network Problems in Beowulf Clusters
Computers. - This will involve study, analysis, comparison
results and experiment. - Based from the study that has been done, there
are several expected problems that might be
occurred in TCP/IP, for example packet loss and
congestion. - Suggestions that might be suitable to the TCP/IP,
decrease the time out or improve the algorithm
for the resend mechanism in TCP/IP. - The problems that occur in MPICH such as poor
performance and unusual distribution of
MPI_Alltoall. - Suggest or develop optimised code for some MPI
routines that is suitable for TCP/IP and Ethernet
network. - Re-run experiments to test changes to MPICH code
or TCP, in order to check for performance
improvement.
29Motivation 4
- Previously Grove had used MPIBench to benchmark
several machines, from his analysis he recorded
outlier results showing very long communication
times. - The main causes of outlier is because of -
- Spurious interference from unrelated operating
system services. - Cluster management system daemons
- However, there has been no further work to
investigate the solution of these problems. -
- Research Aims
- 5. To find solutions for loss of performance in
Beowulf clusters with Linux PCs. - 6. Possibly develop a customized installation of
Linux.
30Methodology
- 5. Investigation of the Outliers Problem.
- Set the same experiment that the MPIBench did
previously on Perseus. - Based on the expected main causes of the
outliers, the experiment will involve - - Experiment with removing operating system and
Cluster Management system processes. - Experiment with reducing the frequency of the
interference from process execution. - Try to identify the cause of outliers and
propose solutions. -
31Value of the Research
- This proposed research will provide -
- An improvement to MPIBench which can be used to
analyze communication networks and MPI
implementations. - Results that can be used for future study for
PEVPM, a new performance modelling technique. - An improvement in communication performance for
Beowulf Clusters using Ethernet network which can
provide a solution for cheap high performance
computing.
32END.