ProtocolDependent MessagePassing Performance on Linux Clusters - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

ProtocolDependent MessagePassing Performance on Linux Clusters

Description:

Dave Turner Xuehua Chen Adam Oline. This work is funded by the DOE MICS office. ... TrendNet TEG-PCITX copper GigE 32-bit 33/66 MHz ns83820 driver ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 24
Provided by: DaveT2
Category:

less

Transcript and Presenter's Notes

Title: ProtocolDependent MessagePassing Performance on Linux Clusters


1
Protocol-Dependent Message-Passing Performance on
Linux Clusters
  • Dave Turner Xuehua Chen Adam Oline
  • This work is funded by the DOE MICS office.
  • http//www.scl.ameslab.gov/

2
Inefficiencies in the communication system
Applications MPI native layer internal
buses driver NIC switch fabric
50 bandwidth 2-3x latency
PCI Memory
Topological bottlenecks
Poor MPI usage No mapping
Hardware limits Driver tuning
OS bypass TCP tuning
3
(No Transcript)
4
The NetPIPE utility
  • NetPIPE does a series of ping-pong tests
    between two nodes.
  • Message sizes are chosen at regular intervals,
    and with slight perturbations, to fully test the
    communication system for idiosyncrasies.
  • Latencies reported represent half the ping-pong
    time for messages smaller than 64 Bytes.

Some typical uses
  • Measuring the overhead of message-passing
    protocols.
  • Help in tuning the optimization parameters of
    message-passing libraries.
  • Identify dropouts in networking hardware.
  • Optimizing driver and OS parameters (socket
    buffer sizes, etc.).

What is not measured
  • NetPIPE can measure the load on the CPU using
    getrusage, but this was not done here.
  • The effects from the different methods for
    maintaining message progress.
  • Scalability with system size.

5
A NetPIPE example Performance on a Cray T3E
  • Raw SHMEM delivers
  • 2600 Mbps
  • 2-3 us latency
  • Cray MPI originally delivered
  • 1300 Mbps
  • 20 us latency
  • MP_Lite delivers
  • 2600 Mbps
  • 9-10 us latency
  • New Cray MPI delivers
  • 2400 Mbps
  • 20 us latency

The top of the spikes are where the message size
is divisible by 8 Bytes.
6
The network hardware and computer test-beds
  • Linux PC test-bed
  • Two 1.8 GHz P4 computers
  • 768 MB PC133 memory
  • 32-bit 33 MHz PCI bus
  • RedHat 7.2 Linux 2.4.7-10
  • Alpha Linux test-bed
  • Two 500 MHz dual-processor Compaq DS20s
  • 1.5 GB memory
  • 32/64-bit 33 MHz PCI bus
  • RedHat 7.1 Linux 2.4.17
  • PC SMP test-bed
  • 1.7 GHz dual-processor Xeon
  • 1.0 GB memory
  • RedHat 7.3 Linux 2.4.18-3smp

All measurements were done back-to-back except
for the Giganet hardware, which went through an
8-port switch.
7
MPICH
  • MPICH 1.2.3 release
  • Uses the p4 device for TCP.
  • P4_SOCKBUFSIZE must be increased to 256
    kBytes.
  • Rendezvous threshold can be changed in the
    source code.
  • MPICH-2.0 will be out soon!

Developed by Argonne National Laboratory and
Mississippi State University.
8
LAM/MPI
  • LAM 6.5.6-4 release from the RedHat 7.2
    distibution.
  • Must lamboot the daemons.
  • -lamd directs messages through the daemons.
  • -O avoids data conversion for homogeneous
    systems.
  • No socket buffer size tuning.
  • No threshold adjustments.

Currently developed at Indiana University.
http//www.lam-mpi.org/
9
MPI/Pro
  • MPI/Pro 1.6.3-1 release
  • Easy to install RPM
  • Requires rsh, not ssh
  • -tcp_long ? 128 kBytes gets rid of most of the
    dip at the rendezvous threshold.
  • Other parameters didnt help.

Thanks to MPI Software Technology for supplying
the MPI/Pro software for testing.
http//www.mpi-softtech.com/
10
The MP_Lite message-passing library
  • A light-weight MPI implementation
  • Highly efficient for the architectures supported
  • Designed to be very user-friendly
  • Ideal for performing message-passing research
  • http//www.scl.ameslab.gov/Projects/MP_Lite/

11
PVM
  • PVM 3.4.3 release from the RedHat 7.2
    distribution.
  • Uses XDR encoding and the pvmd daemons by
    default.
  • pvm_setopt(PvmRoute, PvmRouteDirect) bypasses
    the pvmd daemons.
  • pvm_initsend(PvmDataInPlace) avoids XDR
    encoding for homogeneous systems.

Developed at Oak Ridge National Laboratory.
http//www.csm.ornl.gov/pvm/
12
Performance on Netgear GA620 Fiber Gigabit
Ethernet cards between two PCs
All libraries do reasonably well on this mature
card and driver. MPICH and PVM suffer from an
extra memory copy. LAM/MPI, MPI/Pro, and MPICH
have dips at the rendezvous threshold due to the
large 180 us latency. Tunable thresholds would
easily eliminate this minor drop in performance.
Netgear GA620 fiber GigE 32/64-bit 33/66 MHz
AceNIC driver
13
Performance on TrendNet and Netgear GA622T
Gigabit Ethernet cards between two Linux PCs
Both cards are very sensitive to the socket
buffer sizes. MPICH and MP_Lite do well because
they adjust the socket buffer sizes. Increasing
the default socket buffer size in the other
libraries, or making it an adjustable parameter,
would fix this problem. More tuning of the
ns83820 driver would also fix this problem.
TrendNet TEG-PCITX copper GigE 32-bit 33/66 MHz
ns83820 driver
14
Performance on SysKonnect Gigabit Ethernet cards
between Compaq DS20s running Linux
The SysKonnect cards using a 9000 Byte MTU
provides a more challenging environment. MP_Lite
delivers nearly all the 900 Mbps
performance. LAM/MPI again suffers due to the
smaller socket buffer sizes. MPICH suffers from
the extra memory copy. PVM suffers from both.
SysKonnect SK-9843-SX fiber GigE 32/64-bit
33/66 MHz sk98lin driver
15
Performance on Myrinet cards between two Linux PCs
MPICH-GM and MPI/Pro-GM both pass almost all the
performance of GM through to the
application. SCore claims to provide better
performance, but is not quite ready for prime
time yet. IP-GM provides little benefit over TCP
on Gigabit Ethernet, and at a much greater cost.
Myrinet PCI64A-2 SAN card 66 MHz RISC with 2
MB memory
16
Performance on VIA Giganet hardware and on
SysKonnect GigE cards using M-VIA between two
Linux PCs
MPI/Pro, MVICH, and MP_Lite all provide 800
Mbps bandwidth on the Giganet hardware, but
MPI/Pro has a longer latency of 42 us compared
with 10 us for the others. The M-VIA 1.2b2
performance is roughly at the same level that raw
TCP provides. The M-VIA 1.2b3 release has not
been tested, nor has using jumbo frames.
Giganet CL1000 cards through an 8-port CL5000
switch http//www.nersc.gov/research/ftg/via,mvic
h/
17
SMP message-passing performance on a
dual-processor Compaq DS20 running Alpha Linux
With the data starting in main memory.
18
SMP message-passing performance on a
dual-processor Compaq DS20 running Alpha Linux
With the data starting in cache.
19
SMP message-passing performance on a
dual-processor Xeon running Linux
With the data starting in main memory.
20
SMP message-passing performance on a
dual-processor Xeon running Linux
With the data starting in cache.
21
One-sided Puts between two Linux PCs
  • MP_Lite is SIGIO based, so MPI_Put() and
    MPI_Get() finish without a fence.
  • LAM/MPI has no message progress, so a fence is
    required.
  • ARMCI uses a polling method, and therefore does
    not require a fence.
  • An MPI-2 implementation of MPICH is under
    development.
  • An MPI-2 implementation of MPI/Pro is under
    development.

Netgear GA620 fiber GigE 32/64-bit 33/66 MHz
AceNIC driver
22
Conclusions
Most message-passing libraries do reasonably well
if properly tuned. All need to have the socket
buffer sizes and thresholds user-tunable. Optimizi
ng the network drivers would also correct some of
the problems. There is still much room for
improvement for SMP and 1-sided communications.
Future Work
All network cards should be tested on a 64-bit
66 MHz PCI bus to put more strain on the
message-passing libraries. Testing within real
applications is vital to verify NetPIPE results,
test scalability of the implementation methods,
investigate loading of the CPU, and study the
effects of the various approaches to maintaining
message progress. Score should be compared to
GM. VIA and Infinaband modules are needed for
NetPIPE.
23
Protocol-Dependent Message-Passing Performance on
Linux Clusters
Dave Turner Xuehua Chen Adam
Oline turner_at_ameslab.gov http//www.scl.amesla
b.gov/
Write a Comment
User Comments (0)
About PowerShow.com