Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters - PowerPoint PPT Presentation

About This Presentation

Title:

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Description:

Cracow, 16 October 2006. Performance Evaluation of Gigabit Ethernet-Based ... 86 dual core Opteron CPU, 42 Sun SunFire v20z and 1 Sun SunFire v40z with ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 27

Provided by: DariuszMa7

Category:

more less

Transcript and Presenter's Notes

Title: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

1
Performance Evaluation of Gigabit Ethernet-Based
Interconnects for HPC Clusters

Pawel Pisarczyk pawel.pisarczyk_at_atm.com.pl
Jaroslaw Weglinski jaroslaw.weglinski_at_atm.com.pl

Cracow, 16 October 2006
2
Agenda

Introduction
HPC cluster interconnects
Message propagation model
Experimental setup
Results
Conclusions

3
Who we are

joint stock company
founded in 1994, earlier (since 1991) as a
departmentwithin PP ATM
IPO in September 2004 (Warsaw Stock Exchange)
major shares owned by founders (Polish citizens)
no state capital involved
financial data
stock capital about 6 million
2005 sales 29,7 million
about 230 employees

4
Mission

building business value through innovative
information communication technology
initiatives creating new markets in Poland and
abroad
ATM's competitive advantage is based on combining
three key competences
integration of comprehensive IT systems
telecommunication services
consulting and software development

5
Achievements

1991 Polands first company connected to
Internet
1993 Polands first commercial ISP
1994 Polands first LAN with ATM backbone
1994 Polands first supercomputer on the
Dongarras Top 500 list
1995 Polands first MAN in ATM technology
1996 Polands first corporate network with voice
data integration
2000 Polands first prototype Interactive TV
system over a public network
2002 Polands first validated MES system for a
pharmaceutical factory
2003 Polands first commercial, public Wireless
LAN
2004 Polands first public IP content billing
system

6
Client base
(based on 2005 sales revenues)
7
HPC clusters developed by ATM

2004 - Poznan Supercomputing and Networking
Center
238 Itanium2 CPU, 119 x HP rx2600 nodes with
Gigabit Ethernet interconnect
2005 - University of Podlasie
34 Itanium2 CPU,17 x HP rx2600 nodes with Gigabit
Ethernet interconnect and Lustre 1.2 filesystem
2005 - Poznan Supercomputing and Networking
Center
86 dual core Opteron CPU, 42 Sun SunFire v20z and
1 Sun SunFire v40z with Gigabit Ethernet
interconnect
2006 - Military University of Technology Faculty
of Engineering, Chemistry and Applied Physics
32 Itanium2 CPU, 16 x HP rx1620 with Gigabit
Ethernet interconnect
2006 Gdansk University of Technology Department
of Pharmaceutial Technology and Chemistry
22 Itanium2 CPU (11 x HP RX1620) with Gigabit
Ethernet interconnect

8
Selected software projects related to distributed
systems

Distributed Multimedia Archive in Interactive
Television (iTVP) Project
scalable storage for iTVP platform with ability
to process the stored content
ATM Objects
scalable storage for multimedia content
distribution platform
system for Cinem_at_n company (founded by ATM and
Monolith)
Cinem_at_n will introduce high-quality movies, news
and entertainment digital content distribution
services
Spread Screens Manager
platform for POS TV
system is currently used by Zabka (shopping
network) and Neckermann (travel service)
about 300 of terminals presenting the multimedia
content located in many polish cities

9
Selected current projects

ATMFS
distributed filesystem for petabyte scale storage
based on COTS
based on variable-sized chunks
advanced replication and enhanced error detection
dependability evaluation based on software fault
injection technique
FastGig
RDMA stack for Gigabit Ethernet-based
interconnect
message passing latency reduction
increases the application performance

10
Uses of computer networks in HPC clusters

Exchange of messages between cluster nodes to
coordinate distributed computation
requires high maximal throughput and also low
latency
inefficiency observed when the time consumed in
single computation step is comparable to the
message passing time
Access to shared data through network or cluster
file system
requires high bandwidth when transferring data in
blocks of defined size
filesystem and storage drivers are trying to
reduce number of i/o operations issued (by
buffering data and aggregating transfers)

11
Comparison of characteristics of interconnect
technologies
Brett M. Bode, Jason J. Hill, and Troy R.
Benjegerdes Cluster Interconnect Overview
Scalable Computing Laboratory, Ames Laboratory
12
Gigabit Ethernet interconnect characteristic

Popular technology for low cost cluster
interconnects
Satisfied throughput for long frames (1000 bytes
and longer)
High latency and low throughput for small frames
Those drawbacks are mostly caused by construction
of existing network interfaces
What is the influence of the network stack
implementation for the communication latency?

13
Message propagation model
Latency between transferring message to/from MPI
library and transferring data to/from stack
Time difference between sendto/recvfrom function
and driver start_xmit/interrupt functions
Execution time of driver functions
Processing time of the network interface
Propagation latency and latency introduced by
active network elements
14
Experimental setup

Two HP rx2600 servers
2 x Intel Itanium2 1.3 MHz 3MB cache
Debian GNU/Linux Sarge 3.1 operating system
(kernel 2.6.8-2-mckinley-smp)
Gigabit Ethernet interfaces
Broadcom BCM5701 chipset connected using PCI-X
device bus
In order to eliminate possibility of additional
delays, which may be introduced by external
active network devices, servers were connected
using crossover cables
Two NIC drivers were tested tg3 (polling NAPI
dirver), bcm5700 (interrupt driven driver)

15
Tools used for measures

NetPipe package for measuring throughput and
latency for TCP and several implementations of
MPI
For low level testing test programs working
directly on Ethernet frames were developed
Testing programs and NIC drivers were modified to
allow measuring, inserting and transfer of
timestamps

16
Throughput characteristic for tg3 driver
17
Latency characteristic for tg3 driver
18
Results for tg3 driver

The overhead introduced by MPI library is
relatively low
There is a big difference between transmission
latencies in the ping-pong and streaming mode
The latency introduced for small frames is
similar to latency introduced by 115kbps UART (in
the case of transmitting one byte only)
We can deduce that there is some mechanism in the
transmission path that delays transmission of
single packets
What is the difference between NAPI and interrupt
driven driver?

19
Interrupt driven driver vs NAPI driver
(throughput characteristic)
20
Interrupt driven driver vs NAPI driver (latency
characteristic)
21
Interrupt driven driver vs NAPI driver (latency
characteristic) - details
22
Comparison of bcm5700 and tg3 drivers

Using default configuration, BCM5700 driver has
worse characteristics than tg3
Interrupt driven version (default configuration)
cannot achieve more than 650Mb/s of throughput
for frames of any size
After interrupt coalescing disabling, the
performance of BCM5700 driver have exceeded the
results obtained by tg3 driver
Disabling of the polling can improve
characteristics of the network driver, but NAPI
is not the major cause of the transmission delay

23
Tools for message processing time measurement

Timestamps were inserted into the message eat
each processing stage
Processing stages on the transmitter side
sendto() function
bcm5700_start_xmit()
interrupt notifying frame transmit
Processing stages on the receiver side
interrupt notifying frame receipt
netif_rx()
recvfrom() function
As high precision timer CPU clock cycle counter
was used, (precision of 0.77ns 1/1.3GHz)

24
Transmitter latency in streaming mode
Send
Answer
17 us
17 us
25
Distribution of delays in transmission path
between cluster nodes
26
Conclusions

We estimate that RDMA based communication can
reduce MPI message propagation time from 43µs to
23µs (doubling the performance for short
messages)
There is also possibility of reducing T3 and T5
latencies by changing the configuration of the
network interface (transmit and receive
thresholds)
In the conducted research we didnt consider
differences between network interfaces (T3 and T5
delays may be longer or shorter than measured)
Latency introduced by switch is also omitted
FastGig project include not only communication
library, but also measurement and communication
profiling framework