Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters - PowerPoint PPT Presentation

About This Presentation
Title:

Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters

Description:

Cracow, 16 October 2006. Performance Evaluation of Gigabit Ethernet-Based ... 86 dual core Opteron CPU, 42 Sun SunFire v20z and 1 Sun SunFire v40z with ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 27
Provided by: DariuszMa7
Category:

less

Transcript and Presenter's Notes

Title: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters


1
Performance Evaluation of Gigabit Ethernet-Based
Interconnects for HPC Clusters
  • Pawel Pisarczyk pawel.pisarczyk_at_atm.com.pl
  • Jaroslaw Weglinski jaroslaw.weglinski_at_atm.com.pl

Cracow, 16 October 2006
2
Agenda
  • Introduction
  • HPC cluster interconnects
  • Message propagation model
  • Experimental setup
  • Results
  • Conclusions

3
Who we are
  • joint stock company
  • founded in 1994, earlier (since 1991) as a
    departmentwithin PP ATM
  • IPO in September 2004 (Warsaw Stock Exchange)
  • major shares owned by founders (Polish citizens)
  • no state capital involved
  • financial data
  • stock capital about 6 million
  • 2005 sales 29,7 million
  • about 230 employees

4
Mission
  • building business value through innovative
    information communication technology
    initiatives creating new markets in Poland and
    abroad
  • ATM's competitive advantage is based on combining
    three key competences
  • integration of comprehensive IT systems
  • telecommunication services
  • consulting and software development

5
Achievements
  • 1991 Polands first company connected to
    Internet
  • 1993 Polands first commercial ISP
  • 1994 Polands first LAN with ATM backbone
  • 1994 Polands first supercomputer on the
    Dongarras Top 500 list
  • 1995 Polands first MAN in ATM technology
  • 1996 Polands first corporate network with voice
    data integration
  • 2000 Polands first prototype Interactive TV
    system over a public network
  • 2002 Polands first validated MES system for a
    pharmaceutical factory
  • 2003 Polands first commercial, public Wireless
    LAN
  • 2004 Polands first public IP content billing
    system

6
Client base
(based on 2005 sales revenues)
7
HPC clusters developed by ATM
  • 2004 - Poznan Supercomputing and Networking
    Center
  • 238 Itanium2 CPU, 119 x HP rx2600 nodes with
    Gigabit Ethernet interconnect
  • 2005 - University of Podlasie
  • 34 Itanium2 CPU,17 x HP rx2600 nodes with Gigabit
    Ethernet interconnect and Lustre 1.2 filesystem
  • 2005 - Poznan Supercomputing and Networking
    Center
  • 86 dual core Opteron CPU, 42 Sun SunFire v20z and
    1 Sun SunFire v40z with Gigabit Ethernet
    interconnect
  • 2006 - Military University of Technology Faculty
    of Engineering, Chemistry and Applied Physics
  • 32 Itanium2 CPU, 16 x HP rx1620 with Gigabit
    Ethernet interconnect
  • 2006 Gdansk University of Technology Department
    of Pharmaceutial Technology  and Chemistry
  • 22 Itanium2 CPU (11 x HP RX1620) with Gigabit
    Ethernet interconnect

8
Selected software projects related to distributed
systems
  • Distributed Multimedia Archive in Interactive
    Television (iTVP) Project
  • scalable storage for iTVP platform with ability
    to process the stored content
  • ATM Objects
  • scalable storage for multimedia content
    distribution platform
  • system for Cinem_at_n company (founded by ATM and
    Monolith)
  • Cinem_at_n will introduce high-quality movies, news
    and entertainment digital content distribution
    services
  • Spread Screens Manager
  • platform for POS TV
  • system is currently used by Zabka (shopping
    network) and Neckermann (travel service)
  • about 300 of terminals presenting the multimedia
    content located in many polish cities

9
Selected current projects
  • ATMFS
  • distributed filesystem for petabyte scale storage
    based on COTS
  • based on variable-sized chunks
  • advanced replication and enhanced error detection
  • dependability evaluation based on software fault
    injection technique
  • FastGig
  • RDMA stack for Gigabit Ethernet-based
    interconnect
  • message passing latency reduction
  • increases the application performance

10
Uses of computer networks in HPC clusters
  • Exchange of messages between cluster nodes to
    coordinate distributed computation
  • requires high maximal throughput and also low
    latency
  • inefficiency observed when the time consumed in
    single computation step is comparable to the
    message passing time
  • Access to shared data through network or cluster
    file system
  • requires high bandwidth when transferring data in
    blocks of defined size
  • filesystem and storage drivers are trying to
    reduce number of i/o operations issued (by
    buffering data and aggregating transfers)

11
Comparison of characteristics of interconnect
technologies
Brett M. Bode, Jason J. Hill, and Troy R.
Benjegerdes Cluster Interconnect Overview
Scalable Computing Laboratory, Ames Laboratory
12
Gigabit Ethernet interconnect characteristic
  • Popular technology for low cost cluster
    interconnects
  • Satisfied throughput for long frames (1000 bytes
    and longer)
  • High latency and low throughput for small frames
  • Those drawbacks are mostly caused by construction
    of existing network interfaces
  • What is the influence of the network stack
    implementation for the communication latency?

13
Message propagation model
Latency between transferring message to/from MPI
library and transferring data to/from stack
Time difference between sendto/recvfrom function
and driver start_xmit/interrupt functions
Execution time of driver functions
Processing time of the network interface
Propagation latency and latency introduced by
active network elements
14
Experimental setup
  • Two HP rx2600 servers
  • 2 x Intel Itanium2 1.3 MHz 3MB cache
  • Debian GNU/Linux Sarge 3.1 operating system
    (kernel 2.6.8-2-mckinley-smp)
  • Gigabit Ethernet interfaces
  • Broadcom BCM5701 chipset connected using PCI-X
    device bus
  • In order to eliminate possibility of additional
    delays, which may be introduced by external
    active network devices, servers were connected
    using crossover cables
  • Two NIC drivers were tested tg3 (polling NAPI
    dirver), bcm5700 (interrupt driven driver)

15
Tools used for measures
  • NetPipe package for measuring throughput and
    latency for TCP and several implementations of
    MPI
  • For low level testing test programs working
    directly on Ethernet frames were developed
  • Testing programs and NIC drivers were modified to
    allow measuring, inserting and transfer of
    timestamps

16
Throughput characteristic for tg3 driver
17
Latency characteristic for tg3 driver
18
Results for tg3 driver
  • The overhead introduced by MPI library is
    relatively low
  • There is a big difference between transmission
    latencies in the ping-pong and streaming mode
  • The latency introduced for small frames is
    similar to latency introduced by 115kbps UART (in
    the case of transmitting one byte only)
  • We can deduce that there is some mechanism in the
    transmission path that delays transmission of
    single packets
  • What is the difference between NAPI and interrupt
    driven driver?

19
Interrupt driven driver vs NAPI driver
(throughput characteristic)
20
Interrupt driven driver vs NAPI driver (latency
characteristic)
21
Interrupt driven driver vs NAPI driver (latency
characteristic) - details
22
Comparison of bcm5700 and tg3 drivers
  • Using default configuration, BCM5700 driver has
    worse characteristics than tg3
  • Interrupt driven version (default configuration)
    cannot achieve more than 650Mb/s of throughput
    for frames of any size
  • After interrupt coalescing disabling, the
    performance of BCM5700 driver have exceeded the
    results obtained by tg3 driver
  • Disabling of the polling can improve
    characteristics of the network driver, but NAPI
    is not the major cause of the transmission delay

23
Tools for message processing time measurement
  • Timestamps were inserted into the message eat
    each processing stage
  • Processing stages on the transmitter side
  • sendto() function
  • bcm5700_start_xmit()
  • interrupt notifying frame transmit
  • Processing stages on the receiver side
  • interrupt notifying frame receipt
  • netif_rx()
  • recvfrom() function
  • As high precision timer CPU clock cycle counter
    was used, (precision of 0.77ns 1/1.3GHz)

24
Transmitter latency in streaming mode
Send
Answer
17 us
17 us
25
Distribution of delays in transmission path
between cluster nodes
26
Conclusions
  • We estimate that RDMA based communication can
    reduce MPI message propagation time from 43µs to
    23µs (doubling the performance for short
    messages)
  • There is also possibility of reducing T3 and T5
    latencies by changing the configuration of the
    network interface (transmit and receive
    thresholds)
  • In the conducted research we didnt consider
    differences between network interfaces (T3 and T5
    delays may be longer or shorter than measured)
  • Latency introduced by switch is also omitted
  • FastGig project include not only communication
    library, but also measurement and communication
    profiling framework
Write a Comment
User Comments (0)
About PowerShow.com