Summer Institute on Advanced Computation - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Summer Institute on Advanced Computation

Description:

High performance cluster technology: the HPVM experience Mario Lauria Dept of Computer and Information Science The Ohio State University – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 45
Provided by: laur4324
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: Summer Institute on Advanced Computation


1
High performance cluster technology the HPVM
experience
  • Mario Lauria
  • Dept of Computer and Information Science
  • The Ohio State University

2
Thank You!
  • My thanks to the organizers of SAIC 2000 for the
    invitation
  • It is an honor and privilege to be here today

3
Acknowledgements
  • HPVM is a project of the Concurrent Systems
    Architecture Group - CSAG (formerly UIUC Dept. of
    Computer Science, now UCSD Dept. of Computer Sci.
    Eng.)
  • Andrew Chien (Faculty)
  • Phil Papadopuolos (Research faculty)
  • Greg Bruno, Mason Katz, Caroline Papadopoulos
    (Research Staff)
  • Scott Pakin, Louis Giannini, Kay Connelly, Matt
    Buchanan, Sudha Krishnamurthy, Geetanjali
    Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu,
    Ju Wang (Graduate Students)
  • NT Supercluster collaboration with NCSA Leading
    Edge Site
  • Robert Pennington (Technical Program Manager)
  • Mike Showerman, Qian Liu (Systems Programmers)
  • Qian Liu, Avneesh Pant (Systems Engineers)

4
Outline
  • The software/hardware interface (FM 1.1)
  • The layer-to-layer interface (MPI-FM and FM 2.0)
  • A production-grade cluster (NT Supercluster)
  • Current status and projects (Storage Server)

5
Motivation for cluster technology
Gigabit/sec Networks - Myrinet, SCI, FC-AL,
Giganet,GigabitEthernet, ATM
  • Killer micros Low cost Gigaflop processors here
    for a few kilos /processor
  • Killer networks Gigabit network hardware, high
    performance software (e.g. Fast Messages), soon
    at 100s / connection
  • Leverage HW, commodity SW (Windows NT), build key
    technologies
  • high performance computing in a RICH and
    ESTABLISHED software environment

6
Ideal Model HPVMs
  • HPVM High Performance Virtual Machine
  • Provides a simple uniform programming model,
    abstracts and encapsulates underlying resource
    complexity
  • Simplifies use of complex resources

7
HPVM Cluster Supercomputers
PGI HPF
MPI
Put/Get
Global Arrays
HPVM 1.0 Released Aug 19, 1997
Fast Messages
Myrinet and Sockets
  • High Performance Cluster Machine (HPVM)
  • Standard APIs hiding network topology,
    non-standard communication sw
  • Turnkey Supercomputing Clusters
  • high performance communication, convenient use,
    coordinated resource management
  • Windows NT and Linux, provides front-end Queueing
    Mgmt (LSF integrated)

8
Motivation for a new communication software
1Gbit network (Ethernet, Myrinet) 125ms
ovhd N1/215KB
  • Killer networks have arrived ...
  • Gigabit links, moderate cost (dropping fast), low
    latency routers
  • however network software only delivers network
    performance for large messages.

9
Motivation (cont.)
  • Problem Most messages are small
  • Message Size Studies
  • lt 576 bytes Gusella90
  • 86-99 lt200B KayPasquale
  • 300-400B avg size U Buffalo monitors
  • gt Most messages/applications see little
    performance improvement. Overhead is the key
    (LogP, Culler, et.al. studies)
  • Communication is an enabling technology how to
    fulfill its promise?

10
Fast Messages Project Goals
  • Explore network architecture issues to enable
    delivery of underlying hardware performance
    (bandwidth, latency)
  • Delivering performance means
  • considering realistic packet size distributions
  • measuring performance at the application level
  • Approach
  • minimize communication overhead
  • Hardware/software, multilayer integrated approach

11
Getting performance is hard!
  • Slow Myrinet NIC processor (5 MIPS)
  • Early I/O bus (Suns Sbus) not optimized for
    small transfers
  • 24 MB/s bandwidth with PIO, 45 MB/s with DMA

12
Simple Buffering and Flow Control
  • Dramatically simplified buffering scheme, still
    performance critical
  • Basic buffering flow control can be implemented
    at acceptable cost.
  • Integration between NIC and host critical to
    provide services efficiently
  • critical issues division of labor, bus
    management, NIC-host interaction

13
FM 1.x Performance (6/95)
  • Latency 14 ms, Peak BW 21.4MB/s Pakin, Lauria et
    al., Supercomputing95
  • Hardware limits PIO performance, but N1/2 54
    bytes
  • Delivers 17.5MB/s _at_ 128 byte messages (140mbps,
    greater than OC-3 ATM deliverable)

14
Illinois Fast Messages 1.x
Sender FM_send(NodeID,Handler,Buffer,size) //
handlers are remote procedures Receiver FM_extr
act()
  • API Berkeley Active Messages
  • Key distinctions guarantees(reliable, in-order,
    flow control), network-processor decoupling (dma
    region)
  • Focus on short-packet performance
  • Programmed IO (PIO) instead of DMA
  • Simple buffering and flow control
  • user space communication

15
The FM layering efficiency issue
  • How good is the FM 1.1 API?
  • Test build a user-level library on top of it and
    measure the available performance
  • MPI chosen as representative user-level library
  • porting of MPICH (ANL/MSU) to FM
  • Purpose to study what services are important in
    layering communication libraries
  • integration issues what kind of inefficiencies
    arise at the interface, and what is needed to
    reduce them Lauria Chien, JPDC 1997

16
MPI on FM 1.x
  • First implementation of MPI on FM was ready in
    Fall 1995
  • Disappointing performance, only fraction of FM
    bandwidth available to MPI applications

17
MPI-FM Efficiency
  • Result FM fast, but its interface not efficient

18
MPI-FM layering inefficiencies
  • Too many copies due to header attachment/removal,
    lack of coordination between transport and
    application layers

19
The new FM 2.x API
  • Sending
  • FM_begin_message(NodeID,Handler,size),
    FM_end_message()
  • FM_send_piece(stream,buffer,size) // gather
  • Receiving
  • FM_receive(buffer,size) // scatter
  • FM_extract(total_bytes) // rcvr flow
    control
  • Implementation based on use of a lightweight
    thread for each message received

20
MPI-FM 2.x improved layering
Header
Source buffer
Header
Destination buffer
  • Gather-scatter interface handler multithreading
    enables efficient layering, data manipulation
    without copies

21
MPI on FM 2.x
Msg Size
  • MPI-FM 91 MB/s, 13ms latency, 4 ms overhead
  • Short messages much better than IBM SP2, PCI
    limited
  • Latency SGI O2K

22
MPI-FM 2.x Efficiency
Efficiency
  • High Transfer Efficiency, approaches 100
    Lauria, Pakin et al. HPDC7 98
  • Other systems much lower even at 1KB (100Mbit
    40, 1Gbit 5)

23
MPI-FM at work the NCSA NT Supercluster
110 GF, June 99
77 GF, April 1998
  • 192 Pentium II, April 1998, 77Gflops
  • 3-level fat tree (large switches), scalable
    bandwidth, modular extensibility
  • 256 Pentium II and III, June 1999, 110 Gflops
    (UIUC), w/ NCSA
  • 512xMerced, early 2001, Teraflop Performance (_at_
    NCSA)

24
The NT Supercluster at NCSA
  • Andrew Chien, CS UIUC--gtUCSD
  • Rob Pennington, NCSA
  • Myrinet Network, HPVM, Fast Messages
  • Microsoft NT OS, MPI API, etc.

192 Hewlett Packard, 300 MHz
64 Compaq, 333 MHz
25
HPVM III
26
MPI applications on the NT Supercluster
  • Zeus-MP (192P, Mike Norman)
  • ISIS (192P, Robert Clay)
  • ASPCG (128P, Danesh Tafti)
  • Cactus (128P, Paul Walker/John Shalf/Ed Seidel)
  • QMC (128P, Lubos Mitas)
  • Boeing CFD Test Codes (128P, David Levine)
  • Others (no graphs)
  • SPRNG (Ashok Srinivasan), Gamess, MOPAC (John
    McKelvey), freeHEP (Doug Toussaint), AIPS (Dick
    Crutcher), Amber (Balaji Veeraraghavan),
    Delphi/Delco Codes, Parallel Sorting
  • gt No code retuning required (generally) after
    recompiling with MPI-FM

27
Solving 2D Navier-Stokes Kernel - Performance
of Scalable Systems
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, NCSA Andrew Chien
(UIUC, UCSD)
28
NCSA NT Supercluster Solving Navier-Stokes
Kernel
Single Processor Performance MIPS R10k
117 MFLOPS Intel Pentium II 80 MFLOPS
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, Andrew Chien NCSA
29
Solving 2D Navier-Stokes Kernel (cont.)
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 4094x4094)
  • Excellent Scaling to 128P, Single Precision 25
    faster

Danesh Tafti, Rob Pennington, NCSA Andrew Chien
(UIUC, UCSD)
30
Near Perfect Scaling of Cactus - 3D Dynamic
Solver for the Einstein GR Equations
Cactus was Developed by Paul Walker,
MPI-Potsdam UIUC, NCSA
Ratio of GFLOPs Origin 2.5x NT SC
Paul Walker, John Shalf, Rob Pennington, Andrew
Chien NCSA
31
Quantum Monte Carlo Origin and HPVM Cluster
Origin is about 1.7x Faster than NT SC
T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance
Nanomaterials Team)
32
Supercomputer Performance Characteristics
Mflops/Proc Flops/Byte Flops/NetworkRT Cray
T3E 1200 2 2,500 SGI Origin2000 500 0.5
1,000 HPVM NT Supercluster 300 3.2 6,000 Be
rkeley NOW II 100 3.2 2,000 IBM
SP2 550 3.7 38,000 Beowulf (100Mbit) 300 2
5 500,000
  • Compute/communicate and compute/latency ratios
  • Clusters can provide programmable characteristics
    at a dramatically lower system cost

33
HPVM today HPVM 1.9
  • Added support for
  • Shared memory
  • VIA interconnect
  • New API
  • BSP

34
Show me the numbers!
  • Basics
  • Myrinet
  • FM 100MB/sec, 8.6 µsec latency
  • MPI 91MB/sec _at_ 64K, 9.6 µsec latency
  • Approximately 10 overhead
  • Giganet
  • FM 81MB/sec, 14.7 µsec latency
  • MPI 77MB/sec, 18.6 µsec latency
  • 5 BW overhead, 26 latency!
  • Shared Memory Transport
  • FM 195MB/sec, 3.13 µsec latency
  • MPI 85MB/sec, 5.75 µsec latency

35
Bandwidth Graphs
  • FM bandwidth usually a good indicator of
    deliverable bandwidth
  • High BW attained for small messages
  • N1/2 512 Bytes

36
Other HPVM related projects
  • Approx. three hundreds groups have downloaded
    HPVM 1.2 at the last count
  • Some interesting research projects
  • Low-level support for collective communication,
    OSU
  • FM with multicast (FM-MC), Vrije Universiteit,
    Amsterdam
  • Video server on demand, Univ. of Naples
  • Together with AM, U-Net and VMMC, FM has been the
    inspiration for the VIA industrial standard by
    Intel, Compaq, IBM
  • Latest release of HPVM is available from
    http//www-csag.ucsd.edu

37
Current project a HPVM-based Terabyte Storage
Server
  • High performance parallel architectures
    increasingly associated with data-intensive
    applications
  • NPACI large dataset applications requiring 100s
    of GB
  • Digital Sky Survey, Brain waves Analysis
  • digital data repositories, web indexing,
    multimedia servers
  • Microsoft TerraServer, Altavista,
    RealPlayer/Windows Media servers (Audionet, CNN),
    streamed audio/video
  • genomic and proteomic research
  • large centralized data banks (GenBank, SwissProt,
    PDB, )
  • Commercial terabyte systems (Storagetek, EMC)
    have price tags in the M range

38
The HPVM approach to a Terabyte Storage Server
  • Exploit commodity PC technologies to build a
    large (2 TB) and smart (50 Gflops) storage server
  • benefits inexpensive PC disks, modern I/O bus
  • The cluster advantage
  • 10 us communication latency vs 10 ms disk access
    latency provides opportunity for data
    declustering, redistribution, aggregation of I/O
    bandwidth
  • distributed buffering, data processing capability
  • scalable architecture
  • Integration issues
  • efficient data declustering, I/O bus bandwidth
    allocation, remote/local programming interface,
    external connectivity

39
Global Picture
1 GB/s link
  • 1GB/s link between the two sites
  • 8 parallel Gigabit Ethernet connections
  • Ethernet cards installed in some of the nodes on
    each machine

40
The Hardware Highlights
  • Main features
  • 1.6 TB 64 25GB disks 30K (UltraATA disks)
  • 1 GB/s of aggregate I/O bw ( 64 disks 15 MB/s)
  • 45 GB RAM, 48 Gflop/s
  • 2.4 Gb/s Myrinet network
  • Challenges
  • make available aggregate I/O bandwidth to
    applications
  • balance I/O load across nodes/disks
  • transport of TB of data in and out of the cluster

41
The Software Components
Storage Resource Broker (SRB) used for
interoperability with existing NPACI applications
at SDSC Parallel I/O library (e.g. Panda,
MPI-IO) to provide high performance I/O to
code running on the cluster The HPVM suite
provides support for fast communication,
standard APIs on NT cluster
SRB
Panda
MPI
Put/Get
Global Arrays
Fast Messages
Myrinet
42
Related Work
  • User-level Fast Networking
  • VIA list AM (Fast Socket) Culler92,
    Rodrigues97, U-Net (Unet/MM) Eicken95,
    Welsh97, VMMC-2 Li97
  • RWCP PM Tezuka96, BIP Prylli97
  • High-perfomance Cluster-based Storage
  • UC Berkeley Tertiary Disks (Talagala98)
  • CMU Network-attached Devices Gibson97, UCSB
    Active Disks (Acharya98)
  • UCLA Randomized I/O (RIO) server (Fabbrocino98)
  • UC Berkeley River system (Arpaci-Dusseau, unpub.)
  • ANL ROMIO and RIO projects (Foster, Gropp)

43
Conclusions
  • HPVM provides all the necessary tools to
    transform a PC cluster into a production
    supercomputer
  • Projects like HPVM demonstrate
  • level of maturity achieved so far by cluster
    technology with respect to conventional HPC
    utilization
  • springboard for further research on new uses of
    the technology
  • Efficient component integration at several levels
    key to performance
  • tight coupling of the host and NIC crucial to
    minimize communication overhead
  • software layering on top of FM has exposed the
    need for a client-conscious design at the
    interface between layers

44
Future Work
  • Moving toward a more dynamic model of
    computation
  • dynamic process creation, interaction between
    computations
  • communication group management
  • long term targets are dynamic communication,
    support for adaptive applications
  • Wide-area computing
  • integration within computational grid
    infrastructure
  • LAN/WAN bridges, remote cluster connectivity
  • Cluster applications
  • enhanced-functionality storage, scalable
    multimedia servers
  • Semi-regular network topologies
Write a Comment
User Comments (0)
About PowerShow.com