Title: Summer Institute on Advanced Computation
1High performance cluster technology the HPVM
experience
- Mario Lauria
- Dept of Computer and Information Science
- The Ohio State University
2Thank You!
- My thanks to the organizers of SAIC 2000 for the
invitation - It is an honor and privilege to be here today
3Acknowledgements
- HPVM is a project of the Concurrent Systems
Architecture Group - CSAG (formerly UIUC Dept. of
Computer Science, now UCSD Dept. of Computer Sci.
Eng.) - Andrew Chien (Faculty)
- Phil Papadopuolos (Research faculty)
- Greg Bruno, Mason Katz, Caroline Papadopoulos
(Research Staff) - Scott Pakin, Louis Giannini, Kay Connelly, Matt
Buchanan, Sudha Krishnamurthy, Geetanjali
Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu,
Ju Wang (Graduate Students) - NT Supercluster collaboration with NCSA Leading
Edge Site - Robert Pennington (Technical Program Manager)
- Mike Showerman, Qian Liu (Systems Programmers)
- Qian Liu, Avneesh Pant (Systems Engineers)
4Outline
- The software/hardware interface (FM 1.1)
- The layer-to-layer interface (MPI-FM and FM 2.0)
- A production-grade cluster (NT Supercluster)
- Current status and projects (Storage Server)
5Motivation for cluster technology
Gigabit/sec Networks - Myrinet, SCI, FC-AL,
Giganet,GigabitEthernet, ATM
- Killer micros Low cost Gigaflop processors here
for a few kilos /processor - Killer networks Gigabit network hardware, high
performance software (e.g. Fast Messages), soon
at 100s / connection - Leverage HW, commodity SW (Windows NT), build key
technologies - high performance computing in a RICH and
ESTABLISHED software environment
6Ideal Model HPVMs
- HPVM High Performance Virtual Machine
- Provides a simple uniform programming model,
abstracts and encapsulates underlying resource
complexity - Simplifies use of complex resources
7HPVM Cluster Supercomputers
PGI HPF
MPI
Put/Get
Global Arrays
HPVM 1.0 Released Aug 19, 1997
Fast Messages
Myrinet and Sockets
- High Performance Cluster Machine (HPVM)
- Standard APIs hiding network topology,
non-standard communication sw - Turnkey Supercomputing Clusters
- high performance communication, convenient use,
coordinated resource management - Windows NT and Linux, provides front-end Queueing
Mgmt (LSF integrated)
8Motivation for a new communication software
1Gbit network (Ethernet, Myrinet) 125ms
ovhd N1/215KB
- Killer networks have arrived ...
- Gigabit links, moderate cost (dropping fast), low
latency routers - however network software only delivers network
performance for large messages.
9Motivation (cont.)
- Problem Most messages are small
- Message Size Studies
- lt 576 bytes Gusella90
- 86-99 lt200B KayPasquale
- 300-400B avg size U Buffalo monitors
- gt Most messages/applications see little
performance improvement. Overhead is the key
(LogP, Culler, et.al. studies) - Communication is an enabling technology how to
fulfill its promise?
10Fast Messages Project Goals
- Explore network architecture issues to enable
delivery of underlying hardware performance
(bandwidth, latency) - Delivering performance means
- considering realistic packet size distributions
- measuring performance at the application level
- Approach
- minimize communication overhead
- Hardware/software, multilayer integrated approach
11Getting performance is hard!
- Slow Myrinet NIC processor (5 MIPS)
- Early I/O bus (Suns Sbus) not optimized for
small transfers - 24 MB/s bandwidth with PIO, 45 MB/s with DMA
12Simple Buffering and Flow Control
- Dramatically simplified buffering scheme, still
performance critical - Basic buffering flow control can be implemented
at acceptable cost. - Integration between NIC and host critical to
provide services efficiently - critical issues division of labor, bus
management, NIC-host interaction
13FM 1.x Performance (6/95)
- Latency 14 ms, Peak BW 21.4MB/s Pakin, Lauria et
al., Supercomputing95 - Hardware limits PIO performance, but N1/2 54
bytes - Delivers 17.5MB/s _at_ 128 byte messages (140mbps,
greater than OC-3 ATM deliverable)
14Illinois Fast Messages 1.x
Sender FM_send(NodeID,Handler,Buffer,size) //
handlers are remote procedures Receiver FM_extr
act()
- API Berkeley Active Messages
- Key distinctions guarantees(reliable, in-order,
flow control), network-processor decoupling (dma
region) - Focus on short-packet performance
- Programmed IO (PIO) instead of DMA
- Simple buffering and flow control
- user space communication
15The FM layering efficiency issue
- How good is the FM 1.1 API?
- Test build a user-level library on top of it and
measure the available performance - MPI chosen as representative user-level library
- porting of MPICH (ANL/MSU) to FM
- Purpose to study what services are important in
layering communication libraries - integration issues what kind of inefficiencies
arise at the interface, and what is needed to
reduce them Lauria Chien, JPDC 1997
16MPI on FM 1.x
- First implementation of MPI on FM was ready in
Fall 1995 - Disappointing performance, only fraction of FM
bandwidth available to MPI applications
17MPI-FM Efficiency
- Result FM fast, but its interface not efficient
18MPI-FM layering inefficiencies
- Too many copies due to header attachment/removal,
lack of coordination between transport and
application layers
19The new FM 2.x API
- Sending
- FM_begin_message(NodeID,Handler,size),
FM_end_message() - FM_send_piece(stream,buffer,size) // gather
- Receiving
- FM_receive(buffer,size) // scatter
- FM_extract(total_bytes) // rcvr flow
control - Implementation based on use of a lightweight
thread for each message received
20MPI-FM 2.x improved layering
Header
Source buffer
Header
Destination buffer
- Gather-scatter interface handler multithreading
enables efficient layering, data manipulation
without copies
21MPI on FM 2.x
Msg Size
- MPI-FM 91 MB/s, 13ms latency, 4 ms overhead
- Short messages much better than IBM SP2, PCI
limited - Latency SGI O2K
22MPI-FM 2.x Efficiency
Efficiency
- High Transfer Efficiency, approaches 100
Lauria, Pakin et al. HPDC7 98 - Other systems much lower even at 1KB (100Mbit
40, 1Gbit 5)
23MPI-FM at work the NCSA NT Supercluster
110 GF, June 99
77 GF, April 1998
- 192 Pentium II, April 1998, 77Gflops
- 3-level fat tree (large switches), scalable
bandwidth, modular extensibility - 256 Pentium II and III, June 1999, 110 Gflops
(UIUC), w/ NCSA - 512xMerced, early 2001, Teraflop Performance (_at_
NCSA)
24The NT Supercluster at NCSA
- Andrew Chien, CS UIUC--gtUCSD
- Rob Pennington, NCSA
- Myrinet Network, HPVM, Fast Messages
- Microsoft NT OS, MPI API, etc.
192 Hewlett Packard, 300 MHz
64 Compaq, 333 MHz
25HPVM III
26MPI applications on the NT Supercluster
- Zeus-MP (192P, Mike Norman)
- ISIS (192P, Robert Clay)
- ASPCG (128P, Danesh Tafti)
- Cactus (128P, Paul Walker/John Shalf/Ed Seidel)
- QMC (128P, Lubos Mitas)
- Boeing CFD Test Codes (128P, David Levine)
- Others (no graphs)
- SPRNG (Ashok Srinivasan), Gamess, MOPAC (John
McKelvey), freeHEP (Doug Toussaint), AIPS (Dick
Crutcher), Amber (Balaji Veeraraghavan),
Delphi/Delco Codes, Parallel Sorting - gt No code retuning required (generally) after
recompiling with MPI-FM
27Solving 2D Navier-Stokes Kernel - Performance
of Scalable Systems
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, NCSA Andrew Chien
(UIUC, UCSD)
28NCSA NT Supercluster Solving Navier-Stokes
Kernel
Single Processor Performance MIPS R10k
117 MFLOPS Intel Pentium II 80 MFLOPS
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, Andrew Chien NCSA
29Solving 2D Navier-Stokes Kernel (cont.)
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 4094x4094)
- Excellent Scaling to 128P, Single Precision 25
faster
Danesh Tafti, Rob Pennington, NCSA Andrew Chien
(UIUC, UCSD)
30Near Perfect Scaling of Cactus - 3D Dynamic
Solver for the Einstein GR Equations
Cactus was Developed by Paul Walker,
MPI-Potsdam UIUC, NCSA
Ratio of GFLOPs Origin 2.5x NT SC
Paul Walker, John Shalf, Rob Pennington, Andrew
Chien NCSA
31Quantum Monte Carlo Origin and HPVM Cluster
Origin is about 1.7x Faster than NT SC
T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance
Nanomaterials Team)
32Supercomputer Performance Characteristics
Mflops/Proc Flops/Byte Flops/NetworkRT Cray
T3E 1200 2 2,500 SGI Origin2000 500 0.5
1,000 HPVM NT Supercluster 300 3.2 6,000 Be
rkeley NOW II 100 3.2 2,000 IBM
SP2 550 3.7 38,000 Beowulf (100Mbit) 300 2
5 500,000
- Compute/communicate and compute/latency ratios
- Clusters can provide programmable characteristics
at a dramatically lower system cost
33HPVM today HPVM 1.9
- Added support for
- Shared memory
- VIA interconnect
- New API
- BSP
34Show me the numbers!
- Basics
- Myrinet
- FM 100MB/sec, 8.6 µsec latency
- MPI 91MB/sec _at_ 64K, 9.6 µsec latency
- Approximately 10 overhead
- Giganet
- FM 81MB/sec, 14.7 µsec latency
- MPI 77MB/sec, 18.6 µsec latency
- 5 BW overhead, 26 latency!
- Shared Memory Transport
- FM 195MB/sec, 3.13 µsec latency
- MPI 85MB/sec, 5.75 µsec latency
35Bandwidth Graphs
- FM bandwidth usually a good indicator of
deliverable bandwidth - High BW attained for small messages
36Other HPVM related projects
- Approx. three hundreds groups have downloaded
HPVM 1.2 at the last count - Some interesting research projects
- Low-level support for collective communication,
OSU - FM with multicast (FM-MC), Vrije Universiteit,
Amsterdam - Video server on demand, Univ. of Naples
- Together with AM, U-Net and VMMC, FM has been the
inspiration for the VIA industrial standard by
Intel, Compaq, IBM - Latest release of HPVM is available from
http//www-csag.ucsd.edu
37Current project a HPVM-based Terabyte Storage
Server
- High performance parallel architectures
increasingly associated with data-intensive
applications - NPACI large dataset applications requiring 100s
of GB - Digital Sky Survey, Brain waves Analysis
- digital data repositories, web indexing,
multimedia servers - Microsoft TerraServer, Altavista,
RealPlayer/Windows Media servers (Audionet, CNN),
streamed audio/video - genomic and proteomic research
- large centralized data banks (GenBank, SwissProt,
PDB, ) - Commercial terabyte systems (Storagetek, EMC)
have price tags in the M range
38The HPVM approach to a Terabyte Storage Server
- Exploit commodity PC technologies to build a
large (2 TB) and smart (50 Gflops) storage server - benefits inexpensive PC disks, modern I/O bus
- The cluster advantage
- 10 us communication latency vs 10 ms disk access
latency provides opportunity for data
declustering, redistribution, aggregation of I/O
bandwidth - distributed buffering, data processing capability
- scalable architecture
- Integration issues
- efficient data declustering, I/O bus bandwidth
allocation, remote/local programming interface,
external connectivity
39Global Picture
1 GB/s link
- 1GB/s link between the two sites
- 8 parallel Gigabit Ethernet connections
- Ethernet cards installed in some of the nodes on
each machine
40The Hardware Highlights
- Main features
- 1.6 TB 64 25GB disks 30K (UltraATA disks)
- 1 GB/s of aggregate I/O bw ( 64 disks 15 MB/s)
- 45 GB RAM, 48 Gflop/s
- 2.4 Gb/s Myrinet network
- Challenges
- make available aggregate I/O bandwidth to
applications - balance I/O load across nodes/disks
- transport of TB of data in and out of the cluster
41The Software Components
Storage Resource Broker (SRB) used for
interoperability with existing NPACI applications
at SDSC Parallel I/O library (e.g. Panda,
MPI-IO) to provide high performance I/O to
code running on the cluster The HPVM suite
provides support for fast communication,
standard APIs on NT cluster
SRB
Panda
MPI
Put/Get
Global Arrays
Fast Messages
Myrinet
42Related Work
- User-level Fast Networking
- VIA list AM (Fast Socket) Culler92,
Rodrigues97, U-Net (Unet/MM) Eicken95,
Welsh97, VMMC-2 Li97 - RWCP PM Tezuka96, BIP Prylli97
- High-perfomance Cluster-based Storage
- UC Berkeley Tertiary Disks (Talagala98)
- CMU Network-attached Devices Gibson97, UCSB
Active Disks (Acharya98) - UCLA Randomized I/O (RIO) server (Fabbrocino98)
- UC Berkeley River system (Arpaci-Dusseau, unpub.)
- ANL ROMIO and RIO projects (Foster, Gropp)
43Conclusions
- HPVM provides all the necessary tools to
transform a PC cluster into a production
supercomputer - Projects like HPVM demonstrate
- level of maturity achieved so far by cluster
technology with respect to conventional HPC
utilization - springboard for further research on new uses of
the technology - Efficient component integration at several levels
key to performance - tight coupling of the host and NIC crucial to
minimize communication overhead - software layering on top of FM has exposed the
need for a client-conscious design at the
interface between layers
44Future Work
- Moving toward a more dynamic model of
computation - dynamic process creation, interaction between
computations - communication group management
- long term targets are dynamic communication,
support for adaptive applications - Wide-area computing
- integration within computational grid
infrastructure - LAN/WAN bridges, remote cluster connectivity
- Cluster applications
- enhanced-functionality storage, scalable
multimedia servers - Semi-regular network topologies