Summer Institute on Advanced Computation - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Summer Institute on Advanced Computation

Description:

High performance cluster technology: the HPVM experience Mario Lauria Dept of Computer and Information Science The Ohio State University – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 45

Provided by: laur4324

Learn more at: http://cecs.wright.edu

Category:

more less

Transcript and Presenter's Notes

Title: Summer Institute on Advanced Computation

1
High performance cluster technology the HPVM
experience

Mario Lauria
Dept of Computer and Information Science
The Ohio State University

2
Thank You!

My thanks to the organizers of SAIC 2000 for the
invitation
It is an honor and privilege to be here today

3
Acknowledgements

HPVM is a project of the Concurrent Systems
Architecture Group - CSAG (formerly UIUC Dept. of
Computer Science, now UCSD Dept. of Computer Sci.
Eng.)
Andrew Chien (Faculty)
Phil Papadopuolos (Research faculty)
Greg Bruno, Mason Katz, Caroline Papadopoulos
(Research Staff)
Scott Pakin, Louis Giannini, Kay Connelly, Matt
Buchanan, Sudha Krishnamurthy, Geetanjali
Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu,
Ju Wang (Graduate Students)
NT Supercluster collaboration with NCSA Leading
Edge Site
Robert Pennington (Technical Program Manager)
Mike Showerman, Qian Liu (Systems Programmers)
Qian Liu, Avneesh Pant (Systems Engineers)

4
Outline

The software/hardware interface (FM 1.1)
The layer-to-layer interface (MPI-FM and FM 2.0)
A production-grade cluster (NT Supercluster)
Current status and projects (Storage Server)

5
Motivation for cluster technology
Gigabit/sec Networks - Myrinet, SCI, FC-AL,
Giganet,GigabitEthernet, ATM

Killer micros Low cost Gigaflop processors here
for a few kilos /processor
Killer networks Gigabit network hardware, high
performance software (e.g. Fast Messages), soon
at 100s / connection
Leverage HW, commodity SW (Windows NT), build key
technologies
high performance computing in a RICH and
ESTABLISHED software environment

6
Ideal Model HPVMs

HPVM High Performance Virtual Machine
Provides a simple uniform programming model,
abstracts and encapsulates underlying resource
complexity
Simplifies use of complex resources

7
HPVM Cluster Supercomputers
PGI HPF
MPI
Put/Get
Global Arrays
HPVM 1.0 Released Aug 19, 1997
Fast Messages
Myrinet and Sockets

High Performance Cluster Machine (HPVM)
Standard APIs hiding network topology,
non-standard communication sw
Turnkey Supercomputing Clusters
high performance communication, convenient use,
coordinated resource management
Windows NT and Linux, provides front-end Queueing
Mgmt (LSF integrated)

8
Motivation for a new communication software
1Gbit network (Ethernet, Myrinet) 125ms
ovhd N1/215KB

Killer networks have arrived ...
Gigabit links, moderate cost (dropping fast), low
latency routers
however network software only delivers network
performance for large messages.

9
Motivation (cont.)

Problem Most messages are small
Message Size Studies
lt 576 bytes Gusella90
86-99 lt200B KayPasquale
300-400B avg size U Buffalo monitors
gt Most messages/applications see little
performance improvement. Overhead is the key
(LogP, Culler, et.al. studies)
Communication is an enabling technology how to
fulfill its promise?

10
Fast Messages Project Goals

Explore network architecture issues to enable
delivery of underlying hardware performance
(bandwidth, latency)
Delivering performance means
considering realistic packet size distributions
measuring performance at the application level
Approach
minimize communication overhead
Hardware/software, multilayer integrated approach

11
Getting performance is hard!

Slow Myrinet NIC processor (5 MIPS)
Early I/O bus (Suns Sbus) not optimized for
small transfers
24 MB/s bandwidth with PIO, 45 MB/s with DMA

12
Simple Buffering and Flow Control

Dramatically simplified buffering scheme, still
performance critical
Basic buffering flow control can be implemented
at acceptable cost.
Integration between NIC and host critical to
provide services efficiently
critical issues division of labor, bus
management, NIC-host interaction

13
FM 1.x Performance (6/95)

Latency 14 ms, Peak BW 21.4MB/s Pakin, Lauria et
al., Supercomputing95
Hardware limits PIO performance, but N1/2 54
bytes
Delivers 17.5MB/s _at_ 128 byte messages (140mbps,
greater than OC-3 ATM deliverable)

14
Illinois Fast Messages 1.x
Sender FM_send(NodeID,Handler,Buffer,size) //
handlers are remote procedures Receiver FM_extr
act()

API Berkeley Active Messages
Key distinctions guarantees(reliable, in-order,
flow control), network-processor decoupling (dma
region)
Focus on short-packet performance
Programmed IO (PIO) instead of DMA
Simple buffering and flow control
user space communication

15
The FM layering efficiency issue

How good is the FM 1.1 API?
Test build a user-level library on top of it and
measure the available performance
MPI chosen as representative user-level library
porting of MPICH (ANL/MSU) to FM
Purpose to study what services are important in
layering communication libraries
integration issues what kind of inefficiencies
arise at the interface, and what is needed to
reduce them Lauria Chien, JPDC 1997

16
MPI on FM 1.x

First implementation of MPI on FM was ready in
Fall 1995
Disappointing performance, only fraction of FM
bandwidth available to MPI applications

17
MPI-FM Efficiency

Result FM fast, but its interface not efficient

18
MPI-FM layering inefficiencies

Too many copies due to header attachment/removal,
lack of coordination between transport and
application layers

19
The new FM 2.x API

Sending
FM_begin_message(NodeID,Handler,size),
FM_end_message()
FM_send_piece(stream,buffer,size) // gather
Receiving
FM_receive(buffer,size) // scatter
FM_extract(total_bytes) // rcvr flow
control
Implementation based on use of a lightweight
thread for each message received

20
MPI-FM 2.x improved layering
Header
Source buffer
Header
Destination buffer

Gather-scatter interface handler multithreading
enables efficient layering, data manipulation
without copies

21
MPI on FM 2.x
Msg Size

MPI-FM 91 MB/s, 13ms latency, 4 ms overhead
Short messages much better than IBM SP2, PCI
limited
Latency SGI O2K

22
MPI-FM 2.x Efficiency
Efficiency

High Transfer Efficiency, approaches 100
Lauria, Pakin et al. HPDC7 98
Other systems much lower even at 1KB (100Mbit
40, 1Gbit 5)

23
MPI-FM at work the NCSA NT Supercluster
110 GF, June 99
77 GF, April 1998

192 Pentium II, April 1998, 77Gflops
3-level fat tree (large switches), scalable
bandwidth, modular extensibility
256 Pentium II and III, June 1999, 110 Gflops
(UIUC), w/ NCSA
512xMerced, early 2001, Teraflop Performance (_at_
NCSA)

24
The NT Supercluster at NCSA

Andrew Chien, CS UIUC--gtUCSD
Rob Pennington, NCSA
Myrinet Network, HPVM, Fast Messages
Microsoft NT OS, MPI API, etc.

192 Hewlett Packard, 300 MHz
64 Compaq, 333 MHz
25
HPVM III
26
MPI applications on the NT Supercluster

Zeus-MP (192P, Mike Norman)
ISIS (192P, Robert Clay)
ASPCG (128P, Danesh Tafti)
Cactus (128P, Paul Walker/John Shalf/Ed Seidel)
QMC (128P, Lubos Mitas)
Boeing CFD Test Codes (128P, David Levine)
Others (no graphs)
SPRNG (Ashok Srinivasan), Gamess, MOPAC (John
McKelvey), freeHEP (Doug Toussaint), AIPS (Dick
Crutcher), Amber (Balaji Veeraraghavan),
Delphi/Delco Codes, Parallel Sorting
gt No code retuning required (generally) after
recompiling with MPI-FM

27
Solving 2D Navier-Stokes Kernel - Performance
of Scalable Systems
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, NCSA Andrew Chien
(UIUC, UCSD)
28
NCSA NT Supercluster Solving Navier-Stokes
Kernel
Single Processor Performance MIPS R10k
117 MFLOPS Intel Pentium II 80 MFLOPS
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, Andrew Chien NCSA
29
Solving 2D Navier-Stokes Kernel (cont.)
Preconditioned Conjugate Gradient Method With
Multi-level Additive Schwarz Richardson
Pre-conditioner (2D 4094x4094)

Excellent Scaling to 128P, Single Precision 25
faster

Danesh Tafti, Rob Pennington, NCSA Andrew Chien
(UIUC, UCSD)
30
Near Perfect Scaling of Cactus - 3D Dynamic
Solver for the Einstein GR Equations
Cactus was Developed by Paul Walker,
MPI-Potsdam UIUC, NCSA
Ratio of GFLOPs Origin 2.5x NT SC
Paul Walker, John Shalf, Rob Pennington, Andrew
Chien NCSA
31
Quantum Monte Carlo Origin and HPVM Cluster
Origin is about 1.7x Faster than NT SC
T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance
Nanomaterials Team)
32
Supercomputer Performance Characteristics
Mflops/Proc Flops/Byte Flops/NetworkRT Cray
T3E 1200 2 2,500 SGI Origin2000 500 0.5
1,000 HPVM NT Supercluster 300 3.2 6,000 Be
rkeley NOW II 100 3.2 2,000 IBM
SP2 550 3.7 38,000 Beowulf (100Mbit) 300 2
5 500,000

Compute/communicate and compute/latency ratios
Clusters can provide programmable characteristics
at a dramatically lower system cost

33
HPVM today HPVM 1.9

Added support for
Shared memory
VIA interconnect
New API
BSP

34
Show me the numbers!

Basics
Myrinet
FM 100MB/sec, 8.6 µsec latency
MPI 91MB/sec _at_ 64K, 9.6 µsec latency
Approximately 10 overhead
Giganet
FM 81MB/sec, 14.7 µsec latency
MPI 77MB/sec, 18.6 µsec latency
5 BW overhead, 26 latency!
Shared Memory Transport
FM 195MB/sec, 3.13 µsec latency
MPI 85MB/sec, 5.75 µsec latency

35
Bandwidth Graphs

FM bandwidth usually a good indicator of
deliverable bandwidth
High BW attained for small messages

N1/2 512 Bytes

36
Other HPVM related projects

Approx. three hundreds groups have downloaded
HPVM 1.2 at the last count
Some interesting research projects
Low-level support for collective communication,
OSU
FM with multicast (FM-MC), Vrije Universiteit,
Amsterdam
Video server on demand, Univ. of Naples
Together with AM, U-Net and VMMC, FM has been the
inspiration for the VIA industrial standard by
Intel, Compaq, IBM
Latest release of HPVM is available from
http//www-csag.ucsd.edu

37
Current project a HPVM-based Terabyte Storage
Server

High performance parallel architectures
increasingly associated with data-intensive
applications
NPACI large dataset applications requiring 100s
of GB
Digital Sky Survey, Brain waves Analysis
digital data repositories, web indexing,
multimedia servers
Microsoft TerraServer, Altavista,
RealPlayer/Windows Media servers (Audionet, CNN),
streamed audio/video
genomic and proteomic research
large centralized data banks (GenBank, SwissProt,
PDB, )
Commercial terabyte systems (Storagetek, EMC)
have price tags in the M range

38
The HPVM approach to a Terabyte Storage Server

Exploit commodity PC technologies to build a
large (2 TB) and smart (50 Gflops) storage server
benefits inexpensive PC disks, modern I/O bus
The cluster advantage
10 us communication latency vs 10 ms disk access
latency provides opportunity for data
declustering, redistribution, aggregation of I/O
bandwidth
distributed buffering, data processing capability
scalable architecture
Integration issues
efficient data declustering, I/O bus bandwidth
allocation, remote/local programming interface,
external connectivity

39
Global Picture
1 GB/s link

1GB/s link between the two sites
8 parallel Gigabit Ethernet connections
Ethernet cards installed in some of the nodes on
each machine

40
The Hardware Highlights

Main features
1.6 TB 64 25GB disks 30K (UltraATA disks)
1 GB/s of aggregate I/O bw ( 64 disks 15 MB/s)
45 GB RAM, 48 Gflop/s
2.4 Gb/s Myrinet network
Challenges
make available aggregate I/O bandwidth to
applications
balance I/O load across nodes/disks
transport of TB of data in and out of the cluster

41
The Software Components
Storage Resource Broker (SRB) used for
interoperability with existing NPACI applications
at SDSC Parallel I/O library (e.g. Panda,
MPI-IO) to provide high performance I/O to
code running on the cluster The HPVM suite
provides support for fast communication,
standard APIs on NT cluster
SRB
Panda
MPI
Put/Get
Global Arrays
Fast Messages
Myrinet
42
Related Work

User-level Fast Networking
VIA list AM (Fast Socket) Culler92,
Rodrigues97, U-Net (Unet/MM) Eicken95,
Welsh97, VMMC-2 Li97
RWCP PM Tezuka96, BIP Prylli97
High-perfomance Cluster-based Storage
UC Berkeley Tertiary Disks (Talagala98)
CMU Network-attached Devices Gibson97, UCSB
Active Disks (Acharya98)
UCLA Randomized I/O (RIO) server (Fabbrocino98)
UC Berkeley River system (Arpaci-Dusseau, unpub.)
ANL ROMIO and RIO projects (Foster, Gropp)

43
Conclusions

HPVM provides all the necessary tools to
transform a PC cluster into a production
supercomputer
Projects like HPVM demonstrate
level of maturity achieved so far by cluster
technology with respect to conventional HPC
utilization
springboard for further research on new uses of
the technology
Efficient component integration at several levels
key to performance
tight coupling of the host and NIC crucial to
minimize communication overhead
software layering on top of FM has exposed the
need for a client-conscious design at the
interface between layers