Supercomputing on Windows Clusters: Experience and Future Directions - PowerPoint PPT Presentation

About This Presentation

Title:

Supercomputing on Windows Clusters: Experience and Future Directions

Description:

Invited Talk, USENIX Windows, August 4, 2000 ... Entropia, Inc -- University of California, ... Fixed Size Trailer Length/Flag space Increasing. Addresses ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 57

Provided by: concurrent1

Learn more at: https://static.usenix.org

Category:

more less

Transcript and Presenter's Notes

Title: Supercomputing on Windows Clusters: Experience and Future Directions

1
Supercomputing on Windows Clusters Experience
and Future Directions

Andrew A. Chien
CTO, Entropia, Inc.
SAIC Chair Professor
Computer Science and Engineering, UCSD
National Computational Science Alliance
Invited Talk, USENIX Windows, August 4, 2000

2
Overview

Critical Enabling Technologies
The Alliances Windows Supercluster
Design and Performance
Other Windows Cluster Efforts
Future
Terascale Clusters
Entropia

3
External Technology Factors
4
Microprocessor Performance
DEC Alpha (5)
X86/Alpha (1)
Year Introduced

Micros 10MF -gt 100 MF -gt 1GF -gt 3GF -gt 6GF
(2001?)
gt Memory system performance catching up (2.6
GB/s 21264 memory BW)

Adapted from Baskett, SGI and CSC Vanguard
5
Killer Networks
GigSAN/GigE 110 MB/s

LAN 10Mb/s -gt 100Mb/s -gt ?
SAN 12MB/s -gt 110MB/s (Gbps) -gt 1100MB/s -gt ?
Myricom, Compaq, Giganet, Intel,...
Network bandwidths limited by system internal
memory bandwidths
Cheap and very fast communication hardware

UW Scsi 40 MB/s
FastE 12 MB/s
Ethernet 1MB/s
6
Rich Desktop Operating Systems Environments
Clustering, Performance, Mass store, HP
networking, Management, Availability, etc.
Multiprocess Protection SMP support
Graphical Interfaces Audio/Graphics
HD Storage Networks
Basic device access
1981
1985
1995
1999
1990

Desktop (PC) operating systems now provide
richest OS functionality
best program development tools
broadest peripheral/driver support
broadest application software/ISV support

7
Critical Enabling Technologies
8
Critical Enabling Technologies

Cluster management and resource integration (use
like one system)
Delivered communication performance
IP protocols inappropriate
Balanced systems
Memory bandwidth
I/O capability

9
The HPVM System

Goals
Enable tightly coupled and distributed clusters
with high efficiency and low effort (integrated
solution)
Provide usable access thru convenient standard
parallel interfaces
Deliver highest possible performance and simple
programming model

10
Delivered Communication Performance

Early 1990s, Gigabit testbeds
500Mbits (60MB/s) _at_ 1 MegaByte packets
IP protocols not for Gigabit SANs
Cluster Objective High performance communication
to small and large messages
Performance Balance Shift Networks faster than
I/O, memory, processor

11
Fast Messages Design Elements

User-level network access
Lightweight protocols
flow control, reliable delivery
tightly-coupled link, buffer, and I/O bus
management
Poll-based notification
Streaming API for efficient composition
Many generations 1994-1999
IEEE Concurrency, 6/97
Supercomputing 95, 12/95
Related efforts UCB AM, Cornell U-Net,RWCP PM,
Princeton VMMC/Shrimp, Lyon BIP gt VIA standard

12
Improved Bandwidth

20MB/s -gt 200 MB/s (10x)
Much of advance is software structure APIs and
implementation
Deliver all of the underlying hardware
performance

13
Improved Latency

100ms to 2ms overhead (50x)
Careful design to minimize overhead while
maintaining throughput
Efficient event handling, fine-grained resource
management and interlayer coordination
Deliver all of the underlying hardware
performance

14
HPVM Cluster Supercomputers
MPI
Put/Get
Global Arrays
BSP
Scheduling Mgmt (LSF)
HPVM 1.0 (8/1997) HPVM 1.2 (2/1999) - multi,
dynamic, install HPVM 1.9 (8/1999) -
giganet, smp
Fast Messages
Performance Tools
Myrinet
Server- Net
Giganet VIA
SMP
WAN

Turnkey Cluster Computing Standard APIs
Network hardware and APIs increase leverage for
users, achieve critical mass for system
Each involved new research challenges and
provided deeper insights into the research issues
Drove continually better solutions (e.g.
multi-transport integration, robust flow control
and queue management)

15
HPVM Communication Performance

N1/2 400 Bytes

Delivers underlying performance for small
messages, endpoints are the limits
100MB/s at 1K vs 60MB/s at 1000K
gt1500x improvement

16
HPVM/FM on VIA

N1/2 400 Bytes

FM Protocol/techniques portable to Giganet VIA
Slightly lower performance, comparable N1/2
Commercial version WSDI (stay tuned)

17
Unified Transfer and Notification (all transports)
ltspacegt
Fixed Size Frames
Procs
Variable Size Data
Increasing Addresses
Networks
Fixed Size Trailer Length/Flag

Solution Uniform notify and poll (single Q
representation)
Scalability n into k (hash) arbitrary SMP size
or number of NIC cards
Key integrate variable-sized messages achieve
single DMA transfer
no pointer-based memory management, no special
synchronization primitives, no complex
computation
Memory format provides atomic notification in
single contiguous memory transfer (bcopy or DMA)

18
Integrated Notification Results
Single Transport Integrated Myrinet
(latency) 8.3ms 8.4ms Myrinet (BW)
101MB/s 101MB/s Shared Memory (latency)
3.4ms 3.5ms Shared Memory (BW) 200MB/s
200MB/s

No polling or discontiguous access performance
penalties
Uniform high performance which is stable over
changes of configuration or the addition of new
transports
no custom tuning for configuration required
Framework is scalable to large numbers of SMP
processors and network interfaces

19
Supercomputer Performance Characteristics (11/99)
MF/Proc Flops/Byte Flops/NetworkRT Cray
T3E 1200 2 2,500 SGI
Origin2000 500 0.5 1,000 HPVM NT
Supercluster 600 8 12,000 IBM SP2
(4 or 8-way) 2.6-5.2GF 12-25 150-300K Be
owulf (100Mbit) 600 50 200,000
20
The NT Supercluster
21
Windows Clusters

Early prototypes in CSAG
1/1997, 30P, 6GF
12/1997, 64P, 20GF
Alliances Supercluster
4/1998, 256P, 77GF
6/1999, 256P, 109GF

22
NCSAs Windows Supercluster
Engineering Fluid Flow Problem
207 in Top 500 Supercomputing Sites
D. Tafti, NCSA
Rob Pennington (NCSA), Andrew Chien (UCSD)
23
Windows Cluster System
Front-End Systems
File servers LSF master
Fast Ethernet
FTP to Mass Storage Daily backups
128 GB Home 200 GB Scratch
LSF BatchJob Scheduler
Internet

Apps development
Job submission

128 Compute Nodes, 256 CPUs
Windows NT, Myrinet and HPVM
128 Dual 550 MHz Systems
Infrastructure and Development Testbeds
Windows 2K and NT
8 4p 550 32 2p 300 8 2p 333
(courtesy Rob Pennington, NCSA)
24
Example Application Results

MILC QCD
Navier-Stokes Kernel
Zeus-MP Astrophysics CFD
Large-scale Science and Engineering codes
Comparisons to SGI O2K and Linux clusters

25
MILC Performance
Src D. Toussaint and K. Orginos, Arizona
26
Zeus-MP (Astrophysics CFD)
27
2D Navier Stokes Kernel
Source Danesh Tafti, NCSA
28
Applications with High Performance on Windows
Supercluster

Zeus-MP (256P, Mike Norman)
ISIS (192P, Robert Clay)
ASPCG (256P, Danesh Tafti)
Cactus (256P, Paul Walker/John Shalf/Ed Seidel)
MILC QCD (256P, Lubos Mitas)
QMC Nanomaterials (128P, Lubos Mitas)
Boeing CFD Test Codes, CFD Overflow (128P, David
Levine)
freeHEP (256P, Doug Toussaint)
ARPI3D (256P, weather code, Dan Weber)
GMIN (L. Munro in K. Jordan)
DSMC-MEMS (Ravaioli)
FUN3D with PETSc (Kaushik)
SPRNG (Srinivasan)
MOPAC (McKelvey)
Astrophysical N body codes (Bode)
gt Little code retuning and quickly running ...
Parallel Sorting (Rivera CSAG),

29
MinuteSort

Sort max data disk-to-disk in 1 minute
Indy sort
fixed size keys, special sorter, and file format
HPVM/Windows Cluster winner for 1999 (10.3GB) and
2000 (18.3GB)
Adaptation of Berkeley NOWSort code (Arpaci and
Dusseau)
Commodity configuration ( not a metric)
PCs, IDE disks, Windows
HPVM and 1Gbps Myrinet

30
MinuteSort Architecture
HPVM 1Gbps Myrinet
32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks
32 HP Netservers 2 x 16GB SCSI disks
(Luis Rivera UIUC, Xianan Zhang UCSD)
31
Sort Scaling

Concurrent read/bucket-sort/communicate is
bottleneck faster I/O infrastructure required
(busses and memory, not disks)

32
MinuteSort Execution Time
33
Reliability

Gossip Windows platforms are not reliable
Larger systems gt intolerably low MTBF
Our Experience Nodes dont crash
Application runs of 1000s of hours
Node failure means an application failure
effectively not a problem
Hardware
Short term Infant mortality (1 month burn-in)
Long term
1 hardware problem/100 machines/month
Disks, network interfaces, memory
No processor or motherboard problems.

34
Windows Cluster Usage

Lots of large jobs
Runs up to 14,000 hours (64p 9 days)

35
Other Large Windows Clusters

Sandias Kudzu Cluster (144 procs, 550 disks,
10/98)
Cornells AC3 Velocity Cluster (256 procs, 8/99)
Others (sampled from vendors)
GE Research Labs (16, Scientific)
Boeing (32, Scientific)
PNNL (96, Scientific)
Sandia (32, Scientific)
NCSA (32, Scientific)
Rice University (16, Scientific)
U. of Houston (16, Scientific)
U. of Minnesota (16, Scientific)
Oil Gas (8,Scientific)
Merrill Lynch (16, Ecommerce)
UIT (16, ASP/Ecommerce)

36
The AC3 Velocity

64 Dell PowerEdge 6350 Servers
Quad Pentium III 500 MHz/2 MB Cache Processors
(SMP)
4 GB RAM/Node
50 GB Disk (RAID 0)/Node
GigaNet Full Interconnect
100 MB/Sec Bandwidth between any 2 Nodes
Very Low Latency
2 Terabytes Dell PowerVault 200S Storage
2 Dell PowerEdge 6350 Dual Processor File
Servers
4 PowerVault 200S Units/File Server
8 36 GB/Disk Drives/PowerVault 200S
Quad Channel SCSI Raid Adapter
180 MB/sec Sustained Throughput/ Server
2 Terabyte PowerVault 130T Tape Library
4 DLT 7000 Tape Drives
28 Tape Capacity

381 in Top 500 Supercomputing Sites
(courtesy David A. Lifka, Cornell TC)
37
Recent AC3 Additions

8 Dell PowerEdge 2450 Servers (Serial Nodes)
Pentium III 600 MHz/512 KB Cache
1 GB RAM/Node
50 GB Disk (RAID 0)/Node
7 Dell PowerEdge 2450 Servers (First All NT Based
AFS Cell)
Dual Processor Pentium III 600 MHz/512 KB Cache
1 GB RAM/Node Fileservers, 512 MB RAM/Node
Database servers
1 TB SCSI based RAID 5 Storage
Cross platform filesystem support
64 Dell PowerEdge 2450 Servers (Protein Folding,
Fracture Analysis)
Dual Processor Pentium III 733 Mhz/256 KB Cache
2 GB RAM/Node
27 GB Disk (RAID 0)/Node
Full Giganet Interconnect
3 Intel ES6000 1 ES1000 Gigabit switches
Upgrading our Server Backbone network to Gigabit
Ethernet

(courtesy David A. Lifka, Cornell TC)
38
AC3 Goals

Only commercially supported technology
Rapid spinup and spinout
Package technologies for vendors to sell as
integrated systems
gt All of commercial packages were moved from SP2
to Windows, all users are back and more!
Users I dont do windows gt
Im agnostic about operating systems, and just
focus on getting my work done.

39
Protein Folding
Reaction path study of lig and diffusion in
leghemoglobin. The ligand is CO (white) and it
is moving from the binding site, the heme pocket,
to the protein exterior. A study by Weislaw
Nowak and Ron Elber.
The cooperative motion of ion and water through
the gramicidin ion channel. The effective
quasi-article that permeates through the channel
includes eight water molecules and the ion. Work
of Ron Elber with Bob Eisenberg, Danuta Rojewska
and Duan Pin.
http//www.tc.cornell.edu/reports/NIH/resource/Com
pBiologyTools/
(courtesy David A. Lifka, Cornell TC)
40

Protein Folding Per/Processor Performance
Results on different computers for a protein
structures

Results on different computers for (a /b or b
proteins)

(courtesy David A. Lifka, Cornell TC)
41
AC3 Corporate Members

Air Products and Chemicals
Candle Corporation
Compaq Computer Corporation
Conceptual Reality Presentations
Dell Computer Corporation
Etnus, Inc.
Fluent, Inc.
Giganet, Inc.
IBM Corporation
ILOG, Inc.
Intel Corporation
KLA-Tencor Corporation
Kuck Associates, Inc.

Lexis-Nexis
MathWorks, Inc.
Microsoft Corporation
MPI Software Technologies, Inc.
Numerical Algorithms Group
Portland Group, Inc.
Reed Elsevier, Inc.
Reliable Network Solutions, Inc.
SAS Institute, Inc.
Seattle Lab, Inc.
Visual Numerics, Inc.
Wolfram Research, Inc.

(courtesy David A. Lifka, Cornell TC)
42
Windows Cluster Summary

Good performance
Lots of Applications
Good reliability
Reasonable Management complexity (TCO)
Future is bright uses are proliferating!

43
Windows Cluster Resources

NT Supercluster, NCSA
http//www.ncsa.uiuc.edu/General/CC/ntcluster/
http//www-csag.ucsd.edu/projects/hpvm.html
AC3 Cluster, TC
http//www.tc.cornell.edu/UserDoc/Cluster/
University of Southampton
http//www.windowsclusters.org/
gt application and hardware/software evaluation
gt many of these folks will work with you on
deployment

44
Tools and Technologies for Building Windows
Clusters

Communication Hardware
Myrinet, http//www.myri.com/
Giganet, http//www.giganet.com/
Servernet II, http//www.compaq.com/
Cluster Management and Communication Software
LSF, http//www.platform.com/
Codeine, http//www.gridware.net/
Cluster CoNTroller, MPI, http//www.mpi-softtech.c
om/
Maui Scheduler http//www.cs.byu.edu/
MPICH, http//www-unix.mcs.anl.gov/mpi/mpich/
PVM, http//www.epm.ornl.gov/pvm/
Microsoft Cluster Info
Win2000 http//www.microsoft.com/windows2000/
MSCS,http//www.microsoft.com/ntserver/ntserverent
erprise/exec/overview/clustering.asp

45
Future Directions

Terascale Clusters
Entropia

46
A Terascale Cluster
10 Teraflops in 2000?

NSF currently running a 36M Terascale
competition
Budget could buy
an Itanium cluster (3000 processors)
3TB of main memory
gt 1.5Gbps high speed network interconnect

? 1 in Top 500 ? Supercomputing Sites
47
Entropia Beyond Clusters

COTS, SHV enable larger, cheaper, faster systems
Supercomputers (MPPs) to
Commodity Clusters (NT Supercluster) to
Entropia

48
Internet Computing

Idea Assemble large numbers of idle PCs in
peoples homes, offices, into a massive
computational resource
Enabled by broadband connections, fast
microprocessors, huge PC volumes

49
Unprecedented Power

Entropia network 30,000 machines (and growing
fast!)
100,000, 1Ghz gt 100 TeraOp system
1,000,000, 1Ghz gt 1,000 TeraOp system (1 PetaOp)
IBM ASCI White (12 TeraOp, 8K processors, 110
Million system)

50
Why Participate Cause Computing!
51
People Will Contribute

Millions have demonstrated willingness to donate
their idle cycles
Great Cause Computing
Current Find ET, Large Primes, Crack DES
Next find cure for cancer, muscular dystrophy,
air and water pollution,
understand human genome, ecology, fundamental
properties of matter, economy
Participate in science, medical research,
promoting causes that you care about!

52
Technical Challenges

Heterogeneity (machine, configuration, network)
Scalability (thousands to millions)
Reliability (turn off, disconnect, fail)
Security (integrity, confidentiality)
Performance
Programming
. . .
Entropia harnessing the computational power of
the Internet

53
Entropia is . . .

Power a network with unprecedented power and
scale
Empower ordinary people to participate in
solving the great social challenges and mysteries
of our time
Solve team solving fascinating technical problems

54
Summary