Title: Supercomputing on Windows Clusters: Experience and Future Directions
1Supercomputing on Windows Clusters Experience
and Future Directions
- Andrew A. Chien
- CTO, Entropia, Inc.
- SAIC Chair Professor
- Computer Science and Engineering, UCSD
- National Computational Science Alliance
- Invited Talk, USENIX Windows, August 4, 2000
2Overview
- Critical Enabling Technologies
- The Alliances Windows Supercluster
- Design and Performance
- Other Windows Cluster Efforts
- Future
- Terascale Clusters
- Entropia
3External Technology Factors
4Microprocessor Performance
DEC Alpha (5)
X86/Alpha (1)
Year Introduced
- Micros 10MF -gt 100 MF -gt 1GF -gt 3GF -gt 6GF
(2001?) - gt Memory system performance catching up (2.6
GB/s 21264 memory BW)
Adapted from Baskett, SGI and CSC Vanguard
5 Killer Networks
GigSAN/GigE 110 MB/s
- LAN 10Mb/s -gt 100Mb/s -gt ?
- SAN 12MB/s -gt 110MB/s (Gbps) -gt 1100MB/s -gt ?
- Myricom, Compaq, Giganet, Intel,...
- Network bandwidths limited by system internal
memory bandwidths - Cheap and very fast communication hardware
UW Scsi 40 MB/s
FastE 12 MB/s
Ethernet 1MB/s
6Rich Desktop Operating Systems Environments
Clustering, Performance, Mass store, HP
networking, Management, Availability, etc.
Multiprocess Protection SMP support
Graphical Interfaces Audio/Graphics
HD Storage Networks
Basic device access
1981
1985
1995
1999
1990
- Desktop (PC) operating systems now provide
- richest OS functionality
- best program development tools
- broadest peripheral/driver support
- broadest application software/ISV support
7Critical Enabling Technologies
8Critical Enabling Technologies
- Cluster management and resource integration (use
like one system) - Delivered communication performance
- IP protocols inappropriate
- Balanced systems
- Memory bandwidth
- I/O capability
9The HPVM System
- Goals
- Enable tightly coupled and distributed clusters
with high efficiency and low effort (integrated
solution) - Provide usable access thru convenient standard
parallel interfaces - Deliver highest possible performance and simple
programming model
10Delivered Communication Performance
- Early 1990s, Gigabit testbeds
- 500Mbits (60MB/s) _at_ 1 MegaByte packets
- IP protocols not for Gigabit SANs
- Cluster Objective High performance communication
to small and large messages - Performance Balance Shift Networks faster than
I/O, memory, processor
11Fast Messages Design Elements
- User-level network access
- Lightweight protocols
- flow control, reliable delivery
- tightly-coupled link, buffer, and I/O bus
management - Poll-based notification
- Streaming API for efficient composition
- Many generations 1994-1999
- IEEE Concurrency, 6/97
- Supercomputing 95, 12/95
- Related efforts UCB AM, Cornell U-Net,RWCP PM,
Princeton VMMC/Shrimp, Lyon BIP gt VIA standard
12Improved Bandwidth
- 20MB/s -gt 200 MB/s (10x)
- Much of advance is software structure APIs and
implementation - Deliver all of the underlying hardware
performance
13Improved Latency
- 100ms to 2ms overhead (50x)
- Careful design to minimize overhead while
maintaining throughput - Efficient event handling, fine-grained resource
management and interlayer coordination - Deliver all of the underlying hardware
performance
14HPVM Cluster Supercomputers
MPI
Put/Get
Global Arrays
BSP
Scheduling Mgmt (LSF)
HPVM 1.0 (8/1997) HPVM 1.2 (2/1999) - multi,
dynamic, install HPVM 1.9 (8/1999) -
giganet, smp
Fast Messages
Performance Tools
Myrinet
Server- Net
Giganet VIA
SMP
WAN
- Turnkey Cluster Computing Standard APIs
- Network hardware and APIs increase leverage for
users, achieve critical mass for system - Each involved new research challenges and
provided deeper insights into the research issues - Drove continually better solutions (e.g.
multi-transport integration, robust flow control
and queue management)
15HPVM Communication Performance
- Delivers underlying performance for small
messages, endpoints are the limits - 100MB/s at 1K vs 60MB/s at 1000K
- gt1500x improvement
16HPVM/FM on VIA
- FM Protocol/techniques portable to Giganet VIA
- Slightly lower performance, comparable N1/2
- Commercial version WSDI (stay tuned)
17Unified Transfer and Notification (all transports)
ltspacegt
Fixed Size Frames
Procs
Variable Size Data
Increasing Addresses
Networks
Fixed Size Trailer Length/Flag
- Solution Uniform notify and poll (single Q
representation) - Scalability n into k (hash) arbitrary SMP size
or number of NIC cards - Key integrate variable-sized messages achieve
single DMA transfer - no pointer-based memory management, no special
synchronization primitives, no complex
computation - Memory format provides atomic notification in
single contiguous memory transfer (bcopy or DMA)
18Integrated Notification Results
Single Transport Integrated Myrinet
(latency) 8.3ms 8.4ms Myrinet (BW)
101MB/s 101MB/s Shared Memory (latency)
3.4ms 3.5ms Shared Memory (BW) 200MB/s
200MB/s
- No polling or discontiguous access performance
penalties - Uniform high performance which is stable over
changes of configuration or the addition of new
transports - no custom tuning for configuration required
- Framework is scalable to large numbers of SMP
processors and network interfaces
19Supercomputer Performance Characteristics (11/99)
MF/Proc Flops/Byte Flops/NetworkRT Cray
T3E 1200 2 2,500 SGI
Origin2000 500 0.5 1,000 HPVM NT
Supercluster 600 8 12,000 IBM SP2
(4 or 8-way) 2.6-5.2GF 12-25 150-300K Be
owulf (100Mbit) 600 50 200,000
20The NT Supercluster
21Windows Clusters
- Early prototypes in CSAG
- 1/1997, 30P, 6GF
- 12/1997, 64P, 20GF
- Alliances Supercluster
- 4/1998, 256P, 77GF
- 6/1999, 256P, 109GF
22NCSAs Windows Supercluster
Engineering Fluid Flow Problem
207 in Top 500 Supercomputing Sites
D. Tafti, NCSA
Rob Pennington (NCSA), Andrew Chien (UCSD)
23Windows Cluster System
Front-End Systems
File servers LSF master
Fast Ethernet
FTP to Mass Storage Daily backups
128 GB Home 200 GB Scratch
LSF BatchJob Scheduler
Internet
- Apps development
- Job submission
128 Compute Nodes, 256 CPUs
Windows NT, Myrinet and HPVM
128 Dual 550 MHz Systems
Infrastructure and Development Testbeds
Windows 2K and NT
8 4p 550 32 2p 300 8 2p 333
(courtesy Rob Pennington, NCSA)
24Example Application Results
- MILC QCD
- Navier-Stokes Kernel
- Zeus-MP Astrophysics CFD
- Large-scale Science and Engineering codes
- Comparisons to SGI O2K and Linux clusters
25MILC Performance
Src D. Toussaint and K. Orginos, Arizona
26Zeus-MP (Astrophysics CFD)
272D Navier Stokes Kernel
Source Danesh Tafti, NCSA
28Applications with High Performance on Windows
Supercluster
- Zeus-MP (256P, Mike Norman)
- ISIS (192P, Robert Clay)
- ASPCG (256P, Danesh Tafti)
- Cactus (256P, Paul Walker/John Shalf/Ed Seidel)
- MILC QCD (256P, Lubos Mitas)
- QMC Nanomaterials (128P, Lubos Mitas)
- Boeing CFD Test Codes, CFD Overflow (128P, David
Levine) - freeHEP (256P, Doug Toussaint)
- ARPI3D (256P, weather code, Dan Weber)
- GMIN (L. Munro in K. Jordan)
- DSMC-MEMS (Ravaioli)
- FUN3D with PETSc (Kaushik)
- SPRNG (Srinivasan)
- MOPAC (McKelvey)
- Astrophysical N body codes (Bode)
- gt Little code retuning and quickly running ...
- Parallel Sorting (Rivera CSAG),
29MinuteSort
- Sort max data disk-to-disk in 1 minute
- Indy sort
- fixed size keys, special sorter, and file format
- HPVM/Windows Cluster winner for 1999 (10.3GB) and
2000 (18.3GB) - Adaptation of Berkeley NOWSort code (Arpaci and
Dusseau) - Commodity configuration ( not a metric)
- PCs, IDE disks, Windows
- HPVM and 1Gbps Myrinet
30MinuteSort Architecture
HPVM 1Gbps Myrinet
32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks
32 HP Netservers 2 x 16GB SCSI disks
(Luis Rivera UIUC, Xianan Zhang UCSD)
31Sort Scaling
- Concurrent read/bucket-sort/communicate is
bottleneck faster I/O infrastructure required
(busses and memory, not disks)
32MinuteSort Execution Time
33Reliability
- Gossip Windows platforms are not reliable
- Larger systems gt intolerably low MTBF
- Our Experience Nodes dont crash
- Application runs of 1000s of hours
- Node failure means an application failure
effectively not a problem - Hardware
- Short term Infant mortality (1 month burn-in)
- Long term
- 1 hardware problem/100 machines/month
- Disks, network interfaces, memory
- No processor or motherboard problems.
34Windows Cluster Usage
- Lots of large jobs
- Runs up to 14,000 hours (64p 9 days)
35Other Large Windows Clusters
- Sandias Kudzu Cluster (144 procs, 550 disks,
10/98) - Cornells AC3 Velocity Cluster (256 procs, 8/99)
- Others (sampled from vendors)
- GE Research Labs (16, Scientific)
- Boeing (32, Scientific)
- PNNL (96, Scientific)
- Sandia (32, Scientific)
- NCSA (32, Scientific)
- Rice University (16, Scientific)
- U. of Houston (16, Scientific)
- U. of Minnesota (16, Scientific)
- Oil Gas (8,Scientific)
- Merrill Lynch (16, Ecommerce)
- UIT (16, ASP/Ecommerce)
36The AC3 Velocity
- 64 Dell PowerEdge 6350 Servers
- Quad Pentium III 500 MHz/2 MB Cache Processors
(SMP) - 4 GB RAM/Node
- 50 GB Disk (RAID 0)/Node
- GigaNet Full Interconnect
- 100 MB/Sec Bandwidth between any 2 Nodes
- Very Low Latency
- 2 Terabytes Dell PowerVault 200S Storage
- 2 Dell PowerEdge 6350 Dual Processor File
Servers - 4 PowerVault 200S Units/File Server
- 8 36 GB/Disk Drives/PowerVault 200S
- Quad Channel SCSI Raid Adapter
- 180 MB/sec Sustained Throughput/ Server
- 2 Terabyte PowerVault 130T Tape Library
- 4 DLT 7000 Tape Drives
- 28 Tape Capacity
381 in Top 500 Supercomputing Sites
(courtesy David A. Lifka, Cornell TC)
37Recent AC3 Additions
- 8 Dell PowerEdge 2450 Servers (Serial Nodes)
- Pentium III 600 MHz/512 KB Cache
- 1 GB RAM/Node
- 50 GB Disk (RAID 0)/Node
- 7 Dell PowerEdge 2450 Servers (First All NT Based
AFS Cell) - Dual Processor Pentium III 600 MHz/512 KB Cache
- 1 GB RAM/Node Fileservers, 512 MB RAM/Node
Database servers - 1 TB SCSI based RAID 5 Storage
- Cross platform filesystem support
- 64 Dell PowerEdge 2450 Servers (Protein Folding,
Fracture Analysis) - Dual Processor Pentium III 733 Mhz/256 KB Cache
- 2 GB RAM/Node
- 27 GB Disk (RAID 0)/Node
- Full Giganet Interconnect
- 3 Intel ES6000 1 ES1000 Gigabit switches
- Upgrading our Server Backbone network to Gigabit
Ethernet
(courtesy David A. Lifka, Cornell TC)
38AC3 Goals
- Only commercially supported technology
- Rapid spinup and spinout
- Package technologies for vendors to sell as
integrated systems - gt All of commercial packages were moved from SP2
to Windows, all users are back and more! - Users I dont do windows gt
- Im agnostic about operating systems, and just
focus on getting my work done.
39Protein Folding
Reaction path study of lig and diffusion in
leghemoglobin. The ligand is CO (white) and it
is moving from the binding site, the heme pocket,
to the protein exterior. A study by Weislaw
Nowak and Ron Elber.
The cooperative motion of ion and water through
the gramicidin ion channel. The effective
quasi-article that permeates through the channel
includes eight water molecules and the ion. Work
of Ron Elber with Bob Eisenberg, Danuta Rojewska
and Duan Pin.
http//www.tc.cornell.edu/reports/NIH/resource/Com
pBiologyTools/
(courtesy David A. Lifka, Cornell TC)
40Protein Folding Per/Processor Performance
Results on different computers for a protein
structures
Results on different computers for (a /b or b
proteins) Â
(courtesy David A. Lifka, Cornell TC)
41AC3 Corporate Members
- Air Products and Chemicals
- Candle Corporation
- Compaq Computer Corporation
- Conceptual Reality Presentations
- Dell Computer Corporation
- Etnus, Inc.
- Fluent, Inc.
- Giganet, Inc.
- IBM Corporation
- ILOG, Inc.
- Intel Corporation
- KLA-Tencor Corporation
- Kuck Associates, Inc.
- Lexis-Nexis
- MathWorks, Inc.
- Microsoft Corporation
- MPI Software Technologies, Inc.
- Numerical Algorithms Group
- Portland Group, Inc.
- Reed Elsevier, Inc.
- Reliable Network Solutions, Inc.
- SAS Institute, Inc.
- Seattle Lab, Inc.
- Visual Numerics, Inc.
- Wolfram Research, Inc.
(courtesy David A. Lifka, Cornell TC)
42Windows Cluster Summary
- Good performance
- Lots of Applications
- Good reliability
- Reasonable Management complexity (TCO)
- Future is bright uses are proliferating!
43Windows Cluster Resources
- NT Supercluster, NCSA
- http//www.ncsa.uiuc.edu/General/CC/ntcluster/
- http//www-csag.ucsd.edu/projects/hpvm.html
- AC3 Cluster, TC
- http//www.tc.cornell.edu/UserDoc/Cluster/
- University of Southampton
- http//www.windowsclusters.org/
- gt application and hardware/software evaluation
- gt many of these folks will work with you on
deployment
44Tools and Technologies for Building Windows
Clusters
- Communication Hardware
- Myrinet, http//www.myri.com/
- Giganet, http//www.giganet.com/
- Servernet II, http//www.compaq.com/
- Cluster Management and Communication Software
- LSF, http//www.platform.com/
- Codeine, http//www.gridware.net/
- Cluster CoNTroller, MPI, http//www.mpi-softtech.c
om/ - Maui Scheduler http//www.cs.byu.edu/
- MPICH, http//www-unix.mcs.anl.gov/mpi/mpich/
- PVM, http//www.epm.ornl.gov/pvm/
- Microsoft Cluster Info
- Win2000 http//www.microsoft.com/windows2000/
- MSCS,http//www.microsoft.com/ntserver/ntserverent
erprise/exec/overview/clustering.asp
45Future Directions
- Terascale Clusters
- Entropia
46A Terascale Cluster
10 Teraflops in 2000?
- NSF currently running a 36M Terascale
competition - Budget could buy
- an Itanium cluster (3000 processors)
- 3TB of main memory
- gt 1.5Gbps high speed network interconnect
? 1 in Top 500 ? Supercomputing Sites
47Entropia Beyond Clusters
- COTS, SHV enable larger, cheaper, faster systems
- Supercomputers (MPPs) to
- Commodity Clusters (NT Supercluster) to
- Entropia
48Internet Computing
- Idea Assemble large numbers of idle PCs in
peoples homes, offices, into a massive
computational resource - Enabled by broadband connections, fast
microprocessors, huge PC volumes
49Unprecedented Power
- Entropia network 30,000 machines (and growing
fast!) - 100,000, 1Ghz gt 100 TeraOp system
- 1,000,000, 1Ghz gt 1,000 TeraOp system (1 PetaOp)
- IBM ASCI White (12 TeraOp, 8K processors, 110
Million system)
50Why Participate Cause Computing!
51People Will Contribute
- Millions have demonstrated willingness to donate
their idle cycles - Great Cause Computing
- Current Find ET, Large Primes, Crack DES
- Next find cure for cancer, muscular dystrophy,
air and water pollution, - understand human genome, ecology, fundamental
properties of matter, economy - Participate in science, medical research,
promoting causes that you care about!
52Technical Challenges
- Heterogeneity (machine, configuration, network)
- Scalability (thousands to millions)
- Reliability (turn off, disconnect, fail)
- Security (integrity, confidentiality)
- Performance
- Programming
- . . .
- Entropia harnessing the computational power of
the Internet
53Entropia is . . .
- Power a network with unprecedented power and
scale - Empower ordinary people to participate in
solving the great social challenges and mysteries
of our time - Solve team solving fascinating technical problems
54Summary
- Windows clusters are powerful, successful high
performance platforms - Cost effective and excellent performance
- Poised for rapid proliferation
- Beyond clusters are Internet computing systems
- Radical technical challenges, vast and profound
opportunities - For more information see
- HPVM http//www-csag.ucsd.edu/
- Entropia http//www.entropia.com/
55Credits
- NT Cluster Team Members
- CSAG (UIUC and UCSD Computer Science) my
research group - NCSA Leading Edge Site -- Robert Penningtons
team - Talk materials
- NCSA (Rob Pennington, numerous application
groups) - Cornell TC (David Lifka)
- Boeing (David Levine)
- MPISoft (Tony Skjellum)
- Giganet (David Wells)
- Microsoft (Jim Gray)
56(No Transcript)