NCSA Terascale Clusters - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

NCSA Terascale Clusters

Description:

Everybody who has analyzed the logical theory of computers has come to the ... Condor resource pools. parameter studies and load sharing ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 14
Provided by: daniel132
Category:

less

Transcript and Presenter's Notes

Title: NCSA Terascale Clusters


1
NCSA Terascale Clusters
  • Dan Reed
  • Director, NCSA and the Alliance
  • Chief Architect, NSF ETF TeraGrid
  • Principal Investigator, NSF NEESgrid
  • William and Jane Marr Gutgsell Professor
  • University of Illinois
  • reed_at_ncsa.uiuc.edu

2
A Blast From the Past
  • Everybody who has analyzed the logical theory of
    computers has come to the conclusion that the
    possibilities of computers are very interesting
    if they could be made to be more complicated by
    several orders of magnitude.
  • December 29, 1959

Richard Feynman
Feynman would be proud! ?
3
NCSA Terascale Linux Clusters
  • 1 TF IA-32 Pentium III cluster (Platinum)
  • 512 1 GHz dual processor nodes
  • Myrinet 2000 interconnect
  • 5 TB of RAID storage
  • 594 GF (Linpack), production July 2001
  • 1 TF IA-64 Itanium cluster (Titan)
  • 164 800 MHz dual processor nodes
  • Myrinet 2000 interconnect
  • 678 GF (Linpack), production March 2002
  • Large-scale calculations on both
  • molecular dynamics (Schulten)
  • first nanosecond/day calculations
  • gas dynamics (Woodward)
  • others underway via NRAC allocations
  • Software packaging for communities
  • NMI GRIDS Center, Alliance In a Box
  • Lessons for TeraGrid

NCSA machine room
4
Platinum Software Configuration
  • Linux
  • RedHat 6.2 and Linux 2.2.19 SMP Kernel
  • Open PBS
  • resource management and job control
  • Maui Scheduler
  • advanced scheduling
  • Argonne MPICH
  • parallel programming API
  • NCSA VMI
  • communication middleware
  • MPICH and Myrinet
  • Myricom GM
  • Myrinet communication layer
  • NCSA cluster monitor
  • IBM GPFS

5
Session Questions
  • Cluster performance and expectations
  • generally met, though with the usual hiccups
  • MTBI and failure modes
  • node and disk loss (stay tuned for my next talk
    )
  • copper Myrinet (fiber much more reliable)
  • avoid open house demonstrations ?
  • System utilization
  • heavily oversubscribed (see queue delays below)
  • Primary complaints
  • long batch queue delays
  • capacity vs. capability balance
  • ISV code availability
  • software tools
  • debuggers and performance tools
  • I/O and parallel file system performance

6
NCSA IA-32 Cluster Timeline
4/5 Myrinet static mapping in place 4/7 CMS runs
successfully 4/11 400 processor HPL runs
completing 4/12 Myricom engineering assistance
2/23 First four racks of IBM hardware arrive
6/1 Friendly user period begins
Jan 2001
Mar 2001
May 2001
Apr 2001
Feb 2001
June 2001
July 2001
Order placed with IBM 512 compute node cluster
3/1 Head nodes operational 3/10 First 126
processor Myrinet test jobs 3/13 Final IBM
hardware shipment 3/22 First application for
compute nodes (CMS/Koranda/Litvin) 3/26
Initial Globus installation 3/26 Final Myrinet
hardware arrives 3/26 First 512 processor MILC
and NAMD runs
5/8 1000p MP Linpack runs 5/11 1008 processor
Top500 run _at_ 594GF 5/14 2.4 Kernel testing 5/28
RedHat 7.1 testing
Production service
7
NCSA Resource Usage
8
Alliance HPC Usage
Clusters in Production
Source PACI Usage Database
9
Hero Cluster Jobs
CPU Hours
  • Capability vs. Capacity

Platinum
Titan
10
Storm Scale Prediction
  • Sample four hour forecast
  • Center for Analysis and Prediction of Storms
  • Advanced Regional Prediction System
  • full-physics mesoscale prediction system
  • Execution environment
  • NCSA Itanium Linux Cluster
  • 240 processors, 4 hours per night for 46 days
  • Fort Worth forecast
  • four hour prediction, 3 km grid
  • initial state includes assimilation of
  • WSR-88D reflectivity and radial velocity data
  • surface and upper air data, satellite, and wind
  • On-demand computing required

Radar
2 hr
Forecast w/Radar
Source Kelvin Droegemeier
11
NCSA Multiphase Strategy
  • Multiple user classes
  • ISV software, hero calculations
  • distributed resource sharing, parameter studies
  • Four hardware approaches
  • shared memory multiprocessors
  • 12 32-way IBM IBM p690 systems (2 TF peak)
  • large memory and ISV support
  • TeraGrid IPF clusters
  • 64-bit Itanium2/Madison (10 TF peak)
  • SDSC, ANL, Caltech and PSC coupling
  • Xeon clusters
  • 32-bit systems for hero calculations
  • dedicated sub-clusters (2-3 TF each)
  • allocated for weeks
  • Condor resource pools
  • parameter studies and load sharing

12
Extensible TeraGrid Facility (ETF)
ANL Visualization
Caltech Data collection analysis
LEGEND
Visualization Cluster
Cluster
IA64
Sun
IA32
0.4 TF IA-64 IA32 Datawulf 80 TB Storage
1.25 TF IA-64 96 Viz nodes 20 TB Storage
IA64
Storage Server
Shared Memory
IA32
IA32
Disk Storage
Backplane Router
Extensible Backplane Network
LA Hub
Chicago Hub
30 Gb/s
30 Gb/s
40 Gb/s
30 Gb/s
30 Gb/s
30 Gb/s
10 TF IA-64 128 large memory nodes 230 TB Disk
Storage 3 PB Tape Storage GPFS and data mining
6 TF EV68 71 TB Storage 0.3 TF EV7
shared-memory 150 TB Storage Server
4 TF IA-64 DB2, Oracle Servers 500 TB Disk
Storage 6 PB Tape Storage 1.1 TF Power4
EV7
IA64
Sun
EV68
IA64
Pwr4
Sun
NCSA Compute Intensive
SDSC Data Intensive
PSC Compute Intensive
13
NCSA TeraGrid 10 TF IPF and 230 TB
TeraGrid Network
GbE Fabric
2 TF Itanium2 256 nodes
700 Madison nodes
Storage I/O over Myrinet and/or GbE
2p Madison 4 GB memory 73 GB scratch
2p Madison 4 GB memory 73 GB scratch
2p 1 GHz 4 or 12 GB memory 73 GB scratch
2p Madison 4 GB memory 73 GB scratch
256 2x FC
Myrinet Fabric
Brocade 12000 Switches
230 TB
InteractiveSpare Nodes
Being Installed Now
10 2p Itanium2 Nodes
10 2p Madison Nodes
Login, FTP
Write a Comment
User Comments (0)
About PowerShow.com