Title: Crystal Ball Panel
1Crystal Ball Panel
ORNL Heterogeneous Distributed Computing Research
Al Geist ORNL March 6, 2003
2Look into the Future
ORNL Heterogeneous Distributed Computing Research
Federated Tera-clusters
Petascale systems
Reply Hazy Try Again
Adaptable software
HPC Linux
Fault Tolerance
High performance I/O
Eight Ball
3Scalable Systems Software for Terascale Centers
ORNL Heterogeneous Distributed Computing Research
IBM Cray Intel Unlimited Scale
ORNL ANL LBNL PNNL
SNL LANL Ames
NCSA PSC SDSC
Collectively (with labs, NSF centers, and
industry) define standard interfaces between
systems components for interoperability
Goal
Create scalable, standardized management tools
for efficiently running our large computing
centers
Part of the DOE SciDAC effort
www.scidac.org/ScalableSystems
4Progress so far on Integrated Suite
Working Components and Interfaces (bold)
Grid Interfaces
Meta Scheduler
Meta Monitor
Meta Manager
Meta Services
Accounting
Scheduler
System Job Monitor
Node State Manager
Service Directory
Standard XML interfaces
Node Configuration Build Manager
authentication communication
Event Manager
Important!
Allocation Management
Usage Reports
Validation Testing
Process Manager
Job Queue Manager
Components written in any mixture of C, C,
Java, Perl, and Python
Hardware Infrastructure Manager
Checkpoint / Restart
5Underneath it all
ORNL Heterogeneous Distributed Computing Research
Rogue OS and/or daemons cited as problem by
existing computer centers
Single System Img Adaptive O/S Asymmetric
Kernels A scalable file system
Scalable High Performance OS
What will it be? Linux Lightweight kernel (like
Red, BG/L) Scyld approach Other?
Fast-OS effort
6Scale up and Fall Down
ORNL Heterogeneous Distributed Computing Research
Fault Tolerance serious issue when scaling to 100
TF and beyond RAS critical
Checkpointing eventually becomes ineffective
Need a Fault Tolerance Overhaul
Needs Adaptive runtime MPI Fault Tolerance New
FT paradigms
7ORNL Heterogeneous Distributed Computing Research
General Purpose vs Simple and Custom
Software Minimum OS w/ High performance but
limited app support Full OS Tuned to hardware
adapt on the fly Autonomic algorithms
Hardware Customized clusters for each
group Centralized general purpose
machine Internet in a box Or out of the box
8Big Science
ORNL Heterogeneous Distributed Computing Research
The final word - dont lose track of why we
justify petascale systems
Science will ultimately be driven by
computation, simulation and modeling.
Science drivers are key to success in HPC and
visa versa