Title: Reliability, Availability, and Serviceability RAS for HighPerformance Computing
1Reliability, Availability, and Serviceability
(RAS) for High-Performance Computing
- Stephen L. ScottChristian Engelmann
- Computer Science Research GroupComputer Science
and Mathematics Division
2Research and development goals
- Provide high-level RAS capabilities for current
terascale and next-generation petascale
high-performance computing (HPC) systems - Eliminate many of the numerous single points of
failure and control in todays HPC systems
- Develop techniques to enable HPC systems to run
computational jobs 24/7 - Develop proof-of-concept prototypes and
production-type RAS solutions
3MOLAR Adaptive runtime support for high-end
computing operating and runtime systems
- Addresses the challenges for operating and
runtime systems to run large applications
efficiently on future ultrascale high-end
computers - Part of the Forum to Address Scalable Technology
for Runtime and Operating Systems (FAST-OS) - MOLAR is a collaborative research effort
(www.fastos.org/molar)
4Symmetric active/active redundancy
- Many active head nodes
- Workload distribution
- Symmetric replication between head nodes
- Continuous service
- Always up to date
- No fail-over necessary
- No restore-over necessary
- Virtual synchrony model
- Complex algorithms
- Prototypes for Torque and Parallel Virtual File
System metadata server
Active/active head nodes
Compute nodes
5Symmetric active/active Parallel Virtual File
System metadata server
Writing throughput
Reading throughput
120
100
80
Throughput (Requests/sec)
60
40
PVFS A/A 1
A/A 2 A/A 4
20
0
1
2
4
8
16
32
Number of clients
6Reactive fault tolerance for HPC with
LAM/MPIBLCR job-pause mechanism
Failed node
Live node
- Operational nodes Pause
- BLCR reuses existing processes
- LAM/MPI reuses existing connections
- Restore partial process state from checkpoint
- Failed nodes Migrate
- Restart process on new node from checkpoint
- Reconnect with paused processes
- Scalable MPI membership management for low
overhead - Efficient, transparent, and automatic failure
recovery
Paused MPI process
Failed MPI process
Failed
Failed
Process migration
Existing connection
New connection
Paused MPI process
Migrated MPI process
New connection
Live node
Spare node
Shared storage
7LAM/MPIBLCR job pause performance
Job pause and migrate
LAM reboot
Job restart
10
9
8
7
6
Seconds
5
4
3
2
1
0
BT
CG
EP
FT
LU
MG
SP
- 3.4 overhead over job restart, but
- No LAM reboot overhead
- Transparent continuation of execution
- No requeue penalty
- Less staging overhead
8Proactive fault tolerance for HPC using Xen
virtualization
- Standby Xen host (spare node without guest VM)
- Deteriorating health
- Migrate guest VM to spare node
- New host generates unsolicited ARP reply
- Indicates that guest VM has moved
- ARP tells peers to resend to new host
- Novel fault-tolerance scheme that acts before a
failure impacts a system
9VM migration performance impact
Double node failure
500
300
without migration one migration
without migration one migration two migrations
450
250
400
350
200
300
Seconds
250
150
Seconds
200
100
150
100
50
50
0
0
BT
CG
EP
LU
SP
BT
CG
EP
LU
SP
- Single node failure 0.55 additional cost over
total wall clock time - Double node failure 28 additional cost over
total wall clock time
10HPC reliability analysis and modeling
- Programming paradigm and system scale impact
reliability - Reliability analysis
- Estimate mean time to failure (MTTF)
- Obtain failure distribution exponential,
Weibull, gamma, etc. - Feedback into fault-tolerance schemes for
adaptation
1
0.8
0.6
Cumulative probability
Negative likelihood value
0.4
0.2
0
0
50
100
150
200
250
300
Time between failure (TBF)
11Contacts
Stephen L. Scott Computer Science Research
GroupComputer Science and Mathematics
Division (865) 574-3144 scottsl_at_ornl.ornl Christi
an Engelmann Computer Science Research
GroupComputer Science and Mathematics
Division (865) 574-3132 engelmannc_at_ornl.ornl
11 Scott_RAS_SC07