Reliability, Availability, and Serviceability RAS for HighPerformance Computing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Reliability, Availability, and Serviceability RAS for HighPerformance Computing

1
Reliability, Availability, and Serviceability
(RAS) for High-Performance Computing

Stephen L. ScottChristian Engelmann
Computer Science Research GroupComputer Science
and Mathematics Division

2
Research and development goals

Provide high-level RAS capabilities for current
terascale and next-generation petascale
high-performance computing (HPC) systems
Eliminate many of the numerous single points of
failure and control in todays HPC systems

Develop techniques to enable HPC systems to run
computational jobs 24/7
Develop proof-of-concept prototypes and
production-type RAS solutions

3
MOLAR Adaptive runtime support for high-end
computing operating and runtime systems

Addresses the challenges for operating and
runtime systems to run large applications
efficiently on future ultrascale high-end
computers
Part of the Forum to Address Scalable Technology
for Runtime and Operating Systems (FAST-OS)
MOLAR is a collaborative research effort
(www.fastos.org/molar)

4
Symmetric active/active redundancy

Many active head nodes
Workload distribution
Symmetric replication between head nodes
Continuous service
Always up to date
No fail-over necessary
No restore-over necessary
Virtual synchrony model
Complex algorithms
Prototypes for Torque and Parallel Virtual File
System metadata server

Active/active head nodes
Compute nodes
5
Symmetric active/active Parallel Virtual File
System metadata server
Writing throughput
Reading throughput
120
100
80
Throughput (Requests/sec)
60
40
PVFS A/A 1
A/A 2 A/A 4
20
0
1
2
4
8
16
32
Number of clients
6
Reactive fault tolerance for HPC with
LAM/MPIBLCR job-pause mechanism
Failed node
Live node

Operational nodes Pause
BLCR reuses existing processes
LAM/MPI reuses existing connections
Restore partial process state from checkpoint
Failed nodes Migrate
Restart process on new node from checkpoint
Reconnect with paused processes
Scalable MPI membership management for low
overhead
Efficient, transparent, and automatic failure
recovery

Paused MPI process
Failed MPI process
Failed
Failed
Process migration
Existing connection
New connection
Paused MPI process
Migrated MPI process
New connection
Live node
Spare node
Shared storage
7
LAM/MPIBLCR job pause performance
Job pause and migrate
LAM reboot
Job restart
10
9
8
7
6
Seconds
5
4
3
2
1
0
BT
CG
EP
FT
LU
MG
SP

3.4 overhead over job restart, but
No LAM reboot overhead
Transparent continuation of execution

No requeue penalty
Less staging overhead

8
Proactive fault tolerance for HPC using Xen
virtualization

Standby Xen host (spare node without guest VM)
Deteriorating health
Migrate guest VM to spare node
New host generates unsolicited ARP reply
Indicates that guest VM has moved
ARP tells peers to resend to new host
Novel fault-tolerance scheme that acts before a
failure impacts a system

9
VM migration performance impact

Single node failure

Double node failure
500
300
without migration one migration
without migration one migration two migrations
450

250
400
350
200
300
Seconds
250
150
Seconds
200
100
150
100
50
50
0
0
BT
CG
EP
LU
SP
BT
CG
EP
LU
SP

Single node failure 0.55 additional cost over
total wall clock time
Double node failure 28 additional cost over
total wall clock time

10
HPC reliability analysis and modeling

Programming paradigm and system scale impact
reliability
Reliability analysis
Estimate mean time to failure (MTTF)
Obtain failure distribution exponential,
Weibull, gamma, etc.
Feedback into fault-tolerance schemes for
adaptation

1
0.8
0.6
Cumulative probability
Negative likelihood value
0.4
0.2
0
0
50
100
150
200
250
300
Time between failure (TBF)
11
Contacts
Stephen L. Scott Computer Science Research
GroupComputer Science and Mathematics
Division (865) 574-3144 scottsl_at_ornl.ornl Christi
an Engelmann Computer Science Research
GroupComputer Science and Mathematics
Division (865) 574-3132 engelmannc_at_ornl.ornl
11 Scott_RAS_SC07

Write a Comment

User Comments (0)

About PowerShow.com

Reliability, Availability, and Serviceability RAS for HighPerformance Computing PowerPoint PPT Presentation