Reliability, Availability, and Serviceability RAS for HighPerformance Computing PowerPoint PPT Presentation

presentation player overlay
1 / 11
About This Presentation
Transcript and Presenter's Notes

Title: Reliability, Availability, and Serviceability RAS for HighPerformance Computing


1
Reliability, Availability, and Serviceability
(RAS) for High-Performance Computing
  • Stephen L. ScottChristian Engelmann
  • Computer Science Research GroupComputer Science
    and Mathematics Division

2
Research and development goals
  • Provide high-level RAS capabilities for current
    terascale and next-generation petascale
    high-performance computing (HPC) systems
  • Eliminate many of the numerous single points of
    failure and control in todays HPC systems
  • Develop techniques to enable HPC systems to run
    computational jobs 24/7
  • Develop proof-of-concept prototypes and
    production-type RAS solutions

3
MOLAR Adaptive runtime support for high-end
computing operating and runtime systems
  • Addresses the challenges for operating and
    runtime systems to run large applications
    efficiently on future ultrascale high-end
    computers
  • Part of the Forum to Address Scalable Technology
    for Runtime and Operating Systems (FAST-OS)
  • MOLAR is a collaborative research effort
    (www.fastos.org/molar)

4
Symmetric active/active redundancy
  • Many active head nodes
  • Workload distribution
  • Symmetric replication between head nodes
  • Continuous service
  • Always up to date
  • No fail-over necessary
  • No restore-over necessary
  • Virtual synchrony model
  • Complex algorithms
  • Prototypes for Torque and Parallel Virtual File
    System metadata server

Active/active head nodes
Compute nodes
5
Symmetric active/active Parallel Virtual File
System metadata server
Writing throughput
Reading throughput
120
100
80
Throughput (Requests/sec)
60
40
PVFS A/A 1
A/A 2 A/A 4
20
0
1
2
4
8
16
32
Number of clients
6
Reactive fault tolerance for HPC with
LAM/MPIBLCR job-pause mechanism
Failed node
Live node
  • Operational nodes Pause
  • BLCR reuses existing processes
  • LAM/MPI reuses existing connections
  • Restore partial process state from checkpoint
  • Failed nodes Migrate
  • Restart process on new node from checkpoint
  • Reconnect with paused processes
  • Scalable MPI membership management for low
    overhead
  • Efficient, transparent, and automatic failure
    recovery

Paused MPI process
Failed MPI process
Failed
Failed
Process migration
Existing connection
New connection
Paused MPI process
Migrated MPI process
New connection
Live node
Spare node
Shared storage
7
LAM/MPIBLCR job pause performance
Job pause and migrate
LAM reboot
Job restart
10
9
8
7
6
Seconds
5
4
3
2
1
0
BT
CG
EP
FT
LU
MG
SP
  • 3.4 overhead over job restart, but
  • No LAM reboot overhead
  • Transparent continuation of execution
  • No requeue penalty
  • Less staging overhead

8
Proactive fault tolerance for HPC using Xen
virtualization
  • Standby Xen host (spare node without guest VM)
  • Deteriorating health
  • Migrate guest VM to spare node
  • New host generates unsolicited ARP reply
  • Indicates that guest VM has moved
  • ARP tells peers to resend to new host
  • Novel fault-tolerance scheme that acts before a
    failure impacts a system

9
VM migration performance impact
  • Single node failure

Double node failure
500
300
without migration one migration
without migration one migration two migrations
450


250
400
350
200
300
Seconds
250
150
Seconds
200
100
150
100
50
50
0
0
BT
CG
EP
LU
SP
BT
CG
EP
LU
SP
  • Single node failure 0.55 additional cost over
    total wall clock time
  • Double node failure 28 additional cost over
    total wall clock time

10
HPC reliability analysis and modeling
  • Programming paradigm and system scale impact
    reliability
  • Reliability analysis
  • Estimate mean time to failure (MTTF)
  • Obtain failure distribution exponential,
    Weibull, gamma, etc.
  • Feedback into fault-tolerance schemes for
    adaptation

1
0.8
0.6
Cumulative probability
Negative likelihood value
0.4
0.2
0
0
50
100
150
200
250
300
Time between failure (TBF)
11
Contacts
Stephen L. Scott Computer Science Research
GroupComputer Science and Mathematics
Division (865) 574-3144 scottsl_at_ornl.ornl Christi
an Engelmann Computer Science Research
GroupComputer Science and Mathematics
Division (865) 574-3132 engelmannc_at_ornl.ornl
11 Scott_RAS_SC07
Write a Comment
User Comments (0)
About PowerShow.com