Title: Case Studies
1HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR
INFORMATIK
DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES
Wintersemester 99/00 Leitung Prof. Dr.
Miroslaw Malek www.informatik.hu-berlin.de/rok/f
tc
2CASE STUDIES
- OBJECTIVES
- TO SHOW EXAMPLES OF EXISTING SYSTEMS WHICH ARE
DESIGNED TO ASSURE HIGH RELIABILITY - TO RELATE GENERAL RELIABILITY METHODOLOGIES
DESCRIBED EARLIER TO PRACTICAL IMPLEMENTATIONS OF
THOSE IDEAS - TO SURVEY THE GENERAL EXISTING RELIABILITY
CONCEPTS WITH EXEMPLARY CASES - CONTENTS
- COMMERCIAL SYSTEMS FROM ATT, SEQUOIA, STRATUS
AND TANDEM - FTMP - FAULT-TOLERANT MULTIPROCESSOR
- SIFT - SOFTWARE IMPLEMENTED FAULT TOLERANCE
- COMMUNICATION CONTROLLER
- FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE
3ATT's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS
5 AND 3B20 (1)
- REQUIREMENTS
- Downtime for the entire system not to exceed 2
hours over 40 years life - of calls handled incorrectly lt 0.02
- System outage 3 min/year
- 100 availability 24 hours a day from user's
perspective - Two minutes of downtime are contributed by
- 24 sec - hardware faults (20)
- 18 sec - software deficiencies (15)
- 36 sec - procedural errors (30)
- 42 sec - recovery deficiencies (35)
4ATT's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS
5 AND 3B20 (2)
- OTHER FEATURES
- 95 of hardware and software faults detected and
diagnosed automatically - 90 of hardware faults diagnosed within field
replaceable unit (FRC). - Repair time less than 2 hours on ESS
- 1 minute on 3B20
- REDUNDANCY
- FULL DUPLICATION (of critical modules)
- CPU, memory, I/O, disks, bus systems
- STANDBY SPARES
- call store
- ERROR DETECTION (at both hardware and software
levels) - replication checks
- timing checks
- coding checks
- internal checks (self-checking)
5ATT's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS
5 AND 3B20 (3)
- replication checks
- duplex system with comparison on every cycle
- timing checks
- used in all hardware components also several
timer resets driven by software interrupts - coding
- m-out-of-n (4-out-of-8) codes, parity and cyclic
codes - internal checks
- address limits
- multiple comparators help software to locate
faults faster
6SYSTEM VIEW (3B20)
7FAULT TREATMENT
- Detection of an error generates an interrupt and
the fault treatment and recovery programs (FT/RP)
are invoked - Three priority categories
- immediate interrupt (maintenance interrupt)
- if the fault is severe enough to effect the
execution of the currently executing program - deferred interrupt
- if too many calls are potentially affected by
interrupt, then wait until the completion of the
currently executing program - polite interrupt
- waits until periodic routine diagnostic is
executed - FT/RP identify and isolate the faulty unit and
reconfigure the system to use one fault-free CPU - If storage has no duplication, other memory area
will be assigned
8RELIABLE SOFTWARE GOALS
- OPERATE CONTINUOUSLY FOR MONTHS OR YEARS
- TECHNIQUES USED FOR HIGH SOFTWARE RELIABILITY
- PROCESSES HAVE INDIVIDUAL FAULT RECOVERY AND
ROLLBACK MECHANISMS WHICH RECOVER FROM HARDWARE
FAILURES OR TRANSIENT SOFTWARE FAILURES - SYSTEM INTEGRITY SOFTWARE MONITORS CORRECT
OPERATION OF THE ENTIRE HARDWARE AND SOFTWARE
SYSTEM - AUDITS VALIDATE DATA CONSISTENCY AND RECLAIM LOST
RESOURCES USING ROBUST DATA STRUCTURES - OVERLOAD CONTROLS ENSURE THE AVAILABILITY OF
RESOURCES AND PREVENT CATASTROPHIC FAILURES - EXCEPTION HANDLING TECHNIQUES
- NONCRITICAL PROGRAMS USUALLY TERMINATE AND
RESTART - CRITICAL PROGRAMS WILL ROLLBACK AND RETRY
9PROGRESSIVE RECOVERY EFFORT
- LEVEL ACTION
- LOCAL LOCAL RECOVERY
- 1 OPERATING SYSTEM AND I/O DRIVER ROLLBACK
-
- 2 QUICK BOOTSTRAP
-
- 3 COMPLETE BOOTSTRAP RELOAD CONFIGURATION
DATABASE -
- 4 MANUAL CLEAR ALL OF MEMORY DO 3 ABOVE
- ALTHOUGH DOWNTIME DOES NOT INCREASE
SIGNIFICANTLY AS RECOVERY ACTIONS ESCALATE,
DISRUPTIONS TO USERS OF APPLICATIONS DO INCREASE
SIGNIFICANTLY - ABORTED TRANSACTIONS
-
10SYSTEM ENHANCEMENT GOALS
- INSTALL NEW HARDWARE AND SOFTWARE
- WITHOUT TAKING DOWN THE SYSTEM
- METHODS TO ADD UPDATES
- CHANGE HARDWARE AND SOFTWARE WITH NO DISRUPTION
IN SERVICE - INSTALL NEW HARDWARE, FIRMWARE, OR SOFTWARE WITH
MINIMAL DISRUPTION IN SERVICE - OFF-LINE SOFTWARE REPLACEMENT SYSTEM
- COMPILE THE NEW SOURCE CODE
- COMPARE NEW OBJECT CODE TO OLD OBJECT CODE
- DETERMINE KINDS OF REPLACEMENTS NEEDED
- GENERATE THE REPLACEMENT FILES
- METHODS TO REMOVE FAULTY UPDATES
- BACK OUT ANY UPDATES WHICH WERE FOUND TO CONTAIN
FAULTS - AUTOMATICALLY BACK OUT OF ANY UPDATE SUSPECTED OF
CAUSING A FAILURE
11OPERATOR INTERFACE GOALS
- HELP EFFECT A QUICK REPAIR
- PROVIDE IMMEDIATE FEEDBACK ON STATUS OF SYSTEM
- HELP OPERATOR MAKE QUICK, ACCURATE DECISIONS
- PREVENT DANGEROUS OPERATOR MISTAKES
- PROVIDE POSITIVE CONTROL OF ALL PARTS OF SYSTEM
12FAULT INJECTION AND REPAIR SIMULATION
- OVER 10,000 SINGLE HARDWARE FAULTS WERE INJECTED
AT RANDOM AND AUTOMATIC SYSTEM RECOVERY WORKED IN
OVER 99.8 OF CASES - IN 133 SIMULATED REPAIR CASES TROUBLE LOCATION
PROCEDURE (TLP) FAILED TO LOCATE FAULTY MODULE IN
5 CASES, AND IN 94 OF THE LISTS OF SUSPECTED
FAULTY COMPONENTS THE FAULT WAS LOCATED WITHIN
THE FIRST FIVE MODULES
13AVAILABILITY ASSURANCE
- MODEL AVAILABILITY
- THROUGH ENTIRE LIFECYCLE
- TEST FOR AVAILABILITY
- TO MEET SPECIFIED AVAILABILITY
- TRACK ON-SITE EXPERIENCE
- TO ENSURE AVAILABILITY OBJECTIVES ARE MET
14SEQUOIA(Marlboro, MA 01752 ph. 617-480-0800)
- TIGHTLY-COUPLED MULTIPROCESSOR capable of trading
performance for dependability and vice versa - MC68020 PROCESSORS (20MHz clock)
- up to 64 PEs
- up to 128 MEs (16 M bytes with ECC)
- up to 96 IOEs
- two 40-bit 10MHz buses
- FAULT DETECTION
- error-detecting codes (e.g., half odd-half even
parity) - comparison of duplicated operations (duplex
microprocessors) - protocol monitoring
- PE faults are located by polling
- RECONFIGURATION
- reassignment to fault-free processors
15STRATUS (also IBM's System/88)(Natick, MA
01760 ph. 617-653-1466)
- TWO-PAIRS OF DUPLEXED PEs (PAIR AND SPARE PAIR)
- UP TO 32 PEs ON RING -TYPE LOCAL AREA NETWORK
- RED-LIGHT NOTIFICATION ABOUT FAULTY BOARD
- ABILITY TO EXCHANGE BOARDS ON LINE
- ECC ON MEMORIES (Up to 32M bytes per PE)
- PERFORMANCE/FAULT TOLERANCE OPTIONS
16STRATUS XA/R SERIES 300PAIR AND SPARE CONCEPT
STRATUS XA/R SERIES 300 MODULE
17TANDEM(Cupertino, CA 95014 ph. 408-725-6000)
- CONFIGURATIONS
- SINGLE SYSTEM 2-16 PEs
- FIBER OPTIC CABLE-CONNECTED SYSTEM UP TO 224 PEs
(14X16) - WORLD-WIDE NETWORK UP TO 4,080 PEs
-
- THE FAULT-TOLERANT COMPUTER OF THE EIGHTIES
FEATURES - NONSTOP II OR NONSTOP TXP PROCESSOR WITH 64KB
CACHE - DUAL DYNABUS (26 Mbytes/sec)
- 2-8 Mbytes Memories
- Dual Disk (MTBF for a single disk is 3-5 years
with dual disk, THE MTBF increases to 1500 years) - FAULT DETECTION - 100 by duplication or by
timeout mechanism (absence of "I'm alive"
message) - FAULT-TOLERANT WITH RESPECT TO ANY SINGLE
HARDWARE FAULT - RECOVERY by rollback to the latest checkpoint in
memory - LATEST SYSTEM INTEGRITY S2 USES TMR OF MIPS
PROCESSORS ("SELECTIVE" TMR)
18NONSTOP CYCLONE (TANDEM COMPUTERS Inc.)
- CYCLONE TOLERATES SINGLE HARDWARE OR SOFTWARE
FAULT - IT USES A FAULT-TOLERANT LOAD BALANCING OPERATING
SYSTEM CALLED GUARDIAN 90 - GUARDIAN 90 MAINTAINS BACKUP OF USER PROCESSES ON
SEPARATE PROCESSORS AND KEEPS CONSISTENCY BY
PERIODIC CHECKPOINTING - 16 AND 64 PROCESSOR CONFIGURATIONS WITH UP TO 2
GB MEMORY 64 I/O CHANNELS (WITH FOX NETWORK UP
TO 255 PROCESSORS CAN WORK TOGETHER)
19NONSTOP CYCLONE (TANDEM COMPUTERS Inc.)
TANDEM NONSTOP CYCLONE SYSTEM
20CYCLONE SYSTEM ARCHITECTURE
- Superscalar proprietary CISC Processors
- A section is a quad of processors which are
connected by duplexed DYNABUS (a proprietary,
fault-tolerant bus, 40 MB/sec) - Sections are also redundantly (duplexed both
ways) interconnected by dynabus also a
proprietary up to 50M long, fault-tolerant bus
which uses fiber optics
- BASIC PRINCIPLE FAIL FAST
- (concurrent error detection or Im alive
messages, combined with immediate termination of
operation upon detection to minimize error
propagation) - Replacement of components on line
- SEC-DED on memories
- Mirrored disks
DYNABUS
DYNABUS
DYNABUS
DYNABUS
Four separate sections connected by DYNABUS
21HIMALAYA K10000 (TANDEM COMPUTERS Inc.)
22HIMALAYA K10000s INTERSECTION NETWORK
Section
Dual Fiber Optic Rings
Node
23FTMP - FAULT-TOLERANTMULTIPROCESSOR (DRAPER LABS)
- THREE TRIADS IN TMR CONFIGURATION (NINE PROCESSOR
SYSTEM) - TMR ON COMMUNICATION LINES
- FAULT-TOLERANT TMR CLOCK
- FAULT-TOLERANT WITH RESPECT TO ANY SINGLE FAULT
- DESIGN GOALS
- 10-9 FAILURES/HOUR
- 10 HOUR MISSION TIME
- 300 HOUR MAINTENANCE INTERVALS
24FAULT-TOLERANT PARALLEL PROCESSOR(FTPP FROM
Draper Labs)
Byzantine resilience
A four-triplex group cluster
T4
T3
T2
An ensemble of 16 triplex groups
25SIFT - SOFTWARE IMPLEMENTED FAULT TOLERANCE
- NINE PROCESSOR SYSTEM WITH CAPABILITY TO SCHEDULE
TASKS TO RUN ON 1, 3, 5, 7 OR 9 PROCESSORS
DEPENDING ON TASK CRITICALITY - LOCAL EXECUTIVE FOR EACH TASK
- error handler/detector
- scheduler
- software voter
- repeated communication
- GLOBAL EXECUTIVE
- runs in TMR mode
- allocates resources
- diagnoses reports from local error handlers
- SYSTEM SHOULD HAVE FAILURE RATE llt10-9 OVER 10
HOUR MISSION TIME - FLEXIBLE TRADING OF PERFORMANCE AND RELIABILITY
26COMMUNICATION CONTROLLER
- EXAMPLE OF A SELF-TESTING MICROPROCESSOR-BASED
SYSTEM A COMMUNICATION CONTROLLER FROM E-SYSTEMS,
INC. - THE CPU OF A SELF-TESTING SYSTEM
- SELF TEST PROGRAM IS STORED IN THE 1K TEST ROM.
- SELF TEST PROGRAM IS EXECUTED IN BACKGROUND MODE
(INVOKED BY A LOW PRIORITY INTERRUPT). - DETECTION OF FAULT CAUSES AN INDICATION LIGHT TO
BE TURNED ON IN AN LED PANEL. - THE ACTIVE MICROPROCESSOR MUST ACCESS AND RESET A
TIMER AT REGULAR INTERVALS. FAILURE TO DO SO
CAUSES A TIME-OUT CIRCUIT TO TRANSFER CONTROL TO
THE BACK-UP MICROPROCESSOR AND TURN ON THE CPU
FAULT LIGHT.
27THE CPU OF A SELF-TESTING SYSTEM
- ROMs ARE TESTED BY CHECK SUMMING
- RAM IS TESTED BY CHECKERBOARD PATTERNS WITH
BUFFERING A CURRENT WORD UNDER TEST IN THE CPU
REGISTER - I/O TESTS ARE PERFORMED USING THE LOOP-BACK
PROCEDURE. I.E., OUTPUTS ARE CONNECTED TO INPUTS
UNDER THE CPU CONTROL.
28SPACE SHUTTLE SYSTEM
- The Data Processing System (DPS) of the Space
Shuttle - A FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE
- Five General-Purpose Computers (GPCs)
- Time-shared Data Bus
- Two Magnetic Tape Mass Storage Units
- Specialized hardware components with redundancy
level 2 to 5
29A FAULT-TOLERANT BUILDING BLOCKARCHITECTURE (1)
- SELF-CHECKING AND FAULT TOLERANCE ARE PROVIDED AT
THE PROCESSOR, MEMORY, I/O AND BUS. - SELF-CHECKING COMPUTER MODULE (SCCM) CONTAINS
FOUR TYPES OF BUILDING BLOCK CIRCUITS WHICH
INTERFACE MEMORIES, PROCESSORS, I/O AND EXTERNAL
buses TO AN INTERNAL SCCM BUS. - THE BUILDING BLOCKS PROVIDE CONCURRENT FAULT
DETECTION WITHIN THEMSELVES AND IN THEIR
ASSOCIATED CIRCUITRY.
30A FAULT-TOLERANT BUILDING BLOCK
31SELF-CHECKING COMPUTER MODULES
- THE MEMORY INTERFACE BUILDING BLOCK (MIBB)
- THE MIBB SUPPORTS SINGLE ERROR CORRECTION OR
DOUBLE ERROR DETECTION - THE MIBB CAN BE COMMANDED TO REPLACE ANY TWO
SPECIFIED BITS (IN ALL WORDS) WITH THE TWO SPARE
BITS (PERMANENT CORRECTION) - THE CORE BUILDING BLOCK (CBB)
- DUAL PROCESSOR SYSTEM CONTINUOUSLY COMPARES
PROCESSORS OUTPUTS AND SIGNALS A FAULT IF IT
DETECTS A DISAGREEMENT - THE CBB ALSO SERVES AS A BUS ARBITER AND COLLECTS
ALL FAULT INDICATIONS FROM OTHER BUILDING BLOCKS
AND ITS OWN INTERNAL CIRCUITRY - IF A FAULT IS DETECTED, THE CBB ATTEMPTS EITHER A
PROGRAM ROLLBACK OR RESTART - IF THE FAULT RECURS, THE CBB DISABLES ITS HOST
COMPUTER BY HALTING THE PROCESSORS AND DISABLING
THE SCCM OUTPUTS - ANOTHER OPTION IS TO CONTINUE OPERATION USING ONE
FAULT-FREE PROCESSOR AND DEFER THE MAINTENANCE - THE CBB USES INTERNAL DUPLICATION AND
SELF-CHECKING LOGIC
32BUS INTERFACE BUILDING BLOCKS (BIBBS)
- THE BIBBS PROVIDE COMMUNICATIONS THROUGH
REDUNDANT BUSES WITH OTHER COMPUTERS IN THE
NETWORK - STATUS MESSAGES AND CODING VERIFY PROPER
TRANSMISSION AND REDUNDANT BUSES PROVIDE BACKING
TRANSMISSION PATHS - OVERHEAD ANALYSIS
- NONREDUNDANT SYSTEM REQUIRES 35 LSI CHIPS
- ADDING SCCMs INCREASES THE CHIP COUNT TO 43 (23
INCREASE) - MEMORY OVERHEAD (IF ALL OPTIONS ARE INCLUDED, MAY
BE AS HIGH AS 60
33SIFT CLOCK SYNCHRONIZATION ALGORITHM
- "READ" CLOCK VALUES C1, C2, ...., CN FROM OTHER
CLOCKS - COMPUTE
- (ELIMINATES EFFECTS OF GROSSLY DIFFERENT OR
FAILED CLOCKS)
- COMPUTE NEW CLOCK VALUE
- CLOCKS SYNCHRONIZED TO 50 µs
34CONCLUSIONS
- USE COMBINED METHODS OF
- CODING
- RECONFIGURATION
- REPLICATION
- TIMERS
- WATCHDOG PROCESSOR
- RECOVERY POINTS
- ROLL BACK OR ROLL FORWARD
- REMEMBER THE CONCEPT OF VERTICAL MIGRATION