Case Studies - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Case Studies

Description:

TO SHOW EXAMPLES OF EXISTING SYSTEMS WHICH ARE DESIGNED TO ASSURE HIGH RELIABILITY ... THE FAULT-TOLERANT COMPUTER OF THE EIGHTIES FEATURES: ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 35
Provided by: AF129
Category:
Tags: case | eighties | studies

less

Transcript and Presenter's Notes

Title: Case Studies


1
HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR
INFORMATIK
DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES
Wintersemester 99/00 Leitung Prof. Dr.
Miroslaw Malek www.informatik.hu-berlin.de/rok/f
tc
2
CASE STUDIES
  • OBJECTIVES
  • TO SHOW EXAMPLES OF EXISTING SYSTEMS WHICH ARE
    DESIGNED TO ASSURE HIGH RELIABILITY
  • TO RELATE GENERAL RELIABILITY METHODOLOGIES
    DESCRIBED EARLIER TO PRACTICAL IMPLEMENTATIONS OF
    THOSE IDEAS
  • TO SURVEY THE GENERAL EXISTING RELIABILITY
    CONCEPTS WITH EXEMPLARY CASES
  • CONTENTS
  • COMMERCIAL SYSTEMS FROM ATT, SEQUOIA, STRATUS
    AND TANDEM
  • FTMP - FAULT-TOLERANT MULTIPROCESSOR
  • SIFT - SOFTWARE IMPLEMENTED FAULT TOLERANCE
  • COMMUNICATION CONTROLLER
  • FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE

3
ATT's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS
5 AND 3B20 (1)
  • REQUIREMENTS
  • Downtime for the entire system not to exceed 2
    hours over 40 years life
  • of calls handled incorrectly lt 0.02
  • System outage 3 min/year
  • 100 availability 24 hours a day from user's
    perspective
  • Two minutes of downtime are contributed by
  • 24 sec - hardware faults (20)
  • 18 sec - software deficiencies (15)
  • 36 sec - procedural errors (30)
  • 42 sec - recovery deficiencies (35)

4
ATT's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS
5 AND 3B20 (2)
  • OTHER FEATURES
  • 95 of hardware and software faults detected and
    diagnosed automatically
  • 90 of hardware faults diagnosed within field
    replaceable unit (FRC).
  • Repair time less than 2 hours on ESS
  • 1 minute on 3B20
  • REDUNDANCY
  • FULL DUPLICATION (of critical modules)
  • CPU, memory, I/O, disks, bus systems
  • STANDBY SPARES
  • call store
  • ERROR DETECTION (at both hardware and software
    levels)
  • replication checks
  • timing checks
  • coding checks
  • internal checks (self-checking)

5
ATT's ELECTRONIC SWITCHING SYSTEMS ESS1A - ESS
5 AND 3B20 (3)
  • replication checks
  • duplex system with comparison on every cycle
  • timing checks
  • used in all hardware components also several
    timer resets driven by software interrupts
  • coding
  • m-out-of-n (4-out-of-8) codes, parity and cyclic
    codes
  • internal checks
  • address limits
  • multiple comparators help software to locate
    faults faster

6
SYSTEM VIEW (3B20)
7
FAULT TREATMENT
  • Detection of an error generates an interrupt and
    the fault treatment and recovery programs (FT/RP)
    are invoked
  • Three priority categories
  • immediate interrupt (maintenance interrupt)
  • if the fault is severe enough to effect the
    execution of the currently executing program
  • deferred interrupt
  • if too many calls are potentially affected by
    interrupt, then wait until the completion of the
    currently executing program
  • polite interrupt
  • waits until periodic routine diagnostic is
    executed
  • FT/RP identify and isolate the faulty unit and
    reconfigure the system to use one fault-free CPU
  • If storage has no duplication, other memory area
    will be assigned

8
RELIABLE SOFTWARE GOALS
  • OPERATE CONTINUOUSLY FOR MONTHS OR YEARS
  • TECHNIQUES USED FOR HIGH SOFTWARE RELIABILITY
  • PROCESSES HAVE INDIVIDUAL FAULT RECOVERY AND
    ROLLBACK MECHANISMS WHICH RECOVER FROM HARDWARE
    FAILURES OR TRANSIENT SOFTWARE FAILURES
  • SYSTEM INTEGRITY SOFTWARE MONITORS CORRECT
    OPERATION OF THE ENTIRE HARDWARE AND SOFTWARE
    SYSTEM
  • AUDITS VALIDATE DATA CONSISTENCY AND RECLAIM LOST
    RESOURCES USING ROBUST DATA STRUCTURES
  • OVERLOAD CONTROLS ENSURE THE AVAILABILITY OF
    RESOURCES AND PREVENT CATASTROPHIC FAILURES
  • EXCEPTION HANDLING TECHNIQUES
  • NONCRITICAL PROGRAMS USUALLY TERMINATE AND
    RESTART
  • CRITICAL PROGRAMS WILL ROLLBACK AND RETRY

9
PROGRESSIVE RECOVERY EFFORT
  • LEVEL ACTION
  • LOCAL LOCAL RECOVERY
  • 1 OPERATING SYSTEM AND I/O DRIVER ROLLBACK
  • 2 QUICK BOOTSTRAP
  •  
  • 3 COMPLETE BOOTSTRAP RELOAD CONFIGURATION
    DATABASE
  •  
  • 4 MANUAL CLEAR ALL OF MEMORY DO 3 ABOVE 
  • ALTHOUGH DOWNTIME DOES NOT INCREASE
    SIGNIFICANTLY AS RECOVERY ACTIONS ESCALATE,
    DISRUPTIONS TO USERS OF APPLICATIONS DO INCREASE
    SIGNIFICANTLY 
  • ABORTED TRANSACTIONS

10
SYSTEM ENHANCEMENT GOALS
  • INSTALL NEW HARDWARE AND SOFTWARE
  • WITHOUT TAKING DOWN THE SYSTEM
  • METHODS TO ADD UPDATES
  • CHANGE HARDWARE AND SOFTWARE WITH NO DISRUPTION
    IN SERVICE
  • INSTALL NEW HARDWARE, FIRMWARE, OR SOFTWARE WITH
    MINIMAL DISRUPTION IN SERVICE  
  • OFF-LINE SOFTWARE REPLACEMENT SYSTEM
  • COMPILE THE NEW SOURCE CODE
  • COMPARE NEW OBJECT CODE TO OLD OBJECT CODE
  • DETERMINE KINDS OF REPLACEMENTS NEEDED
  • GENERATE THE REPLACEMENT FILES
  • METHODS TO REMOVE FAULTY UPDATES
  • BACK OUT ANY UPDATES WHICH WERE FOUND TO CONTAIN
    FAULTS
  • AUTOMATICALLY BACK OUT OF ANY UPDATE SUSPECTED OF
    CAUSING A FAILURE

11
OPERATOR INTERFACE GOALS
  • HELP EFFECT A QUICK REPAIR
  • PROVIDE IMMEDIATE FEEDBACK ON STATUS OF SYSTEM
  • HELP OPERATOR MAKE QUICK, ACCURATE DECISIONS
  • PREVENT DANGEROUS OPERATOR MISTAKES
  • PROVIDE POSITIVE CONTROL OF ALL PARTS OF SYSTEM

12
FAULT INJECTION AND REPAIR SIMULATION
  1. OVER 10,000 SINGLE HARDWARE FAULTS WERE INJECTED
    AT RANDOM AND AUTOMATIC SYSTEM RECOVERY WORKED IN
    OVER 99.8 OF CASES
  2. IN 133 SIMULATED REPAIR CASES TROUBLE LOCATION
    PROCEDURE (TLP) FAILED TO LOCATE FAULTY MODULE IN
    5 CASES, AND IN 94 OF THE LISTS OF SUSPECTED
    FAULTY COMPONENTS THE FAULT WAS LOCATED WITHIN
    THE FIRST FIVE MODULES

13
AVAILABILITY ASSURANCE
  • MODEL AVAILABILITY
  • THROUGH ENTIRE LIFECYCLE
  • TEST FOR AVAILABILITY
  • TO MEET SPECIFIED AVAILABILITY
  • TRACK ON-SITE EXPERIENCE
  • TO ENSURE AVAILABILITY OBJECTIVES ARE MET

14
SEQUOIA(Marlboro, MA 01752 ph. 617-480-0800)
  • TIGHTLY-COUPLED MULTIPROCESSOR capable of trading
    performance for dependability and vice versa
  • MC68020 PROCESSORS (20MHz clock)
  • up to 64 PEs
  • up to 128 MEs (16 M bytes with ECC)
  • up to 96 IOEs
  • two 40-bit 10MHz buses
  • FAULT DETECTION
  • error-detecting codes (e.g., half odd-half even
    parity)
  • comparison of duplicated operations (duplex
    microprocessors)
  • protocol monitoring
  • PE faults are located by polling
  • RECONFIGURATION
  • reassignment to fault-free processors

15
STRATUS (also IBM's System/88)(Natick, MA
01760 ph. 617-653-1466)
  • TWO-PAIRS OF DUPLEXED PEs (PAIR AND SPARE PAIR)
  • UP TO 32 PEs ON RING -TYPE LOCAL AREA NETWORK
  • RED-LIGHT NOTIFICATION ABOUT FAULTY BOARD
  • ABILITY TO EXCHANGE BOARDS ON LINE
  • ECC ON MEMORIES (Up to 32M bytes per PE)
  • PERFORMANCE/FAULT TOLERANCE OPTIONS

16
STRATUS XA/R SERIES 300PAIR AND SPARE CONCEPT
STRATUS XA/R SERIES 300 MODULE
17
TANDEM(Cupertino, CA 95014 ph. 408-725-6000)
  • CONFIGURATIONS
  • SINGLE SYSTEM 2-16 PEs
  • FIBER OPTIC CABLE-CONNECTED SYSTEM UP TO 224 PEs
    (14X16)
  • WORLD-WIDE NETWORK UP TO 4,080 PEs
  • THE FAULT-TOLERANT COMPUTER OF THE EIGHTIES
    FEATURES
  • NONSTOP II OR NONSTOP TXP PROCESSOR WITH 64KB
    CACHE
  • DUAL DYNABUS (26 Mbytes/sec)
  • 2-8 Mbytes Memories
  • Dual Disk (MTBF for a single disk is 3-5 years
    with dual disk, THE MTBF increases to 1500 years)
  • FAULT DETECTION - 100 by duplication or by
    timeout mechanism (absence of "I'm alive"
    message)
  • FAULT-TOLERANT WITH RESPECT TO ANY SINGLE
    HARDWARE FAULT
  • RECOVERY by rollback to the latest checkpoint in
    memory 
  • LATEST SYSTEM INTEGRITY S2 USES TMR OF MIPS
    PROCESSORS ("SELECTIVE" TMR)

18
NONSTOP CYCLONE (TANDEM COMPUTERS Inc.)
  • CYCLONE TOLERATES SINGLE HARDWARE OR SOFTWARE
    FAULT
  • IT USES A FAULT-TOLERANT LOAD BALANCING OPERATING
    SYSTEM CALLED GUARDIAN 90
  • GUARDIAN 90 MAINTAINS BACKUP OF USER PROCESSES ON
    SEPARATE PROCESSORS AND KEEPS CONSISTENCY BY
    PERIODIC CHECKPOINTING
  • 16 AND 64 PROCESSOR CONFIGURATIONS WITH UP TO 2
    GB MEMORY 64 I/O CHANNELS (WITH FOX NETWORK UP
    TO 255 PROCESSORS CAN WORK TOGETHER)

19
NONSTOP CYCLONE (TANDEM COMPUTERS Inc.)
TANDEM NONSTOP CYCLONE SYSTEM
20
CYCLONE SYSTEM ARCHITECTURE
  • Superscalar proprietary CISC Processors
  • A section is a quad of processors which are
    connected by duplexed DYNABUS (a proprietary,
    fault-tolerant bus, 40 MB/sec)
  • Sections are also redundantly (duplexed both
    ways) interconnected by dynabus also a
    proprietary up to 50M long, fault-tolerant bus
    which uses fiber optics
  • BASIC PRINCIPLE FAIL FAST
  • (concurrent error detection or Im alive
    messages, combined with immediate termination of
    operation upon detection to minimize error
    propagation)
  • Replacement of components on line
  • SEC-DED on memories
  • Mirrored disks

DYNABUS
DYNABUS
DYNABUS
DYNABUS
Four separate sections connected by DYNABUS
21
HIMALAYA K10000 (TANDEM COMPUTERS Inc.)
22
HIMALAYA K10000s INTERSECTION NETWORK
Section
Dual Fiber Optic Rings
Node
23
FTMP - FAULT-TOLERANTMULTIPROCESSOR (DRAPER LABS)
  • THREE TRIADS IN TMR CONFIGURATION (NINE PROCESSOR
    SYSTEM)
  • TMR ON COMMUNICATION LINES
  • FAULT-TOLERANT TMR CLOCK
  • FAULT-TOLERANT WITH RESPECT TO ANY SINGLE FAULT
  • DESIGN GOALS
  • 10-9 FAILURES/HOUR
  • 10 HOUR MISSION TIME
  • 300 HOUR MAINTENANCE INTERVALS

24
FAULT-TOLERANT PARALLEL PROCESSOR(FTPP FROM
Draper Labs)
Byzantine resilience
A four-triplex group cluster
T4
T3
T2
An ensemble of 16 triplex groups
25
SIFT - SOFTWARE IMPLEMENTED FAULT TOLERANCE
  • NINE PROCESSOR SYSTEM WITH CAPABILITY TO SCHEDULE
    TASKS TO RUN ON 1, 3, 5, 7 OR 9 PROCESSORS
    DEPENDING ON TASK CRITICALITY
  • LOCAL EXECUTIVE FOR EACH TASK
  • error handler/detector
  • scheduler
  • software voter
  • repeated communication
  • GLOBAL EXECUTIVE
  • runs in TMR mode
  • allocates resources
  • diagnoses reports from local error handlers
  • SYSTEM SHOULD HAVE FAILURE RATE llt10-9 OVER 10
    HOUR MISSION TIME
  • FLEXIBLE TRADING OF PERFORMANCE AND RELIABILITY

26
COMMUNICATION CONTROLLER
  • EXAMPLE OF A SELF-TESTING MICROPROCESSOR-BASED
    SYSTEM A COMMUNICATION CONTROLLER FROM E-SYSTEMS,
    INC.
  • THE CPU OF A SELF-TESTING SYSTEM
  • SELF TEST PROGRAM IS STORED IN THE 1K TEST ROM.
  • SELF TEST PROGRAM IS EXECUTED IN BACKGROUND MODE
    (INVOKED BY A LOW PRIORITY INTERRUPT).
  • DETECTION OF FAULT CAUSES AN INDICATION LIGHT TO
    BE TURNED ON IN AN LED PANEL.
  • THE ACTIVE MICROPROCESSOR MUST ACCESS AND RESET A
    TIMER AT REGULAR INTERVALS. FAILURE TO DO SO
    CAUSES A TIME-OUT CIRCUIT TO TRANSFER CONTROL TO
    THE BACK-UP MICROPROCESSOR AND TURN ON THE CPU
    FAULT LIGHT.

27
THE CPU OF A SELF-TESTING SYSTEM
  • ROMs ARE TESTED BY CHECK SUMMING
  • RAM IS TESTED BY CHECKERBOARD PATTERNS WITH
    BUFFERING A CURRENT WORD UNDER TEST IN THE CPU
    REGISTER
  • I/O TESTS ARE PERFORMED USING THE LOOP-BACK
    PROCEDURE. I.E., OUTPUTS ARE CONNECTED TO INPUTS
    UNDER THE CPU CONTROL.

28
SPACE SHUTTLE SYSTEM
  • The Data Processing System (DPS) of the Space
    Shuttle
  • A FAULT-TOLERANT BUILDING BLOCK ARCHITECTURE
  • Five General-Purpose Computers (GPCs)
  • Time-shared Data Bus
  • Two Magnetic Tape Mass Storage Units
  • Specialized hardware components with redundancy
    level 2 to 5

29
A FAULT-TOLERANT BUILDING BLOCKARCHITECTURE (1)
  • SELF-CHECKING AND FAULT TOLERANCE ARE PROVIDED AT
    THE PROCESSOR, MEMORY, I/O AND BUS.
  • SELF-CHECKING COMPUTER MODULE (SCCM) CONTAINS
    FOUR TYPES OF BUILDING BLOCK CIRCUITS WHICH
    INTERFACE MEMORIES, PROCESSORS, I/O AND EXTERNAL
    buses TO AN INTERNAL SCCM BUS.
  • THE BUILDING BLOCKS PROVIDE CONCURRENT FAULT
    DETECTION WITHIN THEMSELVES AND IN THEIR
    ASSOCIATED CIRCUITRY.

30
A FAULT-TOLERANT BUILDING BLOCK
31
SELF-CHECKING COMPUTER MODULES
  • THE MEMORY INTERFACE BUILDING BLOCK (MIBB)
  • THE MIBB SUPPORTS SINGLE ERROR CORRECTION OR
    DOUBLE ERROR DETECTION
  • THE MIBB CAN BE COMMANDED TO REPLACE ANY TWO
    SPECIFIED BITS (IN ALL WORDS) WITH THE TWO SPARE
    BITS (PERMANENT CORRECTION)
  • THE CORE BUILDING BLOCK (CBB)
  • DUAL PROCESSOR SYSTEM CONTINUOUSLY COMPARES
    PROCESSORS OUTPUTS AND SIGNALS A FAULT IF IT
    DETECTS A DISAGREEMENT
  • THE CBB ALSO SERVES AS A BUS ARBITER AND COLLECTS
    ALL FAULT INDICATIONS FROM OTHER BUILDING BLOCKS
    AND ITS OWN INTERNAL CIRCUITRY
  • IF A FAULT IS DETECTED, THE CBB ATTEMPTS EITHER A
    PROGRAM ROLLBACK OR RESTART
  • IF THE FAULT RECURS, THE CBB DISABLES ITS HOST
    COMPUTER BY HALTING THE PROCESSORS AND DISABLING
    THE SCCM OUTPUTS
  • ANOTHER OPTION IS TO CONTINUE OPERATION USING ONE
    FAULT-FREE PROCESSOR AND DEFER THE MAINTENANCE
  • THE CBB USES INTERNAL DUPLICATION AND
    SELF-CHECKING LOGIC

32
BUS INTERFACE BUILDING BLOCKS (BIBBS)
  • THE BIBBS PROVIDE COMMUNICATIONS THROUGH
    REDUNDANT BUSES WITH OTHER COMPUTERS IN THE
    NETWORK
  • STATUS MESSAGES AND CODING VERIFY PROPER
    TRANSMISSION AND REDUNDANT BUSES PROVIDE BACKING
    TRANSMISSION PATHS
  • OVERHEAD ANALYSIS
  • NONREDUNDANT SYSTEM REQUIRES 35 LSI CHIPS
  • ADDING SCCMs INCREASES THE CHIP COUNT TO 43 (23
    INCREASE)
  • MEMORY OVERHEAD (IF ALL OPTIONS ARE INCLUDED, MAY
    BE AS HIGH AS 60

33
SIFT CLOCK SYNCHRONIZATION ALGORITHM
  • "READ" CLOCK VALUES C1, C2, ...., CN FROM OTHER
    CLOCKS
  • COMPUTE
  • (ELIMINATES EFFECTS OF GROSSLY DIFFERENT OR
    FAILED CLOCKS)
  1. COMPUTE NEW CLOCK VALUE
  1. CLOCKS SYNCHRONIZED TO 50 µs

34
CONCLUSIONS
  • USE COMBINED METHODS OF
  • CODING
  • RECONFIGURATION
  • REPLICATION
  • TIMERS
  • WATCHDOG PROCESSOR
  • RECOVERY POINTS
  • ROLL BACK OR ROLL FORWARD
  • REMEMBER THE CONCEPT OF VERTICAL MIGRATION
Write a Comment
User Comments (0)
About PowerShow.com