Last class review - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Last class review

Description:

FAA WAAS (Wide Area Augmentation System) Fault Detection. Duplication with comparison ... fail, the crew or the controllers on earth might decide to abort the mission ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 47
Provided by: danielort
Category:

less

Transcript and Presenter's Notes

Title: Last class review


1
Last class review
  • Information redundancy
  • Error Detection and Correcting Codes
  • Applications
  • Hamming codes (SEC,SECDED)
  • CRC
  • Reed Solomon

2
Summary of FTC techniques
3
Barriers in HW/SW
Barriers constructed by design techniques for
fault avoidance, masking, tolerance
Better name for fault tolerance is error tolerance
4
Hardware Redundancy (Spatial)
  • Passive (static) redundancy techniques
  • fault masking
  • Dynamic redundancy techniques
  • detection, localization, containment, recovery
  • Hybrid redundancy techniques
  • static dynamic
  • fault masking reconfiguration

5
Passive Hardware Redundancy
  • Triple modular redundancy (TMR)

110 ? 1 011 ? 1 101 ? 1 001 ? 0 010 ? 0 100 ? 0
3 active components fault masking by
voter Problem voter is a single point of failure
6
Simple Majority voting
  • Hardware realization of 1 bit majority voting

Fabacbc 2 gate delays
7
Triple Modular Redundancy
Ri denotes the reliability of one module. RTMR
P(all modules are functioning) P(exactly two
modules are functioning) If all modules are
same R1R2R3R RTMR 3R2 2R3 If the
lifetimes of the modules are exponential R e
?t RTMR 3e2?t 2e3?t and MTTFTMR
?RTMRdT 5/ (6?)
8
N-Modular Redundancy (NMR)
  • Generalization of TMR (more than 3 modules, e.g.
    5MR)
  • Tolerates up to (N-1)/2 faulty elements
  • In general voting can be done on digital or
    analog data
  • Application temperature measurement (no
    majority)
  • Method take 3 measurements, compute median
    value
  • Example
  • Sensor 1 99C
  • Sensor 2 100 C
  • Sensor 3 45,217 C lt- discard outlier!!

9
NMR
  • TMR is a 2-of-3 system. In general the
    reliability of an m-of-n system is

Considering a single voter with reliability Rv
10
TMR with Triplicated Voters
To avoid single point of failure we can use
multiple voters
11
TMR with Triplicated Voters
TMR (or NMR) can be applied at all levels gates,
sensors, processor, memory boards. However,
applying redundancy at low level, produces
overhead and cost can be high.
Voting on processors output
12
TMR with Triplicated Voters
TMR handles processor fault voter fault
memory fault bus fault System has no
single-point-of-failure This approach was
implemented in Tandem Integrity system
Voting at read from memory
Voting at read/write from/to memory
Mi memory Pi processor
Voting at write to memory
13
Cascading TMR modules
  • Idea is to isolate failures from either
    components or voters so that they dont propagate
  • Examples
  • JPL STAR (Self-Testing And Repairing computer)
  • FAA WAAS (Wide Area Augmentation System)

14
Fault Detection
  • Duplication with comparison

Two identical modules performing same
operations Fault detection but not tolerance
15
Fault Detection in Multicomputers
  • Multicomputers with shared memory

Fault detection. Comparison is performed in
software
16
Dynamic RedundancyStandby Systems
  • Types
  • Hot
  • Cold
  • Warm

17
Reliability
  • Parallel System (2 modules)
  • Standby (2 modules)
  • Rs gt Rp but the coupling mechanism can have a
    significant effect on results.
  • In case of standby the coupler is more complex
  • Detects failure, powers standby and switches
    outputs

18
Hot Standby Systems
  • Updated simultaneously with primary system
  • Advantages
  • Very short outage time
  • Does not require recovery of application
  • Drawbacks
  • High failure rate
  • High power consumption

19
Warm Standby Systems
  • Secondary (i.e., backup) system runs in the
    background of the primary system.
  • Data is mirrored to the secondary server at
    regular intervals (at times both servers do not
    contain the exact same data)
  • Advantages
  • Does not require simultaneous up-dating with
    primary
  • Drawbacks
  • Requires recovery time
  • High failure rate
  • High power consumption

20
Cold Standby Systems
  • Secondary (i.e., backup) system is only called
    upon when the primary system fails. Secondary is
    not updated as frequently as in warm standby
  • Advantages
  • Low failure rate
  • Low power consumption
  • Drawbacks
  • Very long outage time
  • Needs to boot kernel/OS system and recover
    application status

21
Hybrid Redundancy
  • Self purging redundancy
  • Adaptive voting
  • PEi processing element
  • ESi elementary switch

ESs compare output from voter and discards PE
whose output disagrees with voting result.
22
Hybrid Redundancy
N Modular redundancy with spare
Voters output is used to identify faulty
modules, which are replaced with spares
23
Re-execution
  • Replicate the actions on a module either on the
    same module (temporal redundancy) or on spare
    modules (temporal spatial redundancy)
  • Good for detecting and/or correcting transient
    faults
  • Transient error will only affect one execution
  • Can implement this at many different levels
  • ALU
  • Thread context
  • Processor
  • System

24
Re-execution with Shifting operands (RESO)
  • Re-execute the same arithmetic operations, but
    shifting the operands
  • Goal detect errors in ALU
  • Example shift left by 2
  • 1 0 1 0 1 0 X X
  • 1 0 0 1 0 1 X X
  • 0 0 1 0 1 1 X X
  • By comparing output bit 0 of the first execution
    and output bit 2 of the shifted re-execution, we
    detect an error in the ALU, since they should be
    equal

25
Re-execution with Processes
  • Use redundant processes to detect errors
  • Problem serialization, slowdown factor of 2
  • In a multiprocessor, we can execute copies of the
    same process simultaneously on 2 processors and
    have them periodically compare their results
  • Almost no slowdown, except for comparisons
  • Disadvantage the opportunity cost of not using
    that other processor to perform non-redundant
    work

Process
Process
CPU
Check errors
Process
Process
CPU
CPU
Check errors
26
Re-execution of microinstructions
Superscalar Processor Microarchitecture
Drawback -Tests only FUs not whole pipeline
  • Processors also use built in self test (BIST)
  • Generation of test vectors within chip

27
Re-execution with Threads
  • Use redundant threads to detect errors
  • Many current superscalar microprocessors are
    multithreaded ( e.g. Pentium4 is hyperthreaded)
  • Each processor can run multiple processes or
    multiple threads of the same process (i.e., it
    has multiple thread contexts)
  • Can re-execute a program on multiple thread
    contexts, just like with multiple processors
  • Better performance than re-execution with
    multiple processors, since the comparison can be
    performed on-chip
  • Less opportunity cost to use extra thread context
    than extra processor

28
SMT - Flow of Instructions
29
Re-execution with Simultaneous Multithreaded (SMT)
  • Motivation (Rotenberg 99)
  • Increasingly high clock rates and chip density
    may cause transient errors in high performance
    microprocessors
  • High cost of multiprocessor
  • Active stream/redundant stream Simultaneous
    Multithreading (SMT)
  • Low overhead, broad coverage of transient faults
    and some permanent faults
  • In AR-SMT, two explicit copies of the program run
    concurrently on the same processor resources

30
Re-execution with Simultaneous Multithreaded (SMT)
  • A-stream is executed on SMT and results are
    committed in the delay buffer
  • R-stream executes on the SMT, delayed from the
    A-stream, by no more than the size of the delay
  • R-stream results are compared to A-stream results
    in delay buffer, a fault is detected if results
    differ
  • SMT Pipeline
  • time-shared, in any given cycle, the pipeline
    stage is consumed entirely by one thread.
  • space-shared, every cycle a fraction of the
    bandwidth is allocated to both threads.

31
I/O in Computer Systems
  • Disks are often considered the stable storage
    on which we save critical data
  • E.g., databases write their important data to
    disks
  • Critical disk systems are backed up with tape
  • Periodically (e.g., nightly, weekly) log diffs to
    tape
  • Disks are generally protected with
  • Information redundancy (EDC/ECC)
  • Physical redundancy

32
Redundant Array of Inexepensive Disks (RAID)
  • Motivation (Patterson88) Amdahls law

f.90 k10
S1/((1-.90).90/10) S1/.19 S5.2
f.90 k100
non improved part due to I/O
improvement
Sspeedup ffraction of program that is
improved kCPU speedup
S1/((1-.90).90/100) S1/.109 S9.2
10 times CPU speedupgt2 times overall
speedup Slim (k-gt8)10 (90 wasted speedup)
33
RAID
Reliability for an array of disks MTTFarrayMTTF
single/disks MTTF10300,000/1030,000
h MTTF100300,000/1003,000 h The more disks,
the more likely one of them may crash
Worst MTTF performance with array
34
RAID
  • Techniques to organize data across multiple disks
    that exploit
  • Redundancy
  • Parallelism
  • Goals of RAID - improve
  • Reliability
  • Performance
  • Levels
  • RAID1 mirrored disks
  • RAID2 hamming code for ECC
  • RAID3 single check disk per group
  • RAID4 independent read/writes
  • RAID5 no single check disk

35
RAID-0 Stripping
  • Virtual disk sectors split into strips of k
    sectors
  • Strips placed on disks cyclically
  • No overhead
  • Maximal parallelism
  • No fault tolerance (worse than single disk).
  • Not real RAID no redundancy

36
RAID-1 Mirroring
  • Every disk has a copy (a mirror)
  • Simple
  • Write does 2 I/O, Read can be from either copy
  • 100 overhead
  • Excellent fault tolerance

37
RAID-2 ECC across disks
Hamming check bits
  • Use Error-Correcting Code (ECC) (Hamming SED SEC)
  • Spread each words bits across the disks
  • Lower overhead than RAID-1 (depends on ECC)
  • Drives need to be synchronized
  • Complicated, expensive controller

38
RAID-3 Parity Disk
1 0 1 0 0 0 1 1
  • Single parity disk, stores XOR of other disks.
  • Provides 1-disk crash tolerance
  • crashed disk position known by the disk
    controller
  • to find missing information add P mod 2 to other
    disks data
  • Disk drives need to be synchronized.
  • Low overhead (1/N for N disks to store parity)

Even parity
39
RAID-4 Parity Stripping
  • Like RAID 3 but with strip-for-strip parity
  • No drive synchronization.
  • Each write causes 2 reads (data, parity) and 2
    writes (new data, new parity)
  • Parity drive can become a bottleneck

40
RAID-5 Distributed Parity drive
  • Like RAID-4 but parity strips spread over all
    disks
  • More complicated controller crash recovery
    process
  • Most used RAID system
  • Other RAID systems RAID-6, RAID-10

41
RAID-3 Parity Disk Reliability
  • A simplified model of reliability for RAID-3 (2
    HD data, 1 parity)

knumber of HD that must operate out of n RAID4
and 5 have similar models
42
Other issues at Disk
The I/O bus can still be a potential point of
failure Possible solution is to use redundant
busses
43
Space Shuttle
Architecture of Space Shuttle computer system - 5
CPUs, replicated buses and components
44
Space Shuttle Computer System
  • Configuration of computers and most fault
    detection is performed in software
  • The 4 computers get similar input data and
    compute the same functions.
  • Additionally, each processor
  • Has extensive self-test facilities.
  • If an error is detected, it is reported to the
    crew, which then can switch off the faulty unit.
  • Compares its results with those produced by its
    neighbors.
  • If a processor detects a disagreement, it signals
    this, and voting is used in order to remove the
    offending computer.
  • Has a watchdog timer, in order to detect CPU
    crashes.

45
Space Shuttle Computer System
  • If one processor is switched off
  • obtains a triple modular redundancy arrangement
    (TMR).
  • If a second processor is switched off
  • the system is switched into duplex mode, where
    the two computers compare their results in order
    to detect any further failure.
  • In case of a third failure
  • the system reports the inconsistencies to the
    crew and uses fault detection techniques in order
    to identify the offending unit.
  • This provides therefore protection against
    failures of two units and fault detection and
    limited fault tolerance against the failure of a
    third unit.

46
Space Shuttle Computer System
  • In an emergency, the fifth computer can take over
    critical functions
  • The 5th computer allows protection against
    systematic faults in the software.
  • If one or two computers fail, the crew or the
    controllers on earth might decide to abort the
    mission
Write a Comment
User Comments (0)
About PowerShow.com