Last class review - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Last class review

Description:

FAA WAAS (Wide Area Augmentation System) Fault Detection. Duplication with comparison ... fail, the crew or the controllers on earth might decide to abort the mission ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 47

Provided by: danielort

Category:

more less

Transcript and Presenter's Notes

Title: Last class review

1
Last class review

Information redundancy
Error Detection and Correcting Codes
Applications
Hamming codes (SEC,SECDED)
CRC
Reed Solomon

2
Summary of FTC techniques
3
Barriers in HW/SW
Barriers constructed by design techniques for
fault avoidance, masking, tolerance
Better name for fault tolerance is error tolerance
4
Hardware Redundancy (Spatial)

Passive (static) redundancy techniques
fault masking
Dynamic redundancy techniques
detection, localization, containment, recovery
Hybrid redundancy techniques
static dynamic
fault masking reconfiguration

5
Passive Hardware Redundancy

Triple modular redundancy (TMR)

110 ? 1 011 ? 1 101 ? 1 001 ? 0 010 ? 0 100 ? 0
3 active components fault masking by
voter Problem voter is a single point of failure
6
Simple Majority voting

Hardware realization of 1 bit majority voting

Fabacbc 2 gate delays
7
Triple Modular Redundancy
Ri denotes the reliability of one module. RTMR
P(all modules are functioning) P(exactly two
modules are functioning) If all modules are
same R1R2R3R RTMR 3R2 2R3 If the
lifetimes of the modules are exponential R e
?t RTMR 3e2?t 2e3?t and MTTFTMR
?RTMRdT 5/ (6?)
8
N-Modular Redundancy (NMR)

Generalization of TMR (more than 3 modules, e.g.
5MR)
Tolerates up to (N-1)/2 faulty elements
In general voting can be done on digital or
analog data
Application temperature measurement (no
majority)
Method take 3 measurements, compute median
value
Example
Sensor 1 99C
Sensor 2 100 C
Sensor 3 45,217 C lt- discard outlier!!

9
NMR

TMR is a 2-of-3 system. In general the
reliability of an m-of-n system is

Considering a single voter with reliability Rv
10
TMR with Triplicated Voters
To avoid single point of failure we can use
multiple voters
11
TMR with Triplicated Voters
TMR (or NMR) can be applied at all levels gates,
sensors, processor, memory boards. However,
applying redundancy at low level, produces
overhead and cost can be high.
Voting on processors output
12
TMR with Triplicated Voters
TMR handles processor fault voter fault
memory fault bus fault System has no
single-point-of-failure This approach was
implemented in Tandem Integrity system
Voting at read from memory
Voting at read/write from/to memory
Mi memory Pi processor
Voting at write to memory
13
Cascading TMR modules

Idea is to isolate failures from either
components or voters so that they dont propagate
Examples
JPL STAR (Self-Testing And Repairing computer)
FAA WAAS (Wide Area Augmentation System)

14
Fault Detection

Duplication with comparison

Two identical modules performing same
operations Fault detection but not tolerance
15
Fault Detection in Multicomputers

Multicomputers with shared memory

Fault detection. Comparison is performed in
software
16
Dynamic RedundancyStandby Systems

Types
Hot
Cold
Warm

17
Reliability

Parallel System (2 modules)
Standby (2 modules)

Rs gt Rp but the coupling mechanism can have a
significant effect on results.
In case of standby the coupler is more complex
Detects failure, powers standby and switches
outputs

18
Hot Standby Systems

Updated simultaneously with primary system
Advantages
Very short outage time
Does not require recovery of application
Drawbacks
High failure rate
High power consumption

19
Warm Standby Systems

Secondary (i.e., backup) system runs in the
background of the primary system.
Data is mirrored to the secondary server at
regular intervals (at times both servers do not
contain the exact same data)
Advantages
Does not require simultaneous up-dating with
primary
Drawbacks
Requires recovery time
High failure rate
High power consumption

20
Cold Standby Systems

Secondary (i.e., backup) system is only called
upon when the primary system fails. Secondary is
not updated as frequently as in warm standby
Advantages
Low failure rate
Low power consumption
Drawbacks
Very long outage time
Needs to boot kernel/OS system and recover
application status

21
Hybrid Redundancy

Self purging redundancy
Adaptive voting
PEi processing element
ESi elementary switch

ESs compare output from voter and discards PE
whose output disagrees with voting result.
22
Hybrid Redundancy
N Modular redundancy with spare
Voters output is used to identify faulty
modules, which are replaced with spares
23
Re-execution

Replicate the actions on a module either on the
same module (temporal redundancy) or on spare
modules (temporal spatial redundancy)
Good for detecting and/or correcting transient
faults
Transient error will only affect one execution
Can implement this at many different levels
ALU
Thread context
Processor
System

24
Re-execution with Shifting operands (RESO)

Re-execute the same arithmetic operations, but
shifting the operands
Goal detect errors in ALU
Example shift left by 2
1 0 1 0 1 0 X X
1 0 0 1 0 1 X X
0 0 1 0 1 1 X X
By comparing output bit 0 of the first execution
and output bit 2 of the shifted re-execution, we
detect an error in the ALU, since they should be
equal

25
Re-execution with Processes

Use redundant processes to detect errors
Problem serialization, slowdown factor of 2
In a multiprocessor, we can execute copies of the
same process simultaneously on 2 processors and
have them periodically compare their results
Almost no slowdown, except for comparisons
Disadvantage the opportunity cost of not using
that other processor to perform non-redundant
work

Process
Process
CPU
Check errors
Process
Process
CPU
CPU
Check errors
26
Re-execution of microinstructions
Superscalar Processor Microarchitecture
Drawback -Tests only FUs not whole pipeline

Processors also use built in self test (BIST)
Generation of test vectors within chip

27
Re-execution with Threads

Use redundant threads to detect errors
Many current superscalar microprocessors are
multithreaded ( e.g. Pentium4 is hyperthreaded)
Each processor can run multiple processes or
multiple threads of the same process (i.e., it
has multiple thread contexts)
Can re-execute a program on multiple thread
contexts, just like with multiple processors
Better performance than re-execution with
multiple processors, since the comparison can be
performed on-chip
Less opportunity cost to use extra thread context
than extra processor

28
SMT - Flow of Instructions
29
Re-execution with Simultaneous Multithreaded (SMT)

Motivation (Rotenberg 99)
Increasingly high clock rates and chip density
may cause transient errors in high performance
microprocessors
High cost of multiprocessor
Active stream/redundant stream Simultaneous
Multithreading (SMT)
Low overhead, broad coverage of transient faults
and some permanent faults
In AR-SMT, two explicit copies of the program run
concurrently on the same processor resources

30
Re-execution with Simultaneous Multithreaded (SMT)

A-stream is executed on SMT and results are
committed in the delay buffer
R-stream executes on the SMT, delayed from the
A-stream, by no more than the size of the delay
R-stream results are compared to A-stream results
in delay buffer, a fault is detected if results
differ
SMT Pipeline
time-shared, in any given cycle, the pipeline
stage is consumed entirely by one thread.
space-shared, every cycle a fraction of the
bandwidth is allocated to both threads.

31
I/O in Computer Systems

Disks are often considered the stable storage
on which we save critical data
E.g., databases write their important data to
disks
Critical disk systems are backed up with tape
Periodically (e.g., nightly, weekly) log diffs to
tape
Disks are generally protected with
Information redundancy (EDC/ECC)
Physical redundancy

32
Redundant Array of Inexepensive Disks (RAID)

Motivation (Patterson88) Amdahls law

f.90 k10
S1/((1-.90).90/10) S1/.19 S5.2
f.90 k100
non improved part due to I/O
improvement
Sspeedup ffraction of program that is
improved kCPU speedup
S1/((1-.90).90/100) S1/.109 S9.2
10 times CPU speedupgt2 times overall
speedup Slim (k-gt8)10 (90 wasted speedup)
33
RAID
Reliability for an array of disks MTTFarrayMTTF
single/disks MTTF10300,000/1030,000
h MTTF100300,000/1003,000 h The more disks,
the more likely one of them may crash
Worst MTTF performance with array
34
RAID

Techniques to organize data across multiple disks
that exploit
Redundancy
Parallelism
Goals of RAID - improve
Reliability
Performance
Levels
RAID1 mirrored disks
RAID2 hamming code for ECC
RAID3 single check disk per group
RAID4 independent read/writes
RAID5 no single check disk

35
RAID-0 Stripping

Virtual disk sectors split into strips of k
sectors
Strips placed on disks cyclically
No overhead
Maximal parallelism
No fault tolerance (worse than single disk).
Not real RAID no redundancy

36
RAID-1 Mirroring

Every disk has a copy (a mirror)
Simple
Write does 2 I/O, Read can be from either copy
100 overhead
Excellent fault tolerance

37
RAID-2 ECC across disks
Hamming check bits

Use Error-Correcting Code (ECC) (Hamming SED SEC)
Spread each words bits across the disks
Lower overhead than RAID-1 (depends on ECC)
Drives need to be synchronized
Complicated, expensive controller

38
RAID-3 Parity Disk
1 0 1 0 0 0 1 1

Single parity disk, stores XOR of other disks.
Provides 1-disk crash tolerance
crashed disk position known by the disk
controller
to find missing information add P mod 2 to other
disks data
Disk drives need to be synchronized.
Low overhead (1/N for N disks to store parity)

Even parity
39
RAID-4 Parity Stripping

Like RAID 3 but with strip-for-strip parity
No drive synchronization.
Each write causes 2 reads (data, parity) and 2
writes (new data, new parity)
Parity drive can become a bottleneck

40
RAID-5 Distributed Parity drive

Like RAID-4 but parity strips spread over all
disks
More complicated controller crash recovery
process
Most used RAID system
Other RAID systems RAID-6, RAID-10

41
RAID-3 Parity Disk Reliability

A simplified model of reliability for RAID-3 (2
HD data, 1 parity)

knumber of HD that must operate out of n RAID4
and 5 have similar models
42
Other issues at Disk
The I/O bus can still be a potential point of
failure Possible solution is to use redundant
busses
43
Space Shuttle
Architecture of Space Shuttle computer system - 5
CPUs, replicated buses and components
44
Space Shuttle Computer System

Configuration of computers and most fault
detection is performed in software
The 4 computers get similar input data and
compute the same functions.
Additionally, each processor
Has extensive self-test facilities.
If an error is detected, it is reported to the
crew, which then can switch off the faulty unit.
Compares its results with those produced by its
neighbors.
If a processor detects a disagreement, it signals
this, and voting is used in order to remove the
offending computer.
Has a watchdog timer, in order to detect CPU
crashes.

45
Space Shuttle Computer System

If one processor is switched off
obtains a triple modular redundancy arrangement
(TMR).
If a second processor is switched off
the system is switched into duplex mode, where
the two computers compare their results in order
to detect any further failure.
In case of a third failure
the system reports the inconsistencies to the
crew and uses fault detection techniques in order
to identify the offending unit.
This provides therefore protection against
failures of two units and fault detection and
limited fault tolerance against the failure of a
third unit.

46
Space Shuttle Computer System

In an emergency, the fifth computer can take over
critical functions
The 5th computer allows protection against
systematic faults in the software.
If one or two computers fail, the crew or the
controllers on earth might decide to abort the
mission

Write a Comment

User Comments (0)