Title: Last class review
1Last class review
- Information redundancy
- Error Detection and Correcting Codes
- Applications
- Hamming codes (SEC,SECDED)
- CRC
- Reed Solomon
-
2Summary of FTC techniques
3Barriers in HW/SW
Barriers constructed by design techniques for
fault avoidance, masking, tolerance
Better name for fault tolerance is error tolerance
4Hardware Redundancy (Spatial)
- Passive (static) redundancy techniques
- fault masking
- Dynamic redundancy techniques
- detection, localization, containment, recovery
- Hybrid redundancy techniques
- static dynamic
- fault masking reconfiguration
5Passive Hardware Redundancy
- Triple modular redundancy (TMR)
110 ? 1 011 ? 1 101 ? 1 001 ? 0 010 ? 0 100 ? 0
3 active components fault masking by
voter Problem voter is a single point of failure
6 Simple Majority voting
- Hardware realization of 1 bit majority voting
Fabacbc 2 gate delays
7Triple Modular Redundancy
Ri denotes the reliability of one module. RTMR
P(all modules are functioning) P(exactly two
modules are functioning) If all modules are
same R1R2R3R RTMR 3R2 2R3 If the
lifetimes of the modules are exponential R e
?t RTMR 3e2?t 2e3?t and MTTFTMR
?RTMRdT 5/ (6?)
8N-Modular Redundancy (NMR)
- Generalization of TMR (more than 3 modules, e.g.
5MR) - Tolerates up to (N-1)/2 faulty elements
- In general voting can be done on digital or
analog data - Application temperature measurement (no
majority) - Method take 3 measurements, compute median
value - Example
- Sensor 1 99C
- Sensor 2 100 C
- Sensor 3 45,217 C lt- discard outlier!!
9NMR
- TMR is a 2-of-3 system. In general the
reliability of an m-of-n system is
Considering a single voter with reliability Rv
10TMR with Triplicated Voters
To avoid single point of failure we can use
multiple voters
11TMR with Triplicated Voters
TMR (or NMR) can be applied at all levels gates,
sensors, processor, memory boards. However,
applying redundancy at low level, produces
overhead and cost can be high.
Voting on processors output
12TMR with Triplicated Voters
TMR handles processor fault voter fault
memory fault bus fault System has no
single-point-of-failure This approach was
implemented in Tandem Integrity system
Voting at read from memory
Voting at read/write from/to memory
Mi memory Pi processor
Voting at write to memory
13Cascading TMR modules
- Idea is to isolate failures from either
components or voters so that they dont propagate - Examples
- JPL STAR (Self-Testing And Repairing computer)
- FAA WAAS (Wide Area Augmentation System)
14Fault Detection
- Duplication with comparison
Two identical modules performing same
operations Fault detection but not tolerance
15Fault Detection in Multicomputers
- Multicomputers with shared memory
Fault detection. Comparison is performed in
software
16Dynamic RedundancyStandby Systems
17Reliability
- Parallel System (2 modules)
- Standby (2 modules)
- Rs gt Rp but the coupling mechanism can have a
significant effect on results. - In case of standby the coupler is more complex
- Detects failure, powers standby and switches
outputs
18Hot Standby Systems
- Updated simultaneously with primary system
- Advantages
- Very short outage time
- Does not require recovery of application
- Drawbacks
- High failure rate
- High power consumption
19Warm Standby Systems
- Secondary (i.e., backup) system runs in the
background of the primary system. - Data is mirrored to the secondary server at
regular intervals (at times both servers do not
contain the exact same data) - Advantages
- Does not require simultaneous up-dating with
primary - Drawbacks
- Requires recovery time
- High failure rate
- High power consumption
20Cold Standby Systems
- Secondary (i.e., backup) system is only called
upon when the primary system fails. Secondary is
not updated as frequently as in warm standby - Advantages
- Low failure rate
- Low power consumption
- Drawbacks
- Very long outage time
- Needs to boot kernel/OS system and recover
application status
21Hybrid Redundancy
- Self purging redundancy
- Adaptive voting
- PEi processing element
- ESi elementary switch
ESs compare output from voter and discards PE
whose output disagrees with voting result.
22Hybrid Redundancy
N Modular redundancy with spare
Voters output is used to identify faulty
modules, which are replaced with spares
23Re-execution
- Replicate the actions on a module either on the
same module (temporal redundancy) or on spare
modules (temporal spatial redundancy) - Good for detecting and/or correcting transient
faults - Transient error will only affect one execution
- Can implement this at many different levels
- ALU
- Thread context
- Processor
- System
24Re-execution with Shifting operands (RESO)
- Re-execute the same arithmetic operations, but
shifting the operands - Goal detect errors in ALU
- Example shift left by 2
- 1 0 1 0 1 0 X X
- 1 0 0 1 0 1 X X
- 0 0 1 0 1 1 X X
- By comparing output bit 0 of the first execution
and output bit 2 of the shifted re-execution, we
detect an error in the ALU, since they should be
equal
25Re-execution with Processes
- Use redundant processes to detect errors
- Problem serialization, slowdown factor of 2
- In a multiprocessor, we can execute copies of the
same process simultaneously on 2 processors and
have them periodically compare their results - Almost no slowdown, except for comparisons
- Disadvantage the opportunity cost of not using
that other processor to perform non-redundant
work
Process
Process
CPU
Check errors
Process
Process
CPU
CPU
Check errors
26Re-execution of microinstructions
Superscalar Processor Microarchitecture
Drawback -Tests only FUs not whole pipeline
- Processors also use built in self test (BIST)
- Generation of test vectors within chip
27Re-execution with Threads
- Use redundant threads to detect errors
- Many current superscalar microprocessors are
multithreaded ( e.g. Pentium4 is hyperthreaded) - Each processor can run multiple processes or
multiple threads of the same process (i.e., it
has multiple thread contexts) - Can re-execute a program on multiple thread
contexts, just like with multiple processors - Better performance than re-execution with
multiple processors, since the comparison can be
performed on-chip - Less opportunity cost to use extra thread context
than extra processor
28SMT - Flow of Instructions
29Re-execution with Simultaneous Multithreaded (SMT)
- Motivation (Rotenberg 99)
- Increasingly high clock rates and chip density
may cause transient errors in high performance
microprocessors - High cost of multiprocessor
- Active stream/redundant stream Simultaneous
Multithreading (SMT) - Low overhead, broad coverage of transient faults
and some permanent faults - In AR-SMT, two explicit copies of the program run
concurrently on the same processor resources
30Re-execution with Simultaneous Multithreaded (SMT)
- A-stream is executed on SMT and results are
committed in the delay buffer - R-stream executes on the SMT, delayed from the
A-stream, by no more than the size of the delay - R-stream results are compared to A-stream results
in delay buffer, a fault is detected if results
differ - SMT Pipeline
- time-shared, in any given cycle, the pipeline
stage is consumed entirely by one thread. - space-shared, every cycle a fraction of the
bandwidth is allocated to both threads.
31I/O in Computer Systems
- Disks are often considered the stable storage
on which we save critical data - E.g., databases write their important data to
disks - Critical disk systems are backed up with tape
- Periodically (e.g., nightly, weekly) log diffs to
tape - Disks are generally protected with
- Information redundancy (EDC/ECC)
- Physical redundancy
32Redundant Array of Inexepensive Disks (RAID)
- Motivation (Patterson88) Amdahls law
f.90 k10
S1/((1-.90).90/10) S1/.19 S5.2
f.90 k100
non improved part due to I/O
improvement
Sspeedup ffraction of program that is
improved kCPU speedup
S1/((1-.90).90/100) S1/.109 S9.2
10 times CPU speedupgt2 times overall
speedup Slim (k-gt8)10 (90 wasted speedup)
33RAID
Reliability for an array of disks MTTFarrayMTTF
single/disks MTTF10300,000/1030,000
h MTTF100300,000/1003,000 h The more disks,
the more likely one of them may crash
Worst MTTF performance with array
34RAID
- Techniques to organize data across multiple disks
that exploit - Redundancy
- Parallelism
- Goals of RAID - improve
- Reliability
- Performance
- Levels
- RAID1 mirrored disks
- RAID2 hamming code for ECC
- RAID3 single check disk per group
- RAID4 independent read/writes
- RAID5 no single check disk
35RAID-0 Stripping
- Virtual disk sectors split into strips of k
sectors - Strips placed on disks cyclically
- No overhead
- Maximal parallelism
- No fault tolerance (worse than single disk).
- Not real RAID no redundancy
36RAID-1 Mirroring
- Every disk has a copy (a mirror)
- Simple
- Write does 2 I/O, Read can be from either copy
- 100 overhead
- Excellent fault tolerance
37RAID-2 ECC across disks
Hamming check bits
- Use Error-Correcting Code (ECC) (Hamming SED SEC)
- Spread each words bits across the disks
- Lower overhead than RAID-1 (depends on ECC)
- Drives need to be synchronized
- Complicated, expensive controller
38RAID-3 Parity Disk
1 0 1 0 0 0 1 1
- Single parity disk, stores XOR of other disks.
- Provides 1-disk crash tolerance
- crashed disk position known by the disk
controller - to find missing information add P mod 2 to other
disks data - Disk drives need to be synchronized.
- Low overhead (1/N for N disks to store parity)
Even parity
39RAID-4 Parity Stripping
- Like RAID 3 but with strip-for-strip parity
- No drive synchronization.
- Each write causes 2 reads (data, parity) and 2
writes (new data, new parity) - Parity drive can become a bottleneck
40RAID-5 Distributed Parity drive
- Like RAID-4 but parity strips spread over all
disks - More complicated controller crash recovery
process - Most used RAID system
- Other RAID systems RAID-6, RAID-10
41RAID-3 Parity Disk Reliability
- A simplified model of reliability for RAID-3 (2
HD data, 1 parity)
knumber of HD that must operate out of n RAID4
and 5 have similar models
42Other issues at Disk
The I/O bus can still be a potential point of
failure Possible solution is to use redundant
busses
43Space Shuttle
Architecture of Space Shuttle computer system - 5
CPUs, replicated buses and components
44Space Shuttle Computer System
- Configuration of computers and most fault
detection is performed in software - The 4 computers get similar input data and
compute the same functions. - Additionally, each processor
- Has extensive self-test facilities.
- If an error is detected, it is reported to the
crew, which then can switch off the faulty unit. - Compares its results with those produced by its
neighbors. - If a processor detects a disagreement, it signals
this, and voting is used in order to remove the
offending computer. - Has a watchdog timer, in order to detect CPU
crashes.
45Space Shuttle Computer System
- If one processor is switched off
- obtains a triple modular redundancy arrangement
(TMR). - If a second processor is switched off
- the system is switched into duplex mode, where
the two computers compare their results in order
to detect any further failure. - In case of a third failure
- the system reports the inconsistencies to the
crew and uses fault detection techniques in order
to identify the offending unit. - This provides therefore protection against
failures of two units and fault detection and
limited fault tolerance against the failure of a
third unit.
46Space Shuttle Computer System
- In an emergency, the fifth computer can take over
critical functions - The 5th computer allows protection against
systematic faults in the software. - If one or two computers fail, the crew or the
controllers on earth might decide to abort the
mission