EE382C Lecture 14 - PowerPoint PPT Presentation

About This Presentation
Title:

EE382C Lecture 14

Description:

EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 * – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 22
Provided by: Willi775
Learn more at: http://cva.stanford.edu
Category:
Tags: ee382c | lecture | mtbf | mttr

less

Transcript and Presenter's Notes

Title: EE382C Lecture 14


1
EE382CLecture 14
  • Reliability and Error Control
  • 5/17/11

2
Announcements
  • Dont forget to iterate with us for your
    checkpoint 1 report
  • Send time slot preferences for checkpoint 2
  • Project presentations next week
  • Let us know if you are OK with presenting on
    Tuesday May 24th

3
Question of the day
  • Consider a symmetric multiprocessing (SMP)
    network that does not allow packet loss and needs
    an availability of 0.99999
  • Link BER is 10-15
  • Router components have failure rate of 1000 FITS
  • How best can you achieve this reliability
    requirement

4
Reliability and Availability
  • Reliability R(t)
  • Probability that system is working at time t
    given that it was working at time t0, and has
    had no failures in between
  • Availability A(t)
  • Probability that the system is working when
    needed, at a given point in time t
  • Often affected by repair process A
    (MTBF/(MTBFMTTR))
  • MTBF mean time between failures
  • FIT failures in time. Inverse of MTBF with zero
    repair time
  • MTTR mean time to recovery
  • RAS requirements Reliability, availability and
    serviceability

5
Examples of RAS Requirements
  • Enterprise Server
  • A 0.99999
  • System level requirement
  • Can reflect to a network-level requirement or
    detect and recover from network failures
  • In general every packet must be correctly
    received or system will fail
  • Internet Router
  • A 0.99999
  • But OK to drop packets (at rate of 10-15)
  • Turn failures into packet drops

6
RAS Requirements in Those Systems
  • Dropping (reliability)
  • Allowed or not
  • Rate allowed (e.g., 10-15)
  • Availability (A)
  • 0.999 to 0.99999
  • Serviceability (MTTR)

7
MTTF and MTTR
8
Failure Modes and Fault Models
Failure Mode Model Rate Units
Gaussian Noise on Channel Transient 10-20 BER
Alpha Particle Strike on Memory Soft 10-9 SER
Alpha Particle Strike on Logic Transient 10-10 BER
Electromigration Stuck-at 1 MTBF
Connector corrosion Stuck-at 10 MTBF
Operator Removes Module Fail-Stop 105 MTBF
Software Failure Fail-Stop 104 MTBF
9
An Analogy
10
The Bathtub Curve
11
Detection, Containment, and Recovery
  • Three-step program to dealing with errors
  • Detection discover the error
  • CRC codes on channels
  • Parity or ECC codes on memories
  • Self-checking logic
  • Contain prevent the error from propagating
    further
  • Mask it
  • Drop the packet (and retry)
  • Fail stop
  • Recover resume normal service
  • Return to a known state
  • Resume sending traffic
  • Possibly resend faulted packet

12
Example Link Level Error Control
  • Detection CRC on channel
  • Containment Drop packet with error
  • Recovery Request retransmission and resume
    normal sequence
  • How can this fail? How to fix it?

13
Link-Level Error Control (2)
Flit 2 was in error. Flits 2-6 are
retransmitted Why would you want to retransmit
flits 3-6?
  • Pointers
  • Ack next flit to be ACKed
  • Tx next flit to be transmitted
  • Tail next free slot

14
Channel Configuration
  • Reconfigure channels with frequent errors
  • Swap in spare bits
  • Reduce width of channel
  • Reduce bit rate
  • If malfunctions continue, decommission channel
  • Assumes routing algorithm will adapt

15
Cray BlackWidow Example
  • Each channel is 3-bits wide at 6.25Gb/s per bit
    (b 18.75Gb/s)
  • 3-bits serialized from 24-bit flit
  • Link-level retry rates monitored
  • Each retry attributed to one bit of the channel
  • If retry rate exceeds a threshold bad bit is
    switched off
  • Channel degrades to two-bits, then one-bit, then
    is switched off

16
Router Error Control
  • What would happen if
  • Header bit in input buffer flips
  • Credit count is corrupted
  • Router picks wrong output
  • Selected output flips mid packet
  • Numerous failure modes inside the router
  • Many lead to catastrophic failure
  • Perhaps after hundreds of cycles after the error
    occurred
  • Many others lead to insidious performance
    problems
  • E.g., loosing credits

17
Router Error Control (2)
  • Same steps of Detect, Confine, Recover apply
  • Detect
  • Parity or CRC on all storage and communication
  • Quick consistency checks (e.g., on allocators and
    credits)
  • Two copies of all other logic (in space or time)
  • Confine
  • Stop propagating faulty packets
  • Operate via confinement regions (e.g., channel)
  • Recover
  • Reset to known good state (sometimes via reset)
  • Resend faulted packets (if available)
  • Disable part of the router (fault-containment
    regions)
  • Replace part of the router (how swapping)

18
Network-Level Error Control
  • Model faulty routers and links as fail-stop
    components
  • Use adaptive routing to avoid them
  • Table based recompute tables periodically
  • Local adaptive pick another minimal link (or
    non-minimal)
  • Need to avoid dead ends and deadlocks

19
End-To-End Error Control
  • Keep a copy of each packet at source until
    acknowledged or timeout
  • (This buffer can get large)
  • If error detected
  • Drop packet
  • (Optionally) send a negative acknowledgement
  • When packet correctly received
  • Send positive acknowledgement
  • When acknowledgement received
  • Discard packet
  • When negative acknowledgement received (or
    timeout)
  • Resend packet
  • May transmit the same packet multiple times

20
Question of the day
  • Consider a symmetric multiprocessing (SMP)
    network that does not allow packet loss and needs
    an availability of 0.99999
  • Link BER is 10-15
  • Router components have failure rate of 1000 FITS
  • How best can you achieve this reliability
    requirement

21
Summary
  • Specification sets reliability requirements
  • Drop rate
  • Availability
  • Failures are abstracted with fault models
  • Bit errors, soft errors, stuck-at, fail stop
  • Detection, Containment, and Recovery
  • Link-level
  • Ack and retransmit
  • Reconfigure
  • Router level
  • Detect all failures
  • Mask, retry, or reset
  • Network level
  • Route around faulty components
  • End-to-End
  • Retransmit on nack or timeout
Write a Comment
User Comments (0)
About PowerShow.com