EE382C Lecture 14 - PowerPoint PPT Presentation

About This Presentation

Title:

EE382C Lecture 14

Description:

EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 * – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 22

Provided by: Willi775

Learn more at: http://cva.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: EE382C Lecture 14

1
EE382CLecture 14

Reliability and Error Control
5/17/11

2
Announcements

Dont forget to iterate with us for your
checkpoint 1 report
Send time slot preferences for checkpoint 2
Project presentations next week
Let us know if you are OK with presenting on
Tuesday May 24th

3
Question of the day

Consider a symmetric multiprocessing (SMP)
network that does not allow packet loss and needs
an availability of 0.99999
Link BER is 10-15
Router components have failure rate of 1000 FITS
How best can you achieve this reliability
requirement

4
Reliability and Availability

Reliability R(t)
Probability that system is working at time t
given that it was working at time t0, and has
had no failures in between
Availability A(t)
Probability that the system is working when
needed, at a given point in time t
Often affected by repair process A
(MTBF/(MTBFMTTR))
MTBF mean time between failures
FIT failures in time. Inverse of MTBF with zero
repair time
MTTR mean time to recovery
RAS requirements Reliability, availability and
serviceability

5
Examples of RAS Requirements

Enterprise Server
A 0.99999
System level requirement
Can reflect to a network-level requirement or
detect and recover from network failures
In general every packet must be correctly
received or system will fail
Internet Router
A 0.99999
But OK to drop packets (at rate of 10-15)
Turn failures into packet drops

6
RAS Requirements in Those Systems

Dropping (reliability)
Allowed or not
Rate allowed (e.g., 10-15)
Availability (A)
0.999 to 0.99999
Serviceability (MTTR)

7
MTTF and MTTR
8
Failure Modes and Fault Models
Failure Mode Model Rate Units
Gaussian Noise on Channel Transient 10-20 BER
Alpha Particle Strike on Memory Soft 10-9 SER
Alpha Particle Strike on Logic Transient 10-10 BER
Electromigration Stuck-at 1 MTBF
Connector corrosion Stuck-at 10 MTBF
Operator Removes Module Fail-Stop 105 MTBF
Software Failure Fail-Stop 104 MTBF
9
An Analogy
10
The Bathtub Curve
11
Detection, Containment, and Recovery

Three-step program to dealing with errors
Detection discover the error
CRC codes on channels
Parity or ECC codes on memories
Self-checking logic
Contain prevent the error from propagating
further
Mask it
Drop the packet (and retry)
Fail stop
Recover resume normal service
Return to a known state
Resume sending traffic
Possibly resend faulted packet

12
Example Link Level Error Control

Detection CRC on channel
Containment Drop packet with error
Recovery Request retransmission and resume
normal sequence
How can this fail? How to fix it?

13
Link-Level Error Control (2)
Flit 2 was in error. Flits 2-6 are
retransmitted Why would you want to retransmit
flits 3-6?

Pointers
Ack next flit to be ACKed
Tx next flit to be transmitted
Tail next free slot

14
Channel Configuration

Reconfigure channels with frequent errors
Swap in spare bits
Reduce width of channel
Reduce bit rate
If malfunctions continue, decommission channel
Assumes routing algorithm will adapt

15
Cray BlackWidow Example

Each channel is 3-bits wide at 6.25Gb/s per bit
(b 18.75Gb/s)
3-bits serialized from 24-bit flit
Link-level retry rates monitored
Each retry attributed to one bit of the channel
If retry rate exceeds a threshold bad bit is
switched off
Channel degrades to two-bits, then one-bit, then
is switched off

16
Router Error Control

What would happen if
Header bit in input buffer flips
Credit count is corrupted
Router picks wrong output
Selected output flips mid packet
Numerous failure modes inside the router
Many lead to catastrophic failure
Perhaps after hundreds of cycles after the error
occurred
Many others lead to insidious performance
problems
E.g., loosing credits

17
Router Error Control (2)

Same steps of Detect, Confine, Recover apply
Detect
Parity or CRC on all storage and communication
Quick consistency checks (e.g., on allocators and
credits)
Two copies of all other logic (in space or time)
Confine
Stop propagating faulty packets
Operate via confinement regions (e.g., channel)
Recover
Reset to known good state (sometimes via reset)
Resend faulted packets (if available)
Disable part of the router (fault-containment
regions)
Replace part of the router (how swapping)

18
Network-Level Error Control

Model faulty routers and links as fail-stop
components
Use adaptive routing to avoid them
Table based recompute tables periodically
Local adaptive pick another minimal link (or
non-minimal)
Need to avoid dead ends and deadlocks

19
End-To-End Error Control

Keep a copy of each packet at source until
acknowledged or timeout
(This buffer can get large)
If error detected
Drop packet
(Optionally) send a negative acknowledgement
When packet correctly received
Send positive acknowledgement
When acknowledgement received
Discard packet
When negative acknowledgement received (or
timeout)
Resend packet
May transmit the same packet multiple times

20
Question of the day

Consider a symmetric multiprocessing (SMP)
network that does not allow packet loss and needs
an availability of 0.99999
Link BER is 10-15
Router components have failure rate of 1000 FITS
How best can you achieve this reliability
requirement

21
Summary