Title: EE382C Lecture 14
1EE382CLecture 14
- Reliability and Error Control
- 5/17/11
2Announcements
- Dont forget to iterate with us for your
checkpoint 1 report - Send time slot preferences for checkpoint 2
- Project presentations next week
- Let us know if you are OK with presenting on
Tuesday May 24th
3Question of the day
- Consider a symmetric multiprocessing (SMP)
network that does not allow packet loss and needs
an availability of 0.99999 - Link BER is 10-15
- Router components have failure rate of 1000 FITS
- How best can you achieve this reliability
requirement
4Reliability and Availability
- Reliability R(t)
- Probability that system is working at time t
given that it was working at time t0, and has
had no failures in between - Availability A(t)
- Probability that the system is working when
needed, at a given point in time t - Often affected by repair process A
(MTBF/(MTBFMTTR)) - MTBF mean time between failures
- FIT failures in time. Inverse of MTBF with zero
repair time - MTTR mean time to recovery
- RAS requirements Reliability, availability and
serviceability
5Examples of RAS Requirements
- Enterprise Server
- A 0.99999
- System level requirement
- Can reflect to a network-level requirement or
detect and recover from network failures - In general every packet must be correctly
received or system will fail - Internet Router
- A 0.99999
- But OK to drop packets (at rate of 10-15)
- Turn failures into packet drops
6RAS Requirements in Those Systems
- Dropping (reliability)
- Allowed or not
- Rate allowed (e.g., 10-15)
- Availability (A)
- 0.999 to 0.99999
- Serviceability (MTTR)
7MTTF and MTTR
8Failure Modes and Fault Models
Failure Mode Model Rate Units
Gaussian Noise on Channel Transient 10-20 BER
Alpha Particle Strike on Memory Soft 10-9 SER
Alpha Particle Strike on Logic Transient 10-10 BER
Electromigration Stuck-at 1 MTBF
Connector corrosion Stuck-at 10 MTBF
Operator Removes Module Fail-Stop 105 MTBF
Software Failure Fail-Stop 104 MTBF
9An Analogy
10The Bathtub Curve
11Detection, Containment, and Recovery
- Three-step program to dealing with errors
- Detection discover the error
- CRC codes on channels
- Parity or ECC codes on memories
- Self-checking logic
- Contain prevent the error from propagating
further - Mask it
- Drop the packet (and retry)
- Fail stop
- Recover resume normal service
- Return to a known state
- Resume sending traffic
- Possibly resend faulted packet
12Example Link Level Error Control
- Detection CRC on channel
- Containment Drop packet with error
- Recovery Request retransmission and resume
normal sequence - How can this fail? How to fix it?
13Link-Level Error Control (2)
Flit 2 was in error. Flits 2-6 are
retransmitted Why would you want to retransmit
flits 3-6?
- Pointers
- Ack next flit to be ACKed
- Tx next flit to be transmitted
- Tail next free slot
14Channel Configuration
- Reconfigure channels with frequent errors
- Swap in spare bits
- Reduce width of channel
- Reduce bit rate
- If malfunctions continue, decommission channel
- Assumes routing algorithm will adapt
15Cray BlackWidow Example
- Each channel is 3-bits wide at 6.25Gb/s per bit
(b 18.75Gb/s) - 3-bits serialized from 24-bit flit
- Link-level retry rates monitored
- Each retry attributed to one bit of the channel
- If retry rate exceeds a threshold bad bit is
switched off - Channel degrades to two-bits, then one-bit, then
is switched off
16Router Error Control
- What would happen if
- Header bit in input buffer flips
- Credit count is corrupted
- Router picks wrong output
- Selected output flips mid packet
-
- Numerous failure modes inside the router
- Many lead to catastrophic failure
- Perhaps after hundreds of cycles after the error
occurred - Many others lead to insidious performance
problems - E.g., loosing credits
17Router Error Control (2)
- Same steps of Detect, Confine, Recover apply
- Detect
- Parity or CRC on all storage and communication
- Quick consistency checks (e.g., on allocators and
credits) - Two copies of all other logic (in space or time)
- Confine
- Stop propagating faulty packets
- Operate via confinement regions (e.g., channel)
- Recover
- Reset to known good state (sometimes via reset)
- Resend faulted packets (if available)
- Disable part of the router (fault-containment
regions) - Replace part of the router (how swapping)
18Network-Level Error Control
- Model faulty routers and links as fail-stop
components - Use adaptive routing to avoid them
- Table based recompute tables periodically
- Local adaptive pick another minimal link (or
non-minimal) - Need to avoid dead ends and deadlocks
19End-To-End Error Control
- Keep a copy of each packet at source until
acknowledged or timeout - (This buffer can get large)
- If error detected
- Drop packet
- (Optionally) send a negative acknowledgement
- When packet correctly received
- Send positive acknowledgement
- When acknowledgement received
- Discard packet
- When negative acknowledgement received (or
timeout) - Resend packet
- May transmit the same packet multiple times
20Question of the day
- Consider a symmetric multiprocessing (SMP)
network that does not allow packet loss and needs
an availability of 0.99999 - Link BER is 10-15
- Router components have failure rate of 1000 FITS
- How best can you achieve this reliability
requirement
21Summary
- Specification sets reliability requirements
- Drop rate
- Availability
- Failures are abstracted with fault models
- Bit errors, soft errors, stuck-at, fail stop
- Detection, Containment, and Recovery
- Link-level
- Ack and retransmit
- Reconfigure
- Router level
- Detect all failures
- Mask, retry, or reset
- Network level
- Route around faulty components
- End-to-End
- Retransmit on nack or timeout