Title: Logic Soft Errors Design and CAD Challenges
1Logic Soft Errors Design and CAD Challenges
- Subhasish Mitra
- Tanay Karnik
- Norbert Seifert
- Ming Zhang
- Intel Corporation
- Acknowledgment Kee Sup Kim, Jose Maiz, TM Mak,
Quan Shi
2Outline
- Introduction
- Logic soft error modeling challenges
- Logic soft error protection
- Conclusion
3What Are Soft Errors?
- Transient errors
- Single Event Upsets, SEU, SER
- Logic value changed (0 ? 1, 1 ? 0)
- Memory, flip-flops, combinational logic
- Causes
- Alpha particles from packaging
- Neutrons from cosmic rays
4Logic Soft Errors
Static combinational logic
- Soft errors affecting
- Flip-flops
- Latches
- Combinational logic
- Memory soft errors
- Well understood
- ECC
Unprotected memory
Flip- flops
Soft Error rate contributions
5System Effects of Logic Soft Errors
- Benign
- e.g., Dead instructions anyway
- e.g., Soft errors in network packets
- Protocol assisted correction retransmit
- Silent data corruption
- Undetected incorrect results
- e.g., 20,000 interpreted as 3,616
- Detected but uncorrected recovery required
6Logic Soft Errors Moores Law
- Soft error rate per unprotected chip
- Roughly double every generation
- Reason Double transistor count
FLIP-FLOP PROTECTION
REQUIRED
Undetected Soft Error Rate Per chip
Enterprise Undetected Soft Error Rate Goal
Logic
Technology Generation
7Outline
- Introduction
- Logic soft error modeling challenges
- Latches flip-flops Well understood
- Combinational logic OPEN
- System-level effects OPEN
- Logic soft error protection
- Conclusion
8Soft Errors in Combinational Logic
A 1
1
S 1
1
OUT
1
B 1
9Soft Errors in Combinational Logic
A 0
0
S 1
OUT
1
B 1
10Soft Errors in Combinational Logic
A 0
0
S 1
D
Q
CK
1
B 1
Setup time
11Architectural Vulnerability Factor (AVF) aka
Logic Derating
- Probability (Given soft error event is BENIGN)
- No impact on architectural state
- Data registers
- Average active variable occupancy
- Pipeline flip-flops
- Average committed instruction occupancy
MAJOR CHALLENGE AUTOMATED AVF ESTIMATION
12Outline
- Introduction
- Logic soft error modeling challenges
- Logic soft error protection
- Conclusion
13Error Protection Low Hanging Fruits
- Selective node engineering
- Increased node capacitance
- Forward body bias
- Soft error rate reduction 40
- Costs
- Power None to 3
- Performance None
- Area 0.8 3
Selective Node Engineering Data
14Major Error Protection Techniques
- Circuit hardening Circuit design literature
- Redundancy Fault-tolerance literature
- Duplication, parity codes
- Multi-threading
- Software techniques (SIHFT)
- Built In Soft Error Resilience New technique
- Soft error rate reduction at least 20X
- Costs discussed later
15Built-In Soft Error Resilience (BISER)
- Soft error resilience
- Detect flip-flop soft error
- Error trapping not covered today
- Correct flip-flop soft error
- Error blocking todays focus
- Hardware reuse paradigm
- Existing on-chip resources reused
- e.g., Scan Testability structures
Ref IEEE Computer, Feb. 2005
16Scan Flip-flop Design ITJ 2004
Scan Clock B
Scan Portion
Scan Data
1D
1D
C1
Scan Output
Q
Scan Clock A
Q
C1
2D
Transfer
C2
1D
Update
C1
System Output
Q
System Data
1D
2D
Q
C2
C1
System Flip
-
flop
System Clock
17Error Resilient Mode
Scan / Checking Flip-flop
Scan Clock B
Scan Data
1D
1D
C1
Scan / Duplex Output
Q
Q
Scan Clock A
C1
2D
C2
Capture 1
1D
Update
C1
System Output
Q
System Data
1D
2D
Q
C2
C1
System Clock
System Flip
-
flop
18C-element
- Extensive use in asynchronous circuit design
Vdd
A
OUT
B
Gnd
19Error Blocking Design
Scan / Checking Flip-flop
Scan Clock B
Scan Data
Scan Output
1D
1D
C1
Q
Q
Scan Clock A
Keeper
2D
C1
C2
Capture 1
Update
1D
C1
Q
System Data
1D
2D
System Output
Q
C2
C1
System Clock
System Flip
-
flop
C-element
20Error Blocking Operation
Scan / Checking portion
0
1D
Q
C1
0
1
0
Keeper
1D
1
C1
Q
2D
0
System Output
1
C2
Error Blocked
System portion
C-element
21Economy Mode Core Reuse Enabled
Scan Clock B 1
Scan / Checking Flip-flop
Scan Data
1D
1D
C1
Scan / Duplex Output
Q
Q
Scan Clock A
C1
2D
C2
Capture 0
1D
Update
C1
System Output
Q
System Data
1D
2D
Q
C2
C1
System Clock
System Flip
-
flop
22Error Blocking Characterization Results
- Flip-flop soft error rate reduction at least 20X
- Chip-level analysis 25 flip-flops protected
- Selected by fault injection
23Comparison with Classical Techniques
24Outline
- Introduction
- Logic soft error modeling challenges
- Logic soft error protection
- Conclusion
25Conclusion
- Logic soft errors Major challenge
- Built-In Soft Error Resilience (BISER)
- Effective practical
- Future challenge Automation
- Combinational logic soft error models
- System-level impact estimation
- Selective BISER insertion
- Maximize protection, Minimize power
26Thank You!
27Comparison with Classical Techniques
28Backup
29Error Trapping Design
Scan / Checking Flip-flop
Scan Clock B
Scan Data
1D
Scan Output
C1
1D
Scan Clock A
Q
Q
2D
XOR
C1
C2
Capture
XOR
1D
Update
C1
System Output
Q
System Data
1D
2D
Q
C2
C1
System Clock
System Flip
-
flop
30Trapped Error Signal Observation
- Shift out using existing slow scan path
- At fixed intervals
- e.g., recovery checkpoint
- e.g., transaction commit point
- No additional global interconnect
- K Error detection latency