Title: Dealing with Multiple Simultaneous Faults in Future Technologies
1Dealing withMultiple Simultaneous Faultsin
Future Technologies
- Doutorando Carlos Arthur Lang Lisbôa
- Orientador Luigi Carro
2Why Multiple Simultaneous Faults ?
- Future technologies (2010 and beyond)
- very small transistors and fewer electrons to
form the channel (? SETs) - transient pulses due to radiation attack will
last longer than the propagation delays of gates
and cycle times - devices will be more sensitive to the effects of
electromagnetic noise, neutrons and alpha
particles
3Single Event Upset Origin
1 0 1 0 0 0 0 1
0 1 0 1 1 1 1 0
1 1 0 1 1 1 1 0
4Why Should One Study Multiple Faults ?
- Changes in paradigm
- Gates will behave statistically, producing
correct outputs only a fraction of the time - Faster devices ? cycle times shorter than
duration of transient pulses
5How to Deal with Multiple Faults ?
- New paradigm multiple simultaneous faults
- new fault tolerance techniques will be required
(TMR will no longer provide enough protection)
6How to Deal with Multiple Faults ?
- New paradigm multiple simultaneous faults
- new fault tolerance techniques will be required
(TMR will no longer provide enough protection) - How to deal with this problem ?
- new materials and manufacturing technologies must
be developed - OR
- new design approaches must be taken
7How to Deal with Multiple Faults ?
- New paradigm multiple simultaneous faults
- new fault tolerance techniques will be required
(TMR will no longer provide enough protection) - How to deal with this problem ?
- new design approaches must be taken (our bet !)
8Research Evolution - Overview
SRC 2005 TechCon
DATE 06 PhD Forum
DFT 04 WDES 04
DFT 06
Research Report
Majority Logic
Research Report
Bit Stream Operators
Online Hardening
Stochastic Operators
TMR and Analog Voter
Statistical Computation
Low cost redundancy
IOLTS 04
VTS 07 (submitted)
ETS 05 SBCCI 05
MemProc
LATW 06 ETS 06
2004 2005
2006 2007
9Published Papers
- Lisbôa, C. and Carro, L., Arithmetic Operators
Robust to Multiple Simultaneous Upsets, 10th
IEEE International Online Test Symposium - IOLTS
2004, IEEE Computer Society, Funchal, Madeira
Island, Portugal, July 2004. - Lisbôa, C. and Carro, L., Highly Reliable
Arithmetic Multipliers for Future Technologies,
in Proceedings of the International Workshop on
Dependable Embedded Systems - WDES 2004 - in
conjunction with the 23rd International Symposium
on Reliable Distributed Systems - SRDS 2004, pp.
13-18. Edited by Becker, L. B. and Kaiser, J.,
Florianópolis, October 17, 2004. - Lisbôa, C. and Carro, L., Arithmetic Operators
Robust to Multiple Simultaneous Upsets, in
Proceedings of the 19th IEEE International
Symposium on Defect and Fault Tolerance in VLSI
Systems - DFT 2004, pp. 289-297,
ISBN0-7695-2241-6. IEEE Computer Society, New
York, October 2004.
10Published Papers
- Lisbôa, C. A. L., Carro, L. and Cota, E., RobOps
- Arithmetic Operators for Future Technologies,
10th European Test Symposium - ETS 2005, Tallin,
Estonia, May 2005. - Lisbôa, C. A. L., Schüler, E. and Carro, L.,
Going Beyond TMR for Protection Against Multiple
Faults, in Proceedings of the 18th Symposium on
Integrated Circuits and Systems Design - SBCCI
2005, September 2005. - Rhod, E. Lisbôa, C. A. L. and Carro, L., Using
Memory to Cope with Simultaneous Transient
Faults, in Proceedings of the 7th Latin-American
Test Workshop - LATW 2006, pp. 151-156, IEEE
Computer Society, New York, March 2006.
11Published Papers
- Rhod, E. Lisbôa, C. A. L. Michels, Á. and
Carro, L., Fault Tolerance Against Multiple SEUs
using Memory-Based Circuits to Improve the
Architectural Vulnerability Factor, in Informal
Digest of Papers of the 11th IEEE European Test
Symposium - ETS 2006, pp. 229-234, IEEE Computer
Society, New York, May 2006. - Michels, Á., Petroli, L., Lisbôa, C. A. L.,
Kastensmidt, F. and Carro, L. SET Fault Tolerant
Combinational Circuits Based on Majority Logic,
in Proceedings of the 21st IEEE International
Symposium on Defect and Fault Tolerance in VLSI
Systems - DFT 2006, pp. 345-352, IEEE Computer
Society, Los Alamitos, CA, October 2006. - Lisbôa, C. A. L., Carro, L., Sonza Reorda, M.,
and Violante, M. Online Hardening of Programs
against SEUs and SETs, in Proceedings of the
21st IEEE International Symposium on Defect and
Fault Tolerance in VLSI Systems - DFT 2006, pp.
280-288, IEEE Computer Society, Los Alamitos, CA,
October 2006.
12Research Approaches - 2004 / 2005
- Use of stochastic operators
- Use of bit stream operators
- Ensuring voter reliability to use n-MR while
dealing with multiple simultaneous faults
13Research Evolution - 2004 / 2005
IOLTS 2004
Stochastic Operators
14Research Evolution - 2004 / 2005
IOLTS 2004
Stochastic Operators
OK for some DSP Applications
15Research Evolution - 2004 / 2005
DFT 2004 WDES 2004
Bit Stream Operators
Looking for more speed
Stochastic Operators
16Research Evolution - 2004 / 2005
DFT 2004 WDES 2004
Bit Stream Operators
Small footprint and fast
Looking for more speed
Stochastic Operators
17Research Evolution - 2004 / 2005
Bit Stream Operators
Looking for more speed
Looking for tolerant converter
Stochastic Operators
Analog Voter
ETS 2005 SBCCI 2005
18Research Evolution - 2004 / 2005
Bit Stream Operators
Tolerant to multiple faults in n-MR solutions
Looking for more speed
Looking for tolerant converter
Stochastic Operators
TMR and Analog Voter
ETS 2005 SBCCI 2005
19Research Evolution - 2004 / 2005
SRC 2005 TechCon
Bit Stream Operators
Research Report
Looking for more speed
Looking for tolerant converter
Stochastic Operators
TMR and Analog Voter
20Research approach - 2006 / 2007
- cooperation with peers
- use of memory for computation
- analog voter majority logic
- use of an I-IP to harden instructions
21Research approach - 2006 / 2007
- cooperation with peers
- use of memory for computation
- analog voter majority logic
- use of an I-IP to harden instructions
- low cost redundancy using statistical parallel
computation
22Research Evolution - 2006 / 2007
DATE 06 PhD Forum
Research Report
23Research Evolution - 2006 / 2007
DATE 06 PhD Forum
Research Report
MemProc
LATW 06 ETS 06
24Research Evolution - 2006 / 2007
DATE 06 PhD Forum
Research Report
MemProc
Majority Logic
LATW 06 ETS 06
DFT 06
25Research Evolution - 2006 / 2007
DATE 06 PhD Forum
Research Report
Low cost redundancy
MemProc
Majority Logic
LATW 06 ETS 06
DFT 06
26Research Evolution - 2006 / 2007
DATE 06 PhD Forum
DFT 06
Online Hardening
Research Report
Low cost redundancy
MemProc
Majority Logic
LATW 06 ETS 06
DFT 06
27Research Evolution - 2006 / 2007
DATE 06 PhD Forum
DFT 06
Online Hardening
Research Report
Statistical Computation
Low cost redundancy
MemProc
Majority Logic
VTS 07 (submitted)
LATW 06 ETS 06
DFT 06
28Current research - motivation
- ? faster devices
- ? transient pulse duration scaling not
proportional to speed scaling - ? transient pulses will last longer than one
cycle
29Current research - motivation
- future technologies
- ? faster devices
- ? transient pulse duration scaling not
proportional to speed scaling - ? transient pulses will last longer than one
cycle - techniques relying on time redundancy will fail
30Current research - motivation
- alternative approach
- ? space redundancy
- ? current solutions area overhead ? 100
- ? small granularity does not provide low overhead
- (what can one do with 50 of a MOSFET ?)
31Current research - motivation
-
-
- proposed solution
- ? fingerprinting
- ? parallel processing on subset of possible
inputs - ? small transient fault probability (desired 0)
- alternative approach
- ? space redundancy
- ? current solutions area overhead ? 100
- ? small granularity does not provide low overhead
- (what can one do with 50 of a MOSFET ?)
32Current research - focus
- use of low cost redundancy and statistical
computation to cope with transient faults
33Sample application
- Freivalds matrix multiplication correctness
- given matrices A and B, n x n
- given one algorithm that calculates C A x B
- goal check if the algorithm performs correctly
by executing thousands of multiplications and
comparing the results - naive solution calculate again and compare ?
O(n3)
34Sample application
- Freivalds technique
- 1. generate a random vector r, with values from
0,1 - 2. compute vector Cr C ? r ? O(n2)
- 3. compute vector ABr A ? (B x r) ? O(n2)
- 4. if C ? A ? B, then PrAbr Cr ? 1/2
- After k independent repetitions of steps 1, 2 and
3 - PrAbr Cr ? 1/2k
35Sample application
- Our extension of Freivalds technique
- 1. generate a random vector r, with values from
0,1 - 2. generate a vector rc with rci not(ri) for i
1n - 3. compute Cr C ? r and Crc C ? rc
- 4. compute ABr A ? (B x r) and ABrc A ? (B x
rc) - 5. if ABr ? Cr OR ABrc ? Crc, then
- PrAbr ? Cr 1
36Sample Implementation
- matrix multiplier with checker
- application of Freivalds technique
37Sample Implementation
Area overhead ( of gates)
38Sample implementation
Time overhead ( of instructions)
39Sample implementation
Fault injection results
40PhD program requiremnets
- 36 credits ?
- qualifying examination ?
- 2 foreign languages proficiency exam ?
- academic week seminar ?
- Thesis proposal ? February 2007
- Thesis presentation ? December 2007
41Questions ?
?
?
?
?
42Using Stochastic Operators
- SEU induced transient errors are of random nature
- Stochastic operators rely on randomness to
produce approximate results - The injection of random faults in the input
signals processed by stochastic operators did not
impact the precision of the results
43Using Stochastic Operators
- SEU induced transient errors are of random nature
- Stochastic operators rely on randomness to
produce approximate results - The injection of random faults in the input
signals processed by stochastic operators did not
impact the precision of the results - Several application areas (DSP) can deal with
approximate values and still produce acceptable
results (outputs)
44Using Stochastic Operators
- Benefit reduced area of the operators
45Using Bit Stream Operators
- Computation principles similar to those of the
stochastic adder and multiplier - Operators can produce bit streams which represent
the exact results of the operation
46Using Bit Stream Operators
- Computation principles similar to those of the
stochastic adder and multiplier - Operators can produce bit streams which represent
the exact results of the operation - Redundancy is added to the bit streams in order
to stand to multiple bit flips
Adding robustness to the bit stream through
redundancy
47Using Bit Stream Operators
- Computation principles similar to those of the
stochastic adder and multiplier - Operators can produce bit streams which represent
the exact results of the operation - Redundancy is added to the bit streams in order
to stand to multiple bit flips - Conversion of bit streams to binary coded values
is delayed as much as possible, and conversion
circuits must use TMR or n-MR for protection
against faults
48Using Bit Stream Operators
- Computation principles similar to those of the
stochastic adder and multiplier - Operators can produce bit streams which represent
the exact results of the operation - Redundancy is added to the bit streams in order
to stand to multiple bit flips - Conversion of bit streams to binary coded values
is delayed as much as possible, and conversion
circuits must use TMR or n-MR for protection
against faults - Issues to be further investigated size of bit
streams and area of the conversion circuits
49What is Wrong with TMR ?
- TMR protects only against single faults in one of
the modules
V O T E R
Module 1
correct output
Module 2
correct output
correct output
Module 3
correct output
50What is Wrong with TMR ?
- TMR protects only against single faults in one of
the modules
V O T E R
Module 1
correct output
correct output
Module 3
correct output
51What is Wrong with TMR ?
- TMR does not protect against double faults in
different modules
V O T E R
Module 1
wrong output
wrong output
Module 3
wrong output
52What is Wrong with TMR ?
- When a single fault occurs in the voter circuit,
the voter output may be wrong
V O T E R
Module 1
correct output
Module 2
correct output
correct output
Module 3
correct output
53What is Wrong with TMR ?
- When a single fault occurs in the voter circuit,
the voter output may be wrong
V O T E R
Module 1
correct output
Module 2
correct output ?
correct output
Module 3
correct output
54Making TMR (n-MR) more reliable
- Known solutions imply in
- area, performance and / or power penalties
- deadlock how to protect the output generator ?
55Making TMR (n-MR) more reliable
- Known solutions imply in
- area, performance and / or power penalties
- deadlock how to protect the output generator ?
- Proposed solution
- use TMR to cope with single faults in the modules
56Making TMR (n-MR) more reliable
- Known solutions imply in
- area, performance and / or power penalties
- deadlock how to protect the output generator ?
- Proposed solution
- use TMR to cope with single faults in the modules
- replace the digital voter by an analog voter that
- uses a comparator to generate the output
57Making TMR (n-MR) more reliable
- Known solutions imply in
- area, performance and / or power penalties
- deadlock how to protect the output generator ?
- Proposed solution
- use TMR to cope with single faults in the modules
- replace the digital voter by an analog voter that
- uses a comparator to generate the output
- can support some noise, nevertheless producing
the correct result
58The Analog Voter
59Minimum Area Comparator
Injection of faults in the comparator ()
() using CMOS 0.35µm
60Electrical Simulation Multiple Faults(SPICE and
CMOS 0.35 ?m)
61Dealing with Multiple Simultaneous Faults n-MR
The Analog Voter with 5 Inputs (for 5-MR)
62Dealing with Multiple Simultaneous Faults n-MR
The Analog Voter with 5 Inputs (for 5-MR)
Simulations with injection of 2 simultaneous
faults also succeeded
63The Analog Voter ... Oops !