Title: Compensation of Transient Faults and Self Repair
1Compensation of Transient Faults and Self Repair
- Problems, Methods and Limitations
Heinrich T. Vierhaus BTU Cottbus Computer
Engineering Group
2Outline
1. Introduction Nano Structure Problems
2. Transient Fault Compensation
3. Repair for Memory and FPGAs
4. Fine-Granular Repair
5. Gate-Level Repair Architectures
6. A Lot of Things to do ....
31. Introduction
A bunch of new problems from nano structures ...
4Nanoelectronic Problems
Lithography
The wavelength used to map structural
information from masks to wafers is larger (4
times of more) than the minimum structural
features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping
faults
Parameter variations
The number of atoms in MOS- transistor channels
becomes so small that statistical variations of
doping densities have an impact on device
parameters such as threshold voltages.
5New Problems with Nano-Technologies
Light source
Wave length 193 nm
mask (reticle)
resist
exposed resist
wafer
Feature size down to 45 nm
6Layout Correction
Modified layout for compensation of mapping faults
Compensation is critical and non-ideal
Faults are not random but correlated !
Requires fast fault diagnosis
7Doping Fluctuations in MOS Transistors
Density and distribution of doping atoms cause
shifts in transistor threshold voltages!
8Nanostructure Problems
Individual device characteristics such as Vth are
more dependent on statistical variations of
underlying physical features such as doping
profiles.
A significant share of basic devices will be out
or specs and needs a replacement by backup
elements for yield improvement after production.
As smaller features mean higher stress (field
strength, current density), also early failures
in the field are more likely and must be
compensated.
Transient error recognition and compensation in
time is becoming a must due to e. g. charged
particles that can discharge circuit nodes.
9Fault Tolerant Computing
Works only for transient faults!
Software-based fault detection compensation
specific
Fault event
HW logic RT-level detection compensation
Typically works for transient and permanent
faults!
universal
very specific
Typically works for specific types of transient
faults only!
Transistor-and switch level compensation
102. Transient Fault Compensation
11Transient Faults and Single Event Upsets (SEUs)
Discharge of memory cells and circuit nodes
Charged particles
EM coupling
- Sources for transient faults
- Radiation
- EM coupling
- Vdd- and GND-noise
12Storage Nodes and Particles
Q /
fC
100
Alpha
-
Part.
10
1
0,18
0,09
0,35
0,25
Technology
fC
Charge!
1 MeV Alpha
-
Particle generates 42
13Contribution to Soft-Error Rates
Static combinational logic 11
Sequential elements (FFs, Latches) 49
Unprotected SRAM 40
Source S. Mitra, N. Seifert, M. Zhang, Q. Shi,
K. S. Kim, Robust System Design with Built-In
Soft Error Resilience IEEE Computer, Vol. 38,
No.2, Febr. 2005, pp. 43-52
14Spikes and Clock Rates in Logic
Charge- / status
Source Pulse of 100
ps
restoration is possible
clock
t
slew time / jitter
clock
Charge- / status
restoration is impossible
t
Fault probability in digital logic is about
proportional to clock frequency!
15Logic Structures and Fault Events
Particle- radiation
Output
Input
-
FFs
FFs
Flip-flops need fault tolerance / fault
hardening in the first place, logic close-to
outputs comes next .
16Muller-C-Element
17Fault Handling
Muller-C-Element
If both inputs are equal out outl1, outl2
If both element are not equal out previous
(outl1, outl2)
Under local fault conditions on the latch outputs
(one of 2 latches false), the C-element
preserves the output condition from the charge
phase of the latch.
Essentially 3 latches!
18Fault Compensation
outl1
Latch 1
out
Muller C-Element
in
Latch 2
outl2
CL
C keep
C transmit
C transmit
C keep
v(t)
in1
in2
clock
t
19Intels Scan Path Element
20Intels Scan Path Element plus Fault Compensation
21Fault Compensation in Combinational Logic
Input
-
FFs
MC
D
MC
D
MC
D
22Fault Compensation in Combinational Logic
fault-free signal
V(t)
t
Signal with glitch
V(t)
t
Latch close
Delayed Signal with glitch
Time left to capture !
V(t)
t
MC no capture / hold
MC capture
MC capture
233. Repair for Memory and FPGAs
Compensation of transient faults is not
enough. Some technologies for transient
compensation can handle permanent faults, too,
but not on the long run and with additional
transient faults!
24Memory Test Repair
Read- / write lines
Lines
Line address
spare column
columns
25Memory Test Repair (2)
Read- / Write lines
Lines
Line address
spare column
Memory BIST controller
columns
... is already state-of-the-art!
26FPGA-based Self Repair
27In-System FPGA Repair
28Repair Mechanism Row / Line- Shift
Little Overhead for the re-configuration
process
Loss of many good CLBs for every fault
29Distributed Backup CLBs
Minimum loss of functional CLBs
High effort for re-wiring requires massive
embedded computing power (32-bit CPU, 500 MHz)
30FPGAs as a Solution ??
The granularity of re-configurable logic blocks
(CLBs) in most FPGAs is the order of several
thousand gates.
Replacement strategies must be placed on a
granularity of blocks in the area of 100-500
transistors for fault densities between 0.01
and 0.1 .
Efficient FPGA- repair mechanism requires
in-system EDA (re-placement and routing) with a
massive demand for computing power.
Example 500 MHz Power 4- processor, run-time up
to minutes, memory about 1 KByte
314. Fine-Granular Repair
Repair procedure
Functioning
overhead
elements lost
Size or replaced blocks
(granularity)
32Granularity of Replacement
33Levels of Repair
34Replacement in Regular Structures (e.g. for DSP)
35Parallel Backup Transistors
VDD
VDD
out
in1
out
in1
redundant transistors
in2
in2
GND
GND
Basic gate
Gate with redundant transistors
36Configuration and Fault Isolation
VDD
Ap
Ap
config
.
switches
VDD
stuck-on fault
out
out
in1
in1
backup
transistors
in2
in2
GND
config
.
An
switches
An
GND
37The Gate-Short-Problem
Load 1
Driver
Load 2
Gate- short
GND-shorts of input gates affect the whole
fan-in network and make redundancy obsolete!!
38Gate Turn-off
39Schematic Layout with VDD / GND Switches
Gate with parallel redundancy and fault isolation
Gate with parallel redundancy
40Transistor-Level Overhead
Redundancy
parallel transistors
VDD / GND switches
separate gate poly lines
Overhead (cells only)
30-40
60-80
100-150
estimates
stuck-off coverage
yes
yes
yes
stuck-on coverage
no
yes
yes
gate shorts cov.
no
no
yes
control
none
one wire
mult. wires
lines
41Duplicate Standard Cells
VDD
Switch
VDD
-
Switch
Gate
2
control
Gate
1
VDD1
VDD2
out
out
in1
in1
in2
in2
GND
GND
42Again Fault Isolation
VDD
Switch
VDD
-
Switch
Gate
2
control
Gate
1
VDD1
VDD2
out
out
in1
in1
in2
in2
GND
GND
Gate input short
Output VDD / GND short
43Administrated Duplicate Cells
VDD
power
switches
1 X
VDD1
VDD2
X 1
gate
in
gate
in
gate
gate
out
out
Gate
1
Gate
2
Gate
short
GND1
0 X
GND2
X 0
0 1
1 0
Act
1
GND
switches
Act
2
1 0
GND
44Cell Duplication and Power Switch
Possible for all types of cells (also flip-flops).
Granularity of partitioning for replacements
(single gates, blocks) can be selected upon
demand.
Combination with dynamic circuit optimization is
favorably possible.
Good coverage potential for transistor faults.
Significant overhead (above 100 ), but most
likely below Triple Modular Redundancy (TMR).
Redundancy may become exhausted and requires a
further level of redundancy!
455. Gate Level Repair
Gate- fault
backup- cell
Std cells (gates)
Does not work for irregular wiring schemes !
Insertion of replacement cell
46Configurable Backup Cell
Problem Fault isolation in case of
gate-input shorts !
47Block-Based Repair
Technology Mapping
Colums of switching elements
48Switching Concept
4 logic states, registered in 2 memory cells
49Overhead
506. Bus Structures and Networks on Chip (NoCs)
Technology forecasts predict that nano-wires may
become the most vulnerable and unreliable
circuit elements ...
51Faults on Irregular Interconnects
Routing tree
C
signal source
S
C
C
single fault (line break)
C
52Redundant Wiring
Routing tree with loops
extra wire
... plus double vias!
C
signal source
S
C
C
single fault (line break)
C
Problem Classic delay calculation works well on
trees only!
53Buses versus NoCs
NoC node
NoC node
NoC node
Bus master
Bus master
NoC node
NoC node
NoC node
Bus master
Bus master
Bus master
NoC node
NoC node
NoC node
Regular network structure (NoC)
Irregular bus structure (SoC)
54Faults on Bus Structures
BM 1
BM 3
BM 5
BM 2
BM 4
BM 6
Local defect affecting the total network
55Bus Segmentation
BM 1
BM 3
BM 5
SC
SC
SC
segment couplers
S C
S C
S C
SC
SC
SC
BM 2
BM 4
BM 6
Structure the bus into segments that can be
repaired individually!
56The Switching Problem
n
nk
n
backup
p
p
1
1
n k p switches contr. states
16
9
8
1
1
16
1
1
32
33
2
2
128
65
32
57Faults and Repair Actions
1. Line- break Section of a line is interrupted
use spare wire!
2. Line- short to GND Section of a line is
connected to GND
use spare wire!
3. Dynamic coupling between adjacent line
a. Re-allocate lines in bundle
b. Insert grounded line for decoupling
4. Bridge between lines
a. Feed both lines with same signal
b. Make one line floating
58Reconfiguration for De-Coupling
2-way switches may be used!
i
i
k
k
i
i
k
k
..can help to minimize dynamic coupling faults!
59Selection of Permutations
All single faults must be repairable by
selecting a minimum set of permutations.
Those lines that can act as replacement for most
of the others are selected for backup lines.
By permutation, also non-faulty functional lines
are re-arranged.
No permutation used for repair must map
a functional line to a faulty line.
60Permutations for 8-Wire-Bundles
New-neighborhood
Pair-wise symmetrical
PW1
PW2
PW3
NNP1
NNP2
NNP3
0 - 2
0 - 3
0 - 5
0 - 1
0 - 6
0 - 4
1 - 6
1 - 5
1 - 7
1 - 0
1 - 7
1- 3
2 - 0
2 - 7
2 - 3
2 - 4
2 - 5
2 - 4
3 - 5
3 - 0
3 - 2
3 - 1
3 - 6
3 - 6
4 - 7
4 - 6
4 - 5
4 - 2
4 - 0
4 - 2
5 - 3
5 - 1
5 - 4
5 - 0
5 - 7
5 - 2
6 - 1
6 - 4
6 - 7
6 - 0
6 - 3
6 - 3
7 - 4
7 - 2
7 - 6
7 - 5
7 - 1
7 - 1
618 Wires Permutations and Replacement
Permutations
Selected backup
Selected backup wires
2 lines selected for backup!
628 Wires Permutations and Replacement
Permutations
4 lines selected for backup!
63Overhead / Coverage for 8-Line-Bundle
Spare Lines (out of 8) / Switches
Faults
4/ 32
0/ 16 1 /48 2 / 32
3 / 32
Single fine fault
-
Dyn. coupl. fault
Double line faults
-
-
20
30
100
Note The number of switches is reduced by a
factor of 2 if full 2-way-switches with 2 inputs
/ 2 outputs are used!
64Administration Scheme
SC
SC
in /
lines
in /
Switches
Switches
0
0
0
out
out
1
1
1
2
2
2
3
3
3
4
4
4
A
B
B
A
5
5
5
6
6
6
7
7
7
Decode
Decode
Config
-
bits
C2
C1
C1
C2
Matching
Config
-
Config
-
Logic
Logic
657. A Lot of Work to Do
Logic fault diagnosis in the field
Efficient logic self repair
Redundancy supervision and management
Resource management under fault conditions
Repair functions for interconnects
Overall system-level fault management