Compensation of Transient Faults and Self Repair - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Compensation of Transient Faults and Self Repair

Description:

Lehrstuhl Technische Informatik - Computer Engineering ... mask (reticle) wafer. resist. exposed resist. Wave length: 193 nm. Feature size: down to 45 nm ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 66
Provided by: vier4
Category:

less

Transcript and Presenter's Notes

Title: Compensation of Transient Faults and Self Repair


1
Compensation of Transient Faults and Self Repair
  • Problems, Methods and Limitations

Heinrich T. Vierhaus BTU Cottbus Computer
Engineering Group
2
Outline
1. Introduction Nano Structure Problems
2. Transient Fault Compensation
3. Repair for Memory and FPGAs
4. Fine-Granular Repair
5. Gate-Level Repair Architectures
6. A Lot of Things to do ....
3
1. Introduction
A bunch of new problems from nano structures ...
4
Nanoelectronic Problems
Lithography
The wavelength used to map structural
information from masks to wafers is larger (4
times of more) than the minimum structural
features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping
faults
Parameter variations
The number of atoms in MOS- transistor channels
becomes so small that statistical variations of
doping densities have an impact on device
parameters such as threshold voltages.
5
New Problems with Nano-Technologies
Light source
Wave length 193 nm
mask (reticle)
resist
exposed resist
wafer
Feature size down to 45 nm
6
Layout Correction
Modified layout for compensation of mapping faults
Compensation is critical and non-ideal
Faults are not random but correlated !
Requires fast fault diagnosis
7
Doping Fluctuations in MOS Transistors
Density and distribution of doping atoms cause
shifts in transistor threshold voltages!
8
Nanostructure Problems
Individual device characteristics such as Vth are
more dependent on statistical variations of
underlying physical features such as doping
profiles.
A significant share of basic devices will be out
or specs and needs a replacement by backup
elements for yield improvement after production.
As smaller features mean higher stress (field
strength, current density), also early failures
in the field are more likely and must be
compensated.
Transient error recognition and compensation in
time is becoming a must due to e. g. charged
particles that can discharge circuit nodes.
9
Fault Tolerant Computing
Works only for transient faults!
Software-based fault detection compensation
specific
Fault event
HW logic RT-level detection compensation
Typically works for transient and permanent
faults!
universal
very specific
Typically works for specific types of transient
faults only!
Transistor-and switch level compensation
10
2. Transient Fault Compensation
11
Transient Faults and Single Event Upsets (SEUs)
Discharge of memory cells and circuit nodes
Charged particles

EM coupling
  • Sources for transient faults
  • Radiation
  • EM coupling
  • Vdd- and GND-noise

12
Storage Nodes and Particles
Q /
fC
100
Alpha
-
Part.
10
1
0,18
0,09
0,35
0,25
Technology
fC
Charge!
1 MeV Alpha
-
Particle generates 42
13
Contribution to Soft-Error Rates
Static combinational logic 11
Sequential elements (FFs, Latches) 49
Unprotected SRAM 40
Source S. Mitra, N. Seifert, M. Zhang, Q. Shi,
K. S. Kim, Robust System Design with Built-In
Soft Error Resilience IEEE Computer, Vol. 38,
No.2, Febr. 2005, pp. 43-52
14
Spikes and Clock Rates in Logic
Charge- / status
Source Pulse of 100
ps
restoration is possible
clock
t
slew time / jitter
clock
Charge- / status
restoration is impossible
t
Fault probability in digital logic is about
proportional to clock frequency!
15
Logic Structures and Fault Events
Particle- radiation
Output
Input
-
FFs
FFs
Flip-flops need fault tolerance / fault
hardening in the first place, logic close-to
outputs comes next .
16
Muller-C-Element
17
Fault Handling
Muller-C-Element
If both inputs are equal out outl1, outl2
If both element are not equal out previous
(outl1, outl2)
Under local fault conditions on the latch outputs
(one of 2 latches false), the C-element
preserves the output condition from the charge
phase of the latch.
Essentially 3 latches!
18
Fault Compensation
outl1
Latch 1
out
Muller C-Element
in
Latch 2
outl2
CL
C keep
C transmit
C transmit
C keep
v(t)
in1
in2
clock
t
19
Intels Scan Path Element
20
Intels Scan Path Element plus Fault Compensation
21
Fault Compensation in Combinational Logic
Input
-
FFs
MC
D
MC
D
MC
D
22
Fault Compensation in Combinational Logic
fault-free signal
V(t)
t
Signal with glitch
V(t)
t
Latch close
Delayed Signal with glitch
Time left to capture !
V(t)
t
MC no capture / hold
MC capture
MC capture
23
3. Repair for Memory and FPGAs
Compensation of transient faults is not
enough. Some technologies for transient
compensation can handle permanent faults, too,
but not on the long run and with additional
transient faults!
24
Memory Test Repair
Read- / write lines
Lines
Line address
spare column
columns
25
Memory Test Repair (2)
Read- / Write lines
Lines
Line address
spare column
Memory BIST controller
columns
... is already state-of-the-art!
26
FPGA-based Self Repair
27
In-System FPGA Repair
28
Repair Mechanism Row / Line- Shift
Little Overhead for the re-configuration
process
Loss of many good CLBs for every fault
29
Distributed Backup CLBs
Minimum loss of functional CLBs
High effort for re-wiring requires massive
embedded computing power (32-bit CPU, 500 MHz)
30
FPGAs as a Solution ??
The granularity of re-configurable logic blocks
(CLBs) in most FPGAs is the order of several
thousand gates.
Replacement strategies must be placed on a
granularity of blocks in the area of 100-500
transistors for fault densities between 0.01
and 0.1 .
Efficient FPGA- repair mechanism requires
in-system EDA (re-placement and routing) with a
massive demand for computing power.
Example 500 MHz Power 4- processor, run-time up
to minutes, memory about 1 KByte
31
4. Fine-Granular Repair
Repair procedure
Functioning
overhead
elements lost
Size or replaced blocks
(granularity)
32
Granularity of Replacement
33
Levels of Repair
34
Replacement in Regular Structures (e.g. for DSP)
35
Parallel Backup Transistors
VDD
VDD
out
in1
out
in1
redundant transistors
in2
in2
GND
GND
Basic gate
Gate with redundant transistors
36
Configuration and Fault Isolation
VDD
Ap
Ap
config
.
switches
VDD
stuck-on fault
out
out
in1
in1
backup
transistors
in2
in2
GND
config
.
An
switches
An
GND
37
The Gate-Short-Problem
Load 1
Driver
Load 2
Gate- short
GND-shorts of input gates affect the whole
fan-in network and make redundancy obsolete!!
38
Gate Turn-off
39
Schematic Layout with VDD / GND Switches
Gate with parallel redundancy and fault isolation
Gate with parallel redundancy
40
Transistor-Level Overhead
Redundancy
parallel transistors
VDD / GND switches
separate gate poly lines
Overhead (cells only)
30-40
60-80
100-150
estimates
stuck-off coverage
yes
yes
yes
stuck-on coverage
no
yes
yes
gate shorts cov.
no
no
yes
control
none
one wire
mult. wires
lines
41
Duplicate Standard Cells
VDD
Switch
VDD
-
Switch
Gate
2
control
Gate
1
VDD1
VDD2
out
out
in1
in1
in2
in2
GND
GND
42
Again Fault Isolation
VDD
Switch
VDD
-
Switch
Gate
2
control
Gate
1
VDD1
VDD2
out
out
in1
in1
in2
in2
GND
GND
Gate input short
Output VDD / GND short
43
Administrated Duplicate Cells
VDD
power
switches
1 X
VDD1
VDD2
X 1
gate
in
gate
in
gate
gate
out
out
Gate
1
Gate
2
Gate
short
GND1
0 X
GND2
X 0
0 1
1 0
Act
1
GND
switches
Act
2
1 0
GND
44
Cell Duplication and Power Switch
Possible for all types of cells (also flip-flops).
Granularity of partitioning for replacements
(single gates, blocks) can be selected upon
demand.
Combination with dynamic circuit optimization is
favorably possible.
Good coverage potential for transistor faults.
Significant overhead (above 100 ), but most
likely below Triple Modular Redundancy (TMR).
Redundancy may become exhausted and requires a
further level of redundancy!
45
5. Gate Level Repair
Gate- fault
backup- cell
Std cells (gates)
Does not work for irregular wiring schemes !
Insertion of replacement cell
46
Configurable Backup Cell
Problem Fault isolation in case of
gate-input shorts !
47
Block-Based Repair
Technology Mapping
Colums of switching elements
48
Switching Concept
4 logic states, registered in 2 memory cells
49
Overhead
50
6. Bus Structures and Networks on Chip (NoCs)
Technology forecasts predict that nano-wires may
become the most vulnerable and unreliable
circuit elements ...
51
Faults on Irregular Interconnects
Routing tree
C
signal source
S
C
C
single fault (line break)
C
52
Redundant Wiring
Routing tree with loops
extra wire
... plus double vias!
C
signal source
S
C
C
single fault (line break)
C
Problem Classic delay calculation works well on
trees only!
53
Buses versus NoCs
NoC node
NoC node
NoC node
Bus master
Bus master
NoC node
NoC node
NoC node
Bus master
Bus master
Bus master
NoC node
NoC node
NoC node
Regular network structure (NoC)
Irregular bus structure (SoC)
54
Faults on Bus Structures
BM 1
BM 3
BM 5
BM 2
BM 4
BM 6
Local defect affecting the total network
55
Bus Segmentation
BM 1
BM 3
BM 5
SC
SC
SC
segment couplers
S C
S C
S C
SC
SC
SC
BM 2
BM 4
BM 6
Structure the bus into segments that can be
repaired individually!
56
The Switching Problem
n
nk
n
backup
p
p
1
1
n k p switches contr. states
16
9
8
1
1
16
1
1
32
33
2
2
128
65
32
57
Faults and Repair Actions
1. Line- break Section of a line is interrupted
use spare wire!
2. Line- short to GND Section of a line is
connected to GND
use spare wire!
3. Dynamic coupling between adjacent line
a. Re-allocate lines in bundle
b. Insert grounded line for decoupling
4. Bridge between lines
a. Feed both lines with same signal
b. Make one line floating
58
Reconfiguration for De-Coupling
2-way switches may be used!
i
i
k
k
i
i
k
k
..can help to minimize dynamic coupling faults!
59
Selection of Permutations
All single faults must be repairable by
selecting a minimum set of permutations.
Those lines that can act as replacement for most
of the others are selected for backup lines.
By permutation, also non-faulty functional lines
are re-arranged.
No permutation used for repair must map
a functional line to a faulty line.
60
Permutations for 8-Wire-Bundles
New-neighborhood
Pair-wise symmetrical
PW1
PW2
PW3
NNP1
NNP2
NNP3
0 - 2
0 - 3
0 - 5
0 - 1
0 - 6
0 - 4
1 - 6
1 - 5
1 - 7
1 - 0
1 - 7
1- 3
2 - 0
2 - 7
2 - 3
2 - 4
2 - 5
2 - 4
3 - 5
3 - 0
3 - 2
3 - 1
3 - 6
3 - 6
4 - 7
4 - 6
4 - 5
4 - 2
4 - 0
4 - 2
5 - 3
5 - 1
5 - 4
5 - 0
5 - 7
5 - 2
6 - 1
6 - 4
6 - 7
6 - 0
6 - 3
6 - 3
7 - 4
7 - 2
7 - 6
7 - 5
7 - 1
7 - 1
61
8 Wires Permutations and Replacement
Permutations
Selected backup
Selected backup wires
2 lines selected for backup!
62
8 Wires Permutations and Replacement
Permutations
4 lines selected for backup!
63
Overhead / Coverage for 8-Line-Bundle
Spare Lines (out of 8) / Switches
Faults
4/ 32
0/ 16 1 /48 2 / 32
3 / 32
Single fine fault
-




Dyn. coupl. fault





Double line faults
-
-
20
30
100
Note The number of switches is reduced by a
factor of 2 if full 2-way-switches with 2 inputs
/ 2 outputs are used!
64
Administration Scheme
SC
SC
in /
lines
in /
Switches
Switches
0
0
0
out
out
1
1
1
2
2
2
3
3
3
4
4
4
A
B
B
A
5
5
5
6
6
6
7
7
7
Decode
Decode
Config
-
bits
C2
C1
C1
C2
Matching
Config
-
Config
-
Logic
Logic
65
7. A Lot of Work to Do
Logic fault diagnosis in the field
Efficient logic self repair
Redundancy supervision and management
Resource management under fault conditions
Repair functions for interconnects
Overall system-level fault management
Write a Comment
User Comments (0)
About PowerShow.com