Fault Tolerance in Field Programmable Gate Arrays

About This Presentation

Title:

Fault Tolerance in Field Programmable Gate Arrays

Description:

... so that the defective resources are bypassed and replaced by fault-free ones. ... If the LUT in C is fault-free, then the combinational function of E may be still ... – PowerPoint PPT presentation

Number of Views:237

Avg rating:3.0/5.0

Slides: 38

Provided by: neh5

Category:

more less

Transcript and Presenter's Notes

Title: Fault Tolerance in Field Programmable Gate Arrays

1
Fault Tolerance in Field Programmable Gate Arrays

Nehir Sönmez
12/5/2005

2
FPGA
Tracks
Logic Element

Each logic element outputs one data bit.
Interconnect programmable between elements.
Interconnect tracks grouped into channels.

3
FPGA (Xilinx XC4000)
4
FPGA Architecture

Basic Building Blocks
Programmable (Configurable) Logic Blocks (PLBs)
Input/Output Blocks (IOBs)
Programmable Switch Matrices (PSMs)
Routing Resources 50 of die area
Wires
Configuration Interconnect Points (CIPs)

5
Basic Logic Block Architecture
6
Reconfigurable Hardware
Logic Element
A
B
Out
C
D
A B C D out

Each logic element operates on four one-bit
inputs.
Output is one data bit.
Can perform any boolean function of four inputs
2 64K functions!

4
2
7
Reconfiguration

Reconfiguration methodology
Static
Partially static (partial reconfiguration)
Dynamic

8
RTR CPU FPGA

CPU re-programs FPGA
Useful for
Co-processing
Reducing CPU size and speed
Increasing system performance
Lower system power
Reducing system cost
Enabling Defect / Fault Tolerance
normal system operation involves reconfiguring
FPGAs for different functions, - controlled by a
module external to FPGAs
ie. an embedded microprocessor with memory to
store configurations.
Extend tasks of this processor to also control
the test, diagnosis, and fault-tolerance functions

9
Why Fault Tolerance in FPGAs?

Decreasing trace sizes increase faults
Increasing speed increases faults
Inherently reconfigurable, should be perfect for
FT
Graceful degradation would be nice

10
Early Efforts at Reconfiguration

Grew out of yield enhancement
PROM technology
Reconfiguration techniques used spare CLBs to
improve yield 50
Did not address fault detection, location, or any
runtime faults
Toward Runtime Reconfiguring Emmert and Bhatia
Used replacement CLB chains to reassign CLBs to
avoid faulty CLBs
Broke problem into a series of incremental (and
easier) reroutings
Did not address interconnect faults
Not runtime yet, allowed use of damaged components

11
First Runtime Reconfiguring

Hanchek and Dutt
Created first runtime solution
Changed interconnect rules to simplify
Guaranteed rerouting in linear time
Only supported one fault per row
Lach, Mangione-Smith, Potkonjak
Primary problem computing power
Use tiling to pre-compute configurations
Configurations each avoid different CLBs
At runtime choose one, based on the fault
location
Like TMR

12
Tile Reconfiguration Issues

Solutions probably have different timing
Finding worse case is combinatorial
Must assume worse case during design
Adjusting tile size changes redundancy
Applicable to interconnect faults in tiles
Does not address interconnect faults at tile
boundaries, does not detect, locate

13
Runtime Reconfiguring, Take III

Mahapatra and Dutt
Runtime router based on static covering chains
created during design
Guaranteed solution if enough working CLBs exist
Still no solution to detection/location

14
What about Fault Detection?

Grew out of yield enhancement
Originally BIST used to verify FPGAs
Products added BIST to POST
Products use BIST to self-diagnose and provide
more information to users

15
Runtime Fault Detection

Shnidman, Mangione-Smith, Potkonjak
Proposed running BIST simultaneously with primary
operation
Created a specification for an FPGA that would
allow Fault Scanning
Reserves 2 CLBs columns

16
Fault Scanning
17
Fault Scanning Issues

Hardware redundancy 2/Ncolumns
Much lower than TMR/NMR
Test shuttles back-and-forth
Errors not detected immediately
Detection latency is linear with Ncolumns
Doubled size of state for CLBs and interconnect
matrices
Only works on bus-based interconnect

18
What would runtime FT look like?

Use inherent reconfigurable nature
Lower overhead than TMR/NMR
Involve runtime relocation and rerouting
Latency based on scanning time
Graceful degradation
Work on existing hardware
CLB and interconnect faults
Handle SEUs as well as permanent faults

19
Runtime FT in FPGAs

Abramovici, Emmert, Hamilton, Skaggs, Stroud,
Verma, Wijesuriya
Meets all above criteria
Implemented example on existing HW
Added extensive use of controller
First time that FT is programmable
Algorithm (not design) for HW FT!
Approach
1) Isolate defects
2) Program device around defects

20
Roving Self-Test Areas

Reserve 2 CLBs in each direction
Hardware redundancy
1-(N-2)2/N2
The overhead decreases with N.
For example, for N20, OV is 19, but for N40,
OV is only 10.
STARs are invisible to working CLBs
Through interconnect unused by STAR
Tricky process for moving functionality
Must stop the clock during CLB move
Filled with BISTERs

21
Test and Reconfiguration Controller (TREC)

TREC accesses an FPGA using its boundary-scan
mechanism
(access is transparent to the normal function of
the chip)
Accesses BISTERs, uses TCK (test clock)
TREC uses RunTimeReconf. to rove the STARs across
the chip and to reconfigure them for different
test and diagnosis operations. TREC initiates the
BISTERs and analyzes their test results. TREC has
the capability to stop the system operation for
short intervals for safe relocation
faults detected, TREC starts the diagnosis
process. TREC also keeps track of the status of
FPGA resources, so that defective hardware is
avoided when part of the working area is
relocated in the previous STAR.
If area contains a Partially Usable Block, TREC
determines if the function of the PLB (to be
moved in the PUB location) matches the functions
that the PUB can correctly perform.
If a PLB or a wire may not be used, then TREC
determines configuration changes to the working
area so that the defective resources are bypassed
and replaced by fault-free ones.

22
Whats a BISTER?

6 CLBs that test each other
Tiled into STARs
Allows simultaneous testing of CLBs

23
BISTER

STAR several disjoint tiles, PLBs in each tile
form a BISTER.
A BISTER contains a Test Pattern Generator
applying test patterns to two identically
configured blocks under test (BUTs)
outputs are compared by an Output Response
Analyser. The ORA latches and reports mismatches
as test failures.
Start/Reset and Pass/Fail two interface signals
via boundary-scan access mechanism.
Start/Reset is used to initiate the BIST sequence
and to reset the TPG and ORA. The test result
Pass/Fail is captured in a FF which is part of a
scan register.
two more inputs TCK and a control input for
scanning out the Pass/Fail results from each
BISTER.

Initial BISTER configuration checks the proper
operation of the scan register, inducing
mismatches by comparing BUTs with different
configurations.
This protects against the case of all ORA FFs
being stuck at the Pass value.

24
The PLB

typical structure of a PLB, composed of a memory
module, a register, and a combinational output
module.
The memory block can implement combinational
look-up tables (LUTs) or RAM.
PLBs also contain special-purpose logic for
arithmetic functions (counters, adders,
multipliers, etc.) The register can be configured
as flip-flops or latches with programmable
clock-enable, preset/clear, and data selector
functions. The BUTs are repeatedly configured to
be exhaustively tested in every mode of operation.

The configuration of the TPG also changes when a
new BUT configuration requires different
patterns.

25
Functionality and Size of ORATPG

To test combinational functions implemented by a
LUT with n inputs, the TPG is a counter that
generates all possible 2n vectors
to test the RAM, the TPG is a state machine
generating standard RAM sequences, which are
known to be exhaustive for the fault models
specific to RAMs, such as pattern-sensitive
faults.
ORA needs to be reconfigured when the new mode of
operation of BUT involves a different number of
outputs.
Since a BISTER provides complete testing only for
its BUTs, we have to reconfigure every BISTER
several times so that every PLB will be a BUT in
at least one configuration.
Test time for a BISTER (total roving time) is
mostly reconfiguration time (much larger than the
BIST time).
to reduce the latency of the procedure, try to
minimize the number of PLBs in a BISTER.
The number of PLBs for an ORA and for a TPG
depend on the target PLB architecture. Usually
the number of outputs in a PLB is smaller than
its number of inputs. Since a TPG must provide
exhaustive patterns for BUTs, more than one PLB
is needed to construct the TPG.
This example a TPG needs three PLBs and an ORA
only one.

26
BISTbuilt-in self test

six floor plans of a BISTER tile,
T TPG, B BUT, O ORA cell
A BISTER tile may contain additional spare PLBs
to help reduce the use of global horizontal
routing resources by the internal BISTER
connections.
We arrange the number of spare PLBs so that a
tile will have an even number of PLBs,
symmetrically distributed between the two columns
of the V-STAR.
The goal systematically rotate the functions of
the PLBs, so that eventually every PLB in the
tile is completely tested twice, each time being
compared with a different BUT.

27
Claims

Claim 1 Any single faulty PLB is guaranteed to
be detected in at least two BISTER
configurations.
Proof The faulty PLB is a BUT in two BISTER
configurations, where its exhaustive inputs
patterns are produced by a fault-free TPG, and
its outputs are compared with a fault-free BUT by
a fault-free ORA. Hence no fault (single or
multiple) detected in the BUT can escape
detection in these two configurations.
Claim 2 Except for a few pathological cases, any
pair of faulty PLBs is guaranteed to be detected
in at least one BISTER configuration.
Claim 3 any combination of faulty PLBs is
detected in at least one BISTER configuration. If
the number of PLBs in a STAR is not a multiple of
the number of PLBs in a tile, then the leftover
PLBs that could not make up a BISTER will be
grouped with some of the PLBs already tested in
the adjacent tile, so that every PLB in the STAR
will eventually be part of a BISTER.

28
PLB Diagnosis

When failures are detected in the STAR,
additional configurations are downloaded to
locate the faulty PLB(s). Separate Pass/Fail
results from independent BISTERs provide initial
sets of suspects.
consider all PLBs in a failing BISTER tile as
suspects.
create new BISTERs that divide the suspected PLBs
into subsets that are separately tested.
repeat the procedure until all faulty PLBs are
identified

29
PLB Diagnosis

Ex. V-STAR has a failing BISTER with six
suspected PLBs.
split in two sets - A,B and C,D,E,F -
included in separate BISTERs, other PLBs were
shown to be fault-free.
Next assume the upper BISTER passes all its
tests, lower one fails.
split remaining in two sets - C,D and E,F -
include in separate BISTERs.
Next assume the E and F are found to be
fault-free.
now the remaining suspects - C and D - reside in
the same row and may not be included in separate
BISTERs. To complete the diagnosis, we
reconfigure to bring H-STAR over the row with C
and D, which now may be tested in separate
horizontal BISTERs.

eventually all faulty PLBs in V-STAR precisely
located, along with their failing modes of
operation.
TREC uses these data to assess whether the faulty
PLBs may be used as PUBs in subsequent system
operation.

30
Xilinx XC4000 Routing
25
31
BISTER for the Interconnect

The programmable interconnect network consists of
wire segments of different length that can be
connected via configuration interconnect points
(CIPs).
Similar approach, to detect any possible short in
the fault model
two BUTs are replaced by two groups of wires
under test (WUTs). A WUT may be composed of
several wire segments connected by closed CIPs.
To check local interconnect, WUTs may also go
through PLBs configured as identity functions
(passing their inputs directly to their outputs).
TPG applies all possible 2n test patterns to
every group of n WUTs.
generating exhaustive patterns with a n-bit
counter requires less PLBs than generating the
two n-bit walking patterns(contained in the
exhaustive set).

32
The Roving Mechanism

Once a STAR has been tested, the STAR roving
process continues by relocating the working area
adjacent to the STAR over the current STAR
position, and by reconfiguring the just-released
working area as a new STAR.
Initial stateA, D, E, and F are working PLBs, B
and C are part of V-STAR.
Assume all test activity has stopped without
detecting any faults, while the normal operation
continues. TREC performs the following
operations
Configure B and C with the functions of D and E
Stop the system clock
If necessary, copy the state of D and E into B
and C
Reconfigure to maintain the correct
interconnections among working PLBs
Restart the system clock
Configure the BISTER tiles in the new STAR and
restart testing.

33
graceful degradation

In most conventional FT methods, faultsare
detected within the working part of the system,
and they must be located and bypassed as fast as
possible to restart the normal operation as soon
as possible.
roving STARs no severe real-time constraints on
fault diagnosis reconfiguration
since the normal system operation is continuing
in the working area, we can allow much more time
for accurate diagnosis or for computing
fault-bypassing reconfigurations, compared with
an approach where normal operation is interrupted
for diagnosis and reconfiguration.
the spare resources are always present in the
neighborhood of the located fault. Thus fault
reconfiguration paths may be much shorter than in
other methods, and we can tolerate fault clusters
that the static FT methods cannot.

34
PUB Partially Usable Block

Another novel FT concept is the PUB, that allows
faulty PLBs to be safely used as a spares
whenever possible, thus increasing the effective
spare capacity and enabling an extended mission
life-span.
This idea may be extended to even a lower level,
such as reusing only a subset of the FFs or a
subset of the outputs of a faulty PLB.
The testing strategy guarantees that the PUB
works correctly in every non-failing mode of
operation, because it has passed an exhaustive
test in every such mode.
Ex. a PLB with faulty flip-flops may still be
safely used for combinational logic, or a PLB
with a faulty multiplier may still be safely used
as an adder.
the reuse of defective hardware resources a more
graceful degradation, compared to throwing away
the defective blocks.

35
Tolerating an Unusable Faulty PLB

assume C is faulty PLB
If the LUT in C is fault-free, then the
combinational function of E may be still be
transferred to C, (the PUB C is as good as a
fault-free PLB for the function of E.)
assume C has a faulty LUT, so it may not replace
E. Although V-STAR will be placed in the columns
of D and E, we will use D as a working PLB
replacement for E and exclude it from the STAR.
After values are transferred from D to B, D may
be configured with the function of E
Although E is now part of the STAR, it cannot be
part of any BISTER tile. Because of the faulty
PLB, we lost one of the spare cells in this row,
and from now on the PLBs in this row can no
longer be tested by V-STAR.
But here we can bring H-STAR to test these PLBs

36
Conclusions

An on-line FPGA testing, diagnosis, and
fault-tolerance, applicable to any FPGA
supporting Real Time Reconfiguration.
Self-testing goes on without disturbing the
normal system activity in the rest of the chip.
The roving of the STARs periodically brings every
section of the device under test. guarantees
complete testing of both PLBs and interconnect,
does not require any part of the chip to be
fault-free.
A STAR consists of several independent BISTERs,
which concurrently test disjoint tiles of the
STAR.
A BISTER is repeatedly configured using a
rotating strategy so that every PLB in its tile
is completely tested in every mode of operation,
and practically any combination of faulty PLBs is
guaranteed to be detected.

37
Conclusions

All test-related activities are done without
disturbing the normal system operation.
a dynamic FT method, where both spare
interconnect and spare PLB resources needed to
bypass a fault are allocated only after the fault
has been detected and diagnosed,
the spare resources are always present in the
neighborhood of the located fault.
In addition to reporting a faulty PLB, also
identify its failing internal module or its
failing mode of operation. provides the defective
PLB to be used in the system logic, provided that
its intended operation is not affected by the
identified faults.
This results in a more graceful degradation and
longer mission life.