Title: A Combinatorial Group Testing Method for FPGA Fault Location
1A Combinatorial Group Testing MethodforFPGA
Fault Location
Ronald F. DeMara, Carthik A. SharmaUniversity of
Central Florida
2Introduction
- Field Programmable Gate Arrays
- Gate-array-based reconfigurable architecture
- Matrix of Logic Cells (Look-Up Tables) surrounded
by peripheral I/O cells - Capabilities
- Runtime reconfiguration
- On-chip processor core Millions of
gate-equivalent logic elements - Millions of FPGA devices produced annually most
SRAM-based - Used in mission-critical applications
- Remote systems Hazardous Environments
- Space Applications Satellites, probes, and
shuttles
3Group Testing Algorithms
- Origin World War II Blood testing
- Problem Test samples from millions of new
recruits - Solution Test blocks of sample before testing
individual samples - Problem Definition
- Identify subset Q of defectives from set P
- Minimize number of tests
- Test v-subsets of P
- Form suitable blocks
4Previous Work
- Pre-compiled Column-Based Dual FPGA architecture
Mitra04 - Autonomous detection, repair by shifting
pre-compiled columns - Isolation using distributed CED-checkers and
blind reconfiguration attempts - Overview of Combinatorial Group Testing and
Applications Du00 - Provides taxonomy and general algorithms for
applying CGT - Examples of CGT applications DNA clone library
filtering, vaccine screening, computer fault
diagnosis, etc. - CGT Enhanced Circuit Diagnosis Kahng04
- Present doubling, halving etc for circuit fault
diagnosis using BIST, CGT - Requires ability to test resources individually
- Chinese Remainder Sieve technique Eppstein05
- Efficient non-adaptive and two-stage CGT based on
prime number driven test formation - Improved algorithms for practical problem sizes
(n lt 1080) with small number of defectives (d lt
4)
5Fault-Handling Techniques
Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
CGT-Based
Repetitive Readback
Approach
TMR
STARS
CED
Dueling
Methods
Duplex Output Comparison
Supplementary Testbench
Duplex Output Comparison
Detection
Cartesian Intersection
Isolation
Bitwise Comparison
Majority Vote
Repetitive Intersections
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
unnecessary
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Invert Bit Value
Ignore Discrepancy
6Isolation Problem Outline
- Objectives
- Locate faulty logic and/or interconnect resource
a single stuck-at fault model is assumed - Online Fault Isolation device not entirely
removed from service - Features
- Runtime Reconfiguration FPGA resources
configured dynamically - Utilize Runtime Inputs avoid special
test-vectors, improve availability - Constraints
- Use pre-designed configurations defined by
target application - Subsets under test have constant resource
utilization range for a given isolation problem - Resource grouping influences fault articulation
resource-mapping and input vector might mask
hardware faults - Do not use specialized block designs
- Runtime reconfiguration limited to
column-swapping - Non-reasonable algorithm tests may be
repeated without gaining new isolation
information
7Fault Location Using Dueling
- The set of all competing configurations is
represented by S. - Set Ck represents the resources utilized by
configuration k. - Each competing configuration k, 1 lt k lt S has
a unique binary - Usage Matrix Uk, 1 lt k lt p.
- Elements Uki,j, 1 lt i lt m, 1 lt j n, where m
and n represent the rows and columns in the
device layout respectively. - Elements Uki,j 1 denote the usage of resource
(i, j) by Ck. - The History Matrix H, with elements Hi,j 1 lt i
lt m, 1 lt j lt n, is an integer matrix used to
represent the relative fitness of individual
resources. - Hi,j provides instantaneous relative fitness
values of resources.
8Dueling Example
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 1 0 0 0
0 0 1 1 0 0 1 0 0 0
0 0 1 0 1 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
H i,j _at_ t 0
U2
U1
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 2 1 0 0 1 0 0 0
0 0 1 0 1 1 0 1 0 0
0 0 1 1 0 1 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
- H i,j changes after C1 and C2 are loaded
- U1 and U2 are corresponding Usage Matrices
- (3,3) is identified as the faulty resource
H i,j _at_ t 2
9Modified Halving
Initially all Hi,j 0
Selection Process can be Adaptive
Fitness Augmentation can be non-linear
Columns can be swapped with any other Columns
10FPGA Arrangement for Dueling
- Configurations in Population
- C CL? CR
- CL subset of left-half configurations
- CR subset of right-half configurations
- CLCR C/2
11Isolation Progress without Halving
- Without Halving
- Initially S 20,000
- Resource Utilization 40
- Number of suspected faulty elements constant at
36 after 23 iterations - No subsequent improvement due to lack of
differentiating information between competing
configurations
12Dueling with Modified Halving
- Dueling with Halving
- Halving works by swapping half the used columns
with unused ones -
- Halving progressively reduces the size of the
set of suspected faulty elements - Isolation proceeds till a single faulty element
is isolated - Fault isolated after 19 iterations
13Effect of Total Number of Elements
- Increased Problem Size
- Number of Elements (Number of Rows x Number of
Columns -
- As the size of the array containing the fault
increases, the increase in the required number of
iterations is minimal - For 1 mill. elements, only 27.4 iterations
required.
14Effect of Population Size
- Population Size
- Single fault in S is assumed
- As pop. size increases, isolation expected to be
faster -
- Increased pop. size implies more initial designs
- A population size of 30 seems to be an ideal
tradeoff between ease of isolation, and the
difficulty of generating increased number of
individuals.
Increased population size provides minimal added
benefit
15Effect of Resource Utilization
- Moderate resource utilization ideal for
isolation -
- Rate of isolation progress low with extreme
utilization characteristics - Isolation takes longer when less than 20 or
greater than 80 of the available resources are
utilized.
20 40
16Future Work
- Conducting Tests using Benchmark Circuits
- ISCAS89 s38584 with 11448 gates sequential logic
- ISCAS85 circuits with max 3513 gates
combinational logic - Compression/ Signal Processing algorithms, such
as the Lempel-Ziv (LZ) compression scheme
Mitra04 - Development of an architecture to enable
column-swapping - Multi-layer Runtime Reconfigurable Architecture
(MRRA) being prototyped
17Backup Slides
18Online Dueling Evaluation
- Objective
- Isolate faults by successive intersection between
sets of FPGA resources used by configurations - Analyze complexity of Isolation process
- Variables
- Total resources available
- Measured in number of LUTs
- Number of Competing Configurations
- Number of initial Seed designs in CRR process
- Degree of Articulation
- Some inputs may not manifest faults, even if
faulty resource used by individual - Resource Utilization Factor
- Percentage of FPGA resources required by target
application/design - Number of Iterations for Isolation
- Measure of complexity and time involved in
isolating fault
19Discrepancy Mirror Circuit
Fault Coverage
Component Fault Scenarios Fault Scenarios Fault Scenarios Fault Scenarios Fault-Free
Function Output A Fault Correct Correct Correct Correct
Function Output B Correct Fault Correct Correct Correct
XNORA Disagree (0) Disagree (0) Fault Disagree(0) Agree (1) Agree (1)
XNORB Disagree (0) Disagree (0) Agree (1) Fault Disagree(0) Agree (1)
BufferA 0 0 High-Z 0 1
BufferB 0 0 0 High-Z 1
Match Output 0 0 0 0 1
20- Influence of LUT utilization
Perpetually Articulating Inputs with Equiprobable
Distribution
Intermittently Articulating Inputs with
Equiprobable Distribution
- expected number of pairings grows sub-linearly
in number of resources - utilization below 20 or above 80 implicates
(or exonerates) a smaller sub-set of resources - 50 utilization, the expected number of pairings
for 1,000, 10,000, and 100,000 resources are
11.1, 14.9, and 17.6
- at 90 utilization mean value of 258 pairings
are required to isolate the faulty resource.
21Accommodating Multi-bit Word Widths
- Proof of concept
- The present circuit works efficiently
- Demonstrates important Dueling-enabled isolation
method - Strategies
- Use an array of detectors
- attempt to minimize points of failure as
word-width increases - Number of logic resources used is acceptable for
smaller circuits - Create new circuit or scheme, combining fault
tolerant coding-based methods with single-fault
secure circuit - Current research focused on improving detector by
investigating codes, and fault-secure circuits
22Pull-down Resistor Considerations
- Proof of concept
- The present circuit works in a verifiable correct
manner - Can utilize synthesized (digital) pull-down
resistor which simulate the behavior of analog
resistors - Demonstrates Dueling-enabled isolation method
- Can be utilized without implementation problems
for Custom-VLSI designs - Alternative Approach
- Alternate detector circuits for FPGA
implementation are under investigation - Avoid using Tri-state buffers, pull-down
resistors and use native digital components
available on FPGAs
23Competitive Runtime Reconfiguration (CRR)
Evolutionary Computation strategies effective for
more than just repair phase continually
detect, rank, and isolate faults entirely within
the underlying data throughput flow
diverse alternatives working a-priori
fault detection by robust consensus over time
no test vectors
device remains online during repair
fault isolation is model-free and
self-calibrating
completely-repaired criteria can be ignored
graceful degredation via ranking of
alternatives
no reconfiguration when fault-free
performance readily adjustable
failures in population memory covered
checking logic part of individual hence also
competes for correctness
24States Transitions during lifetime of ith
Half-Configuration
Configuration Health States
- Discrepancy Operator
- Baseline Discrepancy Operator ? is dyadic
operator with binary output - Z(Ci) is FPGA data throughput output of
configuration Ci
WTA
(Equivalence)
25Procedural Flow under Consensus-Based Evaluation
Initialization Partition P into sub-populations
of size P/2 to designate physical FPGA
left-half or right-half resource utilization
- Regeneration
- Genetic Operators recover based on Reintroduction
Rate ? - Operators only applied once then offspring
returned to service - without concern about increasing fitness
26GA Parameters Experiments
GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
- Speciation
- Two-point crossover between individuals from same
sub-group - Crossover points chosen to prevent intra-CLB
crossover - Breeding occurs exclusively among members of
sub-populations - Maintains non-interfering resource use among L, R
Demonstrate
- Fault Isolation Characteristics
- Regenerative Experiments
Experiments
- Objective fitness function replaced by the
Consensus-based Evaluation Approach and Relative
Fitness - Elimination of additional test vectors
27Impact of Fault on Viable Individuals
- Existence of Positive Test Vector
- Input Ip comprises a positive test vector iff
Cv(Ip) ? Cf(Ip) 1 where Cv denotes a viable
configuration and Cf denotes a faulty
configuration - So if a discrepancy is visible then some Ip
exists which manifests the fault - Minimal Case when Ip is Unique
- Ip is unique if fault is observable under exactly
one test vector - Probability Mass Function for Encountering Ip in
Minimal Case - Consider Ew600 yielding 99.5 coverage for a
module with input space W64 - The number of input occurrences, 0 ? i ? 600,
that randomly encounter Ip to identify the fault
is governed by the probability density function - p.m.f.(i)
where - where D is the length of Ew
28Isolation of a single faulty individual with
1-out-of-64 impact
- Outliers are identified after EW iterations have
elapsed - Expected D.V. (1/64)600 9.375 from
individual impacted by fault - Isolated individuals DV differs from the average
DV by 3? after 1 or more observation intervals of
length EW
29Isolation of a single faulty L individual with
10-out-of-64 impact
- Compare with 1-out-of-64 fault impact
- Expected DV of (10/64)600 93.75 for faulty
configuration - One isolation will be complete approx. once in
every 93.75/5 19 Sliding Windows - Fault Isolation achieved is 100
30Isolation of 8 faulty individuals L4R4 with
1-out-of-64 impact
- Expected isolations do not occur approx. 40 of
the time - Average discrepancy value of the population is
higher - Outlier isolation difficult
- Multiple faulty individual, Discrepancies
scattered
31Regeneration Performance
Parameters
Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold DVS
1-6/60099 Repair Threshold DVR 1-4/600
99.3 Re-introduction rate ?r 0.1
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
32Multilayer Runtime Reconfiguration Architecture
(MRRA)
- Develop MRRA fast reconfiguration paradigm for
the CRR approach - Validate with real hardware platform along with
detailed performance analysis - First general-purpose framework for a wide
variety of applications requiring dynamic
reconfiguration - Extend existing theories on reconfiguration
33Loosely Coupled Solution
The Virtex-II Pro is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
34For further info EH Websitehttp//cal.ucf.edu