A Combinatorial Group Testing Method for FPGA Fault Location PowerPoint PPT Presentation

presentation player overlay
1 / 34
About This Presentation
Transcript and Presenter's Notes

Title: A Combinatorial Group Testing Method for FPGA Fault Location


1
A Combinatorial Group Testing MethodforFPGA
Fault Location
Ronald F. DeMara, Carthik A. SharmaUniversity of
Central Florida
2
Introduction
  • Field Programmable Gate Arrays
  • Gate-array-based reconfigurable architecture
  • Matrix of Logic Cells (Look-Up Tables) surrounded
    by peripheral I/O cells
  • Capabilities
  • Runtime reconfiguration
  • On-chip processor core Millions of
    gate-equivalent logic elements
  • Millions of FPGA devices produced annually most
    SRAM-based
  • Used in mission-critical applications
  • Remote systems Hazardous Environments
  • Space Applications Satellites, probes, and
    shuttles

3
Group Testing Algorithms
  • Origin World War II Blood testing
  • Problem Test samples from millions of new
    recruits
  • Solution Test blocks of sample before testing
    individual samples
  • Problem Definition
  • Identify subset Q of defectives from set P
  • Minimize number of tests
  • Test v-subsets of P
  • Form suitable blocks

4
Previous Work
  • Pre-compiled Column-Based Dual FPGA architecture
    Mitra04
  • Autonomous detection, repair by shifting
    pre-compiled columns
  • Isolation using distributed CED-checkers and
    blind reconfiguration attempts
  • Overview of Combinatorial Group Testing and
    Applications Du00
  • Provides taxonomy and general algorithms for
    applying CGT
  • Examples of CGT applications DNA clone library
    filtering, vaccine screening, computer fault
    diagnosis, etc.
  • CGT Enhanced Circuit Diagnosis Kahng04
  • Present doubling, halving etc for circuit fault
    diagnosis using BIST, CGT
  • Requires ability to test resources individually
  • Chinese Remainder Sieve technique Eppstein05
  • Efficient non-adaptive and two-stage CGT based on
    prime number driven test formation
  • Improved algorithms for practical problem sizes
    (n lt 1080) with small number of defectives (d lt
    4)

5
Fault-Handling Techniques
Device Failure
Characteristics
Duration
Transient SEU
Permanent SEL, Oxide Breakdown, Electron
Migration, LPD
Device Configuration
Processing Datapath
Device Configuration
Processing Datapath
Target
BIST
CGT-Based
Repetitive Readback
Approach
TMR
STARS
CED
Dueling
Methods
Duplex Output Comparison
Supplementary Testbench
Duplex Output Comparison
Detection
Cartesian Intersection
Isolation
Bitwise Comparison
Majority Vote
Repetitive Intersections
Fast Run-time Location
Worst-case Clock Period Dilation
Diagnosis
unnecessary
Evolutionary Algorithm using Intrinsic
Fitness Evaluation
Recovery
Replicate in Spare Resource
Select Spare Resource
Invert Bit Value
Ignore Discrepancy
6
Isolation Problem Outline
  • Objectives
  • Locate faulty logic and/or interconnect resource
    a single stuck-at fault model is assumed
  • Online Fault Isolation device not entirely
    removed from service
  • Features
  • Runtime Reconfiguration FPGA resources
    configured dynamically
  • Utilize Runtime Inputs avoid special
    test-vectors, improve availability
  • Constraints
  • Use pre-designed configurations defined by
    target application
  • Subsets under test have constant resource
    utilization range for a given isolation problem
  • Resource grouping influences fault articulation
    resource-mapping and input vector might mask
    hardware faults
  • Do not use specialized block designs
  • Runtime reconfiguration limited to
    column-swapping
  • Non-reasonable algorithm tests may be
    repeated without gaining new isolation
    information

7
Fault Location Using Dueling
  • The set of all competing configurations is
    represented by S.
  • Set Ck represents the resources utilized by
    configuration k.
  • Each competing configuration k, 1 lt k lt S has
    a unique binary
  • Usage Matrix Uk, 1 lt k lt p.
  • Elements Uki,j, 1 lt i lt m, 1 lt j n, where m
    and n represent the rows and columns in the
    device layout respectively.
  • Elements Uki,j 1 denote the usage of resource
    (i, j) by Ck.
  • The History Matrix H, with elements Hi,j 1 lt i
    lt m, 1 lt j lt n, is an integer matrix used to
    represent the relative fitness of individual
    resources.
  • Hi,j provides instantaneous relative fitness
    values of resources.

8
Dueling Example
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 1 0 0 0
0 0 1 1 0 0 1 0 0 0
0 0 1 0 1 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
H i,j _at_ t 0
U2
U1
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 2 1 0 0 1 0 0 0
0 0 1 0 1 1 0 1 0 0
0 0 1 1 0 1 0 0 0 0
0 0 1 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
  • H i,j changes after C1 and C2 are loaded
  • U1 and U2 are corresponding Usage Matrices
  • (3,3) is identified as the faulty resource

H i,j _at_ t 2
9
Modified Halving
Initially all Hi,j 0
Selection Process can be Adaptive
Fitness Augmentation can be non-linear
Columns can be swapped with any other Columns
10
FPGA Arrangement for Dueling
  • Configurations in Population
  • C CL? CR
  • CL subset of left-half configurations
  • CR subset of right-half configurations
  • CLCR C/2

11
Isolation Progress without Halving
  • Without Halving
  • Initially S 20,000
  • Resource Utilization 40
  • Number of suspected faulty elements constant at
    36 after 23 iterations
  • No subsequent improvement due to lack of
    differentiating information between competing
    configurations

12
Dueling with Modified Halving
  • Dueling with Halving
  • Halving works by swapping half the used columns
    with unused ones
  • Halving progressively reduces the size of the
    set of suspected faulty elements
  • Isolation proceeds till a single faulty element
    is isolated
  • Fault isolated after 19 iterations

13
Effect of Total Number of Elements
  • Increased Problem Size
  • Number of Elements (Number of Rows x Number of
    Columns
  • As the size of the array containing the fault
    increases, the increase in the required number of
    iterations is minimal
  • For 1 mill. elements, only 27.4 iterations
    required.

14
Effect of Population Size
  • Population Size
  • Single fault in S is assumed
  • As pop. size increases, isolation expected to be
    faster
  • Increased pop. size implies more initial designs
  • A population size of 30 seems to be an ideal
    tradeoff between ease of isolation, and the
    difficulty of generating increased number of
    individuals.

Increased population size provides minimal added
benefit
15
Effect of Resource Utilization
  • Moderate resource utilization ideal for
    isolation
  • Rate of isolation progress low with extreme
    utilization characteristics
  • Isolation takes longer when less than 20 or
    greater than 80 of the available resources are
    utilized.

20 40
16
Future Work
  • Conducting Tests using Benchmark Circuits
  • ISCAS89 s38584 with 11448 gates sequential logic
  • ISCAS85 circuits with max 3513 gates
    combinational logic
  • Compression/ Signal Processing algorithms, such
    as the Lempel-Ziv (LZ) compression scheme
    Mitra04
  • Development of an architecture to enable
    column-swapping
  • Multi-layer Runtime Reconfigurable Architecture
    (MRRA) being prototyped

17
Backup Slides
  • On following pages

18
Online Dueling Evaluation
  • Objective
  • Isolate faults by successive intersection between
    sets of FPGA resources used by configurations
  • Analyze complexity of Isolation process
  • Variables
  • Total resources available
  • Measured in number of LUTs
  • Number of Competing Configurations
  • Number of initial Seed designs in CRR process
  • Degree of Articulation
  • Some inputs may not manifest faults, even if
    faulty resource used by individual
  • Resource Utilization Factor
  • Percentage of FPGA resources required by target
    application/design
  • Number of Iterations for Isolation
  • Measure of complexity and time involved in
    isolating fault

19
Discrepancy Mirror Circuit
Fault Coverage
Component Fault Scenarios Fault Scenarios Fault Scenarios Fault Scenarios Fault-Free
Function Output A Fault Correct Correct Correct Correct
Function Output B Correct Fault Correct Correct Correct
XNORA Disagree (0) Disagree (0) Fault Disagree(0) Agree (1) Agree (1)
XNORB Disagree (0) Disagree (0) Agree (1) Fault Disagree(0) Agree (1)
BufferA 0 0 High-Z 0 1
BufferB 0 0 0 High-Z 1
Match Output 0 0 0 0 1
20
  • Influence of LUT utilization

Perpetually Articulating Inputs with Equiprobable
Distribution
Intermittently Articulating Inputs with
Equiprobable Distribution
  • expected number of pairings grows sub-linearly
    in number of resources
  • utilization below 20 or above 80 implicates
    (or exonerates) a smaller sub-set of resources
  • 50 utilization, the expected number of pairings
    for 1,000, 10,000, and 100,000 resources are
    11.1, 14.9, and 17.6
  • at 90 utilization mean value of 258 pairings
    are required to isolate the faulty resource.

21
Accommodating Multi-bit Word Widths
  • Proof of concept
  • The present circuit works efficiently
  • Demonstrates important Dueling-enabled isolation
    method
  • Strategies
  • Use an array of detectors
  • attempt to minimize points of failure as
    word-width increases
  • Number of logic resources used is acceptable for
    smaller circuits
  • Create new circuit or scheme, combining fault
    tolerant coding-based methods with single-fault
    secure circuit
  • Current research focused on improving detector by
    investigating codes, and fault-secure circuits

22
Pull-down Resistor Considerations
  • Proof of concept
  • The present circuit works in a verifiable correct
    manner
  • Can utilize synthesized (digital) pull-down
    resistor which simulate the behavior of analog
    resistors
  • Demonstrates Dueling-enabled isolation method
  • Can be utilized without implementation problems
    for Custom-VLSI designs
  • Alternative Approach
  • Alternate detector circuits for FPGA
    implementation are under investigation
  • Avoid using Tri-state buffers, pull-down
    resistors and use native digital components
    available on FPGAs

23
Competitive Runtime Reconfiguration (CRR)
Evolutionary Computation strategies effective for
more than just repair phase continually
detect, rank, and isolate faults entirely within
the underlying data throughput flow
diverse alternatives working a-priori
fault detection by robust consensus over time
no test vectors
device remains online during repair
fault isolation is model-free and
self-calibrating
completely-repaired criteria can be ignored
graceful degredation via ranking of
alternatives
no reconfiguration when fault-free
performance readily adjustable
failures in population memory covered
checking logic part of individual hence also
competes for correctness
24
States Transitions during lifetime of ith
Half-Configuration
Configuration Health States
  • Discrepancy Operator
  • Baseline Discrepancy Operator ? is dyadic
    operator with binary output
  • Z(Ci) is FPGA data throughput output of
    configuration Ci

WTA
(Equivalence)
25
Procedural Flow under Consensus-Based Evaluation
Initialization Partition P into sub-populations
of size P/2 to designate physical FPGA
left-half or right-half resource utilization
  • Regeneration
  • Genetic Operators recover based on Reintroduction
    Rate ?
  • Operators only applied once then offspring
    returned to service
  • without concern about increasing fitness

26
GA Parameters Experiments
GA operators External-Module-Crossover Internal-Mo
dule-Crossover Internal-Module-Mutation
GA parameters Population size 20 individuals
Crossover rate 5 Mutation rate up to
80 per bit
  • Speciation
  • Two-point crossover between individuals from same
    sub-group
  • Crossover points chosen to prevent intra-CLB
    crossover
  • Breeding occurs exclusively among members of
    sub-populations
  • Maintains non-interfering resource use among L, R

Demonstrate
  • Fault Isolation Characteristics
  • Regenerative Experiments

Experiments
  • Objective fitness function replaced by the
    Consensus-based Evaluation Approach and Relative
    Fitness
  • Elimination of additional test vectors

27
Impact of Fault on Viable Individuals
  • Existence of Positive Test Vector
  • Input Ip comprises a positive test vector iff
    Cv(Ip) ? Cf(Ip) 1 where Cv denotes a viable
    configuration and Cf denotes a faulty
    configuration
  • So if a discrepancy is visible then some Ip
    exists which manifests the fault
  • Minimal Case when Ip is Unique
  • Ip is unique if fault is observable under exactly
    one test vector
  • Probability Mass Function for Encountering Ip in
    Minimal Case
  • Consider Ew600 yielding 99.5 coverage for a
    module with input space W64
  • The number of input occurrences, 0 ? i ? 600,
    that randomly encounter Ip to identify the fault
    is governed by the probability density function
  • p.m.f.(i)
    where
  • where D is the length of Ew

28
Isolation of a single faulty individual with
1-out-of-64 impact
  • Outliers are identified after EW iterations have
    elapsed
  • Expected D.V. (1/64)600 9.375 from
    individual impacted by fault
  • Isolated individuals DV differs from the average
    DV by 3? after 1 or more observation intervals of
    length EW

29
Isolation of a single faulty L individual with
10-out-of-64 impact
  • Compare with 1-out-of-64 fault impact
  • Expected DV of (10/64)600 93.75 for faulty
    configuration
  • One isolation will be complete approx. once in
    every 93.75/5 19 Sliding Windows
  • Fault Isolation achieved is 100

30
Isolation of 8 faulty individuals L4R4 with
1-out-of-64 impact
  • Expected isolations do not occur approx. 40 of
    the time
  • Average discrepancy value of the population is
    higher
  • Outlier isolation difficult
  • Multiple faulty individual, Discrepancies
    scattered

31
Regeneration Performance

Parameters
Difference (vs. Hamming Distance) Evaluation
Window, Ew 600 Suspect Threshold DVS
1-6/60099 Repair Threshold DVR 1-4/600
99.3 Re-introduction rate ?r 0.1
Repairs evolved in-situ, in real-time, without
additional test vectors, while allowing device to
remain partially online.
32
Multilayer Runtime Reconfiguration Architecture
(MRRA)
  • Develop MRRA fast reconfiguration paradigm for
    the CRR approach
  • Validate with real hardware platform along with
    detailed performance analysis
  • First general-purpose framework for a wide
    variety of applications requiring dynamic
    reconfiguration
  • Extend existing theories on reconfiguration

33
Loosely Coupled Solution
The Virtex-II Pro is mounted on a development
board which can then be interfaced with a
WorkStation running Xilinx EDK and ISE.
The entire system operates on a 32-bit basis
34
For further info EH Websitehttp//cal.ucf.edu
Write a Comment
User Comments (0)
About PowerShow.com