Using Software Rules To Enhance FPGA Reliability - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Using Software Rules To Enhance FPGA Reliability

Description:

Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 MIRCHANDANI P226-W/MAPLD2005 * FPGA Fault Tolerance ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 21
Provided by: ChandruMi4
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Using Software Rules To Enhance FPGA Reliability


1
Using Software Rules To Enhance FPGA Reliability
  • Chandru Mirchandani
  • Lockheed-Martin
  • September 7-9, 2005

MIRCHANDANI
P226-W/MAPLD2005
1
2
FPGA Fault Tolerance
  • Historically realized through triple redundancy,
    error correcting codes and replicated elements
  • The fault tolerance process is as good as the
    tests run to validate its performance, e.g.
  • When invalid data is not ignored due to an
    inherent fault in the lookup and compare sequence
  • The testing was not rigorous enough
  • The testing was not complete
  • Lack of real estate and logic on the device
    precludes the ideal solution,
  • Make educated judgment calls on how much is
    acceptable and for how long

3
Reconfiguring FPGAs
  • Replicated circuitry or triple redundancy,
    achieved by having different devices or on the
    same device
  • Same device to replicate a complete circuit will
    not meet the constraint of lack of real estate
    and will decrease performance due to routing
  • Could be used to ones advantage if sub-sets of
    the circuit were replicated
  • Yu and McCluskey - reconfiguring the chip so that
    a damaged configurable logic block (CLB) or
    routing resource is not used by a design

4
Types of Errors
  • Yu and McCluskey When concurrent error
    detection (CED) mechanisms detect an error for
    the first time, it is treated as a transient
    error otherwise, it is treated as a permanent
    error
  • Transient error - the system recovers from
    corrupt data and resumes normal operation
  • Permanent fault - fault diagnosis is initiated to
    determine the location of the damaged resource,
    and a suitable configuration is chosen according
    to the available area
  • In the case of both types of errors, the design
    in VHDL, i.e. FPGA software is the key to success

5
Software Reliability
  • Develop Criteria for Design Objective Acceptance
  • Prioritize tasks or functions in order of
    criticality
  • Develop metrics to measure performance of tasks
    with respect to constraints
  • Evaluate design options based on measured
    reliability metrics

MIRCHANDANI
P226/MAPLD2005
5
6
Typical Software Options
  • Critical software functions are distributed as
    redundant instances on multiple processors, thus
    minimizing the loss of service due to a processor
    failure..

MIRCHANDANI
P226/MAPLD2005
6
7
Redundant Instances of Software
  • Initially detect, contain and recover from faults
    as soon as possible, and in the event this is not
    possible
  • Allow the control to be passed on to the
    redundant instance within the reliability and
    availability requirements levied on the system
  • Finally, include language defined mechanisms to
    detect and prevent the propagation of errors

MIRCHANDANI
P226/MAPLD2005
7
8
Methodology
  • Estimate the reliability based on instruction set
    and operational usage
  • Re-design critical elements to decrease risk
  • Re-evaluate the risk of failure based on a change
    in critical task design based on performance and
    requirements
  • Re-evaluate the reliability based on failure rate
  • Factor in the Uncertainty in Evaluation

MIRCHANDANI
P226/MAPLD2005
8
9
Task Times
Task Class Steps Step Time (stask) Task Time Total Tasks Time (ttask)
Reading r ?xri Sr sr.?xri (sr.?xri).nr tr
Parsing p ?xpi sp sp.?xpi (sp.?xpi).np tp
Pre-processing p1 ?xp1i sp1 sp1.?xp1i (sp1.?xp1i).np1 tp1
Monitoring M ?xMi sM sM.?xMi (sM.?xMi).nM tM
Sorting s ?xsi ss ss.?xsi (ss.?xsi).ns ts
Processing P ?xPi sP sP.?xPi (sP.?xPi).nP tP
Post-processing p2 ?xp2i sp2 sp2.?xp2i (sp2.?xp2i).np2 tp2
Status-gathering S ?xSi sS sS.?xSi (sS.?xSi).nS tS
Writing w ?xwi sw sw.?xwi (sw.?xwi).nw tw
10
FPGA System - Conceptual
  • Consider a FPGA-based system comprising of the
    Reading, Parsing and Pre-Processing Tasks..

each Task is a subsystem
11
Task Reliability Block Diagram
1-1-(exp(-(1-?h).?shwi.t).exp(-(1-?s).?sswi.t))
2
(exp(-?h.uh.?hwi.t).exp(-?s.us.?swi.t)
AND
OR
12
Definitions
Calendar Time t Mission Time to Calculate the Reliability
Execution ei Percentage of Mission Time used by the Task (or Subsystem)
Execution Time t ei . t
Usage for SW Percentage of the Total software used by the Task
Usage for HW Percentage of Area of the Active portion of the Device used by Task
?shwi Failure Intensity of Task i hardware with respect to Execution time
?sswi Failure Intensity of Task i software with respect to Execution time
?hi Fraction of Task i Task hardware that are common cause failures
?si Fraction of Task i Task software that are common cause failures
13
Parameters Derivations
  • Failure Intensity ?shwi ?hwi.uh.(1-?h)
  • Failure Intensity ?sswi ?swi.us.(1-?s)
  • Common Cause ?hwi.uh.(?h) and ?swi.us.(?s)
  • Execution Time t ei . t
  • RSSi Subsystem Reliability
  • System Reliability RS RSS1 . RSS2 . RSS3

  Reading Parsing Pre-Processing
Usage SW - us 0.3 0.3 0.4
Usage HW - uh 0.3 0.4 0.3
?hwi 0.3 0.4 0.3
?swi 0.3 0.4 0.3
Execution - ei 0.2 0.1 0.7
MIRCHANDANI
P226/MAPLD2005
13
14
Extending the Rules
  • The programmed design, be it the original duplex
    design, duplicated or diverse, or the option for
    re-configuration, will optimize whatever option
    is used to enhance Fault Tolerance
  • For example, in the Reading Task, it is shown
    that the area usage and operational profile have
    an effect on the predicted overall reliability of
    the FPGA-based design
  • Yu and McCluskey, state that the designs of the
    CED techniques are area dependent, more
    conservative a design in terms of area, less
    efficiently will the error detection algorithm
    perform, however, but more efficiently or
    optimally the re-configured design in the event
    of a permanent failure.

15
Further Extension
  • Area usage has a higher propensity for multiple
    faults, the operational profile that exercises a
    part of the code more often, then the design and
    its associated code has a greater propensity for
    failures
  • The common cause fractions used in the paper are
    relative numbers to illustrate the model
  • Redundancy of one, the fraction attributed to
    hardware common cause failure is 1 . This
    implies that there is an equal chance for a
    common defect running in the hardware, in this
    case the FPGA, to manifest itself anywhere in the
    active area.

16
Assertions
  • The common cause fractions used in the paper are
    relative numbers to illustrate the model
  • Redundancy of one, the fraction attributed to
    hardware common cause failure is 1 . This
    implies that there is an equal chance for a
    common defect running in the hardware, in this
    case the FPGA, to manifest itself anywhere in the
    active area.
  • Implemented on different devices, this fraction
    drops to ¼ because now the physical defects are
    almost negligible, and the only common effects
    are more environmental, i.e. temperature, power
    and external stresses.

17
More Assertions
  • Software common cause fraction is high in both
    cases, since we assume nearly all software
    failures are common cause, very little change
    from same device to different device, since the
    design implemented is the same, but because the
    devices are different, this a slight chance that
    certain timing conditions may vary and hence the
    ¼ variation
  • Diverse design paradigm, the hardware dependence
    remains in the same ratio relatively, but the
    software fractions vary drastically. In the same
    device, the common cause fraction is 50 and it
    drops to 10 in the case of diverse designs on
    different devices

18
System Configuration Options
Configuration HW Common Cause Fraction SW Common Cause Fraction
Configuration ?h ?s
Same Code Device 0.01 1
Same Code Diff Devices 0.0025 0.9975
Diff Code Same Device 0.01 0.5
Diff Code Devices 0.0025 0.1
19
Results
Option Configuration FPGA-based System Reliability
1 Same Code, Same Devices 0.895726564
2 Same Code, Diff Devices 0.895973815
3 Diff Code, Same Devices 0.944752579
4 Diff Code, Diff Devices 0.98356125
MIRCHANDANI
P226/MAPLD2005
19
20
Conclusions
  • Cost and Schedule Slips
  • Development Delays and Costs
  • Adaptive Model
  • Optimization and Design Constraints
  • Contact Address chandru.j.mirchandani_at_lmco.com

MIRCHANDANI
P226/MAPLD2005
20
Write a Comment
User Comments (0)
About PowerShow.com