Assessing SEU Vulnerability via CircuitLevel Timing Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Assessing SEU Vulnerability via CircuitLevel Timing Analysis

Description:

... ITRS and raw SER estimates for future process technologies, ... Design architecture is kept intact for future process technologies. Two different designs: ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 17
Provided by: cccpEec
Category:

less

Transcript and Presenter's Notes

Title: Assessing SEU Vulnerability via CircuitLevel Timing Analysis


1
Assessing SEU Vulnerabilityvia Circuit-Level
Timing Analysis
  • Kypros Constantinides Stephen Plaza Jason
    Blome Bin Zhang
  • Valeria Bertacco Scott Mahlke Todd Austin
    Michael Orshansky
  • Advanced Computer Architecture Lab Department
    of Electrical and Computer Engineering
  • University of Michigan University
    of Texas at Austin

2
Introduction
  • Recently there is a growing concern about
    transient faults in combinational logic
  • Numerous techniques already exist that deal with
    the effects of transient faults
  • Error Correction Codes (ECC)
  • DIVA
  • Simultaneous Redundantly Threading (SRT)
  • and many other
  • However, these techniques come with a cost on
    performance, power, die size and design time.

3
Introduction
  • Designers have to trade-off between reliability
    provided and implementation cost
  • Inadequate soft-error protection maybe
    useless due to poor reliability
  • Excessive soft-error protection
    uncompetitive in cost and/or performance
  • In order to balance this trade-off, system
    designers need accurate SERs (Soft-Error Rate)
    for their designs
  • The device community provides raw SERs for
    devices of current technologies and projections
    for devices of future technologies
  • However, architecture-level and circuit-level
    phenomena derate the raw SER
  • Accurately assessing a designs SER requires
    circuit-level detail analysis infrastructure

4
In This Work
  • We introduce a high-fidelity, high-performance
    simulation infrastructure for estimating
    soft-error rates
  • asynchronously injects voltage pulses of various
    durations at the gate level
  • accurately gauge detailed circuit phenomena to
    model
  • fault introduction
  • fault propagation
  • and possible fault masking
  • simulates with sufficient speed permitting the
    examination of entire workloads on complex
    designs (thousands of gates)

5
Soft Error Masking
  • Fortunately not all transient faults cause an
    error
  • Circuit and architectural phenomena prevent the
    fault from propagating to the designs output and
    causing an error
  • Logic masking
  • Timing masking
  • Electrical masking
  • Microarchitecture masking
  • Software masking

6
Soft Error Masking
  • Logic Masking the fault gets blocked by a
    following gate whose output is completely
    determined by its other inputs
  • Timing Masking the fault affects the input of a
    latch only in the period of time that the latch
    is not sensitive to its input
  • Electrical Masking the faults pulse is
    attenuated by subsequent logic gates due to
    electrical properties, and does not affect any
    latchs input
  • Microarchitectural Masking the fault alters a
    value of at least one flip-flop, but the
    incorrect values get overwritten without being
    used in any computation affecting the designs
    output
  • Software Masking the fault propagates to the
    designs output but is subsequently masked by
    software without affecting the applications
    correct execution

7
Simulation Infrastructure
Design Under Test gate-level description of the
design (netlist) - Fault-Exposed Model subjected
to fault injection - Golden Model no fault
injected
Fault Generator injects voltage pulses of
various durations at any gate in the design and
flips the value of any flip-flop in the design -
faults are uniformly distributed at time,
location and duration
Fault Analyzer Monitors manifested errors and
tracks all the possible ways a fault can be
masked
Model Stimuli Workload traces that exercise the
design under test
8
Statistical Model for Transient Faults
  • Pulse-based model for transient faults caused by
    energetic particle strikes
  • Faults injected into combinational logic are
    classified based on their duration
  • 20, 40, 60, 80 and 100 of designs clock
    period
  • Faults injected into sequential elements flip
    their value
  • The arrival rate of each type of fault is modeled
    by a separate random variable
  • The mean inter-arrival times for each fault type
    are derived by previously published data and
    detailed SPICE simulations

9
Design Under Test CMP Switch
  • We chose as a design under test a single chip
    multiprocessor interconnection switch (baseline
    provided by Li-Shiuan Peh)
  • Much less complex than a microprocessor yet not
    too simplistic (it includes finite state
    machines, buffers, control logic, and buses)
  • Wormhole switch
  • pipelined at the flit level
  • Specified in Verilog and
  • synthesized to a gate-level netlist
  • 9K logic gates and
  • 1700 sequential elements
  • Realistic workload
  • Communication traces derived from the TRIPS
    architecture

10
Characterization per Fault Type
  • High microarchitectural masking
  • 95 of the faults that flip a flip-flops value
    are masked
  • Timing masking is significant only for faults
    with small pulse durations
  • Logic masking is increasing as the faults pulse
    duration is decreasing

11
Derating Factor
  • Derating factor error rate-1
  • i.e. a derating factor of 30 means that one of
    every 30 injected faults will cause an error
    (corresponds to an error rate of 3.3)
  • Average derating factor for realistic workloads
    is 31
  • Synthetic high utilization workload leads to a
    derating factor of 12

error rate 3.2
error rate 8.3
12
Failure Rate Projections
  • Taking into account projections from ITRS and raw
    SER estimates for future process technologies, we
    make failure rate projections considering the
    transient-fault derating effects
  • Design architecture is kept intact for future
    process technologies
  • Two different designs
  • one clocked with the projected clock frequencies
    for microprocessors
  • and one clocked with the projected clock
    frequencies for interconnection networks

13
Transient-fault Vulnerability per Component
  • We observed that each switch component exhibited
    different vulnerability on transient faults
  • Derating effects greatly depend on the
    components characteristics
  • Most vulnerable component
  • Switch Arbiter (12.8 error)
  • 6 of switchs area
  • Input Controllers
  • dominate switch design
  • 86 of switchs area
  • The switchs vulnerability
  • match with that of input
  • controllers

14
Effects of Multi-fault Strikes
  • A single strike causes multiple faults on
    neighbouring gates or flip-flops
  • lack of data about frequency of such events or
    models for multi-fault strikes on logic gates and
    flip-flops
  • we assume that each strike causes multiple faults
  • extremely pessimistic
  • even under this severe environment the failure
    rates are relatively low

15
Conclusions Directions for Future Work
  • Conclusions
  • For complex designs there is significant fault
    masking, with derating factors as high as 30
  • Soft-error derating effects highly depend on the
    designs characteristics and utilization
  • Our observations suggest that the soft-error
    reliability threat might have been overstated by
    the computer architecture community
  • Designers need to evaluate their designs
    soft-error tolerance with detail analysis tools
    considering circuit level derating effects and
    better trade-off between the protection provided
    and the implementation cost
  • Future Work
  • Study the soft-error derating effects for several
    designs with different amount of complexity and
    different characteristics
  • Enhance our simulation infrastructure to be able
    to simulate large high-complexity systems
    (millions of gates) with short simulation runs

16
Questions?
Write a Comment
User Comments (0)
About PowerShow.com