Title: Assessing SEU Vulnerability via CircuitLevel Timing Analysis
1Assessing SEU Vulnerabilityvia Circuit-Level
Timing Analysis
- Kypros Constantinides Stephen Plaza Jason
Blome Bin Zhang - Valeria Bertacco Scott Mahlke Todd Austin
Michael Orshansky - Advanced Computer Architecture Lab Department
of Electrical and Computer Engineering - University of Michigan University
of Texas at Austin
2Introduction
- Recently there is a growing concern about
transient faults in combinational logic - Numerous techniques already exist that deal with
the effects of transient faults - Error Correction Codes (ECC)
- DIVA
- Simultaneous Redundantly Threading (SRT)
- and many other
- However, these techniques come with a cost on
performance, power, die size and design time.
3Introduction
- Designers have to trade-off between reliability
provided and implementation cost - Inadequate soft-error protection maybe
useless due to poor reliability - Excessive soft-error protection
uncompetitive in cost and/or performance - In order to balance this trade-off, system
designers need accurate SERs (Soft-Error Rate)
for their designs - The device community provides raw SERs for
devices of current technologies and projections
for devices of future technologies - However, architecture-level and circuit-level
phenomena derate the raw SER - Accurately assessing a designs SER requires
circuit-level detail analysis infrastructure
4In This Work
- We introduce a high-fidelity, high-performance
simulation infrastructure for estimating
soft-error rates - asynchronously injects voltage pulses of various
durations at the gate level - accurately gauge detailed circuit phenomena to
model - fault introduction
- fault propagation
- and possible fault masking
- simulates with sufficient speed permitting the
examination of entire workloads on complex
designs (thousands of gates)
5Soft Error Masking
- Fortunately not all transient faults cause an
error - Circuit and architectural phenomena prevent the
fault from propagating to the designs output and
causing an error - Logic masking
- Timing masking
- Electrical masking
- Microarchitecture masking
- Software masking
6Soft Error Masking
- Logic Masking the fault gets blocked by a
following gate whose output is completely
determined by its other inputs - Timing Masking the fault affects the input of a
latch only in the period of time that the latch
is not sensitive to its input - Electrical Masking the faults pulse is
attenuated by subsequent logic gates due to
electrical properties, and does not affect any
latchs input - Microarchitectural Masking the fault alters a
value of at least one flip-flop, but the
incorrect values get overwritten without being
used in any computation affecting the designs
output - Software Masking the fault propagates to the
designs output but is subsequently masked by
software without affecting the applications
correct execution
7Simulation Infrastructure
Design Under Test gate-level description of the
design (netlist) - Fault-Exposed Model subjected
to fault injection - Golden Model no fault
injected
Fault Generator injects voltage pulses of
various durations at any gate in the design and
flips the value of any flip-flop in the design -
faults are uniformly distributed at time,
location and duration
Fault Analyzer Monitors manifested errors and
tracks all the possible ways a fault can be
masked
Model Stimuli Workload traces that exercise the
design under test
8Statistical Model for Transient Faults
- Pulse-based model for transient faults caused by
energetic particle strikes - Faults injected into combinational logic are
classified based on their duration - 20, 40, 60, 80 and 100 of designs clock
period - Faults injected into sequential elements flip
their value - The arrival rate of each type of fault is modeled
by a separate random variable - The mean inter-arrival times for each fault type
are derived by previously published data and
detailed SPICE simulations
9Design Under Test CMP Switch
- We chose as a design under test a single chip
multiprocessor interconnection switch (baseline
provided by Li-Shiuan Peh) - Much less complex than a microprocessor yet not
too simplistic (it includes finite state
machines, buffers, control logic, and buses) - Wormhole switch
- pipelined at the flit level
- Specified in Verilog and
- synthesized to a gate-level netlist
- 9K logic gates and
- 1700 sequential elements
- Realistic workload
- Communication traces derived from the TRIPS
architecture
10Characterization per Fault Type
- High microarchitectural masking
- 95 of the faults that flip a flip-flops value
are masked - Timing masking is significant only for faults
with small pulse durations - Logic masking is increasing as the faults pulse
duration is decreasing
11Derating Factor
- Derating factor error rate-1
- i.e. a derating factor of 30 means that one of
every 30 injected faults will cause an error
(corresponds to an error rate of 3.3) - Average derating factor for realistic workloads
is 31 - Synthetic high utilization workload leads to a
derating factor of 12
error rate 3.2
error rate 8.3
12Failure Rate Projections
- Taking into account projections from ITRS and raw
SER estimates for future process technologies, we
make failure rate projections considering the
transient-fault derating effects - Design architecture is kept intact for future
process technologies - Two different designs
- one clocked with the projected clock frequencies
for microprocessors - and one clocked with the projected clock
frequencies for interconnection networks
13Transient-fault Vulnerability per Component
- We observed that each switch component exhibited
different vulnerability on transient faults - Derating effects greatly depend on the
components characteristics - Most vulnerable component
- Switch Arbiter (12.8 error)
- 6 of switchs area
- Input Controllers
- dominate switch design
- 86 of switchs area
- The switchs vulnerability
- match with that of input
- controllers
14Effects of Multi-fault Strikes
- A single strike causes multiple faults on
neighbouring gates or flip-flops - lack of data about frequency of such events or
models for multi-fault strikes on logic gates and
flip-flops - we assume that each strike causes multiple faults
- extremely pessimistic
- even under this severe environment the failure
rates are relatively low
15Conclusions Directions for Future Work
- Conclusions
- For complex designs there is significant fault
masking, with derating factors as high as 30 - Soft-error derating effects highly depend on the
designs characteristics and utilization - Our observations suggest that the soft-error
reliability threat might have been overstated by
the computer architecture community - Designers need to evaluate their designs
soft-error tolerance with detail analysis tools
considering circuit level derating effects and
better trade-off between the protection provided
and the implementation cost - Future Work
- Study the soft-error derating effects for several
designs with different amount of complexity and
different characteristics - Enhance our simulation infrastructure to be able
to simulate large high-complexity systems
(millions of gates) with short simulation runs
16Questions?