Title: Selfcalibrating Online Wearout Detection
1Self-calibrating Online Wearout
Detection Authors Jason Blome Shuguang
Feng Shantanu Gupta Scott Mahlke
MICRO-40 December 3, 2007
2Motivation
- Designing Reliable Systems from Unreliable
Components - - Shekhar Borkar (Intel)
Failures will be wearout induced
More failures to come
3Current Approaches
- Traditional
- Design margins
- Burn-in
- Detection based on replication of computation
- TMR (Tandem/HP NonStop servers)
- DIVA (Bower, MICRO05)
- Prediction utilizes precise analytical models
and/or sensors - Canary circuits (SentinelSilicion, RidgeTop)
- RAMP (Srinivasan, UIUC/IBM)
Impractical
Static
Costly
4Wearout Mechanisms
- Many failure mechanisms have been shown to be
progressive
- Hot carrier injection (HCI)
- Negative Bias Temperature Inversion (NBTI)
5Objective
- Propose a failure prediction technique that
exploits the progressive nature of wearout - Monitor impact on path delays
- Prediction
- Monitors evolution of wearout
- Proactive
- enables failure avoidance/mitigation
- Continuous feedback
- False negatives and positives
- Detection
- Identifies existing fault
- Reactive
- enables failure recovery
- End-of-life feedback
- False negatives
6Oxide Breakdown (OBD)
- Accumulation of defects leads to a conductive
path
Percolation Model Stathis, JAP06
7OBD HSPICE Model
- Post-breakdown leakage modeling
BSIM4.6.0, 06
8Characterization Testbench
- 90nm standard cell library
tcircuit
tcell
9Impact on Propagation Delay
10Delay Profiling Unit (DPU)
input signal
1
1
1
Latency Sampling
uArch Module
1
1
11TRIX Analysis
Magnitude of divergence between TRIXglobal and
TRIXlocal reflects amount of degradation
12TRIX Analysis Details
- Exponential Moving Average (EMA)
- Triple-smoothed Exponential Moving Average
13Noisy Latency Profile
Percent Nominal Delay ()
Increasing Age
14DPU with TRIX Hardware
TRIXl Calculation
input signal
Latency Sampling
Prediction
TRIXg Calculation
15Wearout Detection Unit (WDU)
TRIXl Calculation
Latency Sampling
Prediction
TRIXg Calculation
16Evaluation Framework
Gate-level Processor Simulator
OR1200 Verilog
Synthesis and Place and Route
90nm Library
Timing, Power, and Temperature Simulations
MediaBench Suite
Workload Simulator
OBD Wearout Model
HSPICE Simulations
Wearout Simulator
17WDU Accuracy
18WDU Overhead
19WDU Overhead
20Long-term Vision
- Introspective Reliability Management (IRM)
- Intelligent reliability management directed by
on-chip sensor feedback - Prospective sensors
- Delay (WDU)
- Leakage/Vt
- Temperature
21Introspective Reliability Management
22Conclusions
- Many progressive wearout phenomenon impact
device-level performance. - Its possible to characterize this impact and
anticipate failures - WDU performance
- Failure predicted within 20 of end of life
(tunable) - Area overhead lt 3 (hybrid)
- Low-level sensors can be used to enable
intelligent reliability management
23Questions?
?