Title: Single Event Upset
1Single Event Upset An Embedded Tutorial
Fan Wang Vishwani D. Agrawal
Department of Electrical and Computer
Engineering Auburn University, AL 36849 USA 21th
International Conf. on VLSI Design, Hyderabad,
India, January 4-8, 2008
2Motivation for This Work
- With the continuous downscaling of CMOS
technologies, the device reliability has become a
major bottleneck. - The sensitivity of electronic systems can
potentially become a major cause of soft
(non-permanent) failures. - It is necessary for both circuit designer and
test engineer to have the basic knowledge of soft
errors caused by the basic radiation mechanisms,
and the soft error mitigation techniques.
3Outline
- Introduction to Soft Errors
- What is Soft Error?
- Historical notes
- Basic radiation mechanisms in silicon
- Soft error resilience techniques
- A case study
- Conclusion
4Introduction to SEU
- Certain behaviors in the state of the art
electronic circuits caused by random factors. - Single event upset (SEU) is non-permanent,
non-functional error. - Definition from NASA Thesaurus
- Single Event Upset (SEU) Radiation-induced
errors in microelectronic circuits caused when
charged particles (usually from the radiation
belts or from cosmic rays) lose energy by
ionizing the medium through which they pass,
leaving behind a wake of electron-hole pairs.
5What is Soft Error
- A fault is the cause of errors.
- A non-permanent fault is a non-destructive fault
and falls into two categories - Transient faults, caused by environmental
conditions like temperature, humidity, pressure,
voltage, power supply, vibrations, fluctuations,
electromagnetic interference, ground loops,
cosmic rays and alpha particles. - Intermittent faults caused by non-environmental
conditions like loose connections, aging
components, critical timing, resistive or
capacitive variations and noise in the system. - With advances in manufacturing, soft error
caused by cosmic rays and alpha particles are
dominant causes of failures in electronic systems.
6Historical Notes
- In the period 1954 through 1957 failures in
digital electronics were reported during the
above-ground nuclear bomb tests. - In 1962, Wallmark and Marcus predicted that
cosmic rays would start upsetting microcircuits
due to heavy ionized particle strikes when
feature sizes become small enough. - In 1970s and early 1980s, the effects of
radiation received attention and more researchers
examined the physics of these phenomena. Same as
the fault tolerant computing theory. - In 1978, May and Woods of Intel Corporation
determined that these errors were caused by the
alpha particles emitted in the radioactive decay
of uranium and thorium present just in few
parts-per-million levels in package materials. - In 1979, Guenzer and Wolicki reported that the
error causing particles came not only from
uranium and thorium but that nuclear reactions
generated high energy neutrons and protons. The
term SEU has been in use since this paper. - In 1979, Ziegler and Lanford from IBM predicted
that cosmic rays could result in the same upset
phenomenon in electronics (not only memories)
even at sea level.
7Soft Error Rate of Specific Applications
- Figure of Merit
- Fail In Time (FIT)
2. MTTF (Mean Time To Failure) - The number of failures per 109 device hours.
1 year MTTF 109/(24365) FIT 114,155 FIT - SER of contemporary commercial chips is
controlled to within 1001000 FITs!!! - Most hard failure mechanisms produce error rate
on the order of 1100 FIT - Programmable Logic SER is almost 100 times larger
than combinational logic
- Soft Error Rate for SRAM-Based FPGAs
- Smaller design rule and lower supply voltages
- Used radiation chamber to calculate SEU frequency
at altitude of 10km at 60N (Sweden)
FPGA XC4010E XC4010XL
Process 0.60um 0.35um
Vcc 5v 3.3v
1 SEU every 1106 hours 2.8105 hours
Projecting this for 3 design rule shrinks and 2
voltage reductions we get 1 SEU every 28.2 hrs
M. Ohlsson, P. Dyreklev, K. Johansson and P.
Alfke, Neutron Single Event Upsets in SRAM-Based
FPGAs, proc. 1998 IEEE Nuclear Space Radiation
Effects Conference Chuck Stroud, FPGA
Architectures and Operation for Tolerating SEUs,
Electrical Engineering VLSI design and test
seminar, Spring 2007, Auburn University.
8Example SRAM-Based FPGA System
Table cont.
1. Example (1) is tested at Denver, using
SpaceRad 4.5 (a software radiation effects
prediction software program). Source Actel. 2.
All systems are without any protection.
9Radiation Mechanisms for Silicon (1)
- Alpha particles are emitted when the nucleus of
an unstable isotope decays to a lower energy
state. - (dominant soft error cause for DRAM in
1970s) - Uranium and thorium have the highest activity
among naturally occurring radioactive materials. - In the terrestrial environment, major sources of
radioactive impurities are lead-based isotopes in
solder bumps of the flip-chip technology, gold
used for the bond wires and lid plating, aluminum
in ceramic packages, lead-frame alloys and
interconnect metalization.
With carefully selected materials, this
mechanism effect can be greatly reduced.
10Radiation Mechanisms for Silicon (2)
- High-energy ( gt 1 MeV) neutrons from cosmic
radiation induces soft errors in semiconductor
devices via secondary ions produced by the
neutron reaction with silicon nuclei. - Cosmic rays which are of galactic origin react
with the Earths atmosphere to produce complex
cascades of secondary particles. - Neutrons are the most likely cosmic radiation
sources to cause SEU in deep-submicron
semiconductors at terrestrial altitude. The
neutron flux is dependent on the altitude above
sea level, the density of the neutron flux
increases with altitude
MeV Million Electron Volts
Nowadays, Neutron is the major cause among all
fail mechanisms.
11Radiation Mechanisms for Silicon (3)
- The secondary radiation induced from the
interaction of cosmic ray neutrons and boron is
the third significant source of ionizing
particles in electronic systems. - Low-energy cosmic neutron interactions with the
isotope boron-10 (10B). 10B is commonly used as
p-type dopant for junction formation IC package.
Baumann et al, IEEE Trans. Device and Materials
Reliability, vol. 1, no. 1, pp. 1722, 2001.
This mechanism can be greatly reduced or
eliminated by removing source of 10B
12Single Event Transient (SET)
- SET is caused by the generation of charge due to
a high-energy particle passing through a
sensitive node. - Each SET has its unique characteristics like
polarity, waveform, amplitude, duration, etc.
depend on particle impact location, particle
energy, device technology, device supply voltage
and output load. - The off transistors struck by a heavy ion with
high enough LET in the junction area are most
sensitive to SEU. - Specifically, the channel region of the off-NMOS
transistor and the drain region of the off-PMOS
transistor.
Linear Energy Transfer is a measure of the
energy transferred to the device per unit length
as an ionizing particle travels through a
material.
13More Details of SET Generation
- (a) Along the path traverses, the particle
produces a dense radial distribution of
electron-hole pairs. - (b) Outside the depletion region the
non-equilibrium charge distribution induces a
temporary funnel-shaped potential distortion
along the trajectory of the event (drift
component). - (c) Funnel collapses, diffusion component then
dominates the collection process until all excess
carriers have been collected, recombined, or
diffused away from the junction area. - (d) Current vs. Time to illustrate the charge
collection and SET generation.
14Analytical Model of SET
- The time constants depend strongly on the type of
ion, its initial energy and the properties of the
specific technology. - Approximate analytical model for ion track charge
collection is a double-exponential form. It gives
an induced current with a rapid rise time but a
more gradual fall time
Typical values are approximately 1.64 x 10-10sec
for and 5.10x10-11sec for .
Experimental Results from NASA JPL
15SET in CMOS Inverter
For example, in ami12 technology, when the
output load capacitance is 100fF and the
cumulative collected charge is 0.65pC, the
amplitude of the voltage pulse is 0.65pC/100fF
0.65 x10-12C/100 x10-15F 0.65V .
16Soft Error Mitigation Techniques
- The soft error tolerant techniques can be
classified into two types recovery and
prevention. - Recovery Recovery error after it does occur.
- Include on-line recovery mechanisms, fault
tolerant computing, ECC/parity check, redundancy
etc. - Prevention The methods to protect microchips
from soft-errors before it occurs. - The need for a recovery mechanism stems from the
fact that prevention techniques may not be enough
for contemporary microchips. - Soft error is not the only reason why computer
systems need to resort to a recovery procedure.
Random errors due to noise, unreliable
components, and coupling effects may also require
the recovery mechanism.
17Some Mitigation Techniques
- Prevention Techniques
- Purify the Fabrication Material
- Uranium and thorium impurities have been reduced
below one hundred parts per trillion for high
reliability. - To eliminate 10B, alternative insulators that
dont contain boron are used. - Radiation Hardened Process Technologies
- SER performance can be greatly improved by
adapting the process technology either to reduce
the collected charge or increase the critical
charge. - Specific methods use additional well isolation
replace bulk silicon with SOI.
10x reduction in SER achieved over conventional
bulk devices when a fully depleted SOI substrate
is used. But SOI is more expensive and parasitic
bipolar action limit further reduction of SER.
18Picked Mitigation Techniques
- Recovery Techniques
- Redundancy
- To gain higher system reliability by sacrificing
the minimality of time or space or both. - Classic design Triple Modular Redundancy (TMR)
with majority voter - New design time redundancy based on
C-element gate to compare two samples of
combinational primary outputs at t0 and t0d. - Error Detection and Correction Code (EDAC)
- Simple solution for memory add a parity bit to
each memory word. - In most situations, it must be combined with a
system-level approach for error recovery.
S. Mitra, Z. Ming, S. Waqas, N. Seifert, B.
Gill, and K. S. Kim, Combinational Logic Soft
Error Correction, in Proc. International Test
Conference, 2006, pp. 19.
19A Case Study IBM eServer z990 System
- z990 configuration
- z990 contains 4 pluggable nodes connected through
a planar board. - Each node contains up to 64 GB physical memory
and 32 MB L2 cache for a system capacity of 256
GB memory and 126 MB L2 cache. - Error tolerance techniques used
- Extensive use of ECC and parity with retry on
data and controls - Full SRAM ECC and parity protection
- Microprocessor mirroring
20Conclusion
- SER in logic and memory chips will continue to
increase as devices become more sensitive to soft
errors at sea level - Open soft error issues
- How EDA tools handle soft error hardening?
- Analysis of radiation mechanisms (too complex to
be comprehensive) - Soft error rate analysis for logics
- Error mitigation methods
21Useful References and Further Readings
- Single Event Phenomena, (Messenger and Ash,
1993) - Ionizing Radiation Effects in MOS Devices and
Circuits, (Ma and Dressendorfer, 1989) - Handbook of Radiation Effects, (A.
Holmes-Siedle and L. Adams,1993) - Fault-Tolerance Techniques for SRAM-Based
FPGAs, (Kastensmidt, Fernanda Lima, Carro,
Luigi, Reis, Ricardo, 2006) - Test methods and standard JEDEC89, JEDEC89A,
JEDEC89-2 - Journals IEEE Trans on Nuclear Science, IEEE
Trans Reliability - NASA Goddards test group
- http//radhome.gsfc.nasa.gov/radhome
/papers/seeca5.htm - NASA Space Environment and Effects Program
- http//see.msfc.nasa.gov/
-
22Thank You . . .