Title: MAPLD 2004 SINGLE EVENT EFFECT (SEE) ANALYSIS, TEST, MITIGATION
1MAPLD 2004SINGLE EVENT EFFECT (SEE) ANALYSIS,
TEST, MITIGATION IMPLIMENTATION OF THE XILINX
VIRTEX-II INPUT OUTPUT BLOCK (IOB)
Mathew Napier(1), Jason Moore(2), Kurt Lanes(1),
Sana Rezgui(2), Gary Swift(3) (1)Sandia National
Laboratories, Albuquerque NM, USA (2)Xilinx, San
Jose, CA, USA (3)JPL/Caltech, Pasadena, CA, USA
"This work was carried out in part by the Jet
Propulsion Laboratory, California Institute of
Technology, under contract with the National
Aeronautics and Space Administration."
"Reference herein to any specific commercial
product, process, or service by trade name,
trademark, manufacturer, or otherwise, does not
constitute or imply its endorsement by the United
States Government or the Jet Propulsion
Laboratory, California Institute of Technology."
2Purpose Outline
- Analyze and Evaluate the different types of TMR
IOB Mitigation structures. Discuss the trade
offs SEE, electrical/timing and resources, and
how these trades off effect the operation and
MTBF of a system. - OUTLINE
- IOB
- SEE IOB Mitigation
- Triple Module Redundant IOB
- JPL Dual-MR
- SEE Trade offs
- Cross Section
- Signal Integrity and Timing
- System Implementation
- TMR, EDAC, I/O Count
- High-speed Interfaces
3SEU Hazards for Xilinx Technology
- Configuration Memory
- Configuration memory controls logic function and
routing - Configuration Memory Upsets Cause
- Changes logic function
- Changes routing
- Changes IO Configuration
- Transient and Static Bit Errors
- Changes data and control states
- Single Event Functional Interrupt (SEFI)
- Power On State Machine Upsets (POR Upset)
- Causes power on reset to occur
- Select Map and JTAG
- Disables part configuration/scrub
- Effective mitigation techniques exist for each of
these error modes
SRAM Configuration Memory Controls Logic Function
Look-up Tables
Internal Registers Store State Data
SRAM Configuration Memory Controls Routing Switch
Matrix
4Input Output Buffer (IOB)
- IOB are used to interconnect the Xilinx FPGA
fabric with external devices. - Support a wide range of I/O operating standards.
- Differential LVDS ECL
- Single Ended LVCMOSHSTL
- Silicon features greatly increasing system
performance. - Flip Flops in the IOB
- Double Data Rate Flip Flops
- Digital Impedance control
- An IOB consists of the following parts
- Input path
- Two DDR registers
- Output path
- Two DDR registers
- Two 3-state DDR registers
- Separate clocks for I O
- Set and reset signals are shared
- Separated sync/async
- Separated Set/Reset attribute per register
5IOB Details
3-State Control Registers
IO standard options (LVDS, etc)
Output Registers
Input Registers
IOB Detailed View (FPGA Editor)
6Xilinx Triple Module Redundancy (XTMR) Inputs
- SEU Immunity requires the use of triple redundant
input pins for every input signal. - Not triplicating input Global signals (clk, rst,
etc) can seriously compromise SEU resistance. - Triplication of input data paths can be traded
for EDAC. - Reduce I/O count
- SEU resistance is sometimes traded-off for
resource utilization. - Xilinx input Capacitance is 10pF per I/O so user
needs to verify that interfacing parts can drive
30pF at speed.
7XTMR Triplicated Outputs with Minority Voters
- Outputs can be triplicated, using three pins for
each output signal. - Minority voters monitor each of the triplicated
design modules - If one module is different from the others, its
output pin is driven to High-Z - Voters are triplicated
Minority Voter
P
TR0
Minority Voter
P
TR1
Minority Voter
P
TR2
Convergence point is outside FPGA, at trace
8XTMR Triplicated Output Operation - Datapath SEU
- If a datapath SEU occurs, minority voter places
its pin in high-Z - Remaining valid outputs drive output to correct
value. - If an SEU occurs on the Minority voter, the worst
it can do is disable a valid output. - To pass an incorrect output, two upsets would
have to occur on the same path - Active Scrubbing of the part will eliminate the
accumulation of double SEUs in Configuration
Logic
9XTMR Duplicated Outputs with Minority Voters
(JPL)
- In this scheme (by Gary Swift at JPL),
triplicated design domains are driven on to two
pins - Two minority voters monitor each of the
triplicated design modules - If a module is different from the others, its
output pin is driven to High-Z - Voters are duplicated
- If an SEU occurs on the datapath without a pin,
the outputs continue operating as normal.
10XTMR Duplicated Output Operation - Datapath
SEU(2)
- If an SEU occurs on the datapath with a pin, that
pin is driven to high-Z. - The main advantage of this technique is that it
uses 2 rather than 3 pins thus reducing pin count
and maintaining SEU immunity. - If an SEU occurs on the Minority voter, the worst
it can do is disable a valid output. Same as XTMR
11XTMR Single output pin
- If a design is pin-limited, you can elect not to
triplicate some outputs. - A single Majority Voter can be placed in series
with a single output. - This will cause additional output delay and leave
the output path susceptible to SEU
12XTMR Output Analysis
- How many configuration bits in TMR I/O after
Minority Voter? - Errors in these bits will change the IOB function
and NOT be caught by the voter. - How many one bit upsets will really change the
Function? - Does a Stuck at High, Stuck at Low or Inverted
IOB Failure in a XTMR structure still function
correctly? Can two I/O overdrive the failed one? - Voltage output High
- Voltage output Low
- Timing Rise/Fall
- How does this change for different I/O types and
switching speeds. - How to design a system that balances
- SEE sensitivity
- System performance and speed
- Resource Utilization
13Schematic Analysis
- Determine the number of Configuration Memory
Cells (CMC) needed to configure unprotect and TMR
I/O Configuration by analyzing Xilinx schematics.
- Guidelines/Assumptions
- Not all SEUs will be catastrophic therefore
there are two types of SEUs (Hard and Soft
Failures) - Hard Failure 100 certainty that when it occurs
will cause a system failure - Causing the output to become inverted
- Causing the output to be either stuck high/low
- Changing the signaling standard to something
completely different (e.g. LVCMOS to HSTL) - Causing the output to be tri-stated
- Soft Failure Uncertain as to the effect
- Changing the signaling standard to something
similar (LVCMOS to LVTTL) - Changing the drive strength or slew rate
- Changing the termination
14Schematic Analysis Results
CLB LUT
Routing to IOB
IOB
- Schematic Analysis of this path 109 bits (but
only 92 essential) - 26 Hard Failures
- 66 Soft Failures
15TMR Output Results
CLB and Routing
IOB
- Schematic Analysis of this configuration 173
bits - 27 Hard Failures
- 122 Soft Failures
- TMR has larger cross section then unprotected .
AC analysis will determine which type is more
robust.
16SEE Mitigated IOB Signal Integrity and Timing
- MEMEC Insight MB-2000 board used as test platform
to test Electrical and Timing Characteristics of
XTMR. - Tied Three I/O together and ran through four
different cases - Normal, Stuck at High, Stuck at Low, Inverted
- For Each Case the following measurements were
measured. - Voh, Vol, Tr, Tf
- 4GHz Scope Pictures
- I/O Types Evaluated included
- 1.8V/2.5V/3.3V LVCMOS LVTTL, LVDCI (Impedance
control) LVDS. - Fast and Slow Slew Rate.
- Hyperlinx Simulations were preformed on all of
the above cases to verify correlation between
measured and simulated data. - JPLs dual-redundant minority voters mitigation
scheme will fail all of the above operating
conditions if one of the I/Os fail.
17SEE Mitigated IOB Signal Integrity and Timing
- XTMR 1.8V LVCMOS
- One output Inverted
- Voh downto 1.4V down from 1.8V
- Vol upto .4V up from 0V
- Noise do to lack of termination
Normal
Inverted
18SEE Mitigated IOB Signal Integrity and Timing
Stuck at High
- LVCMOS1.8V
- Measured
- Voh 1.72V
- Vol .4V
- Tr .58ns
- Tf .51ns
- Simulated
- Voh 1.79V
- Vol .54V
- Tr .80ns
- Tf .60ns
Hyperlynx IBIS Model
Stuck at High Simulation
Stuck at Low
- Simulated
- Voh 1.26V
- Vol -.06V
- Tr .60ns
- Tf .70ns
- LVCMOS1.8V
- Measured
- Voh 1.44V
- Vol -.04V
- Tr .62ns
- Tf .52ns
Simulation data correlates with measured data
Stuck at Low Simulation
19SEE Mitigated IOB Signal Integrity and Timing
- Measured Data Spread Sheet
Normal
Stuck At Low
Stuck At Low
INV
SAH Failure limits V output low margin or
violates level
20CMC Failure Comparison
- How does Naked I/O compare to TMR in dynamic test
in the beam and Fault Injection? - Test will show CMC sensitivity do to switching
failures large enough to break output switching
state.
- TMR displayed zero failures at 3.3V and 1.8V
- Naked I/O has much larger CMC failure cross
section then TMR setup. - I/O test design is only running at 30MHz. TMR
failures may show up at higher speeds.
Inverted
21System Goals Implimentation
- GOALS
- Xilinx FPGA technology is a Mission Enabling
Technology - SEU Goal Develop a design that produces the SEU
performance comparable to that of a fully
hardened design while exploiting the capabilities
of state-of-the-art CMOS process technologies - SEU Result System Upset rate is superior to
that which could be achieved with unmitigated SEU
hard logic - IMPLIMENTATION
- Command and control logic is implemented in SEU
hard logic - Processor Memory includes Parity protection
- Fail over to boot code
- SEU detection and recovery for SEU soft devices
is automatic and occurs without ground
intervention - SEU induced outages that do not require ground
intervention are booked against mission
availability - Although not a specific requirement good SEU
performance under nominal solar flare conditions
is desired
22SEU Mitigation and Error Control
- Mitigate IO Upsets
- TMR of IO for clocks and address signals
- EDAC for data path signals
- Mitigate Configuration Memory Upsets
- TMR internal logic
- Configuration memory scrubbing to prevent error
accumulation - Design approach does not include POR upset
mitigation - Use of shadow devices effective against POR
errors - POR Error rate is very low
- The flight system makes extensive use of several
techniques to exploit the advantages of
nano-meter CMOS technology while maintaining
excellent SEU performance - Multiple bit Reed-Solomon forward error
correction codes - Single bit error correcting codes
- Simple parity error detection
- Cyclic-Redundancy-Check for burst error
correction - Triple Modular Redundancy
- Error Scrubbing
- Mitigation technique is selected based upon error
rate, vulnerability, system impact, and
implementation complexity - Mitigation techniques provide coverage for
dynamic SEU errors
Error Correction Techniques Implemented for SEU
Mitigation Improve the Overall Design Robustness
and Reliability
23Mitigation Overview Sensor Data Processor (SDP)
- Processes 8Gbps of Data.
- Outputs 340Mbits of Processed Data.
- Architecture
- Fiber Receiver and SERDES link, 4 channels at a
maximum of 160Mpix ea. - Four Quadrant Processors for data processing.
Contains 640 Mbytes of SDRAM for data storage - 320 bit 85Mhz SDRAM 1.8V
- Can generate upto 340Mbits/s of Source Packet
Data - One Central Virtex For Data Networking
- De-mux data from Serdes chips outputs to 4
processing channels/Quadrant Xilinx - Controls Frame Summation Rates and Reference
Frame Generation Rates. - Transfer Source Packets to downlink modules at up
to 340Mbits/s Max - USES Compresses source Packets.
24Mitigation Overview Sensor Data Processor (SDP)
RS-ECC
RS-ECC
TMR
TMR
Fiber Input
320
ECC
ECC
320
PIX/Packet
SERDES
Osc
JTAG
JTAG
PIX/Packet
I2C
I2C
ECC/CRC
TMR
XC2V3000
Interface Control
TMR
TMR
PIX/Packet
PIX/Packet
320
JTAG
320
ECC/TMR
I2C TIME System CLK
Packets
JTAG
JTAG
I2C
I2C
To DLM/DLC
SDP
PXS
CTM
25SDP- SDRAM
- SDRAM interface, 1 per Quadrant Virtex
- 20 1.8V Micron Mobile SDRAM
- 1.8V LVTTL I/O
- 320 Bit Data Bus 240 Pixel DATA, 80 ECC
- Data is Reed Solomon Encoded
- TMR'd outputs from Virtex address,control and
Clock - Address and control signals are AC Terminated.
- TMRd input to Virtex Clock Feedback Used to
de-skew the SDRAM Clock - Currently running at 85MHz designed to operate at
100MHz - Test
- Measured TMR SDRAM Addr, RAS and CAS signals for
the following cases. - Inverted, Stuck High, Stuck Low
- Measured Voh, Vol, Tr and Tf.
- Count the Number of Reed Solomon Errors, If any.
SDRAM ADDRESS CONTROL
26SDP- SDRAM(2)
SDRAM Address Normal
SDRAM Address One I/O Inverted
27SDP- SDRAM(3)
No SDRAM Errors for All Three Failure Cases
28Upset Rates for Various SEU Mitigated IO
Configurations
29Lessons Learned
- Triple redundant outputs for gt2.5V LVCMOS or
LVTLL achieve correct Vol and Voh levels for all
failure cases - For low voltage I/O lt1.8V Thresholds are very
close to margins for failure conditions and may
violate other parts spec. - For SDRAM interface 1.8V I/O tolerated all three
failure cases at room temperature. - Double redundant outputs will not meet the
correct Vol and Voh levels under I/O failure. - Rise and/or Fall times are lengthened do to I/O
failure. May cause more failures at higher
speeds. - Recommendation
- If resources permit XTMR output for all control
signals is recommended regardless of I/O type. - High Speed, Jitter or Duty Cycle Sensitive
Devices Outputs need special consideration - EDAC on Data busses are ideal for IOB failure
protection.