Title: Single Event Upset SEU Mitigating Techniques in a Space Radiation Environment for the FPGA based Ite
1Single Event Upset (SEU) Mitigating Techniques in
a Space Radiation Environment for the FPGA based
Iterative Repair Processor
- Group Presentation (11/30/2007)
- Jeffrey M. Carver
2Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
3Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
4Space Applications
- FPGAs are being used in space applications
because of - Low cost over ASICs
- Reconfigurable ability
- Can be optimized for a specific application
- Problems that occur in space
- Single Event Upsets (SEUs) occur when a memory
cell changes values because of the radiation in
the environment. - Radiation also plagues combinational logic by
causing a temporary glitch that has been measured
lasting from .3ns to 1.3ns. - For FPGAs this means that fault tolerant
techniques need to be applied to protect the
storage memory, configuration memory, and
combinational logic on an FPGA.
5Research Goal
- To find and apply fault tolerant techniques for
a system designed for space applications
(Iterative Repair Processor). - Once the fault techniques to apply have been
identified, an SEU Simulator for testing the
robustness of the technique will be developed and
used. The techniques will then be applied and
tested.
6Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
7Triple Modular Redundancy (TMR)
- Is triplication of the module
- with a voting circuit to vote on
- the correct output of the device.
- Variants of this concept are used.
- Analog component to use for voting circuit
- Using 2-3 voting circuits
- with tri-state buffer.
- TMR in time
8Hamming Codes
- Hamming code is to
- insert check bits
- throughout the word.
- Improved Hamming Code can require an extra check
bit, but it appends check bits onto the end of
the word. - Both can correct a single error in a word.
- Hamming Relationship check bits required
9TMR vs. Hamming
- TMR
- Requires at least a 200 percent increase in
space. - It is good for small memory and state machines.
- Hamming Codes
- Good for large memories.
- Requires check bits, Hamming Encoder, and Hamming
Decoder. - Seen to increase timing delay over TMR.
10DWC-CED
- Double Redundancy with Comparison combined with
Concurrent Error Detection (DWC-CED) - Two modules perform the same operation and their
output is compared. (savings of area) - If the outputs do not match then it takes one
more clock cycle to run the concurrent error
detection method that finds which module is
correct. - Problem is finding a test that detects all
possible errors that can occur in a module.
11Other Techniques
- Other techniques for SEUs and even Multiple Event
Upsets (MEUs) in memory. - Cross Parity
- Reed-Muller
- Reed Solomon
- Reed Solomon with Hamming Codes
- Problem is the resource requirement to pull off
these techniques.
12Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
13Configuration Frames
- 1 bit wide
- Span an HCLK Row
- 16 CLBs in Height
- Size is 41 32-bit words
- Block Types
- CLBs/CLKs/DSPs/IOBs
- BRAM Interconnect
- BRAM Contents
- Multiple minor frames per major column
14Major Frames Numbering
- Starts from 0 on the left and increases as going
to the right - SX35 Example
- CLBs/CLKs/DSPs/IOBs
- CLBs 1-6, 8-15, 17-30, 32-39, 41-46
- CLKs 24
- DSP 7, 16, 31, 40
- IOBs 0, 23, 47
- BRAM Interconnect 0-7
- BRAM Content 0-7
15Minor Frames per Major Frame
- There are multiple minor frames per major frame.
The number of minor frames depends on the type of
major frame writing to. - Information for total minor frames per column
type is from file xhwicap_i.h. - CLBs 22 total minor frames
- DSPs 21 total minor frames
- IOBs 30 total minor frames
- CLKs 3 total minor frames
- BRAM Interconnect 20 total minor frames
- BRAM Content 64 total minor frames
- Numbering is from 0 to totalMinorFrames-1
16Frame Layout
- Size is 41 32-bit words (1312 bits total)
- Frames in the bottom half are mirror images in
the top half with the exception of the vertical
HCLK rows that contain the global and regional
clocks. (ug071.pdf Xilinx) - Top Half 1311 to 0
- (word 40 to word 0)
- Bottom Half 0 to 1311
- (word 0 to 40)
17Fault Correction Techniques
- Techniques for repairing faults in the
configuration frames of the FPGA - Scrubbing Just reload the configuration data
from a device like an SEU-immune EEPROM. - Error Checking and Correcting (ECC) frames
- Embed Hamming Codes inside the configuration
frame - Available in the Virtex-4 devices
- In order for these to be used, a device must not
use resources that use the configuration frames
for memory (ex. Shift Registers).
18Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
19DMRH
- Double Modular Redundancy with Hold
- When disagreement, send
- signal to ICAP Controller
- that will scan/fix-up errors
- in areas of modules.
- Disagreement signal also
- sent to controller to pause
- at the current iteration.
- If transient error, it will
- disappear in 1 clock cycle
- Best for combinational logic and parallel designs
- Problem is the delay of time to fix-up frame(s)
20Fan-out design
- Used in some of the multiplexers in the design.
- Can tolerate a SEU in the LUTs
- or 1 of lines after it is fanned out
- to the slices.
- The words being selected are
- Hamming Code protected.
- Reduces the need for redundancy
- Problem is an upset that occurs
- before the line is fanned out to
- the different slices.
21Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
22Iterative Repair (IR) Processor Design
Testing Circuitry (Measures Change Of Behavior)
Timer avoids timeout if circuit can not complete
anymore due to SEU. HWICAP used to read and
write configuration frames.
23Copy Processor
BEFORE
AFTER
24Alter Processor
BEFORE
AFTER
25Evaluate Process
- Is comprised of three sub-processors
- Dependency Graph Violation
- Total Schedule Length
- Resource Over-utilization
26Dependency Graph Violation Sub-Processor
BEFORE
AFTER
27Total Schedule Length Sub-Processor
BEFORE
AFTER
28Resource Over-utilization Sub-Processor
- Longest Stage thus only TMR so it wont increase
delay seen in DMRHmight change in future design. - Measured it taking 18us
- to write a frame
- Measured it taking 30us
- to read and write a frame
- Max Latency of IR
- Processor iteration is 235
- clock cycles or 2.35us if10ns clock period.
29Accept Processor
BEFORE
AFTER
30Adjust Temperature Processor
BEFORE
AFTER
31Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
32BYU SEU Simulator
- Requires 3 Virtex 1000 FPGAs
- Does not directly corrupt flip-flops
- Corrupts bits in bitstream
Sensitive Bits
FPGA Editor
33Xilinx SEU Simulator (xapp714)
- Requires 1 Virtex-4 FPGA
- Does not directly corrupt flip-flops
- Can not see what frame address and configuration
bit is being corrupted. (Is stated to start from
first bit in configuration memory) - Clunky interface to use for simulating SEUs
- Uses embedded ECC frames
- Corrupts every configuration frame on the board.
Unknown how/if it actually corrupts BRAM
Interconnect and Content frames.
34USU SEU Simulator (Tool Flow)
- Requires Xilinx tools ISE, EDK, PlanAhead, and
TMRTool. - TMRTool removes the shift registers in the
design. - Plan Ahead is used to map design to be tested in
separate configuration frames from simulator
circuit.
35USU SEU Simulator
- Uses 1 FPGA (Tester circuit and design to test on
same circuit) - Corrupts all bits in
- configuration frames in the
- design to test area.
- Tests corrupting FFs
- 3 Techniques
- GCAPTURE/GRESTORE
- Intermediate Corruption
- Stuck-At Tests
36Flip-Flop Architecture
- FFs share all lines
- except D (Data) input,
- and XQ/YQ output
- SRINV mux controls
- reset line given to FFs
- SRMODE configuration
- bit determines what FF
- is set to on reset.
- INIT bit is value of FF
- when bitstream first loaded onto FPGA
37GCAPTURE/GRESTORE Method
- GCAPTURE loads the INIT bits of all FFs and
Input/Output Buffer (IOB) registers with the
current value of the register - GRESTORE sets all registers to their INIT bit
values. - Put device into a paused state (where FFs are not
changing, SR input to FFs low, and clock signal
still active). - Then do a GCAPTURE, change INIT bit in desired
FF. Follow with GRESTORE.
38Intermediate Corruption Method
- Put device into a paused state.
- Issue a GCAPTURE command
- Based on the INIT bits, set the SRMODE of the 2
FFs in the slice. - Set the FF to change to set on reset to the
opposite value it is at. - Set the other FF to reset to its current value
- Change the SRINV multiplexer to select the other
value. (This causes reset of FFs) - Fix-up the SRINV multiplexer, SRMODE bits.
- Device can then be resumed.
39Stuck-At Method
- Device can be in a paused state.
- In this method FFs are configured to be stuck at
a desired value during operation of device. - Configure SRMODE bits to the desired value to be
stuck at. Possible combos 00, 01, 10, 11 - Change SRINV mux to select opposite line.
- After device run, fix-up changes done.
- Best if device never resets FFs during operation.
- Helps reveal SEU sensitivity of specific FFs on
any clock cycles.
40Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
41Designed Mapped from PlanAhead
Placement from PlanAhead
Resources Mapped
42Bit Markup of Sensitive Resources
Placement from PlanAhead
- Does not specify what resources have SEU
sensitivity. It just gives a general idea.
Bit Markup
43Map of Sensitive Resources
Placement from PlanAhead
Map of Sensitive Resources in Slice
Key of Resources
44CLBs Tested
- From testing every configuration bit in the
frames that made up the CLBs, we found - 108395 bits out of 2193664 (4.9) caused a change
of behavior in the IR Processor - When flying a satellite around the Earth some
have observed around 1000 SEUs a day. - This means around 42 SEUs an hour
- Of which 2 SEUs on average are problems
- So if timing can be delayed on average every 30
minutes, it can be beneficial to use DMRH to
reduce power and area requirements.
45DSPs, BRAMs Tested
- Show images displaying the Bit Markup
- CLBs Green
- DSPs Purple
- BRAM Interconnect Blue
- BRAM Content Red (Intermediate Corruption
Testing) - 127668 bits out of 4067200 (3.1) caused a change
of behavior in the IR Processor - When trying to change BRAM content, the changes
will not be accepted if writing a 1 to these
bits offsets (ordering is word 0 to 40) - Top 136, 456, 808, 1128
- Bottom 184, 504, 856, 1176
46Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
47Conclusions
- Simulator Tool status
- Simulates SEUs in CLBs, FFs, DSPs, BRAM
interconnects, and BRAM content. - Needs to have a method to reload entire device
when a permanent change in pattern is detected. - Need to test full TMR design
- Need to test proposed fault tolerant design
- Have fault techniques automatically applied when
IR Processor is being generated - Thesis defense in August?
48Outline
- Introduction
- Background
- Fault Tolerant Techniques
- Configuration Frames
- DMRH and Fan-out design
- Iterative Repair Processor Fault Protected
- SEU Simulator
- Current Results
- Conclusions and Program of Study
- Publications
49Publications
- Journal Articles under review
- IET Transactions on Computers and Digital
Techniques - Phillips, J., Sudarsanam, A., Kallam, R., Carver,
J., and Dasu, A., Methodology to Derive
Polymorphic Soft-IP Cores for FPGAs
50Publications
- Conference Papers under review
- DAC 2008
- Carver, J., Phillips, J., and Dasu, A., Improved
SEU Simulator for Virtex 4 FPGAs
51Publications
- Planned Journal Papers
- IEEE Design Test of Computers or IEEE
Transactions on Reliability - Carver, J., Phillips, J., and Dasu, A., SEU
Mitigating Techniques for a FPGA based Iterative
Repair Processor