Title: A Rollback Recovery System for Embedded FPGA Processors James Reed Walker Thesis Advisor: Dr' Christ
1A Rollback Recovery System for Embedded FPGA
ProcessorsJames Reed WalkerThesis Advisor
Dr. Christos PapachristouCase Western Reserve
UniversitySpring 2006
2Talk Outline
- Background
- Key Contributions
- System Implementation
- I. Processor Rollback
- II. Memory Rollback
- Results
- Conclusions and Future Works
- Demonstrations
- Questions
3Field Programmable Gate Arrays
- Array of configurable logic cells, configurable
routing network - 2 Types of Memory
- Configuration memories define logic and routing
- User-side memories hold active data
- Configurable I/O Blocks
- Embedded Cores
- Processors
- Memories
Systems-on-a-Chip (SoC) are constructed from IP
cores.
4Soft Errors
- Radiation Effects
- Single-Event Upset (SEU)
- Single-Event Transient (SET)
- Single-Event Functional Interrupt (SEFI)
- Device Packaging
- Alpha-particles
SRAM Cell
Trend As feature sizes decrease, soft error
rates (SER) increase!
5Error Mitigation
- Configuration Memories
- Scrubbing (Active Reconfiguration)
- Configuration Read-back
- Configuration Rollback
- User-side Memories
- Triple-Modular Redundancy (TMR)
- Duplication with comparison (DWC)
- Error-detecting and correcting codes (EDAC)
- Reset Strategies
Active Data!
6Related Works
- Rezgui et al. from Xilinx
- Combined TMR and scrubbing to mitigate soft
errors in Xilinx FPGAs, tested using radiation - Wang et al. from Xilinx
- Combined processor reset, EDAC, reconfiguration,
and TMR in one scheme for embedded PowerPC in
Virtex-II Pro FPGA - Asadi and Tahoori from Northeastern University
- Read back configuration and computed CRC
checksums, performed reconfiguration and rollback
of user flip-flops from auxiliary FPGA - McCluskey et al. from Stanford
- Performed rollback in a self-healing soft
processor, with recovery state machines and
transparent checkpointing
7FPGA Processor System
- Processor
- Memories
- Processor Caches
- User Memory
- System Memory
- Busses
- System (Bus-0)
- Local (Bus-1)
- I/O Components
- IP Cores
- External I/O Devices
Memory-mapped I/O
Interrupt Assertion
8Transient Faults
- SEU causes bit-flips in active user-side
memories - Register File
- Cache-lines, tags, attributes
- User RAM, Flip-flops
- Block RAM
- Sequential Logic
- Etc.
- Leave configuration memory to other techniques
Active Data!
9Our Approach
- Rollback Recovery Instruction sequences are
re-executed to avoid soft errors - Exploit architectural capabilities of FPGA that
were not available in the past - Focus on error recovery rather than
error-detection
10Key Contributions
- In the context of an FPGA device
- 1. The use of interrupts and interrupt software
routines for checkpoint and rollback of an
embedded processor. - 2. The use of parallel DMA transfers for
checkpoint and rollback of active system memory. - 3. An experimental characterization of the
sensitivity of register bit errors in PowerPC
405.
11Concepts and Implementation
12Rollback Strategy
- Practical Limitations
- The number of memory elements may be very large.
- Ex System Memory
- Some memory elements may be inaccessible
- Ex Internal Buffers and Queues
- Rollback Methods
- Full Restore all data within a component
- Partial Restore data partially within a
component - Implicit Re-use component without restoring data
13Rollback Strategy
- Processor registers and active system memory
recovered using redundant data - Caches recovered through invalidation
- Combinatorial and sequential logic recovered
through instruction retry
Explicit recovery of registers and system memory
only!
14I. Processor Rollback
15Interrupt-based Checkpointing
- Interrupts pause processor at single point of
execution - Checkpoints generated from 2 sources
- Software traps inserted into application code
- Hardware timer generated periodically from IP
core - Checkpoint routine saves all accessible registers
16Interrupt-based Rollback
- Error detectors assert rollback interrupts
- Rollback routine
- Restores all accessible registers
- Returns execution to checkpoint address
- Soft errors avoided after rollback
17Checkpoint-Rollback Scenarios
- Error during program Single checkpoint required
- Error during checkpoint Checkpoint may be
partially written - Rollback routine must override checkpointing
routine - Minimum of 2 checkpoints must be available for
rollback
Error during program
Error during checkpoint!
18Checkpoint-Rollback Coordination
- Flags ensure partial rollback never occurs
- Valid Indicates if checkpoint contains valid
data - Lock Indicates whether checkpoint is being
written
19Checkpoint-Rollback Data Space
- Control header stores control data
- Temporary space stores volatile registers
- Checkpoint space stores checkpoint data in
wraparound buffer
20Implementation
- Block RAM
- ISR Memory holds checkpoint and rollback routines
- Checkpoint Memory-0 holds register checkpoints
- System memory holds OS and user applications
- IP Logic Cores
- INT-1 controller generates critical interrupts
- INT-0 controller generates non-critical interrupts
21Checkpoint Generation
- 2 Interrupt Sources
- Software Traps generate program exceptions
- Checkpoint Timer IP core generates external
exceptions - Exception handlers branch to checkpoint routine
- ba 0x00000000
- Interrupt controller (INT-1) allows for servicing
of non-checkpoint interrupts
22Checkpoint Routine
- Volatile registers saved to temporary space
- Non-checkpoint interrupts branch back to
exception handler - Checkpoint identifier used to compute checkpoint
data address - Load and store instructions used to transfer
register data
23Rollback Generation
- Interrupt controller (INT-0) generates external
interrupts in Simulation mode - Applications assert errors through OPB
interface. - Exception handlers branch to rollback routine
- ba 0x00000500
24Rollback Routine
- Rollback identifier used to compute checkpoint
data address - Load and store instructions used to transfer
register data - Lock bits prevent rollback from using partially
saved checkpoint
Return to checkpoint address
25Operation Summary
- Program Execution
- Instructions and data accessed from System Memory
- Register Checkpoint
- Internal traps or timer asserts checkpoints
periodically - Checkpoint routine executes from ISR memory
- Registers saved to Checkpoint Memory-0
26Operation Summary
- Program Execution
- Instructions and data accessed from System Memory
- Register Checkpoint
- Internal traps or timer asserts checkpoints
periodically - Checkpoint routine executes from ISR memory
- Registers saved to Checkpoint Memory-0
- Register Rollback
- Software asserts rollback through INT-0
Controller - Rollback routine executes from ISR memory
- Registers restored from Checkpoint Memory-0
27II. Memory Rollback
28Memory Rollback Problem
- Assuming processor rollback
First Pass
Second Pass
- Strategy
- Invalidate Cache Data
- Recover Active System Memory
29Cache Invalidation
- Cache Arrays
- Cache-lines contain data
- Tags contain physical addresses
- Attribute bits tell dirty, least recently used
- Operation
- On rollback, invalidate all cache-lines using
software instructions - dccci - invalidates sets of cache-lines in the
data cache - iccci - entire instruction cache
- Write-through Mode All stores update cache and
system memory
512 cache-lines (32 bytes each)
No Memory Coherence Problems!
30DMA Checkpoint and Rollback
- Active data is partitioned into single contiguous
data block - DMA controller(s) transfer data between active
system memory and a redundant checkpoint memory - Maximum transfer speed 4 bytes / bus cycle
- Processor is bypassed entirely
Single DMA
Parallel DMA
31Implementation
- Register Rollback functionality retained
- Block RAM Memories
- Dual-port system memory for simultaneous access
- Checkpoint Memory-1/2 hold copy of active data
space - IP Logic Cores
- Dual DMA Controllers
- Dual OPB Busses
- Memory Controllers
32Program Memory Map
- Linker script places active data sections into
one contiguous block - Simple algorithm used to partition active data
into two consecutive regions - Compute total transfer size in words
- Compute transfer size for each controller
- Compute transfer addresses
ELF (Executable and Linking Format)
Size of Active Data 25 KB !
33DMA Controller Operation
- Processor writes source address (SA), destination
address (DA) and transfer length (LENGTH)
transfer begins immediately - DMA Controller stores data intermediately in
16-word data buffer - Processor checks interrupt status register (ISR)
periodically for completion or errors
Source Address
Destination address
Length
Completion
Courtesy of Xilinx
34DMA Controller Operation
- Routines augmented to support simultaneous
register, memory transfers - Memory Checkpoint
- Save active data to checkpoint memories
- Memory Rollback
- Restore active data from checkpoint memories
- Routines wait until transfer completion or DMA
bus error
Registers transferred during DMA
35Operation Summary
36Operation Summary
37Platform Choice
- Xilinx University Program Virtex-II Pro Board
- Xilinx Virtex-II Pro FPGA
- 2 Embedded PowerPC Processors
- Block RAM (300 KB)
- Xilinx Embedded Development Kit (EDK)
- Hardware
- Synthesis Tools (XST)
- Mapping (MAP)
- Place and Route (PAR)
- Software
- GNU Tools (gcc, as, ld)
- Xilinx Microkernel Operating System
- Debugging Tools (XMD)
38Design Flow
Initialization, Shell, Demos
Checkpoint / Rollback Routines
39Results
40Logic Utilization
1 CLB 4 Slices
Additional data path required more logic slices
Max. Slices 13,696 gt Room for 11 more DMA
paths!
41Register Bit Error Sensitivity
PowerPC Register Bit Error Classification
- Experiment
- Invert and rollback single register bits
- Results
- Reserved bits were not writeable
- Unrecoverable bits caused system to hang
- Recoverable bits restored using rollback
42Bit Coverage
- Bit coverage is the ratio of recoverable bits to
total bits in the system or constituent parts of
the system.
Recovered approximately 500 K bits
43Checkpointing Performance
Tuseful time for normal executionTchkpt
time for checkpointingTtotal total execution
time
Processor time-base recorded during checkpoint
routine.
44Checkpointing Performance
45Mean Recovery Time
Trollback time for rollbackTre-execute
time for re-executionTperiod checkpoint
period
46Rollback Time
For Comparison
47Mean Recovery Time
Re-execution time dominates
48Observations
- When checkpointing performance is 95
- Register Rollback TRecover 10 K cycles
- Parallel DMA Rollback TRecover 267 K cycles
- When checkpointing performance is 99
- Register Rollback TRecover 28 K cycles
- Parallel DMA Rollback TRecover 2.7 M cycles
XMK Operating System Restart 150 M cycles
49Scalability
- Logic Utilization
- Checkpointing Time
- Recovery Time
Size of active data determines checkpointing time
Re-execution time dominates recovery time
50Rollback Errors
Mitigation
- No checkpoints available Reset
- Checkpoint / control data corrupted ECC, Data
Redundancy - Interrupt routines corrupted Memory Checkers
- Interrupt assertion or handling
failure Re-assertion, TMR - DMA transfer errors DMA Reconfiguration
- Cache invalidation failure Cache
Reconfiguration
Solutions 1) Software Reconfiguration 2)
State Checking 3) Redundancy Techniques 4)
Reset
51Latent Errors
- Delays exists between fault occurrence and fault
detection
Solution Must know detection latencies
52Conclusions
- Rollback system successfully recovered operating
system and user applications - Checkpointing overhead minimized using parallel
DMA paths - Rollback demonstrated much shorter recovery times
than full system restart - System considered to be scalable to embedded SoC
with reasonable active memory sizes
53Summary of Contributions
- In the context of an FPGA device
- 1. The use of interrupts and interrupt software
routines for checkpoint and rollback of an
embedded processor - 2. The use of DMA transfers for checkpoint and
rollback of active system memory - 3. The use of parallel DMA with dual-port / bus
structure for checkpoint and rollback of active
system memory - 4. An experimental characterization of the
sensitivity of register bit errors in an embedded
PowerPC processor - 5. A software technique for emulation of a
hardware-based recovery cache
54Future Works
- Extensions
- Error-detection Schemes
- Hardware-based Recovery Cache
- DMA with Larger Bandwidths
- Rigorous Reliability Analysis
- Architectures
- Port to FPGA Soft Processor
- Design combined TMR-Rollback System
- Multi-processor FPGA Rollback Systems
55Other Aspects of this Work
- Recovery Cache
- Developed a software emulation technique for
characterizing hardware-based recovery cache - I/O Rollback
- Recovered I/O sequences by blocking checkpoints
- Rollback Reliability
- Formulated reliability using exponential failure
rate law - Verification Techniques
- Developed message log for recording checkpoint
and rollback events and timing
56Demonstrations
- State Machine Rollback
- State 0 initialize count, assert checkpoint
- State 1 count-down
- State 2 assert rollback
- State 3 exit
- Thread Rollback
- Assert checkpoint in Thread 0
- Assert rollback in Thread 1
57Questions
58Extra Slides
59PowerPC Architecture
60PowerPC Registers
61PowerPC Exceptions
62Reliability
Assuming exponential failure rate
Rollback Reliability
System Reliability
Precovery Pd Pv Pc Pr Pd probability that
error is detectedPv probability that a valid
checkpoint is usedPc probability that error is
covered by rollbackPr probability of successful
rollback execution
63Verification
64Recovery Cache Emulation
for( i0 ilt100 i ) chkpt //assert
checkpoint for( j0 jlt100 j )
//fill arrays test1j i j test2j
i j test3j i j
65Recovery Cache Emulation
66I/O Rollback
67Design Considerations
- I/O Registers (Ports) may not behave like normal
memories - Status Registers
- Read-only / Write-only Registers
- Reading a register can change its value
- I/O sequence a finite sequence of instructions
used to send and receive data from an I/O
component - Assumptions
- I/O sequences perform one independent function
- I/O sequences are repeatable
Bidirectional sequences may contain dependencies!
68I/O Sequence Preservation
- I/O sequences are preserved by blocking
checkpoints during I/O execution
Partial I/O Rollback
Preserved I/O Rollback
69Implementation
- Register and Memory Rollback functionality
retained - UART Controller
- Communicates with PC UART through dedicated I/O
Port - UART controller ports accessed through OPB
interface - Input port polled to receive characters from
HyperTerminal.
70Checkpoint Blocking
- Traps
- Application asserts traps before and after I/O
sequences. - Checkpoint Timer
- Software flag masks checkpoint interrupts
- Application disables checkpoints before I/O,
enables checkpoints after I/O
Checkpoints
No checkpoints saved during UART operation