A Rollback Recovery System for Embedded FPGA Processors James Reed Walker Thesis Advisor: Dr' Christ - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

A Rollback Recovery System for Embedded FPGA Processors James Reed Walker Thesis Advisor: Dr' Christ

Description:

Load and store instructions used to transfer register data ... Processor checks interrupt status register (ISR) periodically for completion or errors ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 71
Provided by: jreedw
Category:

less

Transcript and Presenter's Notes

Title: A Rollback Recovery System for Embedded FPGA Processors James Reed Walker Thesis Advisor: Dr' Christ


1
A Rollback Recovery System for Embedded FPGA
ProcessorsJames Reed WalkerThesis Advisor
Dr. Christos PapachristouCase Western Reserve
UniversitySpring 2006
2
Talk Outline
  • Background
  • Key Contributions
  • System Implementation
  • I. Processor Rollback
  • II. Memory Rollback
  • Results
  • Conclusions and Future Works
  • Demonstrations
  • Questions

3
Field Programmable Gate Arrays
  • Array of configurable logic cells, configurable
    routing network
  • 2 Types of Memory
  • Configuration memories define logic and routing
  • User-side memories hold active data
  • Configurable I/O Blocks
  • Embedded Cores
  • Processors
  • Memories

Systems-on-a-Chip (SoC) are constructed from IP
cores.
4
Soft Errors
  • Radiation Effects
  • Single-Event Upset (SEU)
  • Single-Event Transient (SET)
  • Single-Event Functional Interrupt (SEFI)
  • Device Packaging
  • Alpha-particles

SRAM Cell
Trend As feature sizes decrease, soft error
rates (SER) increase!
5
Error Mitigation
  • Configuration Memories
  • Scrubbing (Active Reconfiguration)
  • Configuration Read-back
  • Configuration Rollback
  • User-side Memories
  • Triple-Modular Redundancy (TMR)
  • Duplication with comparison (DWC)
  • Error-detecting and correcting codes (EDAC)
  • Reset Strategies

Active Data!
6
Related Works
  • Rezgui et al. from Xilinx
  • Combined TMR and scrubbing to mitigate soft
    errors in Xilinx FPGAs, tested using radiation
  • Wang et al. from Xilinx
  • Combined processor reset, EDAC, reconfiguration,
    and TMR in one scheme for embedded PowerPC in
    Virtex-II Pro FPGA
  • Asadi and Tahoori from Northeastern University
  • Read back configuration and computed CRC
    checksums, performed reconfiguration and rollback
    of user flip-flops from auxiliary FPGA
  • McCluskey et al. from Stanford
  • Performed rollback in a self-healing soft
    processor, with recovery state machines and
    transparent checkpointing

7
FPGA Processor System
  • Processor
  • Memories
  • Processor Caches
  • User Memory
  • System Memory
  • Busses
  • System (Bus-0)
  • Local (Bus-1)
  • I/O Components
  • IP Cores
  • External I/O Devices

Memory-mapped I/O
Interrupt Assertion
8
Transient Faults
  • SEU causes bit-flips in active user-side
    memories
  • Register File
  • Cache-lines, tags, attributes
  • User RAM, Flip-flops
  • Block RAM
  • Sequential Logic
  • Etc.
  • Leave configuration memory to other techniques

Active Data!
9
Our Approach
  • Rollback Recovery Instruction sequences are
    re-executed to avoid soft errors
  • Exploit architectural capabilities of FPGA that
    were not available in the past
  • Focus on error recovery rather than
    error-detection

10
Key Contributions
  • In the context of an FPGA device
  • 1. The use of interrupts and interrupt software
    routines for checkpoint and rollback of an
    embedded processor.
  • 2. The use of parallel DMA transfers for
    checkpoint and rollback of active system memory.
  • 3. An experimental characterization of the
    sensitivity of register bit errors in PowerPC
    405.

11
Concepts and Implementation
12
Rollback Strategy
  • Practical Limitations
  • The number of memory elements may be very large.
  • Ex System Memory
  • Some memory elements may be inaccessible
  • Ex Internal Buffers and Queues
  • Rollback Methods
  • Full Restore all data within a component
  • Partial Restore data partially within a
    component
  • Implicit Re-use component without restoring data

13
Rollback Strategy
  • Processor registers and active system memory
    recovered using redundant data
  • Caches recovered through invalidation
  • Combinatorial and sequential logic recovered
    through instruction retry

Explicit recovery of registers and system memory
only!
14
I. Processor Rollback
15
Interrupt-based Checkpointing
  • Interrupts pause processor at single point of
    execution
  • Checkpoints generated from 2 sources
  • Software traps inserted into application code
  • Hardware timer generated periodically from IP
    core
  • Checkpoint routine saves all accessible registers

16
Interrupt-based Rollback
  • Error detectors assert rollback interrupts
  • Rollback routine
  • Restores all accessible registers
  • Returns execution to checkpoint address
  • Soft errors avoided after rollback

17
Checkpoint-Rollback Scenarios
  • Error during program Single checkpoint required
  • Error during checkpoint Checkpoint may be
    partially written
  • Rollback routine must override checkpointing
    routine
  • Minimum of 2 checkpoints must be available for
    rollback

Error during program
Error during checkpoint!
18
Checkpoint-Rollback Coordination
  • Flags ensure partial rollback never occurs
  • Valid Indicates if checkpoint contains valid
    data
  • Lock Indicates whether checkpoint is being
    written

19
Checkpoint-Rollback Data Space
  • Control header stores control data
  • Temporary space stores volatile registers
  • Checkpoint space stores checkpoint data in
    wraparound buffer

20
Implementation
  • Block RAM
  • ISR Memory holds checkpoint and rollback routines
  • Checkpoint Memory-0 holds register checkpoints
  • System memory holds OS and user applications
  • IP Logic Cores
  • INT-1 controller generates critical interrupts
  • INT-0 controller generates non-critical interrupts

21
Checkpoint Generation
  • 2 Interrupt Sources
  • Software Traps generate program exceptions
  • Checkpoint Timer IP core generates external
    exceptions
  • Exception handlers branch to checkpoint routine
  • ba 0x00000000
  • Interrupt controller (INT-1) allows for servicing
    of non-checkpoint interrupts

22
Checkpoint Routine
  • Volatile registers saved to temporary space
  • Non-checkpoint interrupts branch back to
    exception handler
  • Checkpoint identifier used to compute checkpoint
    data address
  • Load and store instructions used to transfer
    register data

23
Rollback Generation
  • Interrupt controller (INT-0) generates external
    interrupts in Simulation mode
  • Applications assert errors through OPB
    interface.
  • Exception handlers branch to rollback routine
  • ba 0x00000500

24
Rollback Routine
  • Rollback identifier used to compute checkpoint
    data address
  • Load and store instructions used to transfer
    register data
  • Lock bits prevent rollback from using partially
    saved checkpoint

Return to checkpoint address
25
Operation Summary
  • Program Execution
  • Instructions and data accessed from System Memory
  • Register Checkpoint
  • Internal traps or timer asserts checkpoints
    periodically
  • Checkpoint routine executes from ISR memory
  • Registers saved to Checkpoint Memory-0

26
Operation Summary
  • Program Execution
  • Instructions and data accessed from System Memory
  • Register Checkpoint
  • Internal traps or timer asserts checkpoints
    periodically
  • Checkpoint routine executes from ISR memory
  • Registers saved to Checkpoint Memory-0
  • Register Rollback
  • Software asserts rollback through INT-0
    Controller
  • Rollback routine executes from ISR memory
  • Registers restored from Checkpoint Memory-0

27
II. Memory Rollback
28
Memory Rollback Problem
  • Assuming processor rollback

First Pass
Second Pass
  • Strategy
  • Invalidate Cache Data
  • Recover Active System Memory

29
Cache Invalidation
  • Cache Arrays
  • Cache-lines contain data
  • Tags contain physical addresses
  • Attribute bits tell dirty, least recently used
  • Operation
  • On rollback, invalidate all cache-lines using
    software instructions
  • dccci - invalidates sets of cache-lines in the
    data cache
  • iccci - entire instruction cache
  • Write-through Mode All stores update cache and
    system memory

512 cache-lines (32 bytes each)
No Memory Coherence Problems!
30
DMA Checkpoint and Rollback
  • Active data is partitioned into single contiguous
    data block
  • DMA controller(s) transfer data between active
    system memory and a redundant checkpoint memory
  • Maximum transfer speed 4 bytes / bus cycle
  • Processor is bypassed entirely

Single DMA
Parallel DMA
31
Implementation
  • Register Rollback functionality retained
  • Block RAM Memories
  • Dual-port system memory for simultaneous access
  • Checkpoint Memory-1/2 hold copy of active data
    space
  • IP Logic Cores
  • Dual DMA Controllers
  • Dual OPB Busses
  • Memory Controllers

32
Program Memory Map
  • Linker script places active data sections into
    one contiguous block
  • Simple algorithm used to partition active data
    into two consecutive regions
  • Compute total transfer size in words
  • Compute transfer size for each controller
  • Compute transfer addresses

ELF (Executable and Linking Format)
Size of Active Data 25 KB !
33
DMA Controller Operation
  • Processor writes source address (SA), destination
    address (DA) and transfer length (LENGTH)
    transfer begins immediately
  • DMA Controller stores data intermediately in
    16-word data buffer
  • Processor checks interrupt status register (ISR)
    periodically for completion or errors

Source Address
Destination address
Length
Completion
Courtesy of Xilinx
34
DMA Controller Operation
  • Routines augmented to support simultaneous
    register, memory transfers
  • Memory Checkpoint
  • Save active data to checkpoint memories
  • Memory Rollback
  • Restore active data from checkpoint memories
  • Routines wait until transfer completion or DMA
    bus error

Registers transferred during DMA
35
Operation Summary
36
Operation Summary
37
Platform Choice
  • Xilinx University Program Virtex-II Pro Board
  • Xilinx Virtex-II Pro FPGA
  • 2 Embedded PowerPC Processors
  • Block RAM (300 KB)
  • Xilinx Embedded Development Kit (EDK)
  • Hardware
  • Synthesis Tools (XST)
  • Mapping (MAP)
  • Place and Route (PAR)
  • Software
  • GNU Tools (gcc, as, ld)
  • Xilinx Microkernel Operating System
  • Debugging Tools (XMD)

38
Design Flow
Initialization, Shell, Demos
Checkpoint / Rollback Routines
39
Results
40
Logic Utilization
1 CLB 4 Slices
Additional data path required more logic slices
Max. Slices 13,696 gt Room for 11 more DMA
paths!
41
Register Bit Error Sensitivity
PowerPC Register Bit Error Classification
  • Experiment
  • Invert and rollback single register bits
  • Results
  • Reserved bits were not writeable
  • Unrecoverable bits caused system to hang
  • Recoverable bits restored using rollback

42
Bit Coverage
  • Bit coverage is the ratio of recoverable bits to
    total bits in the system or constituent parts of
    the system.

Recovered approximately 500 K bits
43
Checkpointing Performance
Tuseful time for normal executionTchkpt
time for checkpointingTtotal total execution
time
Processor time-base recorded during checkpoint
routine.
44
Checkpointing Performance
45
Mean Recovery Time
Trollback time for rollbackTre-execute
time for re-executionTperiod checkpoint
period
46
Rollback Time
For Comparison
47
Mean Recovery Time
Re-execution time dominates
48
Observations
  • When checkpointing performance is 95
  • Register Rollback TRecover 10 K cycles
  • Parallel DMA Rollback TRecover 267 K cycles
  • When checkpointing performance is 99
  • Register Rollback TRecover 28 K cycles
  • Parallel DMA Rollback TRecover 2.7 M cycles

XMK Operating System Restart 150 M cycles
49
Scalability
  • Logic Utilization
  • Checkpointing Time
  • Recovery Time

Size of active data determines checkpointing time
Re-execution time dominates recovery time
50
Rollback Errors
Mitigation
  • No checkpoints available Reset
  • Checkpoint / control data corrupted ECC, Data
    Redundancy
  • Interrupt routines corrupted Memory Checkers
  • Interrupt assertion or handling
    failure Re-assertion, TMR
  • DMA transfer errors DMA Reconfiguration
  • Cache invalidation failure Cache
    Reconfiguration

Solutions 1) Software Reconfiguration 2)
State Checking 3) Redundancy Techniques 4)
Reset
51
Latent Errors
  • Delays exists between fault occurrence and fault
    detection

Solution Must know detection latencies
52
Conclusions
  • Rollback system successfully recovered operating
    system and user applications
  • Checkpointing overhead minimized using parallel
    DMA paths
  • Rollback demonstrated much shorter recovery times
    than full system restart
  • System considered to be scalable to embedded SoC
    with reasonable active memory sizes

53
Summary of Contributions
  • In the context of an FPGA device
  • 1. The use of interrupts and interrupt software
    routines for checkpoint and rollback of an
    embedded processor
  • 2. The use of DMA transfers for checkpoint and
    rollback of active system memory
  • 3. The use of parallel DMA with dual-port / bus
    structure for checkpoint and rollback of active
    system memory
  • 4. An experimental characterization of the
    sensitivity of register bit errors in an embedded
    PowerPC processor
  • 5. A software technique for emulation of a
    hardware-based recovery cache

54
Future Works
  • Extensions
  • Error-detection Schemes
  • Hardware-based Recovery Cache
  • DMA with Larger Bandwidths
  • Rigorous Reliability Analysis
  • Architectures
  • Port to FPGA Soft Processor
  • Design combined TMR-Rollback System
  • Multi-processor FPGA Rollback Systems

55
Other Aspects of this Work
  • Recovery Cache
  • Developed a software emulation technique for
    characterizing hardware-based recovery cache
  • I/O Rollback
  • Recovered I/O sequences by blocking checkpoints
  • Rollback Reliability
  • Formulated reliability using exponential failure
    rate law
  • Verification Techniques
  • Developed message log for recording checkpoint
    and rollback events and timing

56
Demonstrations
  • State Machine Rollback
  • State 0 initialize count, assert checkpoint
  • State 1 count-down
  • State 2 assert rollback
  • State 3 exit
  • Thread Rollback
  • Assert checkpoint in Thread 0
  • Assert rollback in Thread 1

57
Questions
58
Extra Slides
59
PowerPC Architecture
60
PowerPC Registers
61
PowerPC Exceptions
62
Reliability
Assuming exponential failure rate
Rollback Reliability
System Reliability
Precovery Pd Pv Pc Pr Pd probability that
error is detectedPv probability that a valid
checkpoint is usedPc probability that error is
covered by rollbackPr probability of successful
rollback execution
63
Verification
64
Recovery Cache Emulation
for( i0 ilt100 i ) chkpt //assert
checkpoint for( j0 jlt100 j )
//fill arrays test1j i j test2j
i j test3j i j
65
Recovery Cache Emulation
66
I/O Rollback
67
Design Considerations
  • I/O Registers (Ports) may not behave like normal
    memories
  • Status Registers
  • Read-only / Write-only Registers
  • Reading a register can change its value
  • I/O sequence a finite sequence of instructions
    used to send and receive data from an I/O
    component
  • Assumptions
  • I/O sequences perform one independent function
  • I/O sequences are repeatable

Bidirectional sequences may contain dependencies!
68
I/O Sequence Preservation
  • I/O sequences are preserved by blocking
    checkpoints during I/O execution

Partial I/O Rollback
Preserved I/O Rollback
69
Implementation
  • Register and Memory Rollback functionality
    retained
  • UART Controller
  • Communicates with PC UART through dedicated I/O
    Port
  • UART controller ports accessed through OPB
    interface
  • Input port polled to receive characters from
    HyperTerminal.

70
Checkpoint Blocking
  • Traps
  • Application asserts traps before and after I/O
    sequences.
  • Checkpoint Timer
  • Software flag masks checkpoint interrupts
  • Application disables checkpoints before I/O,
    enables checkpoints after I/O

Checkpoints
No checkpoints saved during UART operation
Write a Comment
User Comments (0)
About PowerShow.com