ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I

Description:

Detected. Periodic. Data Check. Periodic. Fault Diagnosis. Reload. Bitstream. Collect and ... Original net criticalities can be used for re-route. Net rip-up ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 39
Provided by: RussTe7
Category:

less

Transcript and Presenter's Notes

Title: ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I


1
ECE 697FReconfigurable ComputingLecture
16Dynamic Reconfiguration I
2
Overview
  • Motivation for Dynamic Reconfiguration
  • Limitations of Reconfigurable Approaches
  • Hardware support for reconfiguration
  • Examples application
  • ASCII to Hex conversion
  • Fault recovery with dynamic reconfiguration

3
DPGA
  • Configuration selects operation of computation
    unit
  • Context identifier changes over time to allow
    change in functionality
  • DPGA Dynamically Programmable Gate Array

4
Computations that Benefit from Reconfiguration
A
F0
F1
F2
B
Non-pipelined example
  • Low throughput tasks
  • Data dependent operations
  • Effective if not all resources active
    simultaneously. Possible to time-multiplex both
    logic and routing resources.

5
Resource Reuse
  • Example circuit Part of ASCII -gt Hex design
  • Computation can be broken up by considering this
    a data flow graph.

6
Resource Reuse
  • Resources must be directed to do different things
    at different times through instructions.
  • Different local configurations can be thought of
    as instructions
  • Minimizing the number and size of instructions a
    key to successfully achieving efficient design.
  • What are the implications for the hardware?

7
Previous Study (DeHon)
Interconnect Mux
Logic Reuse
  • Actxt?80Kl2
  • dense encoding
  • Abase?800Kl2

Each context no overly costly compared to base
cost of wire, switches, IO circuitry
Question How does this effect scale?
8
Exploring the Tradeoffs
  • Assume ideal packing Nactive Ntotal/L
  • Reminder cActxt Abase
  • Difficult to exactly balance resources/demands
  • Needs for contexts may vary across applications
  • Robust point where critical path length equals
    contexts.

9
Implementation Choices
  • Both require same amount of execution time
  • Implementation 1 more resource efficient.

10
Scheduling Limitations
  • NA size of largest stage in terms of active
    LUTs
  • Precedence -gt a LUT can only be evaluated after
    predecessors have been evaluated.
  • Need to assign design LUTs to device
  • LUTs at specific contexts.
  • Consider formulation for scheduling. What are the
    choices?

11
Scheduling
  • ASAP (as soon as possible)
  • Propagate depth forward from primary inputs
  • Depth 1 max input length
  • ALAP (as late as possible)
  • Propagate distance from outputs backwards towards
    inputs
  • Level 1 max output consumption level
  • Slack
  • Slack L 1 (depth level)
  • PI depth 0, PO level 0

12
Slack Example
  • Note connection from C1 to O1
  • Critical path will have 0 slack
  • Admittedly small example

13
Sequentialization
  • Adding time slots allows for potential increase
    in hardware efficiency
  • This comes at the cost of increased latency
  • Adding slack allows better balance.
  • L4 NA 2 (4 or 3 contexts)

14
Full ASCII -gt Hex Circuit
  • Logically three levels of dependence
  • Single Context
  • 21 LUTs _at_ 880Kl218.5Ml2

15
Time-multiplexed version
  • Three contexts 12 LUTs _at_ 1040Kl212.5Ml2
  • Pipelining need for dependent paths.

16
Context Optimization
  • With enough contexts only one LUT needed. Leads
    to poor latency.
  • Increased LUT area due to additional stored
    configuration information
  • Eventually additional interconnect savings taken
    up by LUT configuration overhead

Ideal perfect scheduling spread
no retime overhead
17
General Throughput Mapping
  • Useful if only limited throughput is desired.
  • Target produces new result every t cycles (e.g. a
    t LUT path)
  • Spatially pipeline every t stages.
  • Cycle t
  • Retime to minimize register requirement
  • Multi-context evaluation within a spatial stage.
    Retime to minimize resource usage
  • Map for depth, i, and contexts, C

18
Dharma Architecture (UC Berkeley)
  • Allows for levelized circuit to be executed
  • Design parameters - DLM
  • K -gt number of DLM inputs
  • L -gt number of levels

19
Example Dharma Circuit
20
Dynamic Reconfiguration for Fault Tolerance
  • Embedded systems require high reliability in the
    presence of transient or permanent faults
  • FPGAs contain substantial redundancy
  • Possible to dynamically configure around
    problem areas
  • Numerous on-line and off-line solutions

21
Column Based Reconfiguration
  • Huang and McCluskey
  • Assume that each FPGA column is equivalent in
    terms of logic and routing
  • Preserve empty columns for future use
  • Somewhat wasteful
  • Precompile and compress differences in bitstreams

22
Column Based Reconfiguration
  • Create multiple copies of the same design with
    different unused columns
  • Only requires different inter-block connections
  • Can lead to unreasonable configuration count

23
Column Based Reconfiguration
  • Determining differences and compressing the
    results leads to reasonable overhead
  • Scalability and fault diagnosis are issues

24
Alternate Fault Tolerance Approach
  • Fault tolerant client (with Xilinx Virtex FPGA)
  • Periodic fault diagnosis
  • Send fault location to server
  • Reconfiguration server
  • Recompile based on fault locations
  • Send new configuration to client
  • Communication Network
  • TCP socket

25
Fault Tolerant Client
  • Transient fault (bit flip in configuration
    memory)
  • Permanent fault (faulty logic or routing resource)

26
Reconfiguration Server
  • Create TCP socket and collect faults
  • Select recompilation process
  • Generate bitstream and send it via TCP socket

Send New Bitstream
Generate New Bitstream
LUT Swapping
Timing-Driven Incremental Re-route
From/To Client
Logic Fault
Select Recompilation Process
Routing Fault
Collect Faults
27
System Procedure
  • Server and client initialization
  • Server-client interaction

Create TCP socket
Create TCP socket
Wait request from client
Fault diagnosis
No fault
Load new bits
Recompilation
Transmit fault locations
Transmit bitstream
Collect bitstream
Server
Client
28
Custom incremental CAD flow
  • Existing CAD tools for Virtex
  • Xilinx PAR and Bitgen time consuming and no
    fault avoidance
  • Xilinx JRoute limited performance
  • Our custom flow (place, route, bitstream
    generation)
  • Fault avoidance
  • Fast recompilation
  • Preserve performance of original design
  • No use of Xilinx ISE tools

29
Custom Flow
  • VPR for Virtex -gt modified version of VPR 1
  • JBits -gt bitstream generation tool based on JBits
    API 2

EDIF
FlowMap (Tech Map)
SIS (Logic Opt.)
EDIF2BLIF
BLIF
BLIF
BLIF
VPR for Virtex
JBits
Bitstream
BLIF, Netlist, Placement Routing
Fault Information
30
VPR (pack, place and route)
  • Modifications in VPR
  • Routing graph for Virtex
  • Generate legal placement and routing
  • Timing-driven incremental router
  • Mask faulty resources
  • Re-route and preserve performance of original
    design

31
Fault Avoidance
  • Mask faulty routing resources
  • Check nets affected by faults
  • Re-route affected nets

32
Timing-driven Re-route
  • Timing-driven PathFinder Router
  • Incremental Re-route with history
  • History congestion values guide router away from
    congested area
  • Original net criticalities can be used for
    re-route
  • Net rip-up
  • Nets unaffected by faults may be ripped-up to
    improve routability and timing

33
Bitstream Generation
  • Parse results from VPR
  • Generate bitstream with Xiliinx JBits API
  • Incremental modification
  • XDL output for debugging

XDL for debugging
Xilinx Flow
Original Bitstream
Modified Bitstream
Xilinx JBits API
From VPR
Parser
FPGA
jbits.set (Row, Column, Resource, Driver)
34
System Component
  • Solaris-based
  • SUN workstation
  • Ultra-10
  • 440 MHz CPU
  • 768 MB memory
  • Transtech DM11 board in
  • WinNT-based PC
  • Xilinx XCV100-5 FPGA
  • 200 MHz TI DSP
  • 32 Mbytes SDRAM
  • 1 Mbytes SRAM

35
Single Fault Recovery
Recovery time for single fault (second)
  • Three benchmarks
  • FIR16
  • (16-tap 16-bit FIR filter)
  • FMUL
  • (8-bit floating point multiplier)
  • TEA
  • (tiny encryption algorithm)
  • Recovery time lt 14s
  • 12 X faster
  • Performance preserved

(465 CLBs) (227 CLBs) (109 CLBs)
(465 CLBs) (227 CLBs) (109 CLBs)
77.5 37.8 18.2
36
Single Fault Recovery
  • Time for different tasks of recovery
  • Incremental CAD consumes most time

37
Results
  • Similar recovery time for different designs
  • Base cost time -gt create graph for device
  • Device size -gt total recovery time
  • Recover multiple faults
  • Re-route time increase 2 seconds for 50 faults
  • Critical path delay increase 5 for 50 faults
  • Modified bits in bitstream
  • Single fault -gt 24 configuration bits / 100KB

38
Summary
  • Multiple contexts can be used to combat wire
    inactivity and logic latency
  • Too many contexts lead to inefficiencies due to
    retiming registers and extra LUTs
  • Architectures such as Dharma address these issues
    through contexts
  • Run-time system needed to handle dynamic
    reconfiguration for fault tolerance
Write a Comment
User Comments (0)
About PowerShow.com