Title: ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I
1ECE 697FReconfigurable ComputingLecture
16Dynamic Reconfiguration I
2Overview
- Motivation for Dynamic Reconfiguration
- Limitations of Reconfigurable Approaches
- Hardware support for reconfiguration
- Examples application
- ASCII to Hex conversion
- Fault recovery with dynamic reconfiguration
3DPGA
- Configuration selects operation of computation
unit - Context identifier changes over time to allow
change in functionality - DPGA Dynamically Programmable Gate Array
4Computations that Benefit from Reconfiguration
A
F0
F1
F2
B
Non-pipelined example
- Low throughput tasks
- Data dependent operations
- Effective if not all resources active
simultaneously. Possible to time-multiplex both
logic and routing resources.
5Resource Reuse
- Example circuit Part of ASCII -gt Hex design
- Computation can be broken up by considering this
a data flow graph.
6Resource Reuse
- Resources must be directed to do different things
at different times through instructions. - Different local configurations can be thought of
as instructions - Minimizing the number and size of instructions a
key to successfully achieving efficient design. - What are the implications for the hardware?
7Previous Study (DeHon)
Interconnect Mux
Logic Reuse
- Actxt?80Kl2
- dense encoding
- Abase?800Kl2
Each context no overly costly compared to base
cost of wire, switches, IO circuitry
Question How does this effect scale?
8Exploring the Tradeoffs
- Assume ideal packing Nactive Ntotal/L
- Reminder cActxt Abase
- Difficult to exactly balance resources/demands
- Needs for contexts may vary across applications
- Robust point where critical path length equals
contexts.
9Implementation Choices
- Both require same amount of execution time
- Implementation 1 more resource efficient.
10Scheduling Limitations
- NA size of largest stage in terms of active
LUTs - Precedence -gt a LUT can only be evaluated after
predecessors have been evaluated. - Need to assign design LUTs to device
- LUTs at specific contexts.
- Consider formulation for scheduling. What are the
choices?
11Scheduling
- ASAP (as soon as possible)
- Propagate depth forward from primary inputs
- Depth 1 max input length
- ALAP (as late as possible)
- Propagate distance from outputs backwards towards
inputs - Level 1 max output consumption level
- Slack
- Slack L 1 (depth level)
- PI depth 0, PO level 0
12Slack Example
- Note connection from C1 to O1
- Critical path will have 0 slack
- Admittedly small example
13Sequentialization
- Adding time slots allows for potential increase
in hardware efficiency - This comes at the cost of increased latency
- Adding slack allows better balance.
- L4 NA 2 (4 or 3 contexts)
14Full ASCII -gt Hex Circuit
- Logically three levels of dependence
- Single Context
- 21 LUTs _at_ 880Kl218.5Ml2
15Time-multiplexed version
- Three contexts 12 LUTs _at_ 1040Kl212.5Ml2
- Pipelining need for dependent paths.
16Context Optimization
- With enough contexts only one LUT needed. Leads
to poor latency. - Increased LUT area due to additional stored
configuration information - Eventually additional interconnect savings taken
up by LUT configuration overhead
Ideal perfect scheduling spread
no retime overhead
17General Throughput Mapping
- Useful if only limited throughput is desired.
- Target produces new result every t cycles (e.g. a
t LUT path) - Spatially pipeline every t stages.
- Cycle t
- Retime to minimize register requirement
- Multi-context evaluation within a spatial stage.
Retime to minimize resource usage - Map for depth, i, and contexts, C
18Dharma Architecture (UC Berkeley)
- Allows for levelized circuit to be executed
- Design parameters - DLM
- K -gt number of DLM inputs
- L -gt number of levels
19Example Dharma Circuit
20Dynamic Reconfiguration for Fault Tolerance
- Embedded systems require high reliability in the
presence of transient or permanent faults - FPGAs contain substantial redundancy
- Possible to dynamically configure around
problem areas - Numerous on-line and off-line solutions
21Column Based Reconfiguration
- Huang and McCluskey
- Assume that each FPGA column is equivalent in
terms of logic and routing - Preserve empty columns for future use
- Somewhat wasteful
- Precompile and compress differences in bitstreams
22Column Based Reconfiguration
- Create multiple copies of the same design with
different unused columns - Only requires different inter-block connections
- Can lead to unreasonable configuration count
23Column Based Reconfiguration
- Determining differences and compressing the
results leads to reasonable overhead - Scalability and fault diagnosis are issues
24Alternate Fault Tolerance Approach
- Fault tolerant client (with Xilinx Virtex FPGA)
- Periodic fault diagnosis
- Send fault location to server
- Reconfiguration server
- Recompile based on fault locations
- Send new configuration to client
- Communication Network
- TCP socket
25Fault Tolerant Client
- Transient fault (bit flip in configuration
memory) - Permanent fault (faulty logic or routing resource)
26Reconfiguration Server
- Create TCP socket and collect faults
- Select recompilation process
- Generate bitstream and send it via TCP socket
Send New Bitstream
Generate New Bitstream
LUT Swapping
Timing-Driven Incremental Re-route
From/To Client
Logic Fault
Select Recompilation Process
Routing Fault
Collect Faults
27System Procedure
- Server and client initialization
- Server-client interaction
Create TCP socket
Create TCP socket
Wait request from client
Fault diagnosis
No fault
Load new bits
Recompilation
Transmit fault locations
Transmit bitstream
Collect bitstream
Server
Client
28Custom incremental CAD flow
- Existing CAD tools for Virtex
- Xilinx PAR and Bitgen time consuming and no
fault avoidance - Xilinx JRoute limited performance
- Our custom flow (place, route, bitstream
generation) - Fault avoidance
- Fast recompilation
- Preserve performance of original design
- No use of Xilinx ISE tools
29Custom Flow
- VPR for Virtex -gt modified version of VPR 1
- JBits -gt bitstream generation tool based on JBits
API 2
EDIF
FlowMap (Tech Map)
SIS (Logic Opt.)
EDIF2BLIF
BLIF
BLIF
BLIF
VPR for Virtex
JBits
Bitstream
BLIF, Netlist, Placement Routing
Fault Information
30VPR (pack, place and route)
- Modifications in VPR
- Routing graph for Virtex
- Generate legal placement and routing
- Timing-driven incremental router
- Mask faulty resources
- Re-route and preserve performance of original
design
31Fault Avoidance
- Mask faulty routing resources
- Check nets affected by faults
- Re-route affected nets
32Timing-driven Re-route
- Timing-driven PathFinder Router
- Incremental Re-route with history
- History congestion values guide router away from
congested area - Original net criticalities can be used for
re-route - Net rip-up
- Nets unaffected by faults may be ripped-up to
improve routability and timing
33Bitstream Generation
- Parse results from VPR
- Generate bitstream with Xiliinx JBits API
- Incremental modification
- XDL output for debugging
XDL for debugging
Xilinx Flow
Original Bitstream
Modified Bitstream
Xilinx JBits API
From VPR
Parser
FPGA
jbits.set (Row, Column, Resource, Driver)
34System Component
- Solaris-based
- SUN workstation
- Ultra-10
- 440 MHz CPU
- 768 MB memory
- Transtech DM11 board in
- WinNT-based PC
- Xilinx XCV100-5 FPGA
- 200 MHz TI DSP
- 32 Mbytes SDRAM
- 1 Mbytes SRAM
35Single Fault Recovery
Recovery time for single fault (second)
- Three benchmarks
- FIR16
- (16-tap 16-bit FIR filter)
- FMUL
- (8-bit floating point multiplier)
- TEA
- (tiny encryption algorithm)
- Recovery time lt 14s
- 12 X faster
- Performance preserved
(465 CLBs) (227 CLBs) (109 CLBs)
(465 CLBs) (227 CLBs) (109 CLBs)
77.5 37.8 18.2
36Single Fault Recovery
- Time for different tasks of recovery
- Incremental CAD consumes most time
37Results
- Similar recovery time for different designs
- Base cost time -gt create graph for device
- Device size -gt total recovery time
- Recover multiple faults
- Re-route time increase 2 seconds for 50 faults
- Critical path delay increase 5 for 50 faults
- Modified bits in bitstream
- Single fault -gt 24 configuration bits / 100KB
38Summary
- Multiple contexts can be used to combat wire
inactivity and logic latency - Too many contexts lead to inefficiencies due to
retiming registers and extra LUTs - Architectures such as Dharma address these issues
through contexts - Run-time system needed to handle dynamic
reconfiguration for fault tolerance