ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I

Description:

Detected. Periodic. Data Check. Periodic. Fault Diagnosis. Reload. Bitstream. Collect and ... Original net criticalities can be used for re-route. Net rip-up ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 39

Provided by: RussTe7

Category:

more less

Transcript and Presenter's Notes

Title: ECE 697F Reconfigurable Computing Lecture 16 Dynamic Reconfiguration I

1
ECE 697FReconfigurable ComputingLecture
16Dynamic Reconfiguration I
2
Overview

Motivation for Dynamic Reconfiguration
Limitations of Reconfigurable Approaches
Hardware support for reconfiguration
Examples application
ASCII to Hex conversion
Fault recovery with dynamic reconfiguration

3
DPGA

Configuration selects operation of computation
unit
Context identifier changes over time to allow
change in functionality
DPGA Dynamically Programmable Gate Array

4
Computations that Benefit from Reconfiguration
A
F0
F1
F2
B
Non-pipelined example

Low throughput tasks
Data dependent operations
Effective if not all resources active
simultaneously. Possible to time-multiplex both
logic and routing resources.

5
Resource Reuse

Example circuit Part of ASCII -gt Hex design
Computation can be broken up by considering this
a data flow graph.

6
Resource Reuse

Resources must be directed to do different things
at different times through instructions.
Different local configurations can be thought of
as instructions
Minimizing the number and size of instructions a
key to successfully achieving efficient design.
What are the implications for the hardware?

7
Previous Study (DeHon)
Interconnect Mux
Logic Reuse

Actxt?80Kl2
dense encoding
Abase?800Kl2

Each context no overly costly compared to base
cost of wire, switches, IO circuitry
Question How does this effect scale?
8
Exploring the Tradeoffs

Assume ideal packing Nactive Ntotal/L
Reminder cActxt Abase
Difficult to exactly balance resources/demands
Needs for contexts may vary across applications
Robust point where critical path length equals
contexts.

9
Implementation Choices

Both require same amount of execution time
Implementation 1 more resource efficient.

10
Scheduling Limitations

NA size of largest stage in terms of active
LUTs
Precedence -gt a LUT can only be evaluated after
predecessors have been evaluated.
Need to assign design LUTs to device
LUTs at specific contexts.
Consider formulation for scheduling. What are the
choices?

11
Scheduling

ASAP (as soon as possible)
Propagate depth forward from primary inputs
Depth 1 max input length
ALAP (as late as possible)
Propagate distance from outputs backwards towards
inputs
Level 1 max output consumption level
Slack
Slack L 1 (depth level)
PI depth 0, PO level 0

12
Slack Example

Note connection from C1 to O1
Critical path will have 0 slack
Admittedly small example

13
Sequentialization

Adding time slots allows for potential increase
in hardware efficiency
This comes at the cost of increased latency
Adding slack allows better balance.
L4 NA 2 (4 or 3 contexts)

14
Full ASCII -gt Hex Circuit

Logically three levels of dependence
Single Context
21 LUTs _at_ 880Kl218.5Ml2

15
Time-multiplexed version

Three contexts 12 LUTs _at_ 1040Kl212.5Ml2
Pipelining need for dependent paths.

16
Context Optimization

With enough contexts only one LUT needed. Leads
to poor latency.
Increased LUT area due to additional stored
configuration information
Eventually additional interconnect savings taken
up by LUT configuration overhead

Ideal perfect scheduling spread
no retime overhead
17
General Throughput Mapping

Useful if only limited throughput is desired.
Target produces new result every t cycles (e.g. a
t LUT path)
Spatially pipeline every t stages.
Cycle t
Retime to minimize register requirement
Multi-context evaluation within a spatial stage.
Retime to minimize resource usage
Map for depth, i, and contexts, C

18
Dharma Architecture (UC Berkeley)

Allows for levelized circuit to be executed
Design parameters - DLM
K -gt number of DLM inputs
L -gt number of levels

19
Example Dharma Circuit
20
Dynamic Reconfiguration for Fault Tolerance

Embedded systems require high reliability in the
presence of transient or permanent faults
FPGAs contain substantial redundancy
Possible to dynamically configure around
problem areas
Numerous on-line and off-line solutions

21
Column Based Reconfiguration

Huang and McCluskey
Assume that each FPGA column is equivalent in
terms of logic and routing
Preserve empty columns for future use
Somewhat wasteful
Precompile and compress differences in bitstreams

22
Column Based Reconfiguration

Create multiple copies of the same design with
different unused columns
Only requires different inter-block connections
Can lead to unreasonable configuration count

23
Column Based Reconfiguration

Determining differences and compressing the
results leads to reasonable overhead
Scalability and fault diagnosis are issues

24
Alternate Fault Tolerance Approach

Fault tolerant client (with Xilinx Virtex FPGA)
Periodic fault diagnosis
Send fault location to server
Reconfiguration server
Recompile based on fault locations
Send new configuration to client
Communication Network
TCP socket

25
Fault Tolerant Client

Transient fault (bit flip in configuration
memory)
Permanent fault (faulty logic or routing resource)

26
Reconfiguration Server

Create TCP socket and collect faults
Select recompilation process
Generate bitstream and send it via TCP socket

Send New Bitstream
Generate New Bitstream
LUT Swapping
Timing-Driven Incremental Re-route
From/To Client
Logic Fault
Select Recompilation Process
Routing Fault
Collect Faults
27
System Procedure

Server and client initialization
Server-client interaction

Create TCP socket
Create TCP socket
Wait request from client
Fault diagnosis
No fault
Load new bits
Recompilation
Transmit fault locations
Transmit bitstream
Collect bitstream
Server
Client
28
Custom incremental CAD flow

Existing CAD tools for Virtex
Xilinx PAR and Bitgen time consuming and no
fault avoidance
Xilinx JRoute limited performance
Our custom flow (place, route, bitstream
generation)
Fault avoidance
Fast recompilation
Preserve performance of original design
No use of Xilinx ISE tools

29
Custom Flow

VPR for Virtex -gt modified version of VPR 1
JBits -gt bitstream generation tool based on JBits
API 2

EDIF
FlowMap (Tech Map)
SIS (Logic Opt.)
EDIF2BLIF
BLIF
BLIF
BLIF
VPR for Virtex
JBits
Bitstream
BLIF, Netlist, Placement Routing
Fault Information
30
VPR (pack, place and route)

Modifications in VPR
Routing graph for Virtex
Generate legal placement and routing
Timing-driven incremental router
Mask faulty resources
Re-route and preserve performance of original
design

31
Fault Avoidance

Mask faulty routing resources
Check nets affected by faults
Re-route affected nets

32
Timing-driven Re-route

Timing-driven PathFinder Router

Incremental Re-route with history
History congestion values guide router away from
congested area
Original net criticalities can be used for
re-route
Net rip-up
Nets unaffected by faults may be ripped-up to
improve routability and timing

33
Bitstream Generation

Parse results from VPR
Generate bitstream with Xiliinx JBits API
Incremental modification
XDL output for debugging

XDL for debugging
Xilinx Flow
Original Bitstream
Modified Bitstream
Xilinx JBits API
From VPR
Parser
FPGA
jbits.set (Row, Column, Resource, Driver)
34
System Component

Solaris-based
SUN workstation
Ultra-10
440 MHz CPU
768 MB memory

Transtech DM11 board in
WinNT-based PC
Xilinx XCV100-5 FPGA
200 MHz TI DSP
32 Mbytes SDRAM
1 Mbytes SRAM

35
Single Fault Recovery
Recovery time for single fault (second)

Three benchmarks
FIR16
(16-tap 16-bit FIR filter)
FMUL
(8-bit floating point multiplier)
TEA
(tiny encryption algorithm)
Recovery time lt 14s
12 X faster
Performance preserved

(465 CLBs) (227 CLBs) (109 CLBs)
(465 CLBs) (227 CLBs) (109 CLBs)
77.5 37.8 18.2
36
Single Fault Recovery

Time for different tasks of recovery
Incremental CAD consumes most time

37
Results

Similar recovery time for different designs
Base cost time -gt create graph for device
Device size -gt total recovery time
Recover multiple faults
Re-route time increase 2 seconds for 50 faults
Critical path delay increase 5 for 50 faults
Modified bits in bitstream
Single fault -gt 24 configuration bits / 100KB

38
Summary

Multiple contexts can be used to combat wire
inactivity and logic latency
Too many contexts lead to inefficiencies due to
retiming registers and extra LUTs
Architectures such as Dharma address these issues
through contexts
Run-time system needed to handle dynamic
reconfiguration for fault tolerance

Write a Comment

User Comments (0)