Title: Wormhole RTR FPGAs with Distributed Configuration Decompression
1Wormhole RTR FPGAs with Distributed Configuration
Decompression
- CSE-670 Final Project Presentation
- A Joint Project Presentation by
- Ali Mustafa Zaidi
- Mustafa Imran Ali
2Introduction
- An FPGA configuration architecture supporting
distributed control and fast context switching. - Aim of Joint Project
- Explore the potential for dynamically
reconfiguring FPGAs by adapting the WRTR Approach - Enable High Speed Reconfiguration using
- Optimized Logic-Blocks and
- Distributed Configuration Decompression
techniques - Study focuses on Datapath-oriented FPGAs
3Project Methodology
- In depth study of issues.
- Definition of basic Architecture models.
- Definition of Area models (for estimation of
relative overhead w.r.t other RTR schemes). - Design of Reconfigurable systems around designed
Architecture models. - Identification of FPGA resource allocation
methods - For distributing FPGA area between multiple
applications/hosts (at runtime, compile-time, or
system design-time). - Selection of Benchmarks for testing and
simulation of various approaches. - Experimentation with WRTR systems and
corresponding PRTR system without distributed
configuration (i.e. single host/application) for
comparison of baseline performance. - Evaluate resource utilization, and normalized
reconfiguration overhead etc. - Experimentation with all systems with distributed
configurations (i.e. multiple hosts/applications).
The PRTR systems configuration port will be
time multiplexed between the various
applications. - Evaluate resource utilization, and normalized
reconfiguration overhead etc.
4Configuration Issues with Ultra-high Density FPGAs
- FPGA densities rising dramatically with process
technology improvements. - Configuration time for serial method becoming
prohibitively large. - FPGAs are increasingly used as compute engines,
implementing data-intensive portions of
applications directly in Hardware. - Lack of efficient method for dynamic
reconfiguration of large FPGAs will lead to
inefficient utilization of the available
resources.
5Scalability Issues with Multi-context RTR
- Concept While one plane is operating, configure
the other planes serially. - Latency hidden by overlapping configuration with
computation - As FPGA size, and thus configuration time grows,
Multi context becomes less effective in hiding
latency - Configuring more bits in parallel for each
context is only a stop-gap solution. - Only so many pins can be dedicated to
configuration - Overheads in the Multi-context approach
- Number of SRAM cells used for configuration grow
linearly with number of contexts - Multiplexing Circuitry associated with each
configurable unit - Global, low-skew context select wires.
6Scalability Issues with Partial RTR
- Concept The Configuration memory is Addressable
like standard Random Access Memory. - Overheads in PRTR Approach
- Long global cell-select and data busses required
- area overhead,
- issues with single cycle signal-transmission as
wires grow relative to logic. - Vertical and Horizontal Decoding circuitry
represent centralized control resource. - Can be accessed only sequentially by different
user applications (one app at a time) - Potential for underutilization of hardware as
FPGA density increases. - One solution could be to design the RAM as a
multi-ported memory - But area of RAM increases quadratically with
increase in number of ports. - Only so many dedicated configuration ports can be
provided - Not a long-term scalable solution.
7What is Wormhole RTR
- WRTR is a method for reconfiguring a configurable
device in an entirely distributed fashion. - Routing and configuration handled at local
instead of global level.
8Advertised Benefits of WRTR
- WRTR is a distributed paradigm
- Allows different parts of same resource to be
independently configured simultaneously. - Dramatically increases the configuration
bandwidth of a device. - Lack of centralized controller means
- Fewer single point failures that can lead to
total system failure (e.g. a broken configuration
pin) - Increased resilience routing around faults
improving chip yields ? - Distributed control provides scalability
- Eliminates configuration bottleneck.
9Origins of Wormhole RTR
- Concept Developed in late 90s at Virginia Tech.
- Intended as a method of rapidly creating and
modifying custom computational pathways using a
distributed control scheme. - Essence of WRTR concept
- Independent self-steering streams.
- Streams carried both programming information as
well as operand data - Streams interact with architecture to perform
computation. (see DIAGRAM)
10Origins of Wormhole RTR
11Origins of Wormhole RTR
- Programming information configures both the
pathway of stream through the system, as well as
the operations performed by computational
elements along the path. - Heterogeneity of architectures is supported by
these streams. - Runtime determination of path of stream is
possible, allowing allocation of resources as
they become available.
12Adapting WRTR for conventional FPGAs
- Our aims
- To achieve Fast, Parallel Reconfiguration
- With minimum area overhead
- And minimum constraints imposed on the underlying
FPGA Architecture. - Configuration Architecture is completely
decoupled from the FPGA Architecture. - WRTR model is used as inspiration for developing
a new paradigm for dynamic reconfiguration - Not necessary that WRTR method is followed to the
letter
13Issues Associated with using WRTR for
Conventional FPGAs
- Original WRTR was intended for Coarse-grained
dataflow architectures with localized
communications - Thus operand data was appended to the streams
immediately after the programming header. - In conventional FPGAs, dataflow patterns are
unrelated to configuration flow, and there is no
restriction of localizing communications. - Therefore Wormhole routing is used only for
configuration (cannot be used for data).
14Issues Associated with using WRTR for
Conventional FPGAs
- The original model was intended to establish
linear pipelines through system. - This makes run-time direction determination
feasible. - However, for conventional FPGAs, the functions
implemented have arbitrary structures. - Configuration stream can not change direction
arbitrarily (i.e. fixed at compile-time).
15Issues Associated with using WRTR for
Conventional FPGAs
- Due to the need for large number of configuration
ports, I/O ports must be shared/multiplexed - thus active circuits may need to be stalled to
load configurations. - Should not be a severe issue for high-performance
computing oriented tasks. - Should impose minimum constraints on the
underlying FPGA architecture - Constraints applicable in an FPGA with WRTR are
same as those for any PRTR architecture.
16A Possible System Architecture for a WRTR FPGA
- Many configuration/IO ports, divided between
multiple host processors. (See Diagram) - Internally, FPGA divided into partitions, useable
by each of the hosts. - Partition boundaries may be determined at system
design time, or at runtime, based on requirements
of each host at any given time
17The various WRTR Models derived
- Our aim was to devise a high-speed, distributed
configuration model with all the benefits of the
original WRTR concept, but with minimum overhead. - To this end, 3 models have been devised
- Basic with Full Internal Routing
- Second with Perimeter-only Routing.
- Third Packetized, or parallel configuration
streams, with no Internal Routing. -
18Basic WRTR Model with Internal Routing
- Each configurable block or tile is accompanied
by a simple configuration Stream Router. See
Diagram - Overhead scales linearly with FPGA Size.
- Expected Issues with this model
- Complicated router, arbitration overhead and
prioritization, potential for deadlock conditions
etc. - May be restricted to coarser grained designs.
- Without data routing, do we really need internal
routing?
19Second WRTR with Perimeter-only Routing
- Primary requirement for achieving parallel
configuration is multiple input ports. - Internal Routing not a mandatory requirement.
- So why not restrict routing to chip boundary?
(See Diagram) - Overhead scaling improved (similar to PRTR Model)
- Highlights
- Finer granularity for configuration achievable
- Significantly lower overheads as FPGA sizes grow
(ratio of perimeter to area) - Issues
- Longer time required to reach parts of FPGA as
FPGA Size grows. - Reduced configuration parallelism because of
potentially greater arbitration delays at
boundary Routers.
20Third Packet based distribution of Configuration
- One solution to the increased boundary
arbitration issues use packets instead of
streams. (See Diagram) - A single configuration from each application is
generated as a stream (a worm) similar to
previous models. - Before entering device, configuration packets
from different streams are grouped according to
their target rows.
21Third Packet based distribution of Configuration
- Benefit No need at all for Routers in the fabric
itself. - Drawbacks
- Increases overhead on the host system
- Implies a centralized external controller
- Or a limited crossbar interconnect within the
FPGA - Parallel Reconfiguration still possible, but with
limited multitasking. - This model may be considered for embedded
systems, with low configuration parallelism, but
high resource requirements.
22Basic Area Model
- Baseline model required to identify overheads
associated with each PRTR model. - Basic Building block (See Diagram)
- A basic Array of SRAM Cells
- Configured by arbitrary number of scan chains.
- Assumptions for a fair comparison of overhead
- Each RTR model studied has exactly the same
amount of Logic resources to configure (rows and
columns). - Each model can be configured at exactly the same
granularity. - The given array of logic resources (see Diagram)
has an area equivalent to a serially
configurable FPGA. - AREA of Basic Model (A B) (x y)
23The PRTR FPGA Area Model
- Please See Diagram
- Configuration Granularity decided by A and B
- AREA Area of Basic model Overheads
- Overheads
- Area of log2(x)-to-x Row-select decoder
- Area of log2(y)-to-y n-bit Column
De-multiplexer - Area of 1 n-bit bus y
24The basic WRTR FPGA Area Model
- Please See Diagram
- AREA Area of Basic Model Overheads
- Overheads
- Area of 1 n-bit bus 2x
- Area of 1 n-bit bus 2y
- Area of 1 4-D Router block x y.
25The Perimeter Routing WRTR FPGA Area Model
- Please See Diagram
- AREA Area of Basic Model Overheads
- Overheads
- Area of 1 n-bit bus 2x
- Area of 1 n-bit bus 2y
- Area of 1 3-D Router block 2(x y) 1
- This model can also be made one dimensional to
further reduce overheads. (other constraints will
apply)
26The Packet based WRTR FPGA Area Model
- Please See Diagram
- AREA Area of Basic Model Overheads
- Overheads
- Area of 1 n-bit bus 2x
- Area of 1 n-bit bus 2y
- Additional overheads may appear in host system.
- This model can also be made one dimensional to
further reduce overheads. (other constraints will
apply)
27Parameters defined and their Impact
- The number of Busses (x and y)
- Number of Busses varies with reconfiguration
granularity. - For fixed logic capacity, A and B increase with
decreasing x and y, i.e. coarser granularity. - Impact of coarser granularity
- Reduced overhead ?
- Reduced reconfiguration flexibility
- Increased reconfiguration time per block
- Thus it is better to have finer granularity
- The Width of the busses (n-bits)
- Smaller the width, smaller the overhead (for
fixed number of busses) - Longer Reconfiguration times.
28Parameters defined and their Impact
- It is possible to achieve finer granularity
without increasing overhead at the cost of bus
width (and hence reconfiguration time per block) - Impact of Coarse grained vs. Fine grained
configurability Methods of Handling hazards in
the underlying FPGA fabric. - Coarse grained configuration places minimum
constraints on FPGA architecture - Fine-grained reconfiguration is subject to all
issues associated with Partial RTR Systems.
29Approaches to Router Design
- Active Routing Mechanism
- Similar to conventional networks
- Routing of streams depends on stream-specified
destination, as well as network metrics (e.g.
congestion, deadlock) - Hazards and conflicts may be dealt with at
Run-time. - Significantly complicated routing logic required.
- Most likely will be restricted to very coarse
grained systems. - Passive Routing Mechanism
- Routing of streams depends only on
stream-specified direction - Hazards and conflicts avoided by compile time
optimization. - We have selected the Passive Routing Mechanism
for our WRTR Models
30Passive Router Details
- Must be able to handle streams from 4 different
directions. - Streams from different directions only stalled if
there is a conflict in outgoing direction. - Includes mechanisms for stalling streams in case
of conflict etc. - Detecting and Applying back-pressure
- Routing Circuitry for one port is defined (see
Diagram) - For a 4D router, this design is replicated 4
times - For a 3D router, this design is replicated 3
times
31Utilizing Variable Length Configuration Mechanisms
- Support Hardware and Logic Block Issues
32Configuration Overhead
- Configuration data is huge for large FPGAs
- Has to be reduced in order to have fast context
switching
33Configuration Overhead Minimization
- Initial pointers in this direction
- Variable Length Configurations
- Default Configuration on Power-up
34Variable Length Configurations
- Ideally
- Change only minimum number of bits for a new
configuration - Utilize the idea of short configurations for
frequently used configurations - Start from a default configuration and change
minimum bits to reconfigure to a new state
35Hurdles
- Logic blocks always require full configurations
to be specified configuration sizes cannot be
varied - Knowing only what to change requires keeping
track of what was configured before difficult
issue in multiple dynamic applications switching - A default power-up configuration can hardly be
useful for all application cases
36Configuration Overhead Minimization
- How to do it?
- Remove redundancy in configuration data or
compact the contents of configuration stream - Result?
- This will minimize the information required to
be conveyed during configuration or
reconfiguration - Configuration Compression
- Applying some sort of compression to the
configuration data stream
37Configuration Decompression Approaches
- Centralized Approach
- Decompress the configuration stream at the
boundary of the FPGA - Distributed Approach New Paradigm
- Decompress the stream at the boundary of the
Logic Blocks or Logic Cluster
38Centralized Approach
- Advantage
- Requires hardware only at the boundary of the
device from where the configuration data enters
the device - Significant reduction in configuration size can
be achieved - Runlength Coding and Lemple-Ziv based compression
used - Examples
- Atmel 6000 Series
- Zhiyuan Li, Scott Hauck, Configuration
Compression for Virtex FPGAs, IEEE Symposium on
FPGAs for Custom Computing Machines, 2001
39Centralized Approach
- Limitations
- More efficient variable length coding not easy to
use because of the large number of symbol
possibilities - It is difficult to quantify symbols in the
configuration stream of heterogeneous devices
which can have different types of blocks
40Decentralized Approach
- Advantages
- Decompressing at the logic block boundary enables
configurations to be easily symbolized and hence
VLC to be used - In other words, we know what exactly we are
coding so Huffman like codes can be used based on
the frequency of configuration occurrences - Also has advantages specific to Wormhole RTR
discussed next
41Decentralized Approach
- Limitations
- The decompression hardware has to be replicated
- Optimality Issue Decompression hardware should
be amortized over how much programmable logic
area? - In other words, granularity of the logic area
should be determined for optimal cost/benefit
ratio
42Suitability of Decentralized Approach to WRTR
- If worms are decompressed at the boundary, large
internal worms lengths will result - This leads to greater internal worm lengths and
greater issues to arbitration and worm blockages - Decentralized approach thus favors shorter worm
lengths and parallel worms to traverse with less
blockages
43Variable Length Configuration
- Overall idea
- Frequently used configurations of a logic block
should have small sized codes - Variable length coding such as Huffman coding can
be adapted
44Configuration Frequency Analysis
- How to decide upon the frequency?
- Hardwired? By the designer through benchmarks
analysis? - Generic? Done by software generating the
configuration stream
45Continued
- Hardwired determination will be inferior no
large benefit gained due to variations in
applications - Software that generates the configuration can
optimally identify a given number of frequently
used configuration according to set of
applications to be executed - Code determination should be done by software
generating the configurations for optimal codes
46Decoding Hardware Approaches
- Huffman coding the configurations
- A Hardwired Huffman Decoder
- Adaptive Decoder (code table can be changed)
- Using a Table of Frequently used configurations
and address Decoder - Huffman Coding the Table Addresses
- Static Coding
- Adaptive Coding
47Decoding Hardware Features
- Static Huffman Decoders
- Lower compression
- Coding Configurations
- Requires a very wide decoder
- Using an Address Decoder only
- Reduced hardware but less compression (fixed
sized codes) - Coding the Decoder Inputs
- Requires a relatively smaller Huffman decoder
48Some points to Note
- Decompression approach is decoupled from any
specific logic block architecture - Though certain logic blocks will favor more
compression (discussed later) - Not every possible configuration will be coded.
Especially random logic portions will require all
the bits to be transmitted - A special code will prefix the random logic
configuration to identify it to be handled
separately
49Logic Block Selection
- High Level Issues
- Should be Datapath Oriented
- Efficient support for random logic implementation
- High functionality with minimum configuration
bits to support dense implementation with
reconfiguration overhead reduction - Well defined datapath functionality
(configuration) to aid in the quantification of
frequently used configuration idea
50Chosen Hi-Functionality Logic Block
51A Low Functionality Logic Block
52Logic Block Considerations
- High functionality blocks ? good for datapath
implementations - Low Functionality blocks ? less dense datapath
implementations
53Logic Block Considerations
- Low functionality blocks have lesser
configurations bits/block and vice versa - Frequently Used Configurations memory cost
depends on the size of configurations stored - So decoder hardware overhead will be less for low
functionality blocks
54Logic Block Considerations
- What about the configuration time overhead?
- Less dense functionality means more blocks to
configure - This leads to longer configuration streams
55Logic Block Considerations
- Assumption Random Logic does not benefit from
one block or the other - Consequence Datapath oriented designs will
require fewer blocks to configure with high
functionality blocks and even higher compression
and larger overhead for random logic
implementations than low-functionality blocks
56Logic Block Issue Conclusions
- Proper logic block selection for a particular
application affects - Decoder hardware size
- Configuration compression ratio
- Since Random logic is not compressed, using high
functionality blocks for less datapath oriented
applications will result in - A high decoder overhead unutilized
- Less compression and longer configuration streams
57Huffman decoder hardware
- Basic Huffman hardware is sequential in nature
and is variable input rate variable output rate
58Huffman Hardware
- Sequential decoding not suitable for WRTR
- Worm will have to be stalled
- Negates the benefit of fast reconfiguration
- Hardware should be able to process N bits at a
time where Nbus width - This requires a constant input rate architecture
with variable number of codes processed per cycle
59Constant Input Rate PLA based Architecture
- Input rate K bits/cycle
- PLA undertakes table lookup process
- Input bits
- Determine one unique path along Huffman tree
- Next state
- feed back to the input of PLA to indicate the
final residing state - Indicator
- The number of symbols decoded in the cycle
The constant-input rate PLA-based architecture
for the VLC decoder
60PLA Based Architecture
- Ref VLSI Designs for High Speed Huffman
Decoders, Shihfu Chang and David G.
Messerschimit - Decoder model-FSM
- FSM Implemention
- ROM
- PLA
- Lower complexity
- high speed
61Implementation Results
- The area is a function of number of inputs and
outputs along with input rate
62Hardware Area Estimates
- The number of inputs and outputs depend upon the
maximum codelength and the symbol sizes - Typically, for a 16 entry table
- Code Size range from 1 to 8 bits
- Symbol size will equal the decoder input i.e. 4
63Handling Multiple Symbols Outputs
- More than one code will be decoded with a max of
N codes per cycle - To take full advantage of parallel decoding,
multiple configuration chains can be employed. - A counter can be used to cycle between the chains
and output one configuration per chain with a
maximum of N chains.
64Decoder Hardware
- For parallel decoding of N codes and M-bit
decoded symbols - Huffman Decoder (discussed before)
- N M-bit decoders
- N port ROM table
- N parallel configuration chains
65Concluding Points
- The hardware overhead of the Huffman decoding
mechanism discussed has to be incorporated in the
area model discussed. - Empirical determination of reconfiguration
speed-up versus area overhead determination for a
sampling of benchmarks