Title: Fatih Kocan and Jason Meyer
1Novel SRAM-based FPGA Architectures and
Supporting CAD Tools
- Fatih Kocan and Jason Meyer
- Computer Science
- Southern Methodist University
- October 10, 2007
2Publications Patent
- 1. J. Meyer and F. Kocan, Sharing of SRAM Tables
among NPN-Equivalent LUTs in SRAM-based FPGAs,
IEEE Transactions on VLSI Systems, pp. 182-195,
vol 15, no. 2, Feb. 2007. - 2. J. Meyer and F. Kocan, Improving Critical
Path Delay and Sharing in Shared SRAM Table based
FPGAs, in preparation. - 3. J. Meyer and F. Kocan, Reducing Critical Path
Delay in FPGAs with SRAM Tables Shared by
NPN-Equivalent Functions, International
Conference on Engineering of Reconfigurable
Systems and Algorithms June 25-28, 2007. - 4. J. Meyer and F. Kocan, "Sharing FPGA SRAM
Tables among NPN Equivalent LUTs", IEEE Int'l.
Midwest Symposium on Circuits and Systems,
Cincinnati, Ohio, August 7-10, 2005. - 5. F. Kocan and J. Meyer, Logic Modules with
Shared SRAM Tables for Field-Programmable Gate
Arrays, in Field Programmable Logic and
Application, 14th International Conference,
Leuven, Belgium, August 30-September 1, 2004,
Proceedings, vol. 3203 of Lecture Notes in
Computer Science, pp. 289300, Springer, 2004. - Fatih Kocan, Sharing a static random-access
memory (SRAM) table betweeen two or more lookup
tables (LUTs) that are equivalent to each other ,
US Patent 20070046324
3Outline
- FPGA Architectures
- CLB Architectures
- CAD for FPGAs
- Synthesis, Placement, Routing
- NPN Equivalence Classes of Functions
- Motivation
- Analysis of Benchmarks
- Proposed LUT
- CLB with Proposed LUTs
- Experimental Results
- Conclusions
- Future Work
4Typical Island Style SRAM-Based FPGA
- Reconfigurable Computing
- Bridges general purpose and application specific
computing - Faster than general purpose
- More flexible than application specific
- FPGAs are building blocks of reconfigurable
computing
5Fundamental Components of FPGAs
2-input LUT
Configurable Logic Block (CLB)
0
1
z
1
0
x y
Basic Logic Element (BLE)
Inputs
K-input LUT
Out
D Q
gt Q
6Past Work on FPGA Architectures
- Triptych architecture
- Logic cells allocated for either logic or routing
- Hybrid FPGAs
- Combine LUT-based FPGAs and PLA-based CPLDs
- Some parts suited for LUTs, others suited for
products - Vantis FPGA
- Variable granularity
- Configurable building block (CBB)
- Variable-grain block (VGB) consisting of 4 CBBs
- Super variable-grain consisting of 4 VGBs
- Function folding method attempted to reduce
memory sizes based on fractions of functions.
7Past Work on CLB Architectures (1)
- Internal connections
- Initially assumed to be fully connected
- Sparsely populated connections was proposed
- Single Event Upset faults
- New architecture proposed to detect and correct
these faults - Based on maps and Remaps
8Past Work on CLB Architectures (2)
- Altera Stratix II ALM Architecture
- ALM 8 inputs divided into 2 functions (with
different inputs) - LAB (CLB) 8 ALMs
9Lessons Learned from CLB Research
- For a cluster of size N, 2N2 inputs are
sufficient - For a cluster size N with k-input LUTs,
(k/2)(N1) inputs are sufficient - 4-input functions (LUTs) are sufficient
- Good results are obtained with various cluster
sizes between 4 and 8
10FPGA CAD Flow
Synthesis
Pack CLBs
Placement
Routing
Design
- Design circuit is created using Verilog or VHDL
etc. - Synthesis circuit technology mapped to FPGA
(SIS/RASP) - Packing LUTs packed into CLBs (e.g., T-VPACK)
- Placement CLBs are physically placed in FPGA
(VPR) - Routing connections are wired to channels (VPR)
11VPR GUI from Toronto University
12Synthesis Fundamentals
- Goal Given a multilevel network of logic gates,
transform it into a network of LUTs, each of no
more than K inputs - Objectives
- Minimize number of LUTs
- Minimize delay, area, power
- Parts of Synthesis
- Logic Optimization
- Transform gate-level network into smaller
gate-level network (fewer gates) - Technology Mapping
- Cover the gate-level network with K-LUTs
13Logic Optimization Methods
- Node Decomposition re-express a single node with
logically equivalent composition of 2 or more
nodes - Structural Decomposition
- Symbolic Decomposition
- Boolean Decomposition
- Network Simplification
14Technology Mapping
- Example Technology Mapping with K3
15NPN-equivalence classes of Boolean Functions (1)
- Input Negation (NI) equivalence
- Negate some inputs to g so that g f
- For example, let f(a,b) ab and g(a,b) ab
- g made equivalent to f by inverting a
- Extra inverters and conditional negation required
- Permutation (P) equivalence
- Reorder some of the inputs to g so that g f
- Let f(a,b) ab and g(a,b) ab
- g made equivalent to f by reordering
- No extra logic required
0
1
g
f
0
0
b a
a b
0
1
g
f
0
0
a b
a b
16NPN-equivalence classes of Boolean Functions (2)
- Output Negation (NO) equivalence
- NO equivalent if g f or g f
- Let f(a,b) ab and g(a,b) a or b
- g made equivalent to f by inverting output
- Extra inverters and conditional negation required
- NPN equivalence
- g and f are NPN equivalent
- if any combination of NI, P,
- and NO equivalence
- yield g f
0
1
g
f
0
0
b a
a b
0
0
1
1
g
g
f
f
0
0
0
0
a b
a b
a b
a b
17Specialization (Bridging)
- Bridging
- function f1 is bridged if over f2 iff there
exists xi such that f1(x1, . . . , xn, xi)
f2(y1, . . . , yn).
18Specialization (Constant Assignment)
- Constant Assignment
- f1 is said to be C over f2 iff the cofactor of
f1 with respect to xn1 is equivalent to f2 - f1(x1, . . . , xn, 1) f2(y1, . . . , yn).
- f1 is said to be C- over f2 iff the cofactor of
f1 with respect to xn1 is equivalent to f2 - f1(x1, . . . , xn, 0) f2(y1, . . . , yn).
19Universal Logic Module (ULM) Design
- Universal logic blocks
- Prior to SRAM-based logic modules
- Blocks that supported a majority of functions
- Can implement functions that are
- Negated at primary inputs
- Permuted at the inputs
- Negated at primary outputs
- NPN Equivalence studied in this context
- SRAM Tables already Universal
- Our goal Why study NPN equivalence?
- Answer Sharing SRAM Tables among NPN-equivalent
LUTs
20Motivation for Sharing SRAM Tables
- Reducing SRAM Cells implies
- Reduced Area
- Reduced Power
- Reduced Number Configuration Bits
- Reduced Configuration Time ? Reduced test time
- Even if area is not lowered, extra resources can
be used to - Radiation harden the circuit,
- Increase routing resources
- Buffer I/O
- Potential Adverse Effects of Sharing
- Increased routing resources
- Increased critical path delays
21Practicality of NPN-Equivalence Based Sharing
- We analyzed MCNC, ITC99, and ISCAS85 Benchmarks
with - Academic Tools (SIS, RASP)
- Industrial Tools (Mentor Precision Synthesis RTL)
- We expect to find an abundance of NPN-equivalent
functions
of functions
22Analysis of MCNC Benchmarks
Benchmark descriptions
23NPN-equivalence classes used, Combinational MCNC
Benchmarks
24NPN-equivalence classes used, Sequential
Benchmarks
25ITC 99 and ISCAS 85 Benchmarks
26Mentor Precision RTL Synthesis
27Analysis Results
- Expectations met!!!
- Synthesis tools are biased towards some classes
of functions - Assuming tools give near-optimal solutions, maybe
it is not necessary to utilize all equivalence
classes of functions to implement a circuit? - Another research problem for a PhD student
28Possible Sharing Architectures
NP
PN
NPN1
NPN2
29Architectural Changes to Support NPN Sharing
- Changes to LUT structures
Conditional Negation logic (CN)
MUX with CN
30Power and Delay Measurements
- Plug-in added to VPR to calculate power
- Added power for conditional negation
- Subtracted power for fewer SRAM tables
- Added extra delay for conditional negation
- ORCAD 9.2 and PSpice used to determine power and
delay with the NPN architectures
31Shared CLB
Shared CLB
32Updated CAD Flow
Equiv. Table
Equiv. Analysis
Synthesis
Pack CLBs
Placement
Routing
Design
- Add equivalency analysis stage before packing
CLBs. - Take NPN equivalence into account when packing
CLBs.
33Searching for Optimal CLB Architectures
- Investigated (near) optimal CLB architectures
- Homogeneous larger, mixed CLBs
- 8-16 LUTs per CLB
- About 25 of SRAM tables are shared by 2 LUTs
- Good architectures tested
34Delay Results
35Routing Results
36Power Results
37Area Results
38The Need for Post-Routing Delay Improvement
- Unbalanced delays
- Place and route as if balanced delays
- After routing, do iterative improvements to
critical path delays
39Local Configuration Changes
- Two LUTs can be swapped without requiring global
reconfiguration if - Both fanouts on the same side of the CLB
- Neither fanout outside of the CLB
- Fanout to different sides of CLB but converge
before fanning out further
40Critical Path Improvement Algorithms
- Greedy algorithm does not work
- Swap1 would not be made in a greedy algorithm
- However, if both Swap1 and Swap2 made, critical
path is reduced
- Developed three post-routing algorithms to
improve the critical path delays - Genetic Algorithm
- Simulated Annealing
- Branch and Bound
41Genetic Algorithm
- Objective function critical path delay
- Linear time algorithm, one pass through the
circuit - Population
- Gene
- 1 bit for every shared SRAM table in FPGA
- 1 first LUT fast, second LUT slow
- 0 first LUT slow, second LUT fast
- Start with 300 individuals, created randomly
- Operations
- Run set number of generations
- Crossover
42Simulated Annealing Algorithm
- Objective function critical path delay
- Linear time algorithm, one pass through the
circuit - Gene
- 1 bit for every shared SRAM table in FPGA
- 1 first LUT fast, second LUT slow
- 0 first LUT slow, second LUT fast
- Start with 1 individual, created randomly
- Operation
- Mutation
- Pick a single bit at random and flip
43Branch and Bound Algorithm
- Enumerate and check all possible swaps
- Representation
- 1 bit for every LUT in FPGA
- 1 fast side of SRAM table
- 0 slow side of SRAM table
- Configuration is consistent
- If one LUT for an SRAM is fast, other must be
slow - LUTs mapped to unshared SRAM are fast
- Each iteration, swap 2 bits for LUTs on SRAM
table - If critical path lowers, keep the swap
- If critical path route changes, keep the swap
- Exponential, but
- Only need to swap LUTs on critical path
- Must save off configurations to prevent trying
them again
44Critical Path Improvement Algorithm Results
45Comparing Sharing and Non-sharing when Branch and
Bound Algorithm Utilized
46Conclusions
- Optimal CLB architecture for studied benchmarks
- 16 LUTs/CLB, 34 inputs, 7 shared SRAM tables
- Potential for large savings in SRAM cells
- Pessimistic results
- Reduced of SRAM tables by 44!
- Reduced area by 4.4, power by 2!
- Reduced configuration bits
- configuration time, test time
- No degradation in routing, wirelength, or
critical path delay
47Future Work
- Develop synthesis and resynthesis algorithm to
increase NPN equivalence in a LUT level circuit - Possibly increase amount of logic, but offset by
greater SRAM table sharing - 3 synthesis approaches
- Restrict synthesis to functions that belong to a
specific set of permissible equivalence classes
(NAND gate ex). - Restrict synthesis to specific equivalence
classes locally, but no restrictions globally - Restrict LUT sizes to 3 inputs instead of 4
inputs. - Only 14 3-input NPN equivalence classes, but 222
4-input classes
48Example Re-synthesis
- Fig(a) No NPN Equivalency
- Fig(b) All equivalent
49LP-based Synthesis
- Alan Walker is doing a PhD thesis on this topic.
50Complementary Nano-Electro Mechanical Switch
(CNEMS) and CNEM LUTs
51Two CNEMS-based LUTs
52(No Transcript)