Fatih Kocan and Jason Meyer - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Fatih Kocan and Jason Meyer

Description:

J. Meyer and F. Kocan, 'Sharing of SRAM Tables among NPN-Equivalent ... Triptych architecture. Logic cells allocated for either logic or routing. Hybrid FPGAs ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 53

Provided by: J3152

Category:

more less

Transcript and Presenter's Notes

Title: Fatih Kocan and Jason Meyer

1
Novel SRAM-based FPGA Architectures and
Supporting CAD Tools

Fatih Kocan and Jason Meyer
Computer Science
Southern Methodist University
October 10, 2007

2
Publications Patent

1. J. Meyer and F. Kocan, Sharing of SRAM Tables
among NPN-Equivalent LUTs in SRAM-based FPGAs,
IEEE Transactions on VLSI Systems, pp. 182-195,
vol 15, no. 2, Feb. 2007.
2. J. Meyer and F. Kocan, Improving Critical
Path Delay and Sharing in Shared SRAM Table based
FPGAs, in preparation.
3. J. Meyer and F. Kocan, Reducing Critical Path
Delay in FPGAs with SRAM Tables Shared by
NPN-Equivalent Functions, International
Conference on Engineering of Reconfigurable
Systems and Algorithms June 25-28, 2007.
4. J. Meyer and F. Kocan, "Sharing FPGA SRAM
Tables among NPN Equivalent LUTs", IEEE Int'l.
Midwest Symposium on Circuits and Systems,
Cincinnati, Ohio, August 7-10, 2005.
5. F. Kocan and J. Meyer, Logic Modules with
Shared SRAM Tables for Field-Programmable Gate
Arrays, in Field Programmable Logic and
Application, 14th International Conference,
Leuven, Belgium, August 30-September 1, 2004,
Proceedings, vol. 3203 of Lecture Notes in
Computer Science, pp. 289300, Springer, 2004.
Fatih Kocan, Sharing a static random-access
memory (SRAM) table betweeen two or more lookup
tables (LUTs) that are equivalent to each other ,
US Patent 20070046324

3
Outline

FPGA Architectures
CLB Architectures
CAD for FPGAs
Synthesis, Placement, Routing
NPN Equivalence Classes of Functions
Motivation
Analysis of Benchmarks
Proposed LUT
CLB with Proposed LUTs
Experimental Results
Conclusions
Future Work

4
Typical Island Style SRAM-Based FPGA

Reconfigurable Computing
Bridges general purpose and application specific
computing
Faster than general purpose
More flexible than application specific

FPGAs are building blocks of reconfigurable
computing

5
Fundamental Components of FPGAs
2-input LUT
Configurable Logic Block (CLB)
0
1
z
1
0
x y
Basic Logic Element (BLE)
Inputs
K-input LUT
Out
D Q
gt Q
6
Past Work on FPGA Architectures

Triptych architecture
Logic cells allocated for either logic or routing
Hybrid FPGAs
Combine LUT-based FPGAs and PLA-based CPLDs
Some parts suited for LUTs, others suited for
products
Vantis FPGA
Variable granularity
Configurable building block (CBB)
Variable-grain block (VGB) consisting of 4 CBBs
Super variable-grain consisting of 4 VGBs
Function folding method attempted to reduce
memory sizes based on fractions of functions.

7
Past Work on CLB Architectures (1)

Internal connections
Initially assumed to be fully connected
Sparsely populated connections was proposed
Single Event Upset faults
New architecture proposed to detect and correct
these faults
Based on maps and Remaps

8
Past Work on CLB Architectures (2)

Altera Stratix II ALM Architecture
ALM 8 inputs divided into 2 functions (with
different inputs)
LAB (CLB) 8 ALMs

9
Lessons Learned from CLB Research

For a cluster of size N, 2N2 inputs are
sufficient
For a cluster size N with k-input LUTs,
(k/2)(N1) inputs are sufficient
4-input functions (LUTs) are sufficient
Good results are obtained with various cluster
sizes between 4 and 8

10
FPGA CAD Flow
Synthesis
Pack CLBs
Placement
Routing
Design

Design circuit is created using Verilog or VHDL
etc.
Synthesis circuit technology mapped to FPGA
(SIS/RASP)
Packing LUTs packed into CLBs (e.g., T-VPACK)
Placement CLBs are physically placed in FPGA
(VPR)
Routing connections are wired to channels (VPR)

11
VPR GUI from Toronto University
12
Synthesis Fundamentals

Goal Given a multilevel network of logic gates,
transform it into a network of LUTs, each of no
more than K inputs
Objectives
Minimize number of LUTs
Minimize delay, area, power
Parts of Synthesis
Logic Optimization
Transform gate-level network into smaller
gate-level network (fewer gates)
Technology Mapping
Cover the gate-level network with K-LUTs

13
Logic Optimization Methods

Node Decomposition re-express a single node with
logically equivalent composition of 2 or more
nodes
Structural Decomposition
Symbolic Decomposition
Boolean Decomposition
Network Simplification

14
Technology Mapping

Example Technology Mapping with K3

15
NPN-equivalence classes of Boolean Functions (1)

Input Negation (NI) equivalence
Negate some inputs to g so that g f
For example, let f(a,b) ab and g(a,b) ab
g made equivalent to f by inverting a
Extra inverters and conditional negation required
Permutation (P) equivalence
Reorder some of the inputs to g so that g f
Let f(a,b) ab and g(a,b) ab
g made equivalent to f by reordering
No extra logic required

0
1
g
f
0
0
b a
a b
0
1
g
f
0
0
a b
a b
16
NPN-equivalence classes of Boolean Functions (2)

Output Negation (NO) equivalence
NO equivalent if g f or g f
Let f(a,b) ab and g(a,b) a or b
g made equivalent to f by inverting output
Extra inverters and conditional negation required
NPN equivalence
g and f are NPN equivalent
if any combination of NI, P,
and NO equivalence
yield g f

0
1
g
f
0
0
b a
a b
0
0
1
1
g
g
f
f
0
0
0
0
a b
a b
a b
a b
17
Specialization (Bridging)

Bridging
function f1 is bridged if over f2 iff there
exists xi such that f1(x1, . . . , xn, xi)
f2(y1, . . . , yn).

18
Specialization (Constant Assignment)

Constant Assignment
f1 is said to be C over f2 iff the cofactor of
f1 with respect to xn1 is equivalent to f2
f1(x1, . . . , xn, 1) f2(y1, . . . , yn).
f1 is said to be C- over f2 iff the cofactor of
f1 with respect to xn1 is equivalent to f2
f1(x1, . . . , xn, 0) f2(y1, . . . , yn).

19
Universal Logic Module (ULM) Design

Universal logic blocks
Prior to SRAM-based logic modules
Blocks that supported a majority of functions
Can implement functions that are
Negated at primary inputs
Permuted at the inputs
Negated at primary outputs
NPN Equivalence studied in this context
SRAM Tables already Universal
Our goal Why study NPN equivalence?
Answer Sharing SRAM Tables among NPN-equivalent
LUTs

20
Motivation for Sharing SRAM Tables

Reducing SRAM Cells implies
Reduced Area
Reduced Power
Reduced Number Configuration Bits
Reduced Configuration Time ? Reduced test time
Even if area is not lowered, extra resources can
be used to
Radiation harden the circuit,
Increase routing resources
Buffer I/O
Potential Adverse Effects of Sharing
Increased routing resources
Increased critical path delays

21
Practicality of NPN-Equivalence Based Sharing

We analyzed MCNC, ITC99, and ISCAS85 Benchmarks
with
Academic Tools (SIS, RASP)
Industrial Tools (Mentor Precision Synthesis RTL)
We expect to find an abundance of NPN-equivalent
functions

of functions
22
Analysis of MCNC Benchmarks
Benchmark descriptions
23
NPN-equivalence classes used, Combinational MCNC
Benchmarks
24
NPN-equivalence classes used, Sequential
Benchmarks
25
ITC 99 and ISCAS 85 Benchmarks
26
Mentor Precision RTL Synthesis
27
Analysis Results

Expectations met!!!
Synthesis tools are biased towards some classes
of functions
Assuming tools give near-optimal solutions, maybe
it is not necessary to utilize all equivalence
classes of functions to implement a circuit?
Another research problem for a PhD student

28
Possible Sharing Architectures

NP
PN
NPN1
NPN2
29
Architectural Changes to Support NPN Sharing

Changes to LUT structures

Conditional Negation logic (CN)
MUX with CN
30
Power and Delay Measurements

Plug-in added to VPR to calculate power
Added power for conditional negation
Subtracted power for fewer SRAM tables
Added extra delay for conditional negation
ORCAD 9.2 and PSpice used to determine power and
delay with the NPN architectures

31
Shared CLB
Shared CLB
32
Updated CAD Flow
Equiv. Table
Equiv. Analysis
Synthesis
Pack CLBs
Placement
Routing
Design

Add equivalency analysis stage before packing
CLBs.
Take NPN equivalence into account when packing
CLBs.

33
Searching for Optimal CLB Architectures

Investigated (near) optimal CLB architectures
Homogeneous larger, mixed CLBs
8-16 LUTs per CLB
About 25 of SRAM tables are shared by 2 LUTs
Good architectures tested

34
Delay Results
35
Routing Results
36
Power Results
37
Area Results
38
The Need for Post-Routing Delay Improvement

Unbalanced delays
Place and route as if balanced delays
After routing, do iterative improvements to
critical path delays

39
Local Configuration Changes

Two LUTs can be swapped without requiring global
reconfiguration if
Both fanouts on the same side of the CLB
Neither fanout outside of the CLB
Fanout to different sides of CLB but converge
before fanning out further

40
Critical Path Improvement Algorithms

Greedy algorithm does not work
Swap1 would not be made in a greedy algorithm
However, if both Swap1 and Swap2 made, critical
path is reduced

Developed three post-routing algorithms to
improve the critical path delays
Genetic Algorithm
Simulated Annealing
Branch and Bound

41
Genetic Algorithm

Objective function critical path delay
Linear time algorithm, one pass through the
circuit
Population
Gene
1 bit for every shared SRAM table in FPGA
1 first LUT fast, second LUT slow
0 first LUT slow, second LUT fast
Start with 300 individuals, created randomly
Operations
Run set number of generations
Crossover

Mutation

42
Simulated Annealing Algorithm

Objective function critical path delay
Linear time algorithm, one pass through the
circuit
Gene
1 bit for every shared SRAM table in FPGA
1 first LUT fast, second LUT slow
0 first LUT slow, second LUT fast
Start with 1 individual, created randomly
Operation
Mutation
Pick a single bit at random and flip

43
Branch and Bound Algorithm

Enumerate and check all possible swaps
Representation
1 bit for every LUT in FPGA
1 fast side of SRAM table
0 slow side of SRAM table
Configuration is consistent
If one LUT for an SRAM is fast, other must be
slow
LUTs mapped to unshared SRAM are fast
Each iteration, swap 2 bits for LUTs on SRAM
table
If critical path lowers, keep the swap
If critical path route changes, keep the swap
Exponential, but
Only need to swap LUTs on critical path
Must save off configurations to prevent trying
them again

44
Critical Path Improvement Algorithm Results
45
Comparing Sharing and Non-sharing when Branch and
Bound Algorithm Utilized
46
Conclusions

Optimal CLB architecture for studied benchmarks
16 LUTs/CLB, 34 inputs, 7 shared SRAM tables
Potential for large savings in SRAM cells
Pessimistic results
Reduced of SRAM tables by 44!
Reduced area by 4.4, power by 2!
Reduced configuration bits
configuration time, test time
No degradation in routing, wirelength, or
critical path delay

47
Future Work

Develop synthesis and resynthesis algorithm to
increase NPN equivalence in a LUT level circuit
Possibly increase amount of logic, but offset by
greater SRAM table sharing
3 synthesis approaches
Restrict synthesis to functions that belong to a
specific set of permissible equivalence classes
(NAND gate ex).
Restrict synthesis to specific equivalence
classes locally, but no restrictions globally
Restrict LUT sizes to 3 inputs instead of 4
inputs.
Only 14 3-input NPN equivalence classes, but 222
4-input classes

48
Example Re-synthesis