Title: NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture
1NanoMap An Integrated Design Optimization Flow
for a Hybrid Nanotube/CMOS
Dynamically Reconfigurable
Architecture
- Wei Zhang, Li Shang and Niraj K. Jha
- Dept. of Electrical EngineeringPrinceton
University - Dept. of Electrical and Computer Engineering
- Queens University
2Outline
- Temporal Logic Folding
- Background on NRAMs
- Overview for hybrid NAnoTUbe/CMOS REconfigurable
architecture (NATURE) (DAC 2006) - NanoMap Design Optimization Flow
- Experimental Results
- Conclusions
3Temporal Logic Folding
- Basic idea Use run-time reconfiguration to
realize different functions in the same resource
every few cycles
LUT 1
LUT 1
LUT 2
LUT 3
LUT 2
LUT 3
LUT 3
LUT 2
LUT 1
MEM
i abc
l (Ief)h
OUT dgl
4Overview of NATURE
- Distributed non-volatile nanotube RAMs (NRAMs)
main storage for reconfiguration bits - Fine-grain reconfiguration (even cycle-by-cycle)
and logic folding - Area-delay trade-off flexibility
- More than an order of magnitude increase in logic
density - More than an order of magnitude reduction in
area-time product - Comparisons assume NRAMs/ CMOS logic implemented
in the same technology - Non-volatility useful in low power secure
processing
CMOS fabrication compatible
NRAM-based
NATURE
Run-time reconfiguration
Temporal logic folding
Logic density
Design flexibility
5Overview of NATURE (Contd.)
- Challenges in nano-circuits/architectures
- Many programmable nanofabrics proposed Nanowire
PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. - Lack of a mature fabrication process
- Fabrication defects and run-time failures
(between 1 and 10) - Regular, reconfigurable architectures, such as an
FPGA, favored - Facilitates fabrication
- Fault tolerance through reconfiguration
- NATURE fabricatable using CMOS-compatible
fabrication process
6NRAMTM by Nantero
Source http//www.nantero.com/nram.html
- Non-volatile nanotube random-access memory (NRAM)
- Mechanically bent or not determines bistable
on/off states - Same/opposite voltage added to change the state
- CMOS-compatible fabrication process
- 10 Gbit NRAMs already fabricated ready to be
commercialized in the near future
7NRAMs
- Properties of NRAMs
- Non-volatile
- Similar speed to SRAM
- Similar density to DRAM
- Chemically and mechanically stable
- NATURE not tied to NRAMs
- Phase change RAM
- Magnetoresistive RAM
- Ferroelectric RAM
8Architecture of NATURE
- Island-style logic blocks (LBs) connected by
various levels of interconnects - An LB contains a super macroblock (SMB) and a
local switch matrix
9Architecture of a Super Macroblock (SMB)
- n1 macroblocks (MBs) comprise an SMBhere n1 4
10Architecture of a Macroblock (MB)
- n2 logic elements (LEs) comprise an MBhere n2
4
11Logic Element (Basic Configuration)
- An LE implements a computation and contains
- An m-input look-up table (LUT)
- l flip-flops
- Input to flip-flop selected between LUT output
and a primary input
12Folding Levels
- Logic folding at different levels of granularity,
providing flexibility to perform area-delay
trade-offs - Level-p folding LE reconfiguration after the
execution of p LUT computations - Reconfiguration time 160ps
- Larger folding level, typically delay decrease,
area increase
(a) level-1 folding
(b) level-2 folding
13Design Optimization Flow NanoMap
- Optimize and implement design on NATURE
- Integrate temporal logic folding
- Choose a proper folding level
- Use force-directed scheduling (FDS) technique to
balance resource usage across folding cycles - Input design specified in register-transfer level
(RTL) and/or gate-level VHDL
14Motivational Example
Level 1 register
Logic in Plane
Folding stage
Plane cycle
Folding cycle
Plane
Level 2 register
- Different planes should have same number of
folding stages to guarantee global
synchronization - Key issue how to achieve the optimization
objective - Appropriate folding level
- Assign the logic to folding stages
15Motivational Example (Contd.)
8 LUTs Logic depth 4
50 LUTs 14 flip-flops
Plane depth 9
38 LUTs Logic depth 7
- Example optimization objective
- Minimize circuit delay under an area constraint
of 32 LEs - Assume each LE contains one LUT and two
flip-flops 32 LEs provide 32 LUTs and 64
flip-flops
16Iterative Design Flow
- Start with initial guess for folding level and
iteratively refine it - Large folding level -gt better circuit delay, but
large area cost - Initial folding stages
- Initial folding levels
- Partition RTL modules into a series of connected
LUT clusters - logic depth at most equal to the folding level
- Significantly speeds up the mapping procedure
17Iterative Design Flow (Contd.)
- Cluster size should be smaller than the area
constraint
34 LUTs gt 32 LUTs
Level-5 folding
Level-4 folding
18Solution for the Example
- Three folding stages using level-4 folding
- 32 LEs required for mapping the RTL circuit area
constraint satisfied - Circuit delay 3 folding cycle delay
19NanoMap Flow Diagram
Input network
Output
reconfiguration bits
1
Optimization
Module
Routing
16
objective
Circuit parameter
library
search
Final routing
2
using VPR router
Folding level
15
computation
User
3
constraint
Final placement
using modified VPR
RTL module partition
placer
Logic Mapping
4
14
Yes
No
Perform logic
folding
?
Satisfy delay
No
5
constraints
?
Yes
12
Schedule each LUT
/
Temporal placement
LUT cluster
Delay estimation
using FDS
6
11
Yes
Map each
7
No
Placement
LUT
/
LUT cluster to
routable
?
SMBs
Temporal clustering
10
7
Fast placement
Satisfy area
Refine
No
No
using modified VPR
constraints
?
placement
?
placer
8
13
Yes
Yes
9
20Force-Directed Scheduling
- Perform FDS on RTL modules partitioned into
LUTs/LUT clusters - Iteratively schedule LUT/(LUT cluster) to
minimize overall resource usage - Model resource usage as a force F Kx
- K distribution graphs (DGs) that describe the
probability of resource usage - Aim of FDS minimize force, indicating minimum
increase in resource usage - LE usage depends on LUT computations and register
storage operationstwo DGs needed
21Temporal Clustering
- For each folding stage, a constructive algorithm
used to assign LUTs to LEs and pack LEs into MBs
and SMBs - Unpacked LUT with a maximal number of inputs
selected as initial seed - New LUTs with high attractions to the seed
selected and assigned to the SMB - Attractions depend on timing criticality and
input pin sharing - Considers attractions across all the folding
cycles
22Placement and Routing
- VPR (U. Toronto) modified to perform placement
and support temporal logic folding - Simulated annealing approach
- Cost function computed across the folding stages
- Routing using VPR router performed
hierarchically, considering direct link,
length-1, length-4 and global interconnects
23Experimental Setup
- Instance of architecture
- 4 MBs in an SMB
- 4 LEs in an MB
- LEs contain a 4-input LUT and 2 flip-flops
- Impact of fixing k at 16 vs. allowing a high
enough k to show design trade-offs - Results based on 100nm technology parameters to
implement CMOS logicand NRAMs
24Experimental Results (Contd.)
1
1
1
1
1
1
1
1
1
2
2
2
2
1
2
1
2
1
1
2
2
1
2
2
2
2
1
1
25Experimental Results (Contd.)
Improvement under AT optimization for RTL
Benchmarks
Reduction in LEs Maximum AT improvement Average AT improvement Circuit delay increase
k enough 14.8X 16.2X 11.0X 31.8
k 16 9.2X 9.3X 7.8X 19.4
- LE utilization around 100
- 50 reduced need for a deep interconnect
hierarchy for level-1 vs. no-folding indicates
trading interconnect area for NRAM area
advantageous
26Experimental Results (Contd.)
- Flexibility in choosing the best folding level
and performing area-delay trade-offs - Mapping results for typical optimizations using
Paulin benchmark as an example
Typical optimizations
Opt. obj. Area const. (LEs) Delay const. (ns) Folding level
Case1 AT No No 1
Case2 Delay No No No
Case3 Area No 27 4
Case4 Delay 210 No 3
27Conclusions
- NATURE A new high-performance run-time
reconfigurable architecture - NanoMap an integrated optimization design flow
for NATURE - Introduction of NRAMs into the architecture
enables cycle-by-cycle reconfiguration and logic
folding leading to significant logic density and
area-time product advantages - Can be very useful for cost-conscious embedded
systems and improvement of future FPGAs - Non-volatility helpful in secure and low power
processing