NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture

Description:

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang , Li Shang and Niraj K. Jha – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 28

Provided by: weiz168

Category:

more less

Transcript and Presenter's Notes

Title: NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture

1
NanoMap An Integrated Design Optimization Flow
for a Hybrid Nanotube/CMOS
Dynamically Reconfigurable
Architecture

Wei Zhang, Li Shang and Niraj K. Jha
Dept. of Electrical EngineeringPrinceton
University
Dept. of Electrical and Computer Engineering
Queens University

2
Outline

Temporal Logic Folding
Background on NRAMs
Overview for hybrid NAnoTUbe/CMOS REconfigurable
architecture (NATURE) (DAC 2006)
NanoMap Design Optimization Flow
Experimental Results
Conclusions

3
Temporal Logic Folding

Basic idea Use run-time reconfiguration to
realize different functions in the same resource
every few cycles

LUT 1
LUT 1
LUT 2
LUT 3
LUT 2
LUT 3
LUT 3
LUT 2
LUT 1
MEM
i abc
l (Ief)h
OUT dgl
4
Overview of NATURE

Distributed non-volatile nanotube RAMs (NRAMs)
main storage for reconfiguration bits
Fine-grain reconfiguration (even cycle-by-cycle)
and logic folding
Area-delay trade-off flexibility
More than an order of magnitude increase in logic
density
More than an order of magnitude reduction in
area-time product
Comparisons assume NRAMs/ CMOS logic implemented
in the same technology
Non-volatility useful in low power secure
processing

CMOS fabrication compatible
NRAM-based
NATURE
Run-time reconfiguration
Temporal logic folding
Logic density
Design flexibility
5
Overview of NATURE (Contd.)

Challenges in nano-circuits/architectures
Many programmable nanofabrics proposed Nanowire
PLA (Dehon, 2004), CMOL (Strukov, 2005), etc.
Lack of a mature fabrication process
Fabrication defects and run-time failures
(between 1 and 10)
Regular, reconfigurable architectures, such as an
FPGA, favored
Facilitates fabrication
Fault tolerance through reconfiguration
NATURE fabricatable using CMOS-compatible
fabrication process

6
NRAMTM by Nantero
Source http//www.nantero.com/nram.html

Non-volatile nanotube random-access memory (NRAM)
Mechanically bent or not determines bistable
on/off states
Same/opposite voltage added to change the state
CMOS-compatible fabrication process
10 Gbit NRAMs already fabricated ready to be
commercialized in the near future

7
NRAMs

Properties of NRAMs
Non-volatile
Similar speed to SRAM
Similar density to DRAM
Chemically and mechanically stable
NATURE not tied to NRAMs
Phase change RAM
Magnetoresistive RAM
Ferroelectric RAM

8
Architecture of NATURE

Island-style logic blocks (LBs) connected by
various levels of interconnects
An LB contains a super macroblock (SMB) and a
local switch matrix

9
Architecture of a Super Macroblock (SMB)

n1 macroblocks (MBs) comprise an SMBhere n1 4

10
Architecture of a Macroblock (MB)

n2 logic elements (LEs) comprise an MBhere n2
4

11
Logic Element (Basic Configuration)

An LE implements a computation and contains
An m-input look-up table (LUT)
l flip-flops
Input to flip-flop selected between LUT output
and a primary input

12
Folding Levels

Logic folding at different levels of granularity,
providing flexibility to perform area-delay
trade-offs
Level-p folding LE reconfiguration after the
execution of p LUT computations
Reconfiguration time 160ps
Larger folding level, typically delay decrease,
area increase

(a) level-1 folding
(b) level-2 folding
13
Design Optimization Flow NanoMap

Optimize and implement design on NATURE
Integrate temporal logic folding
Choose a proper folding level
Use force-directed scheduling (FDS) technique to
balance resource usage across folding cycles
Input design specified in register-transfer level
(RTL) and/or gate-level VHDL

14
Motivational Example
Level 1 register
Logic in Plane
Folding stage
Plane cycle
Folding cycle
Plane
Level 2 register

Different planes should have same number of
folding stages to guarantee global
synchronization
Key issue how to achieve the optimization
objective
Appropriate folding level
Assign the logic to folding stages

15
Motivational Example (Contd.)
8 LUTs Logic depth 4
50 LUTs 14 flip-flops
Plane depth 9
38 LUTs Logic depth 7

Example optimization objective
Minimize circuit delay under an area constraint
of 32 LEs
Assume each LE contains one LUT and two
flip-flops 32 LEs provide 32 LUTs and 64
flip-flops

16
Iterative Design Flow

Start with initial guess for folding level and
iteratively refine it
Large folding level -gt better circuit delay, but
large area cost
Initial folding stages
Initial folding levels
Partition RTL modules into a series of connected
LUT clusters
logic depth at most equal to the folding level
Significantly speeds up the mapping procedure

17
Iterative Design Flow (Contd.)

Cluster size should be smaller than the area
constraint

34 LUTs gt 32 LUTs
Level-5 folding
Level-4 folding
18
Solution for the Example

Three folding stages using level-4 folding
32 LEs required for mapping the RTL circuit area
constraint satisfied
Circuit delay 3 folding cycle delay

19
NanoMap Flow Diagram
Input network
Output
reconfiguration bits
1
Optimization
Module
Routing
16
objective
Circuit parameter
library
search
Final routing
2
using VPR router
Folding level
15
computation
User
3
constraint
Final placement
using modified VPR
RTL module partition
placer
Logic Mapping
4
14
Yes
No
Perform logic
folding
?
Satisfy delay
No
5
constraints
?
Yes
12
Schedule each LUT
/
Temporal placement
LUT cluster
Delay estimation
using FDS
6
11
Yes
Map each
7
No
Placement
LUT
/
LUT cluster to
routable
?
SMBs
Temporal clustering
10
7
Fast placement
Satisfy area
Refine
No
No
using modified VPR
constraints
?
placement
?
placer
8
13
Yes
Yes
9
20
Force-Directed Scheduling

Perform FDS on RTL modules partitioned into
LUTs/LUT clusters
Iteratively schedule LUT/(LUT cluster) to
minimize overall resource usage
Model resource usage as a force F Kx
K distribution graphs (DGs) that describe the
probability of resource usage
Aim of FDS minimize force, indicating minimum
increase in resource usage
LE usage depends on LUT computations and register
storage operationstwo DGs needed

21
Temporal Clustering

For each folding stage, a constructive algorithm
used to assign LUTs to LEs and pack LEs into MBs
and SMBs
Unpacked LUT with a maximal number of inputs
selected as initial seed
New LUTs with high attractions to the seed
selected and assigned to the SMB
Attractions depend on timing criticality and
input pin sharing
Considers attractions across all the folding
cycles

22
Placement and Routing

VPR (U. Toronto) modified to perform placement
and support temporal logic folding
Simulated annealing approach
Cost function computed across the folding stages
Routing using VPR router performed
hierarchically, considering direct link,
length-1, length-4 and global interconnects

23
Experimental Setup

Instance of architecture
4 MBs in an SMB
4 LEs in an MB
LEs contain a 4-input LUT and 2 flip-flops
Impact of fixing k at 16 vs. allowing a high
enough k to show design trade-offs
Results based on 100nm technology parameters to
implement CMOS logicand NRAMs

24
Experimental Results (Contd.)
1
1
1
1
1
1
1
1
1
2
2
2
2
1
2
1
2
1
1
2
2
1
2
2
2
2
1
1
25
Experimental Results (Contd.)
Improvement under AT optimization for RTL
Benchmarks
Reduction in LEs Maximum AT improvement Average AT improvement Circuit delay increase
k enough 14.8X 16.2X 11.0X 31.8
k 16 9.2X 9.3X 7.8X 19.4

LE utilization around 100
50 reduced need for a deep interconnect
hierarchy for level-1 vs. no-folding indicates
trading interconnect area for NRAM area
advantageous

26
Experimental Results (Contd.)

Flexibility in choosing the best folding level
and performing area-delay trade-offs
Mapping results for typical optimizations using
Paulin benchmark as an example

Typical optimizations
Opt. obj. Area const. (LEs) Delay const. (ns) Folding level
Case1 AT No No 1
Case2 Delay No No No
Case3 Area No 27 4
Case4 Delay 210 No 3
27
Conclusions

NATURE A new high-performance run-time
reconfigurable architecture
NanoMap an integrated optimization design flow
for NATURE
Introduction of NRAMs into the architecture
enables cycle-by-cycle reconfiguration and logic
folding leading to significant logic density and
area-time product advantages
Can be very useful for cost-conscious embedded
systems and improvement of future FPGAs
Non-volatility helpful in secure and low power
processing