NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture

Description:

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang , Li Shang and Niraj K. Jha – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 28
Provided by: weiz168
Category:

less

Transcript and Presenter's Notes

Title: NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture


1
NanoMap An Integrated Design Optimization Flow
for a Hybrid Nanotube/CMOS
Dynamically Reconfigurable
Architecture
  • Wei Zhang, Li Shang and Niraj K. Jha
  • Dept. of Electrical EngineeringPrinceton
    University
  • Dept. of Electrical and Computer Engineering
  • Queens University

2
Outline
  • Temporal Logic Folding
  • Background on NRAMs
  • Overview for hybrid NAnoTUbe/CMOS REconfigurable
    architecture (NATURE) (DAC 2006)
  • NanoMap Design Optimization Flow
  • Experimental Results
  • Conclusions

3
Temporal Logic Folding
  • Basic idea Use run-time reconfiguration to
    realize different functions in the same resource
    every few cycles

LUT 1
LUT 1
LUT 2
LUT 3
LUT 2
LUT 3
LUT 3
LUT 2
LUT 1
MEM
i abc
l (Ief)h
OUT dgl
4
Overview of NATURE
  • Distributed non-volatile nanotube RAMs (NRAMs)
    main storage for reconfiguration bits
  • Fine-grain reconfiguration (even cycle-by-cycle)
    and logic folding
  • Area-delay trade-off flexibility
  • More than an order of magnitude increase in logic
    density
  • More than an order of magnitude reduction in
    area-time product
  • Comparisons assume NRAMs/ CMOS logic implemented
    in the same technology
  • Non-volatility useful in low power secure
    processing

CMOS fabrication compatible
NRAM-based
NATURE
Run-time reconfiguration
Temporal logic folding
Logic density
Design flexibility
5
Overview of NATURE (Contd.)
  • Challenges in nano-circuits/architectures
  • Many programmable nanofabrics proposed Nanowire
    PLA (Dehon, 2004), CMOL (Strukov, 2005), etc.
  • Lack of a mature fabrication process
  • Fabrication defects and run-time failures
    (between 1 and 10)
  • Regular, reconfigurable architectures, such as an
    FPGA, favored
  • Facilitates fabrication
  • Fault tolerance through reconfiguration
  • NATURE fabricatable using CMOS-compatible
    fabrication process

6
NRAMTM by Nantero
Source http//www.nantero.com/nram.html
  • Non-volatile nanotube random-access memory (NRAM)
  • Mechanically bent or not determines bistable
    on/off states
  • Same/opposite voltage added to change the state
  • CMOS-compatible fabrication process
  • 10 Gbit NRAMs already fabricated ready to be
    commercialized in the near future

7
NRAMs
  • Properties of NRAMs
  • Non-volatile
  • Similar speed to SRAM
  • Similar density to DRAM
  • Chemically and mechanically stable
  • NATURE not tied to NRAMs
  • Phase change RAM
  • Magnetoresistive RAM
  • Ferroelectric RAM

8
Architecture of NATURE
  • Island-style logic blocks (LBs) connected by
    various levels of interconnects
  • An LB contains a super macroblock (SMB) and a
    local switch matrix

9
Architecture of a Super Macroblock (SMB)
  • n1 macroblocks (MBs) comprise an SMBhere n1 4

10
Architecture of a Macroblock (MB)
  • n2 logic elements (LEs) comprise an MBhere n2
    4

11
Logic Element (Basic Configuration)
  • An LE implements a computation and contains
  • An m-input look-up table (LUT)
  • l flip-flops
  • Input to flip-flop selected between LUT output
    and a primary input

12
Folding Levels
  • Logic folding at different levels of granularity,
    providing flexibility to perform area-delay
    trade-offs
  • Level-p folding LE reconfiguration after the
    execution of p LUT computations
  • Reconfiguration time 160ps
  • Larger folding level, typically delay decrease,
    area increase

(a) level-1 folding
(b) level-2 folding
13
Design Optimization Flow NanoMap
  • Optimize and implement design on NATURE
  • Integrate temporal logic folding
  • Choose a proper folding level
  • Use force-directed scheduling (FDS) technique to
    balance resource usage across folding cycles
  • Input design specified in register-transfer level
    (RTL) and/or gate-level VHDL

14
Motivational Example
Level 1 register
Logic in Plane
Folding stage
Plane cycle
Folding cycle
Plane
Level 2 register
  • Different planes should have same number of
    folding stages to guarantee global
    synchronization
  • Key issue how to achieve the optimization
    objective
  • Appropriate folding level
  • Assign the logic to folding stages

15
Motivational Example (Contd.)
8 LUTs Logic depth 4
50 LUTs 14 flip-flops
Plane depth 9
38 LUTs Logic depth 7
  • Example optimization objective
  • Minimize circuit delay under an area constraint
    of 32 LEs
  • Assume each LE contains one LUT and two
    flip-flops 32 LEs provide 32 LUTs and 64
    flip-flops

16
Iterative Design Flow
  • Start with initial guess for folding level and
    iteratively refine it
  • Large folding level -gt better circuit delay, but
    large area cost
  • Initial folding stages
  • Initial folding levels
  • Partition RTL modules into a series of connected
    LUT clusters
  • logic depth at most equal to the folding level
  • Significantly speeds up the mapping procedure

17
Iterative Design Flow (Contd.)
  • Cluster size should be smaller than the area
    constraint

34 LUTs gt 32 LUTs
Level-5 folding
Level-4 folding
18
Solution for the Example
  • Three folding stages using level-4 folding
  • 32 LEs required for mapping the RTL circuit area
    constraint satisfied
  • Circuit delay 3 folding cycle delay

19
NanoMap Flow Diagram
Input network
Output
reconfiguration bits
1
Optimization
Module
Routing
16
objective
Circuit parameter
library
search
Final routing
2
using VPR router
Folding level
15
computation
User
3
constraint
Final placement
using modified VPR
RTL module partition
placer
Logic Mapping
4
14
Yes
No
Perform logic
folding
?
Satisfy delay
No
5
constraints
?
Yes
12
Schedule each LUT
/
Temporal placement
LUT cluster
Delay estimation
using FDS
6
11
Yes
Map each
7
No
Placement
LUT
/
LUT cluster to
routable
?
SMBs
Temporal clustering
10
7
Fast placement
Satisfy area
Refine
No
No
using modified VPR
constraints
?
placement
?
placer
8
13
Yes
Yes
9
20
Force-Directed Scheduling
  • Perform FDS on RTL modules partitioned into
    LUTs/LUT clusters
  • Iteratively schedule LUT/(LUT cluster) to
    minimize overall resource usage
  • Model resource usage as a force F Kx
  • K distribution graphs (DGs) that describe the
    probability of resource usage
  • Aim of FDS minimize force, indicating minimum
    increase in resource usage
  • LE usage depends on LUT computations and register
    storage operationstwo DGs needed

21
Temporal Clustering
  • For each folding stage, a constructive algorithm
    used to assign LUTs to LEs and pack LEs into MBs
    and SMBs
  • Unpacked LUT with a maximal number of inputs
    selected as initial seed
  • New LUTs with high attractions to the seed
    selected and assigned to the SMB
  • Attractions depend on timing criticality and
    input pin sharing
  • Considers attractions across all the folding
    cycles

22
Placement and Routing
  • VPR (U. Toronto) modified to perform placement
    and support temporal logic folding
  • Simulated annealing approach
  • Cost function computed across the folding stages
  • Routing using VPR router performed
    hierarchically, considering direct link,
    length-1, length-4 and global interconnects

23
Experimental Setup
  • Instance of architecture
  • 4 MBs in an SMB
  • 4 LEs in an MB
  • LEs contain a 4-input LUT and 2 flip-flops
  • Impact of fixing k at 16 vs. allowing a high
    enough k to show design trade-offs
  • Results based on 100nm technology parameters to
    implement CMOS logicand NRAMs

24
Experimental Results (Contd.)
1
1
1
1
1
1
1
1
1
2
2
2
2
1
2
1
2
1
1
2
2
1
2
2
2
2
1
1
25
Experimental Results (Contd.)
Improvement under AT optimization for RTL
Benchmarks
Reduction in LEs Maximum AT improvement Average AT improvement Circuit delay increase
k enough 14.8X 16.2X 11.0X 31.8
k 16 9.2X 9.3X 7.8X 19.4
  • LE utilization around 100
  • 50 reduced need for a deep interconnect
    hierarchy for level-1 vs. no-folding indicates
    trading interconnect area for NRAM area
    advantageous

26
Experimental Results (Contd.)
  • Flexibility in choosing the best folding level
    and performing area-delay trade-offs
  • Mapping results for typical optimizations using
    Paulin benchmark as an example

Typical optimizations
Opt. obj. Area const. (LEs) Delay const. (ns) Folding level
Case1 AT No No 1
Case2 Delay No No No
Case3 Area No 27 4
Case4 Delay 210 No 3
27
Conclusions
  • NATURE A new high-performance run-time
    reconfigurable architecture
  • NanoMap an integrated optimization design flow
    for NATURE
  • Introduction of NRAMs into the architecture
    enables cycle-by-cycle reconfiguration and logic
    folding leading to significant logic density and
    area-time product advantages
  • Can be very useful for cost-conscious embedded
    systems and improvement of future FPGAs
  • Non-volatility helpful in secure and low power
    processing
Write a Comment
User Comments (0)
About PowerShow.com