Title: Power Optimal DualVdd Buffered Tree Considering Buffer Stations and Blockages
1Power Optimal Dual-Vdd Buffered Tree Considering
Buffer Stations and Blockages
- King Ho Tam and Lei He
- Electrical Engineering Department
- University of California, Los Angeles
- Sponsors NSF CAREER, UC MICRO (Fujitsu, Intel
and Mindspeed), and IBM Faculty Partner Award.
2Motivation
- Increasing interconnect power
- 35 cells are buffers at 65nm technology Saxena,
TCAD 04 - Previous work
- Power-optimal single Vdd buffer
insertionLillis, JSSC 96 - Delay-optimal buffered tree generationCong, DAC
00 Alpert, TCAD 02 - No existing algorithms consider dual-Vdd for
buffer insertion or buffered tree generation
3Major Contributions
- First in-depth study of dual Vdd buffer insertion
and buffered tree generation - Large power saving over single Vdd buffering
- Efficient algorithms for power optimality
- 17x faster than Lillis, JSSC 96 when single Vdd
is considered
4Outline
- Dual Vdd buffer insertion and sizing (DVB)
- Problem formulation
- Sampling for speedup
- Experimental results
- Dual Vdd buffered tree generation (D-Tree)
- Problem formulation
- Improved augmented orthogonal search tree
- Experimental results
5Delay, Slew and Power Modeling
- Elmore delay
- Wire , buffer
- Bakoglus slew metric (ln 9 Elmore)
- Power energy per switch
- Wire
- Lumped buffer dynamic/short-circuit power
- Can be easily extended to leakage power
- Low Vdd (VL) reduces leakage
- Need to assume of clock rate and switching
activity
6Introducing Dual Vdd Buffering
- Achieves power saving since power a Vdd2
- Suffer no loss of delay optimality
- VL gt VH requires level converter (LC)
- Restore voltage level and reduce leakage
- Ext-CVS for logic Srivastava, ISLPED 04
- LC delay and power overhead amortized
7Key Observation in Dual Vdd Buffering
- Disallowing VL gt VH will not affect optimality
- Optimality empirically illustrated (_at_ 65nm)
- (a) has LC and VH drives Cl, power (a) gt (b)
- Delay (b) gt (a) only if Cl gt 0.5pF ( 9mm wire)
VH
VL
8DVB Formulation
- Dual Vdd Buffer Insertion (DVB)
- Given interconnect tree
- Find buffer placement, Vdd assignment for
buffers, sizes of buffers - VH buffers driving VL buffers within the tree
- Level converters at VH sinks driven by VL buffers
- Minimize power subject to
- Arrival time requirement at the source (RAT)
- Slew rate constraint at buffer inputs and sinks
9DVB Algorithm
- Based on Lillis, JSSC 96
- Dynamic programming with partial solution
(option) pruning - Options must now record downstream Vdd levels for
buffering - To prevent VL gt VH, which removes unnecessary
search on solution space - Still quite slow for large nets
- Challenge
- Considering power causes super-linear growth in
the number of options (w.r.t. tree size) - Dual Vdd buffers gt 2x options at each node
10Speed-up Technique
- Approximate by power-delay sampling
- Sampling under each distinct cap value
- Uniformly pick options from the entire RATpower
trade-off curve
11Experimental Settings for DVB
- Testcase randomly generated Steiner trees
- 20 to 800 terminals in 1cm x 1cm routing area
- Buffer sizes 16x, 32x, 64x
- Sampling grid set to 20x20
- Comparison
- Exact power-optimal algorithm (PB)Lillis, JSSC
96 - Our algorithm with single (SVB) and dual(DVB)
Vdd buffers
12Sampling Preserves Optimality
- Sampling has little impact on optimality
- SVB follows PB closely
- Still optimal delay, 1.7 larger power over PB
13Dual Vdd Reduces Power
- Dual Vdd shifts power-delay curve to the left
14Experimental Results for DVB
- DVB saves 23 power over SVB
- More power saving in larger nets
- Power saving becomes larger w/delay slack
- e.g. relax delay 5, saving becomes 26
15Runtime
- SVB scales a lot better for larger testcases
- Achieved 17x speedup over PB Lillis, JSSC 96
- DVB takes 2.5x more runtime than SVB
16Outline
- Dual-Vdd Buffer insertion and sizing (DVB)
- Problem formulation
- Sampling speed-up technique
- Experimental results
- Dual-Vdd buffered tree generation (D-Tree)
- Problem formulation
- Improved augmented orthogonal search tree
- Experimental results
17D-Tree Formulation
- Dual Vdd Buffered Tree (D-Tree)
- Given locations of terminals, buffer stations and
blockages - Find a rectilinear Steiner tree (RST), buffer
placement/size/Vdd assignment - VH buffers driving VL buffers only
- Level converters at VH sinks driven by VL buffers
- Minimize power
- Arrival time requirement at the source (RAT)
- Slew rate constraint at buffer inputs and sinks
- D-Tree is NP-Hard
- Finding minimum RST alone is NP-Complete
18Buffered Tree Construction
- Delay optimization only Cong, DAC 00 by
- Build Hanan Graph w/buffer insertion nodes
according to locations of buffer stations - Path search on the grid by option propagation
19D-Tree Algorithm Overview
- Challenges
- Growth of option is exponential
- An artifact of D-Trees NP-hardness
- Considering power worsens option growth
- Solution sampling efficient prune tree
20Prune Tree in Lillis, JSSC 96
- Option inserted in sorted capacitance
- Never need to clear options out from the tree
- If new option is checked against the tree
- Automatically avoid redundant option in tree
- e.g. ?new (c 20, p 100, q 600)
- Not applicable to D-Tree problem
- Order of new options is not known a priori
21Our Improvement on Prune Tree
- Indexing w/capacitance results in fewer trees
- capacitance value lt power value
- Efficient tree cleaning
- Enables out-of-order option insertion
- Guarantee no redundancy in tree
22Tree Cleaning
- To add an option ?new in O(clog(T)) time
- Check whether ?new is dominated by any option in
the data-structure - If not, remove options in the tree dominated by
?new in two downward tree traversals - e.g. ?new (c 10, p 70, q 410, )
23Experimental Settings for D-Tree
- Random testcases
- All based on a random floorplan of 1cm x 1cm
- Blockages 30, buffer stations 1mm apart
- Comparison
- Delay-optimal tree (RMP) Cong, DAC 00
- Ours with single (S-Tree) and dual(D-Tree) Vdd
Buffer
24Experimental Results for D-Tree
- Significant power saving over RMP
- S-Tree 7, D-Tree 18
- Larger saving for large testcases (e.g. T4)
- Handles up to 6-sink nets (T5 takes 23 mins)
- Similar capability compared with delay-optimal
approaches Cong, DAC 00 Chen, ASP-DAC 02
25Conclusion
- Formulated dual Vdd buffer insertion/tree
generation without level converters - Proposed 2 speedup techniques
- Sampling w/negligible loss of optimality
- Improved prune tree for solution pruning
- Applied to single-Vdd buffer insertion, 17x
faster than existing work - Large power saving over single Vdd buffering
- 23 in buffer insertion dual Vdd vs single Vdd
- 18 in buffered tree dual Vdd vs delay optimal
26Future Work
- Speed up tree construction
- Slack allocation for more power reduction
- Path-based buffer insertionSze, DAC 05
- Allocate slack along one interconnect path
- Consider single Vdd buffers only
- Chip level FPGA dual Vdd assignmentLin, DAC 05
- Fixed buffer location, assign Vdd levels
- Consider Multiple critical path
- Solved as a linear programming problem