Title: Next Generation VLSI Circuits: Physical Design Issues
1Next Generation VLSI CircuitsPhysical Design
Issues
- Prof. Patrick H. Madden
- University of Kitakyushu
- pmadden_at_acm.org
2Overview
- Constraints on Circuit Design
- Trouble in the Next Few Years
- Placement Research
- Scalability, wire lengths, routing, timing, mixed
size designs - Focus on Fractional Cut and related papers
- Summary and Future Research Directions
- Two levels to consider
- Theres the big picture related to what we do
as a research community, and as an industry - Theres the small picture on what we do to any
specific problem
3Related Publications
- ICCAD 2003 Fractional Cut Improved Recursive
Bisection Based Placement - A. Agnihotri, S. Ono, A. Khatkhate, A. Mathur, M.
C. Yildiz, P. H. Madden - ISPD 2004 Mixed Size Placement
- A. Khatkhate, C. Li, A. R. Agnihotri, S. Ono, M.
C. Yildiz, C.-K. Koh, P. H. Madden - ICCAD 2004 White Space Allocation
- C. Li, M. Xu, C.-K. Koh, J. Cong, P. H. Madden
- ASPDAC 2005 Detail Placement by Branch and
Price - P. Ramachandran, A. Agnihotri, S. Ono, P.
Damodaran, H. Srihari, P. H. Madden - ASPDAC 2005 Buffer and Repeater Insertion
- C. Li, C.-K. Koh, P. H. Madden
- ASPDAC 2005 Optimality and Scalability Study
- S. Ono, P. H. Madden
4Lithography
Assorted pictures from Google
5System Level PerspectiveImprovements in each
area multiply their effect
- Architecture
- CISC, RISC, mass-par, TransMeta
- Synthesis
- VHDL, Verilog, custom datapath
- Floorplan
- Better representations
- Placement
- Annealing, better partitioners, analytic,
- Routing
- Linsker cost functions, multi-commodity flow
- Switch from channel to over-the-cell
- Lithography/Fab
- 2x improvement every 18 months, copper, strained
silicon - Chip Packaging
- Flip-chip, BGA, SIMM,
- System Packaging
- Quieter fans, better batteries
- Other
- Better software, new markets, colored plastic
6nVidia chip designs....
Designs are getting larger -- and harder to
do. (Table from nVidia talk at ISPD01).
7Power DensityWill Get Even Worse
- Need to Keep the Junctions Cool
- Performance (Higher Frequency)
- Lower leakage (Exponential)
- Better reliability (Exponential)
Pat Gelsinger, ISSCC 2001
8The most serious problem? Interconnect.
- The next few slides are taken from Desmond
Kirkpatrick and Prashant Saxena (Intel), from the
ISPD04 Chicken panel
9Interconnect Chicken Littles (an Ostrich
Viewpoint)
Interconnects are dominating! Interconnects are
dominating!
- Chicken Little a story of mass hysteria
- An acorn hits a chick called Chicken Little on
the head - Chicken Little proclaims the sky is falling
the sky is falling and runs down the road. - Each character (Henny Penny, Goosy Loosy) who
hears her story runs behind her, propagating the
story to the next.
Interconnect-dominated
Interconnect-driven
1995 IEDM Bohr, Interconnect Scaling The Real
Limiter to High Performance VLSI
Deep-submicron
95
98
01
04
Rise of the Interconnect Chicken Little Era
10Interconnect Ostriches (a Chicken Viewpoint)
Interconnects dont scare me Interconnects dont
scare me
- Ostrich a symbol of deep denial
- Despite being an aggressive 40lb bird
- When frightened, an Ostrich is reputed to hide
its head in the sand
1999 ICCAD Ho/Horowitz, Interconnect Scaling
Implications for CAD - Sylvester/Keutzer used
average wires
1998 ICCAD Sylvester/Keutzer, Getting to the
Bottom of Deep Submicron -50k gates no
problem!
Copper / lowK promises
95
98
01
04
Interconnect Ostrich Malaise Era
11Revenge of the Interconnect Chicken Littles?
- Interconnect Buffering debate
- Primarily a paper debate
- Consensus global interconnect phenomenon
- (e.g. Tau 99 Keutzer/Pillegi agree
floorplanning will need more work) - Recent scary predictions
- Buffers will invade synthesis blocks
- Meshes / fabrics / grids of buffers will replace
individual elements / stations
2002 ISPD Saxena, et al, The Scaling
Challenge Can Correct-by-Construction Design?
Help - 70 of cells will be interconnect buffers
at 32nm
95
98
01
04
Revenge of the Chicken Littles
12- Exploding buffer counts will break todays block
design paradigms - All realistic scaling projections encounter this
problem
13What is to be done?
- A partial solution minimize the length of the
interconnect. - Note I am firmly in the Chicken Little camp.
14Fractional Cut Placement
- By improving placement -- we strike at the root
of the interconnect problem - Method must be scalable methods that work well
on small problems are not relevant
15The Placement Problem
Common approach make each gate rectangular, and
arrange them like bricks. The problem.... where
do you put each brick? (And how do you run the
wires?)
Tens of millions of tiny pieces of metal
Millions of gates
16Leading Placement Methods
- Force-Directed/Linear Programming
- Simulated Annealing
- Recursive Bisection
- Split the logic into two groups minimize the
number of wires between the groups - Place one group in the top half of the chip, and
the other in the bottom half - Why use bisection? It scales better than the
other methods, and still gets good results.
17Bisection Based Placement
Logic elements
Semiconductor chip
18Recursive Bisection Placement
19Non-Traditional Approach
- In bisection, cut lines are placed between rows
- Our idea--ignore row boundaries, and place cut
lines where the relative areas suggest - This is Fractional Cut bisection
- Requires legalization to align cells with rows
20Ignoring Row Boundaries
Row boundaries are in blue Black outlined
rectangles are the regions Numbers indicate
total cell areas (there may be a number of cells
in each region).
21Experimental Results
22Standard Cell Observations
- Recursive Bisection
- Fast, and very competative
- More than 30 better than DAC98 best paper(!)
- 30 in 5 years is small compared to the 8X
improvement (or more) from Lithography - Different Benchmarks change results
- PEKO ! IBM?
- More on this later
23This is the start of a solution.
- But large designs are not just standard cells.
24Boulders and Dust
- To speed up the design process, pre-designed
blocks are integrated with standard cell logic.
25Mixed Block Design
Hundreds of large blocks, millions of small
cells. Placement must deal with large size
differential. Also called boulders and dust
problem.
26Recent previous work
- Capo Parquet - ISPD 02, ICCAD 03
- Shred macros, global placement.
- Form groups of standard cells, run fixed outline
floor planner. - Fix macros, place standard cells.
- mPG-ms - ASPDAC 03.
- Coarsening - cluster macros and standard cells.
- Refinement - large macros are fixed gradually
removing overlaps, carry on refinement on smaller
objects.
Objective in these works only HPWL minimization
27Our approach
- Global placement using Fractional Cut based
recursive bisection. - Greedy legalization.
- Branch Bound reordering on standard cells.
28Global placement
- Fractional cut approach (ICCAD03)
- Recursive bisection, but cut lines are not
restricted to row boundaries. Instead, use
legalization after bisection. - Key insight bisection can handle both standard
cells and macro blocks - Partition line is located based on total area of
each side block shapes are not considered - Multilevel clustering based partitioner (hMetis),
with multiple random starts - Large blocks have an opportunity to start on
either side of the partition we are not locked
in place
29Example
30Mixed Block Enhancement
- Output of the Global Placer - rough distribution
of cells/macros across the core area. - Area constraints and fractional cut lines ensure
that distribution is even. - There is some overlap.
- Cells and macro blocks are not row-aligned.
31IBM01 before legalization
32Placement legalization
- Legalization is the stage where we remove
overlaps, align cells with rows, and cell
positions match the site widths in the circuit. - First (abandoned) approach -
- Remove macro overlap using a recursive search
procedure - Cell legalization by dynamic programming based
method (similar to ICCAD03 paper). - Complex, many lines of code, and a great deal of
work. Also not very good. - Better method leverage the uniform area demand
to allow a less complex legalizer.
33Greedy legalization
- Sort cells/macros by left-edge locations.
- Initialize right edge of each row.
- For Each Object
- Greedy assignment of an object to a row, for min
displacement. - If no overlap, leave at abstract placement X
position otherwise shift to the right to avoid
overlap - Macro placement must check multiple rows
- Update right edge profiles for the rows across
which the object spans. - This method extends a prior standard cell
legalization method by Dwight Hill (US Patent
6,370,673) - Also used in Kahng/Wang APlace paper
34Legalization
35Experimental results
- We tested Feng Shui 2.4 on the 18 IBM mixed block
benchmarks on the GSRC Bookshelf web site - - http//www.gigascale.org/bookshelf
- Comparison with
- CapoParquet (I, II) ISPD2002
- mPG-MS ASPDAC2003
- CapoParquet (III) ICCAD2003
- All publications focus on HPWL minimization
timing and routing are not considered.
36Mixed Block Placement
37Experimental Results
As much as 51 better on some benchmarks.
Closest is around 8, for the design that doesnt
have macro blocks.
38Upcoming ICCAD05 mixed block papers
- Only one is able to improve on our results
- APlace paper from Andrew Kahngs group -- and
they use our legalization method and detail
placer - Others are from 10 to 30 higher wire lengths
- Our mixed block work is a major step forward, and
helps mitigate the growing interconnect problem.
39Wire Length Minimization is good.
- But what about routing wires?
- These slides are from a talk to be presented by
Chen Li at ICCAD04 - The paper is a collaboration between Purdue,
Binghamton, and UCLA
40Motivation and previous work
- Objective of placement tools wirelength and
routability - Routabiltiy control in global placement
- Incorporating congestion into cost function
- Cell movement
- Routability control in detailed placement
- Region expanding
- White space allocation
41Congestion Estimation
- Congestion estimation
- Routing resource estimation
- based on width spacing of wires in layers
- Routing demand estimation
- decompose MST of net into two-pin connections
- two-bend LZ routes for each two-pin connections
- Congestion (overflow)
42WSA White Space Allocation
- Idea for routing demand-resource matching
- Fractional cut
- Cutline shifting
- Flow of WSA
- Slicing tree Construction
- Congestion estimation on tree
- White space adjustment
- Detailed placer
43Slicing Tree Construction
44White Space Adjustment
Before cutline shifting
45White Space Adjustment
Level 0
46White Space Adjustment
Level 1
47White Space Adjustment
Level 2 WSA finished
48Detailed Placer
- Objective maintain white space distribution and
further reduce HPWL - For example, DOMINO cannot be applied here
- Greedy legalization
- Remove overlaps
- Sliding window-based local minimization
- White space is considered as pseudo-cells
49Experimental Setup
- IBM v2 easy and hard benchmarks (16 circuits)
- All publicly available placers
- Dragon-fd congestion-driven mode
- CAPO, Feng Shui, mPL
- mPG congestion mode off, QPLACE ECO for
legalization - QPLACE
- All placers are run 5 times except QPLACE, mPL
and our tools once. - WRoute (SE5.3) to evaluate routability
50Experimental Results
100 successful routings
8.124.5 reduction on routed WL
51Experimental Results
- Impacts of various techniques in out flow
Both techniques improve routability. Combined
flow work best.
Both techniques reduce routed WL
52Experimental Results
- WSA on placements generated by other tools
Improvement on routability except for QPLACE
1.18.0 reduction on routed WL compared to
original tools
53Improved Routability
- mPL-R
- Routability-driven global placement reduces
routing demands through cell-replacement based on
accurate congestion estimation - WSA
- Routability-driven detailed placement allocates
routing resources into congested regions - Successful routings on all easy hard IBM
benchmarks - shortest routed WL competitive placement
runtime
54Wire Lengths are Reduced
- But how far can we go?
- There is an OPTIMAL solution--is there room left
for improvement? - PEKO benchmarks are constructed with a known
optimal solution - We use the placement tools on these to evaluate
if there is further room for gain
55Experimental Results
56Other places to reduce wire lengths?
- Surprisingly large potential at the detail
placement level!
57Traditional Detail Placement
A
B
C
A
C
B
B
A
C
B
C
A
C
A
B
C
B
A
Legalize the placement, then try permutations on
groups of cells in order to improve (wire length,
delay, congestion, .)
58Branch-and-Bound results
Bigger window means better results, but longer
run time. About half of FS2.0 run time is in
detail placement.
59Redefining Local
- Placements are optimal wrt the locations of
groups of 6 or so - But what about bigger groups?
- PEKO and Grid benchmarks show that placements are
not optimal - If we increase the window size, how much can we
get? - How to increase window size w/o getting hammered
on run time?
60A Bit of FunGlobal vs. Detail Suboptimality
Improving placement results requires an
understanding of what happens during placement.
Method map each cell in the placement to a
pixel from an image. Rearrange the cells
according to optimal placement. What does this
mean? While theres suboptimality at the global
level, were losing a lot in detail placement.
61Branch-and-Bound Run Times
- 2x2 window 4! 24 combinations
- 3x3 window 9! 362880 combinations
- Runs in about 0.7 seconds on my PC
- 4x4 window 16! xxxx combinations
- Around 1 year to find optimal
- 5x5 window 25! xxxx combinations
- Multiple exansions and contractions of the
universe - With the method presented here
- 10x10 has been solved, and were expecting to be
able to do much larger (with some algorithmic
clean-up of the code) - Its not going to be cheap in terms of run
time, but it should be feasible to apply
62Better Detail Placement
- We know were suboptimal
- But by how much? And what portion is from global
problems, how much from local? - Global placement, we have annealing, analytic,
recursive bisection. All heuristics. - Detail placement, we have enumeration/branch-and-b
ound, and some flow based methods. Optimal, but
small windows. - Strategy
- Apply techniques from OR community
- Solve local optimization problems with bigger
windows
63Short Branch-and-Price Overview
- Based on linear programming
- Solve LP problem to find a lower bound solution
is not neccesarily integer - Use column generation to keep the problem size
manageable - Decompose the problem into master and subproblems
- Branch on non-integers
- Try 0 and 1 values, evaluate the lower bounds
- Traverse the decision tree, jumping to the node
with the best lower bound - If we find an integer solution, and node with a
higher value (integer or not) can be pruned
64Summary
- The first few slides were make things sound very
bad.
65But I think its an accurate picture
- Absolutely essential
- Consider physical constraints in the design
process - Making chips faster and cheaper is no longer easy
(it was never easy, but its very very hard now)
66How to Survive the Future?
- Short term absolutely minimize circuit
interconnect. Weve made progress here, but
theres a long way to go. - Long term If you figure it out, please let me
know!
67(No Transcript)
68Design Automation Challenges
- Handle Moore's Curse
- The problem doubles in size every 18 months.
- Late to Market is a disaster
- Come closer to human-design
- Estimates are that automated design leaves about
7 years of technological advances on the table(!) - Shield the System Designers from the Device
Details - Timing, Power, Signal Integrity, ...
- Designers have enough to worry about now
69Things that Wont Work
- Massive parallel computers
- Neural network paradigm shift
- Much of the current nano and quantum hype
70Traditional Legalization
- Align cuts with cell rows
- After bisection--all cells are within row
boundaries - Sort cells by X position
- Feng Shui 1.5 also packs cells to the left
- For MCNC benchmarks, this works well
- For IBM, Peko, not quite so good
71After Bisection
72Dynamic Programming Legalization
- Process rows one at a time
- For each row
- Select a subset of cells such that the total
horizontal WL of the packed subset, plus the
penalty for the non-selected cells, is minimized - Simple DP formulation obtains good results
73DP Solution
- Suppose we have logic elements A, B, C, D, E, F
- Assume they're all the same width
- We have space for four of the six
- Which ones do we put into the row.
- To minimize the TOTAL distance things move?
74Example
B
F
A
D
E
C
All the blocks are going to be packed to the
"left." The total distance things travel depends
on which blocks we choose.
75Some Observations
- If A is to the left of B before packing
- It should still be to the left after packing
- The distance that F travelled depends only on the
number of blocks to the left of it - We don't care which blocks to the left are
taken--only how many!
76DP Matrix
Filling the blanks in the table is easy
Cost of moving E to position 3 plus the lowest
cost for filling to location 2 with blocks before
E
Cost of filling to location 3 using blocks before
E
77Legalization
78Legalization
79Standard Cell Placement Tools
- Other methods include
- Capo (recursive bisection, U. Michigan)
- Dragon (simulated annealing, UCLA)
- Kraftwerk (linear programming, TU Munich)
- mPL (multilevel slot-based, UCLA)
- .
- Objective is to minimize total wire length
- Benchmark circuits derived from IBM designs