Title: Chickens, Ostriches, and the Interconnect Problem
1Chickens, Ostriches, and the Interconnect Problem
- Prof. Patrick H. Madden
- University of Kitakyushu
- pmadden_at_acm.org
2nVidia chip designs....
Designs are getting larger -- and harder to
do. (Table from Chris Malachowski nVidia talk at
ISPD01).
3Power DensityWill Get Even Worse
- Need to Keep the Junctions Cool
- Performance (Higher Frequency)
- Lower leakage (Exponential)
- Better reliability (Exponential)
Pat Gelsinger, ISSCC 2001
4System Level PerspectiveImprovements in each
area multiply their effect
- Architecture
- CISC, RISC, mass-par, TransMeta
- Synthesis
- VHDL, Verilog, custom datapath
- Floorplan
- Better representations
- Placement
- Annealing, better partitioners, analytic,
- Routing
- Linsker cost functions, multi-commodity flow
- Switch from channel to over-the-cell
- Lithography/Fab
- 2x improvement every 18 months, copper, strained
silicon - Chip Packaging
- Flip-chip, BGA, SIMM,
- System Packaging
- Quieter fans, better batteries
- Other
- Better software, new markets, colored plastic
5Overview
- Interconnect Trouble in the Next Few Years
- Chicken or Ostrich? Im a Chicken.
- Doing Design and Getting the Wires Right
- The problem will not go away, so we need to have
good ways to minimize the impact. - Summary and Future Research Directions
6Related Publications
- DATE 2003 Crosstalk Aware Detail Routing
- R. M. Smey, P. H. Madden
- DAC 2003 Amplified Congestion Global Routing
- R. T. Hadsell, P. H. Madden
- ICCAD 2003 Fractional Cut Improved Recursive
Bisection Based Placement - A. Agnihotri, S. Ono, A. Khatkhate, A. Mathur, M.
C. Yildiz, P. H. Madden - ISPD 2004 Mixed Size Placement
- A. Khatkhate, C. Li, A. R. Agnihotri, S. Ono, M.
C. Yildiz, C.-K. Koh, P. H. Madden - MWSCAS 2004 Clustering and Combinatorial
Placement - S. Ono, P. H. Madden
- SASIMI 2004 Lithography and Manufacturability
Interconnect Synthesis - S. Pujari, R. M. Smey, Y. Tan, H. H. Madden, P.
H. Madden - ICCAD 2004 White Space Allocation
- C. Li, M. Xu, C.-K. Koh, J. Cong, P. H. Madden
- ASPDAC 2005 Detail Placement by Branch and
Price - P. Ramachandran, A. Agnihotri, S. Ono, P.
Damodaran, H. Srihari, P. H. Madden - ASPDAC 2005 Buffer and Repeater Insertion
- C. Li, C.-K. Koh, P. H. Madden
- ASPDAC 2005 Optimality and Scalability Study
7The most serious problem? Interconnect.
- The next few slides are taken from Desmond
Kirkpatrick and Prashant Saxena (Intel), from the
ISPD04 Chicken panel
8Interconnect Chicken Littles (an Ostrich
Viewpoint)
Interconnects are dominating! Interconnects are
dominating!
- Chicken Little a story of mass hysteria
- An acorn hits a chick called Chicken Little on
the head - Chicken Little proclaims the sky is falling
the sky is falling and runs down the road. - Each character (Henny Penny, Goosy Loosy) who
hears her story runs behind her, propagating the
story to the next.
Interconnect-dominated
Interconnect-driven
1995 IEDM Bohr, Interconnect Scaling The Real
Limiter to High Performance VLSI
Deep-submicron
95
98
01
04
Rise of the Interconnect Chicken Little Era
9Interconnect Ostriches (a Chicken Viewpoint)
Interconnects dont scare me Interconnects dont
scare me
- Ostrich a symbol of deep denial
- Despite being an aggressive 40lb bird
- When frightened, an Ostrich is reputed to hide
its head in the sand
1999 ICCAD Ho/Horowitz, Interconnect Scaling
Implications for CAD - Sylvester/Keutzer used
average wires
1998 ICCAD Sylvester/Keutzer, Getting to the
Bottom of Deep Submicron -50k gates no
problem!
Copper / lowK promises
95
98
01
04
Interconnect Ostrich Malaise Era
10Revenge of the Interconnect Chicken Littles?
- Interconnect Buffering debate
- Primarily a paper debate
- Consensus global interconnect phenomenon
- (e.g. Tau 99 Keutzer/Pillegi agree
floorplanning will need more work) - Recent scary predictions
- Buffers will invade synthesis blocks
- Meshes / fabrics / grids of buffers will replace
individual elements / stations
2002 ISPD Saxena, et al, The Scaling
Challenge Can Correct-by-Construction Design?
Help - 70 of cells will be interconnect buffers
at 32nm
95
98
01
04
Revenge of the Chicken Littles
11- Exploding buffer counts will break todays block
design paradigms - All realistic scaling projections encounter this
problem
12Prashant Saxena is an Optimist
- Exploding buffer counts will break everything,
block design or not - We are all in a lot more trouble than you might
expect - Dont worry about the ITRS roadmap for 2010.
Were not going to get there. - Stop chasing Moores Law, and start using Jobs
Law - Colored plastic can make people buy stuff
13What is to be done?
- A partial solution minimize the length of the
interconnect. - Note I am firmly in the Chicken Little camp.
14Circuit Layout
- We have a logic diagram
- Where to place the cells?
- We know that were going to have to change gate
sizes - We know that were going to have to insert
buffers - We know that well need additional space for
routing - Reserve space to handle this?
- No -- bad idea! Very bad!
- So if you dont reserve space.
15Dense Placements
- The placements from Feng Shui contain absolutely
no internal space - This is intentional
- This leaves no space for buffers, sizing,
routing, . - Dont worry, it will all be OK
16Fractional Cut Placement
- By improving placement -- we strike at the root
of the interconnect problem - Method must be scalable methods that work well
on small problems are not relevant
17The Placement Problem
Common approach make each gate rectangular, and
arrange them like bricks. The problem.... where
do you put each brick? (And how do you run the
wires?)
Tens of millions of tiny pieces of metal
Millions of gates
18Leading Placement Methods
- Force-Directed/Linear Programming
- Simulated Annealing
- Recursive Bisection
- Split the logic into two groups minimize the
number of wires between the groups - Place one group in the top half of the chip, and
the other in the bottom half - Why use bisection? It scales better than the
other methods, and still gets good results.
19Bisection Based Placement
Logic elements
Semiconductor chip
20Recursive Bisection Placement
21Non-Traditional Approach
- In bisection, cut lines are placed between rows
- Our idea--ignore row boundaries, and place cut
lines where the relative areas suggest - This is Fractional Cut bisection
- Requires legalization to align cells with rows
22Ignoring Row Boundaries
Row boundaries are in blue Black outlined
rectangles are the regions Numbers indicate
total cell areas (there may be a number of cells
in each region).
23Experimental Results
24This is the start of a solution.
- But large designs are not just standard cells.
25Mixed Block Design
Hundreds of large blocks, millions of small
cells. Placement must deal with large size
differential. Also called boulders and dust
problem.
26Recent previous work
- Capo Parquet - ISPD 02, ICCAD 03
- Shred macros, global placement.
- Form groups of standard cells, run fixed outline
floor planner. - Fix macros, place standard cells.
- mPG-ms - ASPDAC 03.
- Coarsening - cluster macros and standard cells.
- Refinement - large macros are fixed gradually
removing overlaps, carry on refinement on smaller
objects.
Objective in these works only HPWL minimization
27Our approach
- Global placement using Fractional Cut based
recursive bisection. - Greedy legalization.
- Branch Bound reordering on standard cells.
28Global placement
- Fractional cut approach (ICCAD03)
- Recursive bisection, but cut lines are not
restricted to row boundaries. Instead, use
legalization after bisection. - Key insight bisection can handle both standard
cells and macro blocks - Partition line is located based on total area of
each side block shapes are not considered - Multilevel clustering based partitioner (hMetis),
with multiple random starts - Large blocks have an opportunity to start on
either side of the partition we are not locked
in place
29Example
30Mixed Block Enhancement
- Output of the Global Placer - rough distribution
of cells/macros across the core area. - Area constraints and fractional cut lines ensure
that distribution is even. - There is some overlap.
- Cells and macro blocks are not row-aligned.
31IBM01 before legalization
32Placement legalization
- Legalization is the stage where we remove
overlaps, align cells with rows, and cell
positions match the site widths in the circuit. - First (abandoned) approach -
- Remove macro overlap using a recursive search
procedure - Cell legalization by dynamic programming based
method (similar to ICCAD03 paper). - Complex, many lines of code, and a great deal of
work. Also not very good. - Better method leverage the uniform area demand
to allow a less complex legalizer.
33Greedy legalization
- Sort cells/macros by left-edge locations.
- Initialize right edge of each row.
- For Each Object
- Greedy assignment of an object to a row, for min
displacement. - If no overlap, leave at abstract placement X
position otherwise shift to the right to avoid
overlap - Macro placement must check multiple rows
- Update right edge profiles for the rows across
which the object spans. - This method extends a prior standard cell
legalization method by Dwight Hill (US Patent
6,370,673) - Also used in Kahng/Wang APlace paper
34Legalization
35Experimental results
- We tested Feng Shui 2.4 on the 18 IBM mixed block
benchmarks on the GSRC Bookshelf web site - - http//www.gigascale.org/bookshelf
- Comparison with
- CapoParquet (I, II) ISPD2002
- mPG-MS ASPDAC2003
- CapoParquet (III) ICCAD2003
- All publications focus on HPWL minimization
timing and routing are not considered.
36Mixed Block Placement
37Experimental Results
As much as 51 better on some benchmarks.
Closest is around 8, for the design that doesnt
have macro blocks.
38Upcoming ICCAD05 mixed block papers
- Only one is able to improve on our results
- APlace paper from Andrew Kahngs group -- and
they use our legalization method and detail
placer - Others are from 10 to 30 higher wire lengths
- Our mixed block work is a major step forward, and
helps mitigate the growing interconnect problem.
39Wire Length Minimization is good.
- But what about routing wires?
- These slides are from a talk to be presented by
Chen Li at ICCAD04 - The paper is a collaboration between Purdue,
Binghamton, and UCLA
40Motivation and previous work
- Objective of placement tools wirelength and
routability - Routabiltiy control in global placement
- Incorporating congestion into cost function
- Cell movement
- Routability control in detailed placement
- Region expanding
- White space allocation
41Congestion Estimation
- Congestion estimation
- Routing resource estimation
- based on width spacing of wires in layers
- Routing demand estimation
- decompose MST of net into two-pin connections
- two-bend LZ routes for each two-pin connections
- Congestion (overflow)
42WSA White Space Allocation
- Idea for routing demand-resource matching
- Fractional cut
- Cutline shifting
- Flow of WSA
- Slicing tree Construction
- Congestion estimation on tree
- White space adjustment
- Detailed placer
43Slicing Tree Construction
44White Space Adjustment
Before cutline shifting
45White Space Adjustment
Level 0
46White Space Adjustment
Level 1
47White Space Adjustment
Level 2 WSA finished
48Detailed Placer
- Objective maintain white space distribution and
further reduce HPWL - For example, DOMINO cannot be applied here
- Greedy legalization
- Remove overlaps
- Sliding window-based local minimization
- White space is considered as pseudo-cells
49Experimental Setup
- IBM v2 easy and hard benchmarks (16 circuits)
- All publicly available placers
- Dragon-fd congestion-driven mode
- CAPO, Feng Shui, mPL
- mPG congestion mode off, QPLACE ECO for
legalization - QPLACE
- All placers are run 5 times except QPLACE, mPL
and our tools once. - WRoute (SE5.3) to evaluate routability
50Experimental Results
100 successful routings
8.124.5 reduction on routed WL
51Experimental Results
- Impacts of various techniques in out flow
Both techniques improve routability. Combined
flow work best.
Both techniques reduce routed WL
52Experimental Results
- WSA on placements generated by other tools
Improvement on routability except for QPLACE
1.18.0 reduction on routed WL compared to
original tools
53Improved Routability
- mPL-R
- Routability-driven global placement reduces
routing demands through cell-replacement based on
accurate congestion estimation - WSA
- Routability-driven detailed placement allocates
routing resources into congested regions - Successful routings on all easy hard IBM
benchmarks - shortest routed WL competitive placement
runtime
54Dense Placement?
- While the original placement is dense
- We can stretch with WSA to get routability
- We can also stretch to do gate sizing and buffer
insertion (upcoming ASPDAC paper) - We can stretch for thermal and noise issues
- AND THIS IS STABLE
- Individual net lengths change very little.
- Conclusion YOU DO NOT NEED TO RESERVE TONS OF
WHITE SPACE! - And on top of that--when you reserve white space,
you can increase both the power and delay, making
timing closure harder - Traditional white space approaches are based on
people not knowing how to stretch--not because
white space is a good idea.
55Circuit Optimization
- Start with a good placement
- Use a fast and accurate delay analysis tool to
guide placement. No Elmore delay--its not good
enough. No AWE or Spice--too slow. Arvind and
Patrika can tell you what you should do.... - Size gates and insert buffers
- Stretch the placement as needed
- Repeat as necessary
- As the stretch is stable, we converge quickly
- As the area is minimized, we have lower wire
lengths, resulting in less up-sizing and fewer
buffers. - Details on this study in the upcoming ASPDAC paper
56So Weve Reduced Interconnect Lengths.
- But how far can we go?
- There is an OPTIMAL solution--is there room left
for improvement? - PEKO benchmarks are constructed with a known
optimal solution - We use the placement tools on these to evaluate
if there is further room for gain
57Experimental Results
58A Bit of FunGlobal vs. Detail Suboptimality
Improving placement results requires an
understanding of what happens during placement.
Method map each cell in the placement to a
pixel from an image. Rearrange the cells
according to optimal placement. What does this
mean? While theres suboptimality at the global
level, were losing a lot in detail placement.
59Other places to reduce wire lengths?
- Surprisingly large potential at the detail
placement level!
60Traditional Detail Placement
A
B
C
A
C
B
B
A
C
B
C
A
C
A
B
C
B
A
Legalize the placement, then try permutations on
groups of cells in order to improve (wire length,
delay, congestion, .)
61Branch-and-Bound results
Bigger window means better results, but longer
run time. About half of FS2.0 run time is in
detail placement.
62Redefining Local
- Placements are optimal wrt the locations of
groups of 6 or so - But what about bigger groups?
- PEKO and Grid benchmarks show that placements are
not optimal - If we increase the window size, how much can we
get? - How to increase window size w/o getting hammered
on run time?
63Branch-and-Bound Run Times
- 2x2 window 4! 24 combinations
- 3x3 window 9! 362880 combinations
- Runs in about 0.7 seconds on my PC
- 4x4 window 16! xxxx combinations
- Around 1 year to find optimal
- 5x5 window 25! xxxx combinations
- Multiple exansions and contractions of the
universe - With the method presented here
- 10x10 has been solved, and were expecting to be
able to do much larger (with some algorithmic
clean-up of the code) - Its not going to be cheap in terms of run
time, but it should be feasible to apply
64Better Detail Placement
- We know were suboptimal
- But by how much? And what portion is from global
problems, how much from local? - Global placement, we have annealing, analytic,
recursive bisection. All heuristics. - Detail placement, we have enumeration/branch-and-b
ound, and some flow based methods. Optimal, but
small windows. - Strategy
- Apply techniques from OR community
- Solve local optimization problems with bigger
windows
65Short Branch-and-Price Overview
- Based on linear programming
- Solve LP problem to find a lower bound solution
is not neccesarily integer - Use column generation to keep the problem size
manageable - Decompose the problem into master and subproblems
- Branch on non-integers
- Try 0 and 1 values, evaluate the lower bounds
- Traverse the decision tree, jumping to the node
with the best lower bound - If we find an integer solution, and node with a
higher value (integer or not) can be pruned
66Summary
- The first few slides make things sound very bad.
67But I think its an accurate picture
- Absolutely essential
- Consider physical constraints in the design
process - Making chips faster and cheaper is no longer easy
(it was never easy, but its very very hard now)
68How to Survive the Future?
- Short term absolutely minimize circuit
interconnect. Weve made progress here, but
theres a long way to go. - Long term Im pretty sure that colored plastic
will help, but otherwise.
69(No Transcript)
70Design Automation Challenges
- Handle Moore's Curse
- The problem doubles in size every 18 months.
- Late to Market is a disaster
- Come closer to human-design
- Estimates are that automated design leaves about
7 years of technological advances on the table(!) - Shield the System Designers from the Device
Details - Timing, Power, Signal Integrity, ...
- Designers have enough to worry about now
71Things that Wont Work
- Massive parallel computers
- Neural network paradigm shift
- Much of the current nano and quantum hype
72Traditional Legalization
- Align cuts with cell rows
- After bisection--all cells are within row
boundaries - Sort cells by X position
- Feng Shui 1.5 also packs cells to the left
- For MCNC benchmarks, this works well
- For IBM, Peko, not quite so good
73After Bisection
74Dynamic Programming Legalization
- Process rows one at a time
- For each row
- Select a subset of cells such that the total
horizontal WL of the packed subset, plus the
penalty for the non-selected cells, is minimized - Simple DP formulation obtains good results
75DP Solution
- Suppose we have logic elements A, B, C, D, E, F
- Assume they're all the same width
- We have space for four of the six
- Which ones do we put into the row.
- To minimize the TOTAL distance things move?
76Example
B
F
A
D
E
C
All the blocks are going to be packed to the
"left." The total distance things travel depends
on which blocks we choose.
77Some Observations
- If A is to the left of B before packing
- It should still be to the left after packing
- The distance that F travelled depends only on the
number of blocks to the left of it - We don't care which blocks to the left are
taken--only how many!
78DP Matrix
Filling the blanks in the table is easy
Cost of moving E to position 3 plus the lowest
cost for filling to location 2 with blocks before
E
Cost of filling to location 3 using blocks before
E
79Legalization
80Legalization
81Standard Cell Placement Tools
- Other methods include
- Capo (recursive bisection, U. Michigan)
- Dragon (simulated annealing, UCLA)
- Kraftwerk (linear programming, TU Munich)
- mPL (multilevel slot-based, UCLA)
- .
- Objective is to minimize total wire length
- Benchmark circuits derived from IBM designs