Title: Interconnect Optimizations
1 Interconnect Optimizations
2A scaling primer
- Ideal process scaling
- Device geometries shrink by s ( 0.7x)
- Device delay shrinks by s
- Wire geometries shrink by s
- R/m r/(ws.hs) r/s2
- Cc/m (hs).e/(Ss) Cc
- C/m similar
- R/m doubles, C/m and Cc/m unchanged
3Interconnect role
- Short interconnect
- Used to connect nearby cells
- Minimize wire C, i.e., use short minwidth wires
- Medium to long-distance (global) interconnect
- Size wires to tradeoff area vs. delay
- Increasing width ? Capacitance increases,
Resistance decreases Need to find acceptable
tradeoff - wire sizing problem - Fat wires
- Thicker cross-sections in higher metal layers
- Useful for reducing delays for global wires
- Inductance issues, sharing of limited resource
4Cross-Section of A Chip
5Block scaling
- Block area often stays same
- cells, nets doubles
- Wiring histogram shape invariant
- Global interconnect lengths dont shrink
- Local interconnect lengths shrink by s
6Interconnect delay scaling
- Delay of a wire of length l
- tint (rl)(cl) rcl2 (first order)
- Local interconnects
- tint (r/s2)(c)(ls)2 rcl2
- Local interconnect delay unchanged (compare to
faster devices) - Global interconnects
- tint (r/s2)(c)(l)2 (rcl2)/s2
- Global interconnect delay doubles
unsustainable! - Interconnect delay increasingly more dominant
7Buffer Insertion For Delay Reduction
8Analysis of Simple RC Circuit
i(t)
R
v(t)
vT(t)
C
state variable
Input waveform
9Analysis of Simple RC Circuit
Step-input response
match initial state
output response for step-input
10Delays of Simple RC Circuit
- v(t) v0(1 - e-t/RC) -- waveform
- under step input v0u(t)
-
- v(t)0.5v0 ? t 0.7RC
- i.e., delay 0.7RC (50 delay)
- v(t)0.1v0 ? t 0.1RC
- v(t)0.9v0 ? t 2.3RC
- i.e., rise time 2.2RC (if defined as time from
10 to 90 of Vdd) - Commonly used metric TD RC ( Elmore
delay)
11Elmore Delay
Delay
12Elmore Delay
- Driver is modeled as R
- Driver intrinsic gate delay t(B)
- Delay ?all Ri ?all Cj downstream from Ri RiCj
- Elmore delay at n2 R(B)(C1C2)R(w)C2
- Elmore delay at n1 R(B)(C1C2)
n1
n2
R(B)
B
R(w)
C1
C2
13Elmore Delay
- For uniform wire
- No matter how to lump, the Elmore delay is the
same
x
unit wire capacitance c unit wire resistance r
C
14Delay for Buffer
u
v
u
C(b)
C
Driver resistance
Input capacitance
Intrinsic buffer delay
15Buffers Reduce Wire Delay
x/2
x/2
R
C
rx/2
rx/2
R
cx/4
cx/4
cx/4
cx/4
C
?t
t_unbuf R( cx C ) rx( cx/2 C ) t_buf
2R( cx/2 C ) rx( cx/4 C ) tb t_buf
t_unbuf RC tb rcx2/4
x
16Combinational Logic Delay
Register Primary Input
Register Primary Output
Combinational Logic
clock
- Combinational logic delay lt clock period
17Example of Static Timing Analysis
2
7/4/-3
9/6/-3
5/3/-2
3
11
3
20/17/-3
23/20/-3
7
2
4
4/7/3
18/18/0
3
8/8/0
11/11/0
- Arrival time input -gt output, take max
- Required arrival time output -gt input, take min
- Slack required arrival time arrival time
18Buffers Improve Slack
RAT 300 Delay 350 Slack -50
slackmin -50
RAT 700 Delay 600 Slack 100
RAT Required Arrival Time Slack RAT - Delay
RAT 300 Delay 250 Slack 50
Decouple capacitive load from critical path
slackmin 50
RAT 700 Delay 400 Slack 300
19ITRS projections
20Buffered global interconnects Intuition
-
- Interconnect delay r.c.l2
- Now, interconnect delay ? r.c.li2 lt r.c.l2
(where l S lj ) - since S (lj 2) lt (S lj )2
- (Of course, account for buffer delay also)
21Optimal inter-buffer length
- First order (lumped parasitic, Elmore delay)
analysis - Assume N identical buffers with equal
inter-buffer length l (L Nl) - For minimum delay,
22Optimal interconnect delay
- Substituting lopt back into the interconnect
delay expression
Delay grows linearly with L (instead of
quadratically)
23Optimized interconnect delay scaling
- Rewriting the optimal interconnect delay
expression, - With optimally sized buffers (using dT/dh 0),
-
24Optimized interconnect delay scaling
-
- After scaling,
-
(instead of ) - Even with optimal (re-)buffering, interconnects
scale worse than devices - For global interconnects, L doesnt shrink. So
25Buffered nets
26Total buffer count
- Ever-increasing fractions of total cell count
will be buffers - 70 in 32nm
27Buffer Insertion
- Timing optimization
- Slew optimization
28Timing Driven Buffering Problem Formulation
- Given
- A Steiner tree
- RAT at each sink
- A buffer type
- RC parameters
- Candidate buffer locations
- Find buffer insertion solution such that the
slack at the driver is maximized
29Candidate Buffering Solutions
30Candidate Solution Characteristics
- Each candidate solution is associated with
- vi a node
- ci downstream capacitance
- qi RAT
vi is a sink ci is sink capacitance
v is an internal node
31Van Ginnekens Algorithm
Candidate solutions are propagated toward the
source Dynamic Programming
32Solution Propagation Add Wire
x
(v1, c1, q1)
(v2, c2, q2)
- c2 c1 cx
- q2 q1 rcx2/2 rxc1
- r wire resistance per unit length
- c wire capacitance per unit length
33Solution Propagation Insert Buffer
(v1, c1, q1)
(v1, c1b, q1b)
- c1b Cb
- q1b q1 Rbc1 tb
- Cb buffer input capacitance
- Rb buffer output resistance
- tb buffer intrinsic delay
34Solution Propagation Merge
(v, cl , ql)
(v, cr , qr)
- cmerge cl cr
- qmerge min(ql , qr)
35Solution Propagation Add Driver
(v0, c0, q0)
(v0, c0d, q0d)
- q0d q0 Rdc0 slackmin
- Rd driver resistance
- Pick solution with max slackmin
36Example of Solution Propagation
- r 1, c 1
- Rb 1, Cb 1, tb 1
- Rd 1
2
2
(v1, 1, 20)
Add wire
(v2, 3, 16)
(v2, 1, 12)
v1
v1
Insert buffer
Add wire
Add wire
(v3, 5, 8)
(v3, 3, 8)
v1
v1
slack 5
slack 3
Add driver
Add driver
37Example of Merging
Left candidates
Right candidates
Merged candidates
38Solution Pruning
- Two candidate solutions
- (v, c1, q1)
- (v, c2, q2)
- Solution 1 is inferior if
- c1 gt c2 larger load
- and q1 lt q2 tighter timing
39Pruning When Insert Buffer
They have the same load cap Cb, only the one with
max q is kept
40Generating Candidates
From Dr. Charles Alpert
41Pruning Candidates
42Candidate Example Continued
43Candidate Example Continued
After pruning
44Merging Branches
45Pruning Merged Branches
46Van Ginneken Example
(20,400)
Wire C10,d150
Buffer C5, d30
(30,250) (5, 220)
(20,400)
Buffer C5, d50 C5, d30
Wire C15,d200 C15,d120
(30,250) (5, 220)
(45, 50) (5, 0) (20,100) (5, 70)
(20,400)
47Van Ginneken Example Contd
(30,250) (5, 220)
(45, 50) (5, 0) (20,100) (5, 70)
(20,400)
(5,0) is inferior to (5,70). (45,50) is inferior
to (20,100)
Wire C10
(30,250) (5, 220)
(20,100) (5, 70)
(30,10) (15, -10)
(20,400)
Pick solution with largest slack, follow arrows
to get solution
48Basic Data Structure
Worse load cap
(c1, q1)
(c2, q2)
(c3, q3)
Better timing
- Sorted list such that
- c1 lt c2 lt c3
- If there is no inferior candidates q1 lt q2 lt q3
49Prune Solution List
Increasing c
(c1, q1)
(c2, q2)
(c3, q3)
(c4, q4)
N
N
q1 lt q2 ?
q1 lt q3 ?
q1 lt q4 ?
Prune 2
Prune 3
Y
Y
N
Prune 3
q2 lt q4 ?
q2 lt q3 ?
Y
N
Prune 4
q3 lt q4 ?
N
Prune 4
q3 lt q4 ?
50Pruning In Merging
Left candidates
Right candidates
ql1 lt ql2 lt qr1 lt ql3 lt qr2
(cl1, ql1) (cl2, ql2) (cl3, ql3)
(cr1, qr1) (cr2, qr2)
(cl1, ql1) (cl2, ql2) (cl3, ql3)
(cr1, qr1) (cr2, qr2)
Merged candidates (cl1cr1, ql1) (cl2cr1,
ql2) (cl3cr1, qr1) (cl3cr2, ql3)
(cl1, ql1) (cl2, ql2) (cl3, ql3)
(cr1, qr1) (cr2, qr2)
(cl1, ql1) (cl2, ql2) (cl3, ql3)
(cr1, qr1) (cr2, qr2)
51Van Ginneken Complexity
- Generate candidates from sinks to source
- Quadratic runtime
- Adding a wire does not change candidates
- Adding a buffer adds only one new candidate
- Merging branches additive, not multiplicative
- Linear time solution list pruning
- Optimal for Elmore delay model
52Multiple Buffer Types
http//vlsitechnology.org/html/cells/vsclib013
- r 1, c 1
- Rb 1, Cb 1, tb 1
- Rb2 0.5, Cb2 2, tb2 0.5
- Rd 1
2
2
(v1, 1, 20)
(v2, 3, 16)
v1
(v2, 2, 14)
(v2, 1, 12)
v1
v1
53Handle Polarity
Negative
-
Positive
-
-
-
-
-
-
54Consider Cost/Power
- A solution is also characterized by cost w
- A solution is inferior if it is poor on all of c,
q and w - At source, a set of solutions with tradeoff of q
and w - w can be
- total capacitance
- or the number of buffers
55Cost-Slack Trade-off
56Data Organization
Sorted in ascending order of (c, q)
0
(c1, q1)
(c2, q2)
(c3, q3)
1
(c4, q4)
(c5, q5)
(c6, q6)
2
(c7, q7)
(c8, q8)
(c9, q9)
(c10, q10)
3
4
(c11, q11)
buffers inserted
57Pruning Considering Cost
(ci , qi , wi) is inferior to (ck , qk , wk) if
ci gt ck , qi lt qk , wi gt wk
Prune order
Pruning within a list is same as before
0
(c1, q1)
(c2, q2)
(c3, q3)
1
(c4, q4)
(c5, q5)
(c6, q6)
2
(c7, q7)
(c8, q8)
(c9, q9)
w
How to prune a solution with wk from a set of
solutions with w ? wk?
58Blockage Recognition
- Delete insertion points that run over blockages
59References
- L.P.P.P. van Ginneken, Buffer placement in
distributed RC-tree networks for minimal Elmore
delay, ISCAS 1990, 865 -868. - J. Lillis, C.-K. Cheng, and T. T. Lin, Optimal
wire sizing and buffer insertion for low power
and generalized delay model, IEEE J. Solid-State
Circuits, 31(3), pp. 437-447, 1996. - W. Shi and Z. Li, An O(nlogn) time algorithm for
optimal buffer insertion, Proc. DAC 2003, pp.
580-585.