Title: ECE260B CSE241A Winter 2005 Clocking
1ECE260B CSE241AWinter 2005Clocking
Website http//vlsicad.ucsd.edu/courses/ece260b-
w05
Slides courtesy of Prof. Andrew B. Kahng
2Outline
- Problem Statement
- Clock Distribution Structures
- Robustness / Signal Integrity Control
- Clock Design
- Skew Scheduling
- Topology Construction
- Embedding
3Why Clocks?
- Clocks provide the means to synchronize
- By allowing events to happen at known timing
boundaries, we can sequence these events - Greatly simplifies building of state machines
- No need to worry about variable delay through
combinational logic (CL) - All signals delayed until clock edge (clock
imposes the worst case delay)
Dataflow
FSM
Comb Logic
register
Comb Logic
register
register
Courtesy K. Yang, UCLA
4Clock Distribution Network
- General goal of clock distribution
- Deliver clock to all memory elements with
acceptable skew - Deliver clock edges with acceptable sharpness
- Clocking network design is one of the greatest
challenges in the design of a large chip - Consume up to 1/3 of chip power
- Accurate signal delay
- Signal integrity
- Subject to uncertainty / variation of different
processes / operating conditions
5Clock Design Components
- Oscillator
- Dividers
- Buffers
- Strong drivers
- Reduce delay
- Signal integrity / slew rate
- Interconnects
- Balanced trees, meshes, etc.
- Shielding (e.g., for crosstalk reduction)
- Non-tree links / feedback loops
6Clock Distribution Objective
- Minimum / bounded skew
- performance / hold time requirements
- Guaranteed slew rate / signal integrity
- Small insertion delay
- Robustness under process / operating condition
variation - Minimum cell / routing area
- Minimum power consumption
7Clock Distribution Robustness Subject to
- Radically different loading (flip-flop density)
- Across the die
- ECO (Engineering Change Order)
- Interconnect coupling
- Signal integrity
- Delay variation
- Process variation
- From lot-to-lot
- Across the die
- Buffers
- Metal width
- Supply voltage variation across the die
- Both static IR drop
- Dynamic voltage drop
- Temperature
8Issues in Clock Distribution Network Design
- Skew
- Process, voltage, and temperature
- Data dependence
- Noise coupling
- Load balancing
- Power, CV2f (consume up to 1/3 of total chip
power) - Clock gating
- Flexibility/Tunability
- Compactness fit into existing layout/design
- Facilitate ECO
9Skew Clock Delay Varies With Position
10Clock Skew Causes
- Designed (unavoidable) variations mismatch in
buffer load sizes, interconnect lengths - Process variation process spread across die
yielding different Leff, Tox, etc. values - Temperature gradients changes MOSFET
performance across die - IR voltage drop in power supply changes MOSFET
performance across die - Note Delay from clock generator to fan-out
points (clock latency) is not important by itself - BUT increased latency leads to larger skew for
same amount of relative variation
Sylvester / Shepard, 2001
11Outline
- Problem Statement
- Clock Distribution Structures
- Robustness / Signal Integrity Control
- Clock Design
- Skew Scheduling
- Topology Construction
- Embedding
12Clock Distribution Structures
- Grids
- Reliable
- Less data dependency
- Tunable (late in design)
- RC-Tree
- Less capacitance
- More accuracy
- Flexible wiring
- Shown here for final stage drivers driving F/F
loads
13Grids
- Gridded clock distribution common on earlier DEC
Alpha microprocessors - Advantages
- Skew determined by grid density, not too
sensitive to load position - Clock signals available everywhere
- Tolerant to process variations
- Usually yields extremely low skew values
- Disadvantages
- Huge amount of wiring and power
- To minimize such penalties, need to make grid
pitch coarser ? lose the grid advantage
Pre-drivers
Global grid
Sylvester / Shepard, 2001
14H-Tree
- H-tree (Bakoglu)
- One large central driver, recursive structure to
match wirelengths - Halve wire width at branching points to reduce
reflections - Disadvantages
- Slew degradation along long RC paths
- Unrealistically large central driver
- Clock drivers can create large temperature
gradients (ex. Alpha 21064 30 C) - Non-uniform load distribution
- Inherently non-scalable (wire R growth)
- Partial solution intermediate buffers at
branching points
courtesy of P. Zarkesh-Ha
Sylvester / Shepard, 2001
15Buffered H-tree
- Advantages
- Ideally zero-skew
- Can be low power (depending on skew requirements)
- Low area (silicon and wiring)
- CAD tool friendly (regular)
- Disadvantages
- Sensitive to process variations
- Devices ? Want same size buffers at each level of
tree - Wires ? Want similar segment lengths on each
layer in each source-sink path !!! - Local clocking loads inherently non-uniform
Sylvester / Shepard, 2001
16Tree Balancing
Con Routing area often more valuable than Silicon
Some techniques a) Introduce dummy loads b)
Snaking of wirelength to match delays
Sylvester / Shepard, 2001
17Examples From Processor Chips
Grids DEC Alphas
Serpentines Intel x86 Young ISSCC97
- H-Tree, Asymmetric RC-Tree (IBM)
18Example Skews From Processor Chips
DEC-Alpha 21064 clock spines
DEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC local delays
DEC-Alpha 21164 RC delays for Global Distribution
(Spine Grid)
19ReShape Clocks Example (High-End ASIC)
- Balanced, shielded H-tree for pre-clock
distribution - Mesh for block level distribution
- All routes 5-6u M6/5, shielded with 1u grounds
- 10 buffers per node
- E.g., ganged BUFx20s
- Output mesh must hit every sub-block
20Block Level Mesh (.18u)
21Problems with Meshes
- Burn more power at low frequencies
- Blocks more routing resources (solution
integrated power distribution with ribs can
provide shielding for free) - Difficult for spare clock domains that will not
tolerate regioning - Post placement (and routing) tuning required
- No beneficial skew possible
- Clock gating only easy at root
- Fighting tools to do analysis
- Clumped buffers a problem in Static Timing
Analysis tools - Large shorted meshes a problem for STA tools
- What does Elmore delay calculation look like for
a non-tree? - ? Need full extraction and SPICE-like simulation
to determine skew
22Benefits of Meshes
- Deterministic since shielded all the way down to
rib distribution - No ECO placement required all buffers preplaced
before block placement - Low latency since uses shorted ( ganged,
parallel) drivers, therefore lower skew - ECO placements of FFs later do not require
rebalancing of tree - Idealized clocking environment for concurrent
dance of RTL design and timing convergence
23Hybrid Structure
- Balanced tree on the top
- Mesh in the middle
- Minimize skew
- Steiner minimum tree at the bottom
- Minimize cost
- Facilitate ECO
24Outline
- Problem Statement
- Clock Distribution Structures
- Robustness / Signal Integrity Control
- Clock Design
- Skew Scheduling
- Topology Construction
- Embedding
25Process Variation
- Intra-die and inter-die variations
- Intra-die variation is increasingly significant
since 0.13um technology - Systematic and random variations
- Systematic variation is due to equipment,
process, etc. - Global len aberration in lithograthy causes
systematic variation - Pattern-dependent optical proximity, chemical
mechanical polish (CMP) - Random variation is due to inherent variation
- Spatial correlation across a chip
- Fast vs. slow corners
26Process Variation
- Metal wires
- Width variation can be estimated by LUT(width,
spacing) - Thickness variation ? CMP ? local density
- Thickness variation also depends on wire width
and spacing - Could be up to 30-40 in 90nm process
- Transistors
- Channel length variation (delay L1.5)
- Thin gate oxide tox variation ? Vth variation
- Up to 30 variation in term of driving capability
27Process Variations SPICE model
- Process variations are reflected into a
statistical SPICE model - Usually only a few parameters have a statistical
distribution (e.g. DL, DW, TOX,VTn, VTp) and
the others are set to a nominal value - The nominal SPICE model is obtained by setting
the statistical parameters to their nominal value
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer
of UCB
28Global Variations (Inter-die)
- Process variations ? Performance variations
- Critical path delay of a 16-bit adder
All devices have the same set of model
parameters value
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer
of UCB
29Local Variations (Intra-die)
- Each device instance has a slightly different set
of model parameter values (aka device mismatch) - The performance of some analog circuits strongly
depends on the degree of matching of device
properties - Digital circuits are in general more immune to
mismatch, but clock distribution network is
sensitive (clock skew)
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer
of UCB
30Statistical Design
- Need to account for process variations during
design phase
- Statistical design
- Nominal design
- Yield optimization
- Design centering
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer
of UCB
31Statistical Design
Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer
of UCB
32Process Variation Tolerance Enhancement
- Rule of thumb balanced tree
- Identical buffers at identical heights
- Drive identical subtree loads
- Can we do better than this?
- Process variation tolerant clock design
- Bounded-skew DME
- Topology construction
- With process variation tolerance in objective
- Useful skew scheduling
- To the center of permissible ranges
33Signal Integrity
- Crosstalk
- Capacitive, inductive
- Supply voltage drop
- IR, L dI/dt, LC resonance
- Temperature
- Increased resistance with higher temperature
- Substrate coupling
- Parasitic resistance, capacitance in the
substrate layer
34Crosstalk
- Due to the coupling capacitance between
interconnections, a signal switching on a net
(aggressor) may affect the voltage waveform on a
neighboring net (victim)
Noise Propagation
Increased Delay
35Circuit Model for Crosstalk
36Crosstalk Simulation
37Design for Crosstalk
- It can be both capacitive and inductive
- Capacitive is dominant at current switching
speeds - To reduce it
- Use of shielding layer (inter-layer)
- Use of shielding wire (intra-layer)
38Clock Gating
- Reduce power consumption by temporarily shutting
down part of the circuit - Additional cost of enabling circuits
FF
FF
combinational logic
D
Q
CLK1
CLK2
CLK ENABLING
39Outline
- Problem Statement
- Clock Distribution Statement
- Robustness / Signal Integrity Control
- Clock Design
- Skew Scheduling
- Topology Construction
- Embedding
40Skew Local Constraint
- Timing is correct as long as the clock signals of
sequentially adjacent FFs arrive within a
permissible skew range
W. Dai, UC Santa Cruz
41Useful Skew ? Design Robustness
- Design will be more robust if clock signal
arrival time is in the middle of permissible skew
range, rather than on edge
T 6 ns
0 0 0 at verge of violation
W. Dai, UC Santa Cruz
42Constraints on Skews
- FFi receives clock signal delayed by xi ? MIN_DEL
- 0 lt ? ? 1 ? ? if nominal clock delay is xi,
then actual clock delay must fall within interval
?xi ? x ? ?xi - For FF to operate correctly when clock edge
arrives at time x, the correct input data must be
present and stable during the time interval (x
SETUP, x HOLD) - For 1 ? i,j ? L (FFs), we compute lower and
upper bounds MIN(i,j) and MAX(i,j) for the time
that is required for a signal edge to propagate
from FFi to FFj - Avoid double-clocking (race condition)
- ?xi MIN(i,j) ? ?xj HOLD
- Avoid zero-clocking
- ?xj SETUP MAX(i,j) ? ?xj P P clock
period
43Optimal Useful Skews by Linear Programming
- LP_SPEED (clock period reduction)
- minimize P s.t.
- ?xj - ?xj ? HOLD MIN(i,j)
- ?xi ?xj P ? SETUP MAX(i,j)
- xi ? MIN_DEL
- LP_SAFETY (robustness)
- Maximize M s.t.
- ?xj - ?xj M ? HOLD MIN(i,j)
- ?xi ?xj M ? SETUP MAX(i,j) P
- xi ? MIN_DEL
- Notes
- J. P. Fishburn, Clock Skew Optimization, IEEE
Trans. Computers 39(7) (1990), pp. 945-951. - T. G. Szymanski, Computing Optimal Clock
Schedules, Proc. DAC, June 1992, pp. 399-404. - Useful Skew optimization is similar to Retiming
optimization - Peak current reductions are a side benefit
44Outline
- Problem Statement
- Clock Distribution Structures
- Robustness / Signal Integrity Control
- Clock Design
- Skew Scheduling
- Topology Design
- Embedding
- For zero skew (ZST-DME)
- For bounded skew (BST-DME)
45Zero-Skew Tree (ZST) Problem
- Zero Skew Clock Routing Problem (S,G) Given a
set S of sink locations and a connection topology
G, construct a ZST T(S) with topology G and
having minimum cost. - Skew maximum value of td(s0,si) td(s0,sj)
over all sink pairs si, sj in S. - Td signal delay (from source s0)
- Connection topology G rooted binary tree with
nodes of S as leaves - Edge ea in G is the edge from a to its parent
- ea is the (assigned) length of edge ea
- Cost total edge length
46Zero-Skew Example (555 sinks, 40 obstacles)
47A Zero-Skew Routing Algorithm
- Finds a ZST under linear delay model with minimum
cost over all ZSTs with topology G and sink set S - Terms
- Manhattan Arc line segment with slope 1 or 1
- Tilted Rectangular Region (TRR) collection of
points within a fixed distance of a Manhattan arc - Core Manhattan arc
- Radius distance
- Merging segment locus of feasible locations for
a node v in the topology, consistent with minimum
wirelength - If v is a sink, then ms(v) v
- If v is an internal node, then ms(v) is the set
of all points within distance ea of ms(a), and
within distance eb of ms(b)
48Phase 1 Tree of Merging Segments
- Goal Construct a tree of merging segments
corresponding to topology G - Merging segment of a node depends on merging
segment of its children ? bottom-up construction - Let a, b be children of v. We want placements of
v that allow TSa and TSb to be merged with
minimum added wire while preserving zero skew - Merging cost ea eb
- Fact The intersection of two TRRs is also a TRR
and can be found in constant time - Constant time per each new merging segment ?
linear time (in size of S) to construct entire
tree
49Phase 2 Find Node Placements
- Goal Find exact locations (embeddings) pl(v)
of internal nodes v in the ZST topology - If v is the root node, then any point on ms(v)
can be chosen as pl(v)
- If v is an internal node other than the root, and
p is the parent of v, then v can be embedded at
any point in ms(v) that is at distance ev or
less from pl(p) - Detail create square TRR trrp with radius ev
and core equal to pl(p) placement of v can be
any point in ms(v) ? trrp - Each instruction executed at most once for each
node in G, and TRR intersection is O(1) time ?
Find_Exact_Placements is O(n) ? DME is O(n)
50Outline
- Problem Statement
- Clock Distribution Structures
- Robustness / Signal Integrity Control
- Clock Design
- Skew Scheduling
- Topology Design
- Embedding
- For zero skew (ZST-DME)
- For bounded skew (BST-DME)
51Non-Zero Skew Bounds
- Given a skew bound, where can internal nodes of
the given topology (e.g., a, b, v) be placed?
skew
2
4
6
0
a
skew
2
4
6
0
2
4
2
v
6
4
s0
v
6
b
Topology
a
b
s4
s1
s3
s2
52BST-DME Bottom-Up Phase
s0
v
Bottom-Up build tree of merging regions
corresponding to given topology
Topology
a
b
s4
s1
s3
s2
s2
B 4
s0
s3
mr(a)
mr(b)
s1
mr(v)
s4
53BST-DME Top-Down Phase
s0
v
Topology
a
b
s4
s1
s3
s2
s2
B 4
s0
s3
a
b
s1
v
s4
54Good Luck for the Mid-Term!