Clock Distribution - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Clock Distribution

Description:

Title: Digital Devices Author: Bob Reese Last modified by: reese Created Date: 8/18/1999 12:14:36 AM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 66
Provided by: BobR188
Category:

less

Transcript and Presenter's Notes

Title: Clock Distribution


1
Clock Distribution from Past to Present
  • A synchronous system needs a clock which signals
    are synchronized with
  • Clock distribution network goal is to generate a
    clock signal in which you want the clock to
    arrive at the same time at equivalent points on
    the chip
  • Problem Clock skew
  • Clock skew mismatches in wire delays can cause
    differences in arrival times at equivalent points
    in the clocks. Can only predict the arrival time
    of the clock at clock arrival Y /- Skew
  • Clock skew must accounted for in timing budget
    when determing delay paths to meet setup/hold
    constraints
  • Clock skew is THE problem everything else, such
    as power costs with regards to the clock
    distribution system, comes from trying to solve
    the clock skew problem

2
Clock Distribution
clk
100 ps
100 ps
want clk to arrive at same time at equivalent
parts of chip
0 ps
reference point
100 ps
100 ps
3
Clock Distribution (cont)
clk
100 ps
100 ps
Data
if clock arrival is known, can accurately compute
delay constraints when sending data from one
register to another
0 ps
reference point
100 ps
100 ps
4
Clocking Regions
skew to neighboring region 11 ps
local skew 6 ps
local skew 6 ps
clk
98 ps
102 ps
99 ps
96 ps
92 ps
103 ps
97 ps
93 ps
0 ps
reference point
100 ps
100 ps
Chip is divided into regions, the further a
signal has to travel, the larger the skew budget
5
Skew Flip-Flops
Skew adds to both setup and hold when calculating
constraints
Source David Harrison
6
Clock Distribution Evolution
  • Alpha 21064 (1993, 750 nm), 150-200 MHz Global
    Clock/1 driver/gridded clock, 180 ps clock skew
  • Alpha 21164 (1995, 500 nm), 300 Mhz Global
    Clock, 2 drivers, gridded clock, 80 ps clock
    skew,
  • Alpha 21264 (1997/ 350 nm) 500-600 MHz, Clock
    distribution network global/local/conditional
    clocks deskew by delay insertion, gridded
    global clock, 65 ps clock skew
  • IA-64 Gen1 (2000/180 nm) 800 MHz, H-tree global
    clock, active de-skew (distributed PLLs), 28 ps
    skew
  • IA-64 Gen2 (2002/180 nm) 1 GHz, H-tree global
    clock, regional clocks, NO active deskew, 62 ps
    skew
  • IA-64 Gen3 (2003/130 nm) 1.5 GHz, H-tree global
    clock, regional clocks, fuse-based deskew, 7 ps
    (scan based) to 24 ps (fuse based)

7
Clock Distribution Evolution (cont)
  • IA-64 Dual-core (2004/90 nm) 2.5 GHz (Montecito),
    H-tree and balanced binary-tree routing, regional
    clocks, active regional deskew, 10 ps skew
  • Xeon Dual-core (2006/65 nm) 3.4 GHz (Tulsa) two
    different clock systems
  • Core clocks (clocks for processor cores) uses
    same core clock scheme as used in Xeon Single
    Core (2003,/90 nm). This clock scheme was
    designed to scale up to 6 GHz, and used a H-tree
    distributed clock with shorted nodes that had
    produced less than 10 ps skew. No active de-skew
    or fuse-based de-skew.
  • Un-core clock (everything outside the core)
    Cache, bus logic, etc. Large area prevented use
    of gridded clock (power restriction), used a
    clock tree (9 vertical, 2 horizontal) with
    fuse-based deskew at root of each vertical spine.
    Achieved less than 11 ps skew.

8
Definition Gridded Clock
In early clock distribution systems, large
drivers metal clock grid used for clock
distribution. Subsystems just tapped into the
clock grid for connectivity easy to do, but
takes a lot of power, chews up routing resources
(grid density is exaggerated in this picture,
there is a lot more white space than is shown
9
Alpha 21064 Die Photo (1993)
Single Clock driver, 2 transistors for buffer
visible to naked eye
Clocking scheme was 2 phase, single wire. Clock
load was 3.5 nF Gate length of final driver was
35 cm (not a misprint, used serpentine layout to
get this gate length).
10
21064 Clock Skew Distribution
Max clock skew approx. 180 ps (3.6 of 5 ns
clock period) 1 gate delay about 300 ps, so
clock skew about 50 of a gate delay.
Note the skew is smallest closest to center of
chip where driver is located.
11
Thermal Image of 20164
76C at center of chip
46C at edges of chip
30C thermal gradient across chip!!
12
21164 Clock Distribution (1995)
Goal of 21164 Clock distribution was to reduce
skew by 30 and reduce the thermal gradient. A
predriver was centered between two main clock
drivers
Predriver
Clock skew was reduced by factor of 2, and
thermal gradient was reduced.
Main clock drivers
13
Max clock skew approx. 80 ps (2.4 of 3.3 ns
clock period) 1 gate delay about 240 ps, so
clock skew about 1/3 of a gate delay.
Clock skew lowest near two main clock drivers.
14
Aside Why a Gridded Clock?
  • Both 21064, 21164 used a single global clk
    distributed by a metal clock grid.
  • Skew is largely determined by grid interconnect
    density and is insensitive to gate load placement
  • Why? Because capacitance of grid wiring dominates
    the gate loads connected to it.
  • Universal availability of clock signals
  • Design teams can proceed in parallel since clock
    constraints well known
  • Good process-variation tolerance
  • The disadvantage is the extra capacitance of the
    grid
  • Power-performance tradeoff is determined by
    choice of skew target, which establishes the
    needed grid density, which determines the clock
    driver size.

15
21264 Clock Distribution (1997)
  • 21264 clocking fundamentally different from
    previous Alphas because it supported a hierarchy
    of clocks
  • Still had a GCLK (global Clk) grid, but
    conditional and local clocks had several buffer
    stages after GCLK
  • Conditional clocks used to save power
  • Clocks gated to functional units in design
  • If not executing a floating point instruction,
    then stop the clock to the floating point unit to
    save power!
  • State elements and clocking points were 0 to 8
    gates past Gclk
  • Six major regional clocks two gain stages past
    GCLK with grids juxaposed with GCLK, but shielded
    from it.
  • Major clocks drive local clocks and conditional
    clocks
  • Goals were to improve performance, reduce power.

16
Clock Hierarchy of 21264
17
21264 Global Clock Distribution
Window pane arrangement - same skew to all
panes . Note redudant drive to clock nets
18
Phased Lock Loops (PLLs)
  • PLLs and Delay Locked Loops (DLLs) are used to
    perform clock multiplication of an off-chip clock
  • PLLs/DLLs used to align clock edges of original
    clock with multiplied clocks
  • PLLs are analog circuits that use a charge pump
    and a voltage controlled oscillator (VCO) to
    perform phase alignment
  • Alpha 21264 PLL used a separate, regulated 3.3 V
    supply and was located in the corner of the chip
    to minimize noise impact
  • Section 9.5.2 of Rabaey text has a block diagram
    of a PLL
  • All high performance CPUs and most ASICs now
    include a PLL for internal clock generation

19
Global clock grid. Uses 3 of M3/M4 routing
layers (lines in picture are misleadingly thick).
20
All GCLK lines are laterally shielded by Vss/Vdd
signal
GCLK
signal
Vdd
Vss
Lateral shielding via Vss/Vdd prevents clock
noise from coupling into signal lines. Clock
wires and lateral shields were manually placed
21
Simulated worst case GCLK skew was 72 ps . Skew
on M1, M2 was less than 10 ps.
Measured worst case GCLK skew via ebeam tester
was 65 ps
22
Major clocks are two inversions past GCLK
Major clocks saved power over a single global
clocks because they service a lighter load and
distribution area is smaller both of these
means smaller drivers are needed. GclkMajor
clocks used 24 W _at_ 2.2 V, 600 Mhz. It is
estimated that at least 40W would have been
required if only global clocks were used.
10-90 rise/fall times were targeted at lt 320 ps.
23
Major Clock grids. Densest major clock grids used
up to 6 of M3/M4 routing. White areas are
serviced by local clocks, local clocks also
present in major clock grids. Major clocks also
laterally shielded.
24
Local, Conditional Clocks
  • Local clocks generated from any clock GCLK,
    Major clocks, other local clocks
  • Local clocks were neither shielded or gridded
  • Having local clocks gave freedom to move clock
    edges with respect to data to solve timing
    problems
  • This is another form of time borrowing
  • 60,000 local clock nodes, all were analyzed with
    SPICE using minimum and maximum gate capacitance
    estimates
  • Some local clocks had very high min/max delay
    variation tolerances (up to 280 ps)

25
Power Consumption 21264
  • 72 W total (600 Mhz _at_ 2.2 V)
  • Clock distribution power consumption 46.8 W
  • Gclk 10.2 W
  • Major Clocks 24 W
  • Local unconditional clocks 7.6 W
  • Local conditional clocks 15.6
  • Clocking accounted for 65 of the total power in
    the 21264!

26
The IA-64 (Gen 1)
  • IA-64 ISA is successor to the Pentium 4, which
    was the successor to the Pentium 3/2/1.
  • 64-bit architecture, all registers 64-bits wide
  • 128 General Registers, 128 Floating Point
    Registers
  • G0-G31 are global registers, G32-G127 are part
    of the Register Stack where a dynamic number of
    them can be allocated as part of procedure
    call/return and be visible to only that procedure
    (similar to Sparc register windows).
  • Superscalar, maximum issue of 6 instructions per
    clock
  • Supports both speculative branching and
    speculative loading
  • Itanium is the first implementation of the IA-64
    ISA.
  • Executes x86 code (IA-32) with a separate
    execution engine.

27
Technology
  • 0.18? CMOS
  • 25.4 million transistors
  • 6 metal layers
  • Flip-chip with 1014 pads

Recall that the Alpha had 21264 had 15.2 million
transistors
28
IA-64 Clock Distribution (Gen 1)
DSK deskew buffers, RCD Regional Clock
Distribution Network
29
Reference, Core clocks
  • On chip PLL generated 2X reference clock which is
    then divided by two to form a 50 duty cycle core
    clock
  • External clock (system clock) is input to PLL
  • Both 2X reference clock and core clock is
    distributed across die via an H tree

Routed in M5/M6 Fully laterally shielded with
Vss/VDD Inductive reflections minimized at branch
points by sizing wires to match impedances
30
Inductance Affects Delay
Delay in Clock distribution metal H-tree network
affected by R, C, and L. For Ghz speed clocks
in a metal distribution network, must include L
in delay calculations
Inductance adds extra delay in current return
pathInductive effects decreased clock buffer
delays dues to faster transition rates.
31
Regional Clock Distribution
  • De-skew buffers (DSK), Regional Clock Drivers
    (RCD), and Region Clock Grid (RGD)
  • 30 clock regions serviced by regional clocks
  • Regional Clock Grid implemented in M4, M5
  • Floats over one or more functional units
  • Full lateral shielding via Vss/Vdd

32
Alpha vs IA64 Approach
  • Alpha CPU Major Clock IA 64 Regional clocks
  • Alpha did not attempt to deskew Major clocks with
    GCLK
  • Alpha used local clocks generated from major
    clocks and did timing analysis, path delay
    matching between clocks and data to solve timing
    problems
  • This does NOT account for delays due to on die
    process variations
  • At Ghz clock speeds, skew due to on die process
    variations can cause timing failures
  • IA64 used an active distributed deskewing
    approach for GCLK and Regional Clocks
  • Wanted to avoid the detailed delay matching,
    timing analysis required in the Alpha design
    after complete implementation because of impact
    on design schedule
  • Account for delay due to on die process variations

33
Think of reference clock as the golden clock
Feedback clock!!!
Delay circuit used to control edge alignment of
Global clock with Regional Clock. In general,
this is a form of a Digital Delay Locked Loop
(DLL). Any form of PLL/DLL must have feedback for
correction!
34
Decoupling caps
Regional clock can be gated
Shifting a 1 from one end decreases delay,
shifting a 0 from opposite end increases delay
(this is a variable delay line). Delay range was
170 ps in 8.5 ps steps. Phase adjustments made
every 16 clock cycles. Could also be adjusted
manually via test access port (TAP)
35
Controller for Deskew Buffer Register
Deskew Register adjusted every 16 clock cycles of
Reference Clock. The Deskew buffer is just a
simple form of a Delay Locked Loop (DLL).
36
Why a Reference Clock?
  • The goal of the DSK was to deskew the global
    (core) clock with respect to the regional clocks
  • Reference clock was 2X core clock
  • Regional clocks were simply a delayed version of
    the global clock
  • Reference clock was not deskewed but smaller
    distribution region and more balanced routing
    gave less skew in reference clock.
  • Not possible to maintain a balance routing
    network and load matching for core clock over
    such a large design with multiple design teams
    since the core clock was driving logic
  • However, it was possible to design balanced
    routing network and have load matching for the
    reference clock since all it drove were the DSKs
    and global clock design team solely responsible
    for reference clock design
  • Feedback clocks from the regional clock
    distribution were then used to deskew regional
    clocks with respect to reference clock.

37
Skew Elements
  • Total skew of design based on residual skew in
    reference clock, uncertainty of phase detector in
    DSK, and mismatches of feedback clocks
  • Reference clock did not have as large a
    distribution region as the core clock, and loads
    were better matched, so had tighter skew than
    would have been possible with global clock
  • Feedback clock routes were kept short with
    respect to DSKs
  • Phase detector uncertainty kept small via
    symmetric layout techniques and by allowing a
    long time for phase comparison
  • Achieved maximum skew was 28 ps (2.8 of a 1 Ghz
    clock period).

38
Measured skew via Laser voltage probing
39
Local Clocks
  • Local clocks generated from Regional Clocks and
    provided clocks needed by domino logic
  • Full timing analysis performed on local clocks
  • Local clocks responsibility of functional block
    design teams
  • Global and regional clock responsibility of
    global clock design team

Delay added for time borrowing or to account for
skew in local clock
40
Hold Time Analysis (another look)
TdI includes delay through combinational logic
plus hold time on G2 This was called race
analysis in Alpha notes
Min(Td) ? max (Skew) If shortest path from G1
to G2 is less than max Skew, than incorrect value
may get clocked into G2 when clock edge arrives
at G2.
41
Four different cases for Max(skew)
LCB local clock buffer.Common reference means
in same DSK cluster
42
IA-64 Generations 2,3 CPU and CLK
  • This lecture uses two papers that discuss the
    clock and CPU design of the second and third
    generations of the IA-64
  • Anderson, F. E., Wells, J. S., Berta, E. Z, The
    Core Clock System on the Next Generation Itanium
    Processor", ISSCC 2002, pp 453-456.
  • Tam, S., Desai, U. Limaye, R., Clock Generation
    and Distribution for the Third Generation Itanium
    Processor ", 2003 Symposium n VLSI Circuits, pp
    9-12.
  • Stinson, J., Rusu, S., A 1.5GHz Third Generation
    Itanium Processor, ISSCC 2003, paper 14.4.
  • The implementation of the Itanium 2
    microprocessor Naffziger, S.D. Colon-Bonet, G.
    Fischer, T. Riedlinger, R. Sullivan, T.J.
    Grutkowski, T. Solid-State Circuits, IEEE
    Journal of , Volume 37 Issue 11 , Nov. 2002
    Page(s) 1448 -1460
  • All notes in this lecture are from these four
    papers.

43
Clock Comparison of three generations of IA-64
44
Comments
  • Active de-skewing used in 1st generation
    jettisoned in 2nd generation
  • 2nd generation just used a balanced H-tree
  • Difficult to route this type of structure - all
    clock routing was reserved prior to block layout
  • Differential clocks used for 2nd level clock
    distribution reduced jitter
  • Non-active de-skew easier to test, and more
    deterministic behavior
  • Intentional clock skewing for time borrowing
    easier
  • 3rd generation uses programmable fuses for
    skewing
  • allows skew adjustment after fabrication

45
2nd Generation Clock distribution
Gated clocks
differential clocks
46
2nd Generation Clock Shielding
CLK
CLK-
This level reduces inductive effects. Locates gnd
current return close to clock lines.
47
3rd Generation Distribution
Copper interconnect used, extra performance
headroom
48
Fuse-Based De-skewing
69 fuses controlling 23 clock zones. Delay
increments in 30.5 ps over 220 ps
range. Exhaustive search for best fuse settings
not possible, use a generic search algorithm with
statistical history to help done during
production sort.
SLCB second level clock buffer
49
Results of Skew Adjustment
Made a big difference here. Skew reduced from
60ps to 24 ps.
50
90 nm IA Microprocessor (2003)
Global clock distribution scaled up to 6 GHz
Used a clock distributed by H-tree, but shorted
clock nodes at about every third level in order
to reduce the skew. No active de-skew or
fused-based de-skew.
51
90 nm IA Microprocessor (2003) (cont)
Skew attenuation nodes are shorted
52
90 nm, Dual-Core Itanium (2005)
Up to 2.5 GHz, used region-based, active de-skew
2nd-level clock buffer drives 200 CVDs (clock
vernier devices)
active de-skew
53
90 nm, Dual-Core Itanium (2005) (cont)
Each 2nd- level clock buffer can dynamically
adjust its delay by up to 128 ps with 1 ps
resolution Each clock vernier device (CVD) gave
an additional 70 ps of skew adjustment
delay at each clocking level and power
consumption note the number of end points! Post
gater delay matching handled by designers
54
Xeon Dual-core (2006/65 nm) 3.4 GHz (Tulsa) two
different clock systems
  • Core clocks (clocks for processor cores) uses
    same core clock scheme as used in Xeon Single
    Core (2003,/90 nm). This clock scheme was
    designed to scale up to 6 GHz, and used a H-tree
    distributed clock with shorted nodes that had
    produced less than 10 ps skew. No active de-skew
    or fuse-based de-skew.
  • Un-core clock (everything outside the core)
    Cache, bus logic, etc. Large area prevented use
    of gridded clock (power restriction), used a
    clock tree (9 vertical, 2 horizontal) with
    fuse-based deskew at root of each vertical spine.
    Achieved less than 11 ps skew.

Top measured frequency 3.4 GHz
55
Dual Core die photograph
56
Clock Domains
57
Clock Generator Arch.
58
Clock Distribution
Fused-based deskew buffers located at the root of
the vertical MCLK spines
Zclk is the IO clock
59
Clock Hierarchy
60
Core to Un-Core deskew
different VCCs Core 1.25 V, uncore 1.10 V
Un-core clock
Core Clock
Core and un-core clocks are aligned, this just
de-skews the data
61
IO Bus to un-core clock domain
IO-bus and Un-core clock at 8 to N (N is integer
multiple of 200 MHz)
62
Global Skew
Skew lt 10 ps
63
Power
64
Papers
  • Gronowski, Paul E., et.al., High Performance
    Microprocessor Design, IEEE Journal of
    Solid-State Circuits, Vol. 33, No. 5, May 1998,
    pp. 676-686
  • Bailey, Daniel W. and Bradley J. Benschneider,
    Clocking Design and Analysis for a 600-Mhz Alpha
    Microprocessor, IEEE Journal of Solid-State
    Circuits, Vol. 33, No. 11, November 1998, pp.
    1627-1633
  • Tam, S. et.al, "Clock Generation and distribution
    for the First IA-64 microprocessor", IEEE Journal
    of Solid State Circuits, Vol 35, Issue 11, Nov
    2000.
  • Rusu, S. and Singer G, "The first IA-64
    microprocessor ", IEEE Journal of Solid State
    Circuits, Vol 35, Issue 11, Nov 2000.
  • Anderson, F. E., Wells, J. S., Berta, E. Z, The
    Core Clock System on the Next Generation Itanium
    Processor", ISSCC 2002, pp 453-456.
  • Tam, S., Desai, U. Limaye, R., Clock Generation
    and Distribution for the Third Generation Itanium
    Processor ", 2003 Symposium n VLSI Circuits, pp
    9-12.
  • Stinson, J., Rusu, S., A 1.5GHz Third Generation
    Itanium Processor, ISSCC 2003, paper 14.4.
  • The implementation of the Itanium 2
    microprocessor Naffziger, S.D. Colon-Bonet, G.
    Fischer, T. Riedlinger, R. Sullivan, T.J.
    Grutkowski, T. Solid-State Circuits, IEEE
    Journal of , Volume 37 Issue 11 , Nov. 2002
    Page(s) 1448 -1460
  • A 90-nm variable frequency clock system for a
    power-managed itanium architecture processor,
    Fischer, T. Desai, J. Doyle, B. Naffziger, S.
    Patella, B. Solid-State Circuits, IEEE Journal
    of Volume 41, Issue 1, Jan. 2006 Page(s)218
    228 Digital Object Identifier 10.1109/JSSC.2005.85
    9879
  • Clock distribution on a dual-core, multi-threaded
    Itanium/sup /spl reg//-family processor, Mahoney,
    P. Fetzer, E. Doyle, B. Naffziger, S.
    Solid-State Circuits Conference, 2005. Digest of
    Technical Papers. ISSCC. 2005 IEEE International
    6-10 Feb. 2005 Page(s)292 - 599 Vol. 1 Digital
    Object Identifier 10.1109/ISSCC.2005.1493984

65
Papers (cont)
  • Scalable sub-10ps skew global clock distribution
    for a 90nm multi-GHz IA microprocessor Bindal,
    N. Kelly, T. Velastegui, N. Wong, K.L.
    Solid-State Circuits Conference, 2003. Digest of
    Technical Papers. ISSCC. 2003 IEEE International
    2003 Page(s)346 - 498 vol.1 Digital Object
    Identifier 10.1109/ISSCC.2003.1234329
  • A 65-nm Dual-Core Multithreaded Xeon Processor
    With 16-MB L3 Cache Rusu, S. Tam, S. Muljono,
    H. Ayers, D. Chang, J. Cherkauer, B. Stinson,
    J. Benoit, J. Varada, R. Leung, J. Limaye, R.
    D. Vora, S. Solid-State Circuits, IEEE Journal
    of Volume 42, Issue 1, Jan. 2007 Page(s)17
    25 Digital Object Identifier 10.1109/JSSC.2006.885
    041
  • Clock Generation and Distribution of a Dual-Core
    Xeon Processor with 16MB L3 Cache Tam, S. Leung,
    J. Limaye, R. Choy, S. Vora, S. Adachi, M.
    Solid-State Circuits, 2006 IEEE International
    Conference Digest of Technical Papers Feb. 6-9,
    2006 Page(s)1512 - 1521
Write a Comment
User Comments (0)
About PowerShow.com