Energy Recovery Design for Low-Power ASICs - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Energy Recovery Design for Low-Power ASICs

Description:

Application of energy recovery to SOC design. Fine-grain ... Chip Microphotograph. C. H. Ziesler et al., 2003. System Overview. C. H. Ziesler et al., 2003 ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 89
Provided by: mariospapa
Category:

less

Transcript and Presenter's Notes

Title: Energy Recovery Design for Low-Power ASICs


1
Energy Recovery Design for Low-Power ASICs
  • Conrad H. Ziesler1 Joohee Kim1
  • Suhwan Kim2 Marios C. Papaefthymiou1
  • 1Advanced Computer Architecture Laboratory
  • University of Michigan, Ann Arbor
  • 2T. J. Watson Research Center
  • IBM Research, Yorktown Heights

2
(No Transcript)
3
Tutorial Outline
  • Introduction to energy recovery
  • Application of energy recovery to SOC design
  • Fine-grain dynamic pipelines
  • Finite state machines
  • Memory arrays
  • Multi-GHz clocking

4
Introduction to Energy Recovery
  • Power dissipation in static CMOS design
  • Energy recovery operation
  • Implementation issues
  • Quick glance at history

5
Static CMOS Power
  • Leakage Power
  • Crowbar Power
  • Active Power

6
Static CMOS Active Power
  • Active power to transfer charge to/from output
    capacitor
  • Energy stored in capacitor when charged to
    voltage V is ½ CV2
  • Energy dissipated in R1 to charge capacitor is ½
    CV2
  • To discharge capacitor, all energy stored in it
    is dissipated in R2
  • Note Voltage supply level fixed

7
What if we had a time-varying power-supply?
  • Turn on switch when time-varying voltage source
    is at same level as the voltage on output
    capacitor.
  • Voltage on output capacitor slews up or down at
    same rate as voltage source.
  • Most of the energy stored in capacitor is
    returned to time-varying power-supply.

8
Model Simplification
  • Use same switch for charging and discharging
    currents (if it can conduct in both directions).
  • Time-varying voltage source Vpc, called
    power-clock, provides energy and synchronization.
  • Textbooks sometimes draw ideal energy recovery
    model using current source instead of voltage
    source. Principal of operation is the same.

9
Example Linear Ramp
10
Energy Dissipation
  • Integrate charging or discharging current through
    resistor R
  • Dissipation for either charging or discharging is
    approximately (RC/T) CV2
  • T is rise time of Vpc (linear ramp)

If T gtgt RC delay and recovered energy is
efficiently recycled, power savings can be
substantial.
11
Implementation Issues
  • How is power-clock generated?
  • What capacitances should one try to recover
    energy from?
  • How does the timing work if power and clock are
    intermingled?

?????
?????
12
Power-Clock Generation Challenges
  • Need to drive a large mostly capacitive load with
    a controlled rise/fall time waveform.
  • Capacitive load may change as different gates
    switch on or off.
  • Must do so with much less than CV2 dissipation,
    otherwise gains are lost.

13
Where to recover energy from?
  • Power-clock is time-varying
  • Synchronize energy recovery circuits with
    power-clock.
  • Energy recovery requires small RC delay with
    respect to rise/fall time T
  • Aim at small RC/T ratio.
  • Energy recovery saves on active switching power
  • Does not necessarily make sense to apply it on
    low-activity loads.

14
Good Candidates
  • Large capacitive loads that frequently switch
  • Clock networks
  • Memory bit lines
  • LCD row/column drivers
  • High activity loads that are synchronous to
    power-clock
  • Dynamic pipelined logic

15
Quick Glance at History
  • Physics (1970s)
  • Logical reversibility of computation
  • Connection to thermodynamics (adiabatic
    computing)
  • No absolute minimum to energy dissipation, if
    computing is arbitrarily slow. (Does not have to
    be slow, however.)
  • Engineering (1990s)
  • Logic circuitry
  • Energy recycling circuitry
  • VLSI prototyping

16
Sample Design Points from90s
  • The case for reversible computation
  • P. Solomon and D. Frank IWLPD'94
  • Asymptotically zero-energy split-level charge
    recovery logic
  • S.G. Younis and T. F. Knight, Jr. IWLPD94
  • Clock-powered CMOS A hybrid adiabatic logic
    style for energy-efficient computing
  • N. Tzartzanis and W. C. Athas ARVLSI'99
  • And many, many others....

17
Split-Level Charge-Recovery Logic
internal node
f
/P
f
input
output
P
P
/f
  • Initially, f and /f at Vdd/2, P at Gnd, and /P
    at Vdd.
  • On valid input, the pass gate is turned on by
    gradually swinging P and /P.
  • Rails f and /f split, gradually swinging to
    Vdd and Gnd.
  • As soon as output is sampled, pass gate is
    turned off.
  • Internal node is restored by gradually swinging
    f and /f back to Vdd/2.
  • When is the gate output restored?

18
Reversible Pipelines
Split-level charge recovery logic block
output node set by E, restored by F-1
E
F
G
H
input
f2
f1
f3
f4
P1
P2
/P1
/P2
. . .
. . .
E-1
F-1
G-1
H-1
P
f6
f3
f7
f5
P2
f8
/P1
/P2
P1
  • Gate outputs are restored using a reverse
    pipeline whose elements perform the inverse
    function of the forward pipeline
  • Multiple phases required

19
The Remainder of this Tutorial
  • Application of energy recovery to SOC Modules
  • Low switching activity state-machines or
    datapaths
  • Fine-grain pipelines with high switching activity
  • Multi-gigahertz class design
  • Memory and memory-like arrays
  • Interconnect and I/O
  • Defining Characteristics
  • Low overhead, fast operation, no reversibility
  • Energy recovering chips that work and save power
    over conventional operation!

20
Targeting Energy Recovery to ASICs and SOC Modules
  • Module characterization
  • Throughput requirements
  • How much time is availible to compute?
  • Is the application easily pipelined?
  • Expected switching activity
  • Can it be easily reduced?
  • Memory, computation, or control?
  • Clocking requirements
  • Fixed frequency? Range of frequencies?

21
Low Switching Activity Finite-State Machines or
Datapaths
22
Low Switching Activity Modules
  • Best place to focus efforts is on clock tree and
    flip-flops
  • Switching activity is low everywhere else
  • Every capacitance connected to the clock switches
    every cycle
  • Consider applying resonant clocking system

23
Sample Dissipation Breakdown
  • Focus on clock dissipation
  • 2050 of total power
  • Apply energy recovery
  • Single-phase clock node
  • Target design automation
  • Automate energy recovery

24
Resonant Clock ASIC
  • Compatible with ASIC flow
  • Synthesized by Conrad Ziesler
  • and Joohee Kim using in-house
  • standard-cells library and
  • commercial tools
  • Energy recovering clock tree and
  • SRAM word/bit lines
  • Low-cost bulk CMOS process
  • TSMC 0.25mm, 108-pin PGA package, through MOSIS
  • High frequency (300MHz)
  • Low voltage (1.0-1.5V)

25
ASIC Statistics
  • Discrete wavelet transform (DWT)
  • 3,897 gates, 413 ffs
  • 15,571 transistors
  • 400um x 900um
  • 13.6 pF , 21 nH
  • 300 MHz , 1.5V
  • 0.25um logic process

Dual-mode DWT
Clock generator
26
Chip Microphotograph
27
System Overview
. . .
28
The Energy Recovering Flip-Flop
probe
state element
14 transistors 84 ?m2
  • Clock signal Single-phase resonant sinusoid
  • Probe activates state element only if next state
    differs from present state.
  • Low voltage operation at high speeds
  • Delay similar to conventional flip-flops
  • Fully compatible with standard-cell ASIC flow

29
Flip-Flop Power Characterization
Order of magnitude difference between idle (D, Q
constant) and active (D, Q changing) dissipation.
30
Resonant Clock Generator
  • Resonate entire clock capacitance with small
    inductor
  • Pump resonant system with NMOS switch at
    appropriate times
  • NMOS switch only conducts incremental losses
    whenever on

Driver
NMOS Switch
Pre-driver
Control
31
Clock Generator Operation
32
Our Contributions
  • Application of energy recovery to ASIC clock
    network
  • Fully synthesized ASIC
  • Resonant LC tank forms power-clock
  • On-chip power-clock generator w/ off-chip
    inductor
  • Fabrication in 0.25mm standard CMOS process
  • Compare energy recovering clocking system with
    conventional clocking system
  • Direct DC power measurements for complete system
  • Correct operation, 100300MHz
  • 2050 savings in total power consumption
  • 8090 savings in clock power

33
Recovering vs. Conventional Hardware
  • Synthesized dual-mode ASIC
  • Conventional
  • Energy recovery
  • Dual-mode flip-flop cell
  • Conventional clock tree with conventional
    flip-flop
  • Resonant clock tree with energy-recovering
    flip-flop
  • Direct comparison of dissipation at target
    throughput using identical hardware structures

34
Correct Function
signature output (Verilog simulator)
35
Summary
  • Energy recovery technologies for reducing clock
    dissipation
  • Single-phase sinusoidal clock
  • Efficient, LC resonant clock generator
  • Low power sinusoidally clocked flip-flop
  • Key attributes
  • Compatible with ASIC design flow, low overhead
  • High frequency (100300MHz)
  • Low voltage (11.5V)
  • Real, working chips (in 0.25mm logic process)

36
Break
37
Fine-Grain Pipelines with High Switching Activity
38
High Switching Activity Pipelines
  • Problem Parameters
  • Can't reduce switching activity
  • Need lots of throughput
  • Can easily do fine-grained pipelining
  • Solution
  • Dynamic energy recovering logic
  • True single-phase source-coupled adiabatic logic
    (SCAL)

39
SCAL-D Logic Family
  • Dynamic logic family that works with simple
    sinusoidal LC resonant clock
  • Alternate PMOS and NMOS type gates like NMOS/PMOS
    ''zipper'' domino logic
  • Test chip 200MHz 8-bit multiplier implemented in
    0.5mm bulk silicon process

40
SCAL-D Topology
Power clock
NMOS NAND gate
ot
of
  • Sense amplifier
  • Precharge diodes
  • Current switches
  • Evaluation tree
  • Current tail
  • Power-clock

at
af
bf
bt
bias
Vss
  • Single-phase sinusoidal power-clock
  • Minimum-size low-swing evaluation tee
  • Built-in state element
  • Dual-rail noise-tolerant design

41
SCAL-D Operation
  • NMOS Precharge Phase
  • rising edge of power-clock
  • charge transferred to load

42
SCAL-D Operation
  • NMOS Evaluate Phase
  • peak of power-clock
  • non-adiabatic evaluation current
  • current purposefully limited

43
SCAL-D Operation
  • NMOS Sense Phase
  • falling edge of power clock
  • sense amplifiers drive load

44
SCAL-D Operation
  • NMOS Hold Phase
  • negative peak of power clock
  • next pipeline stage samples outputs

45
SCAL-D Implicit Pipelining
  • Free pipelining no flip-flops needed
  • Example Pipelined "andfull adder" cell from
    array multiplier

Static CMOS 100 transistors
SCAL-D 85 transistors
46
Multiplier Chip
  • Suhwan Kim (while still at UM) and Conrad Ziesler
    received First Prize in VLSI Design Contest, DAC
    2001.
  • Minimalist approach
  • Simple tools magic and spice
  • Low-cost standard CMOS process
  • HP 0.5um, 40-pin DIP package, through MOSIS.
  • Operational chip demonstrates practicality of
    energy-recovering circuit design
  • Non-trivial size (8-bit operands, on-chip clock,
    self-test)
  • High throughput
  • Low energy dissipation

47
Chip Microphotograph
48
Test Chip Overview
  • Two multipliers with self-test per chip (minimum
    size die)
  • Integrated power-clock generator
  • Resonant LC oscillator

49
Multiplier and Self-Test
Input BILBO (self-test) Product
array Multiplicand buffers Result summation
Result buffers Self-test Control Output BILBO
(self-test)
  • 9,048 devices in multiplier array, 2,806 devices
    in self-test circuitry
  • Implemented entirely in energy-recovering
    dynamic logic family.

50
Energy Comparison
2 stage static CMOS
500
2.9V
4 stage static CMOS
8 stage static CMOS
Energy recovery
400
3.0V
3.0V
4x
300
Dissipation per Cycle (pJ)
1.9V
200
1.6V
2.3V
2.0V
100
2.7V
1.9V
2.2V
0
50
100
200
140
Frequency (MHz)
51
Single-Phase Power-Clock Generator
Vdd
S1
Vbp
L
_
PC
_
Vbn
S2
Vss
25 tr. 19 tr. 10 tr.
  • Zero-voltage switching
  • LC- resonant clock generation
  • External/bondwire inductor L
  • Resistive/capacitive adiabatic load
  • Compact 170 x 115 um

52
Switch Timings
  • Inductor current builds linearly when switches
    are on.
  • Peak switch current less than peak inductor
    current.
  • Switch S1 turned on at positive voltage peak.
  • Switch S2 turned on at negative voltage peak.
  • Fixed ''on-window'' controlled by pulse
    generator.

Inductor current
Output voltage
53
Power-Clock Waveform
  • Single-phase sinusoidal waveform _at_140MHz
  • 60pF load, 10nH external inductor
  • One DC supply (Vdd, Vss), two DC biases (NMOS,
    PMOS)

54
Multi-GHz Clocking
55
Multi-Gigahertz Designs
  • Need speed more than anything
  • Clock distribution and skew biggest problems
  • Retiming desirable for performance
  • Need multiple phases of clock
  • Clock power dominates

56
Rotary ClockTM Principles
  • Consider MultiGigs Rotary-Clock network (related
    slides courtesy of John Wood, MultiGig Inc.)
  • Multiple transmission line loops arranged in grid
  • Each loop supports a square-wave oscillation in
    lock step with neighbors
  • Small variations between loops average/cancel out
  • Ultra-high frequency low-skew clocking

57
Multi-Ring Visualization
  • Phase lock at junctions without PLL

58
Numerical Example Large Chip
Process 0.18u CMOS 1.8v Size 15 x 15
mm Global clock 2.5 GHz FFs 2 million, Total
capacitance 10,000pF Metal Width 40u, Spacing
40u, Thickness 1.5u Copper x 2 Active Area lt
1 Grid X pitch 1300u, Grid Y pitch
1300u Power CV2F 78 W Rotary Power 6 W
59
Chip Microphotograph
60
Test Chip Measurements
Power 75 less than CV2F of clock capacitance
61
Later Test Chips
  • Test chip 2
  • Switched-capacitor tuning
  • (100MHz /- 35 measured)
  • Test chip 3 quad ring
  • 3.5 GHz
  • Tunable (varactor /- 10)
  • Jitter below measurable levels

62
Benefits of Rotary Clock Architecture
  • Scalable in size and frequency.
  • Reduces dynamic clock power.
  • Guaranteed near-zero skew.
  • Precise skew scheduling possible.
  • Negligible jitter.
  • Inherently low noise
  • Tolerant to process, temperature, and supply
    variation.

63
Memories and Array-Like Structures
64
Memory and Array Structures
  • Many heavily loaded wires all switching
  • Already optimized algorithms to reduce number of
    accesses
  • Already using low-power sense-amplifiers
  • Consider energy recovery on the bit wires

65
Breakdown of CPU Dissipation
  • Memory area 60 of CPU

strongARM, JSSC 1996
66
Memory Power
256x256 array
Long bit/word lines with large capacitance
high power consumption
67
Energy Recovery Driver
Q
ER Driver
C
Power source
0 - VDD
  • Sinusoidal power-clock
  • Synchronization ?
  • Correctness
  • Efficiency
  • E ( RC / t ) CVDD2

68
Outline
  • Energy recovering driver
  • Synchronization
  • Feedback
  • Energy recovering SRAM
  • Operation
  • Simulation of full-custom ERSRAM
  • Voltage scaling behavior

69
One Extreme Fully Gradual Transition
ON
PC
driver output
  • Maximum power efficiency
  • Relatively slow operation

70
The Other Extreme Abrupt Transition
ON
ON
PC
driver output
  • Low power efficiency
  • Fast operation

71
Partially Gradual Transition
ON
ON
PC
driver output
  • Energy efficient
  • Fast

72
ER Driver Core
ch
Pull-up control
PC
driver output
Pull-down control
dch
  • Single-phase power-clock
  • Transmission gate ( wide range of operation )
  • Synchronizing circuitry

73
Pull-Up Control
ch_out
ch
ch_out
PC
PC
  • Transistors sized to achieve correct timing and
    pulse width

74
Pull-Down Control
dch_out
dch
dch_out
PC
PC
  • Transistors sized to achieve correct timing and
    pulse width

75
Tolerance to Control Timing Variations
dch
Maximum possible
Minimum required
driver output
PC
Minimum required
ch
Maximum possible
76
Dissipation During Consecutive Charging/Dischargin
g
output
PC
Consecutive charging
PC
output
Consecutive discharging
77
Complete Structure with Feedback
ch
Pull-up control
driver output
PC
Pull-down control
dch
  • Feedback circuitry
  • Prevents redundant dissipation during consecutive
    charging/discharging

78
ERSRAM Operation
WL
BLT
BLF
write
idle
read
  • WL Explicit discharge after each access for low
    power
  • BL Precharge low (with modified sense amp) for
    single cycle read and write

79
ERSRAM Architecture
128 x 256 Cell array
128 x 256 Cell array
Wl driver
Wl driver
Bl driver
Bl driver
Sense amp
Sense amp
  • 2 x 128 x 256
  • Only drivers and sense amplifiers are different
    from that of conventional SRAM

80
Simulation
  • TSMC 0.35mm process.
  • Full-custom 256x256 conventional and ER SRAM
  • Hspice simulation

81
Power Breakdown
82
Wide Operation Range
r Conv. SRAM O ERSRAM
  • Tolerant to variations in operating conditions
  • Memory failure due to mistiming in sense amp
    enable

83
Voltage Scaling
  • Functions correctly down to 0.7V, 1MHz

84
Summary
  • Static RAM with novel energy recovering driver
  • Single-phase power-clock
  • Single-cycle read and write
  • High speed, low complexity
  • Operation range 0.7V, 1MHz 3.5V, 500MHz with
    0.35mm process
  • Energy efficiency 2.6x at 3V, 300MHz ,
    alternating read-write access

85
Previous Work on SRAMs
  • Multi-phase power-clock
  • Multi-cycle operation
  • Relatively high complexity and low/moderate
    speeds
  • Inefficient during consecutive charging
  • Not voltage scalable
  • Somasekhar, Ye and Roy, ISLPED 1995
  • Tzartzanis and Athas, ISLPED 1996
  • Moon and Jeong, JSSC 1998
  • Avery and Jabri, ISLPED 1998
  • Kwon, Lim and Chae, ISLPED 2000
  • Ng and Lau, JCSC 2000
  • Tzartzanis, Athas and Svensson, ESSCC 2000

86
Interconnect and I/O
  • Highly capacitive parallel buses that take an
    entire cycle for data transfer
  • Can't reduce switching activity or drive voltage
  • Low slew-rates desirable
  • Consider energy recovering driver
  • Techniques similar to memories

87
Tutorial Summary
  • Introduction to energy recovery
  • Application of energy recovery to SOC design
  • Finite state machines
  • Fine-grain dynamic pipelines
  • Multi-GHz clocking
  • Memory arrays and I/O
  • Functional chips in bulk silicon demonstrating
    substantial energy savings and fast operation in
    practice

88
Reference Material
  • Energy recovery group web site at U. Michigan
  • http//www.eecs.umich.edu/acal/energyrecovery
  • Extensive list of references in our tutorial
    paper in SOC03 proceedings
  • The Physics of Computation
  • R. Feynman
Write a Comment
User Comments (0)
About PowerShow.com