Title: Low-Power Design Techniques in Digital Systems
1Low-Power Design Techniques in Digital Systems
- Prof. Vojin G. Oklobdzija
- University of California
2Outline of the Talk
- Power trends in VLSI
- Scaling theory and predictions
- Research efforts in power reduction
- Efficiency measures and design guidelines
- Latches and Flip-Flops for Low-Power
- Dual-Edge FFs
- SOI
- Conclusion Low-Power perspective
3Power trends in VLSI
4CMOS Circuits dissipate little power by nature.
So believed circuit designers (Kuroda-Sakurai,
95)
100
x4 / 3years
10
Power (W)
1
0.1
0.01
95
90
85
80
By the year 2000 power dissipation of high-end
ICs will exceed the practical limits of ceramic
packages, even if the supply voltage can be
feasibly reduced. ( Taken from Sakurais ISSCC
2001 presentation)
5Gloom and Doom predictions
Source Shekhar Borkar, Intel
6Source Shekhar Borkar, Intel
7Power versus Year taken from ISSCC, uP Report,
Hot-Chips
High-end growing at 25 / year
RISC _at_ 12 / yr
X86 _at_ 15 / yr
Consumer (low-end) At 13 / year
8VDD, Power and Current Trend
2.5
200
500
Voltage
2
Power
1.5
Voltage V
Power per chip W
Current
VDD current A
1
0.5
0
0
0
1998
2002
2006
2010
2014
Year
International Technology Roadmap for
Semiconductors 1999 update sponsored by the
Semiconductor Industry Association in cooperation
with European Electronic Component Association
(EECA) , Electronic Industries Association of
Japan (EIAJ), Korea Semiconductor Industry
Association (KSIA), and Taiwan Semiconductor
Industry Association (TSIA) ( Taken from
Sakurais ISSCC 2001 presentation)
9Power Delivery Problem (not just California)
Your car starter !
Source Shekhar Borkar, Intel
10Trend in L di/dt
- di/dt is roughly proportional to
- I f, where I is the chips current and
f is the clock frequency - or I Vdd f / Vdd P f / Vdd, where P
is the chips power. - The trend is
- P f
Vdd - on-chip L package L slightly decreases
- Therefore, L di/dt fluctuation increases
significantly. - ( Taken from Norman Chang, HP)
11Saving Grace !
Energy-Delay product is improving more than 2x /
generation
12X86 efficiency improving dramatically 4X /
generation
average improving 3X / generation
High-End processors efficiency not improving
13Scaling theory and predictions
14The power dissipation has increased 1000 times
over the 15 years and is exceeding 70 Watts
- Scaling principles
- 1. A constant field scaling theory Dennard
assumes that device - voltages as well as device dimensions are
scaled by a scaling - factor x (gt1), resulting in a constant
electric field in a device - power density remains constant
- circuit performance can be improved in terms
of - density x2
- speed x
- power 1/ x2
- power-delay product 1/ x3
- Limitless progress in CMOS is promised with this
scaling scenario
15In practice neither a supply voltage nor a
threshold voltage had been scaled till 1990
leading to the theory of
- Constant voltage scaling which assumes the
constant voltage - This assumption yields
- speed improvement by x2
- power density increases rapidly by x3
16The constant field is not realistic, x0.5 is
satisfactory - however even with that the power
dissipation would exceed ECL by 2001 a new
philosophy is required !
( Taken from Sakurai and Kuroda, IEICE 95 paper)
17High-Performance View Point on Powertaken from
Ron Preston, DEC Alpha
- Pk C V2 f
- Shrinking to the new technology (30 reduction in
l) - C decreases by 30
- f increases by 1/0.7 43
- Pnew0.7 (1/0.7) Pold Pold (No Change in
Power ! ) - New design
- Double the No. of devices
- Pnew2 x 0.7 (1/0.7) Pold 2 X Pold (Power
Doubles !) - Scale Vdd by 30 in the new design
- Pnew2 x 0.7 (1/0.7) (0.7)2Pold Pold (Power
stays constant !)
18High-Performance View Point on Powertaken from
Ron Preston, DEC Alpha
- Reality
- Paradigm Changes More Aggressive Circuits,
Toggle rate increasing, Out of Order, Speculative
Execution - What to Expect Power will be limited by the
package and cooling techniques - Frequency will be determined by the power - as
high as package can take !
Chip l Vdd Freq. Power
21164 05u 3.3V 300MHz 50W
21264 0.35u 2.0V 600MHz 72W
Change -30 -39 100 44
19Research Efforts in Low-Power Design
- Technology scaling
- The highest win
- Thresholds should scale
- Leakage starts to byte
- Dynamic voltage scaling
- Reduce the active load
- Minimize the circuits
- Use more efficient design
- Charge recycling
- More efficient layout
Psw k CL V2cc fCLK
- Reduce Switching Activity
- Conditional clock
- Conditional precharge
- Switching-off inactive blocks
- Conditional execution
- Run it slower
- Use parallelism
- Less pipeline stages
- Use double-edge flip-flop
20Reducing the Power Dissipation
- The power dissipation can be minimized by
reducing - supply voltage
- load capacitance
- switching activity
- Reducing the supply voltage brings a quadratic
improvement - Reducing the load capacitance contributes to the
improvement of both power dissipation and circuit
speed.
21Voltage Scaling
- There are three means to maintain the throughput
- Reduce Vth to improve circuit speed
- Introduce parallel and pipelined architecture
while - using slower device speeds
- (assumes limitless no. of transistors, in
reality the transistor density is - only increasing by 60 per year)
- Prepare multiple supply voltages and for each
cluster - of circuits choose the lowest supply voltage
that satisfies - the speed.
- (A good level converter is necessary which
exhibits small delay and consumes - little power, small area)
22(No Transcript)
23Is there an optimal design point ?
24Power Dissipation and Circuit Delay
-4
x 10
1
0.8
0.6
Power (W)
0.4
0.2
0
4
3
V
-0.
4
0
2
DD
(V)
0.4
1
(V)
0.8
th
( Taken from T. Sakurai)
25Sensitivity to Vth fluctuation
V
1.0 V
DD
?
V
TH
0.15V
0.05V
0.5
( Taken from T. Sakurai)
26Power-Delay Product, Energy-Delay Product
Lowest Voltage Highest Threshold no optimum
(from Sakurai, Kuroda, IEICE 95 paper)
- Power-Delay Product is a misleading measure it
will always favor a processor that operates at
lower frequency - Energy-Delay is more adequate - but Energy-Delay2
should be used
27Power-Delay Product, Energy-Delay Product
Horowitz, Indermaur, Gonzales argue against
Power-Delay, SLPE94
28Energy-Delay2
(courtesy of Prof. T. Sakurai)
29Energy-Delay Product vs. Energy-Delay2
Nowka, Hofstee, Carpenter of IBM argue against
Energy-Delay as a design efficiency measure
(private communication)
30Energy-Delay Product vs. Energy-Delay2
The same design should have relatively the same
efficiency
Optimal point (due to to Vth being fixed ?)
Nowka, Hofstee, Carpenter of IBM argue against
Energy-Delay as a design efficiency measure
(private communication)
31Example PowerPC
32(No Transcript)
33Use of Different Circuits Families
34Capacitance Reduction
- The load capacitance is the sum of
- gate capacitance
- diffusion capacitance
- routing capacitance
- Using small number of transistors, or small size
of transistors - contributes to the reduction in the gate
capacitance and the - diffusion capacitance.
- Pass transistor logic may have advantage because
it - comprises fewer transistors and exhibits smaller
stray - capacitance than conventional static CMOS logic.
35Pass-Transistor Logic
36Pass-Transistor Logic CVSL, CPL, SRPL, DSL,
DPL, DCVSPG
37SAPLSense-Amplifying Pass-transistor Logic
All nodes are first discharged and then evaluated
by inputs. Outputs are 100mV above GND
38Where does the power go ?
39Power use is different from chip to chip
(from Sakurai, Kuroda, IEICE 95 paper)
MPU1 is a low end microprocessor MPU2 is a
high-end CPU with large cache ASSP1 is MPEG-2
decoder ASSP2 is an ATM switch
40Design Example Strong Arm 110
Two power modes idle and sleep Power 0.5W using
1.1V internal PS 184 Drystone/MIPS _at_162MHz 1.1W
using 2V internal PS 245 Drystone/MIPS _at_
215MHz Power Breakdown I-Cache 27 D-Cache 16
I-Unit 18 Exec-Unit 8 I-MMU 9 D-MMU 8 Clock
10 Others 4 (PLL lt 1)
from D. Dobberpuhl
41Design Example Strong Arm 110
from D. Dobberpuhl
42Design Example Strong Arm 110
from D. Dobberpuhl
from D. Dobberpuhl
However, leakage currents starts to affect
stand-by power
43Controlling both VDD and VTH for low power
44Controlling VDD and VTH for low power
Low power ? Low VDD ? Low speed ? Low VTH ? High
leakage ? VDD-VTH control
Software-hardware cooperation
Technology-circuit cooperation
) MTCMOS Multi-Threshold CMOS ) VTCMOS
Variable Threshold CMOS Multiple spatial
assignment Variable temporal assignment
( from Prof. T. Sakurai)
45( from Prof. T. Sakurai)
46Clustered Voltage Scaling for Multiple VDDs
CVS Structure
Conventional Design
Level-Shifting F/F
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
Critical Path
Critical Path
Lower V
portion is shown as shaded
DD
Once VL is applied to a logic gate, VL is applied
to subsequent logic gates until F/Fs to
eliminate DC current paths. F/Fs restore VH.
M.Takahashi et al., A 60mW MPEG4 Video Codec
Using Clustered Voltage Scaling with Variable
Supply-Voltage Scheme, ISSCC, pp.36-37, Feb.1998.
( from Prof. T. Sakurai)
47If you dont need to hussle,VDD should be as low
as possible
1.0
Variable Vdd
0.8
Fixed Vdd
0.6
Normalized power
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Normalized workload
( from Prof. T. Sakurai)
48Measured voltage waveforms
( from Prof. T. Sakurai)
49Measured power characteristics
Total power 0.8W x 0.08 0.16W x 0.86 0.07W
x 0.06 0.2W
1
0.8
W
0.8
Time for
V
8
DDmax
0.6
ƒ
200MHz
Down
Power P W
to 1/5
0.4
ƒ
100
MHz
Time for
V
86
0.16
W
0.2
DDmin
0.07
W
Time for sleep 6
0
0
1
2
Supply voltage V
V
DD
VDD hopping can cut down power consumption to 1/4
( from Prof. T. Sakurai)
50Simulation results
MPEG-2 video decoding
VSELP speech encoding
0.40
0.32
0.35
0.28
RPC 2 levels (f,f/2)
RPC 2 levels (f,f/2)
RPC 3 levels (f,f/2,f/3)
RPC 3 levels (f,f/2,f/3)
0.30
0.24
RPC 4 levels (f,f/2,f/3,f/4)
RPC 4 levels (f,f/2,f/3,f/4)
RPC infinite levels
RPC infinite levels
0.25
0.20
post-simulation analysis
post-simulation analysis
0.20
0.16
0.15
0.12
0.10
0.08
0.05
0.04
0.00
0.00
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Transition Delay T
(ms)
TD
( from Prof. T. Sakurai)
51Aggressive Voltage Scaling
Taken from Kuroda
If we can dynamically scale Vdd and Vth the
advantage is obvious
52Example
53TransMeta Example
Taken from Doug Lairds presentation, January 19
th 2000
54TransMeta Example
Taken from Doug Lairds presentation, January 19
th 2000
55TransMeta Example
Taken from Doug Lairds presentation, January 19
th 2000
- Code Morphing is another contributor to power
reduction - since it eliminates unnecessary external
memory access
56TransMeta Example
57Latches and Flip-Flops for Low-Power
58Simulation Condition and Testbench
- Timing
- Total FF overhead is setup clock-to-output time
- Circuit optimization towards td-q
- Clock skew robustness obtained from observing DQ
curve - Power-Delay Product
- Overall performance parameter at fixed frequency
59Flip-Flop Performance Comparison
Test bench
- Total power consumed
- internal power
- data power
- clock power
- Measured for four cases
- no activity (0000 and 1111)
- maximum activity (0101010..)
- average activity (random sequence)
Delay is (minimum D-Q) Clk-Q Setup time
60- OLD TEST BENCH
- Total Power Drivers Power Test Unit Power
- PDP- Optimized Equal Trade-off on Power and
Delay - Improper Load on Drivers
- NEW TEST BENCH
- Drivers Fixed Gain and Driving Test Unit Only
- Data-to-Output Delay
- PD2P Optimized Best for Constant-Field Scaling
OLD TEST BENCH
NEW TEST BENCH
61Comparison in terms of speed and EDPtot
Technology 0.2u, Vdd2V, T20oC, measured _at_
100MHz
- Delay below 200ps
- SDFF 187ps
- HLFF 199ps
- K-6 ETL 200ps
- 200-300ps
- PowerPC latch 266ps
- 21264 Alpha FF 272ps
- Strong Arm FF 275ps
- mC2MOS latch 292ps
- above 500ps
- SSTC latch 592ps
- DSTC latch 629ps
- SSTC latch 898ps
- DSTC latch 1060ps
- PDPtot _at_100MHz
- below 30fJ
- PowerPC latch 28fJ
- 30 - 50fJ
- HLFF 29fJ
- SDFF 39fJ
- mC2MOS latch 40fJ
- 21264 Alpha FF 43fJ
- Strong Arm FF 45fJ
- 50 - 70fJ
- K-6 ETL 70fJ
- above 70fJ
- SSTC latch 95fJ
- DSTC latch 125fJ
62Delay comparison
- F-F design brings the fastest structures
63Delay comparison
- F-F design brings the fastest structures
64Overall ranking
_at_100MHz
- EDPtot accepted as the overall cost function
- Proposed low-power latches from Yuan
Svensson, compared with other presented
structures do not show advantage, (the
optimization was not properly done - optimization
is yet to be repeated under different setup)
65Overall ranking, zoomed
- Real signals have the activity between 0 and 1.0
(?) - Precharged hybrid structures are the fastest but
their power consumption strongly depends on the
probability of ones - More ones above the ? point
66Overall performance
- Real signals have the activity between 0 and 1.0
(?) - Precharged hybrid structures are the fastest but
their power consumption strongly depends on the
probability of ones - More ones above the ? point
67Conventional Clk-Q vs. minimum D-Q
- Hidden positive setup time
- Degradation of Clk-Q
68 Internal Power distribution
- Four sequences characterize the boundaries for
internal power consumption - 010101
maximum - random, equal transition probability,
average - 111111
precharge activity - 000000
leakage internal clock processing
69Comparison of Clock power consumption
70Using Dual-Edge Flip-Flop(run at ½ of the
frequencysave on the power consumed in clock
distribution tree)
71Dual-Edge vs. Single-Edge Flip-Flops Comparison
Delay ps
Total Power ?W
- Fujitsu 0.18u process Clock frequency 500MHz
(250MHz for Dual Edge FFs) - Data activity ratio ? 0.5
- VDD 1.8V
- Temp 25º
72Dual-Edge vs. Single-Edge Flip-Flops Comparison
Internal Power ?W
Clock Power ?W
Data Power ?W
- Fujitsu 0.18u process Clock frequency 500MHz
(250MHz for Dual Edge FFs) - Data activity ratio ? 0.5
- VDD 1.8V
- Temp 25º
73Silicon on Insulator (SOI) Technology
74SOI Comparison
F 1GHz, ? 0.5, Le 0.08 ?m, VDD1.3V, T 25?C
75In conclusion.
- What can we expect that low power will bring to
us ?
76Wearable Computer
77Wearable Computer
78Wearable Computer
79Digital Ink
80Implantable Computer
81Bluetooth
82Year 2110
Extrapolation of the trend with some saturation
Many important interesting application
Home, Entertainment, Office, Translation , Health
care
Year 2120???
More assembly technique 3D
Year 2110
Combination of bio and semiconductor
Ultra small volume
Brain
Small number of neuron cells
Sensor
Extremely low power
Infrared
Real time image processing
Humidity
(Artificial) Intelligence
Long lifetime by DNA manipulation Bio-computer
CO2
3D flight control
Mosquito