Title: Power Consumption In High Performance Microprocessors
1Power Consumption In High Performance
Microprocessors
- Rajesh Kumar
- Desktop Products Group, Circuit Technology
- Intel, Hillsboro, Oregon
- rajesh.kumar_at_intel.com
- 03/13/2004
2Outline
- Where does the power go ?
- Clock power and frequency
- Optimal pipelining for power/ performance wide
vs. fast - Optimal leakage
- Interconnect power
- Summary
3Silicon correlation do we know where the power
goes ?
IREM image of 130 nm Pentium ? 4 Silicon
running high power floating point
Transistor level power model of Pentium ? 4. gt
90 correlation
4Where does the power go ?
Data for 3.2 Ghz, Intel ? Pentium ? 4 in 130 nm
5Clock Power Components
- Local clock power(distribution flops and
latches) dominates global - Global clock techniques optical clock, resonant
clock, transmission line clock, globally
asynchronous/ locally synchronous (GALS) etc can
only have a moderate impact on total power
Clock power breakdown in Pentium ? 4
6IA32 pipelining/ frequency history
Relative pipelining
Relative frequency iso process
- Increasing pipelining increases frequency e.g. 2X
pipelining from P6 to P4 -gt 1.7X frequency but
increases clock power - What is the optimal pipelining/ frequency for
power/ performance efficiency ?
7Optimal pipelining past research
- FO4 inverter Universal, process independent
metric of logic depth - Pentium ? 4 in 180 nm, 130 nm 18 FO4 inverters
- Will we see even faster (pipelined) machines in
the future ?
8Communication dominant structures experience
area, power explosion with width -gt why we dont
build very slow/ wide ST machines
2x Frequency
2x Width
Source Doug Carmean, Intel
AGU
AGU
Front End
Rename
ALU
Schedulers
Register File
L1 D-cache
ALU
uop Queues
ALU
ALU
ALU
ALU
ALU
9Architectural Power Efficiency
Littles law of queuing theory
Latency X Bandwidth Required Parallelism
- Latency is hard, bandwidth is easy(ier)
- Wider is not necessarily better than pipelined if
maintaining global dependencies - Much more power efficient to use easy
parallelism rather than extract parallelism in
hardware - Thread level (SMT, multiple cores) e.g.
Hyperthreading ? gives 20-30 performance for
negligible hardware or power increase in Pentium
? 4 - Data level e.g. media extensions MMX ?, SSE ?,
SSE2 ? - ILP/DLP/ TLP in media/ graphics engines e.g.
Imagine - Software usage/ programming ease/ enabling is the
main bottleneck, not hardware design
10Source/Drain Leakage Perception
- VCC reduced each process generation to reduce
dynamic power (cv2f) and for reliability (gate
oxide field, hot electrons etc) - Decreasing VCC, decreases gate overdrive (VCC
Vt) and hence speed. Vt needs to be reduced to
gain back the speed - Decreasing Vt increases S/D leakage exponentially
- (Perception) Leakage power is growing fast and
will be the dominant source (40-70) of total
power in future technologies - Flurry of leakage reduction arch/circuit/process
proposals even at the cost of switching power and
chip performance - Our view S/D leakage is a design knob, not a
process constant
11Optimal Leakage
- Optimize all variables architectural pipelining,
transistor (Vt, Leff, Tox), VCC etc. to provide
maximum performance at desired cost or power
At optimal point, the cost / benefit ratio of
all variables is the same!
Freq
Freq
Freq
Leff
Vt
Vcc
Power
Power
Power
Leff
Vt
Vcc
12Impact Of Leakage On Speed
- VCC gtgt Vt (practical case), 10-15 leakage change
for 1 speed gt 2X leakage for 5-7 speed
Any leakage reduction idea must exchange ltlt 5
speed for every 2X reduction in chip leakage
(not easy) to be practical
VCC 0.5V
Change in speed / change in Ioff
VCC 0.75V
1V
Vt (V)
13Optimal Leakage vs. Circuit Type
Optimal VCC
VCC,Vt (V)
Leakage power
Optimal leakage
Optimal Vt
Register files
Datapath
Clock
Switching Activity Factor
- Circuits with different activitiy factors want
different transistors - Clock wants low VCC, low Vt (high leakage)
- Register files and caches want high VCC and low
leakage
Optimal leakage fraction is almost constant
across 50-100X range in activity factors!
14Optimal Leakage Scaling
- Leakage consumes 20-30 of chip power at optimal
setting - Optimal leakage is almost constant with respect
to - Process generation (130nm, 90 nm, 65 nm) as long
as VCC gtgt Vt - Total power budget (1 W or 100 W)
- Frequency
- Pipelining
- Chip area
- Signal switching probabilities (activity factors)
- Leakage reduction proposals must lose ltlt 5 speed
for 2X leakage reduction to be practical
15Interconnect Power Scaling
power in interconnect increasing due to
increase in metal layers and (possibly) longer
wires Diffusion decreasing due to faster
scaling of area component and material changes
SOI capacitance benefit should decrease
16Interconnect Power
of total wire power in Pentium 4, 90 nm
Length (microns)
- Only modest power in long wires e.g. gt 1000 um in
90 nm -gt modest gains for low power signaling
techniques - Low power signaling may be useful for specialized
situations on chip networks, multiple core
interface etc
17Summary
- Clock is the biggest component of dynamic power.
Reducing local clock power much more important
than global clock - High performance on ST applications requires high
power whether through frequency or width - Optimal leakage is 20-30 of chip power and
remains fairly constant with process generations
and chip architectures - Wire power has increased significantly while
diffusion has decreased - Most wire power is in short wires not much to
gain from advanced, low power signaling
techniques