Title: Closing the Power Gap between ASIC and Custom
1(No Transcript)
2Closing the Power Gap between ASIC and Custom
- David Chinnery, Kurt Keutzer
3Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- Conclusions on automating low power techniques
4Why power?
- Battery life is limited by power (e.g. laptop,
mobile phone) - Cost for packaging and cooling increase rapidly
with power dissipation (e.g. plastic vs. ceramic
package, heatsink, fan) - Higher temperatures degrade performance and
reliability - Circuits are slower, with more leakage, at higher
temperature - Less reliable due to increased rate of
electromigration - Increasing integration increases power demand in
portable applications (e.g. mp3 player/PDA/mobile
phone combined) - Performance is limited by power now even for high
end microprocessors
5Power of high performance chips has increased
- As device dimensions (W, L, Tox) scaled down by
a factor k, for high performance, - If supply Vdd and threshold voltage Vth fixed,
then power/unit area ? k3 - If Vdd and Vth scaled downlinearly and
, then power/unit area ? k0.7 - Further voltage scaling may be limited
6Impact of voltage scaling on power
- Major components of power Ptotal Pdynamic
Pleakage - Dynamic power due to switching of capacitances
- Reducing Vdd gives quadratic reduction in
Pdynamic - But transistor drive current depends on Vdd
- Must reduce Vth to maintain drive current
- But reducing Vth increases subthreshold leakage
current, which is the major contributor to
Pleakage - Must look for other ways to reduce power
Chen in Trans. On Electron Devices 1997
7Automate low power techniques
- Custom designers can try to optimize the design
at all levels - Electronic design automation (EDA) tools for
ASICs - Most of the design optimization is high level
- Fast time-to-market and lower design cost
- Increasingly important to reduce design cost for
larger chips - What is the power gap between (automated) ASIC
design and custom design? - We need to characterize the contributing factors
- Can we close the power gap?
- Identify custom techniques that can be used in an
EDA flow
8Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- Conclusions on automating low power techniques
9What is our metric for power?
- Power
- Fixed performance constraint (clock frequency or
throughput e.g. 30 frames/s for MPEG2) - Reduce the power and meet the performance
constraint - Energy efficiency
- No performance constraint
- Throughput/unit power (1/P?T?CPI), e.g. MIPS/mW
- Cycles per instruction (CPI) accounts for impact
of architectural choices (e.g. stalled pipeline
stages) - Energy/operation is the inverse of
throughput/unit power - Maximize throughput/unit power or minimize
energy/operation
10What is the power gap? ARM cores
- 2 to 3 gap between custom and hard macro ARMs
- 1.3 to 1.4 gap between ARM7TDMI-S and ARM7TDMI
- 3 to 4 overall from synthesizable to custom ARMs
11What is the power gap? DCT/IDCT blocks
- 4 to 7 between discrete cosine transform (DCT)
and inverse discrete cosine transform (IDCT)
blocks, after scaling linearly for technology
Fanucci ICECS 2002 - We assumed power reduces linearly with technology
- To get 30 frame/s MPEG2 with a general purpose
processor would require two ARM9 cores and would
consume 15 power Fanucci ICECS 2002 - Application-specific hardware substantially
reduces power
12Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- Conclusions on automating low power techniques
13Breakdown of power by functionality
- Typical breakdown of on-chip power consumption
for an embedded microprocessor - Clock 20 to 40
- Memory 20 to 40
- Control datapath 40 to 60
- Input/output to off-chip 5
- Most of power is in datapath, control, clock tree
and memory - Techniques focus on reducing this power
- Several companies provide custom memory for ASIC
processes, so we wont discuss memory here
14Summary of factors effect on active power
- Automated designs are higher power than custom
because of -
ASIC design quality - Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
15Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- Conclusions on automating low power techniques
16Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
17Microarchitecture
leverage for voltage scaling and sizing
- Increase throughput/cycle to allow Vdd reduction
- Pipelining inserts registers, increasing
throughput - Limited by
- Reduction in instructions/cycle (1/CPI) due to
branch misprediction, waiting to read or write
memory, etc. - Power and delay for registers, data forwarding
logic, and branch prediction - Parallelism increases throughputin exchange for
increased area - Limited by
- Routing, multiplexing, control overheads
insertregisters
18Microarchitecture pipelining model
leverage for voltage scaling and sizing
- Pipeline power model Harstein 2003
- n stages, ?1.1 latch growth vs. n, ?0.05 for
register power - Minimum stage delay
- ASIC tpipelining overhead of 10 FO4 (register
delay) 10 FO4 (imbalance) - Custom tpipelining overhead of 2.6 FO4 total,
same tcombinational of 175 FO4 - CPI penalty 0.025/stage for custom, and
0.05/stage for ASICs - Add fits for dynamic and leakage power with
voltage scaling and sizing - At 40 FO4 delay constraint (500MHz for
Leff0.1um), ASIC is ?2.6 worse
19Microarchitecture
leverage for voltage scaling and sizing
- Custom IDCT pipelining to reduce Vdd
Xanthapoulos JSSC99 - With pipeline Vdd1.32V, 20 power overhead
- Without pipeline Vdd2.2V to meet throughput
- Parallel datapaths Bhavnagarwala IEEE Trans.
VLSI00 - ?2 to ?4 reduction in power by reducing Vdd by
increasing throughput with parallel datapaths - Microarchitecture speed gap is ?1.8 (typical) to
?1.3 (excellent) - At a tight delay constraint, this corresponds to
about ?2.6 to ?1.3 worse power due to higher Vdd,
lower Vth, and wider gates to compensate
?2
20Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
21Clock gating
?1.6 to ?1.0
- Clock signal has high activity, 2. Logic is lower
activity 0.1. - Turn off clocks to inactive modules
- Some DCT/IDCT registers are active lt 3 of time,
clock gating and avoiding computation reduces
power by ?10 August SOC01 - Typical savings are up to ?1.6 power reduction
- Power minimization tools automatically insert
gated clocks - Designer can make microarchitectural/algorithm
decisions - E.g. reduce precision for DCT/IDCT coefficients
- Precomputation control signals reduces power by
?1.4 to ?3.3 Hsu ISLPED02 - ASICs can do this
insertclock gating
22Power gating
reduces leakage in standby
- Turn off leakage path in inactive modules
- May need to preserve the state registers
- Can reduce standby leakage by 3 orders of
magnitudeMutoh JSSC95 - Other approaches
- reverse biasing the substrate
- setting input vectors to low leakage states,
gives ?1.4 leakage reduction Lee DAC03 - Just now getting ASIC methodology support
- Need large sleep transistors to turn off power
- Sleep transistors reduce available supply voltage
23Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
24High speed logic styles
leverage for voltage scaling and sizing
- Low power designs use mostly static CMOS logic
- Static CMOS logic is low leakage, robust
- PMOS pullup series transistors are slow
- Faster custom logic styles speedup critical paths
- Custom can use slack from higher speed (?1.4) to
reduce power by lowering Vdd - ASIC power ?1.3 worse than custom at a tight
delay constraint due to logic style
25Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
26Technology mapping
?1.4 to ?1.0
- Technology mapping tools dont target low power
- We found that targeting minimum area for
multipliers can result in ?1.3 power, delay is a
poor choice - Technology mapping techniques to reduce active
power - ?1.0 ASICs can do as well as custom, if tools
improve
equivalent logic, lower activity
27Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
28Cell sizing and wire sizing
?1.6 to ?1.1
- ?1.35 power reduction on Xtensa processor at
325MHz by (mostly sizing) power minimization with
Design Compiler and 0.13um library internship at
Tensilica - Can do better than Design Compiler (DC) with
cell sizing via linear program (LP) (global
optimization vs. greedy pin-hole
optimization), about ?1.1 to ?1.2 power
reduction
29Cell sizing and wire sizing
?1.6 to ?1.1
- Cell libraries lack fine-grained sizes and skewed
PN drives - Hurat SNUG01 Generate new cells ?1.2 power
reduction and ?1.15 faster for bus controller,
?1.4 MHz/mW - Simultaneous buffer and wire sizing reduced
clock tree power by ?2.7 Gong ISLPED96 - ?1.1 to ?1.2 reduction in total power
- Not available for ASIC interconnect yet
- Up to ?1.6 gap due to cell sizing and wire
sizing, can reduce to ?1.1 using a library with
finely-grained sizes, a good sizing tool, and
design-specific cells
optimizetransistorsizes
30Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
31Dynamic supply and substrate biasing
?4.0 to ?1.0
- Change Vdd based on processor load
- ?10 more energy efficient at low performance
Burd ISSCC00 - Adaptive voltage scaling with the ARM11 gives
?1.7 power reduction for voice, SMS, web
applications National Semiconductor, ARM 02 - Reduce Vdd and bias substrate to lower Vth
- ?1.7 reduction in power, same speed Hamada
CICC98 - Increase Vth in standby to reduce leakage
- These are complicated to automate for ASICs
- Dynamic voltage requires accurate knowledge of
path delays
32Multiple supply and threshold voltages
?4.0 to ?1.0
- Basic idea high speed where critical, low power
elsewhere - Dual Vdd reduces power by ?1.7 after substrate
biasing/lower Vdd Usami JSSC98 - ?2 reduction in clock tree power by using low Vdd
- Separate voltage islands different speeds and
Vdd Lackey ICCAD02 - Turn off Vdd to modules not in use, reduces
leakage by ?500 - ?1.25 to ?3 average power reduction, depending on
activities - Dual Vth can give ?3 to ?6 reduction in
leakageSirichotiyakul DAC99 - ASICs are limited to Vdd and Vth offered by
library and foundry - Cant change Vth to design-specific optimal point
- Standard cell libraries characterized at only two
or three Vdd - Dual Vdd requires level converters and dual Vdd
layout
33Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
34Floorplanning and placement
?1.5 to ?1.1
- Poor floorplanning and cell placement,
inaccurate wire loads - 1.5 worse power than custom
- We compared partitioning a design into 50K vs.
200K gate modules from 0.25um to 0.13um - 42 longer wires for 200K partitions
- Interconnect is 20 to 40 of total power
Sylvester ICCAD98 - ?1.1 to ?1.2 increase in total power due to
wiring, and gates will be upsized to drive the
longer wires
automatic place and route
blockpartitioned
Hauck Micro. Report 01
35Floorplanning and placement
?1.5 to ?1.1
- Bit slices can reduce wire length by 70 or
more vs. automated place-and-route - up to ?1.4 energy reduction as faster and lower
wiring capacitance Chang SM Thesis MIT98 - ?1.5 energy reduction from bit slicing and some
logic optimization Stok, Puri, Bhattacharya,
Cohn - Manual place-and-route achieves 10 shorter wires
and ?1.1 faster, about ?1.1 energy reduction
Chang SM Thesis MIT98 - ASICs still 1.1 higher power than custom due to
layout
automatic place-and-route
tiled bit-slices
custom
36Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
37Process variation impact on power
?2.6 to ?1.2
- ASICs are designed to work at the worst case
delay and worst case power corners for the
process typical delay and power are less - Simulated power was 1.7 actual power for custom
DCT/IDCT - Up to a factor of ?1.75 between worst and best
(average power of 80 chip samples in 0.3um)
38Process variation impact on power
?2.6 to ?1.2
- Binning would leave gap of ?1.4 between low and
high bins - We found a gap of ?1.2 between low speed (high
power) and high speed (low power, after derating
for Vdd and frequency) bins of 0.18 and 0.13um
Intel and AMD PC chips - ASICs dont speed bin (they scan test, no speed
test)
1.4
low power bin
higher power bin
39Process technology
?2.6 to ?1.2
- Low power libraries are more expensive
- 5 to 10 transistor width shrinks to reduce
capacitances - Copper is 40 lower resistivity than aluminum
- Low-k dielectric reduces wire capacitances we
estimate about a 1.1 reduction in total power
with a low-k dielectric - Silicon-on-insulator is 1.1 to 1.3 faster, 1.4
power reduction Narendra Symp. VLSI 2001 - We compared cell libraries in UMC 0.13um vs. IBM
0.13um process - IBM cells about 1.05 faster, 1.6 higher active
power, UMC had 17 leakage - Overall impact of process variation and
technology - ?2.6 ASIC power relative to custom for worst case
conditions and a cheap process - ?1.2 in a low power process, typical conditions,
no speed binning
40Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- Conclusions on automating low power techniques
41Low power design conclusions
- Typical ASIC is ?3 to ?7 less energy efficient
than custom - We assumed ASIC and custom designs can use the
same microarchitectural and logic design
techniques. These are the biggest levers for
reducing power. - Can get 10? or more going from general purpose
hardware to application-specific hardware. - E.g. Fast Fourier transform implementations as
discussed in Andrew Changs paper. - The largest factor for the power gap is voltage
scaling responsible for up to 4 - Process and microarchitecture can be large
factors, about 2.6 each
42Low power design conclusions
- By incorporating custom techniques can get within
- ?3 at a high performance target
- Cant use custom logic styles
- ASIC speed penalty drags down efficiency, as
higher Vdd, lower Vth, and upsized gates are
needed to meet performance target - ?1.5 at a lower performance target (2? slower)
- Make full use of scaling down Vdd and Vth
43Low power ASIC design example
- 0.13um DSP example Stok, Puri, Bhattacharya,
Cohn - 240,000 gates implementing Hilbert transform, FIR
filter, and fast Fourier transform, with 42KB
register array - Technology mapping, logic design (carry save
adders), bit-slicing, physical synthesis gave
?1.86 increase in efficiency - A fine grained standard cell library gave another
?1.16 - Voltage scaling gave another factor of ?1.46
- ?3.1 increase in MHz/mW overall
- The third speaker, Ruchir Puri will discuss some
of their recent low power work at IBM.
44Extra slides
45Impact of voltage scaling on power
- Ptotal Pdynamic Pshort circuit Pstatic
- Short circuit power when switching is 10 or less
of Ptotal - Dynamic power due to switching of capacitances
- Reducing Vdd gives quadratic reduction in
Pdynamic - But transistor drive current depends on Vdd
- Must reduce Vth to maintain drive current
- But reducing Vth increases subthreshold leakage
current, which is the major contributor to
Pstatic - (Clock frequency f gate switching activity a
capacitance C transistor length L transistor
gate oxide thickness Tox temperature T
constants b, t, Io, and m.)
dynamic power
Chen in Trans. On Electron Devices 1997
46ITRS leakage power trends
- Cant scale down Vth much further due to large
subthreshold leakage currents - Gate tunneling leakage through thin gate oxide
Tox is also becoming a significant cause of
leakage - Further Vdd voltage scaling will be limited
- Must also look to other low power techniques
fast, low Vth
slow, high Vth
leakage increasing
From International Technology Roadmap for
Semiconductors data for 2001-2016 (assuming
activity of 0.1, ignoring interconnect).
47Summary of factors affecting (active) power
- Automated designs are higher power than custom
because of - ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Memory 1.4 1.0
- Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
48Memory reduce cache misses
?1.4 to ?1.0
- Larger caches consume more power, but reduced
cache misses - Pipeline stalls, waits many cycles for read/write
to off-chip memory - Caches with higher associativity (e.g. 8-way vs.
direct mapped) consume more power, also affects
likelihood of a cache miss - Duarte ASIC/SOC 2001
- Sub-banking only precharge the need section of
the cache bank, ?1.32 energy savings - Software optimizations to reduce cache misses
gave on average a ?1.6 reduction in power - 90 of the StrongARM area was caches, increasing
the transistor length in the caches by 12
reduced leakage by ?20 Montanaro JSSC96 - ASICs can do this, custom memory is available for
ASICs
49Outline
- Motivation for focusing on reducing ASIC power
- The power gap between ASIC and custom
- Where does the power go?
- What can we do about it?
- ASIC design quality
- Factor typical excellent
- Microarchitecture (pipelining, parallelism) 2.6
1.3 - Clock gating and power gating 1.6 1.0
- Logic design 1.2 1.0
- High speed logic styles (DCVSL, PTL,
domino) 1.3 1.3 - Technology mapping 1.4 1.0
- Cell sizing and wire sizing 1.6 1.1
- Voltage scaling, multi-Vth, multi-Vdd 4.0 1.0
- Floorplanning and placement 1.5 1.1
- Process variation and process technology 2.6 1.2
- Conclusions on automating low power techniques
50Logic design
?1.2 to ?1.0
- Logic design refers to the topology and logic
structure to implement functional units - Logic switching activity of a carry select adder
was ?1.8 worse than a 32-bit carry lookahead
Callaway VLSI Signal Proc.92 - 0.13um 64-bit radix-2 compound domino adder was
slower and about ?1.3 energy compared to radix-4
Zlatanovici ESSC03 - We implemented an algorithm to reduce switching
activity in multipliers, reduced energy by ?1.1
for 64-bit Ito ICCD03 - Given similar design constraints, ASIC designers
can choose the same logic design as custom, ?1.0