Title: VLSI Datapath Choices: CellBased Versus FullCustom
1(No Transcript)
2Explaining The Gap Between ASIC and Custom Power
A Custom Perspective
- Andrew Chang
- Cadence Design Systems
- William J. Dally
- Computer Systems Laboratory
- Stanford University
Work done while Author was at Stanford
3Design Tradeoffs Power vs. Performance
1. Move to More Energy Efficient
Operating Point More Energy Efficient w/
Custom
Power
3
1
2
Performance
4Design Tradeoffs Power vs. Performance
1. Move to More Energy Efficient
Operating Point More Energy Efficient w/
Custom 2. Trade Performance for
Power Larger Range w/ Custom
Power
3
1
2
Performance
5Design Tradeoffs Power vs. Performance
1. Move to More Energy Efficient
Operating Point More Energy Efficient w/
Custom 2. Trade Performance for
Power Larger Range w/ Custom 3.
Move to Different Power vs. Performance
Curve More Architectural Choice with
Custom
Power
3
1
2
Performance
6Dynamic Power Dissipation
- Pdyn a CVdd2 f a Ecircuit f
- Reduce Vdd
- Static, dynamic, voltage islands, power gating
- Reduce a and/or f
- Clock gating, block enables, bus encoding,
glitch identification and elimination - Reduce Ecircuit
- Engineer interconnects, increase circuit
efficiency, subthreshold circuit techniques
7Static Power Dissipation
- Pstatic Vdd (Isub Iox )
-
- Isub K1 W e -Vt/ nVq (1- e Vgs/Vq)
- Iox K2 W (Vgs/tox)2 e a tox/ Vgs
- With K1, K2, n, and a experimentally
determined - Reduce Vdd
- Static, dynamic, voltage islands, power gating
- Increase effective Vt
- Substituting high-threshold devices, transistor
stacking, static and active body bias - Reduce effective W
- Reduce number and size of devices in design
8Which Design Is More Efficient?
- 0.7um CMOS 173MHz chip w/ 460K Ts
-
- 0.18um CMOS 10kHz chip w/ 640K Ts
9Which Design Is More Efficient?
- 0.7um CMOS 173MHz chip w/ 460K Ts
- Vdd (typ) 3.3V, Vdd (min) 1.1V
-
- 0.18um CMOS 10kHz chip w/ 640K Ts
- Vdd (max) 1.8V, Vdd (min) 0.18V
10Which Design Is More Efficient?
- 0.7um CMOS 173MHz chip w/ 460K Ts
- Vdd (typ) 3.3V, Vdd (min) 1.1V
- Power 845mW
-
- 0.18um CMOS 10kHz chip w/ 640K Ts
- Vdd (max) 1.8V, Vdd (min) 0.18V
- Power 1.6mW
11Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
12Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
13Defining Ebit
- Ebit Cbit Vdd2
- Cbit 4 2 fF/um Wmin
- Energy needed to write a 1-bit SRAM cell
- Approximates minimum useful capacitance
- The ratio of Ebit to the energy for a range of
circuits remains largely constant with technology
scaling
14Technology Scaling for Ebit
- c is a normalized unit of distance equal to the
M1 pitch
15Technology Scaling for Nand2
NAND2
A
A
YN
B
YN
B
4c 2.24mm
8c 4.48mm
- c is a normalized unit of distance equal to the
M1 pitch
16Applying Ebit
17Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
18Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
19Effect of Architecture
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style ASIC
2600MHz 55M Transistors
400MHz 125M Transistors
20Effect of Architecture
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style ASIC
2600MHz 55M Transistors 60 Watts
400MHz 125M Transistors 20 Watts
21Effect of Architecture ASIC Architecture 6x
Efficiency
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style ASIC
2600MHz 55M Transistors 60 Watts 5GFlops 5
Gbs
400MHz 125M Transistors 20 Watts 10GFlops
13 GBs
22Custom Circuits 9x (7x) Efficiency
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style Custom
2600MHz 55M Transistors 60 Watts 5GFlops 5
Gbs Vdd 1.3V
400MHz 125M Transistors 3 Watts 10GFlops
13 GBs Vdd 0.65V
23Combined Architecture and Circuits40x
Improvement but 1.5 Years vs. 3 Years
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style Custom
2600MHz 55M Transistors 60 Watts 5GFlops 5
Gbs Vdd 1.3V
400MHz 125M Transistors 3 Watts 10GFlops
13 GBs Vdd 0.65V
24Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
25Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
26ASIC vs. Custom
- ASIC Methods
- Provide only coarse-grain control 100K gates,
but require much less effort and historically
scale with complexity - Custom Methods
- Offer fine-grain control individual transistors
gates, but require large effort and scale poorly
with complexity - Exploits Design Structure
- Exploits Circuit Techniques
27Custom Methods EmphasizeFine-Grain Manual
Control Custom Library
28Custom Methods EmphasizeFine-Grain Manual
Control Custom Library
Operation and Performance Characterized for the
Specific Case
29ASIC Methods SubstituteCoarse-Grain Control
Automation Generic Library
30ASIC Methods SubstituteCoarse-Grain Control
Automation Generic Library
Operation and Performance Characterized for the
Typical/Generic Case
31ASIC Focus on 100K GatesLost Opportunities to
Exploit Structure
- Designs reuse similar basic building blocks
- Building blocks 1-10K-gates not 100K gate
- 64-bit adder 1K-gates
- 64x64 rf 2K-gates
- 64x64 multiplier 20K-gates
- Opportunities to exploit these structures lost
when design is viewed in large chunks
32Different Architectures Similar Building Blocks
1998 MAP 64b Microprocessor - 5M
Ts (MIT/Stanford)
EX
RF
SRAM
XCVRS
Bus
2002 Imagine 32b Stream Processor - 22M
Ts (Stanford)
XCVRS
Bus
EX
RF
SRAM
33Significant Structure ExistsWithin 100K-gates
1998 MAP 64b Microprocessor - 5M
Ts (MIT/Stanford)
EX
RF
SRAM
XCVRS
Bus
C
L
2002 Imagine 32b Stream Processor - 22M
Ts (Stanford)
XCVRS
Bus
EX
RF
SRAM
34Energy of 100K-gate Equivalent
- ASIC (N2) 1400K Ebits (typ)
- Custom Logic 424K Ebits
- SRAM (small) 1085K Ebits
- SRAM (med) 155K Ebits
- SRAM (large) 50K Ebits
- Based on data extracted from Intel McKinley
35Exploiting Circuit Techniques
- Custom circuits more efficient
- Reduced parasitics
- 1.7x circuit techniques and flops
- 1.4x libraries
- 1.4x due to engineering interconnects
- Subthreshold Circuits
- Low Performance but ultra-low power
- Requires Architecture, Gates, Memories, CAD Tools
36Relating Power to PerformanceCV/I, Idsat, tFO4
Idsat K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 K4 Ceff Vdd /Idsat (K4 13.5)
37Relating Power to Performance Relating Vdd and
Vt to tFO4
Idsat K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 K4 Ceff Vdd /Idsat (K4 13.5)
38Relating Power to PerformanceCorrelation to
Reported Foundry Data
Idsat K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 K4 Ceff Vdd /Idsat (K4 13.5)
39Achievable Power Improvement (Assuming 50/50
split of Logic and Memory)
40Achievable Power Improvement(Assuming 50/50
Split of Logic and Memory)
41Achievable Power Improvement(Assuming 50/50
Split of Logic and Memory)
42Achievable Power ImprovementAssuming 50/50 Split
of Logic and Memory
- 130nm uP assumes 80 Dynamic and 20 Static
- 90nm uP assumes 50 Dynamic and 50 Static
43Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
44Talk Outline
- Normalized Metric Ebit
- Effect of Architecture
- ASIC vs. Custom
- Building Blocks
- Achievable Energy Efficiency
- 16b 1024 FFT Example
- Answer to Which Design is More Efficient
4516b 1024 point FFT
- Generally, k N log N operations (complex
multiplies) with pre-computation - Radix-2, Radix-4 etc implementations
- Decimation in time and/or decimation in
Frequency
46Range of Implementations
- MIT FFT (2005)
- 0.18um CMOS, 628K Ts, 10KHz Architecture and
subtheshold circuits, 180mV operation - Spiffee (1999)
- 0.7um CMOS, 460K Ts, 173MHz Cached FFT
Architecture and algorithm, 1.1V operation - SA-1100 (1999)
- 0.35um CMOS, 2.6M Ts, 74MHz Commercial embedded
processor, Custom Circuits, 1.5V operation - Imagine (2003)
- 0.15um CMOS, 22M Ts , 232MHz Streaming Media
Processor, tiled standard cells, 1.2V operation - Stratix IS25F627C8 (2005)
- 0.13um CMOS, 3.9K logic elements, 123K memory
bits, 24 DSP blocks, 272MHz - Commercial FPGA Co-processor,
- Intel P4 (2003)
- 0.13um CMOS, 3GHz, SSE Commerical General
Purpose Processor, Custom Circuits, 1.5V
operation - TI C6416 (2003)
- 0.13um CMOS, 720MHz Commercial Digital Signal
Processor
47Ebit Energy 16b 1024 point FFT
48Ebit Energy 16b 1024 point FFT
49Which Design Is More Efficient?
- 0.7um CMOS 173MHz chip w/ 460K Ts
- Vdd (typ) 3.3V, Vdd (min) 1.1V
- Power 845mW
-
- 0.18um CMOS 10kHz chip w/ 640K Ts
- Vdd (max) 1.8V, Vdd (min) 0.18V
- Power 1.6mW
50Which Design Is More Efficient?Depends on the
Metric!
- 0.7um CMOS 173MHz chip w/ 460K Ts
- Vdd (typ) 3.3V, Vdd (min) 1.1V
- Power 845mW
- EDP 143x better
-
- 0.18um CMOS 10kHz chip w/ 640K Ts
- Vdd (max) 1.8V, Vdd (min) 0.18V
- Power 1.6mW
- Absolute energy 6x better
51Summary
- Normalized metric Ebit - enables meaningful
comparisons across designs and technologies - Custom designers can exploit a wide range of
optimizations enabling architecture with
circuits and circuits with Architecture - Custom designs can readily achieve a 3x advantage
in energy with the potential for over 10x - Selective application of custom techniques and
automated support for performance
characterization at specific instead of generic
operating points can enable ASIC designers to
begin to bridge this Power Gap.
52Back-Up Slides
53ASIC Rely on General Optimization
TechniquesFocus - Improve the Average Case
- Partitioning Hyper-graph - min-cut, ratio cut
- Solutions move-based, geometric combinatorial
forms, clustering
Hypergraph
e8
Circuit
e1
e4
e6
e6
e1
e8
V3
e5
e4
e7
e2
e3
e5
e7
V4
Vertex Edge weights used to encode costs
e3
e2
H(V,E) E e1, e2. nets
54Designs with Structure Do Not Exhibit Average
Characteristics
Density
64b Multiplier (half-array)
Routing
Clear Disparity in Resource Usage