VLSI Datapath Choices: CellBased Versus FullCustom - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

VLSI Datapath Choices: CellBased Versus FullCustom

Description:

Work done while Author was at Stanford. Design Tradeoffs: Power ... Radix-2, Radix-4 etc... implementations. Decimation in time and/or decimation in Frequency ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 55
Provided by: andrew947
Category:

less

Transcript and Presenter's Notes

Title: VLSI Datapath Choices: CellBased Versus FullCustom


1
(No Transcript)
2
Explaining The Gap Between ASIC and Custom Power
A Custom Perspective
  • Andrew Chang
  • Cadence Design Systems
  • William J. Dally
  • Computer Systems Laboratory
  • Stanford University

Work done while Author was at Stanford
3
Design Tradeoffs Power vs. Performance
1. Move to More Energy Efficient
Operating Point More Energy Efficient w/
Custom
Power
3
1
2
Performance
4
Design Tradeoffs Power vs. Performance
1. Move to More Energy Efficient
Operating Point More Energy Efficient w/
Custom 2. Trade Performance for
Power Larger Range w/ Custom
Power
3
1
2
Performance
5
Design Tradeoffs Power vs. Performance
1. Move to More Energy Efficient
Operating Point More Energy Efficient w/
Custom 2. Trade Performance for
Power Larger Range w/ Custom 3.
Move to Different Power vs. Performance
Curve More Architectural Choice with
Custom
Power
3
1
2
Performance
6
Dynamic Power Dissipation
  • Pdyn a CVdd2 f a Ecircuit f
  • Reduce Vdd
  • Static, dynamic, voltage islands, power gating
  • Reduce a and/or f
  • Clock gating, block enables, bus encoding,
    glitch identification and elimination
  • Reduce Ecircuit
  • Engineer interconnects, increase circuit
    efficiency, subthreshold circuit techniques

7
Static Power Dissipation
  • Pstatic Vdd (Isub Iox )
  • Isub K1 W e -Vt/ nVq (1- e Vgs/Vq)
  • Iox K2 W (Vgs/tox)2 e a tox/ Vgs
  • With K1, K2, n, and a experimentally
    determined
  • Reduce Vdd
  • Static, dynamic, voltage islands, power gating
  • Increase effective Vt
  • Substituting high-threshold devices, transistor
    stacking, static and active body bias
  • Reduce effective W
  • Reduce number and size of devices in design

8
Which Design Is More Efficient?
  • 0.7um CMOS 173MHz chip w/ 460K Ts
  • 0.18um CMOS 10kHz chip w/ 640K Ts

9
Which Design Is More Efficient?
  • 0.7um CMOS 173MHz chip w/ 460K Ts
  • Vdd (typ) 3.3V, Vdd (min) 1.1V
  • 0.18um CMOS 10kHz chip w/ 640K Ts
  • Vdd (max) 1.8V, Vdd (min) 0.18V

10
Which Design Is More Efficient?
  • 0.7um CMOS 173MHz chip w/ 460K Ts
  • Vdd (typ) 3.3V, Vdd (min) 1.1V
  • Power 845mW
  • 0.18um CMOS 10kHz chip w/ 640K Ts
  • Vdd (max) 1.8V, Vdd (min) 0.18V
  • Power 1.6mW

11
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

12
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

13
Defining Ebit
  • Ebit Cbit Vdd2
  • Cbit 4 2 fF/um Wmin
  • Energy needed to write a 1-bit SRAM cell
  • Approximates minimum useful capacitance
  • The ratio of Ebit to the energy for a range of
    circuits remains largely constant with technology
    scaling

14
Technology Scaling for Ebit
  • c is a normalized unit of distance equal to the
    M1 pitch

15
Technology Scaling for Nand2
NAND2
A
A
YN
B
YN
B
4c 2.24mm
8c 4.48mm
  • c is a normalized unit of distance equal to the
    M1 pitch

16
Applying Ebit
17
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

18
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

19
Effect of Architecture
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style ASIC
2600MHz 55M Transistors
400MHz 125M Transistors
20
Effect of Architecture
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style ASIC
2600MHz 55M Transistors 60 Watts
400MHz 125M Transistors 20 Watts
21
Effect of Architecture ASIC Architecture 6x
Efficiency
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style ASIC
2600MHz 55M Transistors 60 Watts 5GFlops 5
Gbs
400MHz 125M Transistors 20 Watts 10GFlops
13 GBs
22
Custom Circuits 9x (7x) Efficiency
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style Custom
2600MHz 55M Transistors 60 Watts 5GFlops 5
Gbs Vdd 1.3V
400MHz 125M Transistors 3 Watts 10GFlops
13 GBs Vdd 0.65V
23
Combined Architecture and Circuits40x
Improvement but 1.5 Years vs. 3 Years
NVIDIA GeForceFX
Intel Pentium-4
Design Style Custom
Design Style Custom
2600MHz 55M Transistors 60 Watts 5GFlops 5
Gbs Vdd 1.3V
400MHz 125M Transistors 3 Watts 10GFlops
13 GBs Vdd 0.65V
24
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

25
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

26
ASIC vs. Custom
  • ASIC Methods
  • Provide only coarse-grain control 100K gates,
    but require much less effort and historically
    scale with complexity
  • Custom Methods
  • Offer fine-grain control individual transistors
    gates, but require large effort and scale poorly
    with complexity
  • Exploits Design Structure
  • Exploits Circuit Techniques

27
Custom Methods EmphasizeFine-Grain Manual
Control Custom Library
28
Custom Methods EmphasizeFine-Grain Manual
Control Custom Library
Operation and Performance Characterized for the
Specific Case
29
ASIC Methods SubstituteCoarse-Grain Control
Automation Generic Library
30
ASIC Methods SubstituteCoarse-Grain Control
Automation Generic Library
Operation and Performance Characterized for the
Typical/Generic Case
31
ASIC Focus on 100K GatesLost Opportunities to
Exploit Structure
  • Designs reuse similar basic building blocks
  • Building blocks 1-10K-gates not 100K gate
  • 64-bit adder 1K-gates
  • 64x64 rf 2K-gates
  • 64x64 multiplier 20K-gates
  • Opportunities to exploit these structures lost
    when design is viewed in large chunks

32
Different Architectures Similar Building Blocks
1998 MAP 64b Microprocessor - 5M
Ts (MIT/Stanford)
EX
RF
SRAM
XCVRS
Bus
2002 Imagine 32b Stream Processor - 22M
Ts (Stanford)
XCVRS
Bus
EX
RF
SRAM
33
Significant Structure ExistsWithin 100K-gates
1998 MAP 64b Microprocessor - 5M
Ts (MIT/Stanford)
EX
RF
SRAM
XCVRS
Bus
C
L
2002 Imagine 32b Stream Processor - 22M
Ts (Stanford)
XCVRS
Bus
EX
RF
SRAM
34
Energy of 100K-gate Equivalent
  • ASIC (N2) 1400K Ebits (typ)
  • Custom Logic 424K Ebits
  • SRAM (small) 1085K Ebits
  • SRAM (med) 155K Ebits
  • SRAM (large) 50K Ebits
  • Based on data extracted from Intel McKinley

35
Exploiting Circuit Techniques
  • Custom circuits more efficient
  • Reduced parasitics
  • 1.7x circuit techniques and flops
  • 1.4x libraries
  • 1.4x due to engineering interconnects
  • Subthreshold Circuits
  • Low Performance but ultra-low power
  • Requires Architecture, Gates, Memories, CAD Tools

36
Relating Power to PerformanceCV/I, Idsat, tFO4
Idsat K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 K4 Ceff Vdd /Idsat (K4 13.5)
37
Relating Power to Performance Relating Vdd and
Vt to tFO4
Idsat K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 K4 Ceff Vdd /Idsat (K4 13.5)
38
Relating Power to PerformanceCorrelation to
Reported Foundry Data
Idsat K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25
tFO4 K4 Ceff Vdd /Idsat (K4 13.5)
39
Achievable Power Improvement (Assuming 50/50
split of Logic and Memory)
40
Achievable Power Improvement(Assuming 50/50
Split of Logic and Memory)
41
Achievable Power Improvement(Assuming 50/50
Split of Logic and Memory)
42
Achievable Power ImprovementAssuming 50/50 Split
of Logic and Memory
  • 130nm uP assumes 80 Dynamic and 20 Static
  • 90nm uP assumes 50 Dynamic and 50 Static

43
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

44
Talk Outline
  • Normalized Metric Ebit
  • Effect of Architecture
  • ASIC vs. Custom
  • Building Blocks
  • Achievable Energy Efficiency
  • 16b 1024 FFT Example
  • Answer to Which Design is More Efficient

45
16b 1024 point FFT
  • Generally, k N log N operations (complex
    multiplies) with pre-computation
  • Radix-2, Radix-4 etc implementations
  • Decimation in time and/or decimation in
    Frequency

46
Range of Implementations
  • MIT FFT (2005)
  • 0.18um CMOS, 628K Ts, 10KHz Architecture and
    subtheshold circuits, 180mV operation
  • Spiffee (1999)
  • 0.7um CMOS, 460K Ts, 173MHz Cached FFT
    Architecture and algorithm, 1.1V operation
  • SA-1100 (1999)
  • 0.35um CMOS, 2.6M Ts, 74MHz Commercial embedded
    processor, Custom Circuits, 1.5V operation
  • Imagine (2003)
  • 0.15um CMOS, 22M Ts , 232MHz Streaming Media
    Processor, tiled standard cells, 1.2V operation
  • Stratix IS25F627C8 (2005)
  • 0.13um CMOS, 3.9K logic elements, 123K memory
    bits, 24 DSP blocks, 272MHz
  • Commercial FPGA Co-processor,
  • Intel P4 (2003)
  • 0.13um CMOS, 3GHz, SSE Commerical General
    Purpose Processor, Custom Circuits, 1.5V
    operation
  • TI C6416 (2003)
  • 0.13um CMOS, 720MHz Commercial Digital Signal
    Processor

47
Ebit Energy 16b 1024 point FFT
48
Ebit Energy 16b 1024 point FFT
49
Which Design Is More Efficient?
  • 0.7um CMOS 173MHz chip w/ 460K Ts
  • Vdd (typ) 3.3V, Vdd (min) 1.1V
  • Power 845mW
  • 0.18um CMOS 10kHz chip w/ 640K Ts
  • Vdd (max) 1.8V, Vdd (min) 0.18V
  • Power 1.6mW

50
Which Design Is More Efficient?Depends on the
Metric!
  • 0.7um CMOS 173MHz chip w/ 460K Ts
  • Vdd (typ) 3.3V, Vdd (min) 1.1V
  • Power 845mW
  • EDP 143x better
  • 0.18um CMOS 10kHz chip w/ 640K Ts
  • Vdd (max) 1.8V, Vdd (min) 0.18V
  • Power 1.6mW
  • Absolute energy 6x better

51
Summary
  • Normalized metric Ebit - enables meaningful
    comparisons across designs and technologies
  • Custom designers can exploit a wide range of
    optimizations enabling architecture with
    circuits and circuits with Architecture
  • Custom designs can readily achieve a 3x advantage
    in energy with the potential for over 10x
  • Selective application of custom techniques and
    automated support for performance
    characterization at specific instead of generic
    operating points can enable ASIC designers to
    begin to bridge this Power Gap.

52
Back-Up Slides
53
ASIC Rely on General Optimization
TechniquesFocus - Improve the Average Case
  • Partitioning Hyper-graph - min-cut, ratio cut
  • Solutions move-based, geometric combinatorial
    forms, clustering

Hypergraph
e8
Circuit
e1
e4
e6
e6
e1
e8
V3
e5
e4
e7
e2
e3
e5
e7
V4
Vertex Edge weights used to encode costs
e3
e2
H(V,E) E e1, e2. nets
54
Designs with Structure Do Not Exhibit Average
Characteristics
Density
64b Multiplier (half-array)
Routing
Clear Disparity in Resource Usage
Write a Comment
User Comments (0)
About PowerShow.com