Title: Design Productivity Crisis
1Design Techniques
Borivoje Nikolicbora_at_eecs.berkeley.edu
BWRC Winter Research Retreat
January 13, 2003
2Outline
- Power-constrained design
- Students D. Markovic, V. Stojanovic (Stanford)
- B. Brodersen, M. Horowitz
- Update on various designs
- Dual-supply ALU (Y. Shimazaki, R. Zlatanovici)
- Datapaths for maskless lithography (B. Wild, B.
Warlick) - Iterative decoding (E. Yeo, E. Liao)
- Background-calibrated ADC (Y. Chiu, B. Tsang)
- Transistor modeling (J. Garrett)
- Robust design (R. Zlatanovici, S. Vamvakos)
3Power limited operation
Energy/op
Unoptimized design
Emax
Emin
Dmax
Delay
Dmin
Achieve the highest performance under the power
cap
4Power limited operation
Energy/op
Unoptimized design
Var1
Emax
Design optimization curves
Emin
Dmin
Dmax
Delay
Achieve the highest performance under the power
cap
5Power limited operation
Energy/op
Unoptimized design
Var1
Emax
Design optimization curves
Var2
Emin
Dmin
Dmax
Delay
Achieve the highest performance under the power
cap
6Power limited operation
Energy/op
Unoptimized design
Var1
Emax
Design optimization curves
Var2
Var1 Var2
Emin
Dmin
Dmax
Delay
How far away are we from the optimal solution?
7Power limited operation
Energy/op
Unoptimized design
Var1
Emax
Design optimization curves
Var2
Var1 Var2
Global
Emin
Dmin
Dmax
Delay
Global optimum best performance
8Power limited operation
Energy/op
Unoptimized design
Emax
Emin
Dmin
Dmax
Delay
Maximize throughput for given energy
or Minimize energy for given throughput
9Design optimization
- There are many sets of parameters to adjust
- Tuning variables
- Circuit(sizing, supply, threshold)
- Logic style(domino, pass-gate, )
- Block topology (adder CLA, CSA, )
- Micro-architecture (parallel, pipelined)
10Design optimization
- There are many sets of parameters to adjust
- Tuning variables
- Circuit(sizing, supply, threshold)
- Logic style(domino, pass-gate, )
- Block topology (adder CLA, CSA, )
- Micro-architecture (parallel, pipelined)
- Globally optimal boundary curve pieces of E-D
curves for different topologies
11Energy-delay sensitivity
- ?E Sens(A)(-?D) Sens(B)?D
At the optimal point, all sensitivities should
be the same
12Alpha-power based delay model
- Fitting parameters
- Von, ?d, Kd
13Alpha-power based delay model
heff
- Fitting parameters
- Von, ?d, Kd
- Effective fanout, heff
14Energy model
15Sensitivity to sizing and supply
? for equal heff (Dmin)
xv (Von?Vth)/Vdd
16Sensitivity to Vth
Low initial leakage ? speedup comes for free
17Optimization setup
- Reference/nominal circuit
- sized for Dmin _at_ Vddnom, Vthnom
- known average activity
- Set delay constraint
- Minimize energy under delay constraint
- gate sizing
- Vdd , Vth scaling
18Circuit Examples
- Inverter chain
- No off-path load or reconvergence
- Memory decoder
- Off-path load without reconvergence
- Adder
- Off-path load with reconvergence
19SRAM Decoder Energy Profile
Internal energy peak
100
80
60
Energy (norm)
40
20
0
m4
m2
m1
m8
20W vs. Vdd for Reducing Energy Peak
reference design
optimized design
- Vdd less effective than W optimization
- Buffering also reduces energy peak
also B. Amrutur, M. Horowitz, JSSC 10/01
21Kogge-Stone Tree Adder Topology
- Off-path load (gates wires)
- Reconvergence (inside ?-block)
22Tree Adder Optimization Results
- Reference all paths are critical
sizing E (-54) dinc10
nominal DDmin
2Vdd E (-27) dinc10
- Internal energy ? W more effective than Vdd
- W E(-54), 2Vdd E(-27) at dinc10
23Joint optimization sizing and Vdd
Nominal design
Energy/op
Delay
?E Sens(Vdd)(-?D) Sens(W)?D
24Results of joint optimization
Sensitivity table
80 of energy saved without delay penalty
25Reducing the number of dimensions
Threshold and sizing nearly optimal around the
nominal point
26Scope of circuit optimization
Effective region /-30 around nominal delay
27Power- Limited Design Conclusions
- All design levels need to be optimized jointly
- Equal marginal costs ? Energy-efficient design
- Peak performance is VERY power inefficient
- Todays designs are not leaky enough to be truly
power-optimal - Pipelining starts to gain advantage over
parallelism
28Power- Limited Design Directions
- Expand the analysis across the pipeline stages
- Better compact models (preferably convex)
- Exploring E-D bounds for various functions
- Design qualification in E-D space
- Design optimization, joint with Kurt Keutzer,s
Jan Rabaeys groups - Robust optimization
- Robustness through adjustments (supply, body
bias)
29Compact Delay Models
- Unified transistor model
- Maps into a convex delay model
- J. Garrett, R. Zlatanovici
30Adders in E-D Space
2500
Radix 2 Kogge - Stone
Radix 4 Kogge - Stone
2000
Radix 4 2-Sparse
Radix 4 4-Sparse
Ripple Carry Adder
1500
Total Transistor Width per Bitslice unit widths
1000
CLA
500
RCA
0
0
50
100
150
Delay FO4
31Robust Optimization
- Example robust linear programming (LP)
- LP with uncertainty on some parameters
- Uncertain parameters lie in given ellipsoids Ei
- Worst case require the constraints to be
satisfied for all values of the parameters - Can be formulated as a convex second order cone
program
32Optimization with Random Parameters
- Suppose parameters are random variables with
known distribution - Require that each constraint holds with a
probability of at least ? - Can be formulated as a second order cone program
- R. Zlatanovici
33Dual-Supply ALU Design
Y. ShimazakiR. Zlatanovici
Power
Maxpower
Frequency scaling
Powerlimit
Optimized single-supply design
Dual-supply design
Min delay
Delay
Optimal design achieves the smallest delay under
power constraints.
34Dual-Supply Designs
- Dual-Supply-Voltage Technique expands the
power-delay optimization space. - Layout complications and level conversion make it
impractical for high-speed datapaths in
conventional implementation. - A shared N-well technique is explored on an ALU
ALU is a performance critical path with highest
power density. - To be presented at ISSCC03
35Shared-Well Dual-Supply-Voltage
Conventional
Shared N-well
VDDH
VDDH
VDDL
VDDL
i1
o1
i1
o1
i2
o2
i2
o2
VSS
VSS
VDDH circuit
VDDL circuit
VDDH circuit
VDDL circuit
36Conventional Dual-Supply Layout
VDDL Row
N-well isolation
VDDH
VDDL
VDDH Row
VDDL Row
VDDH Row
(a) Dedicated row
VSS
VDDH Region
VDDL Region
VDDH circuit
VDDL circuit
(b) Dedicated region
37Shared-Well Dual-Supply Layout
VDDL circuit
Shared N-well
VDDH circuit
VDDH
VDDL
VSS
VDDH circuit
VDDL circuit
(a) Floor plan image
38FO4-INV Delay and Leakage
25o C, VDDH1.8V
3.0
10
2.5
1
2.0
Normalized FO4-INV delay
Normalized PMOS Ioff
1.5
0.1
1.0
0.01
shared N-well
0.5
conventional
0.0
0.001
1.0
1.2
1.4
1.6
1.8
2.0
VDDL V
39ALU Block Diagram
clock gen.
clk
ain
carry
sum
ain0
carry gen.
51 MUX
sum sel.
91 MUX
INV1
gp gen.
INV2
s0/s1
21 MUX
91 MUX
partial sum
0.5pF
bin
logical unit
VDDH circuit
VDDL circuit
sumb (long loop-back bus)
40Sparse Radix-4 Carry Tree
cin
bit0
bit63
G/P
G4/P4
G16/P16
G64
SUMSEL
c63
s0
s63
- There are fewer logic gates, because only every
fourth carry signal is calculated. - Partial sum gates can be placed in empty slots.
41Adders Performance Tradeoffs
- Many choices available to the designer
- Tradeoff curves generated using optimization
- Radix-4 4-sparse is best in this case
42Low Swing Bus Level Converter
VDDH
VDDL
keeper
pc
ain0
sumb
sum
VSS
INV1
INV2
domino level converter (91 MUX)
43Test Chip Micrograph
2.0mm
760mm
1.5mm
- Technology summary
- 0.18mm general-purpose Bulk,
- 5 Metal Layers (Al), Local interconnect
200mm
44Measured Results Energy Delay
Room temp.
1000
VDDHVDDL
900
VDDH2.0
800
VDDH1.8
VDDL decreases.
700
Energy pJ
VDDH1.6
600
500
400
300
200
0.6
0.8
1.0
1.2
1.4
1.6
TCYCLE ns
45Dual-Supply ALU Summary
- Shared-Well Dual-Supply-Voltage Technique
- Appropriate for datapath design
- 30 less area
- Low Power ALU Design Techniques
- Sparse radix-4 carry tree
- Low swing bus and domino level converter
- Test Chip Measurement
- 1.16GHz 64bit ALU in GP 0.18mm Bulk
- 33.3 energy saving with 8.3 delay increase
- 42 leakage current reduction
46Maskless Lithography Datapaths
Maskless writing using micromirrors.
47Maskless Lithography Datapaths
SRAM Writer-Interface
Literal / Offset
FIFO
Huffman Decoder
Lempel-Ziv Decoder
DecompressedData
CRC Check
CompressedData
FIFO
8
Length
RD/WR
Table select
Address
Lookup Tables
Control
Synch.
10
Decompressor row block diagram.
48Maskless Lithography Datapaths
Performance of test chip vs. full-scale chip
49Maskless Lithography Datapaths
Huffman Decoder Lookup Memory
SRAM Writer Interface
Huffman Decoder Lempel-Ziv Decoder
FIFO Array
Single Decompression Path
Last SSHAFT 0.18?m designFully functional first
time
B. Warlick is working on next generation
50Iterative Decoders
- Turbo codes comprising convolutional codes
concatenated through interleaver - LDPC codes based on finite field geometries
- Cyclic connectivity between nodes
- LDPC codes based on Ramanujan graphs
- Hierarchical connectivity with regular local
interconnect
p
Convolutional Encoder 2
Convolutional Encoder 1
E. Yeo, E. Liao
51Background Calibrated ADC
- Pipeline ADC calibrated by a ??.
- High speed, 10-bit.
Yun Chiu, Bill TsangProf. Gray