Title: 396ps 32bit HanCarlson ALU in 180nm TSMC process
1396-ps 32-bit Han-Carlson ALU in 180nm TSMC
process
VLSI CAD Lab University of Wisconsin, Madison
2Outline
- Review of Adders
- The Idea of Han-Carlson Adder
- The Implementation of Han-Carlson Adder
- Simulation Result
- Discussion
- Comparison between Lings and H-C Adder
- Future work
- Reference
3Review of Adders
4Review of Adders(cont.)
5Review of Adders(cont.)
6Review of Adders(cont.)
7Review of Adders(cont.)
8Review of Adders(cont.)
Observation
Back
9Review of Adders(cont.)
- Hybrid (Parallel) Prefix Adder
- Brent-Kung Adder
- Kogge-Stone
- Han-Carlson Adder
10Review of Adders(cont.)
- Brent-Kung Adder
- Cost C(k)C(k/2)k-12k-2-log2k ( of adder
cells) - Time 2log2k 2 (in terms of adder levels)
11Review of Adders(cont.)
- Kogge-Stone Adder
- Cost klog2k-(k-1)
- Time log2k
12The idea of Han-Carlson Adder
- Han-Carlson Adder
- B-K adder small area, but slow
- K-S adder large area, but fast
- Speed 2log2n-2?log2n (1/2 reduction)
- Cost 2k-2-log2k?klog2k-k1 (log2k/2 increase)
- The area-time tradeoff results in Han-Carlson
Adder
13The idea of Han-Carlson Adder (cont.)
- Han-Carlson Adder
- Cost O(k/2log2k)
- Time O(log2k1)
14Review of Adders(cont.)
- Optimized Brent-Kung Adder
- Cost C(k)C(k/2)k-12k-2-log2k
- Time log2k (in terms of adder levels)
15The idea of Han-Carlson Adder (cont.)
16The idea of Han-Carlson Adder (cont.)
- Produce Generate, Propagate, and Partial
- Sum bit in the first stage.
- Single-rail circuit with double-rail in the
- last stage to perform XOR function.
- SumPartial_Sum XOR CarryIn
- Improved Domino circuit with odd stage in
Dynamic and even stage in Static.
17The implementation of Han-Carlson Adder
- Schematics Design by Composer, Simulation by
Spectre. Both of them are in the Cadence design
kits - The simulation result is from Schematic
(pre-layout) - The best speed is achieved by using the fast mode
in the technology file instead of tuning the Bulk
voltage - Clock is generated by ring oscillator with five
inverters in the loop. - Cadence tutorial for both of them and about how
to setup the environment are provided here.
18The implementation of Han-Carlson Adder(cont.)
- Clock generation
- Ring Oscillator five inverters followed by lots
of buffers
19The implementation of Han-Carlson Adder(cont.)
20The implementation of Han-Carlson Adder(cont.)
Single Rail Circuit
Foot-transistor added
Double Rail inside
21The implementation of Han-Carlson Adder(cont.)
- ALU PG/Partial Sum Circuit.
Back
22The implementation of Han-Carlson Adder (cont.)
- Dynamic and Static Carry Merge Stage
i0, 2,30
Even Stage
i1, 3, 31, or the carry at that bit is already
got.
Odd Stage
23The implementation of Han-Carlson Adder (cont.)
- Dynamic and Static Carry Merge Stage (cont.)
- Carry-In of LSB should be merged in order to do
subtraction. - The generate and propagate bit MSB are passed to
the last stage to produce the carry_out of the
ALU. (for the check bit)
24The implementation of Han-Carlson Adder (cont.)
- Even/Odd-bits CSG Sum Generation
Complementary signal generator (CSG) logic
25The implementation of Han-Carlson Adder (cont.)
- Even/Odd-bits CSG Sum Generation
- Use a latch to increase noise tolerance
Carry_bar
Carry
26Simulation Result
- Try the worst case pattern to test this design
- A0, B-2, Carry-In1 is the worst case delay.
- Why? Because from the structure of the circuit,
the worst case is 3N-2P-2N-2P-2N-2P-3N (For
Propagate bit)
27Simulation Result (cont.)
- 1st stage g0, p0, Psum0 (P/G/Psum, 3N)
- 2nd stage g 1, p 1 (Static, 2P)
- 3rd stage g0, p0 (Dynamic, 2N)
- 4th stage g 1, p 1 (Static, 2P)
- 5th Stage g0, p0 (Dynamic, 2N)
- 6th stage g 1, p 1 (static, 2P)
- 7th stage Cin310, (Dynamic, 3N)
- The result should be 2 Correct 1
28Simulation Result (cont.)
29Simulation Result (cont.)
30Simulation Result (cont.)
- Test if the error flag is correct.
- 1st Test pattern A-231 B-1. The answer is
231-1 (1b031b1), which is the wrong answer.
And the correct bit should be equal to 0. (test
the lower bound) - Also check the clock period is about 396.23ps
31Simulation Result (cont.)
32Simulation Result (cont.)
- 2nd Test pattern A231-1 B2. The answer is
-2311 (1b1 30b 01b1, wrong answer), the
correct bit should be equal to 0. (test the upper
bound)
33Simulation Result (cont.)
34Discussion P/G/Psum Block
P circuit G circuit Psum circuit
Psum A xor B
Mine
35Discussion (cont.)
- What might be the problem?
- Longer path to the ground
- When pre-charge, both of the propagate and
generate bit are 1 - What we need to consider? If p0, g0, this
circuit may have a good performance. - However, what if g goes from 1 to 0, but p1?
36Discussion (Cont.)
37Discussion (cont.)
- If the longest path is cut, then
Mine
38Discussion (Cont.)
39Comparison between H-C adder and Ling Adder
- Ling Adder
- For n-bit Ling adder combining r groups
- critical path
- logrn-1 levels
- r?1 reduction result in logrn levels,
- -1 is because of the using of CLA expression
rather than Lings expression for the last group.
Therefore, additional stage is saved. - The worst case delay will remain the second path
from the last block - For each block, there are r1 transistors
serially connected. - Use carry-select block for the generation of Sum
bit. Only additional 2 gate delays needed.
40Comparison between H-C adder and Ling Adder(cont.)
Lookahead Network
- Td(logrn-1)(r1)2
- E.g. r3, n32, Td14
Group Generation
CLA expression
Carry-Select structure (MUX)
41Comparison between H-C adder and Ling Adder(cont.)
- H-C Adder
- P, G generation 3
- Carry Merge in each stage (including dynamic and
static) 2 - CSG Sum 5
- Td2log2n3(P, G generation)5 (CSG Sum)
- E.g. n32, Td18
42Comparison between H-C adder and Ling Adder(cont.)
- What is the pros and cons?
- Ling Adder
- Advantage shorter worse case path ? might be
faster theoretically. - Disadvantage.
- not regular layout ?Area waste
- Lots of complex gates imply the charge sharing
problem. - Lots of input for a stage contribute to the long
path of wire ? delay problem for high frequency - Carry-Select logic makes the area bigger.
43Comparison between H-C adder and Ling Adder(cont.)
- Han-Carlson Adder
- Disadvantage. Longer path to the output
- Advantage.
- Regular layout for each stage
- Fewer of inputs for each path imply the
resolution of interconnection - Simpler gates means few charge sharing problem
44Future Work
- Power Reduction by inserting sleep transistors
- Speed improvement by inserting discharge
transistors in the intermediate stack nodes of
the dynamic stages during precharge phase. - Area Reduction in layout
- SOI model test
- Self-Resetting to minimize the clock period
45Reference
- A 6.5GHz 130nm Single-Ended Dynamic ALU and
Instruction Scheduler Loop, ISSCC 2002 - Sub-500-ps 64-b ALUs in 0.18-um SOI/Bulk CMOS
Design and Scaling Trends, JSSC, Nov, 2001 - Fast Area-Efficient VLSI Adders, Proc. 8th Symp.
Computer Arithmetic, Sept. 1987
46Reference (cont.)
- Computer Arithmetic, Algorithms and Hardware
Design. Behrooz Parhami, Oxford University Press. - Advanced Computer Arithmetic Design. Michael J.
Flynn, et al. John Wiley Sons, INC. - 5 GHz 32b Integer-Execution Core in 130nm Dual-Vt
CMOS, ISSCC 2002 - Implementation of a Self-Resetting CMOS 64-Bit
Parallel Adder with Enhanced Testability, JSSC
Aug. 1999