Title: Design of PowerEfficient FloatingPoint Adder Blocks
1Design of Power-Efficient Floating-Point Adder
Blocks
- Xiao Yan Yu
- ACSEL Lab
- University of California at Davis
- May 29, 2007
2Presentation Outline
- Motivation
- Research objective
- Background
- New developed high-performance floating-point
adder POWER6 Floating-Point Adder - New developed medium-performance floating-point
adder - Directions in future adder design
- Conclusion
3Motivation
- Adders designed using dynamic logics can cause
significantly more power consumption in 65nm
technology. Current design trend favors static
circuits in order to save power and uses dynamic
circuits only when necessary. - For high performance applications, static
circuits do not operate as fast as dynamic
circuits in power-performance space. Special
techniques are needed. - For medium performance applications, sparse tree
implementations provide significant power saving.
However, current state-of-the-art sparse tree
designs do not meet our stringent power and area
constraint. A new sparse tree design is needed.
4Research Objective
- This research provides guidelines on how to
design power-efficient floating-point adders for
use in high performance and medium performance
multiply-add fused dataflow. - It targets at one type of floating-point
addition, the end-around carry addition.
5Background of Floating Point Add
- Multiply-add fused dataflow performs T B A ?
C where B is the addend, A is the multiplicand,
and C is the multiplier. - Input operands of the adder come from the outputs
of last 32 compressor that compresses the sum
and carry from multiplier tree and the addend. - The magnitude of the operands is not known prior
to addition. - Floating-point operation is a sign magnitude
operation. The adder needs to produce magnitude
result during 2s complement subtraction. - Case 1 If operand A gt B, A B A B (A
B 1) - Case 2 If operand B gt A, A B B A -(A
B)gt -(A B) 1 (A B 0) - During 2s complement subtraction of A - B, the
final carry-out, Cout, is 1 when A gt B and 0 when
B gt A gt Cout determines whether it is case 1 or
2.
6End-Around Carry Computation cont.
Below shows an abstraction of end-around carry
computation during subtraction
7End-Around Carry Addition
- Assume the adder is divided into four groups.
During subtraction, the carry for each group can
be expressed as - During addition P3 is set to 0 and conventional
carries are computed.
8 9High Performance Adder Design Issues
- Adder is required to operate in higher-end of
multi-gigahertz range inside a unit. Hence, the
overall performance matters, not the stand-alone
performance. - Currently, only dynamic adders can achieve this
performance. Power consumed by dynamic adders
cannot be tolerated in current high performance
applications. Only static circuits can be used
for implementation. - Conventional static adders cannot achieve such
high performance.
10High Performance adder Design Solution
- Adder partition can be used to boost overall
performance. It can be partitioned to fit a
particular floating-point pipeline design. - Placement optimization is performed on these
partitions to ensure lowest communication
overhead. - Cell stacking can be used to shorten wires on the
critical paths. - Adder tree that balances according to its
critical path provides higher performance than
conventional ones.
11- High Performance Adder Design Example
- 128-bit Binary Floating-Point Adder for POWER6
Processor
12Key Features of the POWER6 BFU adder
- Fabricated in IBMs 65nm SOI technology
- It is realized in a 7-cycle multiply-add pipeline
- Implementation uses all static circuits with
nominal Vt devices - Adder is physically implemented as part of an O
shaped BFU floorplan - A non-uniformly sparse adder scheme was used
based on the given wire resource to optimize
performance.
13Organization of POWER6 BFU Adder as Part of the
BFU Dataflow
14Organization of POWER6 BFU Adder cont.
Final Sum Selection
Sum and carries from last 32
p,g generation
Floating-Point Addition
End-around carry computation
15POWER6 BFU Adder Block Diagram
16Diagram of the 32-b block
Since Carry1i Carry0i or Pi Where Carry0i is
the carry when cin is 0 and Carry1i is the
carrywhen cin is 1 Therefore, Pi ? Carry1i and
Carry1i can be used instead of Pi on the
non-critical paths.
17Cell Stacking Technique
18Cell Stacking Technique cont.
19Comparing with other designs
- We have compared our design against the
Ladner-Fischer (LFA) design and a prefix-2
Kogge-Stone adder with sparseness of 8 (Sparse
8). - All designs use only nominal Vt transistors.
- The optimization points of each design are
obtained by varying power performance tradeoff
factor using Einstuner with constrained input
size . - The performance of each point is simulated using
a transistor level static timer, EinsTLT. - The average power dissipation of a design at each
performance point is simulated using a power
simulator, CPAM. - Each output is loaded with equivalent capacitive
load calculated at the unit level of the POWER6
BFU.
20Power-Performance Result Average Power vs.
Performance
21Power-Performance Result Leakage Power vs.
Performance
22POWER6 BFU Adder Layout
Final Sum Selection
Bitwise g, p generation
Complete End-Around Carry
Partial End-Around Carry Conditional Sums
23 24Medium Performance Adder Design Issues
- Adder operates in lower-end of multi-gigahertz
range with stringent power and area constraint.
Sparse tree implementation provides power
efficient solution for this performance region. - Contemporary sparse tree designs implements spare
tree with sparseness not exceeding 4. Designs
with sparseness beyond 4, the ripple carry chain
becomes critical. There is a need to investigate
at designs with sparseness beyond 4 which
provides enough performance in this region. - To reduce power in designs, high Vt optimization
is traditionally performed on a well tuned design
in nominal Vt. This does not provide enough power
saving in this region. Alternative method is
needed.
25Medium Performance adder Design Solution
- To realize a sparse tree with high sparseness, a
new structure can be used, which uses local carry
look-ahead blocks instead of ripple chains. With
this approach, the conditional sum generation
does not become critical when we increase the
sparseness. - Alternative to perform high Vt optimization to
reduce power in a design, a mixture of cell
images in nominal Vt can be used. This will be
demonstrated in our design example. This approach
provides both area and power savings.
26- Medium Performance Adder Design Example
- 270ps 20mW 108-bit Floating Point Adder
27Key Features of our 108-bit BFU adder
- Implemented in IBMs 65nm SOI technology
- It is part of a multiply-add fused dataflow and
uses end-around carry technique - It implements sparse trees with sparseness of 9
- A mixture of two different cell images are used
in this design - Implementation uses all static circuits with
nominal Vt devices
28Sparse 9 BFU Adder Block Diagram
29Diagram of 36-bit Lookahead Block
30Cell Images Used in Sparse 9 BFU Adder
XOR cell that spans 9 tracks
XOR cell that spans 18 tracks
31Comparison of different design approaches
- The two cell images approach is compared with
high Vt optimization. The adder is first
implemented with only 18 track cells. For each
optimization point of the adder with only 18
track cells, high Vt optimization is applied on
its non-critical paths. The design with two cell
images is created by replacing all the 18 track
cells on the non-critical path with 9 track
cells. - The percentage of high Vt cells in the high Vt
optimized design ranges from 34 at the highest
performance point to 57 at the lowest
performance point.
32Comparison of different design approaches
33Comparison of Sparse 9 with other designs
- The sparse 9 design is compared against sparse 4
and a sparse 6 designs. - The organization of these adders is similar to
that of our implementation. The difference lies
in the schemes used inside the 36-bit CLA blocks
and conditional sum blocks of each adder. Both
Sparse4 and Sparse6 designs use ripple-carry
adder in their conditional sum blocks. Our design
uses local CLA adders instead. - All designs use only nominal Vt transistors with
two cell images approach and without the result
latch bank. - The assignments of cell images used in each block
are the same for each adder. The amount of 9
track and 18 track cells used in each adder is
different however. - All critical wires are assumed to have good wire
width and space.
34Comparison of Sparse 9 with other designs
Final Design
35Power Distributions of the Final Design
36Sparse 9 BFU Adder Floorplan
37Sparse 9 BFU Adder Layout
- All components are manually placed and routed for
minimal area. - All loads at the outputs of clock pulse
generators are well balanced to minimize clock
skew. - Total Transistor Count 21306
- 9 track cell percentage 70
- of Metal Layers used 4
38- Directions in Future Adder Design
39Directions in Future Adder Design
- High-Performance Adders
- The concept of adder as a module fades away at
very high frequency. A well-partitioned adder
provides significant improvement of overall
system performance. Future adder will continue to
follow this trend. - Floorplan-aware adder optimization to obtain
optimal adder partitions. Optimizations at
micro-architecture and floorplan levels are
needed to achieve this.
40Directions in Future Adder Design
- Medium-Performance Adders
- Sparse tree adders have been shown to have
sufficient power efficiency. By adaptively adjust
the sparseness number, a design can meet its
stringent performance and power constraints. - We have also observed the effectiveness of
designing adder with two different cell images.
Currently assignment of cell images has to be
done manually. This can be implement in tools to
automatically assign cell image.
41Conclusions
- This research provided new ways to design
performance-specific adders. - For high-performance applications, we have
designed a fast 128-bit floating-point adder is
implemented and fabricated as part of the POWER6
processor in IBM 65nm SOI technology. - A sparse tree with sparseness of 9 with local CLA
adders inside its conditional sum blocks for
medium-performance applications. A two cell
images design methodology that uses regular Vt
transistors is used to ensure low power and
compact layout.
42Thank you for listening to my talk.Questions?