Design of PowerEfficient FloatingPoint Adder Blocks - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Design of PowerEfficient FloatingPoint Adder Blocks

Description:

New developed high-performance floating-point adder POWER6 Floating-Point Adder ... B A C where B is the addend, A is the multiplicand, and C is the multiplier. ... – PowerPoint PPT presentation

Number of Views:165

Avg rating:3.0/5.0

Slides: 43

Provided by: xyz195

Category:

more less

Transcript and Presenter's Notes

Title: Design of PowerEfficient FloatingPoint Adder Blocks

1
Design of Power-Efficient Floating-Point Adder
Blocks

Xiao Yan Yu
ACSEL Lab
University of California at Davis
May 29, 2007

2
Presentation Outline

Motivation
Research objective
Background
New developed high-performance floating-point
adder POWER6 Floating-Point Adder
New developed medium-performance floating-point
adder
Directions in future adder design
Conclusion

3
Motivation

Adders designed using dynamic logics can cause
significantly more power consumption in 65nm
technology. Current design trend favors static
circuits in order to save power and uses dynamic
circuits only when necessary.
For high performance applications, static
circuits do not operate as fast as dynamic
circuits in power-performance space. Special
techniques are needed.
For medium performance applications, sparse tree
implementations provide significant power saving.
However, current state-of-the-art sparse tree
designs do not meet our stringent power and area
constraint. A new sparse tree design is needed.

4
Research Objective

This research provides guidelines on how to
design power-efficient floating-point adders for
use in high performance and medium performance
multiply-add fused dataflow.
It targets at one type of floating-point
addition, the end-around carry addition.

5
Background of Floating Point Add

Multiply-add fused dataflow performs T B A ?
C where B is the addend, A is the multiplicand,
and C is the multiplier.
Input operands of the adder come from the outputs
of last 32 compressor that compresses the sum
and carry from multiplier tree and the addend.
The magnitude of the operands is not known prior
to addition.
Floating-point operation is a sign magnitude
operation. The adder needs to produce magnitude
result during 2s complement subtraction.
Case 1 If operand A gt B, A B A B (A
B 1)
Case 2 If operand B gt A, A B B A -(A
B)gt -(A B) 1 (A B 0)
During 2s complement subtraction of A - B, the
final carry-out, Cout, is 1 when A gt B and 0 when
B gt A gt Cout determines whether it is case 1 or
2.

6
End-Around Carry Computation cont.
Below shows an abstraction of end-around carry
computation during subtraction
7
End-Around Carry Addition

Assume the adder is divided into four groups.
During subtraction, the carry for each group can
be expressed as
During addition P3 is set to 0 and conventional
carries are computed.

High Performance Adder

9
High Performance Adder Design Issues

Adder is required to operate in higher-end of
multi-gigahertz range inside a unit. Hence, the
overall performance matters, not the stand-alone
performance.
Currently, only dynamic adders can achieve this
performance. Power consumed by dynamic adders
cannot be tolerated in current high performance
applications. Only static circuits can be used
for implementation.
Conventional static adders cannot achieve such
high performance.

10
High Performance adder Design Solution

Adder partition can be used to boost overall
performance. It can be partitioned to fit a
particular floating-point pipeline design.
Placement optimization is performed on these
partitions to ensure lowest communication
overhead.
Cell stacking can be used to shorten wires on the
critical paths.
Adder tree that balances according to its
critical path provides higher performance than
conventional ones.

High Performance Adder Design Example
128-bit Binary Floating-Point Adder for POWER6
Processor

12
Key Features of the POWER6 BFU adder

Fabricated in IBMs 65nm SOI technology
It is realized in a 7-cycle multiply-add pipeline
Implementation uses all static circuits with
nominal Vt devices
Adder is physically implemented as part of an O
shaped BFU floorplan
A non-uniformly sparse adder scheme was used
based on the given wire resource to optimize
performance.

13
Organization of POWER6 BFU Adder as Part of the
BFU Dataflow
14
Organization of POWER6 BFU Adder cont.
Final Sum Selection
Sum and carries from last 32
p,g generation
Floating-Point Addition
End-around carry computation
15
POWER6 BFU Adder Block Diagram
16
Diagram of the 32-b block
Since Carry1i Carry0i or Pi Where Carry0i is
the carry when cin is 0 and Carry1i is the
carrywhen cin is 1 Therefore, Pi ? Carry1i and
Carry1i can be used instead of Pi on the
non-critical paths.
17
Cell Stacking Technique
18
Cell Stacking Technique cont.
19
Comparing with other designs

We have compared our design against the
Ladner-Fischer (LFA) design and a prefix-2
Kogge-Stone adder with sparseness of 8 (Sparse
8).
All designs use only nominal Vt transistors.
The optimization points of each design are
obtained by varying power performance tradeoff
factor using Einstuner with constrained input
size .
The performance of each point is simulated using
a transistor level static timer, EinsTLT.
The average power dissipation of a design at each
performance point is simulated using a power
simulator, CPAM.
Each output is loaded with equivalent capacitive
load calculated at the unit level of the POWER6
BFU.

20
Power-Performance Result Average Power vs.
Performance
21
Power-Performance Result Leakage Power vs.
Performance
22
POWER6 BFU Adder Layout
Final Sum Selection
Bitwise g, p generation
Complete End-Around Carry
Partial End-Around Carry Conditional Sums
23

Medium Performance Adder

24
Medium Performance Adder Design Issues

Adder operates in lower-end of multi-gigahertz
range with stringent power and area constraint.
Sparse tree implementation provides power
efficient solution for this performance region.
Contemporary sparse tree designs implements spare
tree with sparseness not exceeding 4. Designs
with sparseness beyond 4, the ripple carry chain
becomes critical. There is a need to investigate
at designs with sparseness beyond 4 which
provides enough performance in this region.
To reduce power in designs, high Vt optimization
is traditionally performed on a well tuned design
in nominal Vt. This does not provide enough power
saving in this region. Alternative method is
needed.

25
Medium Performance adder Design Solution

To realize a sparse tree with high sparseness, a
new structure can be used, which uses local carry
look-ahead blocks instead of ripple chains. With
this approach, the conditional sum generation
does not become critical when we increase the
sparseness.
Alternative to perform high Vt optimization to
reduce power in a design, a mixture of cell
images in nominal Vt can be used. This will be
demonstrated in our design example. This approach
provides both area and power savings.

Medium Performance Adder Design Example
270ps 20mW 108-bit Floating Point Adder

27
Key Features of our 108-bit BFU adder

Implemented in IBMs 65nm SOI technology
It is part of a multiply-add fused dataflow and
uses end-around carry technique
It implements sparse trees with sparseness of 9
A mixture of two different cell images are used
in this design
Implementation uses all static circuits with
nominal Vt devices

28
Sparse 9 BFU Adder Block Diagram
29
Diagram of 36-bit Lookahead Block
30
Cell Images Used in Sparse 9 BFU Adder
XOR cell that spans 9 tracks
XOR cell that spans 18 tracks
31
Comparison of different design approaches

The two cell images approach is compared with
high Vt optimization. The adder is first
implemented with only 18 track cells. For each
optimization point of the adder with only 18
track cells, high Vt optimization is applied on
its non-critical paths. The design with two cell
images is created by replacing all the 18 track
cells on the non-critical path with 9 track
cells.
The percentage of high Vt cells in the high Vt
optimized design ranges from 34 at the highest
performance point to 57 at the lowest
performance point.

32
Comparison of different design approaches
33
Comparison of Sparse 9 with other designs

The sparse 9 design is compared against sparse 4
and a sparse 6 designs.
The organization of these adders is similar to
that of our implementation. The difference lies
in the schemes used inside the 36-bit CLA blocks
and conditional sum blocks of each adder. Both
Sparse4 and Sparse6 designs use ripple-carry
adder in their conditional sum blocks. Our design
uses local CLA adders instead.
All designs use only nominal Vt transistors with
two cell images approach and without the result
latch bank.
The assignments of cell images used in each block
are the same for each adder. The amount of 9
track and 18 track cells used in each adder is
different however.
All critical wires are assumed to have good wire
width and space.

34
Comparison of Sparse 9 with other designs
Final Design
35
Power Distributions of the Final Design
36
Sparse 9 BFU Adder Floorplan
37
Sparse 9 BFU Adder Layout

All components are manually placed and routed for
minimal area.
All loads at the outputs of clock pulse
generators are well balanced to minimize clock
skew.
Total Transistor Count 21306
9 track cell percentage 70
of Metal Layers used 4

Directions in Future Adder Design

39
Directions in Future Adder Design

High-Performance Adders
The concept of adder as a module fades away at
very high frequency. A well-partitioned adder
provides significant improvement of overall
system performance. Future adder will continue to
follow this trend.
Floorplan-aware adder optimization to obtain
optimal adder partitions. Optimizations at
micro-architecture and floorplan levels are
needed to achieve this.

40
Directions in Future Adder Design

Medium-Performance Adders
Sparse tree adders have been shown to have
sufficient power efficiency. By adaptively adjust
the sparseness number, a design can meet its
stringent performance and power constraints.
We have also observed the effectiveness of
designing adder with two different cell images.
Currently assignment of cell images has to be
done manually. This can be implement in tools to
automatically assign cell image.

41
Conclusions

This research provided new ways to design
performance-specific adders.
For high-performance applications, we have
designed a fast 128-bit floating-point adder is
implemented and fabricated as part of the POWER6
processor in IBM 65nm SOI technology.
A sparse tree with sparseness of 9 with local CLA
adders inside its conditional sum blocks for
medium-performance applications. A two cell
images design methodology that uses regular Vt
transistors is used to ensure low power and
compact layout.

42
Thank you for listening to my talk.Questions?

Write a Comment

User Comments (0)