Title: Recent Developments in Theory and Implementation of Parallel Prefix Adders
1Recent Developments inTheory and
Implementationof Parallel Prefix Adders
- Neil Burgess
- Division of Electronics
- Cardiff School of Engineering
- Cardiff University
2Motivation
- Parallel Prefix Adders (e.g. Kogge-Stone) mostly
ignored for deep submicron VLSI - large fan-out points
- wide wiring channels
- Recent insights can remove both and do...
- absolute difference
- late increment
- media processing
3Structure of Presentation
- Parallel Prefix Adder theory
- Kogge-Stone, Ladner-Fisher
- New log-depth prefix trees
- Knowles family of adders
- New applications of prefix adders
- late operations, media adder
4I.Parallel Prefix Adder theory
5Prefix adder structure
6Prefix Equations - 1
- g(i) a(i) ? b(i) carry generate
- p(i) a(i) ? b(i) carry propagate
- k(i) ?a(i) ? b(i) carry kill
- g(i), p(i), k(i) are mutually exclusive
- Use any two ?g(i) k(i) NAND NOR
- p(i) needed as well s(i) p(i) ? c(i)
7Prefix Equations - 2
- Generate and Not Kill signals are com-bined to
form Group Signals - Gxz ?Kxz interpretation
- 0 0 c(x1) 0
- 0 1 c(x1) c(z)
- 1 0 Dont care
- 1 1 c(x1) 1
8Prefix Equations - Interpretation
- Group signals yield carry signals
- Tree outputs c(i1) Gi0
- Tree inputs Gii g(i) ?Kii ?k(i)
9Prefix Equations - characteristics
- Associative
- sub-terms may be pre-computed in parallel
10Prefix equations - characteristics
- Idempotent
- sub-terms may be overlapped
g
(0), k(0)
g
(0), k(0)
g
(1), k(1)
g
(1), k(1)
g
(2), k(2)
g
(2), k(2)
0
1
1
GK
GK
GK
1
2
2
0
0
GK
GK
2
2
c
(3)
c
(3)
c
(2)
c
(2)
c
(1)
c
(1)
114-bit Ladner-Fisher prefix tree
- 1 sub-term
- pre-computed
- Logarithmic
- depth
- Fan-out 2
- in 2nd row
- (laterally)
128-bit Ladner-Fisher prefix tree
- Log depth lateral fan-out 4 in 3rd row
- No exploitation of idempotency
1316-bit Ladner-Fisher prefix tree
- Log depth with large fan-out in final row
144-bit Kogge-Stone prefix graph
- Fan-out 1
- (laterally)
- 1 extra cell
- parallel wires
- in 2nd row
158-bit Kogge-Stone prefix graph
- More cells wiring than Ladner-Fisher
1616-bit Kogge-Stone prefix graph
- Low fan-out but wider wiring channels
- No exploitation of idempotency
17Black cells and grey cells
- Carries, c(i) Gi-10 Ki-10 terms not needed
- G-only cells called and coloured grey
18The story so far
- Parallel prefix adders available in VLSI
- Log-depth adders possible
- high fan-outs 1,2,4,8 low cell count
- low fan-outs 1,1,1,1 high cell count
- Problematic in VLSI (buffering, area)
- Idempotency of ? operator not exploited
19II.KnowlesFamily of Adders
20Log-depth prefix trees
- In VLSI
- L-F trees require too much buffering ? delay
- K-S trees require too much area (wire flux)
- Fan-outs characterised as
- 1,2,4,8 Ladner-Fisher
- 1,1,1,1 Kogge-Stone
21Knowles insight
- Use other fan-out schemes
- 5 possible 8-bit log-depth prefix trees
- 1,1,1 17 cells Kogge-Stone
- 1,1,2 17 cells uses idempotency
- 1,1,4 14 cells no idempotency
- 1,2,2 14 cells no idempotency
- 1,2,4 12 cells Ladner-Fisher
22Knowles 8-bit prefix trees
23Tree construction rules
- Levels are labelled 0,1,2...
- Fan-out at jth level, 2k , satisfies 2k ? 2j
- Fan-out at jth level ? fan-out at j1th level
- Lateral wire length at jth level is 2j
24Knowles 16-bit trees - I
- 1,1,1,1 49 cells 1,1,1,8 42 cells
- 1,1,1,2 49 cells 1,2,2,2 42 cells
- 1,1,1,4 49 cells 1,1,4,4 40 cells
- 1,1,2,2 49 cells 1,1,4,8 36 cells
- 1,1,2,4 49 cells 1,2,2,8 36 cells
- 1,1,2,8 42 cells 1,2,4,4 36 cells
- 1,2,2,4 42 cells 1,2,4,8 32 cells
25Knowles 16-bit trees - II
- 1,1,1,1 1,1,1,8
- 1,1,1,2 Idempotent 1,2,2,2
- 1,1,1,4 Idempotent 1,1,4,4
- 1,1,2,2 Idempotent 1,1,4,8
- 1,1,2,4 Idempotent 1,2,2,8
- 1,1,2,8 Idempotent 1,2,4,4
- 1,2,2,4 Idempotent 1,2,4,8
26Knowles 16-bit trees - III
- 1,1,1,1 1,1,1,8 R
- 1,1,1,2 I 1,2,2,2 R
- 1,1,1,4 I 1,1,4,4 R
- 1,1,2,2 I 1,1,4,8 R
- 1,1,2,4 I 1,2,2,8 R
- 1,1,2,8 R, I 1,2,4,4 R
- 1,2,2,4 R, I 1,2,4,8 R
27Quick way of spotting R, I
- Define span(l) as distance from start of wire to
first cell in lth level - span(l) 2l ? fanout(l) ? 1
- tree characteristics
- R if span(j) ? span(k) for j lt k
- I if span(i) span(j) span(k) for i lt j lt k
28Examples of R I spotting
- fanout(l) span(l) characteristic
- 1,1,1,1 ? 1,2,4,8 neither R nor I
- 1,1,2,2 ? 1,2,3,7 I only
- 1,2,2,2 ? 1,1,3,7 R only
- 1,2,2,4 ? 1,1,3,5 R I
- Are R I adders best?
29VLSI design of prefix adders
- Adders laid out as rectangular array of prefix
cells (and gaps) - Assume cells measure 10?m ? 4?m
- 2 cells per significance ? 20?m / bit
- Key design parameters
- buffering (area delay)
- wiring channels (area)
3016-bit adder example
- Assumptions
- Maximum fan-out without buffering
- 3 cells 80?m wire (4 cell widths)
- Maximum fan-out with buffering
- 9 cells 240?m wire (12 cell widths)
- Employ 1,2,2,4 architecture
311,2,2,4 prefix adder layout
32Area vs Time for 32-bit adders
Area
K-S 1,1,1,1,1
1,1,2,2,2
1,2,2,4,4 ? 1,1,3,5,13
L-F 1,2,4,8,16
Delay
3332-bit prefix tree adders
- Exploitable trade-off between adders delay and
area - Kogge-Stone adder 16 faster than Ladner-Fisher
but 66 larger - 1,2,2,4,4 adder 8 faster than Ladner-Fisher
but only 3 larger - buffering also trades off speed for area
34III.New applications of prefix adders
35Other addition operations
- Late increment
- Mod 2w-1 addition for Reed-Solomon coding
- floating-point rounding
- Late complement
- absolute difference for video motion estimation
- sign-magnitude addition
- Typically use 2 adders and a MUX
36Increments in prefix trees
- Row of prefix cells late 1 operation
- Ladner-Fisher comprises many late 1s
- 1 8-bit, 2 4-bit, 4 2-bit, 8 1-bit
37Late increment tree
- Adder returns AB if inc 0
- Adder returns AB1 if inc 1
inc
38Late increment logic
- Late Carry lc(i) set high if
- c(i) 1 or
- inc 1 and a(n),b(n) ? 0,0 ? n 0 ? n lt i
0
c(i) G
i
-1
lc(i)
inc
0
Ø
K
s(i)
i
-1
p(i)
39Late complement theory
- In 2s-complement, ?N -(N1)
- A ?B A - B - 1
- late increment then yields A - B
- ?(A ?B) -(A - B - 11) B - A
- Absolute difference readily available
40Absolute difference logic
- If c(w) 0, result negative
- if c(w) 0, invert all the bits
- else always perform late increment with ?Ki-10
41Summary of late ops
- Available on all prefix adders
- Extra delay 1 gates delay buffering
- Extra hardware ?w black cells
- This technique used in floating-point units
- late increment for rounding
- late complement for true subtraction
42Media (packed) arithmetic
- Fundamental strategy
- Use full wordlength hardware for
- multiple sub-wordlength computations
- Examples
- 32-bit adder ? 4 8-bit adders
- 32-bit multiplier ? 2 16-bit multipliers
43Partitioning an adder
- Criteria
- support carries propagating within sub-adders
- prevent carries propagating between sub-adders
- Solutions
- put AND gates on carry chains ? slower adder
- put dummy 0s on operand bits ? larger adder
- Use prefix adder!!
44Packed prefix adder - 1
- Force ?k(n) 0 at partition points
- prevents carries propagating across bit n
- exploits dont care condition (g,?k) (1,0)
- Implementation
- change ?k(n) gate to (2,1) OR-AND gate
- delay-neutral modification
45Packed prefix adder - 2
- Force c(n) Gn-10 0 at partition points
- prevents c(n) ? s(n) errors
- Implementation
- insert AND gates (off critical path) or
- change Gn-10 gate to (2,1,1) complex gate
- BUT need Gn-10 signal for sub-adder overflows
46Packed prefix adder - 3
- Sub-adder carries complete early
- Extraneous cells automatically do nothing
47Last Slide
- Recent developments in prefix adders
- new family of log-depth trees
- late operations
- packed arithmetic for media processing
- Future possibilities
- systematic exploitation of idempotency
- trees with reduced buffering
- combine packed arithmetic/late ops
48ANY QUESTIONS OR COMMENTS?