Title: ECE 699Digital Signal Processing Hardware Implementations Lecture 11
1ECE 699Digital Signal Processing Hardware
ImplementationsLecture 11
- Low-Power Design, Final Review
- 4/22/09
2Outline
- Low-Power Design
- Final Review
3Reading
- Low-Power Design
- Parhi, VLSI Digital Signal Processing Systems
- Chapter 17
4Low-Power Design
5Design Criteria
Source Parhi/Owall
6Low-Power Trends
Source Parhi/Owall
7Peak Power and Average Power
Source Parhi/Owall
8CMOS Power Consumption
Source Parhi/Owall
9CMOS Dynamic Power
Source Parhi/Owall
10System Integration
Source Parhi/Owall
11Switching Activity
Source Parhi/Owall
12Glitching
Source Parhi/Owall
13Clock Gating
Source Parhi/Owall
14Ripple Carry Glitching
Source Parhi/Owall
15Balancing Operations
Source Parhi/Owall
16Delay vs. Supply Voltage and Threshold Voltage
Source Parhi/Owall
17Dual Vt Technology
Source Parhi/Owall
18High VT stand-by
Source Parhi/Owall
19Final Exam Review
20Lecture 6 Highlights
- Lecture 6 began with a study of the Discrete Time
Fourier Transform (DTFT) and continued to a
sample version of the DTFT, called the Discrete
Fourier Transform - The FFT was introduced as a computationally-effici
ent mechanism to implement the DFT - Radix-2 FFT (DIF and DIT)
- Radix-4 FFT (DIF and DIF)
- Finally, various implementation issues were
discussed including FFT architectures (serial,
parallel, pipeline, etc.) and bit-level issues
21Fast Fourier Transform
- Can exploit shared twiddle factor properties
(i.e. sub-expression sharing) to reduce the
number of multiplications in DFT - These class of algorithms are called Fast Fourier
Transforms - An FFT is simply an efficient implementation of
the DFT - Mathematically FFT DFT
- FFT exploits two properties in the twiddle
factors - Symmetry Property
- Periodicity Property
- FFTs use a divide and conquer approach, breaking
an N-point DFT into several smaller DFTs - N can be factored as Nr1r2r2rv where the ri
are prime - Particular focus on r1r2..rvr, where r is
called the radix of the FFT algorithm - In this case Nrv and the FFT has a regular
pattern - We will study radix-2 (r2) and radix-4 (r4)
FFTs in this class
22Decimation-in-time Radix-2 FFT
- Split x(n) into even and odd samples and perform
smaller FFTs - f1(n) x(2n)
- f2(n) x(2n1)
- n0, 1, N/2-1
- Derivation performed in class
- Radix-2 Decimation-in-time (DIT) algorithm
- In radix-2, the "butterfly" element takes in 2
inputs and produces 2 outputs - Butterfly implements 2-point FFT
- Computations
- (N/2)log2N complex multiplications
- Nlog2N complex additions
23Decimation-in-time Radix-2 FFT (N8)
24Decimation-in-frequency Radix-2 FFT
- Decompose X(k) such that it is split into FFT of
points 0 to N/2-1 and points N/2 to N-1 - Then decimate X(k) into even and odd numbered
samples - Derivation performed in class
- Radix-2 Decimation-in-frequency (DIF) algorithm
- In radix-2, the "butterfly" element takes in 2
inputs and produces 2 outputs - Butterfly implements 2-point FFT
- Computations
- (N/2)log2N complex multiplications
- Nlog2N complex additions
25Radix-4 FFT
- In radix-2 you have log2N stages
- Can also implement radix-4 and now have log4N
stages - Radix-4 Decimation-in-time split x(n) into four
time sequences instead of two - Derivation performed in class
- Split x(n) into four decimated sample streams
- f1(n) x(4n)
- f2(n) x(4n1)
- f3(n) x(4n2)
- f4(n) x(4n3)
- n0, 1, .. N/4-1
- Radix-4 Decimation-in-time (DIT) algorithm
- In radix-4, the "butterfly" element takes in 4
inputs and produces 4 outputs - Butterfly implements 4-point FFT
- Computations
- (3N/4)log4N (3N/8)log2N complex multiplications
? decrease from radix-2 algorithms - (3N/2)log2N complex additions ? increase from
radix-2 algorithms - Downside can only deal with FFTs of a factor of
4, such as N4, 16, 64, 256, 1024, etc.
26Parallel Implementation
- Implement entire FFT structure in a parallel
fashion - Advantages Control is easy (i.e. no controller),
low latency (i.e. 0 cycles in this example),
customize each twiddle factor as a multiplication
by a constant - Disadvantages Huge Area, Routing congestion
27Serial/In-Place FFT Implementation
- Implement a single butterfly. Use that butterfly
and some memory to compute entire FFT - Advantages Small area
- Disadvantages Large latency, complex controller
28Pipeline FFT
Slice 1
Slice 2
Slice 3
Slice 4
- Pipeline FFT is very common for communication
systems (OFDM, DMT) - Implements an entire "slice" of the FFT and
reuses-hardware to perform other slices - Advantages Particularly good for systems in
which x(n) comes in serially (i.e. no block
assembly required), very fast, more area
efficient than parallel, can be pipelined - Disadvantages Controller can become complicated,
large intermediate memories may be required
between stages, latency of N cycles (more if
pipelining introduced)
29Lecture 8 Highlights
- Lecture 8 covered CORDIC architectures,
discussing - Rotations vs. pseudorotations
- CORDIC in vectoring and rotation modes
- CORDIC hardware architecture
- Extension of CORDIC
- We also discussed direct digital frequency
synthesizers - Showed basic structures
- Discussed improvements, particularly in ROM
compression - Discussed potential sources of error/spurs in
DDFS circuits
3022.1 Rotations and Pseudorotations
Key ideas in CORDIC
COordinate Rotation DIgital Computer used this
method in 1950s modern electronic calculators
also use it
If we have a computationally efficient way of
rotating a vector, we can evaluate cos, sin, and
tan1 functions Rotation by an arbitrary angle
is difficult, so we Perform
psuedorotations that require simpler operations
Use special angles to synthesize the desired
angle z z a (1) a (2) . . . a (m)
-
Source Parhami
3122.2 Basic CORDIC Iterations
CORDIC iteration In step i, we pseudorotate by
an angle whose tangent is di 2i (the angle e(i)
is fixed, only direction di is to be picked)
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
i
0
45.0 0.785 398 163 1 26.6 0.463 647 609
2 14.0 0.244 978 663 3 7.1 0.124
354 994 4 3.6 0.062 418 810 5
1.8 0.031 239 833 6 0.9 0.015 623 728
7 0.4 0.007 812 341 8 0.2 0.003
906 230 9 0.1 0.001 953
123
Table 22.1 Value of the function e(i) tan
1 2i, in degrees and radians, for 0 ? i ? 9
e(i) in degrees (approximate)
e(i) in radians (precise)
Example 30? angle 30.0 ? 45.0 26.6 14.0
7.1 3.6 1.8
0.9 0.4 0.2 0.1
30.1
Source Parhami
32Using CORDIC in Rotation Mode
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
x(m) K(x cos z y sin z)
y(m) K(y cos z x sin z) z(m)
0 where K 1.646 760 258 121 . . .
Make z converge to 0 by choosing di sign(z(i))
Start with x 1/K 0.607 252 935 . . . and
y 0 to find cos z and sin z
For k bits of precision in results, k CORDIC
iterations are needed, because tan 1 2i ? 2I
for large i
Convergence of z to 0 is possible because each of
the angles in our list is more than half the
previous one or, equivalently, each is less than
the sum of all the angles that follow it
Domain of convergence is 99.7 z 99.7,
where 99.7 is the sum of all the angles in our
list the domain contains ?/2, ?/2 radians
Source Parhami
33Using CORDIC in Vectoring Mode
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
x(m) K(x2 y2)1/2 y(m) 0
z(m) z tan 1(y / x)
where K 1.646 760 258 121 . . .
Make y converge to 0 by choosing di
sign(x(i)y(i))
Start with x 1 and z 0 to find tan 1 y
For k bits of precision in results, k CORDIC
iterations are needed, because tan 1 2i ? 2I
for large i
Even though the computation above always
converges, one can use the relationship tan
1(1/y ) p/2 tan 1y to limit the range
of fixed-point numbers encountered
Other trig functions tan z obtained from sin z
and cos z via division inverse sine and cosine
(sin 1 z and cos 1 z) discussed later
Source Parhami
3422.3 CORDIC Hardware
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
If very high speed is not needed (as in a
calculator), a single adder and one shifter would
suffice
Fig. 22.3 Hardware elements needed for the
CORDIC method.
Source Parhami
35Overview of Frequency Synthesizers
- A frequency synthesizer is a device which
generates many output frequencies from a single
input reference frequency using direct, indirect,
or digital synthesis techniques - Three different types of frequency synthesizers
- Indirect Frequency Synthesizer
- Produce an output frequency from a secondary
oscillator frequency, usually a voltage
controlled oscillator (VCO) phase locked to a
primary frequency - Direct Frequency Synthesizer
- Produces multiple output frequencies from a
single frequency standard using a series of
mixing, multiplication, division, and filtering
stages - Direct Digital Frequency Synthesizer also called
a Numerically Controlled Oscillator (NCO) - A digital synthesizer, as suggested by its name,
utilizes digital circuitry to generate output
frequencies - Direct Digital Frequency Synthesizer first
describe in a paper by J. Tierney, C. M. Rader,
and B. Gold in 1971
Courtesy D. Wilson
36Advantages and Disadvantages of a DDFS
- Advantages
- Fine Frequency Resolution (sub-Hertz)
- Lower Power Consumption
- Fast Switching Speed
- Wide Tuning Bandwidth
- Low Phase Noise
- Continuous Phase Switching Response
- Disadvantages
- A DDFS generates a sinc(x) output frequency
spectrum containing the desired output frequency
plus harmonics which must be filtered out - A DDFS produces spurious frequencies or spurs
resulting from phase word truncation and
imperfections in the digital-to-analog converter
(DAC) - A DDFS requires a digital-to-analog converter
(DAC) - DACs are the greatest cause of spurs in
high-speed and high-resolution (gt10 bits, gt50
MHz) DDFS applications - DACs susceptible to spurs created by clock
feedthrough, - intermodulation, and glitch energy
Courtesy D. Wilson
37Basic DDFS Architecture
- Structure of a DDFS is fairly simple
- Major components are a Phase Accumulator, Phase
to Amplitude Converter, a D/A Converter, and a
Low Pass or Inverse Sinc Filter
Courtesy D. Wilson
38Lecture 9 Highlights
- Lecture 9 covered retiming in detail
- We first discussed timing basics and applications
of retiming - We then discussed cutset retiming and pipelining
- Finally we discussed an algorithm used for
retiming to reduce the clock period of a
recursive system (i.e. have clock period meet the
iteration bound)
39Retiming Introduction
- Retiming moves around registers which already
exist in the system - Retiming does not alter the latency in the system
- Retiming does not change the input/output
characteristics - Retiming DOES change the critical path of the
system and/or the number of registers in the
system - Uses the primary rules
D
D
D
D
D
D
40Retiming Uses
- Retiming used
- 1) to decrease minimum clock period of a circuit
(i.e. faster) - 2) to reduce number of registers of a circuit
(i.e. smaller) - 3) for logic synthesis (not covered in class)
- 4) for low power CMOS circuits
41Cutset Retiming
- Two special cases of retiming exist
- Cutset retiming
- Pipelining pipelining can be considered as
adding a number of registers in the front of the
DFG and then doing retiming on these new
registers - Cutset retiming
- Cutset set of edges that can be removed from
graph to create 2 disconnect subgraphs - Cutset retiming only affects the weights of the
edges in the cutset. - If 2 disconnected subgraphs are G1 and G2 then
cutset retiming consists of adding k delays to
each edge from G1 to G2 and removing k delays
from each edge from G2 to G1 - Cutset retiming is a special case of retiming
where each node in the graph G1 has the retiming
value j and each node in the subgraph G2 has the
retiming value jk (j is arbitrary) - Remember Retiming solution is feasible only if
wr(e) gt 0 for all edges
42Algorithm for Retiming for Clock Period
Minimization
- Algorithm for retiming for clock period
minimization - First construct W(U,V) and D(U,V)
- 1) Let Mtmaxn where tmax is the maximum
computation time of the nodes in G and n is the
number of nodes in G. - 2) Form a new graph G' which is the same as G
except the edge weights are replaced by w'(e)
Mw(e) t(U) for all edges e for U?V - 3) Solve the all-pairs shortest path problem on
G' (using Floyd-Warshall, for example). Let S'UV
be the shortest path from U to V. - 4) If U ? V, then W(U,V) ceil(S'UV/M) and
D(U,V) MW(U,V) - S'UV t(V). If UV, then
W(U,V) 0 and D(U,V) t(U). Ceil() is the
ceiling function. - Use W(U,V) and D(U,V) to determine if there is a
retiming solution that can achieve a desired
clock period c. - Usually set this desired clock period equal to
the iteration bound of the circuit.
43Algorithm for Retiming for Clock Period
Minimization cont'd
- Given a desired clock period c, there is a
feasible retiming solution r such that F(Gr) lt c
if the following constraints hold - CONSTRAINT 1 (feasibility) r(U) r(V) lt w(e)
for every U?V along edge e of G - This enforces the numbers of delays on each edge
in the retimed graph to be nonnegative - CONSTRAINT 2 (critical path) r(U) r(V) lt
W(U,V) 1 for all vertices U,V, in G such that
D(U,V) gt c - This enforces F(Gr) lt c
- Thus, to find a solution
- 1) pick a value of c (usually equal to iteration
bound) - 2) Create a series of inequalities based on the
feasibility constraint. - 3) Create a series of inequalities based on the
critical path constraint. - 4) Combine these (using most restrictive if
overlap exists) and create a constraint graph. - 5) Find feasibility using shortest-path algorithm
(i.e. Floyd-Warshall) and find retiming values
44Lecture 10 Highlights
- Lecture 10 began with a discussion of unfolding
and its use to reduce the critical path of the
circuit, as well as for parallel processing - Folding was also introduced for area
minimization, and an algorithm was presented to
achieve a folded structure
45Unfolding Algorithm
Source Parhi
46Applications of Unfolding
Source Parhi
47Folding
Source Parhi
48Folding Transformation
Source Parhi