ECE 699Digital Signal Processing Hardware Implementations Lecture 11 - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

ECE 699Digital Signal Processing Hardware Implementations Lecture 11

Description:

Can exploit shared twiddle factor properties (i.e. sub-expression sharing) to ... two properties in the twiddle factors: Symmetry Property: Periodicity Property: ... – PowerPoint PPT presentation

Number of Views:287

Avg rating:3.0/5.0

Slides: 49

Provided by: david815

Category:

more less

Transcript and Presenter's Notes

Title: ECE 699Digital Signal Processing Hardware Implementations Lecture 11

1
ECE 699Digital Signal Processing Hardware
ImplementationsLecture 11

Low-Power Design, Final Review
4/22/09

2
Outline

Low-Power Design
Final Review

3
Reading

Low-Power Design
Parhi, VLSI Digital Signal Processing Systems
Chapter 17

4
Low-Power Design
5
Design Criteria
Source Parhi/Owall
6
Low-Power Trends
Source Parhi/Owall
7
Peak Power and Average Power
Source Parhi/Owall
8
CMOS Power Consumption
Source Parhi/Owall
9
CMOS Dynamic Power
Source Parhi/Owall
10
System Integration
Source Parhi/Owall
11
Switching Activity
Source Parhi/Owall
12
Glitching
Source Parhi/Owall
13
Clock Gating
Source Parhi/Owall
14
Ripple Carry Glitching
Source Parhi/Owall
15
Balancing Operations
Source Parhi/Owall
16
Delay vs. Supply Voltage and Threshold Voltage
Source Parhi/Owall
17
Dual Vt Technology
Source Parhi/Owall
18
High VT stand-by
Source Parhi/Owall
19
Final Exam Review
20
Lecture 6 Highlights

Lecture 6 began with a study of the Discrete Time
Fourier Transform (DTFT) and continued to a
sample version of the DTFT, called the Discrete
Fourier Transform
The FFT was introduced as a computationally-effici
ent mechanism to implement the DFT
Radix-2 FFT (DIF and DIT)
Radix-4 FFT (DIF and DIF)
Finally, various implementation issues were
discussed including FFT architectures (serial,
parallel, pipeline, etc.) and bit-level issues

21
Fast Fourier Transform

Can exploit shared twiddle factor properties
(i.e. sub-expression sharing) to reduce the
number of multiplications in DFT
These class of algorithms are called Fast Fourier
Transforms
An FFT is simply an efficient implementation of
the DFT
Mathematically FFT DFT
FFT exploits two properties in the twiddle
factors
Symmetry Property
Periodicity Property
FFTs use a divide and conquer approach, breaking
an N-point DFT into several smaller DFTs
N can be factored as Nr1r2r2rv where the ri
are prime
Particular focus on r1r2..rvr, where r is
called the radix of the FFT algorithm
In this case Nrv and the FFT has a regular
pattern
We will study radix-2 (r2) and radix-4 (r4)
FFTs in this class

22
Decimation-in-time Radix-2 FFT

Split x(n) into even and odd samples and perform
smaller FFTs
f1(n) x(2n)
f2(n) x(2n1)
n0, 1, N/2-1
Derivation performed in class
Radix-2 Decimation-in-time (DIT) algorithm
In radix-2, the "butterfly" element takes in 2
inputs and produces 2 outputs
Butterfly implements 2-point FFT
Computations
(N/2)log2N complex multiplications
Nlog2N complex additions

23
Decimation-in-time Radix-2 FFT (N8)
24
Decimation-in-frequency Radix-2 FFT

Decompose X(k) such that it is split into FFT of
points 0 to N/2-1 and points N/2 to N-1
Then decimate X(k) into even and odd numbered
samples
Derivation performed in class
Radix-2 Decimation-in-frequency (DIF) algorithm
In radix-2, the "butterfly" element takes in 2
inputs and produces 2 outputs
Butterfly implements 2-point FFT
Computations
(N/2)log2N complex multiplications
Nlog2N complex additions

25
Radix-4 FFT

In radix-2 you have log2N stages
Can also implement radix-4 and now have log4N
stages
Radix-4 Decimation-in-time split x(n) into four
time sequences instead of two
Derivation performed in class
Split x(n) into four decimated sample streams
f1(n) x(4n)
f2(n) x(4n1)
f3(n) x(4n2)
f4(n) x(4n3)
n0, 1, .. N/4-1
Radix-4 Decimation-in-time (DIT) algorithm
In radix-4, the "butterfly" element takes in 4
inputs and produces 4 outputs
Butterfly implements 4-point FFT
Computations
(3N/4)log4N (3N/8)log2N complex multiplications
? decrease from radix-2 algorithms
(3N/2)log2N complex additions ? increase from
radix-2 algorithms
Downside can only deal with FFTs of a factor of
4, such as N4, 16, 64, 256, 1024, etc.

26
Parallel Implementation

Implement entire FFT structure in a parallel
fashion
Advantages Control is easy (i.e. no controller),
low latency (i.e. 0 cycles in this example),
customize each twiddle factor as a multiplication
by a constant
Disadvantages Huge Area, Routing congestion

27
Serial/In-Place FFT Implementation

Implement a single butterfly. Use that butterfly
and some memory to compute entire FFT
Advantages Small area
Disadvantages Large latency, complex controller

28
Pipeline FFT
Slice 1
Slice 2
Slice 3
Slice 4

Pipeline FFT is very common for communication
systems (OFDM, DMT)
Implements an entire "slice" of the FFT and
reuses-hardware to perform other slices
Advantages Particularly good for systems in
which x(n) comes in serially (i.e. no block
assembly required), very fast, more area
efficient than parallel, can be pipelined
Disadvantages Controller can become complicated,
large intermediate memories may be required
between stages, latency of N cycles (more if
pipelining introduced)

29
Lecture 8 Highlights

Lecture 8 covered CORDIC architectures,
discussing
Rotations vs. pseudorotations
CORDIC in vectoring and rotation modes
CORDIC hardware architecture
Extension of CORDIC
We also discussed direct digital frequency
synthesizers
Showed basic structures
Discussed improvements, particularly in ROM
compression
Discussed potential sources of error/spurs in
DDFS circuits

30
22.1 Rotations and Pseudorotations
Key ideas in CORDIC
COordinate Rotation DIgital Computer used this
method in 1950s modern electronic calculators
also use it
If we have a computationally efficient way of
rotating a vector, we can evaluate cos, sin, and
tan1 functions Rotation by an arbitrary angle
is difficult, so we Perform
psuedorotations that require simpler operations
Use special angles to synthesize the desired
angle z z a (1) a (2) . . . a (m)
-
Source Parhami
31
22.2 Basic CORDIC Iterations
CORDIC iteration In step i, we pseudorotate by
an angle whose tangent is di 2i (the angle e(i)
is fixed, only direction di is to be picked)
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
i
0
45.0 0.785 398 163 1 26.6 0.463 647 609
2 14.0 0.244 978 663 3 7.1 0.124
354 994 4 3.6 0.062 418 810 5
1.8 0.031 239 833 6 0.9 0.015 623 728
7 0.4 0.007 812 341 8 0.2 0.003
906 230 9 0.1 0.001 953
123
Table 22.1 Value of the function e(i) tan
1 2i, in degrees and radians, for 0 ? i ? 9
e(i) in degrees (approximate)
e(i) in radians (precise)
Example 30? angle 30.0 ? 45.0 26.6 14.0
7.1 3.6 1.8
0.9 0.4 0.2 0.1
30.1
Source Parhami
32
Using CORDIC in Rotation Mode
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
x(m) K(x cos z y sin z)
y(m) K(y cos z x sin z) z(m)
0 where K 1.646 760 258 121 . . .
Make z converge to 0 by choosing di sign(z(i))
Start with x 1/K 0.607 252 935 . . . and
y 0 to find cos z and sin z
For k bits of precision in results, k CORDIC
iterations are needed, because tan 1 2i ? 2I
for large i
Convergence of z to 0 is possible because each of
the angles in our list is more than half the
previous one or, equivalently, each is less than
the sum of all the angles that follow it
Domain of convergence is 99.7 z 99.7,
where 99.7 is the sum of all the angles in our
list the domain contains ?/2, ?/2 radians
Source Parhami
33
Using CORDIC in Vectoring Mode
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
x(m) K(x2 y2)1/2 y(m) 0
z(m) z tan 1(y / x)
where K 1.646 760 258 121 . . .
Make y converge to 0 by choosing di
sign(x(i)y(i))
Start with x 1 and z 0 to find tan 1 y
For k bits of precision in results, k CORDIC
iterations are needed, because tan 1 2i ? 2I
for large i
Even though the computation above always
converges, one can use the relationship tan
1(1/y ) p/2 tan 1y to limit the range
of fixed-point numbers encountered
Other trig functions tan z obtained from sin z
and cos z via division inverse sine and cosine
(sin 1 z and cos 1 z) discussed later
Source Parhami
34
22.3 CORDIC Hardware
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
If very high speed is not needed (as in a
calculator), a single adder and one shifter would
suffice
Fig. 22.3 Hardware elements needed for the
CORDIC method.
Source Parhami
35
Overview of Frequency Synthesizers

A frequency synthesizer is a device which
generates many output frequencies from a single
input reference frequency using direct, indirect,
or digital synthesis techniques
Three different types of frequency synthesizers
Indirect Frequency Synthesizer
Produce an output frequency from a secondary
oscillator frequency, usually a voltage
controlled oscillator (VCO) phase locked to a
primary frequency
Direct Frequency Synthesizer
Produces multiple output frequencies from a
single frequency standard using a series of
mixing, multiplication, division, and filtering
stages
Direct Digital Frequency Synthesizer also called
a Numerically Controlled Oscillator (NCO)
A digital synthesizer, as suggested by its name,
utilizes digital circuitry to generate output
frequencies
Direct Digital Frequency Synthesizer first
describe in a paper by J. Tierney, C. M. Rader,
and B. Gold in 1971

Courtesy D. Wilson
36
Advantages and Disadvantages of a DDFS

Advantages
Fine Frequency Resolution (sub-Hertz)
Lower Power Consumption
Fast Switching Speed
Wide Tuning Bandwidth
Low Phase Noise
Continuous Phase Switching Response
Disadvantages
A DDFS generates a sinc(x) output frequency
spectrum containing the desired output frequency
plus harmonics which must be filtered out
A DDFS produces spurious frequencies or spurs
resulting from phase word truncation and
imperfections in the digital-to-analog converter
(DAC)
A DDFS requires a digital-to-analog converter
(DAC)
DACs are the greatest cause of spurs in
high-speed and high-resolution (gt10 bits, gt50
MHz) DDFS applications
DACs susceptible to spurs created by clock
feedthrough,
intermodulation, and glitch energy

Courtesy D. Wilson
37
Basic DDFS Architecture

Structure of a DDFS is fairly simple
Major components are a Phase Accumulator, Phase
to Amplitude Converter, a D/A Converter, and a
Low Pass or Inverse Sinc Filter

Courtesy D. Wilson
38
Lecture 9 Highlights

Lecture 9 covered retiming in detail
We first discussed timing basics and applications
of retiming
We then discussed cutset retiming and pipelining
Finally we discussed an algorithm used for
retiming to reduce the clock period of a
recursive system (i.e. have clock period meet the
iteration bound)

39
Retiming Introduction

Retiming moves around registers which already
exist in the system
Retiming does not alter the latency in the system
Retiming does not change the input/output
characteristics
Retiming DOES change the critical path of the
system and/or the number of registers in the
system
Uses the primary rules

D
D
D
D

D
D
40
Retiming Uses

Retiming used
1) to decrease minimum clock period of a circuit
(i.e. faster)
2) to reduce number of registers of a circuit
(i.e. smaller)
3) for logic synthesis (not covered in class)
4) for low power CMOS circuits

41
Cutset Retiming

Two special cases of retiming exist
Cutset retiming
Pipelining pipelining can be considered as
adding a number of registers in the front of the
DFG and then doing retiming on these new
registers
Cutset retiming
Cutset set of edges that can be removed from
graph to create 2 disconnect subgraphs
Cutset retiming only affects the weights of the
edges in the cutset.
If 2 disconnected subgraphs are G1 and G2 then
cutset retiming consists of adding k delays to
each edge from G1 to G2 and removing k delays
from each edge from G2 to G1
Cutset retiming is a special case of retiming
where each node in the graph G1 has the retiming
value j and each node in the subgraph G2 has the
retiming value jk (j is arbitrary)
Remember Retiming solution is feasible only if
wr(e) gt 0 for all edges

42
Algorithm for Retiming for Clock Period
Minimization

Algorithm for retiming for clock period
minimization
First construct W(U,V) and D(U,V)
1) Let Mtmaxn where tmax is the maximum
computation time of the nodes in G and n is the
number of nodes in G.
2) Form a new graph G' which is the same as G
except the edge weights are replaced by w'(e)
Mw(e) t(U) for all edges e for U?V
3) Solve the all-pairs shortest path problem on
G' (using Floyd-Warshall, for example). Let S'UV
be the shortest path from U to V.
4) If U ? V, then W(U,V) ceil(S'UV/M) and
D(U,V) MW(U,V) - S'UV t(V). If UV, then
W(U,V) 0 and D(U,V) t(U). Ceil() is the
ceiling function.
Use W(U,V) and D(U,V) to determine if there is a
retiming solution that can achieve a desired
clock period c.
Usually set this desired clock period equal to
the iteration bound of the circuit.

43
Algorithm for Retiming for Clock Period
Minimization cont'd

Given a desired clock period c, there is a
feasible retiming solution r such that F(Gr) lt c
if the following constraints hold
CONSTRAINT 1 (feasibility) r(U) r(V) lt w(e)
for every U?V along edge e of G
This enforces the numbers of delays on each edge
in the retimed graph to be nonnegative
CONSTRAINT 2 (critical path) r(U) r(V) lt
W(U,V) 1 for all vertices U,V, in G such that
D(U,V) gt c
This enforces F(Gr) lt c
Thus, to find a solution
1) pick a value of c (usually equal to iteration
bound)
2) Create a series of inequalities based on the
feasibility constraint.
3) Create a series of inequalities based on the
critical path constraint.
4) Combine these (using most restrictive if
overlap exists) and create a constraint graph.
5) Find feasibility using shortest-path algorithm
(i.e. Floyd-Warshall) and find retiming values

44
Lecture 10 Highlights

Lecture 10 began with a discussion of unfolding
and its use to reduce the critical path of the
circuit, as well as for parallel processing
Folding was also introduced for area
minimization, and an algorithm was presented to
achieve a folded structure