ELEC692 VLSI Signal Processing Architecture Lecture 8 - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

ELEC692 VLSI Signal Processing Architecture Lecture 8

Description:

ELEC692 VLSI Signal Processing Architecture Lecture 8 ... Utilization of multipliers increased to 75% due to storage of 3 out of radix-4 butterfly outputs. – PowerPoint PPT presentation

Number of Views:175

Avg rating:3.0/5.0

Slides: 41

Provided by: EEE5

Category:

more less

Transcript and Presenter's Notes

Title: ELEC692 VLSI Signal Processing Architecture Lecture 8

1
ELEC692 VLSI Signal Processing ArchitectureLectur
e 8

Architecture for Fourier Transform

2
Usage of FFT

Frequency transformation
Applications
OFDM wireless systems
Speech/Multimedia data processing
Satellite wireless transmission
DTV, DAB broadcasting using OFDM
Real-time requirement needs special hardware to
do this
E.g. COFDM for DTV
Signal bandwidth 7.5MHz
Useful symbol duration 1ms
Number of parallel subcarrier 7.51000/1 7500
Need 8K complex point FFT
Compute 8K complex FFT in 1ms, i.e. 8M complex
FFT in a second
Not efficient and practical to implement in
software, need special HW for FFT
In fact there are quite some off-the-selves FFT
processors available in the market, but it is
better to integrate the hardware within your chip

3
DFT review

The N-point discrete Fourier transform X(k) of an
N-point sequence x(n) (and the inverse DF) is
given by

4
DFT
5
Direct Implementation of DFT

Product of a matrix (W) and a vector (x)

An 8-point FFT example

6
1D array for DFT for N8
7
Complex multiplications
8
Fast DFT

Fast DFT (Discrete Fourier Transform) algorithm
Cooley-Tukey decomposition (1965)
Radix-2 Decimation-in-time (DIT) or
Decimation-in-Frequency (DIF)
Divide the problem size into two interleaved
halves with each recursive stage
Radix-2 decomposition first computes the
even-indexed numbers x0,x2,,xn-2 and then the
odd-indexed number x1,x3,,xn-1, and then
combines these two results.
The sequence can be decomposed recursively to
reduce the overall runtime to O(nlogn)

9
Radix-2 DIF DFT
Since WNN/2 corresponds to a rotation of 180o,
the factor of the second sum can be even further
reduced. WE have
The division of k into even and odd values leads
to the following
10
Radix-2 decomposition of 8-point FFT
x(0)
y(0)
W0
x(1)
y(4)
-1
W0
x(2)
y(2)
-1
W2
W0
x(3)
y(6)
-1
-1
W0
x(4)
y(14)
-1
W1
W0
x(5)
y (5)
-1
-1
W2
W0
x(6)
y(3)
-1
-1
W3
W2
W0
x(7)
y(7)
-1
-1
-1
11
Implementation of Radix 2 FFT

Two extreme methods
Reuse single Butterfly
Slower
Smaller area
More complicated control
Fully multi-stage straight implementation
Faster
Larger area
More regular control
Trade-off between the two ends based on
Speed, area, power

12
Comparison of calculation
DFT FFT
MUL ADD MUL ADD
(N-2)2 (N)2 N/2log2N-(N-1) Nlog2N
hardware
13
Data transport

One problem for FFT is its less regular data
transport.

If the butterfly PEs are configured such that PEs
with lower exponents of W come first in each
stage, a configuration results with identical
communication networks between stages, (perfect
shuffle)
14
Conventional single butterfly FFT implementation
Strong speed limitation Large intermediate
results storage area need (N complex words) If
the memory is not partitioned, the number of R/W
accesses to perform the FFT creates a
bottleneck An N-point FFT requires N/r logrN
radix-r butterfly computations and 2N logrN R/W
RAM access
15
Single-stage (1-D) implementation- horizontal
projection

Horizontal projection- provide PE for a single
stage
Use only N/2 PE, i.e. one stage only
Reduce throughput by a factor of log2N comparing
with a 2-D array.
Need to take care about the complex communication
structure

PEs do not have fixed coefficients, they need to
change after each cycle and the global
communication network is disadvantageous
16
Single-stage (1-D) implementation implementation-
horizontal projection

Pipelining with PEs does not allow a direct
increase in through put for this architecture
since the results of the current processing are
required for the next processing step.
However sequential data blocks of length N can be
processed independently of one another, so
several data blocks can be processed by
interleaving
Need increase in of register

17
Single-stage (1-D) implementation -horizontal
projection

If N is large, we cannot implement all N of PE.
Project N/2 butterfly PEs to MPEs where M is
also a power of 2 and M lt N/2
Special registers for input data, intermediate
results and result data are required.
Register cyclically read and write a particular
sequence of 2M complex data

18
Single-stage (1-D) implementation Vertical
projection

Vertical projection Have 1PE for each stage
(total logN PE)
Need circuitry between PEs to prepare the correct
data input
From stage to stage, the length of the sequence
onto which the FFT is applied is halved.
Given the previous stage led to a DFT of length
2n, in accordance with perfect shuffle, the
sequence of length 2n must be halved and the 1st
and (n1)th values must be fed to the following
PE. Then the 2nd and (n2)th values are fed to
it.
Hence the sequence must be delayed by n clock
cycles in accordance with the position of the
midpoint

19
Data formatting/sorting for Vertical projection

The block un-1,,u0 must be delayed by n clock
cycles.
When un is available, the values from the stream
u must be fed to the new lower stream v. The
values of u are input in parallel into the next
butterfly stages for n clock cycles.
SO the values of v are fed in parallel to the
next butterfly PE for n clock cycles and
vn-1,,v0 are delayed by 2n cycles and v2n-1,,vn
delayed by n cycles.

20
Data formatting/sorting for Vertical projection

Special circuit is necessary for the data input
of the 1st stage.
Incoming data stream of N data is divided into 2
parts of N/2 data. The clock rate is hence
halved.We need a demultiplexer followed by a FIFO
register

21
Overall architecture of Linear FFT array based ob
butterfly PEs and delay commutators
Consists of N PEs and delay commutators are
located between the PEs. Due to the continued
halving, control signals are extracted using
frequency dividers
22
Higher radix FFT

Radix-4 DIF algorithm

We have
Thus
23
Radix-4 DIF algorithm

Butterfly of Radix-4 Algorithm

24
Radix-4 Signal flow graph
25
Higher radix FFT

Radix-8 algorithm

26
Some pipeline FFT Processor Architecture

Assume input sequence to be in normal order and
output is allowed to be in digit-reversed
(radix-2 or radix-4) order.
Assume DIF type of decomposition
Here we assume additive butterfly has been
separated from multiplier to show the hardware
requirement distinctively

27
Radix-2 Multi-path Delay Commutator (R2MDC)
N16
Input sequence has been broken into 2 parallel
data stream flowing forward, with correct
distance between data elements entering the
butterfly scheduled by proper delays
of multipliers log2N 2 of butterfly
log2N of registers (3/2)N-2
28
Radix-2 Single-path Delay Feedback (R2SDF)
N16
Storing the butterfly output in feedback shift
registers. A single data streams goes through the
multiplier at every stage.
of multiplers log2N 2 of butterfly
log2N of registers N-1
29
Radix-4 Single-path Delay Feedback (R4SDF)
N256
Use radix-4 and CORDIC iterations. Utilization of
multipliers increased to 75 due to storage of 3
out of radix-4 butterfly outputs. Utilization of
the radix-4 butterfly (which is more complicated
than radix-2 butterfly, containing at least 8
complex adders) is dropped to 25. of
multiplers log4N 1 of butterfly log4N of
registers N-1
30
Radix-4 Multi-path Delay Commutator (R4MDC)
N256
Utilization Rate Butterflies 25, multiplier
250 of multiplers 3log4N of butterfly
log4N of registers (5/2)N-4
31
Some observation

Delay-feedbacks are more efficient than
corresponding delay commutator in terms of memory
utilization since the stored butterfly output can
be directly used by the multipliers
Radix-4 algorithm based single-path architectures
have higher multiplier utilization, but radix-2
algorithm have simpler butterflies which are
better utilized.

32
Comparison
Radix / Speed Low ? ------------------------------
----- ?High
Control Theme Simple ? ---------------------------
-------- ?Complex
Processing Ability / Unit Low ?
----------------------------------- ?High
Combine the advantages ? Further decompose high
radix PE
33
Radix-22 DIF FFT

Optimal hardware
Same number of non-trivial multiplications at the
same positions in the SFG as of radix-4
algorithms
The same butterfly structure as that of radix-2
algorithms.
Radix-22 DIF FFT (S. He, M. Torkelson, A New
Approach to Pipeline FFT Processor, in
Proceedings of IPPS, 1996, pp. 766-780.

34
Radix-22 DIF FFT
Apply a 3-dimensional linear index map
The Common factor algorithm has the form of
Summation Over n1
35
Radix-22 DIF FFT

Proceed the second step of decomposition to the
remaining DFT coefficients, including the
twiddle factor to exploit the exceptional
values in multiplication before the next
butterfly is constructed.

After substituting and simplification, we have
BF I
BF I
BF II
36
Butterfly with decomposed twiddle factors
Full multipliers are required to compute the
product of the decomposed twiddle factor. The
order of the twiddle factors is different from
that of radix-4 algorithm.
37
Complete Radix-22 DIF FFT

Apply the CFA recursively to the remaining DFTs
of length N/4.

38
(No Transcript)
39
Radix-22 Single-path Delay Feedback (R22SDF)
2 types of butterflies 1 identical to R2SDf, the
other contains also the logic to implement the
trivial twiddle factor multiplication

A log2N bit binary counter servers two purposes
Synchronization controller
Address generation counter for twiddle factor
reading in each stages

40
Radix-22 Single-path Delay Feedback (R22SDF)

Structure for BF2I and BF2II

BF2II
BF2I
Operation scheduling
1st N/2 cycle, 2-to-1 mux in BF2I switch to 0
and the butterfly is idle. Input data is directed
to the shift registers until they are
filled. Next N/2 cycles, the mux turn to 1, the
butterfly computes a 2-point DFT with incoming
data and the data stored in the shift registers

Write a Comment

User Comments (0)