ELEC692 VLSI Signal Processing Architecture Lecture 8 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

ELEC692 VLSI Signal Processing Architecture Lecture 8

Description:

ELEC692 VLSI Signal Processing Architecture Lecture 8 ... Utilization of multipliers increased to 75% due to storage of 3 out of radix-4 butterfly outputs. – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 41
Provided by: EEE5
Category:

less

Transcript and Presenter's Notes

Title: ELEC692 VLSI Signal Processing Architecture Lecture 8


1
ELEC692 VLSI Signal Processing ArchitectureLectur
e 8
  • Architecture for Fourier Transform

2
Usage of FFT
  • Frequency transformation
  • Applications
  • OFDM wireless systems
  • Speech/Multimedia data processing
  • Satellite wireless transmission
  • DTV, DAB broadcasting using OFDM
  • Real-time requirement needs special hardware to
    do this
  • E.g. COFDM for DTV
  • Signal bandwidth 7.5MHz
  • Useful symbol duration 1ms
  • Number of parallel subcarrier 7.51000/1 7500
  • Need 8K complex point FFT
  • Compute 8K complex FFT in 1ms, i.e. 8M complex
    FFT in a second
  • Not efficient and practical to implement in
    software, need special HW for FFT
  • In fact there are quite some off-the-selves FFT
    processors available in the market, but it is
    better to integrate the hardware within your chip

3
DFT review
  • The N-point discrete Fourier transform X(k) of an
    N-point sequence x(n) (and the inverse DF) is
    given by

4
DFT
5
Direct Implementation of DFT
  • Product of a matrix (W) and a vector (x)
  • An 8-point FFT example

6
1D array for DFT for N8
7
Complex multiplications
8
Fast DFT
  • Fast DFT (Discrete Fourier Transform) algorithm
  • Cooley-Tukey decomposition (1965)
  • Radix-2 Decimation-in-time (DIT) or
    Decimation-in-Frequency (DIF)
  • Divide the problem size into two interleaved
    halves with each recursive stage
  • Radix-2 decomposition first computes the
    even-indexed numbers x0,x2,,xn-2 and then the
    odd-indexed number x1,x3,,xn-1, and then
    combines these two results.
  • The sequence can be decomposed recursively to
    reduce the overall runtime to O(nlogn)

9
Radix-2 DIF DFT
Since WNN/2 corresponds to a rotation of 180o,
the factor of the second sum can be even further
reduced. WE have
The division of k into even and odd values leads
to the following
10
Radix-2 decomposition of 8-point FFT
x(0)
y(0)
W0
x(1)
y(4)
-1
W0
x(2)
y(2)
-1
W2
W0
x(3)
y(6)
-1
-1
W0
x(4)
y(14)
-1
W1
W0
x(5)
y (5)
-1
-1
W2
W0
x(6)
y(3)
-1
-1
W3
W2
W0
x(7)
y(7)
-1
-1
-1
11
Implementation of Radix 2 FFT
  • Two extreme methods
  • Reuse single Butterfly
  • Slower
  • Smaller area
  • More complicated control
  • Fully multi-stage straight implementation
  • Faster
  • Larger area
  • More regular control
  • Trade-off between the two ends based on
  • Speed, area, power

12
Comparison of calculation
DFT FFT
MUL ADD MUL ADD
(N-2)2 (N)2 N/2log2N-(N-1) Nlog2N
hardware
13
Data transport
  • One problem for FFT is its less regular data
    transport.

If the butterfly PEs are configured such that PEs
with lower exponents of W come first in each
stage, a configuration results with identical
communication networks between stages, (perfect
shuffle)
14
Conventional single butterfly FFT implementation
Strong speed limitation Large intermediate
results storage area need (N complex words) If
the memory is not partitioned, the number of R/W
accesses to perform the FFT creates a
bottleneck An N-point FFT requires N/r logrN
radix-r butterfly computations and 2N logrN R/W
RAM access
15
Single-stage (1-D) implementation- horizontal
projection
  • Horizontal projection- provide PE for a single
    stage
  • Use only N/2 PE, i.e. one stage only
  • Reduce throughput by a factor of log2N comparing
    with a 2-D array.
  • Need to take care about the complex communication
    structure

PEs do not have fixed coefficients, they need to
change after each cycle and the global
communication network is disadvantageous
16
Single-stage (1-D) implementation implementation-
horizontal projection
  • Pipelining with PEs does not allow a direct
    increase in through put for this architecture
    since the results of the current processing are
    required for the next processing step.
  • However sequential data blocks of length N can be
    processed independently of one another, so
    several data blocks can be processed by
    interleaving
  • Need increase in of register

17
Single-stage (1-D) implementation -horizontal
projection
  • If N is large, we cannot implement all N of PE.
  • Project N/2 butterfly PEs to MPEs where M is
    also a power of 2 and M lt N/2
  • Special registers for input data, intermediate
    results and result data are required.
  • Register cyclically read and write a particular
    sequence of 2M complex data

18
Single-stage (1-D) implementation Vertical
projection
  • Vertical projection Have 1PE for each stage
    (total logN PE)
  • Need circuitry between PEs to prepare the correct
    data input
  • From stage to stage, the length of the sequence
    onto which the FFT is applied is halved.
  • Given the previous stage led to a DFT of length
    2n, in accordance with perfect shuffle, the
    sequence of length 2n must be halved and the 1st
    and (n1)th values must be fed to the following
    PE. Then the 2nd and (n2)th values are fed to
    it.
  • Hence the sequence must be delayed by n clock
    cycles in accordance with the position of the
    midpoint

19
Data formatting/sorting for Vertical projection
  • The block un-1,,u0 must be delayed by n clock
    cycles.
  • When un is available, the values from the stream
    u must be fed to the new lower stream v. The
    values of u are input in parallel into the next
    butterfly stages for n clock cycles.
  • SO the values of v are fed in parallel to the
    next butterfly PE for n clock cycles and
    vn-1,,v0 are delayed by 2n cycles and v2n-1,,vn
    delayed by n cycles.

20
Data formatting/sorting for Vertical projection
  • Special circuit is necessary for the data input
    of the 1st stage.
  • Incoming data stream of N data is divided into 2
    parts of N/2 data. The clock rate is hence
    halved.We need a demultiplexer followed by a FIFO
    register

21
Overall architecture of Linear FFT array based ob
butterfly PEs and delay commutators
Consists of N PEs and delay commutators are
located between the PEs. Due to the continued
halving, control signals are extracted using
frequency dividers
22
Higher radix FFT
  • Radix-4 DIF algorithm

We have
Thus
23
Radix-4 DIF algorithm
  • Butterfly of Radix-4 Algorithm

24
Radix-4 Signal flow graph
25
Higher radix FFT
  • Radix-8 algorithm

26
Some pipeline FFT Processor Architecture
  • Assume input sequence to be in normal order and
    output is allowed to be in digit-reversed
    (radix-2 or radix-4) order.
  • Assume DIF type of decomposition
  • Here we assume additive butterfly has been
    separated from multiplier to show the hardware
    requirement distinctively

27
Radix-2 Multi-path Delay Commutator (R2MDC)
N16
Input sequence has been broken into 2 parallel
data stream flowing forward, with correct
distance between data elements entering the
butterfly scheduled by proper delays
of multipliers log2N 2 of butterfly
log2N of registers (3/2)N-2
28
Radix-2 Single-path Delay Feedback (R2SDF)
N16
Storing the butterfly output in feedback shift
registers. A single data streams goes through the
multiplier at every stage.
of multiplers log2N 2 of butterfly
log2N of registers N-1
29
Radix-4 Single-path Delay Feedback (R4SDF)
N256
Use radix-4 and CORDIC iterations. Utilization of
multipliers increased to 75 due to storage of 3
out of radix-4 butterfly outputs. Utilization of
the radix-4 butterfly (which is more complicated
than radix-2 butterfly, containing at least 8
complex adders) is dropped to 25. of
multiplers log4N 1 of butterfly log4N of
registers N-1
30
Radix-4 Multi-path Delay Commutator (R4MDC)
N256
Utilization Rate Butterflies 25, multiplier
250 of multiplers 3log4N of butterfly
log4N of registers (5/2)N-4
31
Some observation
  • Delay-feedbacks are more efficient than
    corresponding delay commutator in terms of memory
    utilization since the stored butterfly output can
    be directly used by the multipliers
  • Radix-4 algorithm based single-path architectures
    have higher multiplier utilization, but radix-2
    algorithm have simpler butterflies which are
    better utilized.

32
Comparison
Radix / Speed Low ? ------------------------------
----- ?High
Control Theme Simple ? ---------------------------
-------- ?Complex
Processing Ability / Unit Low ?
----------------------------------- ?High
Combine the advantages ? Further decompose high
radix PE
33
Radix-22 DIF FFT
  • Optimal hardware
  • Same number of non-trivial multiplications at the
    same positions in the SFG as of radix-4
    algorithms
  • The same butterfly structure as that of radix-2
    algorithms.
  • Radix-22 DIF FFT (S. He, M. Torkelson, A New
    Approach to Pipeline FFT Processor, in
    Proceedings of IPPS, 1996, pp. 766-780.

34
Radix-22 DIF FFT
Apply a 3-dimensional linear index map
The Common factor algorithm has the form of
Summation Over n1
35
Radix-22 DIF FFT
  • Proceed the second step of decomposition to the
    remaining DFT coefficients, including the
    twiddle factor to exploit the exceptional
    values in multiplication before the next
    butterfly is constructed.

After substituting and simplification, we have
BF I
BF I
BF II
36
Butterfly with decomposed twiddle factors
Full multipliers are required to compute the
product of the decomposed twiddle factor. The
order of the twiddle factors is different from
that of radix-4 algorithm.
37
Complete Radix-22 DIF FFT
  • Apply the CFA recursively to the remaining DFTs
    of length N/4.

38
(No Transcript)
39
Radix-22 Single-path Delay Feedback (R22SDF)
2 types of butterflies 1 identical to R2SDf, the
other contains also the logic to implement the
trivial twiddle factor multiplication
  • A log2N bit binary counter servers two purposes
  • Synchronization controller
  • Address generation counter for twiddle factor
    reading in each stages

40
Radix-22 Single-path Delay Feedback (R22SDF)
  • Structure for BF2I and BF2II

BF2II
BF2I
Operation scheduling
1st N/2 cycle, 2-to-1 mux in BF2I switch to 0
and the butterfly is idle. Input data is directed
to the shift registers until they are
filled. Next N/2 cycles, the mux turn to 1, the
butterfly computes a 2-point DFT with incoming
data and the data stored in the shift registers
Write a Comment
User Comments (0)
About PowerShow.com