Title: Is F Better than D
1Is F Better than D
- David Hansen and James Michelussi
2Introduction
- Discrete Fourier Transform (DFT)
- Fast Fourier Transform (FFT)
- FFT Algorithm Applying the Mathematics
- Implementations of DFT and FFT
- Hardware Benchmarks
- Conclusion
3DFT
- In 1807 introduced by Jean Baptiste Joseph
Fourier. - allows a sampled or discrete signal that is
periodic to be transformed from the time domain
to the frequency domain - Correlation between the time domain signal and N
cosine and N sine waves
X(k) DFT Frequency Signal N Number of Sample
Points X(n) Time Domain Signal WN Twiddle
Factor
4DFT (Walking Speed)
- Why is this important? Where is this used?
- allows machines to calculate the frequency domain
- allows for the convolution of signals by just
multiplying them together - Used in digital spectral analysis for speech,
imaging and pattern recognition as well as signal
manipulation using filters - But the DFT requires N2 multiplications!
5FFT (Jet Speed)
- J. W. Cooley and J. W. Tukey are given credit for
bringing the FFT to the world in the 1960s - Simply an algorithm for more efficiently
calculating the DFT - Takes advantage of symmetry and periodicity in
the twiddle factors as well as uses a divide and
conquer method - Symmetry WNr N/2 -WNr
- Periodicity WNrN WNr
- Requires only (N/2)log2(N) multiplications !
- Faster computation times
- More precise results due to less round-off error
6FFT Algorithm
- Several different types of FFT Algorithms
(Radix-2, Radix-4, DIT DIF) - Focus on Radix-2 using Decimation in Time (DIT)
method - Breaks down the DFT calculation into a number of
2-point DFTs - Each 2-point DFT uses an operation called the
Butterfly - These groups are then re-combined with another
group of two and so on for log2(N) stages - Using the DIT method the input time domain points
must be reordered using bit reversal
7Butterfly Operation
8Bit Reversal
98-Point Radix-2 FFT Example
108-Point Radix-2 FFT Example
11Implementations of DFT and FFT
12DFT Implementation
for (r0 rltsamples/2 r) float re 0.0f,
im 0.0f float part (float)r -2.0f PI /
(float)samples for (k0 kltsamples
k) float theta part (float)k re
data_ink cos(theta) im data_ink
sin(theta)
- Nested For Loop, (N/2)N Iterations O(N2)
- 63027.41 Cycles / Sample (123 cycles per inner
loop iteration) - Obvious Inefficiencies, cos and sin math.h
functions - Efficient assembly coding could reduce the inner
loop to 3 cycles per iteration (1,536 cycles /
sample)
13C FFT Implementation
void fft_float (unsigned NumSamples, float
RealIn, float ImagIn, float RealOut,
float ImagOut ) for ( i0 i lt NumSamples
i ) // Iterate over the samples and
perform the bit-reversal j ReverseBits
( i, NumBits ) BlockEnd 1 //
Following loop iterates Log2(NumSamples) for
( BlockSize 2 BlockSize lt NumSamples
BlockSize ltlt 1 ) // Perform Angle
Calculations (Using math.h sin/cos) //
Following 2 loops iterate over NumSamples/2
for ( i0 i lt NumSamples i BlockSize )
for ( ji, n0 n lt BlockEnd
j, n ) // Perform
butterfly calculations
BlockEnd BlockSize
14C FFT Implementation
- Bit-Reverse For Loop N iterations
- Nested For Loops
- First Outer Loop Log2(N) iterations
- Made use of sin/cos math.h functions
- Second Outer Loop N / BlockSize iterations
- Inner Loop BlockSize/2 iterations
- O(N Log2(N) N/BlockSize BlockSize/2)
- O(NNLog2(N))
- 193.84 Cycles / Sample
15Assembly FFT Implementation
- Bit-Reverse Address Generation
- Hide Bit-Reverse operation inside first and
second FFT Stages - Sin and Cos values stored in a Look-Up-Table
- 256 Kbyte LUT added to Data1
- Needed to grow Data1 Memory Space using LDF file
- Interleaved Real and Imaginary Arrays
- Quad Reads Loads 2 Complex Points per Cycle
- Supports the Real FFT for input signals with no
Imaginary component - 40 Algorithm-based Savings
16Assembly FFT Implementation
- Special Butterfly Instruction
- Can perform addition/subtraction in parallel in
one compute block - Speeds up the inner-most loop
- VLIW and SIMD Operations
- Performs simultaneous operations in both compute
blocks - Loop unrolling and instruction scheduling keeps
the entire processor busy with instructions. - 11.35 Cycles per Sample
17Assembly FFT Implementation
_BflyLoop qj24r2726 k5k5k9 fr6r30r12 fr16r6-r7 yr30qj04 k3k5 and k4 fr15r23r4 fr24r8r18, fr26r8-r18 xr30qj04 r54lk7k3 fr7r31r13 fr25r9r19, fr27r9-r19 qj14r2524 fr14r30r13 fr17r14r15 qj24r2726 k5k5k9 fr6r2r4 fr18r6-r7 yr118qj04 k3k5 and k4 fr15r31r12 fr24r20r16, fr26r20-r16 xr118qj04 r1312lk7k3 fr7r3r5 fr25r21r17, fr27r21-r17 qj14r2524 fr14r2r5 fr19r14r15 qj24r2726 k5k5k9 fr6r10r12 fr16r6-r7 yr2320qj04 k3k5 and k4 fr15r3r4 fr24r28r18, fr26r28-r18 xr2320qj04 r54lk7k3 fr7r11r13 fr25r29r19, fr27r29-r19 qj14r2524 fr14r10r13 fr17r14r15 qj24r2726 k5k5k9 fr6r22r4 fr18r6-r7 yr3128qj04 k3k5 and k4 fr15r11r12 fr24r0r16, fr26r0-r16 xr3128qj04 r1312lk7k3 fr7r23r5 fr25r1r17, fr27r1-r17 .align_code 4 if NLC0E, jump _BflyLoop
18DC FFT Test
19Audio FFT Test
201024 Point DFT / FFT Comparison
Implementation Cycles Per Sample
DFT Implemented in C 63,027.41 cycles / sample
DFT Implemented in Assembly 1,536 cycles / sample
FFT Implemented in C 193.85 cycles / sample
FFT Implemented in Assembly 11.35 cycles / sample
211024 Point Radix-2 FFT Hardware Comparison
Processor Architecture Cycles Per Sample Processor Frequency Execution Time
ADSP-21369 (SHARC) 8.98 cycles / sample 400 MHz 22.99 µSec
TigerSHARC (website) 9.16 cycles / sample 600 MHz 15.63 µSec
TigerSHARC (our results) 11.35 cycles / sample 600 MHz 19.37 µSec
TMS320C6000 14.125 cycles / sample 350 MHz 41.33 µSec
TMS320DM644x 7.59 cycles / sample 594 MHz 13.08 µSec
22Conclusion
- The FFT algorithm is very useful when computing
the frequency domain on a DSP. - FFT is much faster than a regular DFT algorithm
- FFT is more precise by having less errors
created due to round off. - The timed coding examples further support this
claim and demonstrate how to code the algorithm.
- The Radix-2 FFT isnt the fastest but it uses a
less complex addressing and twiddle factor
routine - In this case (unlike in school) F is better then
D.