Processor Architecture Needed to handle FFT algoarithm presentation

About This Presentation

Transcript and Presenter's Notes

Title: Processor Architecture Needed to handle FFT algoarithm

1
Processor Architecture Needed to handleFFT
algoarithm

M. Smith

2
FFT algorithms

There are more FFT algorithms performed per day
than any other algorithm in the world
Therefore, of course, custom parts of the
processors to handle this situations

3
1 MRI session
http//www.core.org.cn/NR/rdonlyres/Nuclear-Engine
ering/22-56JFall-2005/EA28B7B3-39E5-4999-B858-5E3B
248E5408/0/chp_mri.jpg
4
Magnetic resonance (MR) and DFT discrete Fourier
transform

Place a body in a steady magnetic field 1.5
Telsa, (3 or 7)
All the protons spin (precess) around magnetic
field at a frequency of around10 MHz
If sent in an RF (90 degree) pulse at 10 MHz
will be absorbed by system.
Some energy then emitted by system as sinusoid at
10 MHZ
Do DFT single pulse whose height proportional
to number of protons in body (amount of hydrogen
in body amount of water in body)
Used to non-destructively measuring water content
in wheat

5
Magnetic resonance imaging (MRI) and DFT
discrete Fourier transform

Place four bodies in a steady magnetic field
1.5 Telsa, (3 or 7)
Apply 90 pulse
Apply DFT on response
One signal at 10 MHz
Place four bodies in a steady magnetic field of
1.5 Telsa
Apply 90 pulse
Apply a field gradient in X direction
Apply DFT on response
Four signals 10 G.X1, 10 G.X2, 10 G.X3 , 10
G.X4 where Xi is the x position of object

6
Magnetic resonance imaging (MRI) and DFT
discrete Fourier transform

Place four bodies in a steady magnetic field of
1.5 Telsa
Apply 90 pulse
Apply a field gradient Gx in X direction
Apply DFT on response
Four signals 10 Gx.X1, 10 Gx.X2, 10 Gx.X3 ,
10 Gx.X4 where Xi is the x position of object
Apply 90 pulse
Now add a field gradient in both X and Y
directions
Apply DFT on response
Four signals 10 Gx.X1 Gy.Y1, 10 Gx.X2
Gy.Y2, 10 Gx.X3 Gy.Y3, 10 Gx.X4 Gy.Y3
where Xi is the x position of object and Yi is
the i position of object

7
1 MRI session
http//www.core.org.cn/NR/rdonlyres/Nuclear-Engine
ering/22-56JFall-2005/EA28B7B3-39E5-4999-B858-5E3B
248E5408/0/chp_mri.jpg

Occurs for about 20 minutes
Echo planar imaging (EPI) Generates 19 2 D slices
of the brain in about 60 seconds
Each image is 256 pixels by 256 pixels x 19
Each image requires 256 256 19 DFTs / minute
DFT is only ONE part of the algorithm
My research is using MRI for stroke diagnosis

http//www.magnet.fsu.edu/education/tutorials/magn
etacademy/mri/images/mri-scanner.jpg
8
Tackled already this term

Three types of DSP algorithms
Long loops, multiplication and addition
intensive, regular (simple) memory accesses
e.g. 300 taps in FIR algorithms
Short loops involving multiplications and
additions e.g. 3 stages in IIR algorithms

9
Comparing IIR and FIR filters
Infinite Impulse Responsefilters few
operations to produce output frominput for each
IIR stage 3 7 stages
Finite Impulse Responsefilters many operations
to produce output frominput. Long FIFO buffer
whichmay require as many operations As FIR
calculation itself. Easy to optimize
10
Discrete Fourier Transform

FIR and IIR algorithms directly manipulate the
data in the time domain.
FIR -- Process M data points using N point FIR
filter involves M (N-1) additions M
N multiplications M N 2 M memory
accesses Algorithm takes a time of Order (M
N)
Very slow if manipulating large amount of data

11
Frequency domain analysis

Apply discrete Fourier transform (implemented via
FFT)
Transform to frequency domain takes time Order (M
log M)
Perform FIR in frequency domain takes time Order
(M)
Transform back to time-domain takes time Order (M
log M)
FFT (Order (M log M) is orders of magnitude
faster that FIR (Order (M log M)

12
(No Transcript)
13
(No Transcript)
14
4 point DFT to show concepts
15
Simplify using special complex exponential
properties
16
Running FFT on data stored in array
17
8 point FFT with log 8 ( 3) stages

3 stages with N / 2 butterflies / stageOrder
(N log N) in time

18
Architectural characteristics needed to handle
FFT efficiently
19
Add / subtract in one instruction

The following instruction is illegal as a single
instruction
F4 F2 F3, F5 F6 F7 ILLEGAL
Needs bits to describe 6 registers (6 4 bits)
FFT Butterfly add is special instruction
F4 F11 F12, F5 F11 F12
Uses only 4 registers, 2 in, 2 out (4 2 bits)
2 bits how come?
F4 F12 F11, F5 F12 F11 ILLEGAL
Fx F11-8 F15-12, Fy F11-8 - F15-12

20
Memory accesses

Stage 1
Fetch X data at location k and k N /2
Store X data at location k and k N /2
Stage 2
Fetch X data at location k and k N /4
Store X data at location k and k N /4
Stage 3 -- Final stage
Fetch X data at location k and k N /8
Store X data at bit-reversed location k and k N
/4

21
First issue how do you store complex numbers?

One option
Use 16-bit values
Store real part in top 16-bits
Store imaginary part in bottom 16 bits
Access data on J-bus
Access complex sinusoids on J-bus
Access both components (R and I) in one cycle
TigerSHARC has the ability to do 16-bit complex
additions and multiplications as specific
instructions INTEGER only (NOT SHARC)
Can use both X and Y compute blocks

22
Integer operations a pain tend to overflow --
TigerSHARC syntax

Option 2 floating point
Store Real component in location X and imaginary
component in location Y
Use R10 QJ4 4
Store first imaginary number in X0 and Y0
Store second imaginary number in X1 and Y1
FR3 R1 R0 performs complex floating point
addition in single cycle
LJ5 R3 stores complex answer back

23
Integer operations a pain tend to overflow --
TigerSHARC syntax

Option 3 floating point
Access Real component along J- bus from data1
and Imaginary component along K-bus from data
2
Use XR30 QJ4 4 YR30 QK4 4
Store first imaginary number in X0 and Y0
Store second, third and fourth imaginary number
in XR1, YR1 XR2, YR2 XR3, YR3
Which option is best? Depends? How handle bring
in complex sinusoids

24
Bit reverse addressing
25
Bit reverse addressing Check manual for
accurate details before MII

Only possible with I0, I1, I2? and I3? registers
(also I8, I9, I10?, I11?)
You must start the array on a N aligned boundary
otherwise it does not work
I0 address pointer
B0 base register point to start of array
L0 length of array register
M0 special circular buffer modify register ????
F4 BR I0 1 // Correct SHARC syntax???
Bit-reverse addressing only works on POST-MODIFY
(permits next address to be calculated in
parallel)

26
Issues handling FFT Butterfly
27
(No Transcript)
28
(No Transcript)
29
Only possible on TigerSHARC
30
Wrong again

This is using the Radix 2 form of the algorithm
breaks down into 2-pt DFT
There is also a Radix 4 form of the algorithm
which is faster again

31
(No Transcript)
32
(No Transcript)
33
TIGERSHARC
34
DSP Co-processor on SHARC
35
(No Transcript)
36
Will discuss FFT accelerator later

Was not available on previous SHARCS
Way to go with future processors
Cheap processors co-processors
Cheap microcontroller FPGA component

Write a Comment

User Comments (0)

About PowerShow.com

Processor Architecture Needed to handle FFT algoarithm PowerPoint PPT Presentation