Title: Processor Architecture Needed to handle FFT algoarithm
1Processor Architecture Needed to handleFFT
algoarithm
2FFT algorithms
- There are more FFT algorithms performed per day
than any other algorithm in the world - Therefore, of course, custom parts of the
processors to handle this situations
31 MRI session
http//www.core.org.cn/NR/rdonlyres/Nuclear-Engine
ering/22-56JFall-2005/EA28B7B3-39E5-4999-B858-5E3B
248E5408/0/chp_mri.jpg
4Magnetic resonance (MR) and DFT discrete Fourier
transform
- Place a body in a steady magnetic field 1.5
Telsa, (3 or 7) - All the protons spin (precess) around magnetic
field at a frequency of around10 MHz - If sent in an RF (90 degree) pulse at 10 MHz
will be absorbed by system. - Some energy then emitted by system as sinusoid at
10 MHZ - Do DFT single pulse whose height proportional
to number of protons in body (amount of hydrogen
in body amount of water in body) - Used to non-destructively measuring water content
in wheat
5Magnetic resonance imaging (MRI) and DFT
discrete Fourier transform
- Place four bodies in a steady magnetic field
1.5 Telsa, (3 or 7) - Apply 90 pulse
- Apply DFT on response
- One signal at 10 MHz
- Place four bodies in a steady magnetic field of
1.5 Telsa - Apply 90 pulse
- Apply a field gradient in X direction
- Apply DFT on response
- Four signals 10 G.X1, 10 G.X2, 10 G.X3 , 10
G.X4 where Xi is the x position of object
6Magnetic resonance imaging (MRI) and DFT
discrete Fourier transform
- Place four bodies in a steady magnetic field of
1.5 Telsa - Apply 90 pulse
- Apply a field gradient Gx in X direction
- Apply DFT on response
- Four signals 10 Gx.X1, 10 Gx.X2, 10 Gx.X3 ,
10 Gx.X4 where Xi is the x position of object - Apply 90 pulse
- Now add a field gradient in both X and Y
directions - Apply DFT on response
- Four signals 10 Gx.X1 Gy.Y1, 10 Gx.X2
Gy.Y2, 10 Gx.X3 Gy.Y3, 10 Gx.X4 Gy.Y3
where Xi is the x position of object and Yi is
the i position of object
71 MRI session
http//www.core.org.cn/NR/rdonlyres/Nuclear-Engine
ering/22-56JFall-2005/EA28B7B3-39E5-4999-B858-5E3B
248E5408/0/chp_mri.jpg
- Occurs for about 20 minutes
- Echo planar imaging (EPI) Generates 19 2 D slices
of the brain in about 60 seconds - Each image is 256 pixels by 256 pixels x 19
- Each image requires 256 256 19 DFTs / minute
- DFT is only ONE part of the algorithm
- My research is using MRI for stroke diagnosis
http//www.magnet.fsu.edu/education/tutorials/magn
etacademy/mri/images/mri-scanner.jpg
8Tackled already this term
- Three types of DSP algorithms
- Long loops, multiplication and addition
intensive, regular (simple) memory accesses
e.g. 300 taps in FIR algorithms - Short loops involving multiplications and
additions e.g. 3 stages in IIR algorithms
9Comparing IIR and FIR filters
Infinite Impulse Responsefilters few
operations to produce output frominput for each
IIR stage 3 7 stages
Finite Impulse Responsefilters many operations
to produce output frominput. Long FIFO buffer
whichmay require as many operations As FIR
calculation itself. Easy to optimize
10Discrete Fourier Transform
- FIR and IIR algorithms directly manipulate the
data in the time domain. - FIR -- Process M data points using N point FIR
filter involves M (N-1) additions M
N multiplications M N 2 M memory
accesses Algorithm takes a time of Order (M
N) - Very slow if manipulating large amount of data
11Frequency domain analysis
- Apply discrete Fourier transform (implemented via
FFT) - Transform to frequency domain takes time Order (M
log M) - Perform FIR in frequency domain takes time Order
(M) - Transform back to time-domain takes time Order (M
log M) - FFT (Order (M log M) is orders of magnitude
faster that FIR (Order (M log M)
12(No Transcript)
13(No Transcript)
144 point DFT to show concepts
15Simplify using special complex exponential
properties
16Running FFT on data stored in array
178 point FFT with log 8 ( 3) stages
- 3 stages with N / 2 butterflies / stageOrder
(N log N) in time
18Architectural characteristics needed to handle
FFT efficiently
19Add / subtract in one instruction
- The following instruction is illegal as a single
instruction - F4 F2 F3, F5 F6 F7 ILLEGAL
- Needs bits to describe 6 registers (6 4 bits)
- FFT Butterfly add is special instruction
- F4 F11 F12, F5 F11 F12
- Uses only 4 registers, 2 in, 2 out (4 2 bits)
- 2 bits how come?
- F4 F12 F11, F5 F12 F11 ILLEGAL
- Fx F11-8 F15-12, Fy F11-8 - F15-12
20Memory accesses
- Stage 1
- Fetch X data at location k and k N /2
- Store X data at location k and k N /2
- Stage 2
- Fetch X data at location k and k N /4
- Store X data at location k and k N /4
- Stage 3 -- Final stage
- Fetch X data at location k and k N /8
- Store X data at bit-reversed location k and k N
/4
21First issue how do you store complex numbers?
- One option
- Use 16-bit values
- Store real part in top 16-bits
- Store imaginary part in bottom 16 bits
- Access data on J-bus
- Access complex sinusoids on J-bus
- Access both components (R and I) in one cycle
- TigerSHARC has the ability to do 16-bit complex
additions and multiplications as specific
instructions INTEGER only (NOT SHARC) - Can use both X and Y compute blocks
22Integer operations a pain tend to overflow --
TigerSHARC syntax
- Option 2 floating point
- Store Real component in location X and imaginary
component in location Y - Use R10 QJ4 4
- Store first imaginary number in X0 and Y0
- Store second imaginary number in X1 and Y1
- FR3 R1 R0 performs complex floating point
addition in single cycle - LJ5 R3 stores complex answer back
23Integer operations a pain tend to overflow --
TigerSHARC syntax
- Option 3 floating point
- Access Real component along J- bus from data1
and Imaginary component along K-bus from data
2 - Use XR30 QJ4 4 YR30 QK4 4
- Store first imaginary number in X0 and Y0
- Store second, third and fourth imaginary number
in XR1, YR1 XR2, YR2 XR3, YR3 - Which option is best? Depends? How handle bring
in complex sinusoids
24Bit reverse addressing
25Bit reverse addressing Check manual for
accurate details before MII
- Only possible with I0, I1, I2? and I3? registers
(also I8, I9, I10?, I11?) - You must start the array on a N aligned boundary
otherwise it does not work - I0 address pointer
- B0 base register point to start of array
- L0 length of array register
- M0 special circular buffer modify register ????
- F4 BR I0 1 // Correct SHARC syntax???
- Bit-reverse addressing only works on POST-MODIFY
(permits next address to be calculated in
parallel)
26Issues handling FFT Butterfly
27(No Transcript)
28(No Transcript)
29Only possible on TigerSHARC
30Wrong again
- This is using the Radix 2 form of the algorithm
breaks down into 2-pt DFT - There is also a Radix 4 form of the algorithm
which is faster again
31(No Transcript)
32(No Transcript)
33TIGERSHARC
34DSP Co-processor on SHARC
35(No Transcript)
36Will discuss FFT accelerator later
- Was not available on previous SHARCS
- Way to go with future processors
- Cheap processors co-processors
- Cheap microcontroller FPGA component