Fast Fourier Transforms on GPUs - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Fast Fourier Transforms on GPUs

Description:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL. Fast ... Interpolation ... Can exploit interpolation hardware to eliminate texture accesses for index lookups ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 52
Provided by: cory154
Category:

less

Transcript and Presenter's Notes

Title: Fast Fourier Transforms on GPUs


1
Fast Fourier Transforms on GPUs
  • Cory Quammen
  • April 11, 2007

2
Overview
  • Applications exploiting DFT
  • Discrete Fourier transform
  • FFT algorithm basics
  • Papers
  • Performance comparison

3
Applications
  • Signal processing
  • Waveform analysis
  • Band-pass filtering
  • Signal detection
  • Convolution
  • Signal filtering
  • Polynomial multiplication
  • Big integer multiplication
  • Interpolation
  • Padding with zeros in frequency domain implies
    increased sampling rate in spatial (time) domain

4
Applications
  • Solving PDEs
  • Finding nth derivative in spatial (time) domain
    equivalent to multiplying by in in frequency
    domain
  • Image compression
  • Analyze parts of an image, discarding
    high-frequency components
  • Medical image reconstruction
  • MRI and ultrasound
  • Volume rendering
  • Fourier slice projection theorem

5
Discrete Fourier Transform
  • Transforms discrete, evenly-spaced samples from
    spatial (or time) domain to frequency domain
  • Expresses input signal as a superposition of
    sinusoids
  • Maps complex vectors to complex vectors
  • All frequencies in original data are fully
    represented in transformed data

6
Discrete Fourier Transform
  • Linear
  • Invertible - original data can be fully recovered
  • Important - discrete Fourier transform (DFT) is a
    mathematical concept, not an algorithm

7
DFT definition
  • Forward transform
  • Inverse transform

Difference is a change of sign and scale factor
8
Multi-dimensional DFT
  • Nested summations over each dimension
  • Computationally, simplifies to row-column
    algorithm
  • Apply set of 1D DFTs in one dimension, then
    subsequent sets of 1D DFTs in remaining dimensions

9
Interpretation
  • Fourier transform often characterized by two
    things
  • Amplitude
  • Phase

10
F(impulse)
11
F(cos wave)
12
F(white noise)
13
F(Gaussian)
14
Naïve DFT computation
  • Implement directly
  • Complexity O(N2)
  • Much too slow!

15
Cooley-Tukey algorithm
  • Developed by Gauss in 1805, but forgotten and
    developed independently by Cooley and Tukey, and
    others before them
  • Divide-and-conquer algorithm reduces complexity
    to O(N log N)

16
Faster DFT computation
  • Apply divide-and-conquer
  • Split input into 2 subarrays
  • Even-index subarray
  • Odd-index subarray
  • Take DFT of subarrays
  • Combine results
  • Yields complexity O(N log N)

17
Danielson-Lanczos Lemma
Separate summation into even-odd parts
Notice
Moreland2003
18
Therefore
Recall
Moreland2003
19
Recursive implementation
  • Recursive formulation of FFT algorithm. Very
    inefficient because
  • of data reordering and replication.
  • function Fx myfft(fx)
  • N length(fx)
  • if (N 1)
  • Fx fx
  • return
  • end
  • Coefficient
  • Wn exp(-i2pi/N)
  • even myfft(fx(12N-1))
  • odd myfft(fx(22N))
  • exponents (0(N/2-1))'
  • firstHalf even (Wn.exponents.odd)
  • secondHalf even - (Wn.exponents.odd)
  • Fx firstHalf secondHalf

20
Recursive implementation caveats
  • Requires stack space
  • No stack on GPU
  • Array reordering at every step
  • Too much memory traffic

21
Iterative implementation
  • Reorder input
  • Details to follow
  • Start from smallest partition size first
  • Fourier transform of length-one vector is
    identity
  • Start with partition size p 2
  • Combine results
  • Double partition size p
  • Repeat

22
Input reordering
Reverse the bits!
23
Iterative computation
Unweighted
Weighted
24
Spitzer 2003
  • Implemented 1D FFT
  • Bit reversal is a lookup operation
  • Indices and weights for each pass are precomputed
    and stored in scramble texture
  • Pad real inputs to complex by setting imaginary
    component to 0

Spitzer2003
25
Implementation and results
  • OpenGL and Cg
  • 1D textures
  • Single 2048 FFT

Spitzer2003
26
Performance limitations
  • Vertex processors not used (5 GFLOPS of unused
    processing power)
  • Many OpenGL calls for ping-ponging
  • Each pass must finish before next one starts
  • 1D textures - not optimized in hardware
  • Additional memory bandwidth for index and weight
    lookups
  • Vector ALUs not exploited

Spitzer2003
27
Performance improvements
  • Packing complex numbers into 1 or more textures
  • RG - first number, BA - second number
  • Two textures RGBA1 - real, RGBA2 - imaginary
  • Batching
  • Fill 2D texture with 1D inputs
  • Keep graphics pipeline filled
  • Cuts ratio of driver calls to computation
    substantially

Spitzer2003
28
Mitchell et al. 2003
  • Discusses FFT in context of image processing and
    DirectX 9
  • Similar arrangement and algorithm to
    Spitzer2003
  • Handles 2D FFTs

Mitchell2003
29
FFT over columns
// Horizontal scramble first SetSurfaceAsTexture(
surfaceA) // Input image SetRenderTarget(
surfaceB) LoadShader( HorizontalScramble)
SetTexture( ButterflyTexturelog2(width))
DrawQuad() // Horizontal butterflies
LoadShader( HorizontalButterfly) SetTexture(
ButterflyTexturelog2(width)) for ( i 0 i lt
log2( width) i) SwapSurfacesAandB()
SetShaderConstant( pass, i/log2(width))
DrawQuad()
Mitchell2003
30
FFT over rows
// Vertical scramble SwapSurfacesAandB()
LoadShader( VerticalScramble) SetTexture(
ButterflyTexturelog2(height)) DrawQuad()
// Vertical butterflies LoadShader(
VerticalButterfly) SetTexture(
ButterflyTexturelog2(height)) for ( i 0 i
lt log2( height) i) SwapSurfacesAandB()
SetShaderConstant( pass, i/log2(height))
DrawQuad()
Mitchell2003
31
Moreland and Angel 2003
  • Goal was to enable a real-time filtering stage
    during image generation
  • Developed an indexing scheme to eliminate input
    reordering
  • Optimized for real input signals

Moreland2003
32
Optimizations for real input
  • Most input images are real
  • Symmetry in Fourier transform of real signals
  • No need to add 0-valued imaginary components
  • Uses half the memory

Moreland2003
33
Optimizations for real input
  • Can pack two reals into a single complex number
    and apply standard complex-gtcomplex FFT
  • Treat one half of 1D input as real, other half as
    imaginary
  • Must untangle results afterwards

Moreland2003
34
Conceptual organization
1D Organization
2D Organization
Real Values
Imaginary Values
  • No data reordering
  • 2D packing
  • Columns 0 and M/2 contain reals
  • Other columns can be paired up as real and
    imaginary components

Real Values
Imaginary Values
Moreland2003
35
Program flow
Moreland2003
36
Implementation
  • OpenGL and Cg
  • NVIDIA GeForce FX 5800 Ultra
  • Ping-ponging

Moreland2003
37
Results
  • Arithmetic-bound
  • 2.5 GFLOPS
  • 3.4 GB/sec memory bandwidth
  • 1024x1024 convolution
  • FFTW on 1.7GHz Xeon took 0.625 seconds
  • Their implementation took 2.7 seconds (with data
    transfer)

Moreland2003
38
Sumanaweera and Liu 2005
  • Can operate on two complex inputs simultaneously
  • Targets 2D images
  • Columns first, then rows
  • Auto-tunes to balance load among
  • Vertex processors
  • Rasterizer
  • Fragment processors

Sumanaweera2005
39
Compute buffers
  • Uses one pbuffer with several draw buffers
  • Use separate buffers for real and imaginary
    complex components
  • NVIDIA Quadro NV4x family has stereo capability
  • Supports up to 8 buffers
  • GL_FRONT_LEFT, GL_BACK_LEFT, GL_FRONT_RIGHT,
    GL_BACK_RIGHT
  • GL_AUX0, GL_AUX1, GL_AUX2, GL_AUX3
  • Allows parallel processing of two 2D images
  • 2 complex images ? 2 components ? (src dst
    buffers)

Sumanaweera2005
40
Incorporating vertex processors and rasterizer
  • In previous implementations, vertex processor is
    basically idle
  • Transforms only 4 vertices for each pass
  • Indices and weights stored in textures
  • Enhancement
  • Fragments can be grouped together into contiguous
    chunks within which indices and angular argument
    of the weight vary linearly

Sumanaweera2005
41
Linear interpolation
  • Note the linear variation in indices used to form
    F0, F1, F2, F3
  • Can exploit interpolation hardware to eliminate
    texture accesses for index lookups
  • Also works for angular arguments of weights

Sumanaweera2005
42
Caveats
  • In early stages, must draw many small rectangles
  • Vertex processors loaded more heavily than
    fragment processors
  • Worse performance than single-quad approach with
    index and weight lookups
  • sincos evaluation in hardware to obtain weights
  • Probably enough arithmetic to hide memory latency

Sumanaweera2005
43
Load balancing
Quads at step
1
2

log N
  • Switch between approaches at some stage decided
    at runtime
  • Auto-tuning code

Index lookups
Interpolated indices
Sumanaweera2005
44
Performance
Sumanaweera2005
45
Govindaraju et al. 2006
  • Emphasis is on cache-efficiency
  • Uses Stockham FFT formulation
  • Like Moreland2003, does not perform initial
    reordering step
  • Simpler indexing functions
  • Supports 1D FFTs
  • Data stored in 2D arrays
  • Better exploitation of GPU cache designed for 2D
    access patterns
  • Makes larger problem sizes possible
  • Divides work up into tiles to improve cache reuse

Govindaraju2006
46
Performance
  • Problem size - 4 million complex numbers
  • On NVIDIA 7800GTX, 64?64 tile size performs best
  • 6.1 GFLOPS on NVIDIA 7900 GTX
  • 32 GB/s memory bandwidth
  • 4-5X faster than two 3.6GHz Xeon processors
    w/Hyperthreading and two dual-core Opteron 280s
    running threaded FFTW
  • Claims to be 3X faster than libgpufft from
    Stanford

Govindaraju2006
47
Other implementations
  • CUFFT
  • CUDAs black-box FFT implementation
  • Mark Harris reports 50 GFLOPS on 8800 GTX for
    more than 4096 elements in 1D
  • GPU FFT
  • Brook implementation
  • T. Jansen, B von Rymon-Lipinski, N. Hanssen, E.
    Keeve, Fourier volume rendering on the GPU using
    a split-stream FFT, Proc. Vision, Modeling, and
    Visualization, pp. 395-403, 2004.
  • 9X faster on ATI Radeon 9800 than 2.6GHz Pentium
    4 for 10242 real image

48
Other implementations
  • T. Schiwietz, T. Chang, P. Speier, and R.
    Westermann, MR image reconstruction using the
    GPU, Proc. SPIE 6142, pp. 1279-1290, 2006.
  • GPU-FFT on ATI X1600 XT vs. 3.0GHz Pentium 4
  • 1.7X faster than FFTW for 10242 complex image
  • 1.9X faster for 5122 complex image
  • O. Fialka and M. Cadik, FFT and convolution
    performance in image filtering on GPU, Proc. of
    Information Visualization, 2006.
  • NVIDIA GeForce 6600GT vs 2.6GHz Intel Celeron D
  • Convolution 3.4X faster for 2562 complex image

49
GPU FFT shootout
  • Tested on NVIDIA GeForce 8800 GTX
  • All tests include bus traffic to GPU
  • No readback
  • Complex inputs
  • GFLOPS derived from estimated FFTW operation
    count
  • 5 N log2(N) / time
  • N is product of dimension sizes

50
GPU FFT shootout
? Dimension not supported
51
References
  • Spitzer, Implementing a GPU-Efficient FFT,
    SIGGRAPH GPGPU Course, 2003.
  • K. Moreland and E. Angel, The FFT on a GPU,
    Graphics Hardware, 2003.
  • J. Mitchell, M. Ansari, E. Hart, Advanced Image
    Processing with DirectX 9 Pixel Shaders, in
    ShaderX2 - Shader Programming Tips and Tricks,
    2003.
  • T. Sumanaweera and D. Liu, Medical Image
    Reconstruction with the FFT, GPU Gems 2, pp.
    765-784, 2005.
  • N. Govindaraju, S. Larsen, J. Gray, and D.
    Manocha, A memory model for scientific
    algorithms on graphics processors, Proc.
    Supercomputing, p. 6, 2006.
Write a Comment
User Comments (0)
About PowerShow.com