Efficient%20FFTs%20On%20VIRAM - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient%20FFTs%20On%20VIRAM

Description:

a frequency spectrum. Why Study The FFT? 1D Fast Fourier Transforms (FFTs) are: ... Audio & video. Graphics. Important in many Scientific Applications ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 44
Provided by: rand156
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient%20FFTs%20On%20VIRAM


1
Efficient FFTs On VIRAM
  • Randi Thomas and Katherine Yelick
  • Computer Science Division
  • University of California, Berkeley
  • IRAM Winter 2000 Retreat
  • randit, yelick _at_cs.berkeley.edu

2
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

3
What is the FFT?
  • The Fast Fourier Transform
  • converts
  • a time-domain function
  • into
  • a frequency spectrum

4
Why Study The FFT?
  • 1D Fast Fourier Transforms (FFTs) are
  • Critical for many signal processing problems
  • Used widely for filtering in Multimedia
    Applications
  • Image Processing
  • Speech Recognition
  • Audio video
  • Graphics
  • Important in many Scientific Applications
  • The building block for 2D/3D FFTs
  • All of these are VIRAM target applications!

5
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

6
VIRAM Implementation Assumptions
  • System on the chip
  • Scalar processor 200 MHz vanilla MIPS core
  • Embedded DRAM 32MB, 16 Banks, no subbanks
  • Memory Crossbar 25.6 GB/s
  • Vector processor 200 MHz
  • I/O 4 x 100 MB/sec

7
VIRAM Implementation Assumptions
  • Vector Processor has four 64-bit pipelineslanes
  • Each lane has
  • 2 integer functional units
  • 1 floating point functional unit
  • All functional units have a 1 cycle multiply-add
    operation
  • Each lane can be subdivided into
  • two 32-bit virtual lanes
  • four 16-bit virtual lanes

8
Peak Performance
  • Peak Performance of This VIRAM Implementation
  • Implemented
  • A 32 bit Floating point version (8 lanes, 8 FUs)
  • A 16 bit Fixed point version (16 lanes, 32 FUs)

9
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

10
Computing the DFT (Discrete FT)
  • Given the N-element vector x, its 1D DFT is
    another N-element vector y, given by formula
  • where the jkth root of
    unity
  • N is referred to as the number of points
  • The FFT (Fast FT)
  • Uses algebraic Identities to compute DFT in
    O(NlogN) steps
  • The computation is organized into log2N stages
  • for the radix 2 FFT

11
Computing A Complex FFT
  • Basic computation for a radix 2 FFT
  • The basic computation on VIRAM for Floating Point
    data points
  • 2 multiply-adds 2 multiplies 4 adds
  • 8 operations
  • 2 GFLOP/s is the VIRAM Peak Performance for this
    mix of instructions
  • Xi are the data points
  • w is a root of unity

12
Vector Terminology
  • The Maximum Vector Length (MVL)
  • The maximum of elements 1 vector register can
    hold
  • Set automatically by the architecture
  • Based on the data width the algorithm is using
  • 64-bit data, MVL 32 elements/vector register
  • 32-bit data, MVL 64 elements/vector register
  • 16-bit data, MVL 128 elements/vector register
  • The Vector Length (VL)
  • The total number of elements to be computed
  • Set by the algorithm the inner for-loop

13
One More (FFT) Term!
  • A butterfly group (BG)
  • A set of elements that can be computed upon in 1
    FFT stage using
  • The same basic computation
  • AND
  • The same root of unity
  • The number of elements in a stages BG determines
    the Vector Length (VL) for that stage

14
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

15
Cooley-Tukey FFT Algorithm
vr1vr21 butterfly group VL vector length
16
Vectorizing the FFT
  • Diagram illustrates naïve vectorization
  • A stage vectorizes well when VL ³ MVL
  • Poor HW utilization when VL is small (lt
    MVL)
  • Later stages of the FFT have shorter vector
    lengths
  • the of elements in one butterfly group is
    smaller in the later stages

Stage 4VL 1
Stage 3VL 2
Stage 2VL 4
Stage 1VL 8
vr1
vr1
vr2
vr1
vr2
vr1
vr2
vr2
Time
17
Naïve Algorithm What Happens When Vector
Lengths Get Short?
32 bit Floating Point
VL64MVL
  • Performance peaks (1.4-1.8 GFLOPs) if vector
    lengths are ³ MVL
  • For all FFT sizes, 94 to 99 of the total time
    is spent doing the last 6 stages, when VL lt MVL
    ( 64)
  • For 1024 point FFT, only 60 of the work is done
    in the last 6 stages
  • Performance significantly drops when vector
    lengths lt lanes (8)

18
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

19
Optimization 1 Add auto-increment
  • Automatically adds an increment to the current
    address in order to obtain the next address
  • Auto-increment helps to
  • Reduce the scalar code overhead
  • Useful
  • To jump to the next butterfly group in an FFT
    stage
  • For processing a sub-image of a larger image in
    order to jump to the appropriate pixel in next row

20
Optimization 1 Add auto-increment
  • Small gain from auto-increment
  • For 1024 point FFT
  • 202 MFLOP/s w/o AI
  • 225 MFLOP/s with AI
  • Still 94-99 of the time spent in last 6 stages
    where the VL lt 64
  • Conclusion Auto-increment helps, but scalar
    overhead is not the main source of the
    inefficiency

32 bit Floating Point
21
Optimization 2 Memory Transposes
  • Reorganize the data layout in memory to maximize
    the vector length in later FFT stages
  • View the 1D vector as a 2D matrix
  • Reorganization is equivalent to a matrix
    transpose
  • Transposing the data in memory only works for N
    ³ (2 MVL)
  • Transposing in memory adds significant overhead
  • Increased memory traffic
  • cost too high to make it worthwhile
  • Multiple transposes exacerbate the situation

22
Optimization 3 Register Transposes
  • Rearrange the elements in the vector registers
  • Provides a way to swap elements between 2
    registers
  • What we want to swap (after stage 1 VL MVL
    8)

VL 4 BGs 2
VL 2 BGs 4
  • This behavior is hard to implement with one
    instruction in hardware

23
Optimization 3 Register Transposes
  • Two instructions were added to the VIRAM
    Instruction Set Architecture (ISA)
  • vhalfup and vhalfdn both move elements one-way
    between vector registers
  • Vhalfup/dn
  • Are extensions of already existing ISA support
    for fast in-register reductions
  • Required minimal additional hardware support
  • mostly control lines
  • Much simpler and less costly than a general
    element permutation instruction
  • Rejected in the early VIRAM design phase
  • An elegant, inexpensive, powerful solution to the
    short vector length problem of the later stages
    of the FFT

24
Optimization 3 Register Transposes
Stage 1
SWAP
  • Three steps to swap elements
  • Copy vr1 into vr3
  • Move vr2s low to vr1s high (vhalfup)
  • vr1 now done
  • Move vr3s high to vr2s low (vhalfdn)
  • vr2 now done

25
Optimization 3 Final Algorithm
  • The optimized algorithm has two phases
  • Naïve algorithm is used for stages whose VL ³ MVL
  • Vhalfup/dn code is used on
  • Stages whose VL lt MVL the last log2 (MVL)
    stages
  • Vhalfup/dn
  • Eliminates short vector length problem
  • Allows all vector computations to have VL equal
    to MVL
  • Multiple butterfly groups done with 1 basic
    operation
  • Eliminates all loads/stores between these stages
  • Optimized vhalf algorithm does
  • Auto-increment, software pipelining, code
    scheduling
  • the bit reversal rearrangements of the results
  • Single precision, floating point, complex,
    radix-2 FFTs

26
Optimization 3 Register Transposes
32 bit Floating Point
  • Every vector instruction operates with VLMVL
  • For all stages
  • Keeps the vector pipeline fully utilized
  • Time spent in the last 6 stages
  • drops to 60 to 80 of the total time

27
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

28
Performance Results
32 bit Floating Point
  • Both Naïve versions utilize the auto-increment
    feature
  • 1 does bit reversal, the other does not
  • Vhalfup/dn with and without bit reversal are
    identical
  • Bit reversing the results slows naïve algorithm,
    but not vhalfup/dn

29
Performance Results
32 bit Floating Point
  • The performance gap testifies
  • To the effectiveness of the vhalfup/dn algorithm
    in fully utilizing the vector unit
  • The importance of the new vhalfup/dn instructions

30
Performance Results
32 bit Floating Point
  • VIRAM is competitive with high-end specialized
    Floating Point DSPs
  • Could match or exceed the performance of these
    DSPs if the VIRAM architecture were implemented
    commercially

31
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

32
16 bit Fixed Point Implementation
  • Resources
  • 16 lanes (each 16 bits wide)
  • Two Integer Functional Units per lane
  • 32 Operations/Cycle
  • MVL 128 elements
  • Fixed Point Multiply-Add not utilized
  • 8 bit operands too small
  • 8 bits 8 bits 16 bit product
  • 32 bit product too big
  • 16 bits 16 bits 32 bit product

33
16 bit Fixed Point Implementation (2)
  • The basic computation takes
  • 4 multiplies 4 adds 2 subtracts 10
    operations
  • 6.4 GOP/s is Peak Performance for this mix
  • To prevent overflow two bits are shifted right
    and lost for each stage
  • Input
  • Sbbb bbbb bbbb bbbb.
  • Output
  • Sbbb bbbb bbbb bbbb bb.

Decimal points
Shifted out
34
Performance Results
16 bit Fixed Point
  • Fixed Point is Faster than Floating point on
    VIRAM
  • 1024 pt 28.3 us verses 37 us
  • This implementation attains 4 GOP/s for 1024 pt
    FFT and is
  • An Unoptimized work in progress!

35
Performance Results
16 bit Fixed Point
  • Again VIRAM is competitive with high-end
    specialized DSPs
  • CRI Scorpio 24 bit complex fixed point FFT DSP
  • 1024 pt 7 microseconds

36
Outline
  • What is the FFT and Why Study it?
  • VIRAM Implementation Assumptions
  • About the FFT
  • The Naïve Algorithm
  • 3 Optimizations to the Naïve Algorithm
  • 32 bit Floating Point Performance Results
  • 16 bit Fixed Point Performance Results
  • Conclusions and Future Work

37
Conclusions
  • Optimizations to eliminate short vector lengths
    are necessary for doing the FFT
  • VIRAM is capable of performing FFTs at
    performance levels comparable to or exceeding
    those of high-end floating point DSPs. It
    achieves this performance via
  • A highly tuned algorithm designed specifically
    for VIRAM
  • A set of simple, powerful ISA extensions that
    underlie it
  • Efficient parallelism of vector processing
    embedded in a high-bandwidth on-chip DRAM memory

38
Conclusions (2)
  • Performance of FFTs on VIRAM has the potential to
    improve significantly over the results presented
    here
  • 32-bit fixed point FFTs could run up to 2 times
    faster than floating point versions
  • Compared to 32-bit fixed point FFTs, 16-bit fixed
    point FFTs could run up to
  • 8x faster (with multiply-add ops)
  • 4x faster (with no multiply-add ops)
  • Adding a second Floating Point Functional Unit
    would make floating point performance comparable
    to the 32-bit Fixed Point performance.
  • 4 GOP/s for Unoptimized Fixed Point
    implementation (6.4 GOP/s is peak!)

39
Conclusions (3)
  • Since VIRAM includes both general-purpose CPU
    capability and DSP muscle, it shares the same
    space in the emerging market of hybrid CPU/DSPs
    as
  • Infineon TriCore
  • Hitachi SuperH-DSP
  • Motorola/Lucent StarCore
  • Motorola PowerPC G4 (7400)
  • VIRAMs vector processor plus embedded DRAM
    design may have further advantages over more
    traditional processors in
  • Power
  • Area
  • Performance

40
Future Work
  • On Current Fixed Point implementation
  • Further optimizations and tests
  • Explore the tradeoffs between precision
    accuracy and Performance by implementing
  • A Hybrid of the current implementation which
    alternates the number of bits shifted off each
    stage
  • 2 1 1 1 2 1 1 1...
  • A 32 bit integer version which uses 16 bit data
  • If data occupies the 16 most significant bits of
    the 32 bits, then there are 16 zeros to shift
    off
  • Sbbb bbbb bbbb bbbb b000 0000 0000 0000 0000

41
Backup Slides
42
(No Transcript)
43
Why Vectors For IRAM?
  • Low complexity architecture
  • means lower power and area
  • Takes advantage of on-chip memory bandwidth
  • 100x bandwidth of Work Station memory hierarchies
  • High performance for apps w/ fine-grained ism
  • Delayed pipeline hides memory latency
  • Therefore no cache is necessary
  • further conserves power and area
  • Greater code density than VLIW designs like
  • TIs TMS320C6000
  • Motorola/Lucent StarCore
  • ADs TigerSHARC
  • Siemens (Infineon) Carmel
Write a Comment
User Comments (0)
About PowerShow.com