Title: Efficient FFTs On VIRAM
1Efficient FFTs On VIRAM
- Randi Thomas and Katherine Yelick
- Computer Science Division
- University of California, Berkeley
- IRAM Winter 2000 Retreat
- randit, yelick _at_cs.berkeley.edu
2Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
3 What is the FFT?
- The Fast Fourier Transform
- converts
- a time-domain function
- into
- a frequency spectrum
4Why Study The FFT?
- 1D Fast Fourier Transforms (FFTs) are
- Critical for many signal processing problems
- Used widely for filtering in Multimedia
Applications - Image Processing
- Speech Recognition
- Audio video
- Graphics
- Important in many Scientific Applications
- The building block for 2D/3D FFTs
- All of these are VIRAM target applications!
5Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
6VIRAM Implementation Assumptions
- System on the chip
- Scalar processor 200 MHz vanilla MIPS core
- Embedded DRAM 32MB, 16 Banks, no subbanks
- Memory Crossbar 25.6 GB/s
- Vector processor 200 MHz
- I/O 4 x 100 MB/sec
7 VIRAM Implementation Assumptions
- Vector Processor has four 64-bit pipelineslanes
- Each lane has
- 2 integer functional units
- 1 floating point functional unit
- All functional units have a 1 cycle multiply-add
operation - Each lane can be subdivided into
- two 32-bit virtual lanes
- four 16-bit virtual lanes
8Peak Performance
- Peak Performance of This VIRAM Implementation
- Implemented
- A 32 bit Floating point version (8 lanes, 8 FUs)
- A 16 bit Fixed point version (16 lanes, 32 FUs)
9Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
10Computing the DFT (Discrete FT)
- Given the N-element vector x, its 1D DFT is
another N-element vector y, given by formula - where the jkth root of
unity - N is referred to as the number of points
- The FFT (Fast FT)
- Uses algebraic Identities to compute DFT in
O(NlogN) steps - The computation is organized into log2N stages
- for the radix 2 FFT
11Computing A Complex FFT
- Basic computation for a radix 2 FFT
- The basic computation on VIRAM for Floating Point
data points - 2 multiply-adds 2 multiplies 4 adds
- 8 operations
- 2 GFLOP/s is the VIRAM Peak Performance for this
mix of instructions
- Xi are the data points
- w is a root of unity
12Vector Terminology
- The Maximum Vector Length (MVL)
- The maximum of elements 1 vector register can
hold - Set automatically by the architecture
- Based on the data width the algorithm is using
- 64-bit data, MVL 32 elements/vector register
- 32-bit data, MVL 64 elements/vector register
- 16-bit data, MVL 128 elements/vector register
- The Vector Length (VL)
- The total number of elements to be computed
- Set by the algorithm the inner for-loop
13One More (FFT) Term!
- A butterfly group (BG)
- A set of elements that can be computed upon in 1
FFT stage using - The same basic computation
- AND
- The same root of unity
- The number of elements in a stages BG determines
the Vector Length (VL) for that stage
14Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
15 Cooley-Tukey FFT Algorithm
vr1vr21 butterfly group VL vector length
16Vectorizing the FFT
- Diagram illustrates naïve vectorization
- A stage vectorizes well when VL ³ MVL
- Poor HW utilization when VL is small (lt
MVL) - Later stages of the FFT have shorter vector
lengths - the of elements in one butterfly group is
smaller in the later stages
Stage 4VL 1
Stage 3VL 2
Stage 2VL 4
Stage 1VL 8
vr1
vr1
vr2
vr1
vr2
vr1
vr2
vr2
Time
17 Naïve Algorithm What Happens When Vector
Lengths Get Short?
32 bit Floating Point
VL64MVL
- Performance peaks (1.4-1.8 GFLOPs) if vector
lengths are ³ MVL - For all FFT sizes, 94 to 99 of the total time
is spent doing the last 6 stages, when VL lt MVL
( 64) - For 1024 point FFT, only 60 of the work is done
in the last 6 stages - Performance significantly drops when vector
lengths lt lanes (8)
18Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
19Optimization 1 Add auto-increment
- Automatically adds an increment to the current
address in order to obtain the next address - Auto-increment helps to
- Reduce the scalar code overhead
- Useful
- To jump to the next butterfly group in an FFT
stage - For processing a sub-image of a larger image in
order to jump to the appropriate pixel in next row
20Optimization 1 Add auto-increment
- Small gain from auto-increment
- For 1024 point FFT
- 202 MFLOP/s w/o AI
- 225 MFLOP/s with AI
- Still 94-99 of the time spent in last 6 stages
where the VL lt 64 - Conclusion Auto-increment helps, but scalar
overhead is not the main source of the
inefficiency
32 bit Floating Point
21Optimization 2 Memory Transposes
- Reorganize the data layout in memory to maximize
the vector length in later FFT stages - View the 1D vector as a 2D matrix
- Reorganization is equivalent to a matrix
transpose - Transposing the data in memory only works for N
³ (2 MVL) - Transposing in memory adds significant overhead
- Increased memory traffic
- cost too high to make it worthwhile
- Multiple transposes exacerbate the situation
22Optimization 3 Register Transposes
- Rearrange the elements in the vector registers
- Provides a way to swap elements between 2
registers - What we want to swap (after stage 1 VL MVL
8)
VL 4 BGs 2
VL 2 BGs 4
- This behavior is hard to implement with one
instruction in hardware
23Optimization 3 Register Transposes
- Two instructions were added to the VIRAM
Instruction Set Architecture (ISA) - vhalfup and vhalfdn both move elements one-way
between vector registers - Vhalfup/dn
- Are extensions of already existing ISA support
for fast in-register reductions - Required minimal additional hardware support
- mostly control lines
- Much simpler and less costly than a general
element permutation instruction - Rejected in the early VIRAM design phase
- An elegant, inexpensive, powerful solution to the
short vector length problem of the later stages
of the FFT
24Optimization 3 Register Transposes
Stage 1
SWAP
- Three steps to swap elements
- Copy vr1 into vr3
- Move vr2s low to vr1s high (vhalfup)
- vr1 now done
- Move vr3s high to vr2s low (vhalfdn)
- vr2 now done
25Optimization 3 Final Algorithm
- The optimized algorithm has two phases
- Naïve algorithm is used for stages whose VL ³ MVL
- Vhalfup/dn code is used on
- Stages whose VL lt MVL the last log2 (MVL)
stages - Vhalfup/dn
- Eliminates short vector length problem
- Allows all vector computations to have VL equal
to MVL - Multiple butterfly groups done with 1 basic
operation - Eliminates all loads/stores between these stages
- Optimized vhalf algorithm does
- Auto-increment, software pipelining, code
scheduling - the bit reversal rearrangements of the results
- Single precision, floating point, complex,
radix-2 FFTs
26Optimization 3 Register Transposes
32 bit Floating Point
- Every vector instruction operates with VLMVL
- For all stages
- Keeps the vector pipeline fully utilized
- Time spent in the last 6 stages
- drops to 60 to 80 of the total time
27Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
28Performance Results
32 bit Floating Point
- Both Naïve versions utilize the auto-increment
feature - 1 does bit reversal, the other does not
- Vhalfup/dn with and without bit reversal are
identical - Bit reversing the results slows naïve algorithm,
but not vhalfup/dn
29Performance Results
32 bit Floating Point
- The performance gap testifies
- To the effectiveness of the vhalfup/dn algorithm
in fully utilizing the vector unit - The importance of the new vhalfup/dn instructions
30Performance Results
32 bit Floating Point
- VIRAM is competitive with high-end specialized
Floating Point DSPs - Could match or exceed the performance of these
DSPs if the VIRAM architecture were implemented
commercially
31Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
3216 bit Fixed Point Implementation
- Resources
- 16 lanes (each 16 bits wide)
- Two Integer Functional Units per lane
- 32 Operations/Cycle
- MVL 128 elements
- Fixed Point Multiply-Add not utilized
- 8 bit operands too small
- 8 bits 8 bits 16 bit product
- 32 bit product too big
- 16 bits 16 bits 32 bit product
3316 bit Fixed Point Implementation (2)
- The basic computation takes
- 4 multiplies 4 adds 2 subtracts 10
operations - 6.4 GOP/s is Peak Performance for this mix
- To prevent overflow two bits are shifted right
and lost for each stage - Input
- Sbbb bbbb bbbb bbbb.
-
- Output
- Sbbb bbbb bbbb bbbb bb.
Decimal points
Shifted out
34Performance Results
16 bit Fixed Point
- Fixed Point is Faster than Floating point on
VIRAM - 1024 pt 28.3 us verses 37 us
- This implementation attains 4 GOP/s for 1024 pt
FFT and is - An Unoptimized work in progress!
35Performance Results
16 bit Fixed Point
- Again VIRAM is competitive with high-end
specialized DSPs - CRI Scorpio 24 bit complex fixed point FFT DSP
- 1024 pt 7 microseconds
36Outline
- What is the FFT and Why Study it?
- VIRAM Implementation Assumptions
- About the FFT
- The Naïve Algorithm
- 3 Optimizations to the Naïve Algorithm
- 32 bit Floating Point Performance Results
- 16 bit Fixed Point Performance Results
- Conclusions and Future Work
37Conclusions
- Optimizations to eliminate short vector lengths
are necessary for doing the FFT - VIRAM is capable of performing FFTs at
performance levels comparable to or exceeding
those of high-end floating point DSPs. It
achieves this performance via - A highly tuned algorithm designed specifically
for VIRAM - A set of simple, powerful ISA extensions that
underlie it - Efficient parallelism of vector processing
embedded in a high-bandwidth on-chip DRAM memory
38Conclusions (2)
- Performance of FFTs on VIRAM has the potential to
improve significantly over the results presented
here - 32-bit fixed point FFTs could run up to 2 times
faster than floating point versions - Compared to 32-bit fixed point FFTs, 16-bit fixed
point FFTs could run up to - 8x faster (with multiply-add ops)
- 4x faster (with no multiply-add ops)
- Adding a second Floating Point Functional Unit
would make floating point performance comparable
to the 32-bit Fixed Point performance. - 4 GOP/s for Unoptimized Fixed Point
implementation (6.4 GOP/s is peak!)
39Conclusions (3)
- Since VIRAM includes both general-purpose CPU
capability and DSP muscle, it shares the same
space in the emerging market of hybrid CPU/DSPs
as - Infineon TriCore
- Hitachi SuperH-DSP
- Motorola/Lucent StarCore
- Motorola PowerPC G4 (7400)
- VIRAMs vector processor plus embedded DRAM
design may have further advantages over more
traditional processors in - Power
- Area
- Performance
40Future Work
- On Current Fixed Point implementation
- Further optimizations and tests
- Explore the tradeoffs between precision
accuracy and Performance by implementing - A Hybrid of the current implementation which
alternates the number of bits shifted off each
stage - 2 1 1 1 2 1 1 1...
- A 32 bit integer version which uses 16 bit data
- If data occupies the 16 most significant bits of
the 32 bits, then there are 16 zeros to shift
off - Sbbb bbbb bbbb bbbb b000 0000 0000 0000 0000
41Backup Slides
42(No Transcript)
43Why Vectors For IRAM?
- Low complexity architecture
- means lower power and area
- Takes advantage of on-chip memory bandwidth
- 100x bandwidth of Work Station memory hierarchies
- High performance for apps w/ fine-grained ism
- Delayed pipeline hides memory latency
- Therefore no cache is necessary
- further conserves power and area
- Greater code density than VLIW designs like
- TIs TMS320C6000
- Motorola/Lucent StarCore
- ADs TigerSHARC
- Siemens (Infineon) Carmel