Efficient%20FFTs%20On%20VIRAM - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient%20FFTs%20On%20VIRAM

Description:

a frequency spectrum. Why Study The FFT? 1D Fast Fourier Transforms (FFTs) are: ... Audio & video. Graphics. Important in many Scientific Applications ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 44

Provided by: rand156

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient%20FFTs%20On%20VIRAM

1
Efficient FFTs On VIRAM

Randi Thomas and Katherine Yelick
Computer Science Division
University of California, Berkeley
IRAM Winter 2000 Retreat
randit, yelick _at_cs.berkeley.edu

2
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

3
What is the FFT?

The Fast Fourier Transform
converts
a time-domain function
into
a frequency spectrum

4
Why Study The FFT?

1D Fast Fourier Transforms (FFTs) are
Critical for many signal processing problems
Used widely for filtering in Multimedia
Applications
Image Processing
Speech Recognition
Audio video
Graphics
Important in many Scientific Applications
The building block for 2D/3D FFTs
All of these are VIRAM target applications!

5
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

6
VIRAM Implementation Assumptions

System on the chip
Scalar processor 200 MHz vanilla MIPS core
Embedded DRAM 32MB, 16 Banks, no subbanks
Memory Crossbar 25.6 GB/s
Vector processor 200 MHz
I/O 4 x 100 MB/sec

7
VIRAM Implementation Assumptions

Vector Processor has four 64-bit pipelineslanes
Each lane has
2 integer functional units
1 floating point functional unit
All functional units have a 1 cycle multiply-add
operation
Each lane can be subdivided into
two 32-bit virtual lanes
four 16-bit virtual lanes

8
Peak Performance

Peak Performance of This VIRAM Implementation
Implemented
A 32 bit Floating point version (8 lanes, 8 FUs)
A 16 bit Fixed point version (16 lanes, 32 FUs)

9
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

10
Computing the DFT (Discrete FT)

Given the N-element vector x, its 1D DFT is
another N-element vector y, given by formula
where the jkth root of
unity
N is referred to as the number of points
The FFT (Fast FT)
Uses algebraic Identities to compute DFT in
O(NlogN) steps
The computation is organized into log2N stages
for the radix 2 FFT

11
Computing A Complex FFT

Basic computation for a radix 2 FFT
The basic computation on VIRAM for Floating Point
data points
2 multiply-adds 2 multiplies 4 adds
8 operations
2 GFLOP/s is the VIRAM Peak Performance for this
mix of instructions

Xi are the data points
w is a root of unity

12
Vector Terminology

The Maximum Vector Length (MVL)
The maximum of elements 1 vector register can
hold
Set automatically by the architecture
Based on the data width the algorithm is using
64-bit data, MVL 32 elements/vector register
32-bit data, MVL 64 elements/vector register
16-bit data, MVL 128 elements/vector register
The Vector Length (VL)
The total number of elements to be computed
Set by the algorithm the inner for-loop

13
One More (FFT) Term!

A butterfly group (BG)
A set of elements that can be computed upon in 1
FFT stage using
The same basic computation
AND
The same root of unity
The number of elements in a stages BG determines
the Vector Length (VL) for that stage

14
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

15
Cooley-Tukey FFT Algorithm
vr1vr21 butterfly group VL vector length
16
Vectorizing the FFT

Diagram illustrates naïve vectorization
A stage vectorizes well when VL ³ MVL
Poor HW utilization when VL is small (lt
MVL)
Later stages of the FFT have shorter vector
lengths
the of elements in one butterfly group is
smaller in the later stages

Stage 4VL 1
Stage 3VL 2
Stage 2VL 4
Stage 1VL 8
vr1
vr1
vr2
vr1
vr2
vr1
vr2
vr2
Time
17
Naïve Algorithm What Happens When Vector
Lengths Get Short?
32 bit Floating Point
VL64MVL

Performance peaks (1.4-1.8 GFLOPs) if vector
lengths are ³ MVL
For all FFT sizes, 94 to 99 of the total time
is spent doing the last 6 stages, when VL lt MVL
( 64)
For 1024 point FFT, only 60 of the work is done
in the last 6 stages
Performance significantly drops when vector
lengths lt lanes (8)

18
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

19
Optimization 1 Add auto-increment

Automatically adds an increment to the current
address in order to obtain the next address
Auto-increment helps to
Reduce the scalar code overhead
Useful
To jump to the next butterfly group in an FFT
stage
For processing a sub-image of a larger image in
order to jump to the appropriate pixel in next row

20
Optimization 1 Add auto-increment

Small gain from auto-increment
For 1024 point FFT
202 MFLOP/s w/o AI
225 MFLOP/s with AI
Still 94-99 of the time spent in last 6 stages
where the VL lt 64
Conclusion Auto-increment helps, but scalar
overhead is not the main source of the
inefficiency

32 bit Floating Point
21
Optimization 2 Memory Transposes

Reorganize the data layout in memory to maximize
the vector length in later FFT stages
View the 1D vector as a 2D matrix
Reorganization is equivalent to a matrix
transpose
Transposing the data in memory only works for N
³ (2 MVL)
Transposing in memory adds significant overhead
Increased memory traffic
cost too high to make it worthwhile
Multiple transposes exacerbate the situation

22
Optimization 3 Register Transposes

Rearrange the elements in the vector registers
Provides a way to swap elements between 2
registers
What we want to swap (after stage 1 VL MVL
8)

VL 4 BGs 2
VL 2 BGs 4

This behavior is hard to implement with one
instruction in hardware

23
Optimization 3 Register Transposes

Two instructions were added to the VIRAM
Instruction Set Architecture (ISA)
vhalfup and vhalfdn both move elements one-way
between vector registers
Vhalfup/dn
Are extensions of already existing ISA support
for fast in-register reductions
Required minimal additional hardware support
mostly control lines
Much simpler and less costly than a general
element permutation instruction
Rejected in the early VIRAM design phase
An elegant, inexpensive, powerful solution to the
short vector length problem of the later stages
of the FFT

24
Optimization 3 Register Transposes
Stage 1
SWAP

Three steps to swap elements
Copy vr1 into vr3
Move vr2s low to vr1s high (vhalfup)
vr1 now done
Move vr3s high to vr2s low (vhalfdn)
vr2 now done

25
Optimization 3 Final Algorithm

The optimized algorithm has two phases
Naïve algorithm is used for stages whose VL ³ MVL
Vhalfup/dn code is used on
Stages whose VL lt MVL the last log2 (MVL)
stages
Vhalfup/dn
Eliminates short vector length problem
Allows all vector computations to have VL equal
to MVL
Multiple butterfly groups done with 1 basic
operation
Eliminates all loads/stores between these stages
Optimized vhalf algorithm does
Auto-increment, software pipelining, code
scheduling
the bit reversal rearrangements of the results
Single precision, floating point, complex,
radix-2 FFTs

26
Optimization 3 Register Transposes
32 bit Floating Point

Every vector instruction operates with VLMVL
For all stages
Keeps the vector pipeline fully utilized
Time spent in the last 6 stages
drops to 60 to 80 of the total time

27
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

28
Performance Results
32 bit Floating Point

Both Naïve versions utilize the auto-increment
feature
1 does bit reversal, the other does not
Vhalfup/dn with and without bit reversal are
identical
Bit reversing the results slows naïve algorithm,
but not vhalfup/dn

29
Performance Results
32 bit Floating Point

The performance gap testifies
To the effectiveness of the vhalfup/dn algorithm
in fully utilizing the vector unit
The importance of the new vhalfup/dn instructions

30
Performance Results
32 bit Floating Point

VIRAM is competitive with high-end specialized
Floating Point DSPs
Could match or exceed the performance of these
DSPs if the VIRAM architecture were implemented
commercially

31
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

32
16 bit Fixed Point Implementation

Resources
16 lanes (each 16 bits wide)
Two Integer Functional Units per lane
32 Operations/Cycle
MVL 128 elements
Fixed Point Multiply-Add not utilized
8 bit operands too small
8 bits 8 bits 16 bit product
32 bit product too big
16 bits 16 bits 32 bit product

33
16 bit Fixed Point Implementation (2)

The basic computation takes
4 multiplies 4 adds 2 subtracts 10
operations
6.4 GOP/s is Peak Performance for this mix
To prevent overflow two bits are shifted right
and lost for each stage
Input
Sbbb bbbb bbbb bbbb.
Output
Sbbb bbbb bbbb bbbb bb.

Decimal points
Shifted out
34
Performance Results
16 bit Fixed Point

Fixed Point is Faster than Floating point on
VIRAM
1024 pt 28.3 us verses 37 us
This implementation attains 4 GOP/s for 1024 pt
FFT and is
An Unoptimized work in progress!

35
Performance Results
16 bit Fixed Point

Again VIRAM is competitive with high-end
specialized DSPs
CRI Scorpio 24 bit complex fixed point FFT DSP
1024 pt 7 microseconds

36
Outline

What is the FFT and Why Study it?
VIRAM Implementation Assumptions
About the FFT
The Naïve Algorithm
3 Optimizations to the Naïve Algorithm
32 bit Floating Point Performance Results
16 bit Fixed Point Performance Results
Conclusions and Future Work

37
Conclusions

Optimizations to eliminate short vector lengths
are necessary for doing the FFT
VIRAM is capable of performing FFTs at
performance levels comparable to or exceeding
those of high-end floating point DSPs. It
achieves this performance via
A highly tuned algorithm designed specifically
for VIRAM
A set of simple, powerful ISA extensions that
underlie it
Efficient parallelism of vector processing
embedded in a high-bandwidth on-chip DRAM memory

38
Conclusions (2)

Performance of FFTs on VIRAM has the potential to
improve significantly over the results presented
here
32-bit fixed point FFTs could run up to 2 times
faster than floating point versions
Compared to 32-bit fixed point FFTs, 16-bit fixed
point FFTs could run up to
8x faster (with multiply-add ops)
4x faster (with no multiply-add ops)
Adding a second Floating Point Functional Unit
would make floating point performance comparable
to the 32-bit Fixed Point performance.
4 GOP/s for Unoptimized Fixed Point
implementation (6.4 GOP/s is peak!)

39
Conclusions (3)

Since VIRAM includes both general-purpose CPU
capability and DSP muscle, it shares the same
space in the emerging market of hybrid CPU/DSPs
as
Infineon TriCore
Hitachi SuperH-DSP
Motorola/Lucent StarCore
Motorola PowerPC G4 (7400)
VIRAMs vector processor plus embedded DRAM
design may have further advantages over more
traditional processors in
Power
Area
Performance

40
Future Work

On Current Fixed Point implementation
Further optimizations and tests
Explore the tradeoffs between precision
accuracy and Performance by implementing
A Hybrid of the current implementation which
alternates the number of bits shifted off each
stage
2 1 1 1 2 1 1 1...
A 32 bit integer version which uses 16 bit data
If data occupies the 16 most significant bits of
the 32 bits, then there are 16 zeros to shift
off
Sbbb bbbb bbbb bbbb b000 0000 0000 0000 0000