Title: An Efficient, ModelBased CPUGPU Heterogeneous FFT Library
1An Efficient, Model-Based CPU-GPU Heterogeneous
FFT Library
- Yasuhiko Ogata,, Toshio Endo,,
- Naoya Maruyama,, Satoshi Matsuoka,,
Tokyo Institute of Technology JST,
CREST National Institute of Informatics
2Outline
- Introduction
- Combined using of CPUs and GPU
- Evaluation (Library Performance)
- Predict Optimal Ratio using Model
- Evaluation (Model Accuracy)
- Summery
3Background (GPGPU)
- General-Purpose computation on GPUs
- High Parallelism -gt Over 128 Shader Unit!!
- Good cost performance ratio vs. CPU
- GPU500GFlops_at_400
- CPU70GFlops_at_300
- Problem of GPGPU
- Very high data transfer cost
- Memory capacity limit
- About 512MB
- Only Using GPU, throw out CPU power
4Combined usage of GPUs and CPUs
- Traditional GPGPU
- -gt Displacement computational capacity
from CPU to GPU - Real performance ratio between CPU and GPU is
smaller than theoretical performance ratio - -gt Limited by transfer cost
- Combined usage of GPUs and CPUs
- -gtAddition computational capacity CPU and GPU
- Heterogeneous environment inhabiting many
CPUs and GPUs - However
- Need decision of optimum load distribution ratio
5Goal and Proposal
- Goal
- Auto decision of optimum load distribution ratio.
- Proposal
- Discovery optimum load distribution ratio to
heterogeneous environment using performance
model. - Building performance model that predict exec.
time - Target2D-FFT
6Contribution
- A prototype combined CPU/GPU FFT library
- Improve performance to about 50 vs. only use CPU
- Predict best allocation ratio to CPUs and GPU
using performance modeling - Less than 5 error _at_ x4 data size from obtained
size - -gt no performance decrement, using predicted ratio
7Outline
- Introduction
- Combined using of CPUs and GPU
- Evaluation (Library Performance)
- Predict Optimal Ratio using Model
- Evaluation (Model Accuracy)
- Summery
8Algorithm of 2D-FFT
- Fast Fourier Transfer
- Spectrum, Fluid Simulation, Molecular Dynamics
etc - 2D-FFT Execute 1D-FFT for row and column axes
n
0
Row-wise 1D-FFT
0
n
m
m
Col-wise 1D-FFT
9Partitioning 2D-FFT
- Distribute 1D-FFT to GPU and CPUs, corresponding
to allocation ratio.
Ex) GPUCPU1CPU2652510
GPU65
n
0
CPU125
0
n
CPU210
m
m
10Implementation
- Using generalized 1D-FFT library
- CPU
- FFTW Frigo et al.
- Generalize Used FFT library.
- GPU
- GPUFFTW Govindaraju et al. Using
NV_fragment_program, extension of OpenGL. - CUFFT FFT library implemented on NVIDIA CUDA.
- CUDA
- Good performance (vs GPUFFTW)
- Enable to Execute of the size that exceeds GPU
memory capacity. - Rotate set of 1D-FFTs that adapt to GPU mem size.
11Exec. flow of our library
- Exec. flow is different from data order required
from 1D-FFT library.
12Outline
- Introduction
- Combined using of CPUs and GPU
- Evaluation (Library Performance)
- Predict Optimal Ratio using Model
- Evaluation (Model Accuracy)
- Summery
13Evaluation (Library Performance)
- Evaluation item
- Performance vs. Problem Size
- Performance vs. Load Distribution Size
- Evaluation environment
- Core2 Duo E6400 2.13Ghz
- Intel 975X, PCI Express 1.0 x16
- 4GB main memory
- Geforce8800GTX (768MB memory)
- Linux 2.6.18,GNU GCC4.1.2
- FFTW 3.1.2
- NVIDIA Linux Display Driver version 79.46
- (CUFFT ver. 79.51,CUFFT0.81)
14Performance vs. Problem Size
- 2D-FFT Execution Time vs. Problem size (n2)
35 Execution time Cut(50 Performance Improve)
15Performance vs. Load Distribution Ratio (1)
- Size 81922, CPU 1 thread and 2 threads (5050)
- Use CUFFT
16Performance vs. Load Distribution Ratio (1)
- Size 81922, CPU 1 thread and 2 threads (5050)
- Use CUFFT
- x2.2 performance vs. CPU 1 thread
- x1.5 performance vs. CPU 2 threads
17Performance vs. Load Distribution Ratio (1)
- Size 81922, CPU 1 thread and 2 threads (5050)
- Use CUFFT
- Wasted CPU power by GPU control thread
18Performance vs. Load Distribution Ratio (2)
- Size 81922, CPU 1 thread and 2 threads (5050)
- Use GPUFFTW
- x1.5 performance vs. CPU 1 thread
19Outline
- Introduction
- Combined using of CPUs and GPU
- Evaluation (Library Performance)
- Predict Optimal Ratio using Model
- Evaluation (Model Accuracy)
- Summery
20Predict Optimal Load Distribute Ratio
- GPUs real performance is not determine by Spec
Sheet - Need transfer cost, initialize cost, etc
- The cost of each phase is different.
- 2D-FFT O(n2log(n)), GEMMO(n3)
- Transfer O(n2) , compressible or not?
- Temporally memory allocate, etc
- Insufficiency of using Static Distribute Ratio
Using performance model
- This model predict execution time
- Search optimal ratio, using parameters obtained
from pre-exec.
21Performance Model
- Divide into several sub-steps
- Predicts the execution time of each step using
profiling results
Predicted overall Exec. Time
Max. predicted compute timefor CPU/GPU
Col-wiseComputation
Max. predicted compute timefor CPU/GPU
Row-wiseComputation
22Detail of Performance Model
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
23Detail of Performance Model
Col-wise Exec. time
Row-wise Exec. time
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
24Detail of Performance Model
- Term of Exec. time from
- Problem Size n
- Distribution Ratio r
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
25Detail of Performance Model
Parameters obtained from pre-exec.
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
26Outline
- Introduction
- Combined using of CPUs and GPU
- Evaluation (Library Performance)
- Predict Optimal Ratio using Model
- Evaluation (Model Accuracy)
- Summery
27Evaluation (Model Accuracy)
- Evaluation items
- Predict Accuracy (Target Problem size
Parameter Obtained size ) - Predict Accuracy (Target Problem size gt
Parameter Obtained size ) - Evaluation points
- Difference between Real Exec. time and Predict
time - Degradation ratio of using predicted ratio
28Performance Model Evaluation (1)
- Problem size 81922
- Parameter obtained from 81922
Parameter obtained from red circle point
29Performance Model Evaluation (1)
- Problem size 81922
- Parameter obtained from 81922, CPU/GPU not
combined
Distribution ratio is less than 5 error
Exec time is average 2 error (Max 6)
30Performance Model Evaluation (2)
- Problem Size 10242
- Parameter obtained from 5122, GPU/CPU not combined
Distribution ratio is less than 5 error
Exec time is average 4.5 error
31Performance Model Evaluation (2)
- Problem size 81922
- Parameter obtained from 5122, GPU/CPU not
combined - (Data size x256, Exec. time x300)
Distribution ratio is 5 error Exec time
error is almost under 20
32Detail of Execution Time
- Problem Size 81922
- Distribution Ratio 70
- Parameters from 5122
- Left Predicted time
- Right Measured time
- Red line Synchronization
33Detail of Execution Time
- Problem Size 81922
- Distribution Ratio 70
- Parameters from 5122
- Left Predicted time
- Right Measured time
- Red line Synchronization
GPU time is almost predicted
34Detail of Execution Time
- Problem Size 81922
- Distribution Ratio 70
- Parameters from 5122
- Left Predicted time
- Right Measured time
- Red line Synchronization
Large error at 1st transpose
Cache? Mem. Band width? Now investigation
35Detail of Execution Time
- Problem Size 81922
- Distribution Ratio 70
- Parameters from 5122
- Left Predicted time
- Right Measured time
- Red line Synchronization
Large error at FFT on CPU
Contain Planning etc.. Now working in progress
36Outline
- Introduction
- Combined using of CPUs and GPU
- Evaluation (Library Performance)
- Predict Optimal Ratio using Model
- Evaluation (Model Accuracy)
- Summery
37Related Works(Combined use of CPU and GPU)
- cg-gemm Ohshima et al. 06
- GEMM library using CPU and GPU
- CPU is single core
- GEMM is much easier than FFT
- GEMMs transfer cost is negligible
- No performance model
38Related Works(Performance model)
- Govindaraju et al. 06
- Performance model on GPU
- Memory model
- Find optimal blocking size
- Their shows x2x5 speed up
- Underwood et al. 06
- Performance model on FPGA
- Predict Exec. Time under 5 error
- Only FPGA (dont include combined using)
39Summary
- A prototype combined CPU/GPU FFT library
- Improve performance to about 40 vs. only use CPU
- Predict best allocation ratio to CPUs and GPU
using performance modeling - Less than 5 error _at_ doubled size of obtained
size - -gt no performance decrement, using predicted ratio
40Future Works
- Extension to
- Many CPUs and GPUs
- Clustering
- 3D-FFT
- Extend Performance model their environment
- Integration of Power Consumption model to our
model - Predict total energy