An Efficient, ModelBased CPUGPU Heterogeneous FFT Library - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

An Efficient, ModelBased CPUGPU Heterogeneous FFT Library

Description:

... Ogata , , Toshio Endo , , Naoya Maruyama , , Satoshi Matsuoka , , : Tokyo Institute of Technology : JST, CREST : National Institute of Informatics ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 41

Provided by: oga77

Category:

more less

Transcript and Presenter's Notes

Title: An Efficient, ModelBased CPUGPU Heterogeneous FFT Library

1
An Efficient, Model-Based CPU-GPU Heterogeneous
FFT Library

Yasuhiko Ogata,, Toshio Endo,,
Naoya Maruyama,, Satoshi Matsuoka,,

Tokyo Institute of Technology JST,
CREST National Institute of Informatics
2
Outline

Introduction
Combined using of CPUs and GPU
Evaluation (Library Performance)
Predict Optimal Ratio using Model
Evaluation (Model Accuracy)
Summery

3
Background (GPGPU)

General-Purpose computation on GPUs
High Parallelism -gt Over 128 Shader Unit!!
Good cost performance ratio vs. CPU
GPU500GFlops_at_400
CPU70GFlops_at_300
Problem of GPGPU
Very high data transfer cost
Memory capacity limit
About 512MB
Only Using GPU, throw out CPU power

4
Combined usage of GPUs and CPUs

Traditional GPGPU
-gt Displacement computational capacity
from CPU to GPU
Real performance ratio between CPU and GPU is
smaller than theoretical performance ratio
-gt Limited by transfer cost
Combined usage of GPUs and CPUs
-gtAddition computational capacity CPU and GPU
Heterogeneous environment inhabiting many
CPUs and GPUs
However
Need decision of optimum load distribution ratio

5
Goal and Proposal

Goal
Auto decision of optimum load distribution ratio.
Proposal
Discovery optimum load distribution ratio to
heterogeneous environment using performance
model.
Building performance model that predict exec.
time
Target2D-FFT

6
Contribution

A prototype combined CPU/GPU FFT library
Improve performance to about 50 vs. only use CPU
Predict best allocation ratio to CPUs and GPU
using performance modeling
Less than 5 error _at_ x4 data size from obtained
size
-gt no performance decrement, using predicted ratio

7
Outline

Introduction
Combined using of CPUs and GPU
Evaluation (Library Performance)
Predict Optimal Ratio using Model
Evaluation (Model Accuracy)
Summery

8
Algorithm of 2D-FFT

Fast Fourier Transfer
Spectrum, Fluid Simulation, Molecular Dynamics
etc
2D-FFT Execute 1D-FFT for row and column axes

n
0
Row-wise 1D-FFT
0
n
m
m
Col-wise 1D-FFT
9
Partitioning 2D-FFT

Distribute 1D-FFT to GPU and CPUs, corresponding
to allocation ratio.

Ex) GPUCPU1CPU2652510
GPU65
n
0
CPU125
0
n
CPU210
m
m
10
Implementation

Using generalized 1D-FFT library
CPU
FFTW Frigo et al.
Generalize Used FFT library.
GPU
GPUFFTW Govindaraju et al. Using
NV_fragment_program, extension of OpenGL.
CUFFT FFT library implemented on NVIDIA CUDA.
CUDA
Good performance (vs GPUFFTW)

Enable to Execute of the size that exceeds GPU
memory capacity.
Rotate set of 1D-FFTs that adapt to GPU mem size.

11
Exec. flow of our library

Exec. flow is different from data order required
from 1D-FFT library.

12
Outline

Introduction
Combined using of CPUs and GPU
Evaluation (Library Performance)
Predict Optimal Ratio using Model
Evaluation (Model Accuracy)
Summery

13
Evaluation (Library Performance)

Evaluation item
Performance vs. Problem Size
Performance vs. Load Distribution Size
Evaluation environment
Core2 Duo E6400 2.13Ghz
Intel 975X, PCI Express 1.0 x16
4GB main memory
Geforce8800GTX (768MB memory)
Linux 2.6.18,GNU GCC4.1.2
FFTW 3.1.2
NVIDIA Linux Display Driver version 79.46
(CUFFT ver. 79.51,CUFFT0.81)

14
Performance vs. Problem Size

2D-FFT Execution Time vs. Problem size (n2)

35 Execution time Cut(50 Performance Improve)
15
Performance vs. Load Distribution Ratio (1)

Size 81922, CPU 1 thread and 2 threads (5050)
Use CUFFT

16
Performance vs. Load Distribution Ratio (1)

Size 81922, CPU 1 thread and 2 threads (5050)
Use CUFFT

x2.2 performance vs. CPU 1 thread
x1.5 performance vs. CPU 2 threads

17
Performance vs. Load Distribution Ratio (1)

Size 81922, CPU 1 thread and 2 threads (5050)
Use CUFFT

Wasted CPU power by GPU control thread

18
Performance vs. Load Distribution Ratio (2)

Size 81922, CPU 1 thread and 2 threads (5050)
Use GPUFFTW

x1.5 performance vs. CPU 1 thread

19
Outline

Introduction
Combined using of CPUs and GPU
Evaluation (Library Performance)
Predict Optimal Ratio using Model
Evaluation (Model Accuracy)
Summery

20
Predict Optimal Load Distribute Ratio

GPUs real performance is not determine by Spec
Sheet
Need transfer cost, initialize cost, etc
The cost of each phase is different.
2D-FFT O(n2log(n)), GEMMO(n3)
Transfer O(n2) , compressible or not?
Temporally memory allocate, etc
Insufficiency of using Static Distribute Ratio

Using performance model

This model predict execution time
Search optimal ratio, using parameters obtained
from pre-exec.

21
Performance Model

Divide into several sub-steps
Predicts the execution time of each step using
profiling results

Predicted overall Exec. Time
Max. predicted compute timefor CPU/GPU
Col-wiseComputation
Max. predicted compute timefor CPU/GPU
Row-wiseComputation
22
Detail of Performance Model
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
23
Detail of Performance Model
Col-wise Exec. time
Row-wise Exec. time
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
24
Detail of Performance Model

Term of Exec. time from
Problem Size n
Distribution Ratio r

n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
25
Detail of Performance Model
Parameters obtained from pre-exec.
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
26
Outline