An Efficient, ModelBased CPUGPU Heterogeneous FFT Library - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

An Efficient, ModelBased CPUGPU Heterogeneous FFT Library

Description:

... Ogata , , Toshio Endo , , Naoya Maruyama , , Satoshi Matsuoka , , : Tokyo Institute of Technology : JST, CREST : National Institute of Informatics ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 41
Provided by: oga77
Category:

less

Transcript and Presenter's Notes

Title: An Efficient, ModelBased CPUGPU Heterogeneous FFT Library


1
An Efficient, Model-Based CPU-GPU Heterogeneous
FFT Library
  • Yasuhiko Ogata,, Toshio Endo,,
  • Naoya Maruyama,, Satoshi Matsuoka,,

Tokyo Institute of Technology JST,
CREST National Institute of Informatics
2
Outline
  • Introduction
  • Combined using of CPUs and GPU
  • Evaluation (Library Performance)
  • Predict Optimal Ratio using Model
  • Evaluation (Model Accuracy)
  • Summery

3
Background (GPGPU)
  • General-Purpose computation on GPUs
  • High Parallelism -gt Over 128 Shader Unit!!
  • Good cost performance ratio vs. CPU
  • GPU500GFlops_at_400
  • CPU70GFlops_at_300
  • Problem of GPGPU
  • Very high data transfer cost
  • Memory capacity limit
  • About 512MB
  • Only Using GPU, throw out CPU power

4
Combined usage of GPUs and CPUs
  • Traditional GPGPU
  • -gt Displacement computational capacity
    from CPU to GPU
  • Real performance ratio between CPU and GPU is
    smaller than theoretical performance ratio
  • -gt Limited by transfer cost
  • Combined usage of GPUs and CPUs
  • -gtAddition computational capacity CPU and GPU
  • Heterogeneous environment inhabiting many
    CPUs and GPUs
  • However
  • Need decision of optimum load distribution ratio

5
Goal and Proposal
  • Goal
  • Auto decision of optimum load distribution ratio.
  • Proposal
  • Discovery optimum load distribution ratio to
    heterogeneous environment using performance
    model.
  • Building performance model that predict exec.
    time
  • Target2D-FFT

6
Contribution
  • A prototype combined CPU/GPU FFT library
  • Improve performance to about 50 vs. only use CPU
  • Predict best allocation ratio to CPUs and GPU
    using performance modeling
  • Less than 5 error _at_ x4 data size from obtained
    size
  • -gt no performance decrement, using predicted ratio

7
Outline
  • Introduction
  • Combined using of CPUs and GPU
  • Evaluation (Library Performance)
  • Predict Optimal Ratio using Model
  • Evaluation (Model Accuracy)
  • Summery

8
Algorithm of 2D-FFT
  • Fast Fourier Transfer
  • Spectrum, Fluid Simulation, Molecular Dynamics
    etc
  • 2D-FFT Execute 1D-FFT for row and column axes

n
0
Row-wise 1D-FFT
0
n
m
m
Col-wise 1D-FFT
9
Partitioning 2D-FFT
  • Distribute 1D-FFT to GPU and CPUs, corresponding
    to allocation ratio.

Ex) GPUCPU1CPU2652510
GPU65
n
0
CPU125
0
n
CPU210
m
m
10
Implementation
  • Using generalized 1D-FFT library
  • CPU
  • FFTW Frigo et al.
  • Generalize Used FFT library.
  • GPU
  • GPUFFTW Govindaraju et al. Using
    NV_fragment_program, extension of OpenGL.
  • CUFFT FFT library implemented on NVIDIA CUDA.
  • CUDA
  • Good performance (vs GPUFFTW)
  • Enable to Execute of the size that exceeds GPU
    memory capacity.
  • Rotate set of 1D-FFTs that adapt to GPU mem size.

11
Exec. flow of our library
  • Exec. flow is different from data order required
    from 1D-FFT library.

12
Outline
  • Introduction
  • Combined using of CPUs and GPU
  • Evaluation (Library Performance)
  • Predict Optimal Ratio using Model
  • Evaluation (Model Accuracy)
  • Summery

13
Evaluation (Library Performance)
  • Evaluation item
  • Performance vs. Problem Size
  • Performance vs. Load Distribution Size
  • Evaluation environment
  • Core2 Duo E6400 2.13Ghz
  • Intel 975X, PCI Express 1.0 x16
  • 4GB main memory
  • Geforce8800GTX (768MB memory)
  • Linux 2.6.18,GNU GCC4.1.2
  • FFTW 3.1.2
  • NVIDIA Linux Display Driver version 79.46
  • (CUFFT ver. 79.51,CUFFT0.81)

14
Performance vs. Problem Size
  • 2D-FFT Execution Time vs. Problem size (n2)

35 Execution time Cut(50 Performance Improve)
15
Performance vs. Load Distribution Ratio (1)
  • Size 81922, CPU 1 thread and 2 threads (5050)
  • Use CUFFT

16
Performance vs. Load Distribution Ratio (1)
  • Size 81922, CPU 1 thread and 2 threads (5050)
  • Use CUFFT
  • x2.2 performance vs. CPU 1 thread
  • x1.5 performance vs. CPU 2 threads

17
Performance vs. Load Distribution Ratio (1)
  • Size 81922, CPU 1 thread and 2 threads (5050)
  • Use CUFFT
  • Wasted CPU power by GPU control thread

18
Performance vs. Load Distribution Ratio (2)
  • Size 81922, CPU 1 thread and 2 threads (5050)
  • Use GPUFFTW
  • x1.5 performance vs. CPU 1 thread

19
Outline
  • Introduction
  • Combined using of CPUs and GPU
  • Evaluation (Library Performance)
  • Predict Optimal Ratio using Model
  • Evaluation (Model Accuracy)
  • Summery

20
Predict Optimal Load Distribute Ratio
  • GPUs real performance is not determine by Spec
    Sheet
  • Need transfer cost, initialize cost, etc
  • The cost of each phase is different.
  • 2D-FFT O(n2log(n)), GEMMO(n3)
  • Transfer O(n2) , compressible or not?
  • Temporally memory allocate, etc
  • Insufficiency of using Static Distribute Ratio

Using performance model
  • This model predict execution time
  • Search optimal ratio, using parameters obtained
    from pre-exec.

21
Performance Model
  • Divide into several sub-steps
  • Predicts the execution time of each step using
    profiling results

Predicted overall Exec. Time
Max. predicted compute timefor CPU/GPU
Col-wiseComputation
Max. predicted compute timefor CPU/GPU
Row-wiseComputation
22
Detail of Performance Model
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
23
Detail of Performance Model
Col-wise Exec. time
Row-wise Exec. time
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
24
Detail of Performance Model
  • Term of Exec. time from
  • Problem Size n
  • Distribution Ratio r

n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
25
Detail of Performance Model
Parameters obtained from pre-exec.
n Problem Size, r Load Distribution Ratio to
GPU (0?r?1) Parameters
26
Outline
  • Introduction
  • Combined using of CPUs and GPU
  • Evaluation (Library Performance)
  • Predict Optimal Ratio using Model
  • Evaluation (Model Accuracy)
  • Summery

27
Evaluation (Model Accuracy)
  • Evaluation items
  • Predict Accuracy (Target Problem size
    Parameter Obtained size )
  • Predict Accuracy (Target Problem size gt
    Parameter Obtained size )
  • Evaluation points
  • Difference between Real Exec. time and Predict
    time
  • Degradation ratio of using predicted ratio

28
Performance Model Evaluation (1)
  • Problem size 81922
  • Parameter obtained from 81922

Parameter obtained from red circle point
29
Performance Model Evaluation (1)
  • Problem size 81922
  • Parameter obtained from 81922, CPU/GPU not
    combined

Distribution ratio is less than 5 error
Exec time is average 2 error (Max 6)
30
Performance Model Evaluation (2)
  • Problem Size 10242
  • Parameter obtained from 5122, GPU/CPU not combined

Distribution ratio is less than 5 error
Exec time is average 4.5 error
31
Performance Model Evaluation (2)
  • Problem size 81922
  • Parameter obtained from 5122, GPU/CPU not
    combined
  • (Data size x256, Exec. time x300)

Distribution ratio is 5 error Exec time
error is almost under 20
32
Detail of Execution Time
  • Problem Size 81922
  • Distribution Ratio 70
  • Parameters from 5122
  • Left Predicted time
  • Right Measured time
  • Red line Synchronization

33
Detail of Execution Time
  • Problem Size 81922
  • Distribution Ratio 70
  • Parameters from 5122
  • Left Predicted time
  • Right Measured time
  • Red line Synchronization

GPU time is almost predicted
34
Detail of Execution Time
  • Problem Size 81922
  • Distribution Ratio 70
  • Parameters from 5122
  • Left Predicted time
  • Right Measured time
  • Red line Synchronization

Large error at 1st transpose
Cache? Mem. Band width? Now investigation
35
Detail of Execution Time
  • Problem Size 81922
  • Distribution Ratio 70
  • Parameters from 5122
  • Left Predicted time
  • Right Measured time
  • Red line Synchronization

Large error at FFT on CPU
Contain Planning etc.. Now working in progress
36
Outline
  • Introduction
  • Combined using of CPUs and GPU
  • Evaluation (Library Performance)
  • Predict Optimal Ratio using Model
  • Evaluation (Model Accuracy)
  • Summery

37
Related Works(Combined use of CPU and GPU)
  • cg-gemm Ohshima et al. 06
  • GEMM library using CPU and GPU
  • CPU is single core
  • GEMM is much easier than FFT
  • GEMMs transfer cost is negligible
  • No performance model

38
Related Works(Performance model)
  • Govindaraju et al. 06
  • Performance model on GPU
  • Memory model
  • Find optimal blocking size
  • Their shows x2x5 speed up
  • Underwood et al. 06
  • Performance model on FPGA
  • Predict Exec. Time under 5 error
  • Only FPGA (dont include combined using)

39
Summary
  • A prototype combined CPU/GPU FFT library
  • Improve performance to about 40 vs. only use CPU
  • Predict best allocation ratio to CPUs and GPU
    using performance modeling
  • Less than 5 error _at_ doubled size of obtained
    size
  • -gt no performance decrement, using predicted ratio

40
Future Works
  • Extension to
  • Many CPUs and GPUs
  • Clustering
  • 3D-FFT
  • Extend Performance model their environment
  • Integration of Power Consumption model to our
    model
  • Predict total energy
Write a Comment
User Comments (0)
About PowerShow.com