Title: "GPU Programming and Performance" Kevin R. Tubbs
1"GPU Programming and Performance"Kevin R. Tubbs
IGERT Fellow Donald W. Clayton Graduate Program
in Engineering Science, Louisiana State
University, 3507 CEBA Building, Baton Rouge, LA
70803-6405 PH (225) 578-4246 FAX (225)
578-6782, E-mail ktubbs2_at_lsu.edu
- IGERT Student Seminar Series
Source heise.de, tomshardware.com, wikipedia.org
Source nvidia.com
4Overview CUDA Zone for CFD
- Elegant Mathematics
- Iterative Solvers
- Linear System Solvers
- CFD EM Solvers
- Accelereyes
- Jacket GPU Engine
- Ports matlab code to GPU
- Supports functions friendly to GPU
6Journal Papers
- Methods
- 3D Euler
- LBM (2D 3D)
- Discontinuous Galerkin (DG)
- Smoothed Particle Hydrodynamics
- 23X Meshless/MeshFree
- Langrangian
- LES Navier Stokes Code
- 17X Cartesian
- Red-Black SOR Poisson Solver
- 3D Incompressible Navier Stokes Code
- 100X 4 GPU
- Cartesian
8Applications Accelereyes
- Accelereyes
- Jacket GPU Engine
- Maximum Reported Speedup Using Jacket 50 X
- LBM Speedup 10X
- Graphics Toolbox
- OpenGL
- Basic Plot open source functions
9Applications Elegant Mathematics
- Avg. Peformance 100K unknowns
- 7 GFlop/s Single NVIDIA 260 (100 MFlop/s QC Xeon
2.6Ghz) - Block Versions 70 GFlop/s (3 GFlops )
- EMGPUHMatrix
- Linear System 163,840 unknowns
- Generated and Solved w/ 25X Speedup (QC Xeon)
- EMGPUPartBoltzmann
- Solves Boltzmann Eq. via Particle Method
- Collision 30X (QC Xeon)
- Stream 120X (QC Xeon)
- Problem Size
- 15 M particles on 1 GPU
- 1 B particles on GPU CPU w/ 64GB of Ram Using
10Method Performance LBM
- Good Locality in Memory Acces
- 2D LBM for N-S
- D2Q9
- 10X
- 3D LBM
- D3Q15
- 28X 100X (4 GPU)
11Method Performance 2D LBM
- 2D Shallow Water Mass Transport
- 10X
- Same Code Cast to GPU
- No Optimization
12Method Performance DG
- Maxwells Eq.
- DG, 3D Unstructured Mesh
- DG
- Good Locality in Memory Access (element local)
- High order require fewer data points
A. Klocknera, T. Warburtonb, J. Bridgeb, J. S.
Hesthavena (2009) Nodal Discontinuous Galerkin
Methods on Graphics Processors
13Method Performance DG
- Maxwells Eq.
- DG, 3D Unstructured Mesh
- Single 400 GTX 280 GPU
- 40X to 60X relative to serial
- 200 GFlop/s
A. Klocknera, T. Warburtonb, J. Bridgeb, J. S.
Hesthavena (2009) Nodal Discontinuous Galerkin
Methods on Graphics Processors
14Method Performance 3D N-S Solver
- Lid-Driven Cavity
- Re 1000 Mesh 323 2563
- 2D flow physics 3D computations
- Numerical Method
- Incompressible flow Cartesian Method
- Projection Method Press/Vel Staggered Grid
- 1st order time 2nd order diff. Poisson Jacobi
Julien C. Thibault and Inanc Senocak, (2009)
CUDA Implementation of a Navier-Stokes Solver on
Multi-GPU Desktop Platforms for Incompressible
Flows 47th AIAA Aerospace Sciences Meeting,
Orlando FL, paper no AIAA-2009-758.
15Method Performance 3D N-S Solver
- Workstation GPU Speedup
- Intel Core 2 Duo 3.0 GHz (Max 21X)
- AMD Opteron 2.4 GHz (Max 100X)
Julien C. Thibault and Inanc Senocak, (2009)
CUDA Implementation of a Navier-Stokes Solver on
Multi-GPU Desktop Platforms for Incompressible
Flows 47th AIAA Aerospace Sciences Meeting,
Orlando FL, paper no AIAA-2009-758.
16Method Performance 3D N-S Solver
- Server GPU Speedup
- Scales at 75 Efficiency for Number of GPU
Julien C. Thibault and Inanc Senocak, (2009)
CUDA Implementation of a Navier-Stokes Solver on
Multi-GPU Desktop Platforms for Incompressible
Flows 47th AIAA Aerospace Sciences Meeting,
Orlando FL, paper no AIAA-2009-758.
17IGERT Hardware Options
- Tesla
- 1300
- 240 Streaming Processor Cores
- Frequency of processor cores1.3GHz
- Single Precision floating point performance (peak
) 933 - Double Precision floating point performance
(peak) 78 - Dedicated Memory 4GBÂ GDDR3
- GTX 280
- 350
- 240 Processor Cores
- Processor Clock (MHz)1.3Â GHz Â
- Dedicated Memory 1GBÂ GDDR3