Title: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware
1A Multigrid Solver for Boundary Value Problems
Using Programmable Graphics Hardware
Nolan Goodnight Cliff Woolley Gregory
LewinDavid Luebke Greg Humphreys
University of Virginia
Graphics Hardware 2003
July 26-27 San Diego, CA
2General-Purpose GPU Programming
- Why do we port algorithms to the GPU?
- How much faster can we expect it to be, really?
- What is the challenge in porting?
3Case Study
- Problem Implement a Boundary Value Problem (BVP)
solver using the GPU - Could benefit an entire class of scientific and
engineering applications, e.g. - Heat transfer
- Fluid flow
4Related Work
- Krüger and Westermann Linear Algebra Operators
for GPU Implementation of Numerical Algorithms - Bolz et al. Sparse Matrix Solvers on the GPU
Conjugate Gradients and Multigrid - Very similar to our system
- Developed concurrently
- Complementary approach
5Driving problem Fluid mechanics sim
- Problem domain is a warped disc
regular grid
regular grid
6BVPs Background
- Boundary value problems are sometimes governedby
PDEs of the form - L? f
- L is some operator
- ? is the problem domain
- f is a forcing function (source term)
- Given L and f, solve for ?.
7BVPs Example
- Heat Transfer
- Find a steady-state temperature distribution T in
a solid of thermal conductivity k with thermal
source S - This requires solving a Poisson equation of the
form - k?2T -S
- This is a BVP where L is the Laplacian operator
?2 - All our applications require a Poisson solver.
8BVPs Solving
- Most such problems cannot be solved analytically
- Instead, discretize onto a grid to form a set of
linear equations, then solve - Direct elimination
- Gauss-Seidel iteration
- Conjugate-gradient
- Strongly implicit procedures
- Multigrid method
9Multigrid method
- Iteratively corrects an approximation to the
solution - Operates at multiple grid resolutions
- Low-resolution grids are used to correct
higher-resolution grids recursively - Very fast, especially for large grids O(n)
10Multigrid method
- Use coarser grid levels to recursively correct an
approximation to the solution - Algorithm
- smooth
- residual
- restrict
- recurse
- interpolate
? L?i - f
11Implementation
- For each step of the algorithm
- Bind as texture maps the buffers that contain the
necessary data - Set the target buffer for rendering
- Activate a fragment program that performs the
necessary kernel computation - Render a grid-sized quad with multitexturing
source buffer texture
source buffer texture
render target buffer
render target buffer
fragment program
12Optimizing the Solver
- Detect steady-state natively on GPU
- Minimize shader length
- Special-case whenever possible
- Avoid context-switching
13Optimizing the Solver Steady-state
- How to detect convergence?
- L1 norm - average error
- L2 norm RMS error (common in visual sim)
- L? norm max error (common in sci/eng apps)
- Can use occlusion query!
secs to steady statevs. grid size
14Optimizing the Solver Shader length
- Minimize number of registers used
- Vectorize as much as possible
- Use the rasterizer to perform computations of
linearly-varying values - Pre-compute invariants on CPU
shader original fp fastpath fp fastpath vp
smooth 79-6-1 20-4-1 12-2
residual 45-7-0 16-4-0 11-1
restrict 66-6-1 21-3-0 11-1
interpolate 93-6-1 25-3-0 13-2
15Optimizing the Solver Special-case
- Fast-path vs. slow-path
- write several variants of each fragment program
to handle boundary cases - eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching
fast path, no boundaries
slow path with boundaries
16Optimizing the Solver Special-case
- Fast-path vs. slow-path
- write several variants of each fragment program
to handle boundary cases - eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching
secs per v-cyclevs. grid size
17Optimizing the Solver Context-switching
- Find best packing data of multiple grid
levelsinto the pbuffer surfaces
18Optimizing the Solver Context-switching
- Find best packing data of multiple grid
levelsinto the pbuffer surfaces
19Optimizing the Solver Context-switching
- Find best packing data of multiple grid
levelsinto the pbuffer surfaces
20Optimizing the Solver Context-switching
- Remove context switching
- Can introduce operations with undefined results
reading/writing same surface - Why do we need to do this?
- Can we get away with it?
- What about superbuffers?
21Data Layout
secs to steady statevs. grid size
22Data Layout
- Possible additional vectorization
- Compute 4 values at a time
- Requires source, residual, solution values to be
in different buffers - Complicates boundary calculations
- Adds setup and teardown overhead
Stacked domain
23Results CPU vs. GPU
secs to steady statevs. grid size
24Conclusions
- What we need going forward
- Superbuffers
- or Universal support for multiple-surface
pbuffers - or Cheap context switching
- Developer tools
- Debugging tools
- Documentation
- Global accumulator
- Ever increasing amounts of precision, memory
- Textures bigger than 2048 on a side
25Acknowledgements
- Hardware
- David Kirk
- Matt Papakipos
- Driver Support
- Nick Triantos
- Pat Brown
- Stephen Ehmann
- Fragment Programming
- James Percy
- Matt Pharr
- General-purpose GPU
- Mark Harris
- Aaron Lefohn
- Ian Buck
- Funding
- NSF Award 0092793