a multigrid solver for boundary value problems using programmable graphics hardware
DESCRIPTION
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware. Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys. University of Virginia. augmented by Klaus Mueller, Stony Brook University. General-Purpose GPU Programming. - PowerPoint PPT PresentationTRANSCRIPT
A Multigrid Solver for Boundary Value Problems Using Programmable
Graphics HardwareNolan Goodnight Cliff Woolley Gregory Lewin
David Luebke Greg Humphreys
University of Virginia
Graphics Hardware 2003July 26-27 – San Diego, CA
augmented by Klaus Mueller, Stony Brook University
General-Purpose GPU Programming
Why do we port algorithms to the GPU?
How much faster can we expect it to be, really?
What is the challenge in porting?
Case Study
Problem: Implement a Boundary Value Problem (BVP) solver using the GPU
Could benefit an entire class of scientific and engineering applications, e.g.:
Heat transfer
Fluid flow
Related Work
Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms
Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
Very similar to our system Developed concurrently
Complementary approach
Driving problem: Fluid mechanics sim
Problem domain is a warped disc:
regular grid
regular grid
BVPs: Background
Boundary value problems are sometimes governedby PDEs of the form:
= f
is some operator
is the problem domain
f is a forcing function (source term)
Given and f, solve for .
BVPs: Example
Heat Transfer Find a steady-state temperature distribution T
in a solid of thermal conductivity k with thermal source S
This requires solving a Poisson equation of the form:
k2T = -S
This is a BVP where is the Laplacian operator 2
All our applications require a Poisson solver.
BVPs: Solving
Most such problems cannot be solved analytically
Instead, discretize onto a grid to form a set of linear equations, then solve:
Direct elimination
Gauss-Seidel iteration
Conjugate-gradient
Strongly implicit procedures
Multigrid method
Multigrid method
Iteratively corrects an approximation to the solution
Operates at multiple grid resolutions
Low-resolution grids are used to correct higher-resolution grids recursively
Very fast, especially for large grids: O(n)
Multigrid method
Use coarser grid levels to recursively correct an approximation to the solution
may converge slowly on fine grid -> restrict to course grid push out long wavelength errors quickly (single
grid solvers only smooth out high frequency errors)
Algorithm:
smooth
residual restrict
recurse
interpolate = i - f
Implementation - Overview
For each step of the algorithm:
Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.)
Set the target buffer for rendering
Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation)
Render a grid-sized quad with multitexturing
fragment program
render target buffer
render target buffer
source buffer texture
source buffer texture
Implementation - Overview
Input buffers
Solution buffer: four-channel floating point pixel buffer (p-buffer)
one channel each for solution, residual, source term, and a debugging term
toggle front and back surfaces used to hold old and new solution
Operator map: contains the discretized operator (e.g., Laplacian)
Red-black map: accelerate odd-even tests (see later)
Smoothing
Jacobi method
one matrix row:
calculate new value for each solution vector element:
in our application, the aij are the Laplacian (sparse matrix):
Smoothing
Also factor in the source term
Use Red-black map to update only half of the grid cells in each pass
converges faster in practice
known as red-black iteration
requires two passes per iteration
Calculate residual
Apply operator (Laplacian) and source term to the current solution
residual = k2T + S Store result in the target surface
Use occlusion query to determine if all solution fragments are below threshold ( < threshold)
occlusion query = true means all fragments are below threshold
this is an L norm, which may be too strict
less strict norms L1, L2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead
Multigrid reduction and refinement
Average (restrict) current residual into coarser grid
Iterate/smooth on coarser grid, solving k2 = -S
Interpolate correction back into finer grid
or restrict once more -> recursion
use bilinear interpolation
Update grid with this correction
Iterate/smooth on fine grid
Boundary conditions
Dirichlet (prescribed)
Neumann (prescribed derivative)
Mixed (coupled value and derivative)
Uk: value at grid point k
nk: normal at grid point k
Periodic boundaries result in toroidal mapping
Apply boundary conditions in smoothing pass
Boundary conditions
Only need to compute at boundaries
boundaries need significantly more computations
restrict computations to boundaries
GPUs do not allow branching
or better, both branches are executed and the invalid fragment is discarded
even more wasteful
decompose domain into boundary and interior areas
use general (boundary) and fastpath (interior) shaders
run these in two separate passes, on respective domains
Optimizing the Solver
Detect steady-state natively on GPU
Minimize shader length
Use special-case whenever possible
Limit context switches
Optimizing the Solver: Steady-state
How to detect convergence?
L1 norm - average error
L2 norm – RMS error (common in visual sim)
L norm – max error (common in sci/eng apps) Can use occlusion query!
secs to steady statevs. grid size
Optimizing the Solver: Shader length
Minimize number of registers used Vectorize as much as possible Use the rasterizer to perform computations of
linearly-varying values Pre-compute invariants on CPU Compute texture coodinate offsets in vertex
shader
shader original fp
fastpath fp
fastpath vp
smooth 79-6-1 20-4-1 12-2
residual 45-7-0 16-4-0 11-1
restrict 66-6-1 21-3-0 11-1
interpolate 93-6-1 25-3-0 13-2
Optimizing the Solver: Special-case
Fast-path vs. slow-path
write several variants of each fragment program to handle boundary cases
eliminates conditionals in the fragment program
equivalent to avoiding CPU inner-loop branching
slow path with boundaries
fast path, no boundaries
Optimizing the Solver: Special-case
Fast-path vs. slow-path
write several variants of each fragment program to handle boundary cases
eliminates conditionals in the fragment program
equivalent to avoiding CPU inner-loop branching
secs per v-cyclevs. grid size
Optimizing the Solver: Context-switching
Find best packing data of multiple grid levelsinto the pbuffer surfaces - many p-buffers
Optimizing the Solver: Context-switching
Find best packing data of multiple grid levelsinto the pbuffer surfaces - two p-buffers
Optimizing the Solver: Context-switching
Find best packing data of multiple grid levelsinto the pbuffer surfaces - a single p-buffer
Still one front- and one back surface for iterative smoothing
Optimizing the Solver: Context-switching
Remove context switching
Can introduce operations with undefined results: reading/writing same surface
Why do we need to do this? there is a chance that we write and read from
the same surface at the same time
Can we get away with it? Yes, we can. Just need to be careful to avoid
these conflicts
What about RGBA parallelism? was not used in this implemtation, may give
another boost of factor 4
Data Layout
Performance:
secs to steady statevs. grid size
Data Layout
Compute 4 values at a time
Requires source, residual, solution values to be in different buffers
Complicates boundary calculations
Adds setup and teardown overhead
Stacked domain
Possible additional vectorization:
Results: CPU vs. GPU
Performance:
secs to steady statevs. grid size
Applications – Flow Simulation
Applications – High Dynamic Range
CPU GPU
Conclusions
What we need going forward:
Superbuffers or: Universal support for multiple-surface
pbuffers
or: Cheap context switching
Developer tools Debugging tools
Documentation
Global accumulator
Ever increasing amounts of precision, memory Textures bigger than 2048 on a side
Acknowledgements
Hardware
David Kirk
Matt Papakipos
Driver Support
Nick Triantos
Pat Brown
Stephen Ehmann
Fragment Programming
James Percy
Matt Pharr
General-purpose GPU
Mark Harris
Aaron Lefohn
Ian Buck
Funding
NSF Award #0092793