a multigrid solver for boundary value problems using programmable graphics hardware

A Multigrid Solver for Boundary Value Problems Using Programmable

Graphics HardwareNolan Goodnight Cliff Woolley Gregory Lewin

David Luebke Greg Humphreys

University of Virginia

Graphics Hardware 2003July 26-27 – San Diego, CA

augmented by Klaus Mueller, Stony Brook University

General-Purpose GPU Programming

Why do we port algorithms to the GPU?

How much faster can we expect it to be, really?

What is the challenge in porting?

Case Study

Problem: Implement a Boundary Value Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and engineering applications, e.g.:

Heat transfer

Fluid flow

Related Work

Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms

Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Very similar to our system Developed concurrently

Complementary approach

Driving problem: Fluid mechanics sim

Problem domain is a warped disc:

regular grid

regular grid

BVPs: Background

Boundary value problems are sometimes governedby PDEs of the form:

= f

is some operator

is the problem domain

f is a forcing function (source term)

Given and f, solve for .

BVPs: Example

Heat Transfer Find a steady-state temperature distribution T

in a solid of thermal conductivity k with thermal source S

This requires solving a Poisson equation of the form:

k2T = -S

This is a BVP where is the Laplacian operator 2

All our applications require a Poisson solver.

BVPs: Solving

Most such problems cannot be solved analytically

Instead, discretize onto a grid to form a set of linear equations, then solve:

Direct elimination

Gauss-Seidel iteration

Conjugate-gradient

Strongly implicit procedures

Multigrid method

Multigrid method

Iteratively corrects an approximation to the solution

Operates at multiple grid resolutions

Low-resolution grids are used to correct higher-resolution grids recursively

Very fast, especially for large grids: O(n)

Multigrid method

Use coarser grid levels to recursively correct an approximation to the solution

may converge slowly on fine grid -> restrict to course grid push out long wavelength errors quickly (single

grid solvers only smooth out high frequency errors)

Algorithm:

smooth

residual restrict

recurse

interpolate = i - f

Implementation - Overview

For each step of the algorithm:

Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.)

Set the target buffer for rendering

Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation)

Render a grid-sized quad with multitexturing

fragment program

render target buffer

render target buffer

source buffer texture

source buffer texture

Implementation - Overview

Input buffers

Solution buffer: four-channel floating point pixel buffer (p-buffer)

one channel each for solution, residual, source term, and a debugging term

toggle front and back surfaces used to hold old and new solution

Operator map: contains the discretized operator (e.g., Laplacian)

Red-black map: accelerate odd-even tests (see later)

Smoothing

Jacobi method

one matrix row:

calculate new value for each solution vector element:

in our application, the aij are the Laplacian (sparse matrix):

Smoothing

Also factor in the source term

Use Red-black map to update only half of the grid cells in each pass

converges faster in practice

known as red-black iteration

requires two passes per iteration

Calculate residual

Apply operator (Laplacian) and source term to the current solution

residual = k2T + S Store result in the target surface

Use occlusion query to determine if all solution fragments are below threshold ( < threshold)

occlusion query = true means all fragments are below threshold

this is an L norm, which may be too strict

less strict norms L1, L2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead

Multigrid reduction and refinement

Average (restrict) current residual into coarser grid

Iterate/smooth on coarser grid, solving k2 = -S

Interpolate correction back into finer grid

or restrict once more -> recursion

use bilinear interpolation

Update grid with this correction

Iterate/smooth on fine grid

Boundary conditions

Dirichlet (prescribed)

Neumann (prescribed derivative)

Mixed (coupled value and derivative)

Uk: value at grid point k

nk: normal at grid point k

Periodic boundaries result in toroidal mapping

Apply boundary conditions in smoothing pass

Boundary conditions

Only need to compute at boundaries

boundaries need significantly more computations

restrict computations to boundaries

GPUs do not allow branching

or better, both branches are executed and the invalid fragment is discarded

even more wasteful

decompose domain into boundary and interior areas

use general (boundary) and fastpath (interior) shaders

run these in two separate passes, on respective domains

Optimizing the Solver

Detect steady-state natively on GPU

Minimize shader length

Use special-case whenever possible

Limit context switches

Optimizing the Solver: Steady-state

How to detect convergence?

L1 norm - average error

L2 norm – RMS error (common in visual sim)

L norm – max error (common in sci/eng apps) Can use occlusion query!

secs to steady statevs. grid size

Optimizing the Solver: Shader length

Minimize number of registers used Vectorize as much as possible Use the rasterizer to perform computations of

linearly-varying values Pre-compute invariants on CPU Compute texture coodinate offsets in vertex

shader

shader original fp

fastpath fp

fastpath vp

smooth 79-6-1 20-4-1 12-2

residual 45-7-0 16-4-0 11-1

restrict 66-6-1 21-3-0 11-1

interpolate 93-6-1 25-3-0 13-2

Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching

slow path with boundaries

fast path, no boundaries

Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching

secs per v-cyclevs. grid size

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces - many p-buffers


Find best packing data of multiple grid levelsinto the pbuffer surfaces - two p-buffers


Find best packing data of multiple grid levelsinto the pbuffer surfaces - a single p-buffer

Still one front- and one back surface for iterative smoothing


Remove context switching

Can introduce operations with undefined results: reading/writing same surface

Why do we need to do this? there is a chance that we write and read from

the same surface at the same time

Can we get away with it? Yes, we can. Just need to be careful to avoid

these conflicts

What about RGBA parallelism? was not used in this implemtation, may give

another boost of factor 4

Data Layout

Performance:


Data Layout

Compute 4 values at a time

Requires source, residual, solution values to be in different buffers

Complicates boundary calculations

Adds setup and teardown overhead

Stacked domain

Possible additional vectorization:

Results: CPU vs. GPU

Performance:


Applications – Flow Simulation

Applications – High Dynamic Range

CPU GPU

Conclusions

What we need going forward:

Superbuffers or: Universal support for multiple-surface

pbuffers

or: Cheap context switching

Developer tools Debugging tools

Documentation

Global accumulator

Ever increasing amounts of precision, memory Textures bigger than 2048 on a side

Acknowledgements

Hardware

David Kirk

Matt Papakipos

Driver Support

Nick Triantos

Pat Brown

Stephen Ehmann

Fragment Programming

James Percy

Matt Pharr

General-purpose GPU

Mark Harris

Aaron Lefohn

Ian Buck

Funding

NSF Award #0092793

a multigrid solver for boundary value problems using programmable graphics hardware

Documents

course grid

fine grid

single grid solvers

new value

gridsized quad

new solution operator

laplacian operator

thermal source sthis