iterative solution of linear systems

102
Iterative solution of linear systems Victor Eijkhout Spring 2021

Upload: others

Post on 27-Mar-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Victor Eijkhout
Spring 2021
Justification
As an alternative to Gaussian elimination, iterative methods can be an efficient way to solve the linear system from PDEs. We discuss basic iterative methods and the notion of preconditioning.
SDS 374c/394c — Spring 2021— 2
Two different approaches Solve Ax = b
Direct methods:
Iterative methods:
• Only approximate
• Convergence not guaranteed
Stationary iteration
Iterative methods
xk+1 = Bxk + c
until xk+1− xk2 < ε or until x k+1−xk2 xk < ε
SDS 374c/394c — Spring 2021— 5
Example of iterative solution Example system 10 0 1
1/2 7 1 1 0 6
x1
x2
x3
21 9 8
with solution (2,1,1).
Suppose you know (physics) that solution components are roughly the same size, and observe the dominant size of the diagonal, then10
7 6
SDS 374c/394c — Spring 2021— 6
Iterative example′ Example system 10 0 1
1/2 7 1 1 0 6
x1
x2
x3
Also easy to solve: 10 1/2 7 1 0 6
x1
x2
x3
SDS 374c/394c — Spring 2021— 7
Abstract presentation • To solve Ax = b; too expensive; suppose K ≈ A and solving
Kx = b is possible • Define Kx0 = b, then error correction x0 = x + e0, and
A(x0−e0) = b • so Ae0 = Ax0−b = r0; this is again unsolvable, so • K e0 = r0 and x1 = x0− e0. • In one formula:
x1 = x0−K−1r0
• now iterate: e1 = x1− x , Ae1 = Ax1−b = r1 et cetera
Iterative scheme:
SDS 374c/394c — Spring 2021— 8
Takeaway
Error analysis
• One step
= r0−AK−1r0 (2)
= (I−AK−1)r0 (3)
• Inductively: rn = (I−AK−1)nr0 so rn ↓ 0 if |λ(I−AK−1)|< 1 Geometric reduction (or amplification!)
• This is ‘stationary iteration’: no dependence on the iteration number. Simple analysis, limited applicability.
SDS 374c/394c — Spring 2021— 10
Takeaway
The iteration process does not have a pre-determined number of operations: depends spectral properties of the matrix.
SDS 374c/394c — Spring 2021— 11
Complexity analysis
• Direct solution is O(N3) except sparse, then O(N5/2) or so
• Iterative per iteration cost O(N) assuming sparsity. • Number of iterations is complicated function of spectral
properties: – Stationary iteration #it = O(N2) – Other methods #it = O(N)
(2nd order only, more for higher order) – Multigrid and fast solvers: #it = O(log N) or even O(1)
SDS 374c/394c — Spring 2021— 12
Choice of K
• Diagonal and lower triangular choice mentioned above: let
A = DA + LA + UA
be a splitting into diagonal, lower triangular, upper triangular part, then
• Jacobi method: K = DA (diagonal part),
• Gauss-Seidel method: K = DA + LA (lower triangle, including diagonal)
• SOR method: K = ωDA + LA
SDS 374c/394c — Spring 2021— 13
Computationally
then Ax = b⇒ Kx = Nx + b⇒ Kxi+1 = Nxi + b
Equivalent to the above, and you don’t actually need to form the residual.
SDS 374c/394c — Spring 2021— 14
Jacobi
for k = 1, . . . until convergence, do: for i = 1 . . .n:
//aiix (k+1) i = ∑j 6=i aijx
(k) j + bi ⇒
ii (∑j 6=i aijx (k) j + bi)
Implementation:
for k = 1, . . . until convergence, do: for i = 1 . . .n:
ti = a−1 ii (−∑j 6=i aijxj + bi)
copy x ← t
Jacobi in pictures:
Gauss-Seidel
for k = 1, . . . until convergence, do: for i = 1 . . .n:
//aiix (k+1) i + ∑j<i aijx
(k+1) j ) = ∑j>i aijx
(k) j + bi ⇒
ii (−∑j<i aijx (k+1) j )−∑j>i aijx
(k) j + bi)
for k = 1, . . . until convergence, do: for i = 1 . . .n:
xi = a−1 ii (−∑j 6=i aijxj + bi)
SDS 374c/394c — Spring 2021— 17
GS in pictures:
SDS 374c/394c — Spring 2021— 18
Choice of K through incomplete LU • Inspiration from direct methods: let K = LU ≈ A
Gauss elimination:
for k,i,j: a[i,j] = a[i,j] - a[i,k] * a[k,j] / a[k,k]
Incomplete variant:
for k,i,j: if a[i,j] not zero: a[i,j] = a[i,j] - a[i,k] * a[k,j] / a[k,k]
⇒ sparsity of L + U the same as of A
SDS 374c/394c — Spring 2021— 19
Applicability
Incomplete factorizations mostly work for M-matrices: 2nd order FDM and FEM
Can be severe headache for higher order
SDS 374c/394c — Spring 2021— 20
Stopping tests When to stop converging? Can size of the error be guaranteed?
• Direct tests on error en = x− xn impossible; two choices
• Relative change in the computed solution small:
xn+1− xn/xn< ε
• Residual small enough:
rn= Axn−b< ε
Without proof: both imply that the error is less than some other ε′.
SDS 374c/394c — Spring 2021— 21
Polynomial iterative methods
x1 = x0−K−1r0, r0 = Ax0−b
and conclusion rn = (I−AK−1)nr0
• Abstract relation between true solution and approximation:
xtrue = xinitial + K−1 π(AK−1)rinitial
• Cayley-Hamilton theorem implies
• inspires us to scheme:
Sequence of polynomials of increasing degree
SDS 374c/394c — Spring 2021— 23
Residuals
Multiply by A and subtract b:
ri+1 = r0 + π (i)(AK−1)r0
So: ri = π
(i)(AK−1)r0
where π(i) is a polynomial of degree i with π(i)(0) = 1.
⇒ convergence theory
Computational form
K−1rjαji .
j≤i K−1rjαji .
or: ri = Axi −b
SDS 374c/394c — Spring 2021— 25
Takeaway
Orthogonality Idea one:
If you can make all your residuals orthogonal to each other, and the matrix is of dimension n, then after n iterations you have to have converged: it is not possible to have an n + 1-st residuals that is orthogonal and nonzero.
Idea two: The sequence of residuals spans a series of subspaces of increasing dimension, and by orthogonalizing the initial resid- ual is projected on these spaces. This means that the errors will have decreasing sizes.
SDS 374c/394c — Spring 2021— 27
Minimization
Full Orthogonalization Method Let r0 be given For i ≥ 0:
let s← K−1ri
let t ← AK−1ri
for j ≤ i : let γj be the coefficient so that t− γj rj ⊥ rj
for j ≤ i : form s← s− γjxj
and t ← t− γj rj
let xi+1 = (∑j γj) −1s, ri+1 = (∑j γj)
−1t .
How do you orthogonalize?
• Given x ,y , can you
x ← something with x , y
such that x ⊥ y? (What was that called again in your linear algebra class?
• Gramm-Schmid method
y
How do you orthogonalize?
• Given x ,y , can you
x ← something with x , y
such that x ⊥ y? (What was that called again in your linear algebra class?
• Gramm-Schmid method
y
How do you orthogonalize?
• Given x ,y , can you
x ← something with x , y
such that x ⊥ y? (What was that called again in your linear algebra class?
• Gramm-Schmid method
y
Takeaway
Coupled recurrences form
αjiK −1rj (4)
xi+1 = xi −δipi ,
βijK −1rj .
Conjugate Gradients
Basic idea: r t i K−1rj = 0 if i 6= j.
Split recurrences: xi+1 = xi −δipi
ri+1 = ri −δiApi
Residuals and search directions
Symmetric Positive Definite case
ri+1 = ri −δiApi
SDS 374c/394c — Spring 2021— 35
Preconditioned Conjugate Gradietns
Compute r (0) = b−Ax(0) for some initial guess x(0)
for i = 1,2, . . . solve Mz(i−1) = r (i−1)
ρi−1 = r (i−1)T z(i−1)
if i = 1 p(1) = z(0)
else βi−1 = ρi−1/ρi−2
p(i) = z(i−1) + βi−1p(i−1)
endif q(i) = Ap(i)
SDS 374c/394c — Spring 2021— 36
takeaway
Three popular iterative methods
• Conjugate gradients: constant storage and inner products; works only for symmetric systems
• GMRES (like FOM): growing storage and inner products: restarting and numerical cleverness
• BiCGstab and QMR: relax the orthogonality
SDS 374c/394c — Spring 2021— 38
CG derived from minimization
Special case of SPD:
For which vector x with x= 1 is f (x) = 1/2x tAx−btx minimal? (5)
Taking derivative: f ′(x) = Ax−b.
Optimal solution: f ′(x) = 0⇒ Ax = b.
Traditional: variational formulation New: error of neural net
SDS 374c/394c — Spring 2021— 39
Minimization by line search
Assume full minimization minx : f (x) = 1/2x tAx−btx too expensive.
Iterative update xi+1 = xi + piδi
where pi is search direction.
Finding optimal value δi is ‘line search’:
δi = argmin δ
pt 1Api
Line search
• General non-linear systems
• Machine learning: stochastic gradient descent pi is ‘block vector’ of training set
pt 1Api is a matrix⇒
( pt
SDS 374c/394c — Spring 2021— 41
Let’s go parallel
What’s in an iterative method?
From easy to hard
• Inner product
• Matrix-vector product
• Preconditioner solve
Inner products: collectives
Collective operation: data from all processes is combined. (Is a matrix-vector product a collective?)
Examples: sum-reduction, broadcast These are each other’s mirror image, computationally.
SDS 374c/394c — Spring 2021— 44
Naive realization of collectives Broadcast:
Single message:
α = message startup≈ 10−6s, β = time per word≈ 10−9s
• Time for message of n words:
α + βn
• Single inner product: n = 1 • Time for collective? • Can you improve on that?
SDS 374c/394c — Spring 2021— 45
Naive realization of collectives Broadcast:
Single message:
α = message startup≈ 10−6s, β = time per word≈ 10−9s
• Time for message of n words:
α + βn
• Time for collective? • Can you improve on that?
SDS 374c/394c — Spring 2021— 45
Naive realization of collectives Broadcast:
Single message:
α = message startup≈ 10−6s, β = time per word≈ 10−9s
• Time for message of n words:
α + βn
• Can you improve on that?
SDS 374c/394c — Spring 2021— 45
Naive realization of collectives Broadcast:
Single message:
α = message startup≈ 10−6s, β = time per word≈ 10−9s
• Time for message of n words:
α + βn
• Single inner product: n = 1 • Time for collective? • Can you improve on that?
SDS 374c/394c — Spring 2021— 45
Better implementation of collectives
• What is the running time now?
• Can you come up with lower bounds on the α,β terms? Are these achieved here?
• How about the case of really long buffers?
SDS 374c/394c — Spring 2021— 46
Better implementation of collectives
• What is the running time now?
• Can you come up with lower bounds on the α,β terms? Are these achieved here?
• How about the case of really long buffers?
SDS 374c/394c — Spring 2021— 46
Better implementation of collectives
• What is the running time now?
• Can you come up with lower bounds on the α,β terms? Are these achieved here?
• How about the case of really long buffers?
SDS 374c/394c — Spring 2021— 46
Inner products
• Collective, so induces synchronization
• Research in approaches to hiding: overlapping with other operations
SDS 374c/394c — Spring 2021— 47
What do those inner products serve? • Orthogonality of residuals
• Basic algorithm: Gram-Schmidt
v ′← v− U tv U tU
U.
Modified Gram-Schmidt
i v/ut i ui
update v ← v− ciui
Full Orthogonalization Method Let r0 be given For i ≥ 0:
let s← K−1ri
let t ← AK−1ri
for j ≤ i : let γj be the coefficient so that t− γj rj ⊥ rj
for j ≤ i : form s← s− γjxj
and t ← t− γj rj
let xi+1 = (∑j γj) −1s, ri+1 = (∑j γj)
−1t .
Modified Gramm-Schmidt
let s← K−1ri
let t ← AK−1ri
for j ≤ i : let γj be the coefficient so that t− γj rj ⊥ rj
form s← s− γjxj
let xi+1 = (∑j γj) −1s, ri+1 = (∑j γj)
−1t .
Practical differences
SDS 374c/394c — Spring 2021— 52
Matrix-vector product
SDS 374c/394c — Spring 2021— 53
PDE, 2D case A difference stencil applied to a two-dimensional square domain, distributed over processors. Each point connects to neighbours⇒ each process connects to neighbours.
SDS 374c/394c — Spring 2021— 54
Parallelization
Assume each process has the matrix values and vector values in part of the domain.
Processor needs to get values from neighbors.
SDS 374c/394c — Spring 2021— 55
Parallelization
Assume each process has the matrix values and vector values in part of the domain.
Processor needs to get values from neighbors.
SDS 374c/394c — Spring 2021— 55
Halo region The ‘halo’ region of a process, induced by a stencil
SDS 374c/394c — Spring 2021— 56
Matrices in parallel
Matrix-vector product performance
Beware of optimizations that change the math!
SDS 374c/394c — Spring 2021— 58
Preconditioners
Preconditioners
• Some comments to follow
• There is intrinsic dependence in solvers, hence in preconditioners: parallelism is very tricky. approximate inverses
SDS 374c/394c — Spring 2021— 60
Fill-in during LU
Fill-in: index (i, j) where aij = 0 but `ij 6= 0 or uij 6= 0.
2D BVP: is n×n, gives matrix of size N = n2, with bandwidth n.
Matrix storage O(N)
LU storage O(N3/2)
SDS 374c/394c — Spring 2021— 61
Fill-in is a function of ordering
∗ ∗ · · · ∗ ∗ ∗ /0
After factorization the matrix is dense. Can this be permuted?
SDS 374c/394c — Spring 2021— 62
Domain decomposition

DD factorization
22 I
Parallelism. . .
aij ← aij −aik a−1 kk akj
SDS 374c/394c — Spring 2021— 66
Graph theory of sparse elimination
SDS 374c/394c — Spring 2021— 66
Graph theory of sparse elimination
aij ← aij −aik a−1 kk akj
So inductively S is dense
SDS 374c/394c — Spring 2021— 66
More about separators
• Separators have better spectral properties
SDS 374c/394c — Spring 2021— 67
Recursive bisection
ADD =
The domain/operator/graph view is more insightful, don’t you think?
SDS 374c/394c — Spring 2021— 69
How does this look in reality?
SDS 374c/394c — Spring 2021— 70
Complexity With n =
√ N:
• one dense matrix on a separator of size n, plus • two dense matrices on separators of size n/2 • → 3/2n2 space and 5/12n3 time • and then four times the above with n→ n/2
space = 3/2n2 + 4 ·3/2(n/2)2 + · · · = N(3/2 + 3/2 + · · ·) log n terms = O(N log N)
time = 5/12n3/3 + 4 ·5/12(n/2)3/3 + · · · = 5/12N3/2(1 + 1/4 + 1/16 + · · ·) = O(N3/2)
Unfortunately only in 2D.
More direct factorizations
Minimum degree, multifrontal,. . .
Finding good separators and domain decompositions is tough in general.
SDS 374c/394c — Spring 2021— 72
Sparse operations in parallel: mvp
Mvp y = Ax
for i=1..n y[i] = sum over j=1..n a[i,j]*x[j]
In parallel:
for i=myfirstrow..mylastrow y[i] = sum over j=1..n a[i,j]*x[j]
SDS 374c/394c — Spring 2021— 73
How about ILU solve? Consider Lx = y
for i=1..n x[i] = (y[i] - sum over j=1..i-1 ell[i,j]*x[j])
/ a[i,i]
Parallel code:
for i=myfirstrow..mylastrow x[i] = (y[i] - sum over j=1..i-1 ell[i,j]*x[j])
/ a[i,i]
Block method
for i=myfirstrow..mylastrow x[i] = (y[i] - sum over j=myfirstrow..i-1 ell[i,j]*x[j])
/ a[i,i]
SDS 374c/394c — Spring 2021— 75
Figure: Sparsity pattern corresponding to a block Jacobi preconditioner
SDS 374c/394c — Spring 2021— 76
Variable reordering a11 a12 /0
a21 a22 a23 a32 a33 a34
/0 . . .
a55 . . .
. . . . . .
. . .


2D redblack
Multicolour ILU
How do you get a multi-colouring?
Exactly colour number is NP-completely: don’t bother.
For preconditioner an approximation is good enough: Luby / Jones-Plassman algorithm
• Give every node a random value
• First colour: all nodes with a higher value than all their neighbours
• Second colour: higher value than all neighbours except in first colour
• et cetera
Recurrences
Intuitively: recursion length n2
However. . .
And in fact
But then too
And
Conclusion
2. Equivalency of wavefronts and multicolouring
SDS 374c/394c — Spring 2021— 89
Recursive doubling
Write recurrence xi = bi −ai−1xi−1 as 1 /0
a21 1 . . . . . .
Transform
. . . . . .
. . .

• Now recurse. . .
Normalize ILU solve to (I−L) and (I−U)
Approximate (I−L)x = y by x ≈ (I + L + L2)y
Convergence guaranteed for diagonally dominant
SDS 374c/394c — Spring 2021— 92