iterative methods and parallel algorithmsparallel.bas.bg/scicomp/vd/parallel.pdf · iterative...
TRANSCRIPT
Iterative Methods and ParallelAlgorithms
S. D. Margenov I. D. Lirkov
[email protected] [email protected]
Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria
http://parallel.bas.bg/˜margenov/
http://parallel.bas.bg/˜ivan/
Parallel Algorithms – p. 1/50
CONTENTS
1. Introduction
2. Parallel inner product
3. Sparse matrix vector multiplication
4. Jacobi method
5. Conjugate gradient method
6. Preconditioned conjugate gradient method
7. Circulant Bloick Factorization
8. MIC(0) preconditioning
9. Parallel PCG testsParallel Algorithms – p. 2/50
Parallel Performance
To establish the theoretical performance, asimple model for non-overlapped arithmetic andcommunication times is assumed:
The execution of M a.o. on one processortakes time
Ta = Mta
where ta is the average unit time to performone a.o. on one processor.
Parallel Algorithms – p. 3/50
Parallel Performance
The communication time to transfer M dataelements from one processor to another isapproximated by
Tcom = `(ts + Mtc)
where ts is the start-up time and tc is the timenecessary for each of M elements to be sent,and ` is the graph distance between theprocessors.
Parallel Algorithms – p. 4/50
Parallel Speedup and Efficiency
The standard expressions for parallel speedupS(N, p), and parallel efficiency E(N, p) are used:
S(N, p) =T (N, 1)
T (N, p)
E(N, p) =S(N, p)
p
Here, T (N, p) stands for the parallel time to solvethe problem on p processors, and N is thediscrete size of the problem.
Parallel Algorithms – p. 5/50
Parallel Speedup and Efficiency
The following theoretical estimates hold:
0 < S(N, p) ≤ p 0 < E(N, P ) ≤ 1
Parallel Algorithms – p. 6/50
Iterative Methods
Iterative methods are techniques to solvesystems of linear equations
Ax = b
that generate a sequence of approximations tothe solution vector x in the form
x0, x1, · · · , xk, · · ·
Parallel Algorithms – p. 7/50
Iterative Methods
The process is said to be convergent if themagnitude of the vector
gk = b − Axk
becomes reasonably small. The vector gk
represents the error in the approximation of xand is referred to as the residual after k iterations.
Parallel Algorithms – p. 8/50
The relative stopping criteria is determined bythe inner product of the kth residual
‖gk‖2
‖g0‖2< ε
where ε > 0 is assumed small, and
‖gi‖2 =
√
giTgi.
Parallel Algorithms – p. 9/50
In this general setting, each iteration step includes:
inner product
matrix A vector multiplication
Parallel Algorithms – p. 10/50
Parallel Inner Product I
The parallel implementation of the inner productis the only step of the considered iterativealgorithms which requires globalcommunications.
Communications in the one-to-all like parallel inner productParallel Algorithms – p. 11/50
Parallel Inner Product I
T IPcom = (ts + tc) log p d − cube
T IPcom = 2(ts + tc)d
√p/2e 2D − mesh
Parallel Algorithms – p. 12/50
Parallel Inner Product II
Communications in the all-to-all like parallel inner product
T IPcom = (ts + tc) log p d − cube
T IPcom = 2(ts + tc)(
√p − 1) 2D − mesh
Parallel Algorithms – p. 13/50
Sparse Matrices
Example A1
A =
4 −1 −1
4 −1 −1
4 −1 −1
4 −1 −1
4 −1 −1 −1 −1
−1 −1 −1 4
−1 −1 −1 4
−1 −1 −1 4
−1 −1 −1 4
Parallel Algorithms – p. 14/50
Example A2
A =
4 −1 −1
−1 4 −1 −1
−1 4 −1
−1 4 −1 −1
−1 −1 4 −1 −1
−1 −1 4 −1
−1 4 −1
−1 −1 4 −1
−1 −1 4
Parallel Algorithms – p. 15/50
FDM/FEM Sparse Matrices I
Consider the model problem
−uxx − uyy = f in Ω = [0, 1]2
with Dirichlet boundary conditions on Γ = ∂Ω.Let us assume that FDM or FEM is used to solvenumerically the problem where ωh is a uniformmesh with mesh-size h = 1/(n + 1).
2
3
6
7
8
9
5
1
4
1
2
3
4
5
6
7
8
9
(A1) (A2)
Parallel Algorithms – p. 16/50
FDM/FEM Sparse Matrices
If a column-wise numbering of the nodes(unknowns) is used, then
A = blocktridiag(Ai,i−1, Ai,i, Ai,i+1),
A =
A1,1 A1,2
A2,1 A2,2 A2,3
A3,2 A3,3 A3,4
· · · · · · · · · · · · · · ·· · · · · · · · · · · · · · ·
An−1,n−2 An−1,n−1 An−1,n
An,n−1 An,n
,
Ai,i = tridiag(−1, 4,−1), Ai,i−1 = Ai,i+1 = −I,
A ∈N×N , Ai,j ∈n×n, N = n2.
Parallel Algorithms – p. 17/50
Matrix-Vector Multiplication I
P
P
P
0
1
2
Matrix-vector multiplication with a block-stripped partitioning of the block-tridiagonal sparse matrix
T (N, 1) ≈ 9NtaTcom = 2(ts + ntc)
T (N, p) ≈ 9N
pta + 2(ts +
√Ntc)
Parallel Algorithms – p. 18/50
Jacobi Iterative Method
The ith equation of the system Ax = b can bewritten in the form
xi =1
Ai,i
bi −∑
i6=j
Ai,jxj
.
The iteration step in the Jacobi method is
xk+1i =
1
Ai,i
bi −∑
i6=j
Ai,jxkj
Parallel Algorithms – p. 19/50
Jacobi Iterative Method
or equivalently
xk+1i =
gki
Ai,i+ xk
i .
The method always converges in the class ofdiagonally-dominant matrices.
Parallel Algorithms – p. 20/50
Jacoby Algorithm
1. procedure JACOBI(A, b, x, ε)2. begin
3. k := 0;4. Select initial solution vector x0;5. g0 := b − Ax0;6. while (‖gk‖2 > ε‖g0‖2) do
7. begin
8. k := k + 1;9. for i := 1 to N do xk
i :=
gk−1
i
Ai,i+ xk−1
i ;
11. gk := b − Axk;12. endwhile;13. x := xk;14. endJACOBI
The complexity of one iteration is as follows
NJacit (A−1b) ≈ N (Ad) + N (IP ) + 3N
which for thte model problem reads as
NCGit (A−1b) ≈ 14N.
The related times are simply derived usingthe related matrix-vactor communication es-timate.
T it(N, 1) ≈ 14Nta
T itcom = 2(ts + ntc) + T IP
com
T it(N, p) ≈ 14N
pta + 2(ts +
√Ntc)
Parallel Algorithms – p. 21/50
Conjugate Gradient Algorithm
1. procedure CG(A, b, x, ε)2. begin
3. k := 0;4. Select initial solution vector x0;5. g0 := Ax0 − b, d0 = −g0;6. while (‖gk‖2 > ε‖g0‖2) do
7. begin
8. τk = gkTgk
dkT Adk;
9. xk+1 = xk + τkdk;
10. gk+1 = gk + τkAdk;
11. βk = gk+1Tgk+1
gkT gk;
12. dk+1 = −gk+1 + βkdk;
13. endwhile;14. x := xk;15. endCG
The computational complexity of one CG iter-ation is as follows:
NCGit (A−1b) ≈ N (Ad)+2N (IP )+3N (LT ),
NCGit (A−1b) ≈ 19N.
The related times are derived in a similar wayas for Jacobi method.
T it(N, 1) ≈ 19Nta
T itcom = 2(ts + ntc) + 2T IP
com
T it(N, p) ≈ 19N
pta + 2(ts +
√Ntc)
Parallel Algorithms – p. 22/50
Convergence Rate of CG Method
Theorem.
p(ε) ≤ 1
2
√
κ(A) ln (2/ε) + 1,
where p(ε) stands for the smallest number k suchthat
‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈ RN .
In a very general setting of FEM/FDM sparsematrices, the spectral condition number behavesas κ(A) = O(N).
Parallel Algorithms – p. 23/50
Convergence Rate of CG Method
Therefore NCG(A−1b) = O(N 3/2). It is importantto note, that the same complexity holds for thebest known direct method, namelyNND(A−1b) = O(N 3/2) where ND stands for theNested Dissection Method.
Parallel Algorithms – p. 24/50
Numerical Tests
The number of iterations for the model problemof Gauss-Seidel (G-S), Steepest Descent(SD),Conjugate Gradient (CG) andPreconditioned CG (PCG) methods arepresented in the table. The implemented PCGalgorithm is subject to the next section.
nG−Sit = O(N)
nSDit = O(N)
nCGit = O(N 1/2)
nPCGit = O(N 1/4)
Parallel Algorithms – p. 25/50
Numerical Tests
n G − S SD CG PCG
4 82 185 26 118 309 698 45 1516 1151 2592 91 1932 4242 9541 177 2764 15529 34818 351 38
Conclusion. The efforts should be addressed todevelopment of scalable parallel algorithms forfast enough iterative solution methods.
Parallel Algorithms – p. 26/50
Preconditioned CG Method
The idea of the PCG method is to substitute theoriginal linear system to a new one which isbetter conditioned:
Ax = b => C−1/2AC−1/2y = b
The PCG strategy is to construct apreconditioner C such that:
κ(C−1A) << κ(A)
N (C−1v) << N (A−1v)
The preconditioner is called optimal ifκ(C−1A) = O(1) and N (C−1v) = O(N).
Parallel Algorithms – p. 27/50
PCG Algorithm
1. procedure PCG(A, b, x, ε)2. begin
3. k := 0;4. Select initial solution vector x0;5. g0 = Ax0−b, h0 = C−1g0, d0 = h0;
6. while (‖gk‖C−1 > ε‖g0‖C−1 ) do
7. begin
8. τk = gkThk
dkT Adk;
9. xk+1 = xk + τkdk;
10. gk+1 = gk + τkAdk;
11. hk+1 = C−1gk+1;
12. βk = gk+1Thk+1
gkT hk;
13. dk+1 = −hk+1 + βkdk;
14. endwhile;15. x := xk;16. endPCG
Following the structure of our analysiss,we estimate the computational complexityof one PCG iteration in the form:
NPCGit (A−1b) ≈ N (C−1g) + N (Ad)
+ 2N (IP ) + 3N (LT )
NPCGit (A−1b) ≈ N (C−1g) + 19N.
Then, the related per iteration PCG timesare as follows:
T it(N, 1) ≈ T (C−1g)(N, 1) + 19Nta,
T itcom = 2(ts + ntc) + T
(C−1g)com + 2T IP
com
Parallel Algorithms – p. 28/50
Convergence Rate of PCG Method
Theorem.
p(ε) ≤ 1
2
√
κ(C−1A) ln (2/ε) + 1,
where p(ε) stands for the smallest number k suchthat
‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈N
Parallel Algorithms – p. 29/50
Convergence Rate of PCG Method
Some parallel preconditioning techniques:• Incomplete Factorization• Circulant Bloick Factorization• Domain Decomposition• Patched Local Refinement• Multigrid/Multilevel• Approximate Inverse
Parallel Algorithms – p. 30/50
Circulant Bloick Factorization
A circulant matrix C has the form
Ck,j = c(j−k) mod m
C =
c0 c1 c2 . . . cm−1
cm−1 c0 c1 . . . cm−2... ... ... ...c1 c2 . . . cm−1 c0
C = (c0, c1, . . . cm−1) = FΛF ∗
ℵ(C−1v) = O(m log m)
Parallel Algorithms – p. 31/50
2D model problem
−(a(x, y)ux)x − (b(x, y)uy)y = f(x, y),
∀(x, y) ∈ Ω,
u(x, y) = 0, ∀(x, y) ∈ Γ = ∂Ω,
0 < cmin ≤ a(x, y), b(x, y) ≤ cmax,
A = tridiag(−Ai,i−1, Ai,i,−Ai,i+1) i = 1, 2, . . . n,
C = tridiag(−Ci,i−1, Ci,i,−Ci,i+1) i = 1, 2, . . . n,
where Ci,j = Circulant(Ai,j) is some givencirculant approximation of the correspondingblock Ai,j.
Parallel Algorithms – p. 32/50
Factorization
C = D − L − U
C = (X − L)(I − X−1U)
X = D − LX−1U
X1 = C1,1
Xi = Ci,i − Ci,i−1X−1i−1Ci−1,i, i = 2, . . . n
Ci,j = FΛi,jF∗
Xi = FDiF∗
Parallel Algorithms – p. 33/50
Factorization
D−11 = Λ1,1
D−1i = Λi,i − Λi,i−1Di−1Λi−1,i.
Let us denote with Λ = tridiag(Λi,i−1, Λi,i, Λi,i+1).Then the following relation holds
Cw = u ⇐⇒ (I ⊗ F )Λ(I ⊗ F ∗)w = u.
u = (I ⊗ F ∗)u
Λw = u
w = (I ⊗ F )wParallel Algorithms – p. 34/50
Factorization
∣
∣
∣
∣
∣
v1 = D1u1
vi = Di(ui − Λi,i−1vi−1) i = 2, 3, . . . n
∣
∣
∣
∣
∣
wn = vn
wi = vi − DiΛi,i+1wi+1 i = n − 1, n − 2, . . . 1
Parallel Algorithms – p. 35/50
Parallel algorithm
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
P0 P1 P2 P3
Distribution of vectors on processors.
The CBFpreconditioning can be split in three stages. If weuse the column-wise mapping for first and thirdstage there is no need of communicationbecause we perform block-FFT on blocks whichare stored on one processor. For second stagewe have to reorder the vector entries usingrow-wise mapping.
Parallel Algorithms – p. 36/50
Parallel algorithm
The CBF preconditioning can be split in threestages. If we use the column-wise mapping forfirst and third stage there is no need ofcommunication because we perform block-FFTon blocks which are stored on one processor.For second stage we have to reorder the vectorentries using row-wise mapping.
Parallel Algorithms – p. 36/50
Parallel CBF tests
SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz
n p T (p) Sp Ep T (p) Sp Ep
128 1 0.086 0.0812 0.047 1.84 0.92 0.047 1.71 0.864 0.028 3.04 0.76 0.029 2.77 0.698 0.021 4.13 0.52 0.096 0.84 0.10
256 1 0.389 0.3922 0.207 1.88 0.94 0.208 1.88 0.944 0.109 3.56 0.89 0.127 3.09 0.778 0.065 6.02 0.75 0.138 2.83 0.35
168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep
420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29
Parallel Algorithms – p. 37/50
Parallel CBF tests
SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz
n p T (p) Sp Ep T (p) Sp Ep
384 1 1.460 1.4982 0.759 1.92 0.96 0.783 1.91 0.963 0.523 2.79 0.93 0.533 2.81 0.944 0.394 3.71 0.93 0.473 3.17 0.796 0.269 5.43 0.90 0.780 1.92 0.328 0.338 4.32 0.54 1.122 1.33 0.17
168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep
420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29
Parallel Algorithms – p. 37/50
Parallel CBF tests
SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz
n p T (p) Sp Ep T (p) Sp Ep
420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29
Parallel Algorithms – p. 37/50
MIC(0) Algorithm
Let us rewrite the real matrix A in the formA = D − L − LT . Then, the modified incompleteCholesky factorization is defined as follows:
CMIC(0)(A) = (X − L)X−1(X − L)T ,
where X = diag(x1, · · · , xN) provides the equalrowsums condition.
Parallel Algorithms – p. 38/50
MIC(0) Algorithm
Theorem. Let us assume that(a) L ≥ 0, (b) Ae ≥ 0, (c) Ae + Lte > 0, e =
(1, · · · , 1)t ∈N . Then the relation
xi = aii −i−1∑
k=1
aik
xk
N∑
j=k+1
akj > 0
gives a stable MIC(0) factorization of A.
Parallel Algorithms – p. 39/50
Remark. All presented numerical tests areperformed using the perturbed MIC(0)algorithm, where the incomplete factorization isapplied to the matrix A = A + D. The diagonalperturbation D = D(ξ) = diag(d1, . . . dN) isdefined as follows:
di =
ξaii if aii ≥ 2wi
ξ1/2aii if aii < 2wi
Parallel Algorithms – p. 40/50
wherewi = −
∑
j>i
aij.
Here 0 < ξ < 1 is a constant of the same orderas the minimal eigenvalue of A. Thecomputations for the considerd model problemsare done with ξ = h2.
Parallel Algorithms – p. 41/50
MIC(0) Complexity
The MIC(0) computational complexity of onePCG iteration is as follows:
N PCGit (A−1b) ≈ N (C−1g)+19N, N (C−1g) ≈ 11N,
N PCGit (A−1b) ≈ 30N.
MIC(0) is a cheap preconditioning algorithm.The cost of N (C−1g) is almost the same asN (Ad).
Parallel Algorithms – p. 42/50
MIC(0) Complexity
MIC(0) is a robust preconditioner with respectto local singularities of the problem, whereκ(C−1A) = O(N 1/4), andN PCG(A−1b) = O(N 5/4).
MIC(0) is an inherently sequential algorithm.
Parallel Algorithms – p. 43/50
FDM/FEM Sparse Matrices II
The model problem is considered again:−uxx − uyy = f in Ω = [0, 1]2 with Dirichletboundary conditions on Γ = ∂Ω.
ReM SkM
Since a five point stencil is used in both cases,the accuracy of the regular mesh (ReM) and thealternative skewed mesh (SkM) FDM/FEMapproximations are one and the same.
Parallel Algorithms – p. 44/50
Block-Structure of the Matrices
SkM ReM
The bottleneck problem of the parallelimplementation of MIC(0) algorithm is thesolution of problems with triangle matrices(X − L) and (X − L)T .
Parallel Algorithms – p. 45/50
Block-Structure of the Matrices
The key point of our consideration is, that in thecase of skewed mesh, the stiffness matrix has ablock structure with diagonal blocks which arediagonal.
Parallel Algorithms – p. 46/50
Parallel MIC(0) algorithm
P
P
P
0
2
1
MIC(0) PCG algorithm with a block-stripped partitioning: N = n2 + (n − 1)2.
TMIC(0)it (N, 1) ≈ 38Nta, Tcom ≈ (4ts +6tc)n+2T IP
com,
TMIC(0)it (N, p) ≈ 38N
pta + (2ts + 3tc)
√2N
Parallel Algorithms – p. 47/50
Parallel MIC(0) Tests
The presented tests are performed on aBeowulf like cluster of four dual processorPower Macintosh computers with 512 MBRAM each and G4 processors on 450 MHz.
The parallel MIC(0) algorithm is implementedin C++ using Message Passing Interface(MPI).
Yellow Dog Linux with LAN MPI are used.
The size of the problem and the number ofthe processors are varied to examine theparallel scalability of the code.
Parallel Algorithms – p. 48/50
Parallel Speedup
n 32 64 128 256 512 1024 1500S(n,2) 1.21 1.68 1.96 1.85 1.92 2.03 2.02S(n,3) 0.24 0.46 0.97 1.72 2.45 2.97 2.86S(n,4) 0.22 0.46 1.11 1.97 2.88 3.76 3.95S(n,5) 0.20 0.40 0.96 1.99 3.25 4.48 4.86S(n,6) 0.18 0.39 1.03 1.99 3.55 5.23 5.73S(n,7) 0.19 0.38 0.95 1.78 3.63 6.02 6.31S(n,8) 0.19 0.39 1.00 2.28 3.97 6.37 6.76
Parallel Algorithms – p. 49/50
Parallel Efficiency
n 32 64 128 256 512 1024 1500E(n,2) 0.60 0.84 0.98 0.93 0.96 1.02 1.01E(n,3) 0.08 0.15 0.32 0.57 0.81 0.99 0.95E(n,4) 0.06 0.12 0.27 0.49 0.72 0.94 0.98E(n,5) 0.04 0.08 0.19 0.40 0.65 0.90 0.97E(n,6) 0.03 0.07 0.17 0.33 0.59 0.87 0.96E(n,7) 0.03 0.05 0.14 0.25 0.51 0.86 0.90E(n,8) 0.02 0.05 0.13 0.29 0.50 0.80 0.84
Parallel Algorithms – p. 50/50