iterative methods and parallel algorithmsparallel.bas.bg/scicomp/vd/parallel.pdf · iterative...

Iterative Methods and ParallelAlgorithms

S. D. Margenov I. D. Lirkov

[email protected] [email protected]

Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria

http://parallel.bas.bg/˜margenov/

http://parallel.bas.bg/˜ivan/

Parallel Algorithms – p. 1/50

http://parallel.bas.bg/~margenov/

http://parallel.bas.bg/~ivan/

CONTENTS

1. Introduction

2. Parallel inner product

3. Sparse matrix vector multiplication

4. Jacobi method

5. Conjugate gradient method

6. Preconditioned conjugate gradient method

7. Circulant Bloick Factorization

8. MIC(0) preconditioning

9. Parallel PCG testsParallel Algorithms – p. 2/50

Parallel Performance

To establish the theoretical performance, asimple model for non-overlapped arithmetic andcommunication times is assumed:

The execution of M a.o. on one processortakes time

Ta = Mta

where ta is the average unit time to performone a.o. on one processor.


Parallel Performance

The communication time to transfer M dataelements from one processor to another isapproximated by

Tcom = `(ts + Mtc)

where ts is the start-up time and tc is the timenecessary for each of M elements to be sent,and ` is the graph distance between theprocessors.


Parallel Speedup and Efficiency

The standard expressions for parallel speedupS(N, p), and parallel efficiency E(N, p) are used:

S(N, p) =T (N, 1)

T (N, p)

E(N, p) =S(N, p)

p

Here, T (N, p) stands for the parallel time to solvethe problem on p processors, and N is thediscrete size of the problem.


Parallel Speedup and Efficiency

The following theoretical estimates hold:

0 < S(N, p) ≤ p 0 < E(N, P ) ≤ 1


Iterative Methods

Iterative methods are techniques to solvesystems of linear equations

Ax = b

that generate a sequence of approximations tothe solution vector x in the form

x0, x1, · · · , xk, · · ·


Iterative Methods

The process is said to be convergent if themagnitude of the vector

gk = b − Axk

becomes reasonably small. The vector gk

represents the error in the approximation of xand is referred to as the residual after k iterations.


The relative stopping criteria is determined bythe inner product of the kth residual

‖gk‖2

‖g0‖2< ε

where ε > 0 is assumed small, and

‖gi‖2 =

√

giTgi.


In this general setting, each iteration step includes:

inner product

matrix A vector multiplication


Parallel Inner Product I

The parallel implementation of the inner productis the only step of the considered iterativealgorithms which requires globalcommunications.

Communications in the one-to-all like parallel inner productParallel Algorithms – p. 11/50

Parallel Inner Product I

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)d

√p/2e 2D − mesh


Parallel Inner Product II

Communications in the all-to-all like parallel inner product

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)(

√p − 1) 2D − mesh


Sparse Matrices

Example A1

A =

4 −1 −1

4 −1 −1

4 −1 −1

4 −1 −1

4 −1 −1 −1 −1

−1 −1 −1 4

−1 −1 −1 4

−1 −1 −1 4

−1 −1 −1 4


Example A2

A =

4 −1 −1

−1 4 −1 −1

−1 4 −1

−1 4 −1 −1

−1 −1 4 −1 −1

−1 −1 4 −1

−1 4 −1

−1 −1 4 −1

−1 −1 4


FDM/FEM Sparse Matrices I

Consider the model problem

−uxx − uyy = f in Ω = [0, 1]2

with Dirichlet boundary conditions on Γ = ∂Ω.Let us assume that FDM or FEM is used to solvenumerically the problem where ωh is a uniformmesh with mesh-size h = 1/(n + 1).

2

3

6

7

8

9

5

1

4

1

2

3

4

5

6

7

8

9

(A1) (A2)


FDM/FEM Sparse Matrices

If a column-wise numbering of the nodes(unknowns) is used, then

A = blocktridiag(Ai,i−1, Ai,i, Ai,i+1),

A =

A1,1 A1,2

A2,1 A2,2 A2,3

A3,2 A3,3 A3,4

· · · · · · · · · · · · · · ·· · · · · · · · · · · · · · ·

An−1,n−2 An−1,n−1 An−1,n

An,n−1 An,n

,

Ai,i = tridiag(−1, 4,−1), Ai,i−1 = Ai,i+1 = −I,

A ∈N×N , Ai,j ∈n×n, N = n2.


Matrix-Vector Multiplication I

P

P

P

0

1

2

Matrix-vector multiplication with a block-stripped partitioning of the block-tridiagonal sparse matrix

T (N, 1) ≈ 9NtaTcom = 2(ts + ntc)

T (N, p) ≈ 9N

pta + 2(ts +

√Ntc)


Jacobi Iterative Method

The ith equation of the system Ax = b can bewritten in the form

xi =1

Ai,i

bi −∑

i6=j

Ai,jxj

.

The iteration step in the Jacobi method is

xk+1i =

1

Ai,i

bi −∑

i6=j

Ai,jxkj


Jacobi Iterative Method

or equivalently

xk+1i =

gki

Ai,i+ xk

i .

The method always converges in the class ofdiagonally-dominant matrices.


Jacoby Algorithm

1. procedure JACOBI(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := b − Ax0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. k := k + 1;9. for i := 1 to N do xk

i :=

gk−1

i

Ai,i+ xk−1

i ;

11. gk := b − Axk;12. endwhile;13. x := xk;14. endJACOBI

The complexity of one iteration is as follows

NJacit (A−1b) ≈ N (Ad) + N (IP ) + 3N

which for thte model problem reads as

NCGit (A−1b) ≈ 14N.

The related times are simply derived usingthe related matrix-vactor communication es-timate.

T it(N, 1) ≈ 14Nta

T itcom = 2(ts + ntc) + T IP

com

T it(N, p) ≈ 14N

pta + 2(ts +

√Ntc)


Conjugate Gradient Algorithm

1. procedure CG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := Ax0 − b, d0 = −g0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. τk = gkTgk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. βk = gk+1Tgk+1

gkT gk;

12. dk+1 = −gk+1 + βkdk;

13. endwhile;14. x := xk;15. endCG

The computational complexity of one CG iter-ation is as follows:

NCGit (A−1b) ≈ N (Ad)+2N (IP )+3N (LT ),

NCGit (A−1b) ≈ 19N.

The related times are derived in a similar wayas for Jacobi method.

T it(N, 1) ≈ 19Nta

T itcom = 2(ts + ntc) + 2T IP

com

T it(N, p) ≈ 19N

pta + 2(ts +

√Ntc)


Convergence Rate of CG Method

Theorem.

p(ε) ≤ 1

2

√

κ(A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈ RN .

In a very general setting of FEM/FDM sparsematrices, the spectral condition number behavesas κ(A) = O(N).


Convergence Rate of CG Method

Therefore NCG(A−1b) = O(N 3/2). It is importantto note, that the same complexity holds for thebest known direct method, namelyNND(A−1b) = O(N 3/2) where ND stands for theNested Dissection Method.


Numerical Tests

The number of iterations for the model problemof Gauss-Seidel (G-S), Steepest Descent(SD),Conjugate Gradient (CG) andPreconditioned CG (PCG) methods arepresented in the table. The implemented PCGalgorithm is subject to the next section.

nG−Sit = O(N)

nSDit = O(N)

nCGit = O(N 1/2)

nPCGit = O(N 1/4)


Numerical Tests

n G − S SD CG PCG

4 82 185 26 118 309 698 45 1516 1151 2592 91 1932 4242 9541 177 2764 15529 34818 351 38

Conclusion. The efforts should be addressed todevelopment of scalable parallel algorithms forfast enough iterative solution methods.


Preconditioned CG Method

The idea of the PCG method is to substitute theoriginal linear system to a new one which isbetter conditioned:

Ax = b => C−1/2AC−1/2y = b

The PCG strategy is to construct apreconditioner C such that:

κ(C−1A) << κ(A)

N (C−1v) << N (A−1v)

The preconditioner is called optimal ifκ(C−1A) = O(1) and N (C−1v) = O(N).


PCG Algorithm

1. procedure PCG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 = Ax0−b, h0 = C−1g0, d0 = h0;

6. while (‖gk‖C−1 > ε‖g0‖C−1 ) do

7. begin

8. τk = gkThk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. hk+1 = C−1gk+1;

12. βk = gk+1Thk+1

gkT hk;

13. dk+1 = −hk+1 + βkdk;

14. endwhile;15. x := xk;16. endPCG

Following the structure of our analysiss,we estimate the computational complexityof one PCG iteration in the form:

NPCGit (A−1b) ≈ N (C−1g) + N (Ad)

+ 2N (IP ) + 3N (LT )

NPCGit (A−1b) ≈ N (C−1g) + 19N.

Then, the related per iteration PCG timesare as follows:

T it(N, 1) ≈ T (C−1g)(N, 1) + 19Nta,

T itcom = 2(ts + ntc) + T

(C−1g)com + 2T IP

com


Convergence Rate of PCG Method

Theorem.

p(ε) ≤ 1

2

√

κ(C−1A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈N


Convergence Rate of PCG Method

Some parallel preconditioning techniques:• Incomplete Factorization• Circulant Bloick Factorization• Domain Decomposition• Patched Local Refinement• Multigrid/Multilevel• Approximate Inverse


Circulant Bloick Factorization

A circulant matrix C has the form

Ck,j = c(j−k) mod m

C =

c0 c1 c2 . . . cm−1

cm−1 c0 c1 . . . cm−2... ... ... ...c1 c2 . . . cm−1 c0

C = (c0, c1, . . . cm−1) = FΛF ∗

ℵ(C−1v) = O(m log m)


2D model problem

−(a(x, y)ux)x − (b(x, y)uy)y = f(x, y),

∀(x, y) ∈ Ω,

u(x, y) = 0, ∀(x, y) ∈ Γ = ∂Ω,

0 < cmin ≤ a(x, y), b(x, y) ≤ cmax,

A = tridiag(−Ai,i−1, Ai,i,−Ai,i+1) i = 1, 2, . . . n,

C = tridiag(−Ci,i−1, Ci,i,−Ci,i+1) i = 1, 2, . . . n,

where Ci,j = Circulant(Ai,j) is some givencirculant approximation of the correspondingblock Ai,j.


Factorization

C = D − L − U

C = (X − L)(I − X−1U)

X = D − LX−1U

X1 = C1,1

Xi = Ci,i − Ci,i−1X−1i−1Ci−1,i, i = 2, . . . n

Ci,j = FΛi,jF∗

Xi = FDiF∗


Factorization

D−11 = Λ1,1

D−1i = Λi,i − Λi,i−1Di−1Λi−1,i.

Let us denote with Λ = tridiag(Λi,i−1, Λi,i, Λi,i+1).Then the following relation holds

Cw = u ⇐⇒ (I ⊗ F )Λ(I ⊗ F ∗)w = u.

u = (I ⊗ F ∗)u

Λw = u

w = (I ⊗ F )wParallel Algorithms – p. 34/50

Factorization

∣

∣

∣

∣

∣

v1 = D1u1

vi = Di(ui − Λi,i−1vi−1) i = 2, 3, . . . n

∣

∣

∣

∣

∣

wn = vn

wi = vi − DiΛi,i+1wi+1 i = n − 1, n − 2, . . . 1


Parallel algorithm

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

P0 P1 P2 P3

Distribution of vectors on processors.

The CBFpreconditioning can be split in three stages. If weuse the column-wise mapping for first and thirdstage there is no need of communicationbecause we perform block-FFT on blocks whichare stored on one processor. For second stagewe have to reorder the vector entries usingrow-wise mapping.


Parallel algorithm

The CBF preconditioning can be split in threestages. If we use the column-wise mapping forfirst and third stage there is no need ofcommunication because we perform block-FFTon blocks which are stored on one processor.For second stage we have to reorder the vectorentries using row-wise mapping.


Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

128 1 0.086 0.0812 0.047 1.84 0.92 0.047 1.71 0.864 0.028 3.04 0.76 0.029 2.77 0.698 0.021 4.13 0.52 0.096 0.84 0.10

256 1 0.389 0.3922 0.207 1.88 0.94 0.208 1.88 0.944 0.109 3.56 0.89 0.127 3.09 0.778 0.065 6.02 0.75 0.138 2.83 0.35

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29


Parallel CBF tests



384 1 1.460 1.4982 0.759 1.92 0.96 0.783 1.91 0.963 0.523 2.79 0.93 0.533 2.81 0.944 0.394 3.71 0.93 0.473 3.17 0.796 0.269 5.43 0.90 0.780 1.92 0.328 0.338 4.32 0.54 1.122 1.33 0.17

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29


Parallel CBF tests



420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29


MIC(0) Algorithm

Let us rewrite the real matrix A in the formA = D − L − LT . Then, the modified incompleteCholesky factorization is defined as follows:

CMIC(0)(A) = (X − L)X−1(X − L)T ,

where X = diag(x1, · · · , xN) provides the equalrowsums condition.


MIC(0) Algorithm

Theorem. Let us assume that(a) L ≥ 0, (b) Ae ≥ 0, (c) Ae + Lte > 0, e =

(1, · · · , 1)t ∈N . Then the relation

xi = aii −i−1∑

k=1

aik

xk

N∑

j=k+1

akj > 0

gives a stable MIC(0) factorization of A.


Remark. All presented numerical tests areperformed using the perturbed MIC(0)algorithm, where the incomplete factorization isapplied to the matrix A = A + D. The diagonalperturbation D = D(ξ) = diag(d1, . . . dN) isdefined as follows:

di =

ξaii if aii ≥ 2wi

ξ1/2aii if aii < 2wi


wherewi = −

∑

j>i

aij.

Here 0 < ξ < 1 is a constant of the same orderas the minimal eigenvalue of A. Thecomputations for the considerd model problemsare done with ξ = h2.


MIC(0) Complexity

The MIC(0) computational complexity of onePCG iteration is as follows:

N PCGit (A−1b) ≈ N (C−1g)+19N, N (C−1g) ≈ 11N,

N PCGit (A−1b) ≈ 30N.

MIC(0) is a cheap preconditioning algorithm.The cost of N (C−1g) is almost the same asN (Ad).


MIC(0) Complexity

MIC(0) is a robust preconditioner with respectto local singularities of the problem, whereκ(C−1A) = O(N 1/4), andN PCG(A−1b) = O(N 5/4).

MIC(0) is an inherently sequential algorithm.


FDM/FEM Sparse Matrices II

The model problem is considered again:−uxx − uyy = f in Ω = [0, 1]2 with Dirichletboundary conditions on Γ = ∂Ω.

ReM SkM

Since a five point stencil is used in both cases,the accuracy of the regular mesh (ReM) and thealternative skewed mesh (SkM) FDM/FEMapproximations are one and the same.


Block-Structure of the Matrices

SkM ReM

The bottleneck problem of the parallelimplementation of MIC(0) algorithm is thesolution of problems with triangle matrices(X − L) and (X − L)T .


Block-Structure of the Matrices

The key point of our consideration is, that in thecase of skewed mesh, the stiffness matrix has ablock structure with diagonal blocks which arediagonal.


Parallel MIC(0) algorithm

P

P

P

0

2

1

MIC(0) PCG algorithm with a block-stripped partitioning: N = n2 + (n − 1)2.

TMIC(0)it (N, 1) ≈ 38Nta, Tcom ≈ (4ts +6tc)n+2T IP

com,

TMIC(0)it (N, p) ≈ 38N

pta + (2ts + 3tc)

√2N


Parallel MIC(0) Tests

The presented tests are performed on aBeowulf like cluster of four dual processorPower Macintosh computers with 512 MBRAM each and G4 processors on 450 MHz.

The parallel MIC(0) algorithm is implementedin C++ using Message Passing Interface(MPI).

Yellow Dog Linux with LAN MPI are used.

The size of the problem and the number ofthe processors are varied to examine theparallel scalability of the code.


Parallel Speedup

n 32 64 128 256 512 1024 1500S(n,2) 1.21 1.68 1.96 1.85 1.92 2.03 2.02S(n,3) 0.24 0.46 0.97 1.72 2.45 2.97 2.86S(n,4) 0.22 0.46 1.11 1.97 2.88 3.76 3.95S(n,5) 0.20 0.40 0.96 1.99 3.25 4.48 4.86S(n,6) 0.18 0.39 1.03 1.99 3.55 5.23 5.73S(n,7) 0.19 0.38 0.95 1.78 3.63 6.02 6.31S(n,8) 0.19 0.39 1.00 2.28 3.97 6.37 6.76


Parallel Efficiency

n 32 64 128 256 512 1024 1500E(n,2) 0.60 0.84 0.98 0.93 0.96 1.02 1.01E(n,3) 0.08 0.15 0.32 0.57 0.81 0.99 0.95E(n,4) 0.06 0.12 0.27 0.49 0.72 0.94 0.98E(n,5) 0.04 0.08 0.19 0.40 0.65 0.90 0.97E(n,6) 0.03 0.07 0.17 0.33 0.59 0.87 0.96E(n,7) 0.03 0.05 0.14 0.25 0.51 0.86 0.90E(n,8) 0.02 0.05 0.13 0.29 0.50 0.80 0.84


iterative methods and parallel algorithmsparallel.bas.bg/scicomp/vd/parallel.pdf · iterative...

Documents