iterative methods and parallel algorithmsparallel.bas.bg/scicomp/vd/parallel.pdf · iterative...

53
Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov [email protected] [email protected] Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria http://parallel.bas.bg/˜margenov/ http://parallel.bas.bg/˜ivan/ Parallel Algorithms – p. 1/50

Upload: others

Post on 01-Jan-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Iterative Methods and ParallelAlgorithms

S. D. Margenov I. D. Lirkov

[email protected] [email protected]

Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria

http://parallel.bas.bg/˜margenov/

http://parallel.bas.bg/˜ivan/

Parallel Algorithms – p. 1/50

Page 2: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

CONTENTS

1. Introduction

2. Parallel inner product

3. Sparse matrix vector multiplication

4. Jacobi method

5. Conjugate gradient method

6. Preconditioned conjugate gradient method

7. Circulant Bloick Factorization

8. MIC(0) preconditioning

9. Parallel PCG testsParallel Algorithms – p. 2/50

Page 3: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Performance

To establish the theoretical performance, asimple model for non-overlapped arithmetic andcommunication times is assumed:

The execution of M a.o. on one processortakes time

Ta = Mta

where ta is the average unit time to performone a.o. on one processor.

Parallel Algorithms – p. 3/50

Page 4: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Performance

The communication time to transfer M dataelements from one processor to another isapproximated by

Tcom = `(ts + Mtc)

where ts is the start-up time and tc is the timenecessary for each of M elements to be sent,and ` is the graph distance between theprocessors.

Parallel Algorithms – p. 4/50

Page 5: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Speedup and Efficiency

The standard expressions for parallel speedupS(N, p), and parallel efficiency E(N, p) are used:

S(N, p) =T (N, 1)

T (N, p)

E(N, p) =S(N, p)

p

Here, T (N, p) stands for the parallel time to solvethe problem on p processors, and N is thediscrete size of the problem.

Parallel Algorithms – p. 5/50

Page 6: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Speedup and Efficiency

The following theoretical estimates hold:

0 < S(N, p) ≤ p 0 < E(N, P ) ≤ 1

Parallel Algorithms – p. 6/50

Page 7: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Iterative Methods

Iterative methods are techniques to solvesystems of linear equations

Ax = b

that generate a sequence of approximations tothe solution vector x in the form

x0, x1, · · · , xk, · · ·

Parallel Algorithms – p. 7/50

Page 8: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Iterative Methods

The process is said to be convergent if themagnitude of the vector

gk = b − Axk

becomes reasonably small. The vector gk

represents the error in the approximation of xand is referred to as the residual after k iterations.

Parallel Algorithms – p. 8/50

Page 9: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

The relative stopping criteria is determined bythe inner product of the kth residual

‖gk‖2

‖g0‖2< ε

where ε > 0 is assumed small, and

‖gi‖2 =

giTgi.

Parallel Algorithms – p. 9/50

Page 10: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

In this general setting, each iteration step includes:

inner product

matrix A vector multiplication

Parallel Algorithms – p. 10/50

Page 11: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Inner Product I

The parallel implementation of the inner productis the only step of the considered iterativealgorithms which requires globalcommunications.

Communications in the one-to-all like parallel inner productParallel Algorithms – p. 11/50

Page 12: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Inner Product I

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)d

√p/2e 2D − mesh

Parallel Algorithms – p. 12/50

Page 13: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Inner Product II

Communications in the all-to-all like parallel inner product

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)(

√p − 1) 2D − mesh

Parallel Algorithms – p. 13/50

Page 14: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Sparse Matrices

Example A1

A =

4 −1 −1

4 −1 −1

4 −1 −1

4 −1 −1

4 −1 −1 −1 −1

−1 −1 −1 4

−1 −1 −1 4

−1 −1 −1 4

−1 −1 −1 4

Parallel Algorithms – p. 14/50

Page 15: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Example A2

A =

4 −1 −1

−1 4 −1 −1

−1 4 −1

−1 4 −1 −1

−1 −1 4 −1 −1

−1 −1 4 −1

−1 4 −1

−1 −1 4 −1

−1 −1 4

Parallel Algorithms – p. 15/50

Page 16: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

FDM/FEM Sparse Matrices I

Consider the model problem

−uxx − uyy = f in Ω = [0, 1]2

with Dirichlet boundary conditions on Γ = ∂Ω.Let us assume that FDM or FEM is used to solvenumerically the problem where ωh is a uniformmesh with mesh-size h = 1/(n + 1).

2

3

6

7

8

9

5

1

4

1

2

3

4

5

6

7

8

9

(A1) (A2)

Parallel Algorithms – p. 16/50

Page 17: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

FDM/FEM Sparse Matrices

If a column-wise numbering of the nodes(unknowns) is used, then

A = blocktridiag(Ai,i−1, Ai,i, Ai,i+1),

A =

A1,1 A1,2

A2,1 A2,2 A2,3

A3,2 A3,3 A3,4

· · · · · · · · · · · · · · ·· · · · · · · · · · · · · · ·

An−1,n−2 An−1,n−1 An−1,n

An,n−1 An,n

,

Ai,i = tridiag(−1, 4,−1), Ai,i−1 = Ai,i+1 = −I,

A ∈N×N , Ai,j ∈n×n, N = n2.

Parallel Algorithms – p. 17/50

Page 18: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Matrix-Vector Multiplication I

P

P

P

0

1

2

Matrix-vector multiplication with a block-stripped partitioning of the block-tridiagonal sparse matrix

T (N, 1) ≈ 9NtaTcom = 2(ts + ntc)

T (N, p) ≈ 9N

pta + 2(ts +

√Ntc)

Parallel Algorithms – p. 18/50

Page 19: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Jacobi Iterative Method

The ith equation of the system Ax = b can bewritten in the form

xi =1

Ai,i

bi −∑

i6=j

Ai,jxj

.

The iteration step in the Jacobi method is

xk+1i =

1

Ai,i

bi −∑

i6=j

Ai,jxkj

Parallel Algorithms – p. 19/50

Page 20: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Jacobi Iterative Method

or equivalently

xk+1i =

gki

Ai,i+ xk

i .

The method always converges in the class ofdiagonally-dominant matrices.

Parallel Algorithms – p. 20/50

Page 21: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Jacoby Algorithm

1. procedure JACOBI(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := b − Ax0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. k := k + 1;9. for i := 1 to N do xk

i :=

gk−1

i

Ai,i+ xk−1

i ;

11. gk := b − Axk;12. endwhile;13. x := xk;14. endJACOBI

The complexity of one iteration is as follows

NJacit (A−1b) ≈ N (Ad) + N (IP ) + 3N

which for thte model problem reads as

NCGit (A−1b) ≈ 14N.

The related times are simply derived usingthe related matrix-vactor communication es-timate.

T it(N, 1) ≈ 14Nta

T itcom = 2(ts + ntc) + T IP

com

T it(N, p) ≈ 14N

pta + 2(ts +

√Ntc)

Parallel Algorithms – p. 21/50

Page 22: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Conjugate Gradient Algorithm

1. procedure CG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := Ax0 − b, d0 = −g0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. τk = gkTgk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. βk = gk+1Tgk+1

gkT gk;

12. dk+1 = −gk+1 + βkdk;

13. endwhile;14. x := xk;15. endCG

The computational complexity of one CG iter-ation is as follows:

NCGit (A−1b) ≈ N (Ad)+2N (IP )+3N (LT ),

NCGit (A−1b) ≈ 19N.

The related times are derived in a similar wayas for Jacobi method.

T it(N, 1) ≈ 19Nta

T itcom = 2(ts + ntc) + 2T IP

com

T it(N, p) ≈ 19N

pta + 2(ts +

√Ntc)

Parallel Algorithms – p. 22/50

Page 23: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Convergence Rate of CG Method

Theorem.

p(ε) ≤ 1

2

κ(A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈ RN .

In a very general setting of FEM/FDM sparsematrices, the spectral condition number behavesas κ(A) = O(N).

Parallel Algorithms – p. 23/50

Page 24: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Convergence Rate of CG Method

Therefore NCG(A−1b) = O(N 3/2). It is importantto note, that the same complexity holds for thebest known direct method, namelyNND(A−1b) = O(N 3/2) where ND stands for theNested Dissection Method.

Parallel Algorithms – p. 24/50

Page 25: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Numerical Tests

The number of iterations for the model problemof Gauss-Seidel (G-S), Steepest Descent(SD),Conjugate Gradient (CG) andPreconditioned CG (PCG) methods arepresented in the table. The implemented PCGalgorithm is subject to the next section.

nG−Sit = O(N)

nSDit = O(N)

nCGit = O(N 1/2)

nPCGit = O(N 1/4)

Parallel Algorithms – p. 25/50

Page 26: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Numerical Tests

n G − S SD CG PCG

4 82 185 26 118 309 698 45 1516 1151 2592 91 1932 4242 9541 177 2764 15529 34818 351 38

Conclusion. The efforts should be addressed todevelopment of scalable parallel algorithms forfast enough iterative solution methods.

Parallel Algorithms – p. 26/50

Page 27: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Preconditioned CG Method

The idea of the PCG method is to substitute theoriginal linear system to a new one which isbetter conditioned:

Ax = b => C−1/2AC−1/2y = b

The PCG strategy is to construct apreconditioner C such that:

κ(C−1A) << κ(A)

N (C−1v) << N (A−1v)

The preconditioner is called optimal ifκ(C−1A) = O(1) and N (C−1v) = O(N).

Parallel Algorithms – p. 27/50

Page 28: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

PCG Algorithm

1. procedure PCG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 = Ax0−b, h0 = C−1g0, d0 = h0;

6. while (‖gk‖C−1 > ε‖g0‖C−1 ) do

7. begin

8. τk = gkThk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. hk+1 = C−1gk+1;

12. βk = gk+1Thk+1

gkT hk;

13. dk+1 = −hk+1 + βkdk;

14. endwhile;15. x := xk;16. endPCG

Following the structure of our analysiss,we estimate the computational complexityof one PCG iteration in the form:

NPCGit (A−1b) ≈ N (C−1g) + N (Ad)

+ 2N (IP ) + 3N (LT )

NPCGit (A−1b) ≈ N (C−1g) + 19N.

Then, the related per iteration PCG timesare as follows:

T it(N, 1) ≈ T (C−1g)(N, 1) + 19Nta,

T itcom = 2(ts + ntc) + T

(C−1g)com + 2T IP

com

Parallel Algorithms – p. 28/50

Page 29: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Convergence Rate of PCG Method

Theorem.

p(ε) ≤ 1

2

κ(C−1A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈N

Parallel Algorithms – p. 29/50

Page 30: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Convergence Rate of PCG Method

Some parallel preconditioning techniques:• Incomplete Factorization• Circulant Bloick Factorization• Domain Decomposition• Patched Local Refinement• Multigrid/Multilevel• Approximate Inverse

Parallel Algorithms – p. 30/50

Page 31: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Circulant Bloick Factorization

A circulant matrix C has the form

Ck,j = c(j−k) mod m

C =

c0 c1 c2 . . . cm−1

cm−1 c0 c1 . . . cm−2... ... ... ...c1 c2 . . . cm−1 c0

C = (c0, c1, . . . cm−1) = FΛF ∗

ℵ(C−1v) = O(m log m)

Parallel Algorithms – p. 31/50

Page 32: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

2D model problem

−(a(x, y)ux)x − (b(x, y)uy)y = f(x, y),

∀(x, y) ∈ Ω,

u(x, y) = 0, ∀(x, y) ∈ Γ = ∂Ω,

0 < cmin ≤ a(x, y), b(x, y) ≤ cmax,

A = tridiag(−Ai,i−1, Ai,i,−Ai,i+1) i = 1, 2, . . . n,

C = tridiag(−Ci,i−1, Ci,i,−Ci,i+1) i = 1, 2, . . . n,

where Ci,j = Circulant(Ai,j) is some givencirculant approximation of the correspondingblock Ai,j.

Parallel Algorithms – p. 32/50

Page 33: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Factorization

C = D − L − U

C = (X − L)(I − X−1U)

X = D − LX−1U

X1 = C1,1

Xi = Ci,i − Ci,i−1X−1i−1Ci−1,i, i = 2, . . . n

Ci,j = FΛi,jF∗

Xi = FDiF∗

Parallel Algorithms – p. 33/50

Page 34: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Factorization

D−11 = Λ1,1

D−1i = Λi,i − Λi,i−1Di−1Λi−1,i.

Let us denote with Λ = tridiag(Λi,i−1, Λi,i, Λi,i+1).Then the following relation holds

Cw = u ⇐⇒ (I ⊗ F )Λ(I ⊗ F ∗)w = u.

u = (I ⊗ F ∗)u

Λw = u

w = (I ⊗ F )wParallel Algorithms – p. 34/50

Page 35: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Factorization

v1 = D1u1

vi = Di(ui − Λi,i−1vi−1) i = 2, 3, . . . n

wn = vn

wi = vi − DiΛi,i+1wi+1 i = n − 1, n − 2, . . . 1

Parallel Algorithms – p. 35/50

Page 36: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel algorithm

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

P0 P1 P2 P3

Distribution of vectors on processors.

The CBFpreconditioning can be split in three stages. If weuse the column-wise mapping for first and thirdstage there is no need of communicationbecause we perform block-FFT on blocks whichare stored on one processor. For second stagewe have to reorder the vector entries usingrow-wise mapping.

Parallel Algorithms – p. 36/50

Page 37: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel algorithm

The CBF preconditioning can be split in threestages. If we use the column-wise mapping forfirst and third stage there is no need ofcommunication because we perform block-FFTon blocks which are stored on one processor.For second stage we have to reorder the vectorentries using row-wise mapping.

Parallel Algorithms – p. 36/50

Page 38: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

128 1 0.086 0.0812 0.047 1.84 0.92 0.047 1.71 0.864 0.028 3.04 0.76 0.029 2.77 0.698 0.021 4.13 0.52 0.096 0.84 0.10

256 1 0.389 0.3922 0.207 1.88 0.94 0.208 1.88 0.944 0.109 3.56 0.89 0.127 3.09 0.778 0.065 6.02 0.75 0.138 2.83 0.35

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel Algorithms – p. 37/50

Page 39: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

384 1 1.460 1.4982 0.759 1.92 0.96 0.783 1.91 0.963 0.523 2.79 0.93 0.533 2.81 0.944 0.394 3.71 0.93 0.473 3.17 0.796 0.269 5.43 0.90 0.780 1.92 0.328 0.338 4.32 0.54 1.122 1.33 0.17

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel Algorithms – p. 37/50

Page 40: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel Algorithms – p. 37/50

Page 41: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

MIC(0) Algorithm

Let us rewrite the real matrix A in the formA = D − L − LT . Then, the modified incompleteCholesky factorization is defined as follows:

CMIC(0)(A) = (X − L)X−1(X − L)T ,

where X = diag(x1, · · · , xN) provides the equalrowsums condition.

Parallel Algorithms – p. 38/50

Page 42: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

MIC(0) Algorithm

Theorem. Let us assume that(a) L ≥ 0, (b) Ae ≥ 0, (c) Ae + Lte > 0, e =

(1, · · · , 1)t ∈N . Then the relation

xi = aii −i−1∑

k=1

aik

xk

N∑

j=k+1

akj > 0

gives a stable MIC(0) factorization of A.

Parallel Algorithms – p. 39/50

Page 43: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Remark. All presented numerical tests areperformed using the perturbed MIC(0)algorithm, where the incomplete factorization isapplied to the matrix A = A + D. The diagonalperturbation D = D(ξ) = diag(d1, . . . dN) isdefined as follows:

di =

ξaii if aii ≥ 2wi

ξ1/2aii if aii < 2wi

Parallel Algorithms – p. 40/50

Page 44: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

wherewi = −

j>i

aij.

Here 0 < ξ < 1 is a constant of the same orderas the minimal eigenvalue of A. Thecomputations for the considerd model problemsare done with ξ = h2.

Parallel Algorithms – p. 41/50

Page 45: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

MIC(0) Complexity

The MIC(0) computational complexity of onePCG iteration is as follows:

N PCGit (A−1b) ≈ N (C−1g)+19N, N (C−1g) ≈ 11N,

N PCGit (A−1b) ≈ 30N.

MIC(0) is a cheap preconditioning algorithm.The cost of N (C−1g) is almost the same asN (Ad).

Parallel Algorithms – p. 42/50

Page 46: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

MIC(0) Complexity

MIC(0) is a robust preconditioner with respectto local singularities of the problem, whereκ(C−1A) = O(N 1/4), andN PCG(A−1b) = O(N 5/4).

MIC(0) is an inherently sequential algorithm.

Parallel Algorithms – p. 43/50

Page 47: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

FDM/FEM Sparse Matrices II

The model problem is considered again:−uxx − uyy = f in Ω = [0, 1]2 with Dirichletboundary conditions on Γ = ∂Ω.

ReM SkM

Since a five point stencil is used in both cases,the accuracy of the regular mesh (ReM) and thealternative skewed mesh (SkM) FDM/FEMapproximations are one and the same.

Parallel Algorithms – p. 44/50

Page 48: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Block-Structure of the Matrices

SkM ReM

The bottleneck problem of the parallelimplementation of MIC(0) algorithm is thesolution of problems with triangle matrices(X − L) and (X − L)T .

Parallel Algorithms – p. 45/50

Page 49: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Block-Structure of the Matrices

The key point of our consideration is, that in thecase of skewed mesh, the stiffness matrix has ablock structure with diagonal blocks which arediagonal.

Parallel Algorithms – p. 46/50

Page 50: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel MIC(0) algorithm

P

P

P

0

2

1

MIC(0) PCG algorithm with a block-stripped partitioning: N = n2 + (n − 1)2.

TMIC(0)it (N, 1) ≈ 38Nta, Tcom ≈ (4ts +6tc)n+2T IP

com,

TMIC(0)it (N, p) ≈ 38N

pta + (2ts + 3tc)

√2N

Parallel Algorithms – p. 47/50

Page 51: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel MIC(0) Tests

The presented tests are performed on aBeowulf like cluster of four dual processorPower Macintosh computers with 512 MBRAM each and G4 processors on 450 MHz.

The parallel MIC(0) algorithm is implementedin C++ using Message Passing Interface(MPI).

Yellow Dog Linux with LAN MPI are used.

The size of the problem and the number ofthe processors are varied to examine theparallel scalability of the code.

Parallel Algorithms – p. 48/50

Page 52: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Speedup

n 32 64 128 256 512 1024 1500S(n,2) 1.21 1.68 1.96 1.85 1.92 2.03 2.02S(n,3) 0.24 0.46 0.97 1.72 2.45 2.97 2.86S(n,4) 0.22 0.46 1.11 1.97 2.88 3.76 3.95S(n,5) 0.20 0.40 0.96 1.99 3.25 4.48 4.86S(n,6) 0.18 0.39 1.03 1.99 3.55 5.23 5.73S(n,7) 0.19 0.38 0.95 1.78 3.63 6.02 6.31S(n,8) 0.19 0.39 1.00 2.28 3.97 6.37 6.76

Parallel Algorithms – p. 49/50

Page 53: Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative Methods and Parallel Algorithms S. D. Margenov I. D. Lirkov margenov@parallel.bas.bg

Parallel Efficiency

n 32 64 128 256 512 1024 1500E(n,2) 0.60 0.84 0.98 0.93 0.96 1.02 1.01E(n,3) 0.08 0.15 0.32 0.57 0.81 0.99 0.95E(n,4) 0.06 0.12 0.27 0.49 0.72 0.94 0.98E(n,5) 0.04 0.08 0.19 0.40 0.65 0.90 0.97E(n,6) 0.03 0.07 0.17 0.33 0.59 0.87 0.96E(n,7) 0.03 0.05 0.14 0.25 0.51 0.86 0.90E(n,8) 0.02 0.05 0.13 0.29 0.50 0.80 0.84

Parallel Algorithms – p. 50/50