block lu factorization lecture 24
DESCRIPTION
Block LU Factorization Lecture 24. MA471 Fall 2003. Example Case. Suppose we are faced with the solution of a linear system Ax=b Further suppose: A is large (dim( A )>10,000) A is dense A is full We have a sequence of different b vectors. Problems. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/1.jpg)
Block LU FactorizationLecture 24
MA471 Fall 2003
![Page 2: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/2.jpg)
Example Case
1) Suppose we are faced with the solution of a linear system Ax=b
2) Further suppose:1) A is large (dim(A)>10,000) 2) A is dense3) A is full4) We have a sequence of different b vectors.
![Page 3: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/3.jpg)
Problems• Suppose we are able to compute the
matrix –– It costs N2 doubles to store the matrix– E.g. for N=100,000 we require 76.3 gigabytes
of storage for the matrix alone.– 32 bit processors are limited to 4 gigabytes of
memory– Most desktops (even 64 bit) do not have 76.3
gigabytes
– What to do?
![Page 4: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/4.jpg)
Divide and Conquer
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
One approach is to assume we have a square number of processors.We then divide the matrix into blocks – storing one block per processor.
![Page 5: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/5.jpg)
Back to the Linear System
• We are now faced with LU factorization of a distributed matrix.
• This calls for a modified LU routine which acts on blocks of the matrix.
• We will demonstrate this algorithm for one level.
• i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.
![Page 6: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/6.jpg)
Constructing the Block LU Factorization
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 1 0
L20 0 1
*
U00 U01 U02
0 ?11 ?12
0 ?21 ?22
First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are:
A00 = L00*U00 (compute by L00, U00 by LU factorization)
A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02
A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00
A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..
![Page 7: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/7.jpg)
contA00 = L00*U00 (compute by L00, U00 by LU factorization)
A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02
A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00
A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02
In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m
![Page 8: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/8.jpg)
Summary First Stage
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 1 0
L20 0 1
*
U00 U01 U02
0 ?11 ?12
0 ?21 ?22
First step: LU factorize uppermost block diagonal
Second step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0
Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)
![Page 9: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/9.jpg)
Now Factorize Lower SE Block
?11 ?12
?21 ?22=
L11 0
L21 1*
U11 U12
0 ??22
We repeat the previous algorithm this time on the two by two SE block.
![Page 10: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/10.jpg)
End Result
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 L11 0
L20 L21 L22
*
U00 U01 U02
0 U11 U12
0 0 U22
![Page 11: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/11.jpg)
Matlab Version
![Page 12: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/12.jpg)
Parallel AlgorithmP0 P1 P2
P3 P4 P5
P6 P7 P8
P0: A00 = L00*U00 (compute by L00, U00 by LU factorization)
P1: U01 = L00\A01P2: U02 = L00\A02
P3: L10 = A10/U00P6: L20 = A20/U00
P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02
In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m
![Page 13: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/13.jpg)
Parallel Communication
L00U00 U01 U02
L10 A11 A12
L20 A21 A22
P0: L00,U00 =lu(A)
P1: U01 = L00\A01P2: U02 = L00\A02
P3: L10 = A10/U00P6: L20 = A20/U00
P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02
In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m
![Page 14: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/14.jpg)
Communication Summary
P0: L00,U00 =lu(A)
P1: U01 = L00\A01P2: U02 = L00\A02
P3: L10 = A10/U00P6: L20 = A20/U00
P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02
P0: sends L00 to P1,P2 sends U00 to P3,P6
P1: sends U01 to P4,P7P2: sends U02 to P5,P8
P3: sends L10 to P4,P5P4: sends L20 to P7,P8
P0 P1 P2
P3 P4 P5
P6 P7 P8
L00U00 U01 U02
L10 A11 A12
L20 A21 A22
![Page 15: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/15.jpg)
Upshot
Notes:1) I added an MPI_Barrier purely to separate the LU factorization and the backsolve.2) In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait
compared to compute time.3) The compute part of this code can be optimized much more – making the parallel
efficiency even worse.
a b
(a) P0: sends L00 to P1,P2 sends U00 to P3,P6
(b) P1: sends U01 to P4,P7(c) P2: sends U02 to P5,P8
(d) P3: sends L10 to P4,P5(e) P4: sends L20 to P7,P8
cde
(f) P4: sends L11 to P5 sends U11 to P7
(g) P1: sends U12 to P8
(h) P3: sends L21 to P8
f
1st stage: 1st stage:
g
h
![Page 16: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/16.jpg)
Block Back Solve
• After factorization we are left with the task of using the distributed L and U to compute the backsolve:
U00L00 U01 U02
L10 U11L11 U12
L20 L21 U22L22
Block distribution of L and U
P0 P1 P2
P3 P4 P5
P6 P7 P8
![Page 17: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/17.jpg)
Recall
• Given an LU factorization of A namely, L,U such that A=LU
• Then we can solve Ax=b by• y=L\b• x=U\y
![Page 18: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/18.jpg)
Distributed Back Solve
L00 0 0
L10 L11 0
L20 L21 L22
=
y0
y1
y2
b0
b1
b2
P0: solve L00*y0 = b0 send: y0 to P3,P6P3: send: L10*y0 to P4P4: solve L11*y1 = b1-L10*y0 send: y1 to P7P6: send: L20*y0 to P8\P7: send: L21*y1 to P8P8: solve L22*y2 = b2-L20*y0-L21*y1Results: y0 on P0, y1 on P4, y2 on P8
P0 P1 P2
P3 P4 P5
P6 P7 P8
![Page 19: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/19.jpg)
Matlab Code
![Page 20: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/20.jpg)
Back Solve
After the factorization we computed a solution to Ax=b
This consists of two distributed block triangular systems to solve
![Page 21: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/21.jpg)
Barrier Between Back Solves
This time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..
![Page 22: Block LU Factorization Lecture 24](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815f6c550346895dce7356/html5/thumbnails/22.jpg)
Example Codehttp://www.math.unm.edu/~timwar/MA471F03/blocklu.m
http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c