sparse factorization using low rank submatrices cleve ashcraft...

Sparse factorizationusing low rank submatrices

Cleve AshcraftLSTC

cleve@lstc.com

2010 MUMPS User Group MeetingApril 15-16, 2010

Toulouse, FRANCE

ftp.lstc.com:outgoing/cleve/MUMPS10 Ashcraft.pdf

LSTCLivermore Software Technology Corporation

Livermore, California 94550

• Founded by John Hallquist of LLNL, 1980’s

• Public domain versions of DYNA and NIKE codes

• LS-DYNA : implicit/explicit,nonlinear finite element analysis code

• Multiphysics capabilities

– Fluid/structure interaction

– Thermal analysis

– Acoustics

– Electromagnetics

Multifrontal Algorithm

• very large, sparse LDLT and LDU factorizations

• tree structure organizes factor storage,solve and factor operations

• medium-to-large sparse linear systemslocated at each leaf node of the tree

• medium-sized dense linear systemslocated at each interior node of the tree

• dense matrix-matrix operationsat each interior node

• sparse matrix-matrix adds between nodes

Multifrontal tree

Multifrontal tree – polar representation

18 19 20 21

44454647

Multifrontal Algorithm

• 40M equations, present frontier at LSTC

• serial, SMP, MPP, hybrid, GPU

• large problems, > 1M - 10M dof,require out-of-core storage of factor entries,even on distributed memory systems

• IO cost largely hidden during the factorization

• IO cost dominant during the solves

• eigensolver =⇒ several right hand sides

• many applications, e.g., Newton’s method,have a single right hand side

Low Rank Approximations

• hierarchical matrices,Hackbusch, Bebendorf, Leborne, others

• semi-separable matrices, Gu, others

• submatrices are numerically rank deficient

• method of choice for Boundary Elements (BEM)

• now applied to Finite Elements (FEM)

• A ≈ XY TAm

• storage = r(m + n) vs mn

• reduction in ops =r

min(m,n)7

Multifrontal Algorithm + Low Rank Approximations

• At each leaf node in the multifrontal tree —use standard multifrontal

• At each interior node in the multifrontal tree —low rank matrix-matrix multiplies and sums

• Between nodes —low rank matrix sums

• Dramatic reduction in storage

• Dramatic reduction in operations

• Excellent approximation propertiesfor finite element operators

• Our experience is with potential equations,elasticity with solids and shells

Outline

• graph, tree, matrix perspectives

• experiments – 2-D potential equation

• low rank computations

• blocking strategies

• summary

One subgraph

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

One subtree

ΩI ΩJ

One submatrix

AΩI ,ΩIAΩI ,S

AΩI ,∂S

AΩJ ,ΩJAΩJ ,S AΩJ ,∂S

AS,ΩIAS,ΩJ

AS,S AS,∂S

A∂S,ΩIA∂S,ΩJ

A∂S,S A∂S,∂S

Portion of original matrix

0 200 400 600 800 1000 1200

nz = 9052

A ordered by domains, separator and external boundary

Portion of factor matrix

0 200 400 600 800 1000 1200

nz = 48899

L ordered by domains, separator and external boundary

Schur complement matrix

[AS,S AS,∂S

A∂S,S A∂S,∂S

0 20 40 60 80 100 120 140 160

nz = 20659

Schur complement, separator and external boundary

Outline

• summary

Computational experiments

• Compute LS,S, L∂S,S and A∂S,∂S

• Find domain decompositions of S and ∂S

• Form block matrices, e.g., LS,S =∑

• Find singular value decompositions of each LK,J

• Collect all singular values

σ =∑

min(|K|,|J |)∑

σ(K,J)i

• Split matrix LS,S = MS,S + NS,S using singularvalues σ.

• We want ‖NS,S‖F ≤ 10−14 ‖LS,S‖F

255 × 255 diagonal block factor LS,S

43% dense, relative accuracy 10−14

LS,S =

0 50 100 150 200 250

754 × 255 lower block factor L∂S,S

L∂S,S =

0 100 200

754 × 754 update matrix A∂S,∂S

A∂S,∂S =

0 100 200 300 400 500 600 700

255 × 255 factor matrix LS,S

storage vs accuracy

0 0.1 0.2 0.3 0.4 0.5 0.6−16

fraction of dense storage

log 10

(1 −

|| F /

||A|| F

approximating 255 x 255 separator factor matrix L3,3

2 x 2 blocks3 x 3 blocks4 x 4 blocks5 x 5 blocks6 x 6 blocks7 x 7 blocks8 x 8 blocks

754 × 255 factor matrix L∂S,S

storage vs accuracy

0 0.05 0.1 0.15 0.2−16

log 10

(1 −

|| F /

||A|| F

approximating 754 x 255 interface−separator factor matrix L4,3

3 x 1 blocks6 x 2 blocks9 x 3 blocks12 x 4 blocks

storage vs accuracy

0 0.1 0.2 0.3 0.4 0.5−16

log 10

(1 −

|| F /

||A|| F

approximating 754 x 754 schur complement matrix Ahat44

2 x 2 blocks3 x 3 blocks4 x 4 blocks5 x 5 blocks6 x 6 blocks7 x 7 blocks8 x 8 blocks

Outline

• summary

How to compute low rank submatrices ?

• SVD – singular value decomposition

A = UΣUT

where U and V are orthogonal and Σ is diagonal

• Gold standard, expensive, O(n3) ops

• QR factorization

AP = QR

where Q is orthogonal and R is upper triangular,and P is a permutation matrix

• Silver standard, moderate cost, O(rn2) ops

row norms of R vs singular values of A

94 × 94 submatrix sizenear, mid and far interactions

0 2 4 6 8 10 12 14 16 18 20−20

0row norms of R versus singular values of 754 x 754 A using 8 segments

near ||R

near σi

mid ||Ri,:

mid σi

far ||Ri,:

far σi

SVD vs QR with column pivoting — Conclusions :

• Column pivoting QR factorization does well.

• Row norms of R track singular values σ

• The numerical rank of R is greater than needed,but not that much greater

• For more accuracy/less storagetwo sided orthogonal factorizations

– AP = ULV , U and V orthogonal, L triangular

– PAQ = UBV ,U and V orthogonal, B bidiagonal

track the singular values very closely.

Type of approximation of submatrices

• submatrix LI,J of L∂S,S, ‖LI,J‖F = 2.92 × 10−2

factorization numerical rank total entriesnone — 4900QR 20 2681

ULV 18 2680SVD 18 2538

• submatrix LI,J of L∂S,S, ‖LI,J‖F = 2.04 × 10−3

factorization numerical rank total entriesnone — 4900QR 14 1896

ULV 10 1447SVD 10 1410

Operations with low rank submatrices

A = UAV TA , B = UBV T

B , C = UCV TC

See Bebendorf, “Hierarchical Matrices”, Chapter 1

• Multiplication A = B C, rank(A) ≤ min(rank(B), rank(C))

A = UAV TA =

(UBV T

) (UCV T

• Addition A = B + C, rank(A) ≤ rank(B) + rank(C)

A = UAV TA =

(UBV T

(UCV T

)= B + C

=[UB UC

] [VB VC

Multiplication A = B C

A = UAV TA , B = UBV T

B , C = UCV TC

A = B C =

(m × h) (h × l))(

(l × k) (k × n))

= (m × min(h, k)) (min(h, k) × n)

Near-near matrix product

0 5 10 15 20 25 30 35

near−near interaction, L2,1

σ(L21)

σ(L21*L21T)

mid-near matrix product

0 5 10 15 20 25 30 35

mid−near interaction, L2,1

σ(L21)σ(L11)

σ(L21*L11T)

far-near matrix product

0 5 10 15 20 25 30 35

far−near interaction, L5,1

σ(L51)σ(L61)

σ(L51*L61T)

mid-mid matrix product

2 4 6 8 10 12 14 16

mid−mid interaction, L1,1

σ(L11)

σ(L11*L11T)

far-mid matrix product

2 4 6 8 10 12 14 16

mid−far interaction, L6,1

σ(L61)

σ(L61*L61T)

far-far matrix product

1 2 3 4 5 6 7 8 9 10 11

far−far interaction, L6,1

σ(L61)

σ(L61*L61T)

Addition A = B + C

rank(A) ≤ rank(B) + rank(C)

A = UAV TA =

(UBV T

(UCV T

)= B + C

=[UB UC

] [VB VC

= (Q1R1) (Q2R2)T

= Q1 (Q3R3) QT2

= (Q1Q3)(R3Q

= (Q1Q3)(Q2R

• R1RT2 usually has low numerical rank

• examples follow35

update matrix, diagonal blockA3,3 = L3,1L

T3,1 + L3,2L

T3,2 + L3,3L

0 5 10 15 20 25

log 10

update matrix −− self interaction

σ(near)σ(sum)σ(mid)σ(far)

update matrix, mid-distance off-diagonal blockA5,3 = L5,1L

T3,1 + L5,2L

T3,2 + L5,3L

0 2 4 6 8 10 12−20

log 10

update matrix −− mid−range interaction

σ(sum)σ(near)σ(mid)σ(far)

update matrix, far-distance off-diagonal blockA7,3 = L7,1L

T3,1 + L7,2L

T3,2 + L7,3L

0 2 4 6 8 10 12−20

log 10

update matrix −− far−range interaction

σ(sum)σ(near)σ(mid)σ(far)

Outline

• summary

Blocking Strategies

• Active data structures[LJ,J

L∂J,J A∂J,∂J

• partition J and ∂J independently

• for 2-d problems, J and ∂J are 1-d manifolds

• for 3-d problems, J and ∂J are 2-d manifolds

• we need mesh partitioning of the separatorand the boundary of a region J

• each index set of a partition is a segment

Segment partition

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

1 segment wirebasket domain decomposition

0 50 100 150 200 250

LJ,J =∑

Lσ,τ σ × τ ⊆ J × J

0 100 200

L∂J,J =∑

Lσ,τ σ × τ ⊆ ∂J × J

0 100 200 300 400 500 600 700

A∂J,∂J =∑

Aσ,τ σ×τ ⊆ ∂J×∂J

Strategy I

• The partition of ∂J (local to node J)conforms to the partitions of ancestors K,K ∩ ∂J 6= ∅

• Advantages :

– Update assembly is simplified

A(p(J))σ2,τ2 = A

(p(J))σ2,τ2 + A

(J)σ1,τ1

where σ1 ⊆ σ2, τ1 ⊆ τ2

– One destination for each A(J)σ1,τ1

• Disadvantages :

– Partition of ∂J can be fragmented,more segments, smaller size,less efficient storage

Strategy II

• The partition of ∂J (local to node J)need not conform to the partitions of ancestors

• Advantages :

– Partition of ∂J can be optimized bettersince ∂J is small and localized

• Disadvantages :

– Update assembly is more complex

A(p(J))σ1∩σ2,τ1∩τ2

= A(p(J))σ1∩σ2,τ1∩τ2

+ A(J)σ1∩σ2,τ1∩τ2

where σ1 ∩ σ2 6= ∅, τ1 ∩ τ2 6= ∅

– Several destinations for a submatrix A(J)σ1,τ1

Outline

• summary

Multifrontal tree Multifrontal Factorization

18 19 20 21

44454647

• each leaf node is ad-dimensional sparseFEM matrix

• each interior node isa (d − 1)-dimensionaldense BEM matrix

• use low rank storageand computation ateach interior node

Call to action

• progress to date in low rank factorizationsdriven by iterative methods

• work needed from direct methods community

• start from industrial strength multifrontal code

– serial, SMP, MPP, hybrid, GPU

– pivoting for stability, out-of-core,singular systems, null spaces

Call to action

• Many challenges

– partition of space, partition of interface

– added programming complexityof low rank matrices

– challenges to pivot for stability

– challenges/opportunitiesto implement in parallel

• Payoff will be huge

– reduction in storage footprint

– reduction in computational work

– take direct methods to next levelof problem size

sparse factorization using low rank submatrices cleve ashcraft...

Documents

factors of sparse polynomials are sparse

sparse coding in sparse winner networks

sparse approximations

sparse vision

sparse coding - github pagesyiiwood.github.io/images/sparse...

robust memory-aware mappings for parallel multifrontal...

an introduction to sparse coding, sparse sensing, and...

direct methods for sparse linear systems: matlab sparse...

sparse matrix-vector multiplier -...

sparse optimization - lecture: basic sparse optimization...

mumps overview 0.4cm - patrick amestoy (inpt(enseeiht...

sparse optimization...

sparse class

sparse matrices

faster compressed sparse row (csr)-based sparse matrix

sparse coding -...

navier stokes mumps and the ﬁnite element library getfem++...

from sparse regression to sparse multiple correspondence...

cs240a sparse

sparse matrices - boston university sparse matrices ! have...