Download - Jim Demmel and Oded Schwartz
Communication AvoidingAlgorithms
forDense Linear Algebra:
LU, QR, and Cholesky decompositionsand
Sparse Cholesky decompositionsJim Demmel and Oded Schwartz
CS294, Lecture #5 Fall, 2011Communication-Avoiding Algorithms
www.cs.berkeley.edu/~odedsc/CS294
Based on:Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers
2
Outline for today• Recall communication lower bounds • What a “matching upper bound”, i.e. CA-algorithm,
might have to do• Summary of what is known so far• Case Studies of CA dense algorithms
• Last time: case study I: Matrix multiplication• Case study II: LU, QR• Case study III: Cholesky decomposition
• Case Studies of CA sparse algorithms• Sparse Cholesky decomposition
3
Communication lower bounds• Applies to algorithms satisfying technical conditions
discussed last time• “smells like 3-nested loops”
• For each processor:• M = size of fast memory of that processor• G = #operations (multiplies) performed by that processor
#words_movedto/from fast memory
= Ω ( max ( G / M1/2 , #inputs + #outputs ) )
#messagesto/from fast memory
≥ #words_movedmax_message_size
= Ω ( G / M3/2 )
Summary of dense sequential algorithms attaining communication lower bounds• Algorithms shown minimizing # Messages use (recursive) block layout
• Not possible with columnwise or rowwise layouts• Many references (see reports), only some shown, plus ours• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
Algorithm 2 Levels of Memory Multiple Levels of Memory
#Words Moved and # Messages #Words Moved and #Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]
Cholesky LAPACK (with b = M1/2) [Gustavson 97]
[BDHS09]
[Gustavson,97] [Ahmed,Pingali,00]
[BDHS09]
(←same ) (←same)
LU withpivoting
LAPACK (sometimes)[Toledo,97] , [GDX 08]
[GDX 08]not partial pivoting
[Toledo, 97][GDX 08]?
[GDX 08]?
QRRank-
revealing
LAPACK (sometimes) [Elmroth,Gustavson,98]
[DGHL08]
[Frens,Wise,03]but 3x flops?
[DGHL08]
[Elmroth,Gustavson,98]
[DGHL08] ?
[Frens,Wise,03] [DGHL08] ?
Eig, SVD Not LAPACK [BDD11 ]randomized, but more flops;
[BDK11]
[BDD11,BDK11? ]
[BDD11,BDK11?]
Summary of dense parallel algorithms attaining communication lower bounds
• Assume nxn matrices on P processors• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm Reference Factor exceeding lower bound for #words_moved
Factor exceeding lower bound for
#messages
Matrix MultiplyCholesky
LUQR
Sym Eig, SVDNonsym Eig
Summary of dense parallel algorithms attaining communication lower bounds
• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm Reference Factor exceeding lower bound for #words_moved
Factor exceeding lower bound for
#messages
Matrix Multiply [Cannon, 69] 1Cholesky ScaLAPACK log P
LU ScaLAPACK log PQR ScaLAPACK log P
Sym Eig, SVD ScaLAPACK log PNonsym Eig ScaLAPACK P1/2 log P
Summary of dense parallel algorithms attaining communication lower bounds
• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm Reference Factor exceeding lower bound for #words_moved
Factor exceeding lower bound for
#messages
Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P
LU ScaLAPACK log P n log P / P1/2
QR ScaLAPACK log P n log P / P1/2
Sym Eig, SVD ScaLAPACK log P n / P1/2
Nonsym Eig ScaLAPACK P1/2 log P n log P
Summary of dense parallel algorithms attaining communication lower bounds
• Assume nxn matrices on P processors (better)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm Reference Factor exceeding lower bound for #words_moved
Factor exceeding lower bound for
#messages
Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P
LU [GDX10] log P log PQR [DGHL08] log P log3 P
Sym Eig, SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Can we do even better?
• Assume nxn matrices on P processors • Why just one copy of data: M = O(n2 / P) per processor?• Increase M to reduce lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm Reference Factor exceeding lower bound for #words_moved
Factor exceeding lower bound for
#messages
Matrix Multiply [Cannon, 69] 1 1Cholesky ScaLAPACK log P log P
LU [GDX10] log P log PQR [DGHL08] log P log3 P
Sym Eig, SVD [BDD11] log P log3 PNonsym Eig [BDD11] log P log3 P
Recall Sequential One-sided Factorizations (LU, QR)
• Classical Approach for i=1 to n update column i update trailing matrix• #words_moved = O(n3)
• Blocked Approach (LAPACK) for i=1 to n/b update block i of b columns update trailing matrix• #words moved = O(n3/M1/3)
• Recursive Approach func factor(A) if A has 1 column, update it
else factor(left half of A) update right half of A factor(right half of A)• #words moved = O(n3/M1/2)
• None of these approaches minimizes #messages or addresses parallelism• Need another idea
10
Minimizing latency in sequential case requires new data structures
• To minimize latency, need to load/store whole rectangular subblock of matrix with one “message”
• Incompatible with conventional columnwise (rowwise) storage• Ex: Rows (columns) not in contiguous memory locations
• Blocked storage: store as matrix of bxb blocks, each block stored contiguously
• Ok for one level of memory hierarchy, what if more?• Recursive blocked storage: store each block using subblocks
• Also known as “space filling curves”, “Morton ordering”
11
Summer School Lecture 4 12
Different Data Layouts for Parallel GE
Bad load balance:P0 idle after firstn/4 steps
Load balanced, but can’t easily use BLAS2 or BLAS3
Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck
Complicated addressing,May not want full parallelismIn each column, row
0123012301230123
0 1 2 3 0 1 2 3
0 1 2 33 0 1 22 3 0 11 2 3 0
1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout
3) 1D Column Block Cyclic Layout 4) Block Skewed Layout
The winner!0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column
Block Cyclic Layout
0 1 2 3
Bad load balance:P0 idle after firstn/2 steps
0 1
2 3
5) 2D Row and Column Blocked Layout
b
02/14/2006
CS267 Lecture 9 13
Distributed GE with a 2D Block Cyclic Layout
13
02/14/2006
CS267 Lecture 9 14
Mat
rix m
ultip
ly o
f g
reen
= g
reen
- bl
ue *
pink
14
Where is communication bottleneck?• In both sequential and parallel case: panel factorization• Can we retain partial pivoting? No• “Proof” for parallel case:
• Each choice of pivot requires a reduction (to find the maximum entry) per columns
• #messages = Ω (n) or Ω (n log p), versus goal of O(p1/2)• Solution for QR, then show LU
15
TSQR: QR of a Tall, Skinny matrix
W =
Q00 R00
Q10 R10
Q20 R20
Q30 R30
W0
W1
W2
W3
Q00
Q10
Q20
Q30
= = .
R00
R10
R20
R30
R00
R10
R20
R30
=Q01 R01
Q11 R11
Q01
Q11= . R01
R11
R01
R11= Q02 R02
Output = Q00, Q10, Q20, Q30, Q01, Q11, Q02, R02 16
TSQR: An Architecture-Dependent Algorithm
W =
W0W1W2W3
R00R10
R20R30
R01
R11
R02Parallel:
W =
W0W1W2W3
R01 R02
R00
R03
Sequential:
W =
W0W1W2W3
R00R01 R01
R11R02
R11
R03
Dual Core:
Can choose reduction tree dynamicallyMulticore / Multisocket / Multirack / Multisite / Out-of-core: ?
17
Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting”
18
Wnxb =
W1
W2
W3
W4
P1·L1·U1
P2·L2·U2
P3·L3·U3
P4·L4·U4
=
Choose b pivot rows of W1, call them W1’
Choose b pivot rows of W2, call them W2’
Choose b pivot rows of W3, call them W3’
Choose b pivot rows of W4, call them W4’
W1’
W2’
W3’
W4’
P12·L12·U12
P34·L34·U34
=Choose b pivot rows, call them W12’
Choose b pivot rows, call them W34’
W12’
W34’= P1234·L1234·U1234
Choose b pivot rows
Go back to W and use these b pivot rows
(move them to top, do LU without pivoting)
Minimizing Communication in TSLU
W =
W1
W2
W3
W4
LULULULU
LU
LULU
Parallel:(binary tree)
W =
W1
W2
W3
W4
LULU
LU
LU
Sequential:(flat tree)
W =
W1
W2
W3
W4
LU
LULU
LULU
LULU
Dual Core:
Can Choose reduction tree dynamically, as beforeMulticore / Multisocket / Multirack / Multisite / Out-of-core:
19
Making TSLU Numerically Stable• Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||)• Details matter
• Going up the tree, we could do LU either on original rows of W (tournament pivoting), or computed rows of U
• Only tournament pivoting stable• Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get
same results as PP applied to different input matrix whose entries are blocks taken from input A
• CALU – Communication Avoiding LU for general A– Use TSLU for panel factorizations– Apply to rest of matrix– Cost: redundant panel factorizations:
– Extra flops for one reduction step: n/b panels (n b2) flops per panel– For the sequential algorithms this is n2b. b = M1/2,
so extra n2M1/2 flops, (<n3 as M < n2).– For the parallel 2D algorithms, we have log P reduction steps
n/b panels, n/b b2 / P1/2 flops per panel and b = n/(log(P)2P1/2), so extra n/b (n b2) log P / P1/2= n2blog P / P1/2 = n3/(Plog(P)) < n3/P
• Benefit: – Stable in practice, but not same pivot choice as GEPP– One reduction operation per panel: reduces latency to minimum
20
Sketch of TSLU Stabilty Proof
• Consider A = with TSLU of first columns done by :
(assume wlog that A21 already
contains desired pivot rows)• Resulting LU decomposition (first n/2 steps) is
• Claim: As32 can be obtained by GEPP on larger matrix formed from blocks of A
• GEPP applied to G pivots no rows, and
L31 · L21-1 ·A21 · U’11
-1 · U’12 + As32
= L31 · U21 · U’11-1 · U’12 + As
32
= A31 · U’11-1 · U’12 + As
32 = L’31 · U’12 + As32 = A32
21
A11 A12A21 A22A31 A32
A11A21A31
A21
A’11
A11 A12A21 A22A31 A32
11 12 021 22 0 0 0 I
· =A’11 A’12A’21 A’22A31 A32
=U’11 U’12 As
22 As
32
L’11 L’21 I L’31 0 I
·
A’11 A’12A21 A21 -A31 A32
G = =L’11 A21·U’11
-1 L21 -L31 I
·U’11 U’12 U21 -L21
-1 ·A21 · U’11-1 · U’12
As32
Need to show As32
same as above
Stability of LU using TSLU: CALU (1/2)• Worst case analysis
• Pivot growth factor from one panel factorization with b columns• TSLU:
– Proven: 2bH where H = 1 + height of reduction tree– Attained: 2b , same as GEPP
• Pivot growth factor on m x n matrix• TSLU:
– Proven: 2nH-1
– Attained: 2n-1 , same as GEPP • |L| can be as large as 2(b-2)H-b+1 , but CALU still stable• There are examples where GEPP is exponentially less
stable than CALU, and vice-versa, so neither is always better than the other
• Discovered sparse matrices with O(2n) pivot growth22
Stability of LU using TSLU: CALU (2/2)
Summer School Lecture 4 23
• Empirical testing• Both random matrices and “special ones”• Both binary tree (BCALU) and flat-tree (FCALU)• 3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors• See [D., Grigori, Xiang, 2010] for details
23
Performance vs ScaLAPACK
• TSLU– IBM Power 5
• Up to 4.37x faster (16 procs, 1M x 150)– Cray XT4
• Up to 5.52x faster (8 procs, 1M x 150)• CALU
– IBM Power 5
• Up to 2.29x faster (64 procs, 1000 x 1000)– Cray XT4
• Up to 1.81x faster (64 procs, 1000 x 1000)• See INRIA Tech Report 6523 (2008), paper at SC08
25
Exascale predicted speedupsfor Gaussian Elimination: CA-LU vs ScaLAPACK-LU
log2 (p)
log 2
(n2 /p
) =
log 2
(mem
ory_
per_
proc
)
27
Case Study III: Cholesky Decomposition
29
30
LU and Cholesky decompositionsLU decomposition of A
A = P·L·UU is upper triangularL is lower triangularP is a permutation matrix
Cholesky decomposition of AIf A is a real symmetric and positive definite matrix,then L (a real lower triangular) s.t.,
A = L·LT (L is unique if we restrict diagonal elements 0)
Efficient parallel and sequential algorithms?
31
Application of the generalized form to Cholesky decomposition
(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij
other arguments)
jikjAjiAjiAjjL
jiLik
,,,,,
1,1
1
2,,,ik
kiAiiAiiL
32
By the Geometric Embedding theorem[Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
The communication cost (bandwidth) of any classic algorithm for Cholesky decomposition (for dense matrices) is:
MMn 3
PM
Mn
3
33
Some Sequential Classical Cholesky Algorithms
Bandwidth LatencyCache
Oblivious
Lower bound (n3/M1/2) (n3/M3/2)Naive (n3) (n3/M) LAPACK (with right block size!) Column-major Contiguous blocks
O(n3/M1/2)O(n3/M1/2)
O(n3/M)O(n3/M3/2)
Rectangular Recursive [Toledo 97] Column-major Contiguous blocks
O(n3/M1/2+n2 log n)O(n3/M1/2+n2 log n)
(n3/M)(n2)
Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks
O(n3/M1/2)O(n3/M1/2)
O(n3/M)O(n3/M3/2)
34
Some Parallel Classical Cholesky Algorithms
Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)
(n3/PM1/2)(n2/P1/2)
(n3/PM3/2)(P1/2)
Sca-LAPACK General Choosing b=n/P1/2
O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)
O((n/b) log p)O(P1/2 log P)
35
Top (column-major): Full, Old Packed, Rectangular Full Packed.Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04].
Upper bounds – Supporting Data Structures
36
Upper bounds – Supporting Data Structures
Rules of thumb:• Don’t mix column/row major with blocked:
decide which one fits your algorithm.• Converting between the two once in your algorithm
is OK for dense instances.E.g., changing the DS at the beginning of your algorithms.
37
The Naïve Algorithms
jk j k
i
j
j
i
i
Naïve left looking and right looking
Bandwidth: (n3) Latency: (n3/M)
Assuming column-major DS.
1
2,,,ik
kiAiiAiiL
jikjAjiAjiAjjL
jiLik
,,,,,
1,1
38
Rectangular Recursive [Toledo 97]
m
A21 A22
A11 A12
A32A31
n/2
n
n/2
n/2 L21 A22
L11 A12
L31 A32
L21 L22
L11 0
L31 L32
Recursive on left rectangular
Recursive on right rectangular;L12 :=0
[A22;A23] := [A22;A23] - [L21;L31]L21
T
Base: ColumnL := A/(A11)1/2
39
Rectangular Recursive [Toledo 97]
m
A21 A22
A11 A12
A32A31
n/2
n
n/2
n/2 L21 A22
L11 A12
L31 A32
L21 L22
L11 0
L31 L32
Toledo’s Algorithm: guarantees stability for LU.
BW(m,n) = 2m if n = 1, otherwise BW(m,n/2) + BWmm(m-n/2,n/2,n/2) + BW(m-n/2,n/2)
Bandwidth: O(n3/M1/2+n2 log n)
Optimal bandwidth (for almost all M)
40
Rectangular Recursive [Toledo 97]
m
A21 A22
A11 A12
A32A31
n/2
n
n/2
n/2 L21 A22
L11 A12
L31 A32
L21 L22
L11 0
L31 L32
Is it Latency optimal?
If we use a column major DS, thenL = (n3/M), so M1/2 away from optimum (n3/M3/2)This is due to the matrix multiply.
Open problem: Can we prove a latency lower bound for all matrix multiply classic algorithm using column major DS?
Note: we don’t need such a lower bound here, as the matrix-multiply algorithm is known.
41
Rectangular Recursive [Toledo 97]
m
A21 A22
A11 A12
A32A31
n/2
n
n/2
n/2 L21 A22
L11 A12
L31 A32
L21 L22
L11 0
L31 L32
Is it Latency optimal?
If we use a contiguous blocks DS, thenL = (n2). This is due to the forward columns updatesOptimum is (n3/M3/2)
n3/M3/2 ≤ n2 for n ≤ M2/3. Outside this range the algorithm is not latency optimal.
42
Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]
A21 A22
A11 A12
n
n/2
n/2 A21 A22
L11 0
L21 A22
L11 0
L21 A22
L11 0
Recursive on A11 Recursive on A22
A12=0
L21 = RTRSM(A21,L11’)
A22 =
A22 – L21 L21’
RTRSM: Recursive Triangular Solver
43
Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]
A21 A22
A11 A12
n
n/2
n/2 A21 A22
L11 0
L21 A22
L11 0
L21 A22
L11 0
Recursive on A11 Recursive on A22
A12=0
L21 = RTRSM(A21,L11’)
A22 =
A22 – L21 L21’
BW = 3n2 if 3n2 ≤ M, otherwise
2BW(n/2) + O(n3/M1/2 + n2)
44
Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]
A21 A22
A11 A12
n
n/2
n/2 A21 A22
L11 0
L21 A22
L11 0
L21 A22
L11 0
Uses • Recursive matrix multiplication
[Frigo, Leiserson, Prokop, Ramachandran 99]
• and recursive RTRSM
Bandwidth: O(n3/M1/2) Latency: O(n3/M3/2) Assuming blocks recursive DS.• Optimal bandwidth• Optimal latency• Cache oblivious• Not stable for LU
45
Parallel Algorithm
1 2 34 5 67 8 9
1 2 34 5 67 8 9
1 2 34 5 67 8 9
1 2 34 5 67 8 9
Pc
Pr
1. Find Cholesky locally on a diagonal block.Broadcast downwards.
2. Parallel triangular solve.Broadcast to the right.
3. Parallel symmetric rank-b update
46
Parallel Algorithm
1 2 34 5 67 8 9
1 2 34 5 67 8 9
1 2 34 5 67 8 9
1 2 34 5 67 8 9
Pc
Pr
Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)
(n3/PM1/2)(n2/P1/2)
(n3/PM3/2)(P1/2)
Sca-LAPACK General Choosing b=n/P1/2
O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)
O((n/b) log P)O(P1/2 log P)
Low order term, as b ≤ n/P1/2
Optimal (up to a log P factor)only for 2D.New 2.5D algorithm extends to any size M.
47
Summary of algorithms for dense Cholesky decomposition
Tight lower bound for Cholesky decomposition.Sequential:
Bandwidth: (n3/M1/2) Latency: (n3/M3/2)
Parallel:Bandwidth and latency lower bound tight up to (log P) factor.
Sparse Algorithms: Cholesky Decomposition
Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00]
48
Lower bounds for sparse direct methods
• Geometric embedding for sparse 3-nested-loops-like algorithms may be loose:
BW = (#flops / M1/2 ) , L = Ω(#flops / M3/2 )
It may be smaller than the NNZ (number of non-zeros) in the input/output.• For example computing the product of two (block) diagonal
matrices may involve no communication in parallel
49
Lower bounds for sparse matrix multiplication
• Compute C = AB, with A, B, C of size n n, having dn nonzeros uniformly distributed, #flops is at most d2n
• Lower bound on communication in sequential is BW = (dn),
L = (dn/M) • Two types of algorithms - depending on how their
communication pattern is designed.- Take into account the nonzero structure of the matrices Example - Buluc, Gilbert BW = L = Ω(d2n) - Oblivious to the nonzero structure of the matrices Example - sparse Cannon - Buluc, Gilbert for parallel case BW = Ω(dn/ p1/2), #messages = Ω(p1/2)
• L. Grigori P.Y David, J. Demmel, S. Peyronnet50
Lower bounds for sparse Cholesky factorization
• Consider A of size ksks results from a finite difference operator on a regular grid of dimension s2 with ks nodes.
• Its Cholesky L factor contains a dense lower triangular matrix of size ks-1ks-1. BW = Ω(W1/2) L = Ω(W3/2)where W = k 3(s-1)/2 for a sequential algorithm W = k 3(s-1)/(2p) for a parallel algorithm
• This result applies more generally to matrix A whose graph G =(V,E) has the following property for some s: • every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s
vertices in V − W• the Cholesky factor of A contains a dense s x s submatrix
51
• PSPASES with an optimal layout could attains the lower bound in parallel for 2D/3D regular grids:• Uses nested dissection to reorder the matrix• Distributes the matrix using the subtree-to-subcube algorithm• The factorization of every dense multifrontal matrix is performed using an
optimal parallel dense Cholesky factorization.• Sequential multifrontal algorithm attains the lower bound
• The factorization of every dense multifrontal matrix is performed using an optimal dense Cholesky factorization
Optimal algorithms for sparse Cholesky factorization
52
Open Problems / Work in Progress (1/2)• Rank Revealing CAQR (RRQR)
• AP = QR, P a perm. chosen to put “most independent” columns first • Does LAPACK geqp3 move O(n3) words, or fewer?• Better: Tournament Pivoting (TP) on blocks of columns
• CALU with more reliable pivoting (Gu, Grigori, Khabou et al)• Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel
• LU with Complete Pivoting (GECP)• A = PrLUPc
T , Pr and Pc are perms. putting “most independent” rows & columns first• Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows
• LU of A=AT, with symmetric pivoting• A = PLLTPT, P perm., L lower triangular
• Bunch-Kaufman, Rook pivoting: O(n3) words moved?• CA approach: Choose block with one step of CA-GECP, permute rows & columns
to top left, pick symmetric subset with local Bunch-Kaufman• A = PLBLTPT, with symmetric pivoting (Toledo et al)
• L triangular, P perm., B banded• B tridiagonal: due to Aasen• CA approach: B block tridiagonal, stability still in question
53
Open Problems / Work in Progress (2/2)• Heterogeneity (Ballard, Gearhart)
• How to assign work to attain lower bounds when flop rates / bandwidths / latencies vary?• Using extra memory in distributed memory case (Solomonik)
• Using all available memory to attain improved lower bounds• Just matmul and CALU so far
• Using extra memory in shared memory case• Sparse matrices …• Combining sparse and dense algorithms, for better communication performance,
e.g.,• The above [Grigori, David, Peleg, Peyronnet ‘09]• Our retuned version of [Yuster Zwick ‘04]
• Eliminate the log P factor?• Prove latency lower bounds for specific DS use , e.g., can we show that any classic
matrix multiply algorithm that uses column-major DS has to be M1/2 away from optimality?
54