reliable and e cient algorithms for spectrum-revealing low ......both rigorous theory and numeric...
TRANSCRIPT
Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank DataAnalysis
by
David Gaylord Anderson
A dissertation submitted in partial satisfaction of the
requirements for the degree of
Doctor of Philosophy
in
Applied Mathematics
in the
Graduate Division
of the
University of California, Berkeley
Committee in charge:
Professor Ming Gu, Co-chairProfessor Per-Olof Persson, Co-chair
Professor Benjamin Recht
Fall 2016
Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank DataAnalysis
Copyright 2016by
David Gaylord Anderson
1
Abstract
Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank Data Analysis
by
David Gaylord Anderson
Doctor of Philosophy in Applied Mathematics
University of California, Berkeley
Professor Ming Gu, Co-chair
Professor Per-Olof Persson, Co-chair
As the amount of data collected in our world increases, reliable compression algorithmsare needed when datasets become too large for practical analysis, when significant noise ispresent in the data, or when the strongest signals in the data are needed. In this work, twodata compression algorithms are presented. The main result is a low-rank approximationalgorithm (a type of compression algorithm) that uses modern techniques in randomizationto repurpose a classic algorithm in the field of linear algebra called the LU decompositionto perform data compression. The resulting algorithm is called Spectrum-Revealing LU(SRLU).
Both rigorous theory and numeric experiments demonstrate the effectiveness of SRLU.The theoretical work presented also develops a framework with which other low-rank approx-imation algorithms can be analyzed. As the name implies, Spectrum-Revealing LU seeks tocapture the entire spectrum of the data (i.e. to capture all signals present in the data).
A second compression algorithm is also introduced, which seeks to compression graphs.Called a sparsification algorithm, this algorithm can accept a weighted or unweighted graphand produce an approximation without changing the weights (or introducing weights in thecase of an unweighted graph). Theoretical results provide a bound on the quality of theresults, and a numeric example is also explored.
i
To my parents
with all my love.
ii
Contents
Contents ii
List of Figures v
List of Tables vi
I Introduction to Low-Rank Approximation and Linear Alge-bra 1
1 Low-Rank Approximation 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank Data
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Linear Algebra Preliminaries 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 The Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . 62.3 Approximation Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 The LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 The QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Interlacing Property of Singular Values . . . . . . . . . . . . . . . . . . . . . 82.7 Numerical Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.8 Communication-Avoiding Algorithms . . . . . . . . . . . . . . . . . . . . . . 92.9 A Note About Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . 10
II The Spectrum-Revealing LU Decomposition 11
3 Background on Low-Rank Approximation and the LU Decomposition 123.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iii
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Previous Work on Deterministic Low-Rank Approximation Algorithms . . . 203.4 Previous Work on Randomized Low-Rank Approximation Algorithms . . . . 223.5 Problems Related to Low-Rank Approximation . . . . . . . . . . . . . . . . 253.6 Previous Work on the LU Decomposition . . . . . . . . . . . . . . . . . . . . 26
4 Spectrum-Revealing LU 274.1 Main Contribution: Spectrum-Revealing LU (SRLU) . . . . . . . . . . . . . 274.2 Spectrum-Revealing Pivoting (SRP) . . . . . . . . . . . . . . . . . . . . . . . 294.3 LU Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Choice of Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Variations of SRLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 The Theory of Spectrum-Revealing Algorithms 375.1 Theoretical Results for SRLU Factorizations . . . . . . . . . . . . . . . . . . 375.2 Comparison of SRLU Factorizations with RRLU and RRQR Factorizations . 615.3 Fast SRLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Numerical Experiments with SRLU 636.1 Speed and Accuracy Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 Efficiency Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 Towards Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.4 Sparsity Preservation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.5 Online Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.6 Pathological Test Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.7 Image Compression Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.8 Testing Quality Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
III Unweighted Graph Sparsification 72
7 Unweighted Column Selection 737.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.3 An Unweighted Column Selection Algorithm . . . . . . . . . . . . . . . . . . 767.4 Correctness and Performance of the UCS Algorithm . . . . . . . . . . . . . . 787.5 Performance Comparison of UCS and Other Algorithms . . . . . . . . . . . . 867.6 A Numeric Example: Graph Visualization . . . . . . . . . . . . . . . . . . . 917.7 Relationship to the Kadison-Singer Problem . . . . . . . . . . . . . . . . . . 927.8 Additional Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8 Additional Results on Unweighted Graph Sparsification 948.1 A Running Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
iv
8.2 Faster Subset Selection for Matrices and Applications . . . . . . . . . . . . . 95
A An Algorithm for Sparse PCA 97A.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.2 Setup for a Single Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A.3 Solving for v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A.4 More on λ and the Objective Function . . . . . . . . . . . . . . . . . . . . . 100A.5 Bounds on λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101A.6 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.7 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B An Efficient Implementation of the Generalized Minimum Residual Methodfor Stiff PDE Problems 106B.1 Preconditioned GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106B.2 A New Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
C Strassen’s Algorithm 111
D A Visualization of SRLU 113
Bibliography 115
v
List of Figures
1.1 A first example of low-rank approximation. . . . . . . . . . . . . . . . . . . . . . 3
3.1 Visualizations of different LU factorizations. . . . . . . . . . . . . . . . . . . . . 18
4.1 Benchmarking TRLUCP with various block sizes on random matrices of differentsizes and truncation ranks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1 Accuracy Experiment on random 1000x1000 matrices with different rates of spec-tral decay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Time Experiment on random matrices of varying sizes, and a time experiment ona 1000x1000 matrix with varying truncation ranks. . . . . . . . . . . . . . . . . 64
6.3 Efficiency experiment on random matrices of varying sizes compared to peakhardware performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Image processing example. The original image [81], a rank-50 approximation withSRLU, and a highlight of the rows and columns selected by SRLU. . . . . . . . 66
6.5 Circuit Simulation Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.6 Sparse Data Processing Example with Circuit Simulation Data. . . . . . . . . . 676.7 The cumulative uses of the top five most commonly used words in the Enron
email corpus after reordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.8 Singular values of SRLU factorizations of various ranks (red) versus the singular
values of the Devil’s Stairs matrix (blue). . . . . . . . . . . . . . . . . . . . . . . 696.9 Image compression experiment with various factorizations. From left to right:
James Wilkinson, Wallace Givens, George Forsythe, Alston Householder, PeterHenrici, and Friedrich Bauer. (Gatlinburg, Tennessee, 1964.) . . . . . . . . . . . 71
7.1 Autonomous System Example: Original Graph . . . . . . . . . . . . . . . . . . . 917.2 Autonomous Systems Graph with Sparsifiers of Various Cardinalities (node co-
ordinates calculated from whole graph) . . . . . . . . . . . . . . . . . . . . . . . 927.3 Autonomous Systems Graph with Sparsifiers of Various Cardinalities (node co-
ordinates recalculated for each sparsifier) . . . . . . . . . . . . . . . . . . . . . . 927.4 Progress During Iteration and Theoretical Singular Value Lower Bound for Spar-
sifiers of Various Cardinalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
List of Tables
3.1 Efficiency comparison of LU orderings . . . . . . . . . . . . . . . . . . . . . . . 193.2 Definition of parameters (smallest to largest) . . . . . . . . . . . . . . . . . . . . 19
5.1 Bounds for Growth Factors of LU Variants [53] . . . . . . . . . . . . . . . . . . 58
6.1 Mean values of the constants from the theorems presented in this work, for variousrandom matrices. Constants for spectral theorems are averaged over the top 10singular values. TRLUCP was used, and no swaps were needed, so SRLU resultsmatch TRLUCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Average number of swaps needed on a random 1000-by-1000 matrix for varioussmall values of f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vii
Acknowledgments
A great number of people helped to make this work possible, as well as to aid in my broadermathematical studies. I am, in particular, most thankful for the guidance of Ming andPer over the years. With their advising, support, and friendship my PhD studies were asrewarding as I had initially hoped they would be, and I am very proud of my accomplishmentsand (some of) my failures during that time.
1
Part I
Introduction to Low-RankApproximation and Linear Algebra
2
Chapter 1
Low-Rank Approximation
1.1 Introduction
As the amount of data collected grows, data compression has become ubiquitous in our lives.Many datasets are too big to work with directly, and data collection may be growing far fasterthan the processing power. Compression and other algorithms are needed to understand allof the information in our world.
There are no perfect compression algorithms for real-world data. Digital photos, forexample are often stored using the JPEG compression algorithm [86]. Improvements areconsistently introduced, as well as new algorithms, such as JPEG2000 [100]. Many otheralgorithms are widely used for photo compression, each with its own advantages and disad-vantages. Data compression saves disk space, improves data transfer times, renders the dataeasier to analyze, and aids in removing noise from the data. Additional uses of compressioninclude cryptography [71] and energy conservation [10]. Data approximation has even ap-peared in popular culture recently as the innovative technology behind a fictional startup inthe sitcom Silicon Valley [90].
The effectiveness of a compression algorithm depends on the type of data used. Photosare an example of structured data: there are no missing data points in general, and eachdata point is the same type of measurement as each other data point. The effectiveness ofcompression also depends on the goal: lossless algorithms seek to preserve the original datain its entirety, while lossy algorithms, at the expense of some data loss, may save considerablymore space than lossless algorithms and may be effective in more applications. This work isabout lossy compression algorithms for structured data.
1.2 Data Compression
Figure 1.1, a reproduction of a case study from [72], shows data drawn from two distributions.A rank-2 approximation finds two vectors the approximate the entire dataset well; almostall of the data is nearly collinear with one of these two vectors. Although not all of the data
CHAPTER 1. LOW-RANK APPROXIMATION 3
Figure 1.1: A first example of low-rank approximation.
is captured in a rank-2 approximation, such a compression saves a vast amount of space atlittle expense. In essence, two data points can accurately describe the 20,000 data points inFigure 1.1. While few datasets are so obviously well-approximated by a low-rank replication,many datasets are so large compared to the underlying forces determining the data that anaccurate low-rank approximation exists. In [72], for example, a rank-2 approximation ofgene expression data is shown to be an effective data preprocessing step for identifying threecancer types.
1.3 Reliable and Efficient Algorithms for
Spectrum-Revealing Low-Rank Data Analysis
In this work, two algorithms are presented, as well as the theoretical tools needed to analyzethese and comparable algorithms. The rest of this dissertation is organized as follows: theremainder of Part I covers background information and previous results necessary to continuethe discussion of low-rank approximations algorithms. The most significant contribution ofthis work is an approximation algorithm called Spectrum-Revealing LU, which is presentedand analyzed in Part II. An additional low-rank approximation is introduced in Part III. Thisalgorithm, called Unweighted Graph Sparsification, is a low-rank approximation specifically
CHAPTER 1. LOW-RANK APPROXIMATION 4
for graphs, and the quality of the approximation is theoretically analyzed.
1.4 Future Work
One doctoral degree is surely not enough to answer all questions that arise from a researchquestion in mathematics. In Part II, much work remains to develop the SRLU algorithminto state-of-the-art open sourced software. Some code, such as fast versions of finding orapproximately finding the largest element in the Schur complement, remain to be written. InPart III, some unanswered questions include if the performance parameter T could vary (orbe a different constant) to improve that quality of the sparsifier more quickly. Additionally,an unexplored extension of the UCS algorithm is to create higher quality sparsifiers byeliminating nodes in addition to edges. In the Appendix, two research projects are presentedwith preliminary work. The first, an algorithm for sparse principal component analysis,could be extended by developing a block version. The second, an optimization for somePDE problems using GMRES, has a theoretical framework, but has not yet been numericallytested.
5
Chapter 2
Linear Algebra Preliminaries
Linear algebra is the area of mathematics that will provided most of the tools needed inthis work, and is essential in many other areas of mathematical sciences. Linear algebrais not only the study of linear systems, but also the study efficiently organizing numericalcomputation. For an introduction to linear algebra, see [47]. Some especially pertinentconcepts are covered in this chapter.
2.1 Definitions
The range of a matrix A is defined as
range (A) = y ∈ Rm : y = Ax for some x ∈ Rn.
The rank of a matrix A is the dimension of the range of A:
rank (A) = dim (range (A)) .
A vector norm on Rn is a function f : Rn → R that satisfies
1. f(x) ≥ 0, x ∈ Rn,
2. f(x) = 0 iff x = 0,
3. f(x+ y) ≤ f(x) + f(y), x, y ∈ Rn,
4. f(αx) = |α|f(x), α ∈ R, x ∈ Rn.
The norm is denoted ‖A‖= f (A). A matrix norm is a function f : Rm×n → R that satisfiessimilar properties to those in the definition of a vector norm:
1. f (A) ≥ 0,A ∈ Rm×n,
2. f (A) = 0 iff A = 0,
CHAPTER 2. LINEAR ALGEBRA PRELIMINARIES 6
3. f (A + B) ≤ f (A) + f (B) ,A,B ∈ Rm×n,
4. f (αA) = |α|f (A) , α ∈ R,A ∈ Rm×n.
The p-norm of a vector x, denoted ‖x‖p, is defined as
‖x‖p = (|x1|p+ · · · |xn|p)1p .
The p-norm of a matrix A, denoted ‖A‖p, is defined as
‖A‖p = supx6=0
‖Ax‖p‖x‖p
.
The Frobenius norm of a matrix A is defined as
‖A‖F =
(m∑i=1
n∑j=1
|aij|2) 1
2
.
The last two definitions on there own do not automatically imply that they are matrixnorms, but they can indeed be proven to be matrix norms. Several other important terms:the identity matrix I ∈ Rm×m is the matrix with Iii = 1 and Iij = 0 for i 6= j. The transposeof a matrix A, denoted AT , is defined by
(AT)ij
= Aji. A matrix A ∈ Rm×m is orthogonal
if ATA = I. A sparse matrix is a matrix with many zeros (enough that they may be takenadvantage of). Another important definition, the singular values of a matrix A, will bedefined later. See [47] for additional useful definitions.
2.2 The Singular Value Decomposition (SVD)
Theorem 1. (Singular Value Decomposition [47]) For a matrix A ∈ Rm×n, there existorthogonal matrices U ∈ Rm×m and V ∈ Rn×n and a diagonal matrix Σ with diag (Σ) =(σ1, σ2, . . . , σp) with p = min(m,n) and σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0 such that
UΣVT = A.
A linear equation of the form Ax = b for b ∈ Rm can be solved by calculating
x = VӆUT b.
Also,
‖A‖2= σ1 and ‖A‖F=√σ2
1 + · · ·+ σ2p
because‖UTAV‖= ‖Σ‖
CHAPTER 2. LINEAR ALGEBRA PRELIMINARIES 7
for both the 2-norm and the Frobenius norm. The matrix A can also be expressed as
A =
min(m,n)∑i=1
σiuivTi ,
where ui and vi denote columns of U and V respectively. The values σi are known as thesingular values of A. The rank of a matrix A is equal to the number of nonzero singularvalues. If the rank of A is r (note we must have 0 ≤ r ≤ min(m,n)), then A can be expressedas
A =r∑i=1
σiuivTi .
The set of singular values of a matrix A is unique (note there may be repeated values withinthis set).
2.3 Approximation Optimality
Eckart and Young
A standard benchmark for the quality of an approximation is due to Eckart and Young. Let
Ak =k∑i=1
σiuivTi
be the rank-k truncated SVD of a data matrix A ∈ Rm×n, where k is chosen so that1 ≤ k ≤ min(m,n).
Theorem 2. (Eckart-Young [39])
Ak = arg minrank(B)≤k
‖A−B‖2 = arg minrank(B)≤k
‖A−B‖F ,
with
‖A−Ak‖2 = σk+1, ‖A−Ak‖F =
√√√√ ρ∑j=k+1
σ2j .
2.4 The LU Decomposition
The SVD is called a matrix factorization because it expresses a matrix as a product of“simpler” matrices (matrices that are easier to work with in some sense). Another matrix
CHAPTER 2. LINEAR ALGEBRA PRELIMINARIES 8
factorization critical to this work is the LU decomposition. The LU decomposition factors amatrix A into a lower triangular matrix and an upper triangular matrix:
A = LU.
This factorization will be studied in great detail in Part II. The name ‘LU’ reflects thatL is a lower triangular matrix and U is an upper triangular matrix. The familiar term‘Gaussian elimination’ refers to an algorithm that performs the same calculations as the LUdecomposition.
2.5 The QR Decomposition
A matrix A ∈ Rm×n can be factored into the product of an orthogonal matrix Q ∈ Rm×m
and an upper triangular matrix R ∈ Rm×n:
A = QR.
The QR decomposition [47] will not be computed in this work (although it may be applied asa subroutine in the main algorithm discussed later). The existence of the QR decomposition,nevertheless, will be used in theoretical results. The ability to factor a matrix into anorthogonal component and an upper triangular component will be used extensively in theanalysis of the main result in Part II. The name ‘QR’ reflects that Q is commonly usedto denote an orthogonal matrix and R denotes a right-triangular matrix (another term forupper-triangular matrix).
2.6 Interlacing Property of Singular Values
The singular values of a matrix exhibit many fascinating properties. In Part III, the interlac-ing property of singular values is especially important. Essentially, if A ∈ Rm×n is a matrixand A is equal to A but has one additional column, then
σ1
(A)≥ σ1 (A) ≥ σ2
(A)≥ σ2 (A) ≥ · · · ≥ σn
(A)≥ σn (A) ≥ σn+1
(A).
2.7 Numerical Linear Algebra
The LU, QR, and SVD are deterministic algorithms and can be computed for any datamatrix. The computational complexity of all three is of the same order: O (m · n ·min(m,n)).Nevertheless, their operation counts differ by constant factors, and they vary in efficiency dueto memory movements. Roughly, the LU decomposition is about 4 times faster to computethan the QR decomposition, and the QR decomposition is about 4 times faster to computethan the SVD. Concerning stability, the SVD can always be reliably computed. The QR
CHAPTER 2. LINEAR ALGEBRA PRELIMINARIES 9
decomposition is less stable than the SVD, and the LU decomposition is less stable than theQR decomposition. The speed and stability of the LU decomposition will be explored ingreater detail in later sections. Although none of these three algorithms is unique in general,standard implementations of these algorithms guarantee that the produce the same resultswhen repeated on the same data matrix.
The LU decomposition, QR decomposition, and the SVD will all be needed in PartII. Note that these three matrix factorizations are also considered to be among the mostimportant in the study of linear algebra. In particular, these three factorizations are utilizedin linear equation solving. When A and b are a known matrix and known vector and xis an unknown vector such that Ax = b, then x can be computed (with some stabilityassumptions) using all three decompositions:
x = U−1L−1b
x = R−1QT b
x = VΣ−1UT b.
While matrix addition, subtraction, and multiplication are easily defined, matrix divisionrequires careful consideration. For most matrices, an inverse can be calculated, and multiplyby this inverse is conceptually equivalent to division. Computing the inverse of a matrix,nevertheless, is an expensive algorithm (requires many computations), and so, in practice,shortcuts are used when possible to avoid this calculation. Triangular and orthogonal ma-trices offer such shortcuts.
For the triangular matrices U, L, and R in the solutions above, the inverse notationshould not imply calculating the inverse. Instead, because these matrices have a specialform, a faster algorithm can be applied that calculates multiplication by the inverse. Thecomputational complexity of this action is reduced by a factor of n. Furthermore, becauseQ, V, and U (note U is standard notation in LU and the SVD) are orthogonal, their inverseis equal to their transpose. See [34] for discussions on numerical linear algebra.
2.8 Communication-Avoiding Algorithms
The computational cost of many data analysis algorithms is dominated by the time spentmoving data throughout the memory hierarchy of a computer, as opposed to the time spenton arithmetic and logic. In general, hardware constraints cause memory movements to bemore computationally expensive than floating point operations. Writing data to memory istypically the most expensive data operation. Therefore, when an algorithm has flexibilitywith the order in which the necessary arithmetic and logic are applied, varying the order ofthese operations may reduce the amount of data movements needed, increasing the speed ofthe algorithm.
Algorithms 2.6 and 2.7 in [34] compare scalar and block matrix-matrix multiplication,and supporting analysis describes how the block version is far more efficient than the scalar
CHAPTER 2. LINEAR ALGEBRA PRELIMINARIES 10
version, despite performing mathematically identical calculations. Furthermore, an efficiencycomparison of matrix-matrix multiplication, matrix-vector multiplication, and vector-vectormultiplication, shows numerically that the data reuse in block calculations leads to muchgreater processor efficiency, and, therefore, speed. Algorithms arranged in block form to allowfor data reuse and to minimize the number of data movements are known as communication-avoiding algorithms. Part II will discuss SRLU, a block algorithm.
2.9 A Note About Matrix-Matrix Multiplication
In numerical linear algebra, almost no algorithms are known to truly be the fastest for theirpurposes. Indeed, for matrix-matrix multiplication, one of the most basic operations in linearalgebra, new algorithms with new improvements are being discovered continuously.
One of the most profound discoveries in the field of linear algebra is a recursive matrix-matrix multiplication algorithm due to Strassen [99]. This work computed matrix-matrixmultiplication in time O
(nlog2 7
), a most surprising result because there was no obvious
reason to expect a method faster than O (n3). As matrix-matrix multiplication is the mosttime consuming linear algebra operation used in SRLU in Part II, the complexity of SRLUcan freely by reduced by using Strassen’s algorithm. No further discussion is provided inPart II, however, the Appendix describes Strassen’s algorithm in greater detail.
Strassen’s algorithm is known to increase the fixed constant concealed in the order ofcomplexity. A general rule of thumb is that Strassen’s algorithm becomes faster than tra-ditional matrix multiplication for matrices roughly of size 512-by-512 or greater. In recentyears, new recursive algorithms have been discovered that follow in Strassen’s example andreduce the complexity of matrix-matrix multiplication further. All are above O (n2), and itis not known if O (n2) is achievable, which is a lower bound because all data must be used.These newer algorithms, however, are known to have astronomical hidden constants, and arenot practical.
11
Part II
The Spectrum-Revealing LUDecomposition
12
Chapter 3
Background on Low-RankApproximation and the LUDecomposition
3.1 Introduction
In this chapter, a novel truncated LU factorization called Spectrum-Revealing LU (SRLU,Definition 2) is introduced for effective low-rank matrix approximation, and a fast algorithmto compute an SRLU factorization is developed. Both matrix and singular value approxima-tion error bounds for the SRLU approximation computed by this algorithm are presented.Analysis suggests that SRLU is competitive with the best low-rank matrix approximationmethods, deterministic or randomized, in both computational complexity and approximationquality. Numeric experiments illustrate that SRLU preserves sparsity, highlights importantdata features and variables, can be efficiently updated, and calculates data approximationsnearly as accurately as the best possible. This is the first known practical variant of theLU factorization for effective and efficient low-rank matrix approximation. A preliminaryversion of this work appears in [6].
3.2 Problem Statement
Low-rank approximation is an essential data processing technique for understanding datathat is either large or noisy. The Singular Value Decomposition (SVD) is a longstandingstandard for low-rank data approximation. A stable algorithm, the truncated SVD also isoptimal in the spectral and Frobenius norms [39, 47]:
‖A−Ak‖ξ = minrank(B)≤k
‖A−B‖ξ , (3.1)
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 13
where ξ = 2, F and
Ak =k∑i=1
σiuivTi (3.2)
is the truncated SVD of A. The far-reaching applications of low-rank approximation and theSVD include data compression, image and pattern recognition, signal processing, compressedsensing, latent semantic indexing, anomaly detection, and recommendation systems. In thischapter, a novel matrix factorization is introduced called Spectrum-Revealing LU (SRLU)that can be efficiently computed and updated. Simultaneously, it preserves sparsity and canbe used to identify important data variables and observations. This algorithm works on anydata matrix, and achieves an approximation accuracy that only differs from the accuracy ofthe best approximation possible for any given rank by a constant factor.1
The major innovation in SRLU is the efficient calculation of a truncated LU factorizationof the form
Π1AΠT2 =
( k m− k
k L11
m− k L21 In−k
) ( k n− k
U11 U12
S
)≈(
L11
L21
)(U11 U12
) def= LU,
where Π1 and Π2 are judiciously chosen permutation matrices. The LU factorization is un-stable, and in practice is implemented by pivoting (interchanging) rows during factorization(this entails finding Π1 only). For the truncated LU factorization to have any significance,nevertheless, complete pivoting (interchanging rows and columns) is necessary to guarantee
that the factors L and U are well-defined and that their product accurately represents theoriginal data. Previously, complete pivoting was impractical because it requires accessingthe entire data matrix at every iteration, but SRLU efficiently achieves complete pivotingthrough randomization and a deterministic procedure to correct for any mistakes that mayhave arisen from the randomization. The quality of the SRLU factorization is supported byrigorous theory and numeric experiments.
Background on the LU factorization
A rudimentary matrix factorization, the LU decomposition factors a matrix into a lowertriangular matrix and an upper triangular matrix, which can then be used to solve a linearsystem. The stability of the LU decomposition has been extensively studied, with manylongstanding results [56, 82, 101].
The vanilla LU method fails if the diagonal contains numerical small elements, and sopartial pivoting was introduced to stabilize the algorithm by selecting important rows ofthe matrix during iteration. LU can be made completely stable through complete pivoting,where at each iteration the largest element in the Schur complement is permuted to the next
1The truncated SVD is known to provide the best low-rank matrix approximation. But it is rarely usedfor large scale practical data analysis.
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 14
diagonal entry of the U factor [105]. Finding the largest element in the Schur complement,nevertheless, requires O (n2) comparisons at each iteration, greatly increasing the complexityof the decomposition. With row pivoting only, the algorithm can in general achieve a stablemethod for linear equation solving, but provides no insight into important columns of thedata matrix. Thus early terminal of the LU decomposition with partial pivoting will, in gen-eral, produce poor approximations to a data matrix. The LU decomposition, consequently,appears in fewer applications than the SVD.
The LU decomposition, nevertheless, exhibits many advantages over the SVD. The de-composition is faster, partially preserves sparsity, can be updated more easily, and in generalis easier to implement. For example, consider a data matrix examined by Berry et. al. in[15]:
A =
0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0
0 0 0 0.4082 00 1.0000 0 0.4082 0.70710 0 0 0.4082 0
.
Then the rank-3 truncated SVD is
A3 =
0.4971 −0.0330 0.0232 0.4867 −0.00690.6003 0.0094 0.9933 0.3858 0.70910.4971 −0.0330 0.0232 0.4867 −0.00690.1801 0.0740 −0.0522 0.2320 0.0155−0.0326 0.9866 0.0094 0.4402 0.70430.1801 0.0740 −0.0522 0.2320 0.0155
.
Next, consider a rank-3 truncated LU decomposition using partial row pivoting:
A =
0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0
0 0 0 0 00 1.0000 0 0.4082 0.70710 0 0 0 0
.
Here the LU approximation retains a sparsity pattern similar to that of the original matrix.Also, many of the entries are numerically exact. The truncated SVD is both dense and hasno exact entries. The truncated SVD, however, has no egregious errors in the matrix, whilethe LU approximation shows significant errors for two entries. The LU decomposition alsolooks deceptively accurate, while the truncated SVD, ostensibly less accurate in appearance,is optimal in spectral and Frobenius norm errors. Furthermore, if the columns of A wererearranged, then the partial pivoting factorization could lead to a column of all zeros.
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 15
Algorithm 1 presents a basic implementation of the LU factorization, where the result isstored in place such that the upper triangular part of A becomes U and the strictly lowertriangular part becomes the strictly lower part of L, with the diagonal of L implicitly knownto contain all ones. LU with partial pivoting finds the largest entry in the ith column fromrow i to m and pivots the row with that entry to the ith row. LU with complete pivoting findsthe largest entry in the submatrix Ai+1:m,i+1:n and pivots that entry to Ai,i. It is generallyknown and accepted that partial pivoting is sufficient for general, real-world data matricesin the context of linear equation solving.
Require: Data matrix A ∈ Rm×n
Ensure: A overwritten with L and U factors1: for i = 1, 2, · · · ,min(m,n) do2: Perform row and/or column pivots3: A (k + 1 : m, k) = A (k + 1 : m, k) /A (k, k)4: Ai+1:m,i+1:n −= Ai+1:m,1:i ·A1:i,i+1:n
5: end for
Algorithm 1: The LU Decomposition (Alg. 2.4 [34])
Line 4 of Algorithm 1 is known as the Schur update. Given a sparse input, this is the onlystep of the LU factorization that causes fill. As the algorithm progresses, fill will compoundand may become dense, but the LU factorization, and truncated LU in particular, generallypreserves some, if not most, of the sparsity of a sparse input. A numeric illustration ispresented below.
LU Analysis and Variations
By convention, the L factor has a unit diagonal. This guarantees that the LU decompositionis mathematically unique for any data matrix A. Nevertheless, there are many variationson calculating the LU factorization. In Algorithm 2 the Crout version of LU is presented inblock form. The column pivoting entails selecting the next b columns so that the in-place LUstep is performed on a non-singular matrix (provided the remaining entries are not all zero).Note that the matrix multiplication steps are the bottleneck of this algorithm, requiringO(mnb) operations each in general.
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 16
Require: Data matrix A ∈ Rm×n, block size bEnsure: A overwritten with L and U factors
1: for j = 0, b, 2b, · · · ,min(m,n)/b− 1 do2: Perform column pivots3: Aj+1:m,j+1:j+b −= Aj+1:m,1:j ·A1:j,j+1:j+b
4: Apply Algorithm 1 on Aj+1:m,j+1:j+b
5: Apply the row pivots to other columns of A6: Aj+1:j+b,j+b+1:n −= Aj+1:j+b,1:j ·A1:j,j+b+1:n
7: end for
Algorithm 2: Crout LU (in block form)
Letting x = min(m,n) and y = max(m,n), then algorithm 1 implies an operation countof
flops =x−1∑i=1
[m∑
j=i+1
(2(n− i) + 1)
]
= x2y − 1
3x3 − xy − 1
2x2 +
5
6x+ xm−m
= x2y − 1
3x3 + low order.
Note that the exact operation count for a rectangular matrix would differ from that of itstranspose because of the xm−m term, which stems from the scaling in line 3.
We now make several important observations about algorithm 1. First, pivoting in someform is necessary for general matrices because of the division in step 3. The most commonform of pivoting at step 2 is to swaps rows j and arg maxj≤i≤m |A (i, j)|. With this rulefor row swapping then the final factorization with produce an L with |L (i, j) |≤ 1. Analternative pivoting strategy, complete pivoting requires 1
3n3 + O (n2) comparisons for a
dense matrix [38], a significant amount of work. In practice, complete pivoting is rarely usedfor this reason.
Threshold pivoting is a variation of pivoting whereby a parameter u < 1 is chosen so
that an acceptable pivot is any choice such that∣∣∣A(k)
kk
∣∣∣ ≥ u∣∣∣A(k)
ik
∣∣∣ for i > k. Threshold
pivoting means that there may be many acceptable pivots. Indeed, the motivation for thispivoting strategy is to allow multiple choices of pivots to consider other implications in thefactorization. Most notably, for sparse factorizations pivots may be chosen to reduce fill-inin the factorization. Multiple potential pivots allows the factorization some flexibility toattempt to minimize the amount of fill-in. One such strategy, is Markowitz pivoting [47].Such strategies could be applied to SRLU below, but will not be discussed further in thiswork.
Step 4 of Algorithm 1 is important for efficiency not only because of fill in, but alsobecause the term A (j + 1 : n, j) · A (j, j + 1 : n) is calculating an outer product of size
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 17
(m− j)-by-(n− j). These outer products are inefficient because they do not take advantageof data reuse and because they require accessing a large sub matrix at each iteration. Al-gorithm 1, however, can be reorganized into a mathematically equivalent LU factorizationwithout needed to calculate outer products or to access the entire Schur complement atevery iteration. Before reorganizing, first note that Algorithm 1 is written in vector form forbrevity and clarity of what subroutines are needed in the calculation. It is mathematicallyequivalent to algorithm 3. Algorithm 3 is less efficient than Algorithm 1, but taking a stepback will allow us to understand a different approach to LU.
Require: Data matrix A ∈ Rm×n
Ensure: Lower triangular with unit diagonal L ∈ Rm×min(m,n) and upper triangular U ∈Rmin(m,n)×n such that LU = A (mathematically)
1: for k = 1 : min(m,n)− 1 do2: for i = i+ 1 : m do3: for j = i+ 1 : n do4: A (i, j) = A (i, j)− A(i,k)
A(k,k)A (k, j)
5: end for6: end for7: end for
Algorithm 3: The Most Basic Form of the LU Decomposition ([34])
Algorithm 3 is for illustrative purposes only. Note that it is inefficient because as kvaries it will repeatedly calculate A(j,i)
A(i,i)for the same i and j. This form of LU, nevertheless,
highlights the triple nested loop complexity of the LU decomposition. Furthermore, sixvariations of the LU decomposition are immediate by permuting the order of the loops.Based on the loop parameters, Algorithm 3 is known at the KIJ version of LU [68]. Thisversion is also known as “right-looking” LU, as well as “outer product” LU, as noted above.
The JKI version of LU, known as “left-looking” LU (algorithm 4). This version is alsocalled “delayed-update”, or “lazy” LU because the Schur complement is not updated at eachiteration, rather when a column or block column is being updated, all previous updates fromall preceding columns are all applied to the current column at once (the Schur updates aresimultaneously applied). This also means that the final answer only needs to be writtenonce. Thus after a column or block column has been updated, it need not be updated again.As a result, “left-looking” LU will perform early iterations quickly, and slow down as thefactorization progresses because columns will require more updating from more precedingcolumns. Also note that the inner loop is an inner product. Algorithm 4 also illustrates thatswaps must be handled carefully to maintain the “lazy” advantage of delayed updating.
The efficiency advantages of “delayed-update” LU would clearly benefit a truncated LUdecomposition, as unnecessary updates to the Schur complement would be avoided. Nev-ertheless, we cannot obtain a high-quality truncated LU factorization without consideringboth row and column pivoting. Another form of the LU decomposition which factors both
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 18
Require: Data matrix A ∈ Rm×n
Ensure: A overwritten with L and U factors1: for j = 1 : n do2: for k = 1 : j − 1 do3: Apply previous row swaps4: end for5: for k = 1 : j − 1 do6: for i = k + 1 : n do7: Aij −= AikAkj
8: end for9: end forPick new swaps
10: for k = 1 : j do11: Apply new swaps to previous columns12: end for13: for i = j + 1 : n do14: Aij = Aij/Ajj
15: end for16: end for
Algorithm 4: The Left-Looking LU Decomposition
(a) Right-looking LU. (b) Left-looking LU. (c) Crout LU.
Figure 3.1: Visualizations of different LU factorizations.
columns and rows while it iterates is called the Crout LU decomposition. Unlike the pre-viously discussed versions of LU, Crout LU cannot be obtained by permuting the loops ofalgorithm 3, rather the loops must be broken up so that columns and rows can be updatedat each iterations. A visual representation of variations of the LU decomposition is providedin Figure 3.1.
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 19
The Crout LU decomposition, Algorithm 2, has also been implemented to quickly calcu-late incomplete LU decompositions, and to achieve effective dropping strategies [68]. Notethat there are other variations of LU as well. The Doolittle variation resembles the left-looking algorithm, but is applied to rows instead of columns. Table 3.1 highlights someefficiency differences of these mathematically equivalent algorithms.
LU Method Largest Data Read Largest Data Write Total Data Writes
Right-looking (n− 1)2 (n− 1)2 16(n− 1)n(2n− 1)
Left-looking 12n(n+ 1) n n2
Doolittle 12n(n+ 1) n n2
Crout 14n2 2(n− 1) n2
Table 3.1: Efficiency comparison of LU orderings
Hardware constraints mean that writing data to memory is generally more expensivethan reading data [23]. Carson et. al. also present a parallel write-avoiding LU algorithmwithout pivoting, which is based on the left-looking algorithm.
Notation and Definitions
We use the conventions outlined in Table 3.2 for our discussions and comparisons of low-rank approximation algorithms. In particular, p `, which will lead to greater efficiency inSRLU than that of existing algorithms.
Variable Representation Additional Notes
b block size smallest parameterp an oversampling parameter relative to b the minor dimension of our
random projectionk target rank potentially b k, but
k m,n` an oversampling parameter relative to k the minor dimension of
other algorithms’ randomprojections
m,n dimensions of data matrix b < p < k ≤ ` m,n
Table 3.2: Definition of parameters (smallest to largest)
Finally, a few specialized definitions are useful concerning LU decompositions:
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 20
Definition 1. An LU factorization
Π1AΠT2 =
( k min(m,n)− k
k L11
m− k L21 In−k
)·
( k n− k
k U11 U12
min(m,n)− k U22
)(3.3)
is rank-revealing if
σk (A) ≥ σmin (L11U11) σmax (U22) ≥ σk+1 (A) ≈ 0. (3.4)
Definition 2. A rank-k truncated LU factorization is spectrum-revealing if∥∥∥A− LU∥∥∥
2≤ q1(k,m, n)σk+1 (A)
and
σj
(LU)≥ σj (A)
q2(k,m, n)
for 1 ≤ j ≤ k and q1(k,m, n) and q2(k,m, n) is bounded by a low degree polynomial in k, m,and n.
Definition 3. A spectral gap ratio of a matrix A ∈ Rm×n is a ratio σ`(A)σj(A)
for 1 ≤ k <
` ≤ min(m,n).
The different between Definitions 1 and 2 is one of the primary concerns of this work.These two definitions provide a measurement for evaluating the quality of a low-rank approx-imation. The former is the standard definition in this area of numeric analysis. This work,nevertheless, argues that the second definition may be more useful for low-rank approxima-tion in many real-world settings. Definition 3 highlights a concept that is also related tothe quality of low-rank approximations and will be essential to the theoretical analysis ofspectrum-revealing algorithms in later chapters. Ultimately, Theorems 9 and 11 will showthat SRLU is spectrum-revealing.
3.3 Previous Work on Deterministic Low-Rank
Approximation Algorithms
As previously discussed, the SVD is a natural starting point for low-rank matrix approx-imation because of the optimality in equation (3.1). Computation of the SVD is stable,but has many difficulties. First, the complexity of calculation is 4
3mnmin(m,n) for a full
SVD calculation, and approximately 43kmn for a rank-k truncated SVD, which we will see
is considerably slower than other methods. Second, the SVD cannot be updated in general
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 21
if additional data is acquired. Additionally, the SVD may be difficult to interpret. The low-rank approximation in equation 3.2 is composed of vectors ui and vi that are generally dense.Thus while the SVD can choose a low dimensional subspace that accurately approximatesthe data, the SVD cannot in general accurately determine a subset of the data variables thataccurately represent the data. This remains true even if the input matrix is sparse. See [102]for detailed interpretations of the meaning of the SVD.
The URV decomposition introduced by Stewart [97] is a factorization that can be updatedquickly when a new sample arrives, unlike the SVD. In this factorization the middle matrix istriangular, reducing the interpretability of this decomposition. Also, few theoretical resultshave been reported about the URV decomposition.
Sparse PCA approximates a data matrix with not only a low dimensional subspace, butalso a sparse set of data variables. Because the principal components and loading vectorsreturned by an SVD factorization are dense in general, the SVD does not readily reportwhich data variables best describe the data set. Note that thresholding, which simply ranksvariables based on the magnitude of the elements in the loading vectors, often poorly predictsthe influence of he data variables because highly correlated variables will be underreporteddue to dilution. Sparse PCA algorithms seek more robust algorithms to find sparse low-rankapproximations [110].
Low-rank approximations can also be found using rank-revealing QR (RRQR) factoriza-tions [24, 51, 55]. RRQR algorithms perform column pivots to guarantee that the factoriza-tion
AΠ =
(Q11 Q12
Q21 Q22
)(R11 R12
R22
)produces an approximation
(Q11
Q21
)(R11 R12
)that is rank-revealing.
The CUR decomposition approximates a matrix by the factorization A ≈ CUR, whereC is a sample of the columns of A, R is a sample of the rows of A, and U = C†AR† [16].This factorizations seeks to choose important rows and columns, which, for a sparse matrixprovides insight into important combinations of the data variables. Thus the motivation forthe CUR decomposition is similar to that of Sparse PCA. Note also that the decompositionretains the sparsity of the original matrix and may be used for data compression, as C andR have the same sparsity pattern as submatrices of A, and U is a small matrix. The CURdecomposition is easily adapted to symmetric matrices by letting C and R be the sameselection of indices. A CUR decomposition can be formed from any row/column selectionalgorithm, such as LU with pivoting, QR with pivoting (including RRQR), or other columnselection methods [8, 11, 18, 29]. CUR decompositions can be deterministic or randomized,and will be discussed in later sections. The CX decomposition is analogous to the CURdecomposition, where only a subset of columns are selected.
Additional work on low-rank data approximation includes the Interpolative Decomposi-tion (ID) [25] and other deterministic column selection algorithms, such as [11]. A truncatedversion of the Cholesky factorization was studied in [41].
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 22
Low-rank approximations can also be formulated as optimization problems. For a datamatrix A with rank r the nuclear norm is denoted and defined as
‖A‖∗def=
r∑i=1
σi (A) .
Thus the nuclear norm is an `1 norm of the spectrum of A. Note that the `1 norm mini-mization is sparsity inducing, and so heuristically the nuclear norm minimization similarlyinduces sparsity in a matrix spectrum. This in turn implies that nuclear norm minimizationinduces low-rank [20, 21, 57, 87].
3.4 Previous Work on Randomized Low-Rank
Approximation Algorithms
Randomized algorithms have grown in popularity in recent years because of their abilityto efficiently process large data matrices and because they can be supported with rigor-ous theory. Randomized low-rank approximation algorithms generally fall into one of twocategories: sampling algorithms and black box algorithms. Sampling algorithms form dataapproximations from a random selection of rows and/or columns of the data. Black boxalgorithms attempt to find an accurate low-rank basis of the data matrix by factoring arandom projection. SRLU, the main result presented below, will resemble a deterministicalgorithm with elements of a black box algorithm as a subproblem.
Sampling Algorithms
Examples of sampling algorithms include [35, 36, 44, 72]. [37] showed that for a givenapproximate rank k, a randomly drawn subset C of c = O (k log(k)ε−2 log (1/δ)) columns ofthe data, a randomly drawn subset R of r = O (c log(c)ε−2 log (1/δ)) rows of the data, andsetting U = C†AR†, then the matrix approximation error ‖A−CUR‖F is at most a factorof 1 + ε from the optimal rank k approximation with probability at least 1− δ.
In [44], Frieze et. al. showed that a randomized algorithms can find an approximationD∗ to a matrix A ∈ Rm×n satisfying
‖A−D∗‖2F ≤ min
D,rankD≤k‖A−D‖2
F+ε‖A‖2F ,
with probability 1 − δ, and can run in time polynomial in k, 1ε, log
(1δ
), independent of m
and n. This implies that a method for testing in constant time if a large matrix has a goodlow-rank approximation. Their algorithm runs by randomly sampling rows and columns,and calculating the SVD of the scaled intersection. The top singular vectors can then befiltered to produce a low-rank approximation.
Adaptive sampling methods update the sampling probabilities as the rank of the approx-imation increases. Deshpande and Vempala [35] and Deshpande et. al. [36] exponentially
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 23
reduce the error of the method in [44] by adaptively sampling at each iteration, by updatingthe sampling probabilities to reflect the squared distance to the span of the previous samples.
In [72] Mahoney and Drineas use statistical leverage scores to randomly select rows andcolumns of a data matrix to form a CUR decomposition. As described above, this methodcreates a sparse approximation. Their algorithm satisfies
‖A−CUR‖F ≤ (2 + ε)‖A−Ak‖F ,
with probability at least 98%. They apply their algorithm to text document data, geneticmicroarray data, and a social science dataset. Other work on sampling algorithms include[107].
Background on Random Projections
Before discussing black box algorithms, the mathematical background of random projectionsis presented here. Paramount to the analysis of randomized algorithms is a theorem due toJohnson and Lindenstrauss [61]. Their work proved that, given n and 1 > ε > 0, then forevery set of n points P ∈ Rd there exists a map f : Rd → Rk, with k = O (ε−2 log n), and
(1− ε) ‖u− v‖2≤ ‖f(u)− f(v)‖2≤ (1 + ε) ‖u− v‖2
for all u, v ∈ P . In other words, a set of points in a high dimensional space can be mappedto a lower dimensional space in such a way that the distances between points are preservedwithin some bound. Their work also showed that a random rank k orthogonal projection on`n2 will satisfy the condition on f with an exponentially decaying probability of failure.
Frankl and Maehara [43] improved the bound to k = d9 (ε2 − 2ε3/3)−1
ln|P |e + 1 for0 < ε < 1
2and provided a simpler proof. Indyk and Motwani [59] showed that orthogonality
of the random projection is not needed. Dasgupta and Gupta [32] improved the bound on kto k ≥ 4 (ε2/2− ε3/3)
−1lnn.
Achlioptas [1] showed that the random projection can be chosen from −1,+1 and thatit can also be made sparse: the random projection can have entries randomly chosen from−1, 0,+1 with respective probabilities 1
6, 2
3, and 1
6. For ε, β > 0, the condition on f holds
for k at least 4+2βε2/2−ε3/3 log n with probability at least 1 − n−β. Other notable works include
that of Klartag and Mendelson [65] and Indyk and Naor [60]. See also [62].In [3], Ailon and Chazelle introduced the fast Johnson-Lindenstrauss transform, which
preconditions a sparse projection matrix with a randomized Fourier transform and avoidsnaive multiplication of dense matrices. This method uses the Heisenberg principle to over-come the distortion caused by sparse random projections, and achieves a lower complexitythan previous algorithms. Matousek [77] combines the ideas of [1] and [3] to further im-prove the speed of the projection. Additionally, Matousek argues that there is a limit on theamount of sparsity achievable. Aileen and Liberty [4] further improve the running time toO (d log k) for k = O
(d1/2−δ) for an arbitrarily small fixed δ.
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 24
Dasgupta, Kumar, and Sarlos [31] report a method that constructs the projection matrixusing a hash function instead of independent random variables, with o
(1ε2
)nonzero entries
in the projection matrix. Their work achieves a O(
1ε
)update time per nonzero element,
compared to the O(
1ε2
)update time per nonzero element of previous works.
In this work, we will be concerned with a version of the Johnson-Lindenstrauss theoremwhen n = 1. In this case, [104] showed
(1− ε)‖u‖22≤∥∥∥∥ 1√
kRu
∥∥∥∥2
2
≤ (1 + ε)‖u‖22, (3.5)
with probability at least
1− 2e−(ε2−ε3) k4 .
Here, R ∈ Rk×d is a Gaussian random projection. This theorem can also be applied bydetermining a desired level of certainty, and using the theorem above to determine the re-quired amount of oversampling k. Note that in the context of SRLU below, the oversamplingparameter k will be represented with p to better conform with other current work.
Black Box Algorithms
Black box algorithms typically approximate a data matrix in the form
A ≈ QTQA,
where Q is an orthonormal basis of the random projection (usually using SVD, QR, or ID).The result of [61] provided the theoretical groundwork for these algorithms, which have beenextensively studied [30, 50, 52, 69, 74, 84, 89, 108]. For example, in [84], Papadimitriouet. al. use a random projection as a preprocessing step for SVD in the context of latentsemantic indexing. Note that the projection of a m-by-n data matrix is of size m-by-`, forsome oversampling parameter ` ≥ k, where k is the target rank. Thus the computationalchallenge is the orthogonalization of the projection (the random projection can be appliedquickly, as described in these works). A previous result on randomized LU factorizations forlow-rank approximation was presented in [5], but is uncompetitive in terms of theoreticalresults and computational performance with the work presented here.
Note that the large random projection requires initialize the factorization with an ex-pensive matrix-matrix multiplication operation, requiring 2`mn operations, and imposing abottleneck on computation just as a preprocessing step. This complication has been over-come by using Fourier transform-like techniques to apply random projections faster thanthe cost of matrix-matrix multiplication, as described in the previous section. For instance,the Johnson-Lindenstrauss transform is introduced in [89]. Using a Johnson-Lindenstraussmatrix, Sarlos computes a relative Frobenius norm error in time O (Mr + (m+ n)r2) with
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 25
probability 1/2, where M is the number of nonzeros and r = Θ (k/ε+ k log k). The proba-bility of success can be increased to 1−δ with O (log(1/δ)) processors computing approxima-tions on independent copies. Thus this work showed how to use the Johnson-Lindenstrausstransform [3] to significantly reduce the cost of the initial random projection.
Martinsson et. al. [74] proposed randomized algorithms for low-rank approximationby combining a random projection with the ID and SVD methods. This work bounds thesingular values of the approximations with high probability. Cost savings over deterministicmethods are achieved when the data matrix A and its transpose can be applied “rapidly”to random vectors. A subsequent work by Woolfe et. al. [108] accomplishes this taskby applying a structured random matrix, which is composed of randomly selected rows ofthe product of a discrete Fourier transform matrix and a random diagonal matrix. Theresulting algorithm has complexity O (mn log(k) + l2(m+ n)), where the first term is due tothe initialization cost of applying the random projection.
Clarkson and Woodruff [30] showed that a relative Frobenius norm rank-k approximationto a matrix A can be computed in input sparsity time O (nnz (A))+O (nk2ε−4 + k3ε−5). Thiswork provides additional results for overconstrained least-squares regression, `p-regression,and approximating leverage scores.
For both sampling and black box algorithms the tuning parameter ε cannot be arbitrarilysmall, as the methods become meaningless if the number of rows and columns sampled (inthe case of sampling algorithms) or the size of the random projection (in the case of blackbox algorithms) surpasses the size of the data. A common practice is ε ≈ 1
2.
3.5 Problems Related to Low-Rank Approximation
Robust PCA [22] seeks to factor a matrix A into two factors: L0 that is low-rank and S0
that is sparse by solving:min
L+S=A‖L‖∗+λ‖S‖1
and recovering L0 and S0 with high probability.Subspace Clustering [85] seeks to separate a dataset into low-rank clusters. Matrix com-
pletion [21] seeks to approximate missing data in a matrix by choosing approximations thatminimize the rank of the dataset, the motivation being to choose data that is consistent withthe information that is present. In rank-deficient least squares [34], a low-rank approxima-tion may be necessary. In Appendix B, a linear solver called GMRES is discussed. AlthoughGMRES is not a low-rank approximation method, it operates by finding the best approxi-mate solution to a linear system in a low-rank subspace called a Krylov subspace. There aremany other problems in mathematics and statistics that relate to low-rank approximation.
CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 26
3.6 Previous Work on the LU Decomposition
The LU factorization has been studied extensively since long before the invention of comput-ers, with notable results from many mathematicians, including Gauss, Turing, and Wilkin-son. Current research on LU factorizations includes communication-avoiding implementa-tions, such as tournament pivoting [64], sparse implementations [49], and new computationof preconditioners [27]. A randomized approach to efficiently compute the LU factorizationwith complete pivoting recently appeared in [78]. These results are all in the context of linearequation solving, either directly or indirectly through an incomplete factorization used toprecondition an iterative method. This work repurposes the LU factorization to create anovel efficient and effective low-rank approximation algorithm using modern randomizationtechnology.
27
Chapter 4
Spectrum-Revealing LU
4.1 Main Contribution: Spectrum-Revealing LU
(SRLU)
Our algorithm for computing SRLU is composed of two subroutines: partially factoring thedata matrix with randomized complete pivoting (TRLUCP) and performing swaps to im-prove the quality of the approximation (SRP). The first provides an efficient algorithm forcomputing a truncated LU factorization, whereas the second ensures the resulting approxi-mation is provably reliable.
Truncated Randomized LU with Complete Pivoting (TRLUCP)
Intuitively, TRLUCP performs deterministic LU with partial row pivoting for some initialdata with permuted columns. TRLUCP uses a random projection of the Schur complementto cheaply find and move forward columns that are more likely to be representative of thedata. To accomplish this, Algorithm 5 performs an iteration of block LU factorization in
Require: Data matrix A ∈ Rm×n, target rank k, block size b, oversampling parameterp ≥ b, random Gaussian matrix Ω ∈ Rp×m
1: Calculate random projection R = ΩA2: for j = 0, b, 2b, · · · , k − b do3: Perform column selection algorithm on R and swap columns of A4: Update block column of L5: Perform block LU with partial row pivoting and swap rows of A6: Update block row of U7: Update R8: end for
Algorithm 5: TRLUCP
CHAPTER 4. SPECTRUM-REVEALING LU 28
a careful order that resembles Crout LU reduction. The ordering is reasoned as follows:LU with partial row pivoting cannot be performed until the needed columns are selected,and so column selection must first occur at each iteration. Once a block column is selected,a Schur update must be performed on that column before proceeding. At this point, aniteration of block LU with partial row pivoting can be performed on the current block. Oncethe row pivoting is performed, a Schur update of a block row of U can be performed, whichcompletes the factorization up to rank j+b. Finally, the projection matrix R can be cheaplyupdated to prepare for the next iteration. Note that any column selection method may beused when picking column pivots from R, such as QR with column pivoting, LU with rowpivoting, or even this algorithm can again be run on the subproblem of column selection ofR. A visualization of SRLU appears in Appendix D.
1: DGEMM p (2m− 1)n flops2: for j = 0 : b : k − b do3: DGETRF, DGEQRF or other, column swaps O (np2) flops4: DGEMM 2 (m− j) jb flops5: DGETRF, row swaps (m− j) b2 − 1
3b3 flops
6: DGEMM 2bj (n− j − b) flops7: DGEMM, DTRSM, DTRMM O (pbn) flops8: end for
Algorithm 6: SRLU (A second look at subroutines and flop count)
The flop count of TRLUCP is dominated by the three matrix multiplication (DGEMM)steps. The total number of flops is
FTRLUCP = 2pmn+ (m+ n)k2 + low order.
Note the transparent constants, and, because matrix multiplication is the bottleneck, thisalgorithm can be implemented efficiently in terms of both computation as well as memoryusage. Because the output of TRLUCP is only written once, the total number of memorywrites is (m + n − k)k. Minimizing the number of data writes by only writing data oncesignificantly improves efficiency because writing data is typically one of the slowest computa-tional operations. Also worth consideration is the simplicity of the LU decomposition, whichonly involves three types of operations: matrix multiply, scaling, and pivoting. By contrast,state-of-the-art calculation of both the full and truncated SVD requires a more complex pro-cess of bidiagonalization. The projection R can be updated efficiently to become a randomprojection of the Schur complement for the next iteration. This calculation involves thecurrent progress of the LU factorization and the random matrix Ω, and is described below.
Updating R
The goal of TRLUCP is to access the entire matrix once in the initial random projection,and then choose column pivots at each iteration without accessing the Schur complement.
CHAPTER 4. SPECTRUM-REVEALING LU 29
Therefore, a projection of the Schur complement must be obtained at each iteration withoutaccessing the Schur complement, a method that first appeared in [78]. Assume that siterations of TRLUCP have been performed and denote the projection matrix
Ω =( sb b n− (s+ 1)b
Ω1 Ω2 Ω3
),
and the current A as
A(s) =
sb b n− (s+ 1)b
sb A11 A12 A13
b A21 A22 A23
m− (s+ 1)b A31 A32 A33
.Then the current projection of the Schur complement is
Rcur =( b n− (s+ 1)b
Rcur1 Rcur
2
)=(Ω2 Ω3
)(S11 S12
S21 S22
),
where the right-most matrix is the current Schur complement. The next iteration of TR-LUCP will need to choose columns based on a random projection of the Schur complement,which we wish to avoid accessing. We can write:
Rupdate = Ω3
(A33 −A32A
−122 A23
)= Ω3A33 + Ω2A23 −Ω2A23 −Ω3A32A
−122 A23
= Ω3A33 + Ω2A23 −Ω2L22U23 −Ω3L32U23
= Rcurrent2 − (Ω2L22 + Ω3L32) U23. (4.1)
Here the current L and U at stage s have been blocked in the same way as Ω. Noteequation (4.1) no longer has the term A33. Furthermore, A−1
22 has been replaced by substi-tuting in submatrices of L and U that have already been calculated, which helps eliminatepotential instability.
When the block size b = 1 and TRLUCP runs fully (k = min(m,n)), TRLUCP ismathematically equivalent to the Gaussian Elimination with Randomized Complete Pivoting(GERCP) algorithm of [78]. However, TRLUCP differs from GERCP in two very importantaspects: TRLUCP is based on the Crout variant of the LU factorization, which allowsefficient truncation for low-rank matrix approximation, and TRLUCP has been structuredin block form for more efficient implementation.
4.2 Spectrum-Revealing Pivoting (SRP)
TRLUCP produces high-quality data approximations for almost all data matrices, despitethe lack of theoretical guarantees, but can miss important rows or columns of the data.
CHAPTER 4. SPECTRUM-REVEALING LU 30
Next, we develop an efficient variant of the existing rank-revealing LU algorithms [51, 79]to rapidly detect and, if necessary, correct any possible matrix approximation failures ofTRLUCP.
Intuitively, the quality of the factorization can be tested by searching for the next choiceof pivot in the Schur complement if the factorization continued. Because TRLUCP does notprovide an updated Schur complement, the largest element in the Schur complement can beapproximated by finding the column of R with largest norm, performing a Schur update ofthat column, and then picking the largest element in that column. Let α be this element,and, without loss of generality, assume it is the first entry of the Schur complement. Denote:
Π1AΠT2 =
L11
`T 1L31 I
U11 u U13
α sT12
s21 S22
.
Next, we must find the row and column that should be replaced if the row and columncontaining α are important. Note that the smallest entry of L11U11 may still lie in animportant row and column, and so the largest element of the inverse should be examinedinstead. Thus we propose defining
A11def=
(L11
`T 1
)(U11 u
α
)and testing
‖A−1
11 ‖max ≤f
|α|
for a tolerance parameter f > 1 that provides a control of accuracy versus the number ofswaps needed. Should the test fail, the row and column containing α are swapped with the
row and column containing the largest element in A−1
11 . Note that this element may occur in
the last row or last column of A−1
11 , indicating only a column swap or row swap respectivelyis needed. When the swaps are performed, the factorization must be updated to maintaintruncated LU form. We have developed a variant of the LU updating algorithm of [48] toefficiently update the SRLU factorization.
SRP can be implemented efficiently: each swap requires at most O (k(m+ n)) operations,
and ‖A−1
11 ‖max can be quickly and reliably estimated using [54]. An argument similar to thatused in [79] shows that each swap will increase
∣∣det(A11
)∣∣ by a factor at least f , hence willnever repeat.
4.3 LU Updating
Restoring the LU form of a matrix factorization after a modification to the factorization (LUupdating) has been explored in [46, 48, 96]. Restoring the truncated LU format of SRLU
CHAPTER 4. SPECTRUM-REVEALING LU 31
Require: Truncated LU factorization A ≈ LU, tolerance f > 1
1: while ‖A−1
11 ‖max>f|α| do
2: Set α to be the largest element in S (or find an approximate α using R)
3: Swap row and column containing α with row and column of largest element in A−1
11
4: Update truncated LU factorization5: end while
Algorithm 7: Spectrum-Revealing Pivoting (SRP)
requires updating the factorization after row and column swaps. [48] illustrates LU updatinga full LU factorization after row and column swaps are performed, which maintains stabilityby allowing for additional row and column swaps to be performed.
Updating a truncated LU factorization can be achieved by: moving the row/column to beremoved to the kth row/column, and moving forward the rows/columns in between. Then,use the elimination strategy of [48] to eliminate the super diagonal/sub diagonal entries (andmaintain a unit diagonal on L). Note, nevertheless, the final super/sub diagonal entry cannotbe eliminated with this strategy because, if a row or column swap is required for stability,then the updating may swap the row/column to be moved out with the row/column to bemoved in twice. Instead, after all other super/sub diagonal entries have been eliminated,swap rows/columns k and k + 1 to simultaneously add the new row/column required andremove the row/column not needed. At this stage, a single step of LU factorization withoutpivoting restores the truncated LU factorization.
To see that a step of LU is stable without pivoting, note that entry (k − 1, k) of the Lfactor is 1 and the (k, k − a) entry of the U factor is a entry that was previously on thediagonal of the U factor. Hence, the leading entry in the Schur complement is not numerically0, and so LU can proceed without a pivot.
Let α denote the entry with largest magnitude in the Schur complement. Let (i, j)indicate the coordinates of the entry to be moved out of the upper k-by-k principal submatrixof A, and (s, t) be the coordinates of the entry to be moved out. LU updating proceeds asfollows:
1. Swap rows s and k + 1, as well as columns t and k + 1. Note that the factorizationremains in truncated LU format.
2. Move row i+ 1 to row i, row i+ 2 to i+ 1, ... row k to k − 1, and row i to row k.
3. Use the methodology of [48] to eliminate entries (i, i + 1), (i + 1, i + 2), ... , and(k − 2, k − 1) from the L factor, while maintaining 1s on the diagonal.
4. Move column j + 1 to column j, ... , column k to column k − 1, and column j tocolumn k.
CHAPTER 4. SPECTRUM-REVEALING LU 32
5. Use the methodology of [48] to eliminate entries (j + 1, j), ... , and (k− 2, k− 1) fromthe U factor.
6. Swap rows k and k + 1 and columns k and k + 1.
7. Perform a single step of LU factorization to eliminate entry (k−1, k) from the L factorand (k, k − 1) from the U factor.
4.4 Choice of Block Size
A heuristic for choosing a block size for TRLUCP is described here, which differs fromstandard block size methodologies for the LU decomposition. Note that a key difference ofSRLU and TRLUCP from previous works is the size of the random projection: here thesize is relative to the block size, not the target rank k (2pmn flops for TRLUCP versus thesignificantly larger 2kmn for others). This also implies a change to the block size also changesthe flop count, and, to our knowledge, this is the first algorithm where the choice of blocksize affects the flop count. For problems where LAPACK chooses b = 64, our experimentshave shown block sizes of 8 to 20 to be optimal for TRLUCP.
Because the ideal block size depends on many parameters, such as the architecture of thecomputer and the costs for various arithmetic, logic, and memory operations, guidelines aresought instead of an exact determination of the most efficient block size. To simplify calcu-lations, only the matrix multiplication operations are considered, which are the bottleneckof computation. Let M denote the size of cache, f and m the number of flops and memorymovements, and tf and tm the cost of a floating point operation and the cost of a memorymovement. Using standard communication-avoiding analysis, we seek to choose a block sizeto minimize the total calculation time T modeled as
T = f · tf +m · tm.
Choosing p = b+ c for a small, fixed constant c, and minimizing implies
T =
[(m+ n− k)
(k2 − kb
)− 4
3k3 + 2bk2 − 2
3b2k
]· tf
+
[(m+ n− k)
(k2
b− k)− 4
3
k3
b+ 2k2 − 2
3bk
]· M(√
b2 +M − b)2 · tm.
This result is derived as follows: we analyze blocking by allowing different block sizes ineach dimension. For matrices Ω ∈ Rp×m and A ∈ Rm×n consider blocking in the form
Ω ·R =
( `
s ∗ ∗ ∗∗ ∗ ∗
)·
b
` ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗
.
CHAPTER 4. SPECTRUM-REVEALING LU 33
Then a current block update requires cache storage of
s`+ `b+ sb ≤ M.
Thus we will constrain
` ≤ M − sbs+ b
.
The total runtime T is
T = 2pmn · tf +(ps
)(m`
)(nb
)(s`+ `b+ sb) · tm
= 2pmn · tf + pmn
(s+ b
sb+
1
`
)· tm
≥ 2pmn · tf + pmn
(s+ b
sb+
s+ b
M − sb
)· tm
= 2pmn · tf + pmnM
(s+ b
sb (M − sb)
)· tm
def= 2pmn · tf + pmnML (s, b,M) · tm.
Given Ω and A, changing the block sizes has no effect on the flop count. OptimizingL (s, b,M) over s yields
s2 + 2sb = M.
By symmetry
b2 + 2sb = M.
Note, nevertheless, that s ≤ p by definition. Hence
s∗ = min
(√M
3, p
),
and
b∗ = max
(√M
3,√p2 +M − p
).
These values assume
`∗ =M − sbs+ b
= max
(√M
3,√p2 +M − p
)= b∗.
CHAPTER 4. SPECTRUM-REVEALING LU 34
This analysis applies to matrix-matrix multiplication where the matrices are fixed and theleading matrix is short and fat or the trailing matrix is tall and skinny. As noted above,nevertheless, the oversampling parameter p is a constant amount larger than the block sizeused during the LU factorization. The total initialization time is
T init = 2pmn · tf + pmnM
(s+ b
sb (M − sb)
)· tm
= 2pmn · tf +mn ·min
3√
3p√M,
M(√p2 +M − p
)2
· tm.We next choose the parameter b used for blocking the LU factorization, where p = b+O (1).The cumulative matrix multiplication (DGEMM) runtime is
TDGEMM =∑
j=b:b:k−b
[2jb(m− j) + 2jb(n− j − b)] · tf
+2 [j(m− j) + j(n− j − b)] M(√b2 +M − b
)2 · tm
=
[(m+ n− k)
(k2 − kb
)− 4
3k3 + 2bk2 − 2
3b2k
]· tf +
+
[(m+ n− k)
(k2
b− k)− 4
3
k3
b+ 2k2 − 2
3bk
]M(√
b2 +M − b)2 · tm
def= NDGEMM
f · tf +NDGEMMm · tm.
The methodology for choosing a block size is compared to other choices of block size inFigure 4.1. Note that LAPACK generally chooses a block size of 64 for these matrices, whichis suboptimal in all cases, and can be up to twice as slow. In all of the cases tested, thecalculated block size is close to or exactly the optimal block size.
4.5 Variations of SRLU
The CUR Decomposition with LU
A natural extension of truncated LU factorizations is a CUR-type decomposition for im-proved accuracy [72]:
Π1AΠT2 ≈ L
(L†AU†
)U
def= LMU.
As with standard CUR, the factors L and U retain (much of) the sparsity of the originaldata, while M is a small, k-by-k matrix. The CUR decomposition can improve the accuracy
CHAPTER 4. SPECTRUM-REVEALING LU 35
Figure 4.1: Benchmarking TRLUCP with various block sizes on random matrices of differentsizes and truncation ranks.
of an SRLU with minimal extra needed memory. Extra computational time, nevertheless, isneeded to calculate M. A more efficient, approximate CUR decomposition can be obtainedby replacing A with a high quality approximation (such as an SRLU factorization of highrank) in the calculation of M:
A ≈ LkMUk where M = L†kLk+pUk+pU†k. (4.2)
Here, a suitable p ∈ Z+ is chosen so that Lk+pUk+p more accurately approximates A,and which can be calculated after Lk and Uk have been formed by simply continuing thefactorization for additional iterations.
CHAPTER 4. SPECTRUM-REVEALING LU 36
The Online SRLU Factorization
Given a factored data matrix A ∈ Rm×n and new observations BΠT2 =
( k m− k
B1 B2
)∈ Rs×m,
an augmented LU decomposition takes the form
(Π1AΠT
2
BΠT2
)=
L11
L21 IL31 I
U11 U12
SSnew
,
where L31 = B1U−111 and Snew = B2 − B1U
−111 U12. An SRLU factorization can then be
obtained by simply performing correcting swaps. For a rank-1 update, at most 1 swapis expected (although examples can be constructed that require more than one swap),which requires at most O (k (m+ n)) flops. By contrast, the URV decomposition of [97]is O (n2), while SVD updating requires O
((m+ n) min2 (m,n)
)operations in general, or
O((m+ n) min (m,n) log2
2 ε)
for a numerical approximation with the fast multipole method.Subspace iteration has no known updating technique.
In many applications, reduced weight is given to old data. In this context, multiplyingthe matrices U11, U12 and S by some scaling factor less than 1 before applying spectrum-revealing pivoting will reflect the reduced importance of the old data.
37
Chapter 5
The Theory of Spectrum-RevealingAlgorithms
5.1 Theoretical Results for SRLU Factorizations
Analysis of General Truncated LU Decompositions
Theorem 3. Let (·)s denote the rank-s truncated SVD for s ≤ k m,n. Then for anytruncated LU factorization ∥∥∥Π1AΠT
2 − LU∥∥∥ = ‖S‖
for any norm ‖·‖. Furthermore,∥∥∥Π1AΠT2 −
(LU)s
∥∥∥2≤ 2‖S‖2+σs+1 (A) .
Proof. The equation simply follows from Π1AΠT2 = LU +
(0 00 S
). For the inequality:
∥∥∥Π1AΠT2 −
(LU)s
∥∥∥2
=∥∥∥Π1AΠT
2 − LU + LU−(LU)s
∥∥∥2
≤∥∥∥Π1AΠT
2 − LU∥∥∥
2+∥∥∥LU−
(LU)s
∥∥∥2
= ‖S‖2+σs+1
(LU)
= ‖S‖2+σs+1
(Π1AΠT
2 −(
0 00 S
))≤ ‖S‖2+σs+1 (A) + ‖S‖2.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 38
Theorem 4. For a general rank-k truncated LU decomposition, we have for all 1 ≤ j ≤ k,
σj (A) ≤ σj
(LU)1 +
1 +‖S‖2
σk
(LU) ‖S‖2
σj (A)
.
Proof.
σj (A) ≤ σj
(LU)1 +
‖S‖2
σj
(LU)
= σj
(LU)1 +
σj (A)
σj
(LU) ‖S‖2
σj (A)
≤ σj
(LU)1 +
σj
(LU)
+ ‖S‖2
σj
(LU) ‖S‖2
σj (A)
= σj
(LU)1 +
1 +‖S‖2
σj
(LU) ‖S‖2
σj (A)
≤ σj
(LU)1 +
1 +‖S‖2
σk
(LU) ‖S‖2
σj (A)
.
Note that the relaxation in the final step serves to establish a universal constant across allj, which leads to fewer terms that need bounding when the global SRLU swapping strategyis developed. Although we could succinctly write
σj (A) ≤ σj
(LU)1 +
‖S‖2
σj
(LU) ≤ σj
(LU)1 +
‖S‖2
σk
(LU) ,
performing the relaxation earlier produces a weaker bound when ‖S‖2 is small. An upperbound is simpler:
σj
(LU)≤ σj (A)
(1 +
‖S‖2
σj (A)
).
Theorem 5. (CUR Error Bounds)∥∥∥Π1AΠT2 − LMU
∥∥∥2≤ 2‖S‖2
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 39
and ∥∥∥Π1AΠT2 − LMU
∥∥∥F≤ ‖S‖F .
Proof. First
‖Π1AΠT2 − LMU‖2 =
∣∣∣∣∣∣∣∣∣∣(
0(QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)∣∣∣∣∣∣∣∣∣∣2
≤∣∣∣∣∣∣(QL
1
)TC(QU
2
)T ∣∣∣∣∣∣2
+∣∣∣∣∣∣((QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)∣∣∣∣∣∣2
=∣∣∣∣∣∣(QL
1
)TC(QU
2
)T ∣∣∣∣∣∣2
+∣∣∣∣∣∣(QL
2
)TC((
QU1
)T (QU
2
)T)∣∣∣∣∣∣2
≤ 2‖C‖2
= 2‖S‖2.
Also
‖Π1AΠT2 − LMU‖F =
∥∥∥∥∥(QL1 QL
2
)((QL1
)T(QL
2
)T)
A((
QU1
)T (QU
2
)T)(QU1
QU2
)−QL
1
(QL
1
)TA(QU
1
)TQU
1
∥∥∥F
=∥∥∥QL
1
(QL
1
)TA(QU
2
)TQU
2 + QL2
(QL
2
)TA(QU
1
)TQU
1
+QL2
(QL
2
)TA(QU
2
)TQU
2
∥∥∥F
=∥∥∥QL
1
(QL
1
)TC(QU
2
)TQU
2 + QL2
(QL
2
)TC(QU
1
)TQU
1
+QL2
(QL
2
)TC(QU
2
)TQU
2
∥∥∥F
=
∣∣∣∣∣∣∣∣∣∣(QL
1 QL2
)( 0(QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)(
QU1
QU2
)∣∣∣∣∣∣∣∣∣∣F
=
∣∣∣∣∣∣∣∣∣∣(
0(QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)∣∣∣∣∣∣∣∣∣∣F
≤
∣∣∣∣∣∣∣∣∣∣((
QL1
)TC(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)∣∣∣∣∣∣∣∣∣∣F
=
∣∣∣∣∣∣∣∣∣∣(QL
1 QL2
)((QL1
)T(QL
2
)T)
C((
QU1
)T (QU
2
)T)(QU1
QU2
)∣∣∣∣∣∣∣∣∣∣F
= ‖C‖F= ‖S‖F .
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 40
Theorem 3 simply concludes that the approximation is accurate if the Schur comple-ment is small, but the singular value bounds of Theorem 4 are needed to guarantee thatthe approximation retains structural properties of the original data, such as an accurateapproximation of the rank and the spectrum. Furthermore, singular values bounds can besignificantly stronger than the more familiar norm error bounds that appear in Theorem3. Theorem 4 provides a general framework for singular value bounds, and bounding theterms in this theorem provided guidance in the design and development of SRLU. Just as inthe case of deterministic LU with complete pivoting, the sizes of ‖S‖2
σk(LU)and ‖S‖2
σj(LU)range
from moderate to small for almost all data matrices of practice interest. They, nevertheless,cannot be effectively bounded for a general TRLUCP factorization, implying the need forAlgorithm 7 to ensure that these terms are controlled. While the error bounds in Theorem5 for the CUR decomposition do not improve upon the result in Theorem 3, CUR boundsfor SRLU specifically will be considerably stronger.
The factor of 2 in Theorem 5 will also appear in Theorems below. This factor is due tothe potential increase in the norm of a matrix when part of it is zeroed out, as seen in theproof. This factor is likely not optimal, but a counter example shows that it cannot be 1:∥∥∥∥( 0n
√2In√
2In In
)∥∥∥∥2∥∥∥∥( −In
√2In√
2In In
)∥∥∥∥2
=
√4
3.
We conjecture that√
43
is indeed a sharp bound. In the 2-by-2 case, using an exact formula
for the singular values shows that the factor is smaller than√
2: note that if a12 = 0 or
a21 = 0 then
∥∥∥∥( 0 a12
a21 a22
)∥∥∥∥2
≤∥∥∥∥(a11 a12
a21 a22
)∥∥∥∥2
because the matrix on the left-hand side is a
submatrix of the matrix on the right-hand side. Assuming a12, a21 6= 0, then a quick estimateshows∥∥∥∥( 0 a12
a21 a22
)∥∥∥∥2∥∥∥∥(a11 a12
a21 a22
)∥∥∥∥2
=
√a212+a221+a222
2+
√(a212+a221+a222
2
)2
− (a12a21)2
√a211+a212+a221+a222
2+
√(a211+a212+a221+a222
2
)2
− (a11a22 − a12a21)2
≤
√a212+a221+a222
2+
√(a212+a221+a222
2
)2
− (a12a21)2
√a212+a221+a222
2
<√
2.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 41
Correctness of the Spectrum-Revealing LU Decomposition
Our immediate goal is to bound ‖S‖2σk(LU)
and ‖S‖2σj(A)
. In the context of LU with complete
pivoting, a clear metric to evaluate the size of ‖S‖2 is to compare ‖S‖max, the next choice ofpivot element were the algorithm to continue, with the quality of the current approximation.While large values in A11 indicate that the corresponding rows and columns are importantcontributions to the approximation, note that small entries need not imply that the corre-sponding rows and columns are not important, as these rows and columns may contain otherlarge entries. Instead, large entries of A−1
11 will imply which columns and rows contributeleast to the quality of the approximation.
Consider the following notation. Let α be the entry in S with largest magnitude. Withoutloss of generality, assume that α has been permuted to the first entry of S:
S =
(α s12
s21 S22
).
Then:
Π1AΠT2 =
A11 u A13
`T α s12
A31 s21 S22
.
Denote
A11 :=
(A11 u`T α
).
For a given tuning parameter f ≥ 1, we propose evaluating the quality of the approximationLU by testing if
‖A−1
11 ‖max ≤f
|α|. (5.1)
Should condition (5.1) be satisfied, then no column and row combination can be deemedinsignificant compared to the column and row the contain α. With this test, we will showthat inequality in Theorem 4 is indeed bounded. Because our test involves searching for thelargest element in the Schur complement, we cannot expect to do better than a dimension-dependent bound. By contrast, a QR factorization produces a factorization in terms ofcolumn norms, which indicates that a comparable analysis may produce a bound with fixedglobal constants.
Theorem 6. (Existence of an Optimal Solution)SRLU terminates after a finite number of swaps.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 42
Proof. The theorem will follow after proving two properties of the algorithm: first, perform-ing a swap always increases det (A11), and so the algorithm cannot repeat itself. Second, aset of rows and columns always exists that satisfies condition (5.1). Because only a finitenumber of possible row and column selections are possible, SRLU must always terminate.
Suppose ‖A−1
11 ‖max>f|α| , and let i and j denote the row and column respectively of the
location of the largest element in A−1
11 (there may be more than one such entry). Let A11
denote A11 with a single swap replacing row i and column j with row k + 1 and columnk + 1. Then from [80]:
det(A11
)det (A11)
=(U−1
11 u)j
(`TL−1
11
)i+ α
(A−1
11
)ji
= α(A−1
11
)ji
> f.
This implies that a swap always improves the quality of the approximation.Now suppose Π1 and ΠT
2 are chosen so that A11 has the largest possible determinant.Then:
1 ≥det(A11
)det (A11)
= α(A−1
11
)ji,
and so
‖A−1
11 ‖max≤1
|α|≤ f
|α|.
Note: f is not essential to the proof of theorem 6. This parameter is essential, neverthe-less, as an exponential number of swaps may be necessary to find an optimal solution (i.e.if f = 1). For f > 1 then the determinant of A11 is improved by at least a factor of f .
Theorem 7. (Spectrum-Revealing LU Stopping Time)The Spectrum-Revealing LU algorithm terminantes in time proportional to the oversam-pling size (with a dimension-dependent and ε-dependent constant) with failure probability
1− 2e−(ε2−ε3)p/4.
Here, ε is a tuning parameter that affects the bound on the number of swaps. Typically,ε ≈ 1/2.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 43
Proof. Let S(k) denote the Schur complement after k rows and columns have been factoredwith SRLU, and pivots have been performed. Then det A11 =
∏Jk=1 S
(k)1,1 because these are
the diagonal elements of U1,1. Before proceeding, note several inequalities:
‖S(k)(:, 1)‖2 ≤ g1
√m− k + 1
∣∣∣S(k)1,1
∣∣∣ .Using partial row pivoting, g1 = 1 is guaranteed to satisfy this inequality. Applying the formof the Johnson-Lindenstrauss theorem in inequality 3.5,
‖S(k)‖1,2 ≤(
1 + ε
1− ε
)g2‖S(k)(:, 1)‖2.
Because SRLU is a block algorithm, and because it reuses the original random projection byupdating R at each iteration, this result is not automatic. This result was shown to hold in[109]. The failure probability remains 1− 2e−(ε2−ε3)k/2.
Deterministic QRCP always pivots the column with largest norm to the front, and so thisinequality would hold without any constants. Using randomized QRCP to select columns,this inequality is guaranteed to be true for g2 = 1. Higher values of g2 will hold for modifiedpivoting strategies, such as tournament pivoting.
σ1
(S(k)
)≤√n− k + 1‖S(k)‖1,2.
To see this last inequality, note that for any vector x of length n − k + 1 that ‖x‖1≤√n− k + 1‖x‖2. Then
σ1
(S(k)
)= max
x 6=0
∥∥∥∥S(k) x
‖x‖2
∥∥∥∥2
≤ maxx 6=0
∥∥∥∥S(k) x
‖x‖2
√n− k + 1
‖x‖2
‖x‖1
∥∥∥∥2
=√n− k + 1‖S(k)‖1,2.
Finally, note thatσ1
(S(k)
)≥ σk (A)
from [83].
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 44
Continuing the argument:
|det (A11)| =J∏k=1
∣∣∣S(k)1,1
∣∣∣≥
J∏k=1
‖S(k)1,1‖2
g1
√m− k + 1
≥J∏k=1
(1− ε1 + ε
)σ1
(S(k)
)g1g2
√m− k + 1
√n− k + 1
≥J∏k=1
(1− ε1 + ε
)σk (A)
g1g2
√m− k + 1
√n− k + 1
≥
[J∏k=1
σk (A)
]·[(
1− ε1 + ε
)1
g1g2
√m√n
]J.
After performing K swaps:
|det (A11)| ≥
[J∏k=1
σk (A)
]·[(
1− ε1 + ε
)1
g1g2
√m√n
]JfK .
Thus [J∏k=1
σk (A)
]≥ |det (A11)| ≥
[J∏k=1
σk (A)
]·[(
1− ε1 + ε
)1
g1g2
√m√n
]JfK ,
implying
K ≤ J logf
((1 + ε
1− ε
)g1g2
√mn
).
Analysis of the Spectrum-Revealing LU Decomposition
Theorem 8. (SRLU Error Bounds)For j ≤ k and γ ≤ O (fk
√mn), a rank-k SRLU factorization satisfies
‖Π1AΠT2 − LU‖2 ≤ γσk+1 (A) ,
‖Π1AΠT2 −
(LU)j‖2 ≤ σj+1 (A)
(1 + 2γ
σk+1 (A)
σj+1 (A)
).
Note: the notation τ ≤ O (fk√mn) is meant to convey that a bound of O (fk
√mn)
has been proven here, although a better bound τ may still exist.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 45
Note: although the factor of 2 may seem redundant in the presence of τ , numericexperiments will show that τ and similar constants below are indeed small in most practicalcases. Hence it is worth keeping the factor of 2 (here and in later theorems) separate fromτ .
Proof. Note that the definition of α implies
‖S‖2 ≤√
(m− k)(n− k)|α|.
From [83]:
σmin
(A11
)≤ σk+1 (A) .
Then:
σ−1k+1 (A) ≤ ‖A−1
11 ‖2
≤ (k + 1)‖A−1
11 ‖max
≤ (k + 1)f
|α|.
Thus
|α|≤ f(k + 1)σk+1 (A) .
The theorem follows by using this result with Theorem 3, with
γ ≤√mnf(k + 1).
Theorem 8 is a special case of Theorem 3 for SRLU factorizations. For a data matrixwith a rapidly decaying spectrum, the right-hand side of the second inequality is close toσj+1 (A), a substantial improvement over the sharpness of the bounds in [37].
While a CUR variant immediately follows by invoking Theorem 5 instead of Theorem 3,a stronger bound for CUR is developed later.
Theorem 9. (SRLU Spectral Bound)For 1 ≤ j ≤ k and τ ≤ O (mnk2f 3), a rank-k SRLU factorization satisfies:
σj (A)
1 + τ σk+1(A)
σj(A)
≤ σj
(LU)≤ σj (A)
(1 + τ
σk+1 (A)
σj (A)
),
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 46
Proof. After running k iterations of rank-revealing LU,
Π1AΠT2 = LU + C,
where C =
(0 00 S
), and S is the Schur complement. Then
σj (A) ≤ σj
(LU)
+ ‖C‖2
= σj
(LU)1 +
‖C‖2
σj
(LU) . (5.2)
For the upper bound:
σj
(LU)
= σj (A−C)
≤ σj (A) + ‖C‖2
= σj (A)
[1 +
‖C‖2
σj (A)
]= σj (A)
[1 +
‖S‖2
σj (A)
].
The final form is achieved using the same bound on γ as in Theorem 8.
While the worst case upper bound on τ is large, it is dimension-dependent, and j and kmay be chosen so that σk+1(A)
σj(A)is arbitrarily small compared to τ . In particular, if k is the
numeric rank of A, then the singular values of the approximation are numerically equal tothose of the data.
These bounds are ‘problem-specific bounds’ because their quality depends on the spec-trum of the original data, rather than universal constants that appear in previous results.The benefit of these problem-specific bounds is that an approximation of data with a rapidlydecaying spectrum is guaranteed to be high-quality. Furthermore, if σk+1 (A) is not smallcompared to σj (A), then no high-quality low-rank approximation is possible in the 2 andFrobenius norms. Thus, in this sense, the bounds presented in Theorems 8 and 9 are optimal.
We note that singular value ratios have appeared before. For example, Hwang, Lin, andYang [58] proved:
Theorem 10. Let C(n, k) = n!k!(n−k)!
. For nonsingular A ∈ Rn×n, there exist permutation
matrices Π1 and ΠT2 such that∣∣∣(U22)ij
∣∣∣ ≤ C(n, k)σk+1 (A)
1− C(n, k)σk+1(A)
σk(A)
.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 47
Given a high-quality rank-k truncated LU factorization, Theorem 9 ensures that a low-rank approximation of rank ` with ` < k of the compressed data is an accurate rank-`approximation of the full data. The proof of this theorem centers on bounding the terms inTheorems 3 and 4. Experiments will show that τ is small in almost all cases.
Stronger results are achieved with the CUR version of SRLU:
Theorem 11.
‖Π1AΠT2 − LMU‖2 ≤ 2γσk+1 (A)
and
‖Π1AΠT2 − LMU‖F ≤ ωσk+1 (A) ,
where γ ≤ O (fk√mn) is the same as in Theorem 8, and, similarly, ω ≤ O (fk
√mn).
Proof. Note that the definition of α implies
‖S‖F ≤√
(m− k)(n− k)|α|.
The rest follows by using Theorem 5 in a manner similar to how Theorem 8 invoked Theorem3.
Theorem 12. If σ2j (A) > 2‖S‖2
2 then
σj (A) ≥ σj
(LMU
)≥ σj (A)
√1− 2γ
(σk+1 (A)
σj (A)
)2
,
where γ ≤ O (mnk2f 2), and f is an input parameter controlling a tradeoff of quality vs.speed as before.
Proof. Perform QR and LQ decompositions L = QLRL =:(QL
1 QL2
)(RL11 RL
12
RL22
)and
U = LUQU =:
(LU
11
LU21 LU
22
)(QU
1
QU2
). Then
LMU = QL1
(QL
1
)TA(QU
1
)TQU
1 .
Note that
ATQL2 =
(LU + C
)TQL
2
=(QL
1 RL11L
U11Q
U1 + C
)TQL
2
=(QU
1
)T (LU
11
)T (RL
11
)T (QL
1
)TQL
2 + CTQL2
= CTQL2 . (5.3)
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 48
Analogously
A(QU
2
)T= C
(QU
2
)T. (5.4)
Then
σj (A) = σj
((QL
1
)TA(QU
1
)T (QL
1
)TA(QU
2
)T(QL
2
)TA(QU
1
)T (QL
2
)TA(QU
2
)T)
= σj
((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)
=
λj((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)T
·
((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T))) 1
2
=
(λj
(((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T)T·((
QL1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T))+((
QL2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)T·((
QL2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T))) 12.
Continuing:
σj (A) ≤(λj
(((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T)T·((
QL1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T))+∥∥∥((QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)∥∥∥2
) 12
≤(λj
(((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T)T·((
QL1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T))+ ‖C‖22
) 12
=(λj
(((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T)·((
QL1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T)T)+ ‖C‖22
) 12
.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 49
Continuing:
=
(λj
((QL
1
)TA(QU
1
)T ((QL
1
)TA(QU
1
)T)T+(QL
1
)TC(QU
2
)T ((QL
1
)TC(QU
2
)T)T)+ ‖C‖2
2
) 12
≤(λj
((QL
1
)TA(QU
1
)T ((QL
1
)TA(QU
1
)T)T)+
∥∥∥∥(QL1
)TC(QU
2
)T ((QL
1
)TC(QU
2
)T)T∥∥∥∥2
+ ‖C‖22
) 12
≤(λj
((QL
1
)TA(QU
1
)T ((QL
1
)TA(QU
1
)T)T)+ 2‖C‖2
2
) 12
≤(σ2j
((QL
1
)TA(QU
1
)T)+ 2‖C‖2
2
) 12
=(σ2j
(LMU
)+ 2‖C‖2
2
) 12
= σj
(LMU
)√√√√√1 + 2
‖C‖2
σj
(LMU
)2
= σj
(LMU
)√√√√√1 + 2
‖S‖2
σj
(LMU
)2
.
Solve for σj
(LMU
)for the lower bound. The upper bound:
σj (A) = σj
((QL
1
)TA(QU
1
)T (QL
1
)TA(QU
2
)T(QL
2
)TA(QU
1
)T (QL
2
)TA(QU
2
)T)
≥ σj
((QL
1
)TA(QU
1
)T)= σj
(QL
1
(QL
1
)TA(QU
1
)TQU
1
)= σj
(LMU
).
As before, the constants are small in practice. Observe that for most real data matrices,their singular values decay with increasing j. For such matrices this result is significantlystronger than Theorem 9.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 50
As in Theorem 5, the factor of 2 stems from approximating the spectral norm of a matrixwith a zero submatrix. As before, the approximation is sharper under the Frobenius norm:
Theorem 13. (Frobenius Bound)
‖LMU‖F ≥ ‖A‖F√1 +
(‖S‖F
‖LMU‖F
)2.
Note: to see that Theorem 13 improves upon Theorem 12, note that accumulating
σ2j (A)− 2‖S‖2
2 ≤ σ2j
(LMU
)gives a weaker lower bound than
‖A‖2F−‖S‖2
F ≤ ‖LMU‖2F ,
these two inequalities being equivalent to the results of these theorems.
Proof.
‖A‖2F = ‖
(QL)T
A(QU)T ‖2
F
= trace
((QL1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)T
·
((QL
1
)TA(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T))
= trace
(((QL
1
)TA(QU
1
)T)T (QL1
)TA(QU
1
)T)
+trace
( 0(QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)T
·
(0
(QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T))
.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 51
Continuing:
‖A‖2F ≤ trace
(((QL
1
)TA(QU
1
)T)T (QL1
)TA(QU
1
)T)
+trace
((QL1
)TC(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T)T
·
((QL
1
)TC(QU
1
)T (QL
1
)TC(QU
2
)T(QL
2
)TC(QU
1
)T (QL
2
)TC(QU
2
)T))
= ‖LMU‖2F+trace
(CTC
)= ‖LMU‖2
F+‖C‖2F .
Note that Theorem 13 is a general result. For SRLU factorizations:
Theorem 14. If σj (A) ≥ ‖S‖F then
‖LMU‖F ≥ ‖A‖F
√1− γ
(σk+1 (A)
‖A‖F
)2
,
where γ ≤ O (mnf 2k2).
Proof. As in Theorem 12, the result is achieved by solving for ‖LMU‖F and substituting‖S‖F with a bound in terms of m,n, k, f , and A. Note that, as seen above, the definitionof α implies ‖S‖F≤
√(m− k)(n− k)|α|. The rest of the proof is similar to the proof of
Theorem 12. Substituting in this bound for ‖S‖F and the previously established bound for|α| completes the proof.
Although the condition in Theorem 14 does not guarantee that the operand in the squareroot in the statement of the theorem is nonnegative, it does imply the operand is nonnegativebefore the final substitution. Numeric experiments imply that the constant is small, and thusa condition on the constant would be pessimistic. Also, note that for Theorems 12 and 14the condition in the statement is not a significant restriction. In both proofs, the restrictionis applied in the final steps to clean up the form of the bound.
Note: in general there is no nuclear norm bound in the form of Theorem 13. Counterexamples for the nuclear norm are easy to construct to show that such bounds with thenuclear norm are indeed false.
Theorem 15. (Monotonicity of CUR Approximations)Suppose LkUk is a rank k truncated LU approximation to a matrix A, and suppose the decom-
position Lk+cUk+c =
(Lk
(0Lc
))(Uk(
0 Uc
)) is a rank k + c truncated LU approximation
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 52
obtained by continuing the LU factorization from the rank k approximation. Then
σj
(Lk
(L†kAU†k
)Uk
)≤ σj
(Lk+c
(L†k+cAU†k+c
)Uk+c
)≤ σj (A) .
Proof. Perform QR and LQ decompositions on L and U, similar to the decomposition intheorem 12, but with extra blocks:
A = LU =(QL
1 QL2 QL
3
)RL11 RL
12 RL13
RL22 RL
23
RL33
LU11
LU21 LU
22
LU31 LU
32 LU33
QU1
QU2
QU3
.
Then
σj
(Lk+c
(L†k+cAU†k+c
)Uk+c
)= σj
((QL
1 QL2
)((QL1
)T(QL
2
)T)
A((
QU1
)T (QU
2
)T)(QU1
QU2
))
= σj
(((QL
1
)T(QL
2
)T)
A((
QU1
)T (QU
2
)T))≥ σj
((QL
1
)TA(QU
1
)T)= σj
(QL
1
(QL
1
)TA(QU
1
)TQU
1
)= σj
(Lk
(L†kAU†k
)Uk
).
The rightmost inequality is a direct application of Theorem 12.
Note: there is no result equivalent to Theorem 15 for a plain LU factorization. Consider
A =
1 1 11 1.01 01 1.0001 1
=
11 11 .01 1
1 1 1.01 −1
.01
.
Then
σ1
11 11 .01
(1 1 1.01 −1
) < σ1 (A) < σ1
111
(1 1 1) .
Caveat: consider the matrices
A =
(In InIn −3In
), and L =
(InIn
),U =
(In In
). (5.5)
ThenLL†AU†U = 0!
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 53
Nevertheless,‖A− LL†AU†U‖2= ‖A‖2= 1 +
√5 < 4 = ‖A− LU‖2,
and‖A− LL†AU†U‖F= ‖A‖F= 2
√3√n < 4
√n = ‖A− LU‖F .
Theorem 16. (Approximate CUR lower bound)Let M = L†kLk+cUk+cU
†k, as in line (4.2). Let Sk be the Schur complement after a rank-
k truncated LU decomposition, and let Sk+c be the Schur complement after extending thetruncated LU decomposition to rank-(k + c). Then
σj
(LMU
)≥ σj (A)√
1 + 2 ‖Sk+c‖2σj(LMU)
+2‖Sk‖22+3‖Sk+c‖22
σ2j (LMU)
.
Proof. Let A have the decomposition as in Theorem 15. Then
σj (A) = σj
(QL1 QL
2 QL3
)TA
QU1
QU2
QU3
T
= σj
(QL
1
)TA(QU
1
)T (QL
1
)TA(QU
2
)T (QL
1
)TA(QU
3
)T(QL
2
)TA(QU
1
)T (QL
2
)TA(QU
2
)T (QL
2
)TA(QU
3
)T(QL
3
)TA(QU
1
)T (QL
3
)TA(QU
2
)T (QL
3
)TA(QU
3
)T
= σj
(QL
1
)TA(QU
1
)T (QL
1
)TSk(QU
2
)T (QL
1
)TSk+c
(QU
3
)T(QL
2
)TSk(QU
1
)T (QL
2
)TSk(QU
2
)T (QL
2
)TSk+c
(QU
3
)T(QL
3
)TSk+c
(QU
1
)T (QL
3
)TSk+c
(QU
2
)T (QL
3
)TSk+c
(QU
3
)T
≤
√√√√√√σ2j
(QL
1
)TA(QU
1
)T (QL
1
)TSk(QU
2
)T(QL
2
)TSk(QU
1
)T (QL
2
)TSk(QU
2
)T(QL
3
)TSk+c
(QU
1
)T (QL
3
)TSk+c
(QU
2
)T+ ‖Sk+c‖2
2
≤
√√√√σ2j
((QL
1
)TA(QU
1
)T (QL
1
)TSk(QU
2
)T(QL
2
)TSk(QU
1
)T (QL
2
)TSk(QU
2
)T)
+ 2‖Sk+c‖22
≤
√√√√σ2j
((QL
1
)TA(QU
1
)T(QL
2
)TSk(QU
1
)T)
+ ‖Sk‖22+2‖Sk+c‖2
2.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 54
Continuing:
σj (A) ≤√σ2j
((QL
1
)TA(QU
1
)T)+ 2‖Sk‖22+2‖Sk+c‖2
2
=√σ2j
(RL
11LU11 + RL
12LU21 + RL
13LU31
)+ 2‖Sk‖2
2+2‖Sk+c‖22
≤√(
σj(RL
11LU11 + RL
12LU21
)+ ‖RL
13LU31‖2
)2+ 2‖Sk‖2
2+2‖Sk+c‖22
≤√(
σj(RL
11LU11 + RL
12LU21
)+ ‖Sk+c‖2
)2+ 2‖Sk‖2
2+2‖Sk+c‖22
=√σ2j
(RL
11LU11 + RL
12LU21
)+ 2σj
(RL
11LU11 + RL
12LU21
)‖Sk+c‖2+2‖Sk‖2
2+3‖Sk+c‖22
=
√σ2j
(LMU
)+ 2σj
(LMU
)‖Sk+c‖2+2‖Sk‖2
2+3‖Sk+c‖22
The result follows.
Theorem 17. (Approximate CUR upper bound)Under the assumptions of Theorem 16:
σj
(LMU
)≤ σj (A)
(1 +‖Sk+p‖2
σj (A)
).
Proof.
σj
(LMU
)≤ σj
(QL
1
(QL
1
)TLk+pUk+p
(QU
1
)TQU
1
)≤ σj (Lk+pUk+p)
≤ σj (A)
(1 +‖Sk+p‖2
σj (A)
),
where we have used Theorem 9.
Theorem 18. ∥∥∥LMU∥∥∥F≥ ‖A‖F√
1 +‖Sk‖2F +‖Sk+c‖2F‖LMU‖2
F
.
Proof. Using Theorem 16 in the same way that Theorem 13 is based on Theorem 12 yields∥∥∥LMU∥∥∥F≥ ‖A‖F√
1 +‖Sk‖2F +2‖Sk+c‖2F‖LMU‖2
F
.
If the leading submatrix is separated into its components when the components of Sk andSk+c are separated out, then the factor of 2 can be dropped.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 55
Theorem 18 can also be simplified by solving for∥∥∥LMU
∥∥∥F
, as other theorems above
have done.Next, an alternative CUR bound is presented which can be stronger or weaker than the
bound in Theorem 12.
Theorem 19. For a general LU factorization
σj
(LMU
)≥ σj (A11)− 1
4‖S‖2. (5.6)
Proof. Let NL = L21L−111 and NU = U−1
11 U12. Then
L(L)†
=
(I
NL
)(I + NT
LNL
)−1 (I NL
)and (
U)†
U =
(I
NU
)(I + NUNT
U
)−1 (I NU
).
Then
L(L)†
C(U)†
U =
(I
NL
)(I + NT
LNL
)−1NLSNU
(I + NUNT
U
)−1 (I NU
).
Note that
σj
(LU +
(I
NL
)(I + NT
LNL
)−1NLSNU
(I + NUNT
U
)−1 (I NU
))≥ σj
(A11 +
(I + NT
LNL
)−1NLSNU
(I + NUNT
U
)−1)
because the singular values of a sub matrix are bounded by the corresponding singular valuesof the whole matrix. Let σNL
1 and σNU1 denote the leading singular values of NL and NU ,
and note that
σNL1
1 +(σNL
1
)2 ,σNU
1
1 +(σNU
1
)2 ≤ 1
2.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 56
Then
σj
(LMU
)= σj
(L(L)†
A(U)†
U
)= σj
(L(L)† (
LU + C)(
U)†
U
)= σj
(L(L)†
LU(U)†
U + L(L)†
C(U)†
U
)= σj
(LU + L
(L)†
C(U)†
U
)≥ σj
(A11 +
(I + NT
LNL
)−1NLSNU
(I + NUNT
U
)−1)
≥ σj (A11)− ‖(I + NT
LNL
)−1NLSNU
(I + NUNT
U
)−1 ‖2
≥ σj (A11)− ‖(I + NT
LNL
)−1NL‖2‖S‖2‖NU
(I + NUNT
U
)−1 ‖2
= σj (A11)− σNL1
1 +(σNL
1
)2‖S‖2σNU
1
1 +(σNU
1
)2
≥ σj (A11)− 1
4‖S‖2.
To see that Theorems 12 and 19 are not redundant consider the matrices
A1 =
−1.0722 1.4367 −1.20780.9610 −1.9609 2.90800.1240 −0.1977 0.8252
and A2 =
−0.1303 0.8620 −0.84870.1837 −1.3617 −0.3349−0.4762 0.4550 0.5528
.
Perform rank-2 truncated LU factorizations on each without pivoting, and let (·)11 denotethe principal 2-by-2 submatrix. Both matrices satisfy the condition σ2
1 (A) > 2‖S‖22 required
by Theorem 12. However,
σ1 ((A1)11)− 1
4‖S1‖2 <
√σ2
1 (A1)− 2‖S1‖22
σ1 ((A2)11)− 1
4‖S2‖2 >
√σ2
1 (A2)− 2‖S2‖22.
Theorem 19 could be refined for SRLU in two ways: first an upper bound for ‖S‖ isfound in Theorem 8. Observing that the singular values of a submatrix are bounded aboveby the singular values of the whole matrix, and using the bound on ‖S‖ for SRLU, then anSRLU-specific bound becomes
σj
(LMU
)≥ σj (A11)
(1− γ
4
σk+1 (A)
σj (A)
),
where γ has the same bound as in Theorem 8. Second, for j = 1, a lower bound of σj (A11)
can be extracted from the proof of Theorem 7 by noting that σ1 (A11) ≥ |det (A11)|1/k.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 57
Comparison of Singular Value Bounds
To see how the different factorizations explored in the previous sections capture informationfrom a data matrix A, consider the factorization in Theorem 15 and note that
σj (A) = σj
(QL1 QL
2 QL3
)TA
QU1
QU2
QU3
T
= σj
RL11L
U11 + RL
12LU21 + RL
13LU31 RL
12LU22 + RL
13LU32 RL
13LU33
RL22L
U21 + RL
23LU31 RL
22LU22 + RL
23LU32 RL
23LU33
RL33L
U31 RL
33LU32 RL
33LU33
.
Then the aforementioned factorizations can be represented as:
σj (LkUk) = σj(RL
11LU11
)σj
(LkMUk
)= σj
(RL
11LU11 + RL
12LU21
)σj (Lk+cUk+c) = σj
(RL
11LU11 + RL
12LU21 RL
12LU22
RL22L
U21 RL
22LU22
)σj (LkMUk) = σj
(RL
11LU11 + RL
12LU21 + RL
13LU31
).
Only the last equality captures the singular value of a submatrix (although not a submatrix ofA), and so only the CUR approximation possess monotonicity: as the rank of approximationincreases the error cannot increase.
Consider these approximations relative to sub matrices of A:
σj (A11) ≤ σj(RL
11LU11
)σj
(A11 A12
A21 A22
)≤ σj
(RL
11LU11 + RL
12LU21 RL
12LU22
RL22L
U21 RL
22LU22
).
Note that the expressions here have been simplified by recognizing that QLRL is lowertriangular and LUQU is upper triangular.
The LU Growth Factor
To understand the stability of the LU decomposition, we begin with a useful definition.
Definition 4. Let A ∈ Rn×n, and let A(i) denote the result of the LU decomposition, per-formed in place, after i steps. The growth factor of an LU decomposition is defined as
ρn =max i,j,k|A(k)
i,j |maxi,j|Ai,j|
.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 58
The growth factor can be related to the backwards stability of the LU decompositionthrough the following bound [34]:
‖δA‖∞≤ 3ρnn3ε‖A‖∞,
where δA is a matrix defined by A + δA = LU for the numerically calculated factorizationLU ≈ A, and ε is machine precision.
LU Variant Bound on Element Growth Factor ρn
No pivoting ∞Partial pivoting 2n−1 [105]
Threshold pivoting with u < 1(1 + 1
u
)n−1[38]
Rook pivoting 1.5n34
log(n) [42]
Complete pivoting n12
(2 · 3 1
2 · · ·n1
n−1
) 12 ∼ cn
12n
14
log(n) [105]
SRLU α1+
(1+ 1
2+···+ 1
nb−1
)·(nb
) 12
(213
12 · · ·
(nb
) 1nb−1
) 12
2b−1
Table 5.1: Bounds for Growth Factors of LU Variants [53]
Table 5.1 compares the growth factor for several variations of the LU decomposition. [105]also shows that the bound for partial pivoting is attainable, while the bound for completepivoting is not attainable. Because the backwards error bound for LU with partial pivotingis exponential in n, this form of LU is not stable for all matrices. In practice, nevertheless,this bound is conservative, and LU with partial pivoting is stable for most data matrices[34]. Next, we establish bounds for element growth for a truncated LU factorization.
Theorem 20. (Deterministic LUCP Element Growth Factor Bound)If SRLU is performed on a matrix A ∈ R(sb)×(sb) in blocks of size b, with b columns chosen byRRQR from the full Schur complement and then b rows chosen by LUPP, then the elementgrowth factor is bounded by
ρdetSRLU ≤ α1+
(1+ 1
2+···+ 1
nb−1
)·(nb
) 12
(213
12 · · ·
(nb
) 1nb−1
) 12
2b−1. (5.7)
Proof. Let A(rb) denote the Schur complement when rb rows and columns remain. Let pidenote the absolute value of the pivot element chosen with i rows and columns remaining,and let pr := prb · prb−1 · · · p(r−1)b+1 for 1 ≤ r ≤ s. Note that
∣∣det(A(rb)
)∣∣ =r∏i=1
pr. (5.8)
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 59
Using RRQR, let A(rb)Π = Q
(Rrb
11 Rrb12
Rrb22
). For 1 ≤ i ≤ b
α · σi(Rrb
11
)≥ σi
(A(rb)
).
Let Lrb11U
rb11 = Q
(Rrb
11
0
)be the LU factorization with partial pivoting of the b columns chosen
by RRQR. Then
det((
Lrb11
)TLrb
11
)≤
trace((
Lrb11
)TLrb
11
)b
b
≤(nb
)b=
(nb
)b= rb,
where we have bounded the trace using the property that the entries in Lrb11 are bounded
in magnitude by 1. The inequality is simply the AM-GM inequality applied to eigenvalues.Then
b∏i=1
σi
(Q
(Rrb
11
0
))=
b∏i=1
σi(Rrb
11
)=
√det((
Rrb11
)TRrb
11
)=
√det(Urb
11
)det((
Lrb11
)TLrb
11
)det(Urb
11
)≤
(b−1∏i=0
prb−i
)r
b2
= prrb2 .
Consequently
∣∣det(A(rb)
)∣∣ =rb∏i=1
σi(A(rb)
)≤
(b∏i=1
σi(A(rb)
))r
≤(αbr
b2pr)r. (5.9)
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 60
Much of the following argument is modeled after the bound derived by Wilkinson in [105].Define
qi := log pi.
Taking logarithms of lines 5.9 and 5.8:
r−1∑i=1
qi ≤rb
2log(rα2)
+ (r − 1) qr (5.10)
and
s∑i=1
qi = log∣∣det
(A(sb)
)∣∣ . (5.11)
For r = 2, · · · , s− 1, note that
1
r(r − 1)+
1
(r + 1)r+ · · ·+ 1
(s− 1)(s− 2)+
1
s− 1=
1
r − 1.
Let h(k) := 1 + 12
+ · · ·+ 1k. Summing bound (5.10) divided by r(r − 1) for r = 2, · · · , s− 1
and adding (5.11) divided by s− 1 yields
q1 +qs
s− 1≤
s−1∑r=2
[b log
(α
1r−1
)+b
2log(r
1r−1
)]+
1
s− 1log∣∣det A(sb)
∣∣=
b
2log(α2h(s−2) · 213
12 · · · (s− 1)
1s−2
)+
1
s− 1log∣∣det A(sb)
∣∣≤ b
2log(α2h(s−2) · 213
12 · · · (s− 1)
1s−2
)+
s
s− 1qs +
sb
2(s− 1)log(sα2).
Let g(k) :=(α2h(k−1) · 213
12 · · · k
1k−1
) b2. Then
q1 − qs ≤ log (g(s)) +b
2log(sα2).
Thenp1
ps≤ s
b2αbg(s).
Note that within a block, the LUPP growth factor bound applies, and so
ps ≥ 2b(b−1)
2 pbsb.
Let pmin = min1≤i≤b pi. Then, because LUPP is performed on the last block:
pb1 ≤ 2b(b−1)
2 pbmin ≤ 2b(b−1)
2 p1.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 61
Hence
p1
pn≤
(2
b(b−1)2
+b(b−1)
2 sb2αbg(s)
) 1b
= 2b−1s12α (g(s))
1b
= α(nb
) 12
(α
2
(1+ 1
2+···+ 1
nb−1
)· 213
12 · · ·
(nb
) 1nb−1
) 12
2b−1.
Bound (5.7) is a combination of the growth factor bounds of LUCP and LUPP becauseeach block iteration of the algorithm performs both row and columns swaps, but within each
block LUPP is performed. Indeed, the term(nb
) 12
(213
12 · · ·
(nb
) 1nb−1
) 12
is due to the nb
blocks
where both row and column pivoting is performed, and the term 2b−1 reflects the LUPP
growth bound from within each block factorization. The term α1+
(1+ 1
2+···+ 1
nb−1
)stems from
the quality of the QR factorization over the nb
blocks where QR is performed for columnselection. This term will change if a different column selection subroutine is implemented.
Although not explicitly derived, equality is not achievable in theorem 20. The componentof the bound that corresponds to LUCP leads to the lack of sharpness, which is proved in[105].
5.2 Comparison of SRLU Factorizations with RRLU
and RRQR Factorizations
In previous work on rank-revealing factorizations, such as [80], a bound on the quality of
A11 is established, while in this work bounds on LU are established. Because A11 is asubmatrix of A, its singular values are bounded above by the corresponding singular valuesof A. Thus, by guaranteeing the singular values of this submatrix are close to those ofthe original matrix implies that the sub matrix captures most of the “mass” of the originalmatrix. Note, however, that there is no clear interpretation of how A11 can approximatethe original matrix, as they are different sizes. In this work, which seeks to produce highquality low-rank approximations, the focus above is on the quality of the full approximationLU, and not on sub matrices. Nevertheless, lower bounds on the singular values of the submatrix A11 can be obtained using the analysis above in Theorem 7 by observing the thedeterminant is the product of singular values.
CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 62
5.3 Fast SRLU
Note that the test 5.1 requires knowledge of the largest entry of the Schur complement. Theefficiency of SRLU, nevertheless, is largely due to avoiding calculating the Schur complement.A weaker and faster SRLU factorization can be calculated by avoiding computation of theSchur complement during Spectrum Revealing Pivoting. A faster correction strategy is tofind the column of R, the random projection of the Schur complement, and then pick thelargest entry in the corresponding column of the Schur complement. Calculating the columnnorms of R, updating a single column of S, and finding the largest entry in that column aresignificantly faster calculations than updating the whole Schur complement.
63
Chapter 6
Numerical Experiments with SRLU
Although it is not possible to test every aspect of SRLU, several of the most importantfeatures of SRLU are examined experimentally in this chapter.
6.1 Speed and Accuracy Tests
In Figure 6.1, the accuracy of our method is compared to the accuracy of the truncatedSVD. Note that SRLU did not perform any swaps in these experiments. “CUR” is theCUR version of the output of SRLU. Note that both methods exhibits a convergence ratesimilar to that of the truncated SVD (TSVD), and so only a constant amount of extra workis needed to achieve the same accuracy. When the singular values decay slowly, the CURdecomposition provides a greater accuracy boost. In Figure 6.2, the runtime of SRLU iscompared to that of the truncated SVD, as well as Subspace Iteration [50]. Note that forSubspace Iteration, we choose iteration parameter q = 0 and do not measure the time ofapplying the random projection, in acknowledgement that fast methods exist to apply arandom projection to a data matrix. All numeric experiments were run on NERSC’s Edison.For timing experiments, the truncated SVD is calculated with PROPACK.
6.2 Efficiency Tests
As a follow-up to the benchmarking tests in the previous section, the performance of SRLUis now benchmarked against hardware parameters. Because the most computationally ex-pensive steps in SRLU are all matrix-matrix multiplication, a finely tuned algorithm (whenimplemented carefully, e.g. using communication-avoiding blocking), SRLU has the potentialto utilize near-peak hardware performance.
The performance of TRLUCP is examined in figure 6.3 by comparing the approximatenumber of floating point operations, 2mnp + (m+ n) k2 against the time of calculation forsquare matrices of various sizes. Indeed, TRLUCP appears to scale linearly with the numberof floating point operations. For a matrix of size 10,000 by 10,000, TRLUCP is 86% efficient
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 64
Figure 6.1: Accuracy Experiment on random 1000x1000 matrices with different rates ofspectral decay.
Figure 6.2: Time Experiment on random matrices of varying sizes, and a time experimenton a 1000x1000 matrix with varying truncation ranks.
on NERSC’s Edison [40]. The result of LAPACK’s DGEMM is included as well. Both TRLUCPand DGEMM are truncated to rank 300.
6.3 Towards Feature Selection
An image processing example is now presented to illustrate the benefit of highlighting im-portant rows and columns. In Figure 6.4 an image is compressed to a rank-50 approximationusing SRLU. Note that the rows and columns chosen overlap with the astronaut and theplanet, implying that minimal storage is needed to capture the black background, whichcomposes approximately two thirds of the image. While this result cannot be called featureselection per se, the rows and columns selected highlight where to look for features: rowsand/or columns are selected in a higher density around the astronaut, the curvature of theplanet, and the storm front on the planet.
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 65
Figure 6.3: Efficiency experiment on random matrices of varying sizes compared to peakhardware performance.
6.4 Sparsity Preservation Tests
The SRLU factorization is tested on sparse, unsymmetric matrices in Figure 6.5 from [33].Figure 6.6 shows the sparsity patterns of the factors of an SRLU factorization of a sparse datamatrix representing a circuit simulation (oscil dcop), as well as a full LU decomposition ofthe data. Note that the LU decomposition preserves the sparsity of the data initially, butthe full LU decomposition becomes dense.
6.5 Online Data Processing
Online SRLU is tested here on the Enron email corpus [70]. The documents were initiallyreverse-sorted by the usage of the most common word, and then reverse-sorted by the secondmost, and this process was repeated for the five most common words (the top five wordswere used significantly more than any other), so that the most common words occurred mostat the end of the corpus. The cumulative frequencies of the top 5 words in the Enron emailcorpus (after reordering) are plotted in Figure 6.7.
The data contains 39,861 documents and 28,102, and an initial SRLU factorization ofrank 20 was performed on the first 30K documents. The initial factorization contained none
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 66
Figure 6.4: Image processing example. The original image [81], a rank-50 approximationwith SRLU, and a highlight of the rows and columns selected by SRLU.
(a) Original circuit simulation data matrix. (b) Depiction of corresponding circuit.
Figure 6.5: Circuit Simulation Data.
of the top five words, but, after adding the remaining documents and updating, the top threewere included in the approximation. To understand why the last two may have been missed,the correlation matrix of the top 5 words is computed:
power company energy market california
power 1 0.40 0.81 0.51 0.78company 0.40 1 0.42 0.57 0.28energy 0.81 0.42 1 0.51 0.78market 0.51 0.57 0.51 1 0.48california 0.78 0.23 0.78 0.48 1
.
The fourth and fifth words ‘market’ and ‘california’ have high covariance with at least twoof the three top words, and so their inclusion may be redundant in a low-rank approximation.
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 67
(a) The L and U matrices of a rank 43 SRLU factorization (with updated Schur com-plement). The green entries compose a low-rank approximation of the data. Red entriesare the additional data needed for an exact factorization.
(b) The L and U matrices of a full LU factorization
Figure 6.6: Sparse Data Processing Example with Circuit Simulation Data.
6.6 Pathological Test Matrix
Here, we test SRLU on the Devil’s Stairs [98], a test matrix with a singular value rateof decay that may be difficult for a low-rank approximation algorithm to capture. TheDevil’s Stairs matrix used here is a randomly generated 100-by-100 with the first 20 singularvalues equal to 1, and then each following set of 20 singular values have 1/10th the value ofthe preceding 20 singular values. The singular values of various rank approximations withSRLU are plotted against the singular values of this test matrix in Figure 6.8. The SRLUfactorization does not appear to commit any serious errors for any truncation rank. Somesingular values are overestimated and some are underestimated, but all appear to convergecorrectly as the target rank increases, with the leading singular values converging rapidly,and the smaller singular values with a bounded relative error.
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 68
Figure 6.7: The cumulative uses of the top five most commonly used words in the Enronemail corpus after reordering.
6.7 Image Compression Example
Although sophisticated algorithms exist specifically for high quality image compression, anexperiment is explored here to demonstrate how oversampling can make up for quality thatmay be lost in SVD approximations, such as SRLU. Figure 6.9 shows an experiment wherevarious factorizations are used to compress an image.
The original image [76], of rank-480, is compressed using the truncated SVD to rank-100,a high quality approximation. A rank-100 approximation with TRLUCP exhibits a clearimage, although with reduced quality. A rank-200 approximation with TRLUCP appearsto match the quality of the previous SVD approximation. For comparison, a rank-150approximation with QRCP also appears to match the quality of the SVD compression.
6.8 Testing Quality Controls
Tightness of the Theoretical Bounds
Here, we test the dimension-dependent constants that appear in the theorems in the theo-retical analysis of SRLU. Although the data matrix does affect the size of the constant seenin practice, Table 6.1 summarizes the maximum constants seen with 1,000-by-1,000 randommatrices of spectrums decaying at rates 0.9, 0.8, and 0.7. In the case of spectral bounds, themaximums were taken over the top 10 singular values. Each experiment was performed 10times and averaged. The truncation rank is k = 100.
The Sensitive of SRLU to the Tuning Parameter f
Here, we test the average number of swaps needed as the tuning parameter f becomes small.In practice, f ≈ 10 is recommended. In many real-world datasets, no swaps will be needed.Table 6.2 shows the average number of swaps needed for small values of f for a random
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 69
(a) Rank-10 approximation. (b) Rank-30 approximation.
(c) Rank-50 approximation. (d) Rank-90 approximation.
Figure 6.8: Singular values of SRLU factorizations of various ranks (red) versus the singularvalues of the Devil’s Stairs matrix (blue).
1000-by-1000 matrix with a singular value decay rate of 0.9. The truncation rank is k = 100.
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 70
Table 6.1: Mean values of the constants from the theorems presented in this work, forvarious random matrices. Constants for spectral theorems are averaged over the top 10singular values. TRLUCP was used, and no swaps were needed, so SRLU results matchTRLUCP.
Constant Theorem Meanγ 8 (1st ineq.) 7.97γ 8 (2nd ineq.) 0.04τ 9 0.16γ 11 1.57ω 11 0.24γ 12 0.40
Table 6.2: Average number of swaps needed on a random 1000-by-1000 matrix for varioussmall values of f .
f Average number of swaps2.1 0.71.8 2.01.5 4.41.2 8.8
CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 71
(a) Original image and a rank-100 truncated SVD approximation.
(b) A rank-100 approximation with TRLUCP and a rank-200 approximation with TR-LUCP.
(c) A rank-150 approximation with QRCP.
Figure 6.9: Image compression experiment with various factorizations. From left to right:James Wilkinson, Wallace Givens, George Forsythe, Alston Householder, Peter Henrici, andFriedrich Bauer. (Gatlinburg, Tennessee, 1964.)
72
Part III
Unweighted Graph Sparsification
73
Chapter 7
Unweighted Column Selection
Part III introduces a new low-rank approximation algorithm. This algorithm, called Un-weighted Column Selection, calculates a sparse graph (called a sparsifier) that accuratelyapproximates the original graph in some sense. A previous result, [12], uses an algorithmthat is based on a “purely linear algebraic theorem”. Similarly, the algorithm presented hereis built on linear algebra theory.
7.1 Introduction
Spectral graph sparsification has emerged as a powerful tool in the analysis of large-scalenetworks by reducing the overall number of edges, while maintaining a comparable graphLaplacian matrix. In this paper, we present an efficient algorithm for the construction of anew type of spectral sparsifier, the unweighted spectral sparsifier. Given a general undirectedand unweighted graph G = (V,E), and an integer ` < |E| (the number of edges in E), wecompute an unweighted graph H = (V, F ) with F ⊂ E and |F |= ` such that for everyx ∈ RV
xTLGx
κ≤ xTLHx ≤ xTLGx,
where LG and LH are the Laplacian matrices for G and H, respectively, and κ ≥ 1 is a slowly-varying function of |V |, |E| and `. This work addresses the open question of the existenceof unweighted graph sparsifiers for unweighted graphs [12]. Additionally, our algorithmefficiently computes unweighted graph sparsifiers for weighted graphs, leading to sparsifiedgraphs that retain the weights of the original graphs. A version of this chapter appears in[7].
7.2 Background
Graph sparsification seeks to approximate a graph G with a graph H on the same vertices,but with fewer edges. Called a sparsifier, H requires less storage than G and serves as a
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 74
proxy for G in computations where G is too large, evoking the effectiveness of sparsifiersin wide-ranging applications of graphs, including social networks, conductance, electricalnetworks, and similarity [26, 28, 75, 92]. In some applications, graph sparsification improvesthe quality of the graph, such as in the design of information networks and the hardwiring ofprocessors and memory in parallel computers [13, 66]. Sparsifiers have also been utilized tofind approximate solutions of symmetric, diagonally-dominant linear systems in nearly-lineartime [13, 92, 93, 94].
Recent work on graph sparsification includes [2, 63, 91, 92, 95]. Batson, Spielman,and Srivastava [12] prove that for every graph there exists a spectral sparsifier where thenumber of edges is linear in the number of vertices. They further provide a polynomial-time, deterministic algorithm for the sparsification of weighted graphs, which could produceweights that differ greatly from the weights of the original graph. The work of Avron andBoutsidis [9] explores unweighted sparsification in the context of finding low-stretch spanningtrees. They provide a greedy edge removal algorithm and a volume sampling algorithm withtheoretical guarantees. In comparison, our novel greedy edge selection algorithm has tightertheoretical bounds for both spanning trees and in the more general context of unweightedgraph sparsification.
Our work introduces a deterministic, greedy edge selection algorithm to calculate spar-sifiers for weighted and unweighted graphs. Our algorithm selects a subset of edges forthe sparse approximation H, without assigning or altering weights. While the Dual Setalgorithms of [9, 12, 17] reweight all selected edges for computing weighted sparsifiers, ouralgorithm produces unweighted sparsifiers for an unweighted input graph, and can create aweighted sparsifier for a weighted input graph by assigning the original edge weights to thesparsifier. Hence our concept of unweighted sparsification applies to both unweighted andweighted graphs. To formalize:
Definition 5. Let G = (V,E,w) be a given graph1. We define an unweighted sparsificationof G to be any graph of the form H = (V, F, w IF ), where
IF (e) =
1, if e ∈ F0, otherwise
is the indicator function and is the Hadamard product, i.e.
(w IF ) (e) =
we, if e = (u, v) ∈ F0, otherwise
.
Several definitions have been proposed for the notion in which a sparsifier approximatesa dense graph. Benzcur and Karger [14] introduced cut sparsification, where the sum ofthe weights of the edges of a cut dividing the set of vertices is approximately the same forthe dense graph and the sparsifier. Spielman and Teng [95] proposed spectral sparsification,
1Note that any unweighted graph G = (V,E) induces a weighted graph G = (V,E,w) where we = 1 ife = (u, v) ∈ E and we = 0 otherwise.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 75
a generalization of cut sparsification, which seeks sparsifiers with a Laplacian matrix closeto that of the input graph. We follow the work of [12, 95] and base our work on spectralsparsification, for which we now present a rigorous definition.
Given an undirected graph G = (V,E,w), define the signed edge-vertex incidence matrixBG ∈ RE×V as
(BG)ej =
−1, if e = (u, v) ∈ E and j = u ∈ V1, if e = (u, v) ∈ E and j = v ∈ V ,0 otherwise
where all edges are randomly assigned a direction, and e = (u, v) ∈ E is an edge from u tov. Define the diagonal weight matrix WG ∈ RE×E
(WG)ef =
we, if e = f ∈ E,0 otherwise
.
The Laplacian of the graph isLG = BT
GWGBG.
Note thatxTLGx =
∑(u,v)∈E
w(u,v) (xu − xv)2
for a vector x ∈ RV . To compare Laplacians of graphs X and Y defined on the same set ofnodes we denote
LX LY if and only if xTLXx ≤ xTLY x, for all x.
Definition 6. The graph H is a κ-approximation of G if
1
κLG LH LG.
Because our unweighted sparsification algorithm does not change the weights of the edgeskept in H, it is immediate that LH LG:
Proposition 1. If H is an unweighted sparsification of G, then
LH LG.
Proof.
xTLHx =∑
(u,v)∈F
w(u,v) (xu − xv)2
≤∑
(u,v)∈F
w(u,v) (xu − xv)2 +∑
(u,v)∈E\F
w(u,v) (xu − xv)2
= xTLGx
for all x ∈ RV .
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 76
Our algorithm does not operate directly on the Laplacian matrix. Rather, we considerthe SVD of W
1/2G BG.
W1/2G BG = UT
GΣGVG, (7.1)
where ΣG is a diagonal matrix containing all non-zero singular values of W1/2G BG; and where
UG ∈ Rn×m is a row orthonormal matrix, with n = |V |−r, and r being the number ofconnected components in G. For the unweighted graph, WG is simply the identity matrix.UG plays a similar role to that of the matrix Vn×m in [12] and the matrix Y in [9]. Ouralgorithm utilizes the column-orthogonality of UT
G , highlighting the reason for not workingdirectly with the Laplacian matrix. We note, nevertheless, that this algorithm can be adaptedto any orthogonal decomposition of W
1/2G BG.
7.3 An Unweighted Column Selection Algorithm
Our algorithm selects edges for a sparsifier based on the columns ui of UG,
UG =(u1 u2 · · · um
)∈ Rn×m,
where m = |E| is the number of edges, and n = |V | − r, as above. Therefore, the edges of Gthat are included in the sparsifier are exactly the columns of UG that our algorithm selects.
Denote the number of edges kept as `def= |F |. Let Πt denote the set of selected edges after t
iterations.We propose the following greedy algorithm for column selection on UG. Initially set
A0 = 0n×n and Π0 = ∅, and choose a constant T > 0. At step t ≥ 0:
• Solve for the unique λ < λmin (At) such that
tr (At − λI)−1 = T. (7.2)
• Solve for the unique λ ∈ (λ, λmin (A)) such that
(λ− λ
)(m− t+
n∑j=1
1− λjλj − λ
)=
∑nj=1
1−λj(λj−λ)(λj−λ)∑n
j=11
(λj−λ)(λj−λ)
, (7.3)
where λj is the jth largest eigenvalue of the symmetric matrix At.
• Find an index i /∈ Πt such that
tr(At − λI + uiu
Ti
)−1
≤ tr (At − λI)−1 . (7.4)
• Update At and Πt.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 77
While equations (7.2) and (7.3) are relatively straightforward to justify and solve, equa-tion (7.4) requires careful consideration, and is the focus of much of section 3. Note thatequation (7.2) can be solved in O (n3) operations, equation (7.3) in O (n) operations, andequation (7.4) in O (n2m) operations. This last complexity count follows because testing theinequality scales with O (n2), and potentially all remaining indices must be tested. Thus thetotal complexity of selecting ` columns is O (`n2m).
While this procedure will work for any T > 0, we will show that an effective choice is
T = T ∗(
1 + F(T ∗))
,
where
F(T)
=
[(1− n
T
)`
m− `−12
+ T − n− n
T
],
and where T ∗ is the minimizer of F(T)
, given as
T ∗ =n(m+ `+1
2− n
)+√n`(m− `−1
2
) (m+ `+1
2− n
)`− n
.
Our spectral bounds are derived using this choice of T . We summarize this procedure in theUnweighted Column Selection algorithm.
Algorithm: Unweighted Column Selection (UCS)
Inputs: G = (V,E,w), T > 0, `.Outputs: Huw = (V, F, w IF )
1: Calculate the column-orthogonal matrix UTG
2: Set A0 = 0n×n, Π0 = ∅3: for t = 0, · · · , `− 1 do4: Solve for λ using equation (7.2)
5: Calculate λ using equation (7.3)6: Find i 6∈ Πt such that inequality (7.4) is satisfied7: Update At+1 = At + uiu
Ti
8: Update Πt+1 = Πt ∪ i9: end for
10: Let F = Π` be the selected edges
Theorem 21 below confirms the correctness of the Unweighted Column Selection Algo-rithm. This theorem, along with other properties of the UCS algorithm, will be discussedand proved in Section 7.4.
Theorem 21. Let G = (V,E,w) and let n < ` < m. Then the sparsified graph H producedby the UCS algorithm satisfies
1
κLG LH LG,
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 78
where1
κ=
(`− n)2(√n(m+ `+1
2− n
)+√`(m− `+1
2
))2
+ (`− n)2
. (7.5)
7.4 Correctness and Performance of the UCS
Algorithm
The goal of this section is to prove Theorem 21. Section 7.4 establishes that the UCSalgorithm is well-defined. Section 7.4 proves a lower bound for the minimum singular valueof the submatrix selected by the UCS algorithm, and provides a good choice for the inputparameter T . In section 7.4, the UCS algorithm is shown to be a graph sparsificationalgorithm.
The Existence of a Solution to Equation (7.4)
The next two lemmas show that equation (7.4) always has a solution.
Lemma 1. At a given iteration t in the UCS algorithm, at step 6 define
f(x)def= (x− λ)
[m− t+
n∑j=1
1− λjλj − λ
]−∑n
j=11−λj
(λj−λ)(λj−x)∑nj=1
1(λj−λ)(λj−x)
.
Then there exists λ, with λ < λ < λn, such that f(λ)
= 0. Furthermore,
0 <(λ− λ
)∑n
j=11−λj
(λj−λ)(λj−λ)2∑n
j=11
(λj−λ)(λj−λ)
−(λ− λ
) n∑j=1
1− λj(λj − λ)
(λj − λ
) . (7.6)
Proof. Clearly f (λ) < 0. Although f is undefined at λn, let λεn := λn− ε, where ε > 0. Notethat
limε→0+
(n∑j=1
1− λj(λj − λ) (λj − λεn)
)/(n∑j=1
1
(λj − λ) (λj − λεn)
)= 1− λn
because the last term in each sum will dominate the rest of the sum. Furthermore,
limε→0+
(λεn − λ)
[m− t+
n∑j=1
1− λjλj − λ
]= 1− λn + lim
ε→0+(λεn − λ)
[m− t+
n−1∑j=1
1− λjλj − λ
]> 1− λn.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 79
Hence for small ε > 0, we have f (λεn) > 0, and, therefore, λ exists, with λ < λ < λn, and
f(λ)
= 0 via the Intermediate Value Theorem. Note that if there exists 0 < γ < n such
that λγ = λγ+1 = · · · = λn, then we repeat the same argument replacing the expression1− λn with
∑nj=γ 1− λj = (n− γ + 1) (1− λn).
Now we prove inequality (7.6). We use the following version of the Cauchy-Schwartzformula: for aj, bj ≥ 0 then (
∑ajbj)
2 ≤(∑
a2jbj)
(∑bj). Consequently n∑
j=1
1− λj(λj − λ)
(λj − λ
)2
≤
n∑j=1
1− λj
(λj − λ)(λj − λ
)2
(0 +n∑j=1
1− λjλj − λ
)
<
n∑j=1
1− λj
(λj − λ)(λj − λ
)2
>0︷ ︸︸ ︷m− t+
n∑j=1
1− λjλj − λ
=1(
λ− λ) n∑
j=1
1− λj
(λj − λ)(λj − λ
)2
∑n
j=11−λj
(λj−λ)(λj−λ)∑nj=1
1
(λj−λ)(λj−λ)
,
where the last step comes from f(λ)
= 0. The strict inequality above holds because
m− t ≥ m− `+ 1 ≥ 1. After some simple algebra,
(λ− λ
) n∑j=1
1− λj(λj − λ)
(λj − λ
) < n∑
j=1
1− λj
(λj − λ)(λj − λ
)2
/ n∑
j=1
1
(λj − λ)(λj − λ
) ,
which implies our desired inequality because 0 < λ− λ.
Next, we show that our algorithm is well defined in the sense we can always find a newindex i /∈ Πt for each iteration that satisfies (6).
Lemma 2. An index i /∈ Πt can always be found to satisfy line (6) of the UCS algorithm for0 ≤ t < `.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 80
Proof. Note the two following partial fraction results
λ− λ(λj − λ
)(λj − λ)
=1
λj − λ− 1
λj − λ(7.7)
λ− λ(λj − λ
)(λj − λ)2
+1(
λj − λ)
(λj − λ)=
1(λj − λ
)2 . (7.8)
Using the fact that f(λ) = 0, followed by the inequality of Lemma 1, we have
(λ− λ
)[m− t+
n∑j=1
1− λjλj − λ
]
=
n∑j=1
1− λj(λj − λ
)(λj − λ)
/ n∑j=1
1(λj − λ
)(λj − λ)
+ 0
<
n∑j=1
1− λj(λj − λ
)(λj − λ)
/ n∑j=1
1(λj − λ
)(λj − λ)
+(λ− λ
)∑nj=1
1−λj(λj−λ)(λj−λ)2∑n
j=11
(λj−λ)(λj−λ)
−(λ− λ
) n∑j=1
1− λj(λj − λ
)(λj − λ)
=
(λ− λ) n∑j=1
1− λj(λj − λ
)(λj − λ)2
+n∑j=1
1− λj(λj − λ
)(λj − λ)
/ n∑
j=1
1(λj − λ
)(λj − λ)
− (λ− λ)2n∑j=1
1− λj(λj − λ
)(λj − λ)
=
∑nj=1
1−λj(λj−λ)
2∑nj=1
1
(λj−λ)(λj−λ)
−(λ− λ
)( n∑j=1
1− λjλj − λ
−n∑j=1
1− λjλj − λ
),
where the last line follows from equations (7.7) and (7.8). After some rearranging:(m− t+
n∑j=1
1− λjλj − λ
) n∑j=1
λ− λ(λj − λ
)(λj − λ)
<
n∑j=1
1− λj(λj − λ
)2 .
This inequality can be rewritten using the trace property tr(xyT
)= yTx and the identity
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 81
∑i/∈Πt
uiuTi =
m∑i=1
uiuTi −
∑i∈Πt
uiuTi = In − At:
(∑i 6∈Πt
1 + uTi
(At − λI
)−1
ui
)(tr(At − λI
)−1
− tr (At − λI)−1
)
=
(m− t+
∑i 6∈Πt
tr
[(At − λI
)−1
uiuTi
])( n∑j=1
1
λj − λ−
n∑j=1
1
λj − λ
)
=
(m− t+ tr
[(At − λI
)−1
(I − At)]) n∑
j=1
λ− λ(λj − λ
)(λj − λ)
=
(m− t+
n∑j=1
1− λjλj − λ
) n∑j=1
λ− λ(λj − λ
)(λj − λ)
<
n∑j=1
1− λj(λj − λ
)2
= tr
((At − λI
)−2
(I − At))
=∑i 6∈Πt
uTi
(At − λI
)−2
ui.
Moving terms to the right and dividing by
(tr(At − λI
)−1
− tr (At − λI)−1
)> 0 (because
λ > λ) gives
∑i 6∈Πt
uTi
(At − λI
)−2
ui
tr(At − λI
)−1
− tr (At − λI)−1−(
1 + uTi
(At − λI
)−1
ui
) > 0.
For this to be true, there must exist an i 6∈ Πt such that uTi
(At − λI
)−2
ui
tr(At − λI
)−1
− tr (At − λI)−1−(uTi
(At − λI
)−1
ui
) > 1.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 82
This last relation gives
tr (At − λI)−1 > tr(At − λI
)−1
−uTi
(At − λI
)−2
ui
1 + uTi
(At − λI
)−1
ui
= tr(At − λI
)−1
− tr
(At − λI
)−1
uiuTi
(At − λI
)−1
1 + uTi
(At − λI
)−1
ui
= tr
(At − λI + uiu
Ti
)−1
,
where the last line was accomplished with the trace property previously indicated and theSherman-Morrison formula.
Lower Bound on λmin(A`)
Lemma 2 ensures that the UCS algorithm can indeed find all ` indices. We now estimatean eigenvalue lower bound on A`. Let λ(t), λ(t) and λ
(t)j represent the values of λ, λ and λj,
respectively, determined in iteration t. Then note that by the definitions of λ and λ we have
λ(0) < λ(0) ≤ λ(1) < λ(1) ≤ · · · ≤ λ(`−1) < λ(`−1).
Define the following quantity and functions:
Tdef= T
(1− λ(`−1)
), (7.9)
g(t)def=
`(
1− n
T
)m− t+ T − n
, and F (T )def=
`(
1− n
T
)m− `−1
2+ T − n
− n
T.
To bound λmin (A`), we first establish a recurrence relation on λ(`−1).
Lemma 3. After the last iteration of the UCS algorithm, we have
λ(`−1) ≥(
1− λ(`−1))[1
`
`−1∑t=0
g (t)− n
T
].
Proof. Remember that T = tr(A− λ(t)I
)−1=
n∑j=1
1
λ(t)j − λ(t)
, and note that
1− λ(t)j
λ(t)j − λ(t)
=1− λ(t)
λ(t)j − λ(t)
+λ(t) − λ(t)
j
λ(t)j − λ(t)
=1− λ(t)
λ(t)j − λ(t)
− 1. (7.10)
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 83
The equation f(λ(t))
= 0 gives
(λ(t) − λ(t)
)(m− t+
n∑j=1
1− λ(t)j
λ(t)j − λ(t)
)=
∑nj=1
1−λ(t)j(λ(t)j −λ(t)
)(λ(t)j −λ(t)
)∑nj=1
1(λ(t)j −λ(t)
)(λ(t)j −λ(t)
) .
Applying equation (7.10) to both sides:(λ(t) − λ(t)
) (m− t+
(1− λ(t)
)T − n
)= 1− λ(t) −
∑nj=1
1
λ(t)j −λ(t)∑n
j=11(
λ(t)j −λ(t)
)(λ(t)j −λ(t)
)
≥ 1− λ(t) −n
(maxj∗
1
λ(t)j∗−λ
(t)
)(
maxj∗1
λ(t)j∗−λ
(t)
)(∑nj=1
1
λ(t)j −λ(t)
)= 1− λ(t) − n
T.
Since (λ(t−1) − λ(t)
)≤ 0, and
(λ(t) − λ(t)
)≥
1− λ(t) − nT
m− t+ (1− λ(t))T − n,
we have
λ(`−1) ≥ λ(`−1) +`−1∑t=1
≤0︷ ︸︸ ︷(λ(t−1) − λ(t)
)−λ(0) + λ(0)
=`−1∑t=0
(λ(t) − λ(t)
)+ λ(0)
≥`−1∑t=0
1− λ(t) − nT
m− t+ (1− λ(t))T − n− n
T
≥`−1∑t=0
1− λ(`−1) − nT
m− t+(
1− λ(`−1))T − n
− n
T. (7.11)
Inequality (7.11) follows by noting that the terms in the sum are decreasing in λ(t). The finalsubstitution is necessary because solving the preceding recurrence relation is impractical. Tofurther simplify calculations, we define
T := T(
1− λ(`−1)).
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 84
Therefore,
λ(`−1) ≥(
1− λ(`−1))[ `−1∑
t=0
1− n
T
m− t+ T − n− n
T
]
=(
1− λ(`−1))[1
`
`−1∑t=0
g(t)− n
T
].
Next, to demonstrate the effectiveness of the algorithm, we derive a lower bound for λnafter ` iterations. This analysis will involve selecting an appropriate T to maximize the lowerbound.
Lemma 4. If T > n, then
λmin (A`) ≥F(T)
1 + F(T) .
Proof. A key observation is that g(t) is strictly convex in t, which is easily verified by showing
that the second derivative d2gdt2
(t) is positive by our assumptions that T > n and m ≥ ` > t.Next, we note that the harmonic mean is strictly less than the arithmetic mean unless areterms are equal and can further bound the recurrence relation in Lemma 3:
λ(`−1) ≥(
1− λ(`−1))1
`
`−1∑t=0
`(
1− n
T
)m− t+ T − n
− n
T
>
(1− λ(`−1)
) `(
1− n
T
)1`
∑`−1t=0 m− t+ T − n
− n
T
=
(1− λ(`−1)
) `(
1− n
T
)m− `−1
2+ T − n
− n
T
=
(1− λ(`−1)
)F(T).
Along with λn > λ from Lemma 1, this finally leads to
λmin = λn > λ(`−1) >F(T)
1 + F(T) . (7.12)
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 85
The expression on the right-hand side of (7.12) is monotonically increasing in F . So,
maximizing F(T)
will also maximize the lower bound on λ(`−1).
Lemma 5. The function F (T ) is maximized at
T ∗ =n(m+ `+1
2− n
)+√n`(m− `−1
2
) (m+ `+1
2− n
)`− n
.
Proof. Setting the derivative of F(T)
to zero:
dF
dT=
(n− `)T 2 + 2n(m+ `+1
2− n
)T + n
(m+ `+1
2− n
) (m− `−1
2− n
)T 2(m− `−1
2− n+ T
)2 = 0.
Solving for the desired root:
T ∗ =n(m+ `+1
2− n
)+√n`(m− `−1
2
) (m+ `+1
2− n
)`− n
.
We see that T ∗ is the global maximum on the region T ∈ (n,∞) via the first derivative test
since dF
dT> 0 for n < T < T ∗ and dF
dT< 0 for T ∗ < T .
We remark that combining (7.9) and (7.12) implies that the UCS algorithm should choose
T = T ∗(
1 + F(T ∗))
for effective column selection. We are now ready to estimate λmin (A`).
Theorem 22. If T is chosen according to Lemma 5 in the UCS algorithm, then
λmin (A`) >1
κ,
where κ is defined in (7.5).
Proof. We wish to apply our choice of T to Lemma 4. We satisfy the assumption
T ∗ =n(m+ `+1
2− n
)+√n`(m− `−1
2
) (m+ `+1
2− n
)`− n
≥ n (m− n)
`− n≥ n.
Therefore, plugging T ∗ into (7.12) of Lemma 4:
λmin (A`) >F(T)
1 + F(T)
=(`− n)T − n
(m+ `+1
2− n
)T(m− `−1
2− n+ T
)+ (`− n)T − n
(m+ `+1
2− n
)=
(`− n)2(√n(m+ `+1
2− n
)+√`(m− `+1
2
))2
+ (`− n)2
.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 86
Correctness of the Unweighted Column Selection Algorithm
We are now in a position to prove Theorem 21. Our arguments are similar to those of theweighted sparsifier algorithm in [12].
Proof of Theorem 21. By Proposition 1, we only need to show 1κLG LH . Consider the
SVD of W1/2G BG in equation (7.1), and let x be any vector such that y = ΣGVGx 6= 0. Then
LG = BTGWGBG = V T
G Σ2GVG,
LH = BTGWHBG = BT
GW1/2G ΠT
` Π`W1/2G BG
= V TG ΣG
(UGΠT
` Π`UTG
)ΣGVG.
It follows that
xTLHx
xTLGx=
xT(V TG ΣG
(UGΠT
` Π`UTG
)ΣGVG
)x
xT (V TG Σ2
GVG)x
=yTUGΠT
` Π`UTGy
yTy. (7.13)
On the other hand, by construction we have
A` =∑j∈Π`
ujuTj = UGΠT
` Π`UTG .
With equation (7.13), the Courant-Fisher min-max property gives
xTLHx
xTLGx=yTUGΠT
` Π`UTGy
yTy≥ λmin (A`) >
1
κ,
where the last line is due to Theorem 22.
7.5 Performance Comparison of UCS and Other
Algorithms
This section compares the bound (7.5) to bounds of other current methods.
Comparison with Twice-Ramanujan Sparsifiers
Given a weighted graph G = (V,E,w), the algorithm of [12] produces a sparsified graphH = (V, F, w), where F is a subset of E and w contains new edge weights, such that
LG LH
(√d+ 1√d− 1
)2
LG, (7.14)
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 87
where the parameter d is defined via the equation ` = dd (n− 1)e.By choosing d to be a moderate and dimension-independent constant, equation (7.14)
asserts that every graph G = (V,E,w) has a weighted spectral sparsifier with a numberof edges linear in |V |. This strong result, nevertheless, is obtained by allowing unrestrictedchanges in the graph weights. Such changes may be undesirable, especially ifG is unweighted,and the UCS algorithm may be preferred.
To compare the effectiveness of these two types of sparsifiers, we simplify equation (7.5):
1
κ≈
(√d− 1
)2
m/n+ d/2 +(√
d− 1)2 .
It follows that for κ = Θ(1), a dimension-independent constant, we must choose d = Θ(m/n).This is the price one must pay to retain the original weights. For d m/n, the UCSalgorithm computes a sparsified graph with a κ that grows at most linearly with m/n. Thealgorithm of [12] runs in time O (dn3m), which is equivalent to UCS.
Near-Optimal Column-Based Matrix Reconstruction
The algorithm of [12] has been generalized in [17] to a column selection algorithm for com-puting CX decompositions. In this work, Boutsidis, Drineas, and Magdon-Ismail prove that,given row-orthonormal matrices
V T1 =
(~v1
1 ~v12 · · · ~v1
m
)∈ Rn×m and V T
2 =(~v2
1 ~v22 · · · ~v2
m
)∈ R(m−n)×m,
then for a given n < ` ≤ m there exist weights si ≥ 0 with at most ` of them nonzero suchthat
λn
(m∑i=1
si~v1i
(~v1i
)T) ≥(
1−√n
`
)2
(7.15)
and
λ1
(m∑i=1
si~v2i
(~v2i
)T) ≤
(1 +
√m− n`
)2
. (7.16)
In the context of CX decompositions,
[V T
1
V T2
]= V T ∈ Rm×m is understood to be the loadings
matrix of a data matrix A, i.e. A = UΣV T is the SVD of A (although the algorithm couldbe applied to other matrices for other applications). Their work includes an algorithm forfinding the weights, Deterministic Dual Set Spectral Sparsification (DDSSS).
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 88
Theorem 23. Let ΠDDSSS` ∈ Rm×` denote a matrix that chooses the ` columns selected by
the DDSSS algorithm. The inequalities (7.15) and (7.16) imply
σ2min
(V T
1 ΠDDSSS`
)≥
(√`−√n)2
(√`+√m− n
)2
+(√
`−√n)2
def=
1
κDDSSS
.
Proof. We interpret these inequalities as a bound on λn by first partitioning
V TΠ =
(V1 V ′1V2 V ′2
),
where Π is a permutation matrix that orders the selected columns first. Then, using a CSdecomposition [103], we can write
V1
V2
=
P1
(C 0
)QT
1
P2
−S 00 I0 0
QT2
,
where C and S are diagonal matrices with non-negative entries such that C2 + S2 = I.Furthermore, because P1 and Q are orthogonal, by inspection C contains the singular valuesof V1. Hence
λn ≥ σ2min (V1) = σ2
min(C).
Now let W be a weight matrix, whose diagonal entries are√si, the weights from above.
Define
Qdef= QT
1
(WW T
)Q1
def=
(Q11 Q12
Q21 Q22
).
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 89
Then
(V2W ) (V1W )†
= V2W (V1W )T(V1WW TV T
1
)−1
= V2
(WW T
)V T
1
(V1
(WW T
)V T
1
)−1
= V2
(WW T
) (P1
(C 0
)QT
1
)T (P1
(C 0
)QT
1
(WW T
)Q1
(C0
)P T
1
)−1
= P2
−S 00 I0 0
Q
(C0
)P T
1 P1
((C 0
)Q
(C0
))−1
P T1
= P2
−S 00 I0 0
( Q11 Q12
Q21 Q22
)(C0
)(CQ11C
)−1
P T1
= P2
−S 00 I0 0
( Q11C
Q21C
)C−1Q−1
11 C−1P T
1
= P2
−SC−1
Q21Q−111 C
−1
0
P T1 .
Therefore √1− σ2
min(C)
σ2min(C)
= ‖SC−1‖2
≤ ‖(V2W ) (V1W )† ‖2
≤
(1 +
√m− n`
)(1−
√n
`
)−1
.
Rearranging
σ2min(C) ≥
(1−
√n`
)2(1 +
√m−n`
)2
+(1−
√n`
)2
=
(√`−√n)2
(√`+√m− n
)2
+(√
`−√n)2 .
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 90
Corollry 1. Let κUCS be as defined in equation (7.5). Then
1
κUCS
>1
κDDSSS
.
Proof.
(`− n)2(√n(m+ `+1
2− n
)+√`(m− `+1
2
))2
+ (`− n)2
=(√`−√n)2(√
n(m+ `+12−n)+
√`(m− `+1
2 )√`+√n
)2
+ (√`−√n)2
≥ (√`−√n)2(√
n(m+ `+12−n)+
√`m
√`+√n
)2
+ (√`−√n)2
≥ (√`−√n)2(√
n(m+`−n)+√`m
√`+√n
)2
+ (√`−√n)2
≥ (√`−√n)2(√
n(m+`−n)+√`m−n`+`2
√`+√n
)2
+ (√`−√n)2
≥ (√`−√n)2(√
nm−n2+√n`+√`m−n`+
√`2√
`+√n
)2
+ (√`−√n)2
=(√`−√n)2(
(√m−n+
√`)(√`+√n)√
`+√n
)2
+ (√`−√n)2
=(√`−√n)2(√
m− n+√`)2
+ (√`−√n)2
.
This suggests the UCS algorithm may find a better subset than the column selectionalgorithm in [17]. Observe that typically m ` ≥ n. For the purpose of finding a well-conditioned subset of columns in V T
1 ∈ Rn×m, requiring the whole matrix V T ∈ Rm×m iscomputationally expensive. On the other hand, an even better subset can be obtained byapplying the UCS algorithm directly to V T
1 , at considerable savings in computational timeand memory usage. This algorithm runs in time O
(`m(n2 + (m− `)2)) ≈ O (`m3), far
slower than UCS.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 91
Figure 7.1: Autonomous System Example: Original Graph
7.6 A Numeric Example: Graph Visualization
We test the UCS algorithm on the Autonomous systems AS-733 dataset in [67]2. The datais undirected, unweighted, and contains 493 nodes and 1189 edges. To visualize the data,nodes are plotted using coordinates determined by the force-directed Fruchterman-Reingoldalgorithm. This algorithm treats the edges of a graph as forces (similar to springs), andperturbs node coordinates until the graph appears to be near an equilibrium state [45].
We apply the force-directed algorithm with two methodologies. First, the force-directedalgorithm is run on the whole graph to determine a fixed set of node coordinates. Usingthese coordinates, the original graph is plotted with various sparsifiers in Figure 7.2. Second,we run the force-directed algorithm on each sparsifier to determine node coordinates for thatsparsifier, and plot both the sparsifier and the original graph on these coordinates (Figure7.3). While this requires rerunning the force-directed algorithm for each sparsifier, thealgorithm converges faster because of the reduced number of edges.
Although the original graph can be considered sparse, visualization of the graph is dif-ficult. In Figure 7.1, a few nodes are seen to have high degree, but little information isreadily available about important edges in the graph or about how important nodes arerelated. Figure 7.2 shows that plotting the sparsifier on the original graph provides incre-mental benefit. The sparser graphs begin to highlight important nodes and important edgesconnecting them, but visualization remains difficult. Rerunning the force-directed algorithmon the sparsifiers, nevertheless, evokes an easily interpretable structure, where importantnodes, clusters, and important edges connecting clusters are readily visible (Figure 7.3).
2File as19981229
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 92
(a) 984 Edges (b) 738 Edges (c) 615 Edges (d) Spanning Tree3
Figure 7.2: Autonomous Systems Graph with Sparsifiers of Various Cardinalities (nodecoordinates calculated from whole graph)
(a) 984 Edges (b) 738 Edges (c) 615 Edges (d) Spanning Tree3
Figure 7.3: Autonomous Systems Graph with Sparsifiers of Various Cardinalities (nodecoordinates recalculated for each sparsifier)
(a) 984 Edges (b) 738 Edges (c) 615 Edges (d) 493 Edges
Figure 7.4: Progress During Iteration and Theoretical Singular Value Lower Bound forSparsifiers of Various Cardinalities
7.7 Relationship to the Kadison-Singer Problem
Let p ≥ 2 be an integer, and let U = (u1, · · · , um) ∈ Rn×m be a matrix that satisfies
n∑k=1
ukuTk = I, and ‖uk‖2≤ δ, for k = 1, · · · ,m, (7.17)
3Calculated by running UCS algorithm with ` = 493 and omitting the final edge. In general, a spanningtree for a connected graph can be found by selecting ` = n + 1 edges and removing an edge from a loopcreated by the UCS algorithm.
CHAPTER 7. UNWEIGHTED COLUMN SELECTION 93
where 0 < δ < 1. Equation (7.17) implies that U is a row-orthonormal matrix and that eachcolumn of U is uniformly bounded away from 1 in 2−norm. Marcus et al. [73] show thatthere exists a partition
P = P1 ∪ · · · ∪ Pp (7.18)
of 1, · · · , n such that
‖U (:,Pk)‖2 ≤1√p
+ δ, for k = 1, · · · , p.
When the graphG is sufficiently dense, equation (7.18) implies the existence of an unweightedgraph sparsifier (see Batson, et al. [12]) .
7.8 Additional Thoughts
We have presented an efficient algorithm for the construction of unweighted spectral spar-sifiers for general weighted and unweighted graphs, addressing the open question of theexistence of such graph sparsifiers for general graphs [12]. Our algorithm is supported bystrong theoretical spectral bounds. Through numeric experiments, we have demonstratedthat our sparsification algorithm can be an effective tool for graph visualization, and an-ticipate that it will prove useful for wide-ranging applications involving large graphs. Animportant feature of our sparsification algorithm is the deterministic unweighted columnselection algorithm on which it is based. An open question is the existence of a larger lowerspectral bound, either with the same T or a new one.
94
Chapter 8
Additional Results on UnweightedGraph Sparsification
8.1 A Running Bound
Note that the bound 7.5 assumes the recommended T and all ` iterations have been per-formed. This bound does not apply, nevertheless, at iteration s < ` by replacing ` with s inthis bound because T was determined by `, not s. Here, a new bound is derived for whens < ` iterations have been performed.
For a given n,m, and `, we look at the lower bound at step s ≤ `. From inequality (7.11):
λs ≥s∑t=0
1− λs − nT
m− T +(
1− λs)T − n
− n
T
≥ s ·1− λs − n
T
m− s2
+(a− λs
)T − n
− n
T.
Then
λ(m− s
2+(
1− λs)T − n
)≥ s
(1− λs − n
T
)− n
T
(m− s
2+(
1− λs)T − n
)−T
(λs)2
+ λ(m− s
2+ T − n
)≥ s− sλs − sn
T− nm
T+sn
2T− n+ nλs +
n2
T
−T(λs)2
+ λs(m+
s
2+ T − 2n
)≥ s− sn
T− nm
T+sn
2T− n+
n2
T
We solve for the smaller root of this quadratic to find a lower bound.
λs ≥2n− T − s
2−m+
√(m+ s
2+ T − 2n
)2+ 4T
(−s+ sn
2T+ nm
T+ n− n2
T
)−2T
.
CHAPTER 8. ADDITIONAL RESULTS ON UNWEIGHTED GRAPHSPARSIFICATION 95
8.2 Faster Subset Selection for Matrices and
Applications
Their improved bound in Theorem 3.5:
‖X†S‖2ξ≤(
1 +
√m
`
)2(1−
√n
`
)−2
‖X†‖2ξ .
Equivalently
‖XS‖2ξ
‖X‖2ξ
≥(1−
√n`
)2(1 +
√m`
)2
= κDDSSS2.
Then
κDDSSS2 =
(√`−√n)2
(√`+√m)2
=(`− n)2(√
`+√n)2 (√
`+√m)2
=(`− n)2(
`+√`m+
√n`+
√nm)2
=(`− n)2(
`− n+ n+√`m+
√n`+
√nm)2
<(`− n)2(
`− n+ n+√`(m− `+1
2
)+√n`+
√nm)2
<(`− n)2(
`− n+ n+√`(m− `+1
2
)+√n (`+m)
)2 .
CHAPTER 8. ADDITIONAL RESULTS ON UNWEIGHTED GRAPHSPARSIFICATION 96
Continuing:
κDDSSS2 <(`− n)2(
`− n+ n+√`(m− `+1
2
)+√n(m+ `+1
2− n
))2
<(`− n)2(
`− n+√`(m− `+1
2
)+√n(m+ `+1
2− n
))2
<(`− n)2
(`− n)2 +(√
`(m− `+1
2
)+√n(m+ `+1
2− n
))2
= κUCS.
97
Appendix A
An Algorithm for Sparse PCA
Although the truncated Singular Value Decomposition is a mathematically optimally low-rank approximation algorithm in the sense of Theorem 2, there are many drawbacks to thistype of approximation. Part II demonstrated that SRLU simultaneously addresses many ofthese drawbacks. This section presents an algorithm specifically for the problem of sparsity,known as Sparse PCA [110].
Recall that the SVD is generally dense, regardless of the amount of sparsity in the inputdata. While the SVD finds the optimal subspace of a chosen rank k, the basis for thissubspace will generally be a set of k dense principal components and loading vectors. As aresult, the rank-k subspace is a combination of all input variables. Instead, one may ask thenatural question of which k variables best explain the data, a different question which theSVD does not address.
A.1 Problem Formulation
The sparse PCA problem has many formulations. Here, we use the following objectivefunction:
minUTU=IV (Π⊥)=0
‖A− UV T‖TF−η
[log det
(V TV
)−
k∑j=1
log V (:, j)TV (:, j)
]. (A.1)
For a target rank k and A ∈ Rm×n, we have U ∈ Rm×k is orthonormal, and V ∈ Rn×k.Because the determinant of a matrix is the product of the singular values, this objectivefunction seeks to simultaneously maximize all of the singular values of V . The last term inthe object function serves to keep the columns of V bounded. The presence of the logarithmicfunction will allow us to analyze this function analytically. Furthermore, the logarithmicterm in the final expression renders a column-wise penalty similar to the `1 norm, which isgenerally known to induce sparsity.
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 98
Sparsity will be represented in the following setup. The sparsity pattern of V (here thejth column can be used instead of 1st column):
V =
(v1 V1
0 V2
). (A.2)
A.2 Setup for a Single Column
Let
a =
(a1
a2
)def=(ATU
)j.
Note
V TV =
(vT1 v1 vT1 V1
V T1 v1 V T
2 V2 + V T1 V1
),
and define
Pdef= I − V1
(V T
1 V1 + V T2 V2
)−1V T
1 .
Note that
‖A− UTV ‖2F = tr
(ATA
)− 2tr
(UTAV
)+ tr
(V TV
).
Then the part of the objective function that is updated:
minv1−2vT1 a1 + vT1 v1 − η
[log vT1 P v1 − log vT1 v1
]. (A.3)
Taking the derivative:
−a1 + v1 − η[P v1
vT1 P v1
− v1
vT1 v1
]= 0. (A.4)
A.3 Solving for v1
Rewriting:
a1 = v1
(1 +
η
vT1 v1
)− η
vT1 P v1
P v1.
Define
αdef= 1 +
η
vT1 v1
,
βdef=
η
vT1 P v1
.
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 99
Thus we have
v1 = (αI − βP )−1 a1, (A.5)
and
α = λβ = 1 +η
aT1 (αI − βP )−2 a1
, (A.6)
β =η
aT1 (αI − βP )−1 P (αI − βP )−1 a1
. (A.7)
Additionally, define
λdef=
α
β=
(1 + η
vT1 v1
)vT1 P v1
η
=vT1 P v1
η+vT1 P v1
vT1 v1
. (A.8)
Extracing a factor of 1/β from the terms in the denominator of equation (A.7), and notingthe terms commute, implies
β =η
1β2aT1 (λI − P )−2 Pa1
=ηβ2
aT1 (λI − P )−2 Pa1
=1
ηaT1 (λI − P )−2 Pa1. (A.9)
Similarly, equation (A.6) implies
α = λβ =ηβ2
aT1 (λI − P )−2 a1
+ 1. (A.10)
Applying equation (A.9) to both sides of equation (A.10) yields
λ
ηaT1 (λI − P )−2 Pa1 = 1 +
η
aT1 (λI − P )−2 a1
[aT1 (λI − P )−2 Pa1
]2η2
λ =η
aT1 (λI − P )−2 Pa1
+aT1 (λI − P )−2 Pa1
aT1 (λI − P )−2 a1
λaT1 (λI − P )−2 a1 − aT1 (λI − P )−2 Pa1
aT1 (λI − P )−2 a1
=η
aT1 (λI − P )−2 Pa1
aT1 (λI − P )−2 (λI − P ) a1
aT1 (λI − P )−2 a1
=η
aT1 (λI − P )−2 Pa1
aT1 (λI − P )−1 a1
aT1 (λI − P )−2 a1
=η
aT1 (λI − P )−2 Pa1
. (A.11)
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 100
This provides a formula for λ. Where to look for λ is discussed later.
A.4 More on λ and the Objective Function
Using the notation introduced above, we can use equation (A.9) to rewrite our objectivefunction, in equation (A.3), as
Obj. = −2ηaT1 (λI − P )−1 a1
aT1 (λI − P )−2 Pa1
+η2aT1 (λI − P )−2 a1[aT1 (λI − P )−2 Pa1
]2 − η logaT1 (λI − P )−2 Pa1
aT1 (λI − P )−2 Pa1
= − η2aT1 (λI − P )−2 a1[aT1 (λI − P )−2 Pa1
]2 − η logη
aT1 (λI − P )−1 a1
= −η2 1
aT1 (λI − P )−2 a1
(aT1 (λI − P )−1 a1
η
)2
− η logη
aT1 (λI − P )−1 a1
= −[aT1 (λI − P )−1 a1
]2aT1 (λI − P )−2 a1
− η logη
aT1 (λI − P )−1 a1
,
where we have used equation (A.11) in the derivations above. Then
dObj.
dλ= −
[−2[aT1 (λI − P )−1 a1
] [aT1 (λI − P )−2 a1
]2+2[aT1 (λI − P )−1 a1
]2aT1 (λI − P )−3 a1
]/[aT1 (λI − P )−2 a1
]2− d
dλη log
η
aT1 (λI − P )−1 a1
= 2aT1 (λI − P )−1 a1[aT1 (λI − P )−2 a1
]2 ([aT1 (λI − P )−2 a1
]2−aT1 (λI − P )−1 a1 · aT1 (λI − P )−3 a1
)− d
dλη log
η
aT1 (λI − P )−1 a1
.
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 101
Let αj denote the jth component of a1, and µj the jth eigenvalue of P . Then we can write[aT1 (λI − P )−2 a1
]2 − aT1 (λI − P )−1 a1 · aT1 (λI − P )−3 a1
=k∑i=1
α2i
(λ− µi)2
k∑j=1
α2j
(λ− µj)2 −k∑i=1
α2i
(λ− µi)1
k∑j=1
α2j
(λ− µj)3
=k∑
i,j=1
α2iα
2j [(λ− µj)− (λ− µi)](λ− µi)2 (λ− µj)3
=k∑
i,j=1
α2iα
2j [µi − µj]
(λ− µi)2 (λ− µj)3 . (A.12)
Suppose that t1 and t2 are indices such that µt1 > µt2 . Then the terms in the sum in euqation(A.12) corresponding to this pair of indices are
α2t1α2t2
[µt1 − µt2 ](λ− µt1)
2 (λ− µt2)3 +
α2t2α2t1
[µt2 − µt1 ](λ− µt2)
2 (λ− µt1)3
=α2t1α2t2
(λ− µt2)3 (λ− µt1)
3 [(µt1 − µt2) (λ− µt1) + (µt2 − µt1) (λ− µt2)]
=α2t1α2t2
(λ− µt2)3 (λ− µt1)
3
[− (µt1 − µt2)
2]≤ 0.
This implies line (A.12) is non-positive. By inspection,
d
dλη log
η
aT1 (λI − P )−1 a1
≥ 0.
Therefore,
dObj.
dλ< 0. (A.13)
A.5 Bounds on λ
Note that equation (A.8) implies
vT1 P v1
vT1 v1
≤ vT1 P v1
η+vT1 P v1
vT1 v1
= λ ≤ ‖P‖2
η+ ‖P‖2.
Consider that we are picking v1 to maximize λ, as argued by line (A.13), and note that byconstruction P 0 and P T = P . Then
‖P‖2= ‖P 1/2‖22= max
x‖P 1/2 x
‖x‖2
‖2.
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 102
Let
x = arg maxx‖P 1/2 x
‖x‖2
‖2.
Then
‖P‖2 = ‖P 1/2‖22
= ‖P 1/2 x
‖x‖2
‖22
=xT(P 1/2
)TP 1/2x
xT x
=xTPx
xT x
≤ vT1 P v1
η+vT1 P v1
vT1 v1
=vT1 P v1
η+vT1 P v1
vT1 v1
= λ.
Hence
‖P‖2≤ λ ≤ ‖P‖2+‖P‖2
η. (A.14)
Equation (A.11) has more than one root, so the bounds in inequality (A.14) provide aguidance for the correct root. To guarantee that a root can be found, nevertheless, thesebounds may still be insufficient. A stable and fast algorithm such as Brent’s method [19]requires bounds with opposing signs when searching for the zero of a function. Define
F (λ)def=
aT1 (λI − P )−1 a1
aT1 (λI − P )−2 a1
− η
aT1 (λI − P )−2 Pa1
,
and let
P = QΛQT
be the SVD of P , which takes this form because P is symmetric. Denote b = aT1Q, and letdi be the diagonal elements of Λ (in descending order). Then
F (λ) =b [λI− Λ]−1 bT
b [λI− Λ]−2 bT− η
b [λI− Λ]−2 ΛbT
=
∑i
b2iλ−di∑
ib2i
(λ−di)2− η∑
idib2i
(λ−di)2.
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 103
Note that if λ1 is the solution to ∑i
b2i
λ− di=
η
d1
, (A.15)
then F (λ1) ≤ 0. Let t be the largest integer such that dt > 0. Then the solution λ2 to∑i
b2i
λ− di=
η
dt(A.16)
will lead to F (λ2) ≥ 0. Therefore, equations A.15 and A.16 lead to lower and upper boundsrespectively on the value of λ that is a zero of F . Although the need to solve new equationsmay seem circular, note that equations A.15 and A.16 are far more simple than F .
A.6 Putting It All Together
A value for λ is found by solving equation (A.11). Then solve for β using equation (A.9).Hence α = λβ. Finally, solve for v1 using equation (A.5). We summarize in Algorithm 9.
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 104
Inputs: A,U, V , and a sparsity pattern for VOutputs: Refined V
1: for j = 1, · · · , n do2: Calculate aj =
(ATU
)j, and calculate P corresponding to jth column
3: Solve for λ:
aTj (λI − P )−1 aj
aTj (λI − P )−2 aj=
η
aTj (λI − P )−2 Paj
4: Solve for β:
β =1
ηaTj (λI − P )−2 Paj
5: Solve for α:
α = λβ
6: Solve for Vj
Vj = (αI − βP )−1 aj
7: end for
Algorithm 9: Sparse PCA Refinement Algorithm
APPENDIX A. AN ALGORITHM FOR SPARSE PCA 105
A.7 Error Bound
Using the results above, we calculate:
‖a1 − v1‖2 = (a1 − v1)T (a1 − v1)
= aT1 a1 + vT1 v1 − 2aT1 v1
= aT1 a1 + aT1 (αI − βP )−2 a1 − 2aT1 (αI − βP )−1 a1
= aT1 a1 +1
β2aT1 (λI − P )−2 a1 −
2
βaT1 (λI − P )−1 a1
= aT1 a1 +η2[
aT1 (λI − P )−2 Pa1
]2aT1 (λI − P )−2 a1
−2ηa1 (λI − P )−1 a1
aT1 (λI − P )−2 Pa1
(A.17)
= aT1 a1 +
[aT1 (λI − P )−1 a1
]2aT1 (λI − P )−2 a1
− 2
[aT1 (λI − P )−1 a1
]2aT1 (λI − P )−2 Pa1
(A.18)
= aT1 a1 −[a1 (λI − P )−1 a1
]2aT1 (λI − P )−2 a1
= aT1 (λI − P )−1 (λI − P ) a1 −[a1 (λI − P )−1 a1
]2aT1 (λI − P )−2 a1
= λaT1 (λI − P )−1 a1 − aT1 (λI − P )−1 Pa1 −[a1 (λI − P )−1 a1
]2aT1 (λI − P )−2 a1
=
(λ− aT1 (λI − P )−1 a1
aT1 (λI − P )−2 a1
)aT1 (λI − P )−1 a1 − aT1 (I − P )−1 Pa1
=aT1 (λI − P )−2 Pa1
aT1 (λI − P )−2 a1
aT1 (λI − P )−1 a1 − aT1 (λI − P )−1 Pa1 (A.19)
= η − aT1 (λI − P )−1 Pa1
< η. (A.20)
Equation (A.17) uses equation (A.9). Equation (A.18) uses equation (A.11). Equation (A.19)uses the identity in my notebook (write this as its own lemma). Note that the bounds in line(A.14) imlpy that (λI − P )−1 is positive semidefinite, and so is P , hence inequality (A.20)follows.
This error bound provides insight into the tradeoff between accuracy and sparsity. Inparticular, notice that this bound implies the error disappears as η goes to 0.
106
Appendix B
An Efficient Implementation of theGeneralized Minimum ResidualMethod for Stiff PDE Problems
B.1 Preconditioned GMRES
In this chapter, a potential optimization to the Generalized Minimum Residual Method(GMRES) is reduced. Unlike the algorithms above, GMRES [88] is not a low-rank approxi-mation algorithm, rather it is a linear solver for large, sparse systems. GMRES, nevertheless,is a Krylov subspace method; an iterative method that builds a subspace at each iterationand seeks the best solution to the overall problem which lies in that subspace. In this sense,the Krylov subspace is a low-rank approximation to the entire system for the purpose offinding the solution to the system. This work is preliminary, and numeric experiments arenot yet available. This algorithm was originally motivated by boundary-layer Navier Stokessimulations using high order finite element methods.
GMRES, which is generally known to be one of the most stable iterative solvers, is oftenused to solve unsymmetric systems with potentially high condition numbers. As a result,GMRES is often applied to preconditioned systems. A preconditioned system is of the form
M−1Ax = M−1b.
The matrix M is known as a preconditioner and is chosen so that M is as close to A aspossible, or, more specifically, so that the eigenvalues of M−1A are as close to 1 as possibleand not near 0. As a result, the system B.1 is easier to solve than Ax = b.
Many strategies for preconditioning exist. The incomplete LU decomposition is a com-mon choice of preconditioned for large, sparse, unstable linear systems because it is fastto compute and apply, and because it does not isolate parts of the system (for instance, ablock preconditioner isolates the block diagonal components of a linear system and does notaccount for interaction between those blocks). The incomplete LU factorization takes the
APPENDIX B. AN EFFICIENT IMPLEMENTATION OF THE GENERALIZEDMINIMUM RESIDUAL METHOD FOR STIFF PDE PROBLEMS 107
form
A ≈ LU,
where the factorization is performed from start to finish using a standard LU decomposition(as opposed to the truncated LU factorization discussed previously in this work), but onlypopulates L and U where there are nonzero entries in A. Thus the factorization time is afunction on the number of nonzeros in the matrix A, and is not an O (n3) algorithm. (Notethat A is assumed to be square by the nature of the problem.) The rest of this chapter willassume that GMRES is being applied to a system preconditioned with incomplete LU.
B.2 A New Optimization
The most expensive step in GMRES is the calculation of a matrix-vector product. This isperformed at each iteration until the algorithm converges. For a stiff PDE problems, therows and columns of A (and therefore L and U) must be given a careful ordering beforefactorization because pivoting is generally too expensive (in terms of memory management)during iteration because of the size of the system. Thus discretization of the system likelyimplies that points in the domain that are close together (interact through the PDE beingsimulated) should be as close to possible in the linear representation A. Furthermore, forconvection-dominated problems, these elements may benefit by being ordered in the directionof the convection, allowing information to flow through the system A as the dynamics of thePDE would flow through the domain.
Inevitably, some parts of the linear system Ax = b will converge faster than other parts.This may occur in parts of the system that correspond to parts of the domain that are affectedleast by the dynamics of the PDE. Unstructured meshes, such as boundary layer meshes,provide a method for spending less time and fewer calculations on parts of the system whereless change occurs. However, it remains unlikely that all parts of the system will convergesimultaneously. Furthermore, unstructured meshes are likely more concerned with balancingaccuracy with computation time than just computation time. We now present an algorithmthat potentially optimizes GMRES for stiff PDE problems by ceasing calculation at thematrix-vector product step of GMRES by turing off calculations for parts of the system thathave been determined to have converged before the entire system has converged.
As linear systems in general must simultaneously converge, care must be taken when de-termining if parts of the domain have converged before the entire system has converged. Fora finite element method, all nodes within an element will typically be ordered together, andeach node may be represented with several elements in the system A. The algorithm pre-sented here relies on the following assumption: if the part of the linear system correspondingto an element is contiguous in the linear system and the residual corresponding to that partof the system is below a given tolerance, then that element is considered to have converged.While it is possible for parts of a linear system to appear to have converged during iterationbut have actually achieved a small residual with an incorrect value by chance, it should be
APPENDIX B. AN EFFICIENT IMPLEMENTATION OF THE GENERALIZEDMINIMUM RESIDUAL METHOD FOR STIFF PDE PROBLEMS 108
unlikely that the part of a system corresponding to an entire element should accidentallyexhibit a small residual. Note however, that a tolerance for the convergence of the partof a system corresponding to an element within the domain should be much smaller thanthe tolerance for the convergence of the entire system, if not machine precision. Note that,for restarted GMRES, the residuals of individual elements can only easily be calculated atrestarts.
Our goal is to compute the preconditioned matrix-vector product
x =
x1
x2
x3
x4
= U−1L−1Ab,
where the underline corresponds to elements that have been deemed to have converged andhence do not need to be updated. There may be more back-and-forths between elementsthat have converged and elements that are still converging, but the for of the vector x aboveshould be sufficient to generalize to all possibilities. We will need the following partitions:
L =
L11
L21 L22
L31 L32 L33
L41 L42 L43 L44
U =
U11 U12 U13 U14
U22 U23 U24
U33 U34
U44
A =
(A1 A2 A3 A4
)b =
b1
b2
b3
b4
.
Here, the size of the partitions match the size of the partition of x. We will first solve
y =
y1
y2
y3
y4
= L−1Ab,
and then solve the entire system for x. Operations will be performed from right to left
APPENDIX B. AN EFFICIENT IMPLEMENTATION OF THE GENERALIZEDMINIMUM RESIDUAL METHOD FOR STIFF PDE PROBLEMS 109
because operating a vector is significantly cheaper than matrix-matrix operations. Denote
L−1 =
L−1
11
X21 L−122
X31 X32 L−133
X41 X42 X43 L−144
U−1 =
U−1
11 Y12 Y13 Y14
U−122 Y23 Y24
U−133 Y34
U−144
.
Then
y1 = L−111 A1b1.
As before, the inverse indicates that this sub-system should be solved, and does not requiresolving explicitly for L−1
11 . As elements converge, the vector y should be stored in memoryas well. Although the vector b may change at each iteration, by assumption the parts of bcorresponding to the parts of x that have converged will be constant. Suppose now that y2
has been deemed fixed and need not be calculated, then the next task is to compute y3. Ify3 can be calculated, then once it’s known any ordering of fixed and unknown elements canbe solved for. Write
a1
a2
a3
a4
= Ab.
Note that
L31y1 + L32y2 + L33y3 = a3.
Then, knowing y1 and y2, we can solve
y3 = L−133 (a3 − L31y1 − L32y2) .
If Ly is also stored in memory, then the expressions above need not be recalculated. Note thatrecalculation of y4, which represents elements that have converged that appear at the end ofthe vector y, can be avoided by simply terminating after the final un-converged element hasbeen computed.
The above can be applied to an portions remaining in the vector y until the entire vectoris found. To find x, the process is repeated in the reverse direction because U is uppertriangular. For example, given x4 has been calculated or was previously fixed:
x3 = U−133 (y3 −U34x4) .
APPENDIX B. AN EFFICIENT IMPLEMENTATION OF THE GENERALIZEDMINIMUM RESIDUAL METHOD FOR STIFF PDE PROBLEMS 110
There are many potential optimizations when simulating a PDE, and most may conflictwith other possible optimizations. The choice of one preconditioned, for example, precludesthe possibility of using a different preconditioner. Note, nevertheless that the algorithm de-scribed here does not conflict with any other known optimizations. If all elements converge atabout the same rate, then this algorithm will provide no discernible benefit. This algorithm,nevertheless, will not cause additional harm either, and has the potential for considerabletime savings for stiff PDE problems.
111
Appendix C
Strassen’s Algorithm
As discussed in Section 2.8, there are various ways to compute matrix multiplication. Al-though a study of matrix multiplication is not necessary in this work, the main results inPart II depend on matrix multiplication. Strassen’s Algorithm is presented here for two rea-sons: demonstration that the computational complexity of SRLU can be improved by usingthe Strassen form of matrix multiplication in place of DGEMM calls, and because Strassen’sAlgorithm is one of the most unexpected and incredible results encountered by this authorduring his PhD studies.
The algorithm is defined for square matrices of size 2k-by-2k for some k, but can easilybe adjust for all matrices. The matrix product(
A11 A12
A21 A22
)(B11 B12
B21 B22
)=
(C11 C12
C21 C22
)can be calculated by first computing
P1 = (A11 + A22) (B11 + B22) ,
P2 = (A21 + A22) B11,
P3 = A11 (B12 −B22) ,
P4 = A22 (B21 −B11) ,
P5 = (A11 + A12) B22,
P6 = (A21 −A11) (B11 + B12) , and
P7 = (A12 −A22) (B21 + B22) .
Then
C11 = P1 + P4 −P5 + P7
C12 = P3 + P5
C21 = P2 + P4
C22 = P1 −P2 + P3 + P6.
APPENDIX C. STRASSEN’S ALGORITHM 112
Calculation of each Pi requires one matrix-matrix multiplication of size 2k−1-by-2k−1, and soif the equations above are applied recursively then the amount of time spent in multiplicationfor a n-by-n matrix is
T (n) ≈ 7 · T(n
2
)≈ 7log2(n) · T (1)
≈ O(nlog2 7
).
Here the approximation is due to lower order operations needed as well, but when all opera-tions are considered the order of complexity remains O
(nlog2 7
)[47]. The three-nested loop
standard algorithm is an O (n3) algorithm, and the definition and conventional algorithmfor matrix multiplication give no indication that this complexity could be improved upon.Newer algorithms have been discovered with even smaller complexity, although these algo-rithms have no practical implementation. It remains an open conjecture that matrix-matrixmultiplication can be arbitrarily close to or achieve O (n2) [106].
113
Appendix D
A Visualization of SRLU
InitializationXXXXDGEMMXXXXO (pmn) flops
p Ωm
m A
n
= p Rn
Iterate j = 0 : b : (k − b):
1. Column selectionXXXXDGETRF (or RRQR)XXXXO ((n− j)pb) flops
Rp
n
2. Partial Schur updateXXXXDGEMMXXXX2(m− j)jb flops
APPENDIX D. A VISUALIZATION OF SRLU 114
m-j
j b
3. LU factorizationXXXXDGETRFXXXX2
3(m− j)b2 flops
m-j
b
4. Update UXXXXDGEMMXXXX2bj (n− (j + b)) flops
b
j
n-(j+b)
5. Update RXXXXDGEMMXXXXO (pb (n− (j + b)) flops
Multiple low-order operations
TOTAL FLOPS:2pmn+ (m+ n) k2 + (low order)
115
Bibliography
[1] D. Achlioptas. “Database-friendly random projections: Johnson-Lindenstrauss withbinary coins”. In: J. Comput. Syst. Sci. 66.4 (2003), pp. 671–687.
[2] K. J. Ahn, S. Guha, and A. McGregor. “Graph sketches: sparsification, spanners, andsubgraphs.” In: PODS. ACM, 2012, pp. 5–14.
[3] N. Ailon and B. Chazelle. “The Fast Johnson–Lindenstrauss Transform and Approx-imate Nearest Neighbors”. In: SIAM J. Comput. 39.1 (2009), pp. 302–322.
[4] N. Ailon and E. Liberty. “Fast Dimension Reduction Using Rademacher Series onDual BCH Codes.” In: Discrete and Computational Geometry 42.4 (Dec. 17, 2009),pp. 615–630.
[5] Y. Aizenbud, G. Shabat, and A. Averbuch. “Randomized LU Decomposition UsingSparse Projections.” In: CoRR abs/1601.04280 (2016).
[6] D. G. Anderson and M. Gu. “An Efficient, Sparsity-Preserving, Online Algorithm forData Approximation”. In: CoRR (2015). url: \urlhttps://arxiv.org/abs/1602.05950.
[7] D. G. Anderson, M. Gu, and C. Melgaard. “An Efficient Algorithm for UnweightedSpectral Graph Sparsification.” In: CoRR abs/1410.4273 (2014).
[8] D. G. Anderson et al. “Spectral Gap Error Bounds for Improving CUR Matrix De-composition and the Nystrm Method.” In: AISTATS. Vol. 38. JMLR Proceedings.2015.
[9] H. Avron and C. Boutsidis. “Faster Subset Selection for Matrices and Applications”.In: CoRR abs/1201.0127 (2012).
[10] K. C. Barr and K. Asanovic. “Energy-aware lossless data compression.” In: ACMTrans. Comput. Syst. 24.3 (Oct. 22, 2008), pp. 250–291.
[11] J. Batson, D. A. Spielman, and N. Srivastava. “Twice-Ramanujan Sparsifiers”. In:SIAM Journal on Computing 41.6 (2012), pp. 1704–1721.
[12] J. D. Batson, D. A. Spielman, and N. Srivastava. “Twice-Ramanujan Sparsifiers.” In:SIAM J. Comput. 41.6 (2012), pp. 1704–1721.
[13] J. D. Batson et al. “Spectral sparsification of graphs: theory and algorithms.” In:Commun. ACM 56.8 (2013), pp. 87–94.
BIBLIOGRAPHY 116
[14] A. A. Benczur and D. R. Karger. “Approximating s-t Minimum Cuts in O(n2) Time.”In: STOC. ACM, 1996, pp. 47–55.
[15] M. W. Berry, Z. Drmac, and E. R. Jessup. “Matrices, Vector Spaces, and InformationRetrieval”. In: SIAM Review 41.2 (1999), pp. 335–362.
[16] M. W. Berry, S. A. Pulatova, and G. W. Stewart. “Algorithm 844: Computing sparsereduced-rank approximations to sparse matrices.” In: ACM Trans. Math. Softw. 31.2(2005), pp. 252–269.
[17] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. “Near-Optimal Column-Based Ma-trix Reconstruction”. In: CoRR abs/1103.0995 (2011).
[18] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. “Near-Optimal Column-Based Ma-trix Reconstruction.” In: SIAM J. Comput. 43.2 (2014), pp. 687–717.
[19] R. P. Brent. “An algorithm with guaranteed convergence for finding a zero of a func-tion”. In: The Computer Journal 14.4 (1971), pp. 422–425.
[20] E. J. Candes and T. Tao. “The power of convex relaxation: near-optimal matrixcompletion”. In: IEEE Transactions on Information Theory 56.5 (2010), pp. 2053–2080.
[21] E. J. Cands and B. Recht. “Exact matrix completion via convex optimization.” In:Commun. ACM 55.6 (2012), pp. 111–119.
[22] E. J. Cands et al. “Robust Principal Component Analysis?” In: CoRR abs/0912.3599(2009).
[23] E. Carson et al. Write-Avoiding Algorithms. Tech. rep. UCB/EECS-2015-163. EECSDepartment, University of California, Berkeley, 2015.
[24] T. F. Chan. “Rank revealing QR factorizations”. In: Linear algebra and its applica-tions 88/89 (1987), pp. 67–82.
[25] H. Cheng et al. “On the Compression of Low Rank Matrices.” In: SIAM J. ScientificComputing 26.4 (2005), pp. 1389–1404.
[26] F. Chierichetti, S. Lattanzi, and A. Panconesi. “Rumour Spreading and Graph Con-ductance.” In: SODA. SIAM, 2010, pp. 1657–1663.
[27] E. Chow and A. Patel. “Fine-Grained Parallel Incomplete LU Factorization.” In:SIAM J. Scientific Computing 37.2 (2015).
[28] P. Christiano et al. “Electrical flows, laplacian systems, and faster approximation ofmaximum flow in undirected graphs”. In: STOC. ACM, 2011, pp. 273–282.
[29] A. Civril and M. Magdon-Ismail. “Deterministic Sparse Column Based Matrix Re-construction via Greedy Approximation of SVD”. In: Algorithms and Computation.Vol. 5369. Lecture Notes in Computer Science. 2008, pp. 414–423.
[30] K. L. Clarkson and D. P. Woodruff. “Low Rank Approximation and Regression inInput Sparsity Time”. In: CoRR abs/1207.6365 (2012).
BIBLIOGRAPHY 117
[31] A. Dasgupta, R. Kumar, and T. Sarlos. “A sparse Johnson: Lindenstrauss transform”.In: STOC. ACM, 2010, pp. 341–350.
[32] S. Dasgupta and A. Gupta. “An elementary proof of a theorem of Johnson and Lin-denstrauss.” In: Random Struct. Algorithms 22.1 (2003), pp. 60–65.
[33] T. A. David and Y. Hu. “The University of Florida Sparse Matrix Collection”. In:ACM Transactions on Mathematical Software 38 (1 2011), pp. 1–25. url: http:
//www.cise.ufl.edu/research/sparse/matrices.
[34] J. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[35] A. Deshpande and S. Vempala. “Adaptive Sampling and Fast Low-Rank Matrix Ap-proximation.” In: APPROX-RANDOM. Vol. 4110. Lecture Notes in Computer Sci-ence. Springer, 2006, pp. 292–303.
[36] A. Deshpande et al. “Matrix Approximation and Projective Clustering via VolumeSampling.” In: Theory of Computing 2.12 (2006), pp. 225–247.
[37] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. “Relative-Error CUR MatrixDecompositions.” In: SIAM J. Matrix Analysis Applications 30.2 (2008), pp. 844–881.
[38] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Sec-ond. Oxford Science Publications. New York: The Clarendon Press Oxford UniversityPress, 1989, pp. xiv+341. isbn: 0-19-853421-3.
[39] C. Eckart and G. Young. “The approximation of one matrix by another of lowerrank”. In: Psychometrika 1.3 (1936), pp. 211–218.
[40] Office of Science U.S. Department of Energy. Configuration. https://www.nersc.gov/users/computational-systems/edison/configuration/. Accessed on August22, 2015. Published on March 30, 2015.
[41] S. Fine and K. Scheinberg. “Efficient SVM Training Using Low-Rank Kernel Repre-sentations.” In: Journal of Machine Learning Research 2 (2001), pp. 243–264.
[42] L. V. Foster. “The growth factor and efficiency of Gaussian elimination with rookpivoting”. In: J. of Comp. and Appl. Math. 86 (1997), pp. 177–194.
[43] P. Frankl and H. Maehara. “The Johnson-Lindenstrauss lemma and the sphericity ofsome graphs.” In: J. Comb. Theory, Ser. B 44.3 (1988), pp. 355–362.
[44] A. M. Frieze, R. Kannan, and S. Vempala. “Fast monte-carlo algorithms for findinglow-rank approximations.” In: J. ACM 51.6 (2004), pp. 1025–1041.
[45] T. Fruchterman and E. Reingold. “Graph Drawing by Force-Directed Placement”. In:Software–Practice & Experience 21.11 (1991), pp. 1129–1164.
[46] P. E. Gill et al. “Maintaining LU Factors of a General Sparse Matrix.” In: LinearAlgebra and Its Applications. 88 (1987), pp. 239–270.
[47] G. H. Golub and C. F. Van Loan. Matrix Computations. 4th. JHU Press, 2013.
BIBLIOGRAPHY 118
[48] J. Gondzio. “Stable algorithm for updating dense LU factorization after row or columnexchange and row and column addition or deletion”. In: Optimization: A journal ofMathematical Programming and Operations Research (2007), pp. 7–26.
[49] L. Grigori, J. Demmel, and X. S. Li. “Parallel Symbolic Factorization for Sparse LUwith Static Pivoting.” In: SIAM J. Scientific Computing 29.3 (2007), pp. 1289–1314.
[50] M. Gu. “Subspace Iteration Randomization and Singular Value Problems.” In: SIAMJ. Scientific Computing 37.3 (2015).
[51] M. Gu and S. C. Eisenstat. “Efficient algorithms for computing a strong rank-revealingQR factorization”. In: SIAM J. Sci. Comput. 17.4 (1996), pp. 848–869.
[52] N. Halko, P.-G. Martinsson, and J. A. Tropp. “Finding Structure with Randomness:Probabilistic Algorithms for Constructing Approximate Matrix Decompositions”. In:SIAM Review 53.2 (2011), pp. 217–288.
[53] N. J. Higham. Accuracy and stability of numerical algorithms (2. ed.). SIAM, 2002,pp. I–XXX, 1–680. isbn: 978-0-89871-521-7.
[54] N. J. Higham and S. D. Relton. “Estimating the Largest Elements of a Matrix.” In:SIAM J. Scientific Computing 38.5 (2016).
[55] Y. P. Hong and C.-T. Pan. “Rank-revealing QR factorizations and the singular valuedecomposition”. In: Mathematics of Computations 58 (1992), pp. 213–232.
[56] H. Hotelling. “Some New Methods in Matrix Calculation”. In: Ann. Math. Stat. 14(1943), pp. 1–34.
[57] C.-J. Hsieh and P. A. Olsen. “Nuclear Norm Minimization via Active Subspace Se-lection.” In: ICML. Vol. 32. JMLR Proceedings. 2014, pp. 575–583.
[58] T.-M. Hwang, W.-W. Lin, and E. K. Yang. “Rank Revealing LU Factorizations”. In:Linear Algebra and Its Applications 175 (1992), pp. 115–141.
[59] P. Indyk and R. Motwani. “Approximate Nearest Neighbors: Towards Removing theCurse of Dimensionality”. In: STOC. 1998, pp. 604–613.
[60] P. Indyk and A. Naor. “Nearest-neighbor-preserving embeddings.” In: ACM Trans-actions on Algorithms 3.3 (2007).
[61] W.B. Johnson and J. Lindenstrauss. “Extensions of Lipschitz Mappings into a HilbertSpace”. In: Contemporary Mathematics 26 (1984), pp. 189–206.
[62] Daniel M. Kane and Jelani Nelson. “Sparser Johnson-Lindenstrauss transforms”. In:SODA. SIAM, 2012, pp. 1195–1206.
[63] M. Kapralov and R. Panigrahy. “Spectral sparsification via random spanners.” In:ITCS. ACM, 2012, pp. 393–398.
[64] A. Khabou et al. “LU Factorization with Panel Rank Revealing Pivoting and ItsCommunication Avoiding Version.” In: SIAM J. Matrix Analysis Applications 34.3(2013), pp. 1401–1429.
BIBLIOGRAPHY 119
[65] B. Klartag and S. Mendelson. “Empirical processes and random projections”. In: JFunct Anal 225 (2005), pp. 229–245.
[66] C. Leiserson. “Fat-trees: Universal Networks for Hardware-efficient Supercomputing”.In: IEEE Trans. Comput. 34.10 (1985), pp. 892–901.
[67] J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection.http://snap.stanford.edu/data. Oct. 2014.
[68] N Li, Y Saad, and E Chow. “Crout Versions of ILU for General Sparse Matrices”. In:SIAM J. Sci. Comput. 25.2 (2003), pp. 716–728.
[69] E. Liberty et al. “Randomized algorithms for the low-rank approximation of matri-ces”. In: Proceedings of the National Academy of Sciences 104.51 (2007), p. 20167.
[70] M. Lichman. UCI Machine Learning Repository. 2013. url: http://archive.ics.uci.edu/ml.
[71] C. Lv and Q. Zhao. “Integration of Data Compression and Cryptography: AnotherWay to Increase the Information Security.” In: AINA Workshops (2). IEEE ComputerSociety, 2007, pp. 543–547.
[72] M. W. Mahoney and P. Drineas. “CUR matrix decompositions for improved dataanalysis”. In: Proceedings of the National Academy of Sciences 106.3 (2009), pp. 697–702.
[73] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing Families II: MixedCharacteristic Polynomials and the Kadison-Singer Problem”. In: CoRR 1306.3969(2014).
[74] P.-G. Martinsson, V. Rokhlin, and M. Tygert. “A Randomized Algorithm for the Ap-proximation of Matrices.” In: Tech. Rep., Yale University, Department of ComputerScience 1361 (2006).
[75] M. Mathioudakis et al. “Sparsification of influence networks.” In: KDD. ACM, 2011,pp. 529–537.
[76] MathWorks. The Gatlinburg and Householder Symposia. https://www.mathworks.com / company / newsletters / articles / the - gatlinburg - and - householder -
symposia.html. Accessed on December 10, 2016. Published in 2013.
[77] J. Matousek. “On variants of the Johnson-Lindenstrauss lemma”. In: Random Struct.Algorithms 33.2 (2008), pp. 142–156.
[78] C. Melgaard and M. Gu. “Gaussian Elimination with Randomized Complete Pivot-ing.” In: CoRR abs/1511.08528 (2015).
[79] L. Miranian and M. Gu. “Stong rank revealing LU factorizations”. In: Linear Algebraand its Applications 367 (2003), pp. 1–16.
[80] L. Miranian and M. Gu. “Strong Rank Revealing LU Factorizations.” In: LinearAlgebra and its Applications 367 (2002), pp. 1–16.
BIBLIOGRAPHY 120
[81] NASA. NASA Celebrates 50 Years of Spacewalking. https://www.nasa.gov/image-feature/nasa-celebrates-50-years-of-spacewalking. Accessed on August 22,2015. Published on June 3, 2015. Original photograph from February 7, 1984.
[82] J. v. Neumann and H. H. Goldstine. “Numerical Inverting of Matrices of High Order”.In: Bull. Amer. Math. Soc. 53 (1947), pp. 1021–1099.
[83] C.-T. Pan. “On the existence and computation of rank-revealing LU factorizations”.In: Linear Algebra and its Applications 316.1 (2000), pp. 199–222.
[84] C. H. Papadimitriou et al. “Latent Semantic Indexing: A Probabilistic Analysis.” In:J. Comput. Syst. Sci. 61.2 (2000), pp. 217–235.
[85] L. Parsons, E. Haque, and H. Liu. “Subspace clustering for high dimensional data: areview”. In: ACM SIGKDD Explorations Newsletter 6.1 (2004), pp. 90–105.
[86] W. B. Pennebaker and J. L. Mitchell. JPEG Still Image Data Compression Standard.Norwell, MA, USA: Kluwer Academic Publishers, 1992.
[87] B. Recht, M. Fazel, and P. A. Parrilo. “Guaranteed Minimum-Rank Solutions ofLinear Matrix Equations via Nuclear Norm Minimization.” In: SIAM Review 52.3(2010), pp. 471–501.
[88] Y. Saad. Iterative Methods for sparse linear systems. 2nd. SIAM, 2003.
[89] T. Sarlos. “Improved Approximation Algorithms for Large Matrices via Random Pro-jections.” In: FOCS. IEEE Computer Society, 2006, pp. 143–152.
[90] Silicon Valley. https : / / www . amazon . com / Silicon - Valley - Season - 1 / dp /
B00M4ZPZPY. Nov. 2016.
[91] D. A. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances.”In: SIAM J. Comput. 40.6 (2011), pp. 1913–1926.
[92] D. A. Spielman and S.-H. Teng. “Nearly-linear time algorithms for graph partitioning,graph sparsication, and solving linear systems”. In: STOC’04. 2004, pp. 81–90.
[93] D. A. Spielman and S.-H. Teng. “Nearly-Linear Time Algorithms for Precondition-ing and Solving Symmetric, Diagonally Dominant Linear Systems”. In: CoRR ab-s/cs/0607105 (2006).
[94] D. A. Spielman and S.-H. Teng. “Solving Sparse, Symmetric, Diagonally-DominantLinear Systems in Time O(m1.31)”. In: CoRR cs.DS/0310036 (2003).
[95] D. A. Spielman and S.-H. Teng. “Spectral Sparsification of Graphs”. In: ().
[96] P. Stange, A. Griewank, and M. Bollhofer. “On the Efficient Update of RectangularLU-Factorizations Subject to Low Rank Modifications.” In: Electronic Transactionson Numerical Analysis. 26 (2007), pp. 161–177.
[97] G. W. Stewart. “An updating algorithm for subspace tracking.” In: IEEE Trans.Signal Processing 40.6 (1992), pp. 1535–1541.
BIBLIOGRAPHY 121
[98] G. W. Stewart. “The QLP Approximation to the Singular Value Decomposition.” In:SIAM J. Scientific Computing 20.4 (1999), pp. 1336–1348.
[99] V. Strassen. “Gaussian elimination is not optimal”. In: Numerische Mathematik 13.4(1969), pp. 354–356.
[100] D. S. Taubman and M. W. Marcellin. JPEG2000 : image compression fundamentals,standards, and practice. Boston: Kluwer Academic Publishers, 2002, pp. –.
[101] A. M. Turing. “Rounding-Off Errors is Matrix Processes”. In: Quart. J. Appl. Math.(1948), pp. 287–308.
[102] Madeleine Udell et al. “Generalized Low Rank Models.” In: Foundations and Trendsin Machine Learning 9.1 (2016), pp. 1–118.
[103] C. Van Loan. “Computing the CS and the Generalized Singular Value Decomposi-tions”. In: Numerische Mathematik 46, Issue 4 (1985), pp. 479–491.
[104] S. S. Vempala. The Random Projection Method. Vol. 65. DIMACS Series in DiscreteMathematics and Theoretical Computer Science. DIMACS/AMS, 2004, pp. 1–103.
[105] J. H. Wilkinson. “Error Analysis of Direct Methods of Matrix Inversion.” In: J. ACM8.3 (1961), pp. 281–330.
[106] V. V. Williams. Breaking the Coppersmith-Winograd barrier. Manuscript. 2012.
[107] D. P. Woodruff. “Sketching as a Tool for Numerical Linear Algebra.” In: Foundationsand Trends in Theoretical Computer Science 10.1-2 (2014), pp. 1–157.
[108] F. Woolfe et al. “A fast randomized algorithm for the approximation of matrices”.In: Applied and Computational Harmonic Analysis 25.3 (2008), pp. 335–366.
[109] J. Xiao. On Reliability of Randomized QR Factorization with Column Pivoting. Octo-ber 5, Matrix Computations and Scientific Computing Seminar, Berkeley, California.2016.
[110] H. Zou, T. Hastie, and R. Tibshirani. “Sparse principal component analysis”. In:Journal of Computational and Graphical Statistics 15 (2006), pp. 262–286.