reliable and e cient algorithms for spectrum-revealing low ......both rigorous theory and numeric...

Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank DataAnalysis

by

David Gaylord Anderson

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Applied Mathematics

in the

Graduate Division

of the

University of California, Berkeley

Committee in charge:

Professor Ming Gu, Co-chairProfessor Per-Olof Persson, Co-chair

Professor Benjamin Recht

Fall 2016

Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank DataAnalysis

Copyright 2016by


1

Abstract

Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank Data Analysis

by


Doctor of Philosophy in Applied Mathematics

University of California, Berkeley

Professor Ming Gu, Co-chair

Professor Per-Olof Persson, Co-chair

As the amount of data collected in our world increases, reliable compression algorithmsare needed when datasets become too large for practical analysis, when significant noise ispresent in the data, or when the strongest signals in the data are needed. In this work, twodata compression algorithms are presented. The main result is a low-rank approximationalgorithm (a type of compression algorithm) that uses modern techniques in randomizationto repurpose a classic algorithm in the field of linear algebra called the LU decompositionto perform data compression. The resulting algorithm is called Spectrum-Revealing LU(SRLU).

Both rigorous theory and numeric experiments demonstrate the effectiveness of SRLU.The theoretical work presented also develops a framework with which other low-rank approx-imation algorithms can be analyzed. As the name implies, Spectrum-Revealing LU seeks tocapture the entire spectrum of the data (i.e. to capture all signals present in the data).

A second compression algorithm is also introduced, which seeks to compression graphs.Called a sparsification algorithm, this algorithm can accept a weighted or unweighted graphand produce an approximation without changing the weights (or introducing weights in thecase of an unweighted graph). Theoretical results provide a bound on the quality of theresults, and a numeric example is also explored.

i

To my parents

with all my love.

ii

Contents

Contents ii

List of Figures v

List of Tables vi

I Introduction to Low-Rank Approximation and Linear Alge-bra 1

1 Low-Rank Approximation 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Reliable and Efficient Algorithms for Spectrum-Revealing Low-Rank Data

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Linear Algebra Preliminaries 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 The Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . 62.3 Approximation Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 The LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 The QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Interlacing Property of Singular Values . . . . . . . . . . . . . . . . . . . . . 82.7 Numerical Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.8 Communication-Avoiding Algorithms . . . . . . . . . . . . . . . . . . . . . . 92.9 A Note About Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . 10

II The Spectrum-Revealing LU Decomposition 11

3 Background on Low-Rank Approximation and the LU Decomposition 123.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

iii

3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Previous Work on Deterministic Low-Rank Approximation Algorithms . . . 203.4 Previous Work on Randomized Low-Rank Approximation Algorithms . . . . 223.5 Problems Related to Low-Rank Approximation . . . . . . . . . . . . . . . . 253.6 Previous Work on the LU Decomposition . . . . . . . . . . . . . . . . . . . . 26

4 Spectrum-Revealing LU 274.1 Main Contribution: Spectrum-Revealing LU (SRLU) . . . . . . . . . . . . . 274.2 Spectrum-Revealing Pivoting (SRP) . . . . . . . . . . . . . . . . . . . . . . . 294.3 LU Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Choice of Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Variations of SRLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 The Theory of Spectrum-Revealing Algorithms 375.1 Theoretical Results for SRLU Factorizations . . . . . . . . . . . . . . . . . . 375.2 Comparison of SRLU Factorizations with RRLU and RRQR Factorizations . 615.3 Fast SRLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Numerical Experiments with SRLU 636.1 Speed and Accuracy Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 Efficiency Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 Towards Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.4 Sparsity Preservation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.5 Online Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.6 Pathological Test Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.7 Image Compression Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.8 Testing Quality Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

III Unweighted Graph Sparsification 72

7 Unweighted Column Selection 737.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.3 An Unweighted Column Selection Algorithm . . . . . . . . . . . . . . . . . . 767.4 Correctness and Performance of the UCS Algorithm . . . . . . . . . . . . . . 787.5 Performance Comparison of UCS and Other Algorithms . . . . . . . . . . . . 867.6 A Numeric Example: Graph Visualization . . . . . . . . . . . . . . . . . . . 917.7 Relationship to the Kadison-Singer Problem . . . . . . . . . . . . . . . . . . 927.8 Additional Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8 Additional Results on Unweighted Graph Sparsification 948.1 A Running Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

iv

8.2 Faster Subset Selection for Matrices and Applications . . . . . . . . . . . . . 95

A An Algorithm for Sparse PCA 97A.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.2 Setup for a Single Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A.3 Solving for v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A.4 More on λ and the Objective Function . . . . . . . . . . . . . . . . . . . . . 100A.5 Bounds on λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101A.6 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.7 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

B An Efficient Implementation of the Generalized Minimum Residual Methodfor Stiff PDE Problems 106B.1 Preconditioned GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106B.2 A New Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

C Strassen’s Algorithm 111

D A Visualization of SRLU 113

Bibliography 115

v

List of Figures

1.1 A first example of low-rank approximation. . . . . . . . . . . . . . . . . . . . . . 3

3.1 Visualizations of different LU factorizations. . . . . . . . . . . . . . . . . . . . . 18

4.1 Benchmarking TRLUCP with various block sizes on random matrices of differentsizes and truncation ranks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1 Accuracy Experiment on random 1000x1000 matrices with different rates of spec-tral decay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 Time Experiment on random matrices of varying sizes, and a time experiment ona 1000x1000 matrix with varying truncation ranks. . . . . . . . . . . . . . . . . 64

6.3 Efficiency experiment on random matrices of varying sizes compared to peakhardware performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.4 Image processing example. The original image [81], a rank-50 approximation withSRLU, and a highlight of the rows and columns selected by SRLU. . . . . . . . 66

6.5 Circuit Simulation Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.6 Sparse Data Processing Example with Circuit Simulation Data. . . . . . . . . . 676.7 The cumulative uses of the top five most commonly used words in the Enron

email corpus after reordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.8 Singular values of SRLU factorizations of various ranks (red) versus the singular

values of the Devil’s Stairs matrix (blue). . . . . . . . . . . . . . . . . . . . . . . 696.9 Image compression experiment with various factorizations. From left to right:

James Wilkinson, Wallace Givens, George Forsythe, Alston Householder, PeterHenrici, and Friedrich Bauer. (Gatlinburg, Tennessee, 1964.) . . . . . . . . . . . 71

7.1 Autonomous System Example: Original Graph . . . . . . . . . . . . . . . . . . . 917.2 Autonomous Systems Graph with Sparsifiers of Various Cardinalities (node co-

ordinates calculated from whole graph) . . . . . . . . . . . . . . . . . . . . . . . 927.3 Autonomous Systems Graph with Sparsifiers of Various Cardinalities (node co-

ordinates recalculated for each sparsifier) . . . . . . . . . . . . . . . . . . . . . . 927.4 Progress During Iteration and Theoretical Singular Value Lower Bound for Spar-

sifiers of Various Cardinalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vi

List of Tables

3.1 Efficiency comparison of LU orderings . . . . . . . . . . . . . . . . . . . . . . . 193.2 Definition of parameters (smallest to largest) . . . . . . . . . . . . . . . . . . . . 19

5.1 Bounds for Growth Factors of LU Variants [53] . . . . . . . . . . . . . . . . . . 58

6.1 Mean values of the constants from the theorems presented in this work, for variousrandom matrices. Constants for spectral theorems are averaged over the top 10singular values. TRLUCP was used, and no swaps were needed, so SRLU resultsmatch TRLUCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Average number of swaps needed on a random 1000-by-1000 matrix for varioussmall values of f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

vii

Acknowledgments

A great number of people helped to make this work possible, as well as to aid in my broadermathematical studies. I am, in particular, most thankful for the guidance of Ming andPer over the years. With their advising, support, and friendship my PhD studies were asrewarding as I had initially hoped they would be, and I am very proud of my accomplishmentsand (some of) my failures during that time.

1

Part I

Introduction to Low-RankApproximation and Linear Algebra

2

Chapter 1

Low-Rank Approximation

1.1 Introduction

As the amount of data collected grows, data compression has become ubiquitous in our lives.Many datasets are too big to work with directly, and data collection may be growing far fasterthan the processing power. Compression and other algorithms are needed to understand allof the information in our world.

There are no perfect compression algorithms for real-world data. Digital photos, forexample are often stored using the JPEG compression algorithm [86]. Improvements areconsistently introduced, as well as new algorithms, such as JPEG2000 [100]. Many otheralgorithms are widely used for photo compression, each with its own advantages and disad-vantages. Data compression saves disk space, improves data transfer times, renders the dataeasier to analyze, and aids in removing noise from the data. Additional uses of compressioninclude cryptography [71] and energy conservation [10]. Data approximation has even ap-peared in popular culture recently as the innovative technology behind a fictional startup inthe sitcom Silicon Valley [90].

The effectiveness of a compression algorithm depends on the type of data used. Photosare an example of structured data: there are no missing data points in general, and eachdata point is the same type of measurement as each other data point. The effectiveness ofcompression also depends on the goal: lossless algorithms seek to preserve the original datain its entirety, while lossy algorithms, at the expense of some data loss, may save considerablymore space than lossless algorithms and may be effective in more applications. This work isabout lossy compression algorithms for structured data.

1.2 Data Compression

Figure 1.1, a reproduction of a case study from [72], shows data drawn from two distributions.A rank-2 approximation finds two vectors the approximate the entire dataset well; almostall of the data is nearly collinear with one of these two vectors. Although not all of the data

CHAPTER 1. LOW-RANK APPROXIMATION 3

Figure 1.1: A first example of low-rank approximation.

is captured in a rank-2 approximation, such a compression saves a vast amount of space atlittle expense. In essence, two data points can accurately describe the 20,000 data points inFigure 1.1. While few datasets are so obviously well-approximated by a low-rank replication,many datasets are so large compared to the underlying forces determining the data that anaccurate low-rank approximation exists. In [72], for example, a rank-2 approximation ofgene expression data is shown to be an effective data preprocessing step for identifying threecancer types.

1.3 Reliable and Efficient Algorithms for

Spectrum-Revealing Low-Rank Data Analysis

In this work, two algorithms are presented, as well as the theoretical tools needed to analyzethese and comparable algorithms. The rest of this dissertation is organized as follows: theremainder of Part I covers background information and previous results necessary to continuethe discussion of low-rank approximations algorithms. The most significant contribution ofthis work is an approximation algorithm called Spectrum-Revealing LU, which is presentedand analyzed in Part II. An additional low-rank approximation is introduced in Part III. Thisalgorithm, called Unweighted Graph Sparsification, is a low-rank approximation specifically

CHAPTER 1. LOW-RANK APPROXIMATION 4

for graphs, and the quality of the approximation is theoretically analyzed.

1.4 Future Work

One doctoral degree is surely not enough to answer all questions that arise from a researchquestion in mathematics. In Part II, much work remains to develop the SRLU algorithminto state-of-the-art open sourced software. Some code, such as fast versions of finding orapproximately finding the largest element in the Schur complement, remain to be written. InPart III, some unanswered questions include if the performance parameter T could vary (orbe a different constant) to improve that quality of the sparsifier more quickly. Additionally,an unexplored extension of the UCS algorithm is to create higher quality sparsifiers byeliminating nodes in addition to edges. In the Appendix, two research projects are presentedwith preliminary work. The first, an algorithm for sparse principal component analysis,could be extended by developing a block version. The second, an optimization for somePDE problems using GMRES, has a theoretical framework, but has not yet been numericallytested.

5

Chapter 2

Linear Algebra Preliminaries

Linear algebra is the area of mathematics that will provided most of the tools needed inthis work, and is essential in many other areas of mathematical sciences. Linear algebrais not only the study of linear systems, but also the study efficiently organizing numericalcomputation. For an introduction to linear algebra, see [47]. Some especially pertinentconcepts are covered in this chapter.

2.1 Definitions

The range of a matrix A is defined as

range (A) = y ∈ Rm : y = Ax for some x ∈ Rn.

The rank of a matrix A is the dimension of the range of A:

rank (A) = dim (range (A)) .

A vector norm on Rn is a function f : Rn → R that satisfies

1. f(x) ≥ 0, x ∈ Rn,

2. f(x) = 0 iff x = 0,

3. f(x+ y) ≤ f(x) + f(y), x, y ∈ Rn,

4. f(αx) = |α|f(x), α ∈ R, x ∈ Rn.

The norm is denoted ‖A‖= f (A). A matrix norm is a function f : Rm×n → R that satisfiessimilar properties to those in the definition of a vector norm:

1. f (A) ≥ 0,A ∈ Rm×n,

2. f (A) = 0 iff A = 0,

CHAPTER 2. LINEAR ALGEBRA PRELIMINARIES 6

3. f (A + B) ≤ f (A) + f (B) ,A,B ∈ Rm×n,

4. f (αA) = |α|f (A) , α ∈ R,A ∈ Rm×n.

The p-norm of a vector x, denoted ‖x‖p, is defined as

‖x‖p = (|x1|p+ · · · |xn|p)1p .

The p-norm of a matrix A, denoted ‖A‖p, is defined as

‖A‖p = supx6=0

‖Ax‖p‖x‖p

.

The Frobenius norm of a matrix A is defined as

‖A‖F =

(m∑i=1

n∑j=1

|aij|2) 1

2

.

The last two definitions on there own do not automatically imply that they are matrixnorms, but they can indeed be proven to be matrix norms. Several other important terms:the identity matrix I ∈ Rm×m is the matrix with Iii = 1 and Iij = 0 for i 6= j. The transposeof a matrix A, denoted AT , is defined by

(AT)ij

= Aji. A matrix A ∈ Rm×m is orthogonal

if ATA = I. A sparse matrix is a matrix with many zeros (enough that they may be takenadvantage of). Another important definition, the singular values of a matrix A, will bedefined later. See [47] for additional useful definitions.

2.2 The Singular Value Decomposition (SVD)

Theorem 1. (Singular Value Decomposition [47]) For a matrix A ∈ Rm×n, there existorthogonal matrices U ∈ Rm×m and V ∈ Rn×n and a diagonal matrix Σ with diag (Σ) =(σ1, σ2, . . . , σp) with p = min(m,n) and σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0 such that

UΣVT = A.

A linear equation of the form Ax = b for b ∈ Rm can be solved by calculating

x = VΣ†UT b.

Also,

‖A‖2= σ1 and ‖A‖F=√σ2

1 + · · ·+ σ2p

because‖UTAV‖= ‖Σ‖


for both the 2-norm and the Frobenius norm. The matrix A can also be expressed as

A =

min(m,n)∑i=1

σiuivTi ,

where ui and vi denote columns of U and V respectively. The values σi are known as thesingular values of A. The rank of a matrix A is equal to the number of nonzero singularvalues. If the rank of A is r (note we must have 0 ≤ r ≤ min(m,n)), then A can be expressedas

A =r∑i=1

σiuivTi .

The set of singular values of a matrix A is unique (note there may be repeated values withinthis set).

2.3 Approximation Optimality

Eckart and Young

A standard benchmark for the quality of an approximation is due to Eckart and Young. Let

Ak =k∑i=1

σiuivTi

be the rank-k truncated SVD of a data matrix A ∈ Rm×n, where k is chosen so that1 ≤ k ≤ min(m,n).

Theorem 2. (Eckart-Young [39])

Ak = arg minrank(B)≤k

‖A−B‖2 = arg minrank(B)≤k

‖A−B‖F ,

with

‖A−Ak‖2 = σk+1, ‖A−Ak‖F =

√√√√ ρ∑j=k+1

σ2j .

2.4 The LU Decomposition

The SVD is called a matrix factorization because it expresses a matrix as a product of“simpler” matrices (matrices that are easier to work with in some sense). Another matrix


factorization critical to this work is the LU decomposition. The LU decomposition factors amatrix A into a lower triangular matrix and an upper triangular matrix:

A = LU.

This factorization will be studied in great detail in Part II. The name ‘LU’ reflects thatL is a lower triangular matrix and U is an upper triangular matrix. The familiar term‘Gaussian elimination’ refers to an algorithm that performs the same calculations as the LUdecomposition.

2.5 The QR Decomposition

A matrix A ∈ Rm×n can be factored into the product of an orthogonal matrix Q ∈ Rm×m

and an upper triangular matrix R ∈ Rm×n:

A = QR.

The QR decomposition [47] will not be computed in this work (although it may be applied asa subroutine in the main algorithm discussed later). The existence of the QR decomposition,nevertheless, will be used in theoretical results. The ability to factor a matrix into anorthogonal component and an upper triangular component will be used extensively in theanalysis of the main result in Part II. The name ‘QR’ reflects that Q is commonly usedto denote an orthogonal matrix and R denotes a right-triangular matrix (another term forupper-triangular matrix).

2.6 Interlacing Property of Singular Values

The singular values of a matrix exhibit many fascinating properties. In Part III, the interlac-ing property of singular values is especially important. Essentially, if A ∈ Rm×n is a matrixand A is equal to A but has one additional column, then

σ1

(A)≥ σ1 (A) ≥ σ2

(A)≥ σ2 (A) ≥ · · · ≥ σn

(A)≥ σn (A) ≥ σn+1

(A).

2.7 Numerical Linear Algebra

The LU, QR, and SVD are deterministic algorithms and can be computed for any datamatrix. The computational complexity of all three is of the same order: O (m · n ·min(m,n)).Nevertheless, their operation counts differ by constant factors, and they vary in efficiency dueto memory movements. Roughly, the LU decomposition is about 4 times faster to computethan the QR decomposition, and the QR decomposition is about 4 times faster to computethan the SVD. Concerning stability, the SVD can always be reliably computed. The QR


decomposition is less stable than the SVD, and the LU decomposition is less stable than theQR decomposition. The speed and stability of the LU decomposition will be explored ingreater detail in later sections. Although none of these three algorithms is unique in general,standard implementations of these algorithms guarantee that the produce the same resultswhen repeated on the same data matrix.

The LU decomposition, QR decomposition, and the SVD will all be needed in PartII. Note that these three matrix factorizations are also considered to be among the mostimportant in the study of linear algebra. In particular, these three factorizations are utilizedin linear equation solving. When A and b are a known matrix and known vector and xis an unknown vector such that Ax = b, then x can be computed (with some stabilityassumptions) using all three decompositions:

x = U−1L−1b

x = R−1QT b

x = VΣ−1UT b.

While matrix addition, subtraction, and multiplication are easily defined, matrix divisionrequires careful consideration. For most matrices, an inverse can be calculated, and multiplyby this inverse is conceptually equivalent to division. Computing the inverse of a matrix,nevertheless, is an expensive algorithm (requires many computations), and so, in practice,shortcuts are used when possible to avoid this calculation. Triangular and orthogonal ma-trices offer such shortcuts.

For the triangular matrices U, L, and R in the solutions above, the inverse notationshould not imply calculating the inverse. Instead, because these matrices have a specialform, a faster algorithm can be applied that calculates multiplication by the inverse. Thecomputational complexity of this action is reduced by a factor of n. Furthermore, becauseQ, V, and U (note U is standard notation in LU and the SVD) are orthogonal, their inverseis equal to their transpose. See [34] for discussions on numerical linear algebra.

2.8 Communication-Avoiding Algorithms

The computational cost of many data analysis algorithms is dominated by the time spentmoving data throughout the memory hierarchy of a computer, as opposed to the time spenton arithmetic and logic. In general, hardware constraints cause memory movements to bemore computationally expensive than floating point operations. Writing data to memory istypically the most expensive data operation. Therefore, when an algorithm has flexibilitywith the order in which the necessary arithmetic and logic are applied, varying the order ofthese operations may reduce the amount of data movements needed, increasing the speed ofthe algorithm.

Algorithms 2.6 and 2.7 in [34] compare scalar and block matrix-matrix multiplication,and supporting analysis describes how the block version is far more efficient than the scalar


version, despite performing mathematically identical calculations. Furthermore, an efficiencycomparison of matrix-matrix multiplication, matrix-vector multiplication, and vector-vectormultiplication, shows numerically that the data reuse in block calculations leads to muchgreater processor efficiency, and, therefore, speed. Algorithms arranged in block form to allowfor data reuse and to minimize the number of data movements are known as communication-avoiding algorithms. Part II will discuss SRLU, a block algorithm.

2.9 A Note About Matrix-Matrix Multiplication

In numerical linear algebra, almost no algorithms are known to truly be the fastest for theirpurposes. Indeed, for matrix-matrix multiplication, one of the most basic operations in linearalgebra, new algorithms with new improvements are being discovered continuously.

One of the most profound discoveries in the field of linear algebra is a recursive matrix-matrix multiplication algorithm due to Strassen [99]. This work computed matrix-matrixmultiplication in time O

(nlog2 7

), a most surprising result because there was no obvious

reason to expect a method faster than O (n3). As matrix-matrix multiplication is the mosttime consuming linear algebra operation used in SRLU in Part II, the complexity of SRLUcan freely by reduced by using Strassen’s algorithm. No further discussion is provided inPart II, however, the Appendix describes Strassen’s algorithm in greater detail.

Strassen’s algorithm is known to increase the fixed constant concealed in the order ofcomplexity. A general rule of thumb is that Strassen’s algorithm becomes faster than tra-ditional matrix multiplication for matrices roughly of size 512-by-512 or greater. In recentyears, new recursive algorithms have been discovered that follow in Strassen’s example andreduce the complexity of matrix-matrix multiplication further. All are above O (n2), and itis not known if O (n2) is achievable, which is a lower bound because all data must be used.These newer algorithms, however, are known to have astronomical hidden constants, and arenot practical.

11

Part II

The Spectrum-Revealing LUDecomposition

12

Chapter 3

Background on Low-RankApproximation and the LUDecomposition

3.1 Introduction

In this chapter, a novel truncated LU factorization called Spectrum-Revealing LU (SRLU,Definition 2) is introduced for effective low-rank matrix approximation, and a fast algorithmto compute an SRLU factorization is developed. Both matrix and singular value approxima-tion error bounds for the SRLU approximation computed by this algorithm are presented.Analysis suggests that SRLU is competitive with the best low-rank matrix approximationmethods, deterministic or randomized, in both computational complexity and approximationquality. Numeric experiments illustrate that SRLU preserves sparsity, highlights importantdata features and variables, can be efficiently updated, and calculates data approximationsnearly as accurately as the best possible. This is the first known practical variant of theLU factorization for effective and efficient low-rank matrix approximation. A preliminaryversion of this work appears in [6].

3.2 Problem Statement

Low-rank approximation is an essential data processing technique for understanding datathat is either large or noisy. The Singular Value Decomposition (SVD) is a longstandingstandard for low-rank data approximation. A stable algorithm, the truncated SVD also isoptimal in the spectral and Frobenius norms [39, 47]:

‖A−Ak‖ξ = minrank(B)≤k

‖A−B‖ξ , (3.1)

CHAPTER 3. BACKGROUND ON LOW-RANK APPROXIMATION AND THE LUDECOMPOSITION 13

where ξ = 2, F and

Ak =k∑i=1

σiuivTi (3.2)

is the truncated SVD of A. The far-reaching applications of low-rank approximation and theSVD include data compression, image and pattern recognition, signal processing, compressedsensing, latent semantic indexing, anomaly detection, and recommendation systems. In thischapter, a novel matrix factorization is introduced called Spectrum-Revealing LU (SRLU)that can be efficiently computed and updated. Simultaneously, it preserves sparsity and canbe used to identify important data variables and observations. This algorithm works on anydata matrix, and achieves an approximation accuracy that only differs from the accuracy ofthe best approximation possible for any given rank by a constant factor.1

The major innovation in SRLU is the efficient calculation of a truncated LU factorizationof the form

Π1AΠT2 =

( k m− k

k L11

m− k L21 In−k

) ( k n− k

U11 U12

S

)≈(

L11

L21

)(U11 U12

) def= LU,

where Π1 and Π2 are judiciously chosen permutation matrices. The LU factorization is un-stable, and in practice is implemented by pivoting (interchanging) rows during factorization(this entails finding Π1 only). For the truncated LU factorization to have any significance,nevertheless, complete pivoting (interchanging rows and columns) is necessary to guarantee

that the factors L and U are well-defined and that their product accurately represents theoriginal data. Previously, complete pivoting was impractical because it requires accessingthe entire data matrix at every iteration, but SRLU efficiently achieves complete pivotingthrough randomization and a deterministic procedure to correct for any mistakes that mayhave arisen from the randomization. The quality of the SRLU factorization is supported byrigorous theory and numeric experiments.

Background on the LU factorization

A rudimentary matrix factorization, the LU decomposition factors a matrix into a lowertriangular matrix and an upper triangular matrix, which can then be used to solve a linearsystem. The stability of the LU decomposition has been extensively studied, with manylongstanding results [56, 82, 101].

The vanilla LU method fails if the diagonal contains numerical small elements, and sopartial pivoting was introduced to stabilize the algorithm by selecting important rows ofthe matrix during iteration. LU can be made completely stable through complete pivoting,where at each iteration the largest element in the Schur complement is permuted to the next

1The truncated SVD is known to provide the best low-rank matrix approximation. But it is rarely usedfor large scale practical data analysis.


diagonal entry of the U factor [105]. Finding the largest element in the Schur complement,nevertheless, requires O (n2) comparisons at each iteration, greatly increasing the complexityof the decomposition. With row pivoting only, the algorithm can in general achieve a stablemethod for linear equation solving, but provides no insight into important columns of thedata matrix. Thus early terminal of the LU decomposition with partial pivoting will, in gen-eral, produce poor approximations to a data matrix. The LU decomposition, consequently,appears in fewer applications than the SVD.

The LU decomposition, nevertheless, exhibits many advantages over the SVD. The de-composition is faster, partially preserves sparsity, can be updated more easily, and in generalis easier to implement. For example, consider a data matrix examined by Berry et. al. in[15]:

A =

0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0

0 0 0 0.4082 00 1.0000 0 0.4082 0.70710 0 0 0.4082 0

.

Then the rank-3 truncated SVD is

A3 =

0.4971 −0.0330 0.0232 0.4867 −0.00690.6003 0.0094 0.9933 0.3858 0.70910.4971 −0.0330 0.0232 0.4867 −0.00690.1801 0.0740 −0.0522 0.2320 0.0155−0.0326 0.9866 0.0094 0.4402 0.70430.1801 0.0740 −0.0522 0.2320 0.0155

.

Next, consider a rank-3 truncated LU decomposition using partial row pivoting:

A =

0.5774 0 0 0.4082 00.5774 0 1.0000 0.4082 0.70710.5774 0 0 0.4082 0

0 0 0 0 00 1.0000 0 0.4082 0.70710 0 0 0 0

.

Here the LU approximation retains a sparsity pattern similar to that of the original matrix.Also, many of the entries are numerically exact. The truncated SVD is both dense and hasno exact entries. The truncated SVD, however, has no egregious errors in the matrix, whilethe LU approximation shows significant errors for two entries. The LU decomposition alsolooks deceptively accurate, while the truncated SVD, ostensibly less accurate in appearance,is optimal in spectral and Frobenius norm errors. Furthermore, if the columns of A wererearranged, then the partial pivoting factorization could lead to a column of all zeros.


Algorithm 1 presents a basic implementation of the LU factorization, where the result isstored in place such that the upper triangular part of A becomes U and the strictly lowertriangular part becomes the strictly lower part of L, with the diagonal of L implicitly knownto contain all ones. LU with partial pivoting finds the largest entry in the ith column fromrow i to m and pivots the row with that entry to the ith row. LU with complete pivoting findsthe largest entry in the submatrix Ai+1:m,i+1:n and pivots that entry to Ai,i. It is generallyknown and accepted that partial pivoting is sufficient for general, real-world data matricesin the context of linear equation solving.

Require: Data matrix A ∈ Rm×n

Ensure: A overwritten with L and U factors1: for i = 1, 2, · · · ,min(m,n) do2: Perform row and/or column pivots3: A (k + 1 : m, k) = A (k + 1 : m, k) /A (k, k)4: Ai+1:m,i+1:n −= Ai+1:m,1:i ·A1:i,i+1:n

5: end for

Algorithm 1: The LU Decomposition (Alg. 2.4 [34])

Line 4 of Algorithm 1 is known as the Schur update. Given a sparse input, this is the onlystep of the LU factorization that causes fill. As the algorithm progresses, fill will compoundand may become dense, but the LU factorization, and truncated LU in particular, generallypreserves some, if not most, of the sparsity of a sparse input. A numeric illustration ispresented below.

LU Analysis and Variations

By convention, the L factor has a unit diagonal. This guarantees that the LU decompositionis mathematically unique for any data matrix A. Nevertheless, there are many variationson calculating the LU factorization. In Algorithm 2 the Crout version of LU is presented inblock form. The column pivoting entails selecting the next b columns so that the in-place LUstep is performed on a non-singular matrix (provided the remaining entries are not all zero).Note that the matrix multiplication steps are the bottleneck of this algorithm, requiringO(mnb) operations each in general.


Require: Data matrix A ∈ Rm×n, block size bEnsure: A overwritten with L and U factors

1: for j = 0, b, 2b, · · · ,min(m,n)/b− 1 do2: Perform column pivots3: Aj+1:m,j+1:j+b −= Aj+1:m,1:j ·A1:j,j+1:j+b

4: Apply Algorithm 1 on Aj+1:m,j+1:j+b

5: Apply the row pivots to other columns of A6: Aj+1:j+b,j+b+1:n −= Aj+1:j+b,1:j ·A1:j,j+b+1:n

7: end for

Algorithm 2: Crout LU (in block form)

Letting x = min(m,n) and y = max(m,n), then algorithm 1 implies an operation countof

flops =x−1∑i=1

[m∑

j=i+1

(2(n− i) + 1)

]

= x2y − 1

3x3 − xy − 1

2x2 +

5

6x+ xm−m

= x2y − 1

3x3 + low order.

Note that the exact operation count for a rectangular matrix would differ from that of itstranspose because of the xm−m term, which stems from the scaling in line 3.

We now make several important observations about algorithm 1. First, pivoting in someform is necessary for general matrices because of the division in step 3. The most commonform of pivoting at step 2 is to swaps rows j and arg maxj≤i≤m |A (i, j)|. With this rulefor row swapping then the final factorization with produce an L with |L (i, j) |≤ 1. Analternative pivoting strategy, complete pivoting requires 1

3n3 + O (n2) comparisons for a

dense matrix [38], a significant amount of work. In practice, complete pivoting is rarely usedfor this reason.

Threshold pivoting is a variation of pivoting whereby a parameter u < 1 is chosen so

that an acceptable pivot is any choice such that∣∣∣A(k)

kk

∣∣∣ ≥ u∣∣∣A(k)

ik

∣∣∣ for i > k. Threshold

pivoting means that there may be many acceptable pivots. Indeed, the motivation for thispivoting strategy is to allow multiple choices of pivots to consider other implications in thefactorization. Most notably, for sparse factorizations pivots may be chosen to reduce fill-inin the factorization. Multiple potential pivots allows the factorization some flexibility toattempt to minimize the amount of fill-in. One such strategy, is Markowitz pivoting [47].Such strategies could be applied to SRLU below, but will not be discussed further in thiswork.

Step 4 of Algorithm 1 is important for efficiency not only because of fill in, but alsobecause the term A (j + 1 : n, j) · A (j, j + 1 : n) is calculating an outer product of size


(m− j)-by-(n− j). These outer products are inefficient because they do not take advantageof data reuse and because they require accessing a large sub matrix at each iteration. Al-gorithm 1, however, can be reorganized into a mathematically equivalent LU factorizationwithout needed to calculate outer products or to access the entire Schur complement atevery iteration. Before reorganizing, first note that Algorithm 1 is written in vector form forbrevity and clarity of what subroutines are needed in the calculation. It is mathematicallyequivalent to algorithm 3. Algorithm 3 is less efficient than Algorithm 1, but taking a stepback will allow us to understand a different approach to LU.


Ensure: Lower triangular with unit diagonal L ∈ Rm×min(m,n) and upper triangular U ∈Rmin(m,n)×n such that LU = A (mathematically)

1: for k = 1 : min(m,n)− 1 do2: for i = i+ 1 : m do3: for j = i+ 1 : n do4: A (i, j) = A (i, j)− A(i,k)

A(k,k)A (k, j)

5: end for6: end for7: end for

Algorithm 3: The Most Basic Form of the LU Decomposition ([34])

Algorithm 3 is for illustrative purposes only. Note that it is inefficient because as kvaries it will repeatedly calculate A(j,i)

A(i,i)for the same i and j. This form of LU, nevertheless,

highlights the triple nested loop complexity of the LU decomposition. Furthermore, sixvariations of the LU decomposition are immediate by permuting the order of the loops.Based on the loop parameters, Algorithm 3 is known at the KIJ version of LU [68]. Thisversion is also known as “right-looking” LU, as well as “outer product” LU, as noted above.

The JKI version of LU, known as “left-looking” LU (algorithm 4). This version is alsocalled “delayed-update”, or “lazy” LU because the Schur complement is not updated at eachiteration, rather when a column or block column is being updated, all previous updates fromall preceding columns are all applied to the current column at once (the Schur updates aresimultaneously applied). This also means that the final answer only needs to be writtenonce. Thus after a column or block column has been updated, it need not be updated again.As a result, “left-looking” LU will perform early iterations quickly, and slow down as thefactorization progresses because columns will require more updating from more precedingcolumns. Also note that the inner loop is an inner product. Algorithm 4 also illustrates thatswaps must be handled carefully to maintain the “lazy” advantage of delayed updating.

The efficiency advantages of “delayed-update” LU would clearly benefit a truncated LUdecomposition, as unnecessary updates to the Schur complement would be avoided. Nev-ertheless, we cannot obtain a high-quality truncated LU factorization without consideringboth row and column pivoting. Another form of the LU decomposition which factors both



Ensure: A overwritten with L and U factors1: for j = 1 : n do2: for k = 1 : j − 1 do3: Apply previous row swaps4: end for5: for k = 1 : j − 1 do6: for i = k + 1 : n do7: Aij −= AikAkj

8: end for9: end forPick new swaps

10: for k = 1 : j do11: Apply new swaps to previous columns12: end for13: for i = j + 1 : n do14: Aij = Aij/Ajj

15: end for16: end for

Algorithm 4: The Left-Looking LU Decomposition

(a) Right-looking LU. (b) Left-looking LU. (c) Crout LU.

Figure 3.1: Visualizations of different LU factorizations.

columns and rows while it iterates is called the Crout LU decomposition. Unlike the pre-viously discussed versions of LU, Crout LU cannot be obtained by permuting the loops ofalgorithm 3, rather the loops must be broken up so that columns and rows can be updatedat each iterations. A visual representation of variations of the LU decomposition is providedin Figure 3.1.


The Crout LU decomposition, Algorithm 2, has also been implemented to quickly calcu-late incomplete LU decompositions, and to achieve effective dropping strategies [68]. Notethat there are other variations of LU as well. The Doolittle variation resembles the left-looking algorithm, but is applied to rows instead of columns. Table 3.1 highlights someefficiency differences of these mathematically equivalent algorithms.

LU Method Largest Data Read Largest Data Write Total Data Writes

Right-looking (n− 1)2 (n− 1)2 16(n− 1)n(2n− 1)

Left-looking 12n(n+ 1) n n2

Doolittle 12n(n+ 1) n n2

Crout 14n2 2(n− 1) n2

Table 3.1: Efficiency comparison of LU orderings

Hardware constraints mean that writing data to memory is generally more expensivethan reading data [23]. Carson et. al. also present a parallel write-avoiding LU algorithmwithout pivoting, which is based on the left-looking algorithm.

Notation and Definitions

We use the conventions outlined in Table 3.2 for our discussions and comparisons of low-rank approximation algorithms. In particular, p `, which will lead to greater efficiency inSRLU than that of existing algorithms.

Variable Representation Additional Notes

b block size smallest parameterp an oversampling parameter relative to b the minor dimension of our

random projectionk target rank potentially b k, but

k m,n` an oversampling parameter relative to k the minor dimension of

other algorithms’ randomprojections

m,n dimensions of data matrix b < p < k ≤ ` m,n

Table 3.2: Definition of parameters (smallest to largest)

Finally, a few specialized definitions are useful concerning LU decompositions:


Definition 1. An LU factorization

Π1AΠT2 =

( k min(m,n)− k

k L11

m− k L21 In−k

)·

( k n− k

k U11 U12

min(m,n)− k U22

)(3.3)

is rank-revealing if

σk (A) ≥ σmin (L11U11) σmax (U22) ≥ σk+1 (A) ≈ 0. (3.4)

Definition 2. A rank-k truncated LU factorization is spectrum-revealing if∥∥∥A− LU∥∥∥

2≤ q1(k,m, n)σk+1 (A)

and

σj

(LU)≥ σj (A)

q2(k,m, n)

for 1 ≤ j ≤ k and q1(k,m, n) and q2(k,m, n) is bounded by a low degree polynomial in k, m,and n.

Definition 3. A spectral gap ratio of a matrix A ∈ Rm×n is a ratio σ`(A)σj(A)

for 1 ≤ k <

` ≤ min(m,n).

The different between Definitions 1 and 2 is one of the primary concerns of this work.These two definitions provide a measurement for evaluating the quality of a low-rank approx-imation. The former is the standard definition in this area of numeric analysis. This work,nevertheless, argues that the second definition may be more useful for low-rank approxima-tion in many real-world settings. Definition 3 highlights a concept that is also related tothe quality of low-rank approximations and will be essential to the theoretical analysis ofspectrum-revealing algorithms in later chapters. Ultimately, Theorems 9 and 11 will showthat SRLU is spectrum-revealing.

3.3 Previous Work on Deterministic Low-Rank

Approximation Algorithms

As previously discussed, the SVD is a natural starting point for low-rank matrix approx-imation because of the optimality in equation (3.1). Computation of the SVD is stable,but has many difficulties. First, the complexity of calculation is 4

3mnmin(m,n) for a full

SVD calculation, and approximately 43kmn for a rank-k truncated SVD, which we will see

is considerably slower than other methods. Second, the SVD cannot be updated in general


if additional data is acquired. Additionally, the SVD may be difficult to interpret. The low-rank approximation in equation 3.2 is composed of vectors ui and vi that are generally dense.Thus while the SVD can choose a low dimensional subspace that accurately approximatesthe data, the SVD cannot in general accurately determine a subset of the data variables thataccurately represent the data. This remains true even if the input matrix is sparse. See [102]for detailed interpretations of the meaning of the SVD.

The URV decomposition introduced by Stewart [97] is a factorization that can be updatedquickly when a new sample arrives, unlike the SVD. In this factorization the middle matrix istriangular, reducing the interpretability of this decomposition. Also, few theoretical resultshave been reported about the URV decomposition.

Sparse PCA approximates a data matrix with not only a low dimensional subspace, butalso a sparse set of data variables. Because the principal components and loading vectorsreturned by an SVD factorization are dense in general, the SVD does not readily reportwhich data variables best describe the data set. Note that thresholding, which simply ranksvariables based on the magnitude of the elements in the loading vectors, often poorly predictsthe influence of he data variables because highly correlated variables will be underreporteddue to dilution. Sparse PCA algorithms seek more robust algorithms to find sparse low-rankapproximations [110].

Low-rank approximations can also be found using rank-revealing QR (RRQR) factoriza-tions [24, 51, 55]. RRQR algorithms perform column pivots to guarantee that the factoriza-tion

AΠ =

(Q11 Q12

Q21 Q22

)(R11 R12

R22

)produces an approximation

(Q11

Q21

)(R11 R12

)that is rank-revealing.

The CUR decomposition approximates a matrix by the factorization A ≈ CUR, whereC is a sample of the columns of A, R is a sample of the rows of A, and U = C†AR† [16].This factorizations seeks to choose important rows and columns, which, for a sparse matrixprovides insight into important combinations of the data variables. Thus the motivation forthe CUR decomposition is similar to that of Sparse PCA. Note also that the decompositionretains the sparsity of the original matrix and may be used for data compression, as C andR have the same sparsity pattern as submatrices of A, and U is a small matrix. The CURdecomposition is easily adapted to symmetric matrices by letting C and R be the sameselection of indices. A CUR decomposition can be formed from any row/column selectionalgorithm, such as LU with pivoting, QR with pivoting (including RRQR), or other columnselection methods [8, 11, 18, 29]. CUR decompositions can be deterministic or randomized,and will be discussed in later sections. The CX decomposition is analogous to the CURdecomposition, where only a subset of columns are selected.

Additional work on low-rank data approximation includes the Interpolative Decomposi-tion (ID) [25] and other deterministic column selection algorithms, such as [11]. A truncatedversion of the Cholesky factorization was studied in [41].


Low-rank approximations can also be formulated as optimization problems. For a datamatrix A with rank r the nuclear norm is denoted and defined as

‖A‖∗def=

r∑i=1

σi (A) .

Thus the nuclear norm is an `1 norm of the spectrum of A. Note that the `1 norm mini-mization is sparsity inducing, and so heuristically the nuclear norm minimization similarlyinduces sparsity in a matrix spectrum. This in turn implies that nuclear norm minimizationinduces low-rank [20, 21, 57, 87].

3.4 Previous Work on Randomized Low-Rank

Approximation Algorithms

Randomized algorithms have grown in popularity in recent years because of their abilityto efficiently process large data matrices and because they can be supported with rigor-ous theory. Randomized low-rank approximation algorithms generally fall into one of twocategories: sampling algorithms and black box algorithms. Sampling algorithms form dataapproximations from a random selection of rows and/or columns of the data. Black boxalgorithms attempt to find an accurate low-rank basis of the data matrix by factoring arandom projection. SRLU, the main result presented below, will resemble a deterministicalgorithm with elements of a black box algorithm as a subproblem.

Sampling Algorithms

Examples of sampling algorithms include [35, 36, 44, 72]. [37] showed that for a givenapproximate rank k, a randomly drawn subset C of c = O (k log(k)ε−2 log (1/δ)) columns ofthe data, a randomly drawn subset R of r = O (c log(c)ε−2 log (1/δ)) rows of the data, andsetting U = C†AR†, then the matrix approximation error ‖A−CUR‖F is at most a factorof 1 + ε from the optimal rank k approximation with probability at least 1− δ.

In [44], Frieze et. al. showed that a randomized algorithms can find an approximationD∗ to a matrix A ∈ Rm×n satisfying

‖A−D∗‖2F ≤ min

D,rankD≤k‖A−D‖2

F+ε‖A‖2F ,

with probability 1 − δ, and can run in time polynomial in k, 1ε, log

(1δ

), independent of m

and n. This implies that a method for testing in constant time if a large matrix has a goodlow-rank approximation. Their algorithm runs by randomly sampling rows and columns,and calculating the SVD of the scaled intersection. The top singular vectors can then befiltered to produce a low-rank approximation.

Adaptive sampling methods update the sampling probabilities as the rank of the approx-imation increases. Deshpande and Vempala [35] and Deshpande et. al. [36] exponentially


reduce the error of the method in [44] by adaptively sampling at each iteration, by updatingthe sampling probabilities to reflect the squared distance to the span of the previous samples.

In [72] Mahoney and Drineas use statistical leverage scores to randomly select rows andcolumns of a data matrix to form a CUR decomposition. As described above, this methodcreates a sparse approximation. Their algorithm satisfies

‖A−CUR‖F ≤ (2 + ε)‖A−Ak‖F ,

with probability at least 98%. They apply their algorithm to text document data, geneticmicroarray data, and a social science dataset. Other work on sampling algorithms include[107].

Background on Random Projections

Before discussing black box algorithms, the mathematical background of random projectionsis presented here. Paramount to the analysis of randomized algorithms is a theorem due toJohnson and Lindenstrauss [61]. Their work proved that, given n and 1 > ε > 0, then forevery set of n points P ∈ Rd there exists a map f : Rd → Rk, with k = O (ε−2 log n), and

(1− ε) ‖u− v‖2≤ ‖f(u)− f(v)‖2≤ (1 + ε) ‖u− v‖2

for all u, v ∈ P . In other words, a set of points in a high dimensional space can be mappedto a lower dimensional space in such a way that the distances between points are preservedwithin some bound. Their work also showed that a random rank k orthogonal projection on`n2 will satisfy the condition on f with an exponentially decaying probability of failure.

Frankl and Maehara [43] improved the bound to k = d9 (ε2 − 2ε3/3)−1

ln|P |e + 1 for0 < ε < 1

2and provided a simpler proof. Indyk and Motwani [59] showed that orthogonality

of the random projection is not needed. Dasgupta and Gupta [32] improved the bound on kto k ≥ 4 (ε2/2− ε3/3)

−1lnn.

Achlioptas [1] showed that the random projection can be chosen from −1,+1 and thatit can also be made sparse: the random projection can have entries randomly chosen from−1, 0,+1 with respective probabilities 1

6, 2

3, and 1

6. For ε, β > 0, the condition on f holds

for k at least 4+2βε2/2−ε3/3 log n with probability at least 1 − n−β. Other notable works include

that of Klartag and Mendelson [65] and Indyk and Naor [60]. See also [62].In [3], Ailon and Chazelle introduced the fast Johnson-Lindenstrauss transform, which

preconditions a sparse projection matrix with a randomized Fourier transform and avoidsnaive multiplication of dense matrices. This method uses the Heisenberg principle to over-come the distortion caused by sparse random projections, and achieves a lower complexitythan previous algorithms. Matousek [77] combines the ideas of [1] and [3] to further im-prove the speed of the projection. Additionally, Matousek argues that there is a limit on theamount of sparsity achievable. Aileen and Liberty [4] further improve the running time toO (d log k) for k = O

(d1/2−δ) for an arbitrarily small fixed δ.


Dasgupta, Kumar, and Sarlos [31] report a method that constructs the projection matrixusing a hash function instead of independent random variables, with o

(1ε2

)nonzero entries

in the projection matrix. Their work achieves a O(

1ε

)update time per nonzero element,

compared to the O(

1ε2

)update time per nonzero element of previous works.

In this work, we will be concerned with a version of the Johnson-Lindenstrauss theoremwhen n = 1. In this case, [104] showed

(1− ε)‖u‖22≤∥∥∥∥ 1√

kRu

∥∥∥∥2

2

≤ (1 + ε)‖u‖22, (3.5)

with probability at least

1− 2e−(ε2−ε3) k4 .

Here, R ∈ Rk×d is a Gaussian random projection. This theorem can also be applied bydetermining a desired level of certainty, and using the theorem above to determine the re-quired amount of oversampling k. Note that in the context of SRLU below, the oversamplingparameter k will be represented with p to better conform with other current work.

Black Box Algorithms

Black box algorithms typically approximate a data matrix in the form

A ≈ QTQA,

where Q is an orthonormal basis of the random projection (usually using SVD, QR, or ID).The result of [61] provided the theoretical groundwork for these algorithms, which have beenextensively studied [30, 50, 52, 69, 74, 84, 89, 108]. For example, in [84], Papadimitriouet. al. use a random projection as a preprocessing step for SVD in the context of latentsemantic indexing. Note that the projection of a m-by-n data matrix is of size m-by-`, forsome oversampling parameter ` ≥ k, where k is the target rank. Thus the computationalchallenge is the orthogonalization of the projection (the random projection can be appliedquickly, as described in these works). A previous result on randomized LU factorizations forlow-rank approximation was presented in [5], but is uncompetitive in terms of theoreticalresults and computational performance with the work presented here.

Note that the large random projection requires initialize the factorization with an ex-pensive matrix-matrix multiplication operation, requiring 2`mn operations, and imposing abottleneck on computation just as a preprocessing step. This complication has been over-come by using Fourier transform-like techniques to apply random projections faster thanthe cost of matrix-matrix multiplication, as described in the previous section. For instance,the Johnson-Lindenstrauss transform is introduced in [89]. Using a Johnson-Lindenstraussmatrix, Sarlos computes a relative Frobenius norm error in time O (Mr + (m+ n)r2) with


probability 1/2, where M is the number of nonzeros and r = Θ (k/ε+ k log k). The proba-bility of success can be increased to 1−δ with O (log(1/δ)) processors computing approxima-tions on independent copies. Thus this work showed how to use the Johnson-Lindenstrausstransform [3] to significantly reduce the cost of the initial random projection.

Martinsson et. al. [74] proposed randomized algorithms for low-rank approximationby combining a random projection with the ID and SVD methods. This work bounds thesingular values of the approximations with high probability. Cost savings over deterministicmethods are achieved when the data matrix A and its transpose can be applied “rapidly”to random vectors. A subsequent work by Woolfe et. al. [108] accomplishes this taskby applying a structured random matrix, which is composed of randomly selected rows ofthe product of a discrete Fourier transform matrix and a random diagonal matrix. Theresulting algorithm has complexity O (mn log(k) + l2(m+ n)), where the first term is due tothe initialization cost of applying the random projection.

Clarkson and Woodruff [30] showed that a relative Frobenius norm rank-k approximationto a matrix A can be computed in input sparsity time O (nnz (A))+O (nk2ε−4 + k3ε−5). Thiswork provides additional results for overconstrained least-squares regression, `p-regression,and approximating leverage scores.

For both sampling and black box algorithms the tuning parameter ε cannot be arbitrarilysmall, as the methods become meaningless if the number of rows and columns sampled (inthe case of sampling algorithms) or the size of the random projection (in the case of blackbox algorithms) surpasses the size of the data. A common practice is ε ≈ 1

2.

3.5 Problems Related to Low-Rank Approximation

Robust PCA [22] seeks to factor a matrix A into two factors: L0 that is low-rank and S0

that is sparse by solving:min

L+S=A‖L‖∗+λ‖S‖1

and recovering L0 and S0 with high probability.Subspace Clustering [85] seeks to separate a dataset into low-rank clusters. Matrix com-

pletion [21] seeks to approximate missing data in a matrix by choosing approximations thatminimize the rank of the dataset, the motivation being to choose data that is consistent withthe information that is present. In rank-deficient least squares [34], a low-rank approxima-tion may be necessary. In Appendix B, a linear solver called GMRES is discussed. AlthoughGMRES is not a low-rank approximation method, it operates by finding the best approxi-mate solution to a linear system in a low-rank subspace called a Krylov subspace. There aremany other problems in mathematics and statistics that relate to low-rank approximation.


3.6 Previous Work on the LU Decomposition

The LU factorization has been studied extensively since long before the invention of comput-ers, with notable results from many mathematicians, including Gauss, Turing, and Wilkin-son. Current research on LU factorizations includes communication-avoiding implementa-tions, such as tournament pivoting [64], sparse implementations [49], and new computationof preconditioners [27]. A randomized approach to efficiently compute the LU factorizationwith complete pivoting recently appeared in [78]. These results are all in the context of linearequation solving, either directly or indirectly through an incomplete factorization used toprecondition an iterative method. This work repurposes the LU factorization to create anovel efficient and effective low-rank approximation algorithm using modern randomizationtechnology.

27

Chapter 4

Spectrum-Revealing LU

4.1 Main Contribution: Spectrum-Revealing LU

(SRLU)

Our algorithm for computing SRLU is composed of two subroutines: partially factoring thedata matrix with randomized complete pivoting (TRLUCP) and performing swaps to im-prove the quality of the approximation (SRP). The first provides an efficient algorithm forcomputing a truncated LU factorization, whereas the second ensures the resulting approxi-mation is provably reliable.

Truncated Randomized LU with Complete Pivoting (TRLUCP)

Intuitively, TRLUCP performs deterministic LU with partial row pivoting for some initialdata with permuted columns. TRLUCP uses a random projection of the Schur complementto cheaply find and move forward columns that are more likely to be representative of thedata. To accomplish this, Algorithm 5 performs an iteration of block LU factorization in

Require: Data matrix A ∈ Rm×n, target rank k, block size b, oversampling parameterp ≥ b, random Gaussian matrix Ω ∈ Rp×m

1: Calculate random projection R = ΩA2: for j = 0, b, 2b, · · · , k − b do3: Perform column selection algorithm on R and swap columns of A4: Update block column of L5: Perform block LU with partial row pivoting and swap rows of A6: Update block row of U7: Update R8: end for

Algorithm 5: TRLUCP

CHAPTER 4. SPECTRUM-REVEALING LU 28

a careful order that resembles Crout LU reduction. The ordering is reasoned as follows:LU with partial row pivoting cannot be performed until the needed columns are selected,and so column selection must first occur at each iteration. Once a block column is selected,a Schur update must be performed on that column before proceeding. At this point, aniteration of block LU with partial row pivoting can be performed on the current block. Oncethe row pivoting is performed, a Schur update of a block row of U can be performed, whichcompletes the factorization up to rank j+b. Finally, the projection matrix R can be cheaplyupdated to prepare for the next iteration. Note that any column selection method may beused when picking column pivots from R, such as QR with column pivoting, LU with rowpivoting, or even this algorithm can again be run on the subproblem of column selection ofR. A visualization of SRLU appears in Appendix D.

1: DGEMM p (2m− 1)n flops2: for j = 0 : b : k − b do3: DGETRF, DGEQRF or other, column swaps O (np2) flops4: DGEMM 2 (m− j) jb flops5: DGETRF, row swaps (m− j) b2 − 1

3b3 flops

6: DGEMM 2bj (n− j − b) flops7: DGEMM, DTRSM, DTRMM O (pbn) flops8: end for

Algorithm 6: SRLU (A second look at subroutines and flop count)

The flop count of TRLUCP is dominated by the three matrix multiplication (DGEMM)steps. The total number of flops is

FTRLUCP = 2pmn+ (m+ n)k2 + low order.

Note the transparent constants, and, because matrix multiplication is the bottleneck, thisalgorithm can be implemented efficiently in terms of both computation as well as memoryusage. Because the output of TRLUCP is only written once, the total number of memorywrites is (m + n − k)k. Minimizing the number of data writes by only writing data oncesignificantly improves efficiency because writing data is typically one of the slowest computa-tional operations. Also worth consideration is the simplicity of the LU decomposition, whichonly involves three types of operations: matrix multiply, scaling, and pivoting. By contrast,state-of-the-art calculation of both the full and truncated SVD requires a more complex pro-cess of bidiagonalization. The projection R can be updated efficiently to become a randomprojection of the Schur complement for the next iteration. This calculation involves thecurrent progress of the LU factorization and the random matrix Ω, and is described below.

Updating R

The goal of TRLUCP is to access the entire matrix once in the initial random projection,and then choose column pivots at each iteration without accessing the Schur complement.


Therefore, a projection of the Schur complement must be obtained at each iteration withoutaccessing the Schur complement, a method that first appeared in [78]. Assume that siterations of TRLUCP have been performed and denote the projection matrix

Ω =( sb b n− (s+ 1)b

Ω1 Ω2 Ω3

),

and the current A as

A(s) =

sb b n− (s+ 1)b

sb A11 A12 A13

b A21 A22 A23

m− (s+ 1)b A31 A32 A33

.Then the current projection of the Schur complement is

Rcur =( b n− (s+ 1)b

Rcur1 Rcur

2

)=(Ω2 Ω3

)(S11 S12

S21 S22

),

where the right-most matrix is the current Schur complement. The next iteration of TR-LUCP will need to choose columns based on a random projection of the Schur complement,which we wish to avoid accessing. We can write:

Rupdate = Ω3

(A33 −A32A

−122 A23

)= Ω3A33 + Ω2A23 −Ω2A23 −Ω3A32A

−122 A23

= Ω3A33 + Ω2A23 −Ω2L22U23 −Ω3L32U23

= Rcurrent2 − (Ω2L22 + Ω3L32) U23. (4.1)

Here the current L and U at stage s have been blocked in the same way as Ω. Noteequation (4.1) no longer has the term A33. Furthermore, A−1

22 has been replaced by substi-tuting in submatrices of L and U that have already been calculated, which helps eliminatepotential instability.

When the block size b = 1 and TRLUCP runs fully (k = min(m,n)), TRLUCP ismathematically equivalent to the Gaussian Elimination with Randomized Complete Pivoting(GERCP) algorithm of [78]. However, TRLUCP differs from GERCP in two very importantaspects: TRLUCP is based on the Crout variant of the LU factorization, which allowsefficient truncation for low-rank matrix approximation, and TRLUCP has been structuredin block form for more efficient implementation.

4.2 Spectrum-Revealing Pivoting (SRP)

TRLUCP produces high-quality data approximations for almost all data matrices, despitethe lack of theoretical guarantees, but can miss important rows or columns of the data.


Next, we develop an efficient variant of the existing rank-revealing LU algorithms [51, 79]to rapidly detect and, if necessary, correct any possible matrix approximation failures ofTRLUCP.

Intuitively, the quality of the factorization can be tested by searching for the next choiceof pivot in the Schur complement if the factorization continued. Because TRLUCP does notprovide an updated Schur complement, the largest element in the Schur complement can beapproximated by finding the column of R with largest norm, performing a Schur update ofthat column, and then picking the largest element in that column. Let α be this element,and, without loss of generality, assume it is the first entry of the Schur complement. Denote:

Π1AΠT2 =

L11

`T 1L31 I

U11 u U13

α sT12

s21 S22

.

Next, we must find the row and column that should be replaced if the row and columncontaining α are important. Note that the smallest entry of L11U11 may still lie in animportant row and column, and so the largest element of the inverse should be examinedinstead. Thus we propose defining

A11def=

(L11

`T 1

)(U11 u

α

)and testing

‖A−1

11 ‖max ≤f

|α|

for a tolerance parameter f > 1 that provides a control of accuracy versus the number ofswaps needed. Should the test fail, the row and column containing α are swapped with the

row and column containing the largest element in A−1

11 . Note that this element may occur in

the last row or last column of A−1

11 , indicating only a column swap or row swap respectivelyis needed. When the swaps are performed, the factorization must be updated to maintaintruncated LU form. We have developed a variant of the LU updating algorithm of [48] toefficiently update the SRLU factorization.

SRP can be implemented efficiently: each swap requires at most O (k(m+ n)) operations,

and ‖A−1

11 ‖max can be quickly and reliably estimated using [54]. An argument similar to thatused in [79] shows that each swap will increase

∣∣det(A11

)∣∣ by a factor at least f , hence willnever repeat.

4.3 LU Updating

Restoring the LU form of a matrix factorization after a modification to the factorization (LUupdating) has been explored in [46, 48, 96]. Restoring the truncated LU format of SRLU


Require: Truncated LU factorization A ≈ LU, tolerance f > 1

1: while ‖A−1

11 ‖max>f|α| do

2: Set α to be the largest element in S (or find an approximate α using R)

3: Swap row and column containing α with row and column of largest element in A−1

11

4: Update truncated LU factorization5: end while

Algorithm 7: Spectrum-Revealing Pivoting (SRP)

requires updating the factorization after row and column swaps. [48] illustrates LU updatinga full LU factorization after row and column swaps are performed, which maintains stabilityby allowing for additional row and column swaps to be performed.

Updating a truncated LU factorization can be achieved by: moving the row/column to beremoved to the kth row/column, and moving forward the rows/columns in between. Then,use the elimination strategy of [48] to eliminate the super diagonal/sub diagonal entries (andmaintain a unit diagonal on L). Note, nevertheless, the final super/sub diagonal entry cannotbe eliminated with this strategy because, if a row or column swap is required for stability,then the updating may swap the row/column to be moved out with the row/column to bemoved in twice. Instead, after all other super/sub diagonal entries have been eliminated,swap rows/columns k and k + 1 to simultaneously add the new row/column required andremove the row/column not needed. At this stage, a single step of LU factorization withoutpivoting restores the truncated LU factorization.

To see that a step of LU is stable without pivoting, note that entry (k − 1, k) of the Lfactor is 1 and the (k, k − a) entry of the U factor is a entry that was previously on thediagonal of the U factor. Hence, the leading entry in the Schur complement is not numerically0, and so LU can proceed without a pivot.

Let α denote the entry with largest magnitude in the Schur complement. Let (i, j)indicate the coordinates of the entry to be moved out of the upper k-by-k principal submatrixof A, and (s, t) be the coordinates of the entry to be moved out. LU updating proceeds asfollows:

1. Swap rows s and k + 1, as well as columns t and k + 1. Note that the factorizationremains in truncated LU format.

2. Move row i+ 1 to row i, row i+ 2 to i+ 1, ... row k to k − 1, and row i to row k.

3. Use the methodology of [48] to eliminate entries (i, i + 1), (i + 1, i + 2), ... , and(k − 2, k − 1) from the L factor, while maintaining 1s on the diagonal.

4. Move column j + 1 to column j, ... , column k to column k − 1, and column j tocolumn k.


5. Use the methodology of [48] to eliminate entries (j + 1, j), ... , and (k− 2, k− 1) fromthe U factor.

6. Swap rows k and k + 1 and columns k and k + 1.

7. Perform a single step of LU factorization to eliminate entry (k−1, k) from the L factorand (k, k − 1) from the U factor.

4.4 Choice of Block Size

A heuristic for choosing a block size for TRLUCP is described here, which differs fromstandard block size methodologies for the LU decomposition. Note that a key difference ofSRLU and TRLUCP from previous works is the size of the random projection: here thesize is relative to the block size, not the target rank k (2pmn flops for TRLUCP versus thesignificantly larger 2kmn for others). This also implies a change to the block size also changesthe flop count, and, to our knowledge, this is the first algorithm where the choice of blocksize affects the flop count. For problems where LAPACK chooses b = 64, our experimentshave shown block sizes of 8 to 20 to be optimal for TRLUCP.

Because the ideal block size depends on many parameters, such as the architecture of thecomputer and the costs for various arithmetic, logic, and memory operations, guidelines aresought instead of an exact determination of the most efficient block size. To simplify calcu-lations, only the matrix multiplication operations are considered, which are the bottleneckof computation. Let M denote the size of cache, f and m the number of flops and memorymovements, and tf and tm the cost of a floating point operation and the cost of a memorymovement. Using standard communication-avoiding analysis, we seek to choose a block sizeto minimize the total calculation time T modeled as

T = f · tf +m · tm.

Choosing p = b+ c for a small, fixed constant c, and minimizing implies

T =

[(m+ n− k)

(k2 − kb

)− 4

3k3 + 2bk2 − 2

3b2k

]· tf

+

[(m+ n− k)

(k2

b− k)− 4

3

k3

b+ 2k2 − 2

3bk

]· M(√

b2 +M − b)2 · tm.

This result is derived as follows: we analyze blocking by allowing different block sizes ineach dimension. For matrices Ω ∈ Rp×m and A ∈ Rm×n consider blocking in the form

Ω ·R =

( `

s ∗ ∗ ∗∗ ∗ ∗

)·

b

` ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗

.


Then a current block update requires cache storage of

s`+ `b+ sb ≤ M.

Thus we will constrain

` ≤ M − sbs+ b

.

The total runtime T is

T = 2pmn · tf +(ps

)(m`

)(nb

)(s`+ `b+ sb) · tm

= 2pmn · tf + pmn

(s+ b

sb+

1

`

)· tm

≥ 2pmn · tf + pmn

(s+ b

sb+

s+ b

M − sb

)· tm

= 2pmn · tf + pmnM

(s+ b

sb (M − sb)

)· tm

def= 2pmn · tf + pmnML (s, b,M) · tm.

Given Ω and A, changing the block sizes has no effect on the flop count. OptimizingL (s, b,M) over s yields

s2 + 2sb = M.

By symmetry

b2 + 2sb = M.

Note, nevertheless, that s ≤ p by definition. Hence

s∗ = min

(√M

3, p

),

and

b∗ = max

(√M

3,√p2 +M − p

).

These values assume

`∗ =M − sbs+ b

= max

(√M

3,√p2 +M − p

)= b∗.


This analysis applies to matrix-matrix multiplication where the matrices are fixed and theleading matrix is short and fat or the trailing matrix is tall and skinny. As noted above,nevertheless, the oversampling parameter p is a constant amount larger than the block sizeused during the LU factorization. The total initialization time is

T init = 2pmn · tf + pmnM

(s+ b

sb (M − sb)

)· tm

= 2pmn · tf +mn ·min

3√

3p√M,

M(√p2 +M − p

)2

· tm.We next choose the parameter b used for blocking the LU factorization, where p = b+O (1).The cumulative matrix multiplication (DGEMM) runtime is

TDGEMM =∑

j=b:b:k−b

[2jb(m− j) + 2jb(n− j − b)] · tf

+2 [j(m− j) + j(n− j − b)] M(√b2 +M − b

)2 · tm

=

[(m+ n− k)

(k2 − kb

)− 4

3k3 + 2bk2 − 2

3b2k

]· tf +

+

[(m+ n− k)

(k2

b− k)− 4

3

k3

b+ 2k2 − 2

3bk

]M(√

b2 +M − b)2 · tm

def= NDGEMM

f · tf +NDGEMMm · tm.

The methodology for choosing a block size is compared to other choices of block size inFigure 4.1. Note that LAPACK generally chooses a block size of 64 for these matrices, whichis suboptimal in all cases, and can be up to twice as slow. In all of the cases tested, thecalculated block size is close to or exactly the optimal block size.

4.5 Variations of SRLU

The CUR Decomposition with LU

A natural extension of truncated LU factorizations is a CUR-type decomposition for im-proved accuracy [72]:

Π1AΠT2 ≈ L

(L†AU†

)U

def= LMU.

As with standard CUR, the factors L and U retain (much of) the sparsity of the originaldata, while M is a small, k-by-k matrix. The CUR decomposition can improve the accuracy


Figure 4.1: Benchmarking TRLUCP with various block sizes on random matrices of differentsizes and truncation ranks.

of an SRLU with minimal extra needed memory. Extra computational time, nevertheless, isneeded to calculate M. A more efficient, approximate CUR decomposition can be obtainedby replacing A with a high quality approximation (such as an SRLU factorization of highrank) in the calculation of M:

A ≈ LkMUk where M = L†kLk+pUk+pU†k. (4.2)

Here, a suitable p ∈ Z+ is chosen so that Lk+pUk+p more accurately approximates A,and which can be calculated after Lk and Uk have been formed by simply continuing thefactorization for additional iterations.


The Online SRLU Factorization

Given a factored data matrix A ∈ Rm×n and new observations BΠT2 =

( k m− k

B1 B2

)∈ Rs×m,

an augmented LU decomposition takes the form

(Π1AΠT

2

BΠT2

)=

L11

L21 IL31 I

U11 U12

SSnew

,

where L31 = B1U−111 and Snew = B2 − B1U

−111 U12. An SRLU factorization can then be

obtained by simply performing correcting swaps. For a rank-1 update, at most 1 swapis expected (although examples can be constructed that require more than one swap),which requires at most O (k (m+ n)) flops. By contrast, the URV decomposition of [97]is O (n2), while SVD updating requires O

((m+ n) min2 (m,n)

)operations in general, or

O((m+ n) min (m,n) log2

2 ε)

for a numerical approximation with the fast multipole method.Subspace iteration has no known updating technique.

In many applications, reduced weight is given to old data. In this context, multiplyingthe matrices U11, U12 and S by some scaling factor less than 1 before applying spectrum-revealing pivoting will reflect the reduced importance of the old data.

37

Chapter 5

The Theory of Spectrum-RevealingAlgorithms

5.1 Theoretical Results for SRLU Factorizations

Analysis of General Truncated LU Decompositions

Theorem 3. Let (·)s denote the rank-s truncated SVD for s ≤ k m,n. Then for anytruncated LU factorization ∥∥∥Π1AΠT

2 − LU∥∥∥ = ‖S‖

for any norm ‖·‖. Furthermore,∥∥∥Π1AΠT2 −

(LU)s

∥∥∥2≤ 2‖S‖2+σs+1 (A) .

Proof. The equation simply follows from Π1AΠT2 = LU +

(0 00 S

). For the inequality:

∥∥∥Π1AΠT2 −

(LU)s

∥∥∥2

=∥∥∥Π1AΠT

2 − LU + LU−(LU)s

∥∥∥2

≤∥∥∥Π1AΠT

2 − LU∥∥∥

2+∥∥∥LU−

(LU)s

∥∥∥2

= ‖S‖2+σs+1

(LU)

= ‖S‖2+σs+1

(Π1AΠT

2 −(

0 00 S

))≤ ‖S‖2+σs+1 (A) + ‖S‖2.

CHAPTER 5. THE THEORY OF SPECTRUM-REVEALING ALGORITHMS 38

Theorem 4. For a general rank-k truncated LU decomposition, we have for all 1 ≤ j ≤ k,

σj (A) ≤ σj

(LU)1 +

1 +‖S‖2

σk

(LU) ‖S‖2

σj (A)

.

Proof.

σj (A) ≤ σj

(LU)1 +

‖S‖2

σj

(LU)

= σj

(LU)1 +

σj (A)

σj

(LU) ‖S‖2

σj (A)

≤ σj

(LU)1 +

σj

(LU)

+ ‖S‖2

σj

(LU) ‖S‖2

σj (A)

= σj

(LU)1 +

1 +‖S‖2

σj

(LU) ‖S‖2

σj (A)

≤ σj

(LU)1 +

1 +‖S‖2

σk

(LU) ‖S‖2

σj (A)

.

Note that the relaxation in the final step serves to establish a universal constant across allj, which leads to fewer terms that need bounding when the global SRLU swapping strategyis developed. Although we could succinctly write

σj (A) ≤ σj

(LU)1 +

‖S‖2

σj

(LU) ≤ σj

(LU)1 +

‖S‖2

σk

(LU) ,

performing the relaxation earlier produces a weaker bound when ‖S‖2 is small. An upperbound is simpler:

σj

(LU)≤ σj (A)

(1 +

‖S‖2

σj (A)

).

Theorem 5. (CUR Error Bounds)∥∥∥Π1AΠT2 − LMU

∥∥∥2≤ 2‖S‖2


and ∥∥∥Π1AΠT2 − LMU

∥∥∥F≤ ‖S‖F .

Proof. First

‖Π1AΠT2 − LMU‖2 =

∣∣∣∣∣∣∣∣∣∣(

0(QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)∣∣∣∣∣∣∣∣∣∣2

≤∣∣∣∣∣∣(QL

1

)TC(QU

2

)T ∣∣∣∣∣∣2

+∣∣∣∣∣∣((QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)∣∣∣∣∣∣2

=∣∣∣∣∣∣(QL

1

)TC(QU

2

)T ∣∣∣∣∣∣2

+∣∣∣∣∣∣(QL

2

)TC((

QU1

)T (QU

2

)T)∣∣∣∣∣∣2

≤ 2‖C‖2

= 2‖S‖2.

Also

‖Π1AΠT2 − LMU‖F =

∥∥∥∥∥(QL1 QL

2

)((QL1

)T(QL

2

)T)

A((

QU1

)T (QU

2

)T)(QU1

QU2

)−QL

1

(QL

1

)TA(QU

1

)TQU

1

∥∥∥F

=∥∥∥QL

1

(QL

1

)TA(QU

2

)TQU

2 + QL2

(QL

2

)TA(QU

1

)TQU

1

+QL2

(QL

2

)TA(QU

2

)TQU

2

∥∥∥F

=∥∥∥QL

1

(QL

1

)TC(QU

2

)TQU

2 + QL2

(QL

2

)TC(QU

1

)TQU

1

+QL2

(QL

2

)TC(QU

2

)TQU

2

∥∥∥F

=

∣∣∣∣∣∣∣∣∣∣(QL

1 QL2

)( 0(QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)(

QU1

QU2

)∣∣∣∣∣∣∣∣∣∣F

=

∣∣∣∣∣∣∣∣∣∣(

0(QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)∣∣∣∣∣∣∣∣∣∣F

≤

∣∣∣∣∣∣∣∣∣∣((

QL1

)TC(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)∣∣∣∣∣∣∣∣∣∣F

=

∣∣∣∣∣∣∣∣∣∣(QL

1 QL2

)((QL1

)T(QL

2

)T)

C((

QU1

)T (QU

2

)T)(QU1

QU2

)∣∣∣∣∣∣∣∣∣∣F

= ‖C‖F= ‖S‖F .


Theorem 3 simply concludes that the approximation is accurate if the Schur comple-ment is small, but the singular value bounds of Theorem 4 are needed to guarantee thatthe approximation retains structural properties of the original data, such as an accurateapproximation of the rank and the spectrum. Furthermore, singular values bounds can besignificantly stronger than the more familiar norm error bounds that appear in Theorem3. Theorem 4 provides a general framework for singular value bounds, and bounding theterms in this theorem provided guidance in the design and development of SRLU. Just as inthe case of deterministic LU with complete pivoting, the sizes of ‖S‖2

σk(LU)and ‖S‖2

σj(LU)range

from moderate to small for almost all data matrices of practice interest. They, nevertheless,cannot be effectively bounded for a general TRLUCP factorization, implying the need forAlgorithm 7 to ensure that these terms are controlled. While the error bounds in Theorem5 for the CUR decomposition do not improve upon the result in Theorem 3, CUR boundsfor SRLU specifically will be considerably stronger.

The factor of 2 in Theorem 5 will also appear in Theorems below. This factor is due tothe potential increase in the norm of a matrix when part of it is zeroed out, as seen in theproof. This factor is likely not optimal, but a counter example shows that it cannot be 1:∥∥∥∥( 0n

√2In√

2In In

)∥∥∥∥2∥∥∥∥( −In

√2In√

2In In

)∥∥∥∥2

=

√4

3.

We conjecture that√

43

is indeed a sharp bound. In the 2-by-2 case, using an exact formula

for the singular values shows that the factor is smaller than√

2: note that if a12 = 0 or

a21 = 0 then

∥∥∥∥( 0 a12

a21 a22

)∥∥∥∥2

≤∥∥∥∥(a11 a12

a21 a22

)∥∥∥∥2

because the matrix on the left-hand side is a

submatrix of the matrix on the right-hand side. Assuming a12, a21 6= 0, then a quick estimateshows∥∥∥∥( 0 a12

a21 a22

)∥∥∥∥2∥∥∥∥(a11 a12

a21 a22

)∥∥∥∥2

=

√a212+a221+a222

2+

√(a212+a221+a222

2

)2

− (a12a21)2

√a211+a212+a221+a222

2+

√(a211+a212+a221+a222

2

)2

− (a11a22 − a12a21)2

≤

√a212+a221+a222

2+

√(a212+a221+a222

2

)2

− (a12a21)2

√a212+a221+a222

2

<√

2.


Correctness of the Spectrum-Revealing LU Decomposition

Our immediate goal is to bound ‖S‖2σk(LU)

and ‖S‖2σj(A)

. In the context of LU with complete

pivoting, a clear metric to evaluate the size of ‖S‖2 is to compare ‖S‖max, the next choice ofpivot element were the algorithm to continue, with the quality of the current approximation.While large values in A11 indicate that the corresponding rows and columns are importantcontributions to the approximation, note that small entries need not imply that the corre-sponding rows and columns are not important, as these rows and columns may contain otherlarge entries. Instead, large entries of A−1

11 will imply which columns and rows contributeleast to the quality of the approximation.

Consider the following notation. Let α be the entry in S with largest magnitude. Withoutloss of generality, assume that α has been permuted to the first entry of S:

S =

(α s12

s21 S22

).

Then:

Π1AΠT2 =

A11 u A13

`T α s12

A31 s21 S22

.

Denote

A11 :=

(A11 u`T α

).

For a given tuning parameter f ≥ 1, we propose evaluating the quality of the approximationLU by testing if

‖A−1

11 ‖max ≤f

|α|. (5.1)

Should condition (5.1) be satisfied, then no column and row combination can be deemedinsignificant compared to the column and row the contain α. With this test, we will showthat inequality in Theorem 4 is indeed bounded. Because our test involves searching for thelargest element in the Schur complement, we cannot expect to do better than a dimension-dependent bound. By contrast, a QR factorization produces a factorization in terms ofcolumn norms, which indicates that a comparable analysis may produce a bound with fixedglobal constants.

Theorem 6. (Existence of an Optimal Solution)SRLU terminates after a finite number of swaps.


Proof. The theorem will follow after proving two properties of the algorithm: first, perform-ing a swap always increases det (A11), and so the algorithm cannot repeat itself. Second, aset of rows and columns always exists that satisfies condition (5.1). Because only a finitenumber of possible row and column selections are possible, SRLU must always terminate.

Suppose ‖A−1

11 ‖max>f|α| , and let i and j denote the row and column respectively of the

location of the largest element in A−1

11 (there may be more than one such entry). Let A11

denote A11 with a single swap replacing row i and column j with row k + 1 and columnk + 1. Then from [80]:

det(A11

)det (A11)

=(U−1

11 u)j

(`TL−1

11

)i+ α

(A−1

11

)ji

= α(A−1

11

)ji

> f.

This implies that a swap always improves the quality of the approximation.Now suppose Π1 and ΠT

2 are chosen so that A11 has the largest possible determinant.Then:

1 ≥det(A11

)det (A11)

= α(A−1

11

)ji,

and so

‖A−1

11 ‖max≤1

|α|≤ f

|α|.

Note: f is not essential to the proof of theorem 6. This parameter is essential, neverthe-less, as an exponential number of swaps may be necessary to find an optimal solution (i.e.if f = 1). For f > 1 then the determinant of A11 is improved by at least a factor of f .

Theorem 7. (Spectrum-Revealing LU Stopping Time)The Spectrum-Revealing LU algorithm terminantes in time proportional to the oversam-pling size (with a dimension-dependent and ε-dependent constant) with failure probability

1− 2e−(ε2−ε3)p/4.

Here, ε is a tuning parameter that affects the bound on the number of swaps. Typically,ε ≈ 1/2.


Proof. Let S(k) denote the Schur complement after k rows and columns have been factoredwith SRLU, and pivots have been performed. Then det A11 =

∏Jk=1 S

(k)1,1 because these are

the diagonal elements of U1,1. Before proceeding, note several inequalities:

‖S(k)(:, 1)‖2 ≤ g1

√m− k + 1

∣∣∣S(k)1,1

∣∣∣ .Using partial row pivoting, g1 = 1 is guaranteed to satisfy this inequality. Applying the formof the Johnson-Lindenstrauss theorem in inequality 3.5,

‖S(k)‖1,2 ≤(

1 + ε

1− ε

)g2‖S(k)(:, 1)‖2.

Because SRLU is a block algorithm, and because it reuses the original random projection byupdating R at each iteration, this result is not automatic. This result was shown to hold in[109]. The failure probability remains 1− 2e−(ε2−ε3)k/2.

Deterministic QRCP always pivots the column with largest norm to the front, and so thisinequality would hold without any constants. Using randomized QRCP to select columns,this inequality is guaranteed to be true for g2 = 1. Higher values of g2 will hold for modifiedpivoting strategies, such as tournament pivoting.

σ1

(S(k)

)≤√n− k + 1‖S(k)‖1,2.

To see this last inequality, note that for any vector x of length n − k + 1 that ‖x‖1≤√n− k + 1‖x‖2. Then

σ1

(S(k)

)= max

x 6=0

∥∥∥∥S(k) x

‖x‖2

∥∥∥∥2

≤ maxx 6=0

∥∥∥∥S(k) x

‖x‖2

√n− k + 1

‖x‖2

‖x‖1

∥∥∥∥2

=√n− k + 1‖S(k)‖1,2.

Finally, note thatσ1

(S(k)

)≥ σk (A)

from [83].


Continuing the argument:

|det (A11)| =J∏k=1

∣∣∣S(k)1,1

∣∣∣≥

J∏k=1

‖S(k)1,1‖2

g1

√m− k + 1

≥J∏k=1

(1− ε1 + ε

)σ1

(S(k)

)g1g2

√m− k + 1

√n− k + 1

≥J∏k=1

(1− ε1 + ε

)σk (A)

g1g2

√m− k + 1

√n− k + 1

≥

[J∏k=1

σk (A)

]·[(

1− ε1 + ε

)1

g1g2

√m√n

]J.

After performing K swaps:

|det (A11)| ≥

[J∏k=1

σk (A)

]·[(

1− ε1 + ε

)1

g1g2

√m√n

]JfK .

Thus [J∏k=1

σk (A)

]≥ |det (A11)| ≥

[J∏k=1

σk (A)

]·[(

1− ε1 + ε

)1

g1g2

√m√n

]JfK ,

implying

K ≤ J logf

((1 + ε

1− ε

)g1g2

√mn

).

Analysis of the Spectrum-Revealing LU Decomposition

Theorem 8. (SRLU Error Bounds)For j ≤ k and γ ≤ O (fk

√mn), a rank-k SRLU factorization satisfies

‖Π1AΠT2 − LU‖2 ≤ γσk+1 (A) ,

‖Π1AΠT2 −

(LU)j‖2 ≤ σj+1 (A)

(1 + 2γ

σk+1 (A)

σj+1 (A)

).

Note: the notation τ ≤ O (fk√mn) is meant to convey that a bound of O (fk

√mn)

has been proven here, although a better bound τ may still exist.


Note: although the factor of 2 may seem redundant in the presence of τ , numericexperiments will show that τ and similar constants below are indeed small in most practicalcases. Hence it is worth keeping the factor of 2 (here and in later theorems) separate fromτ .

Proof. Note that the definition of α implies

‖S‖2 ≤√

(m− k)(n− k)|α|.

From [83]:

σmin

(A11

)≤ σk+1 (A) .

Then:

σ−1k+1 (A) ≤ ‖A−1

11 ‖2

≤ (k + 1)‖A−1

11 ‖max

≤ (k + 1)f

|α|.

Thus

|α|≤ f(k + 1)σk+1 (A) .

The theorem follows by using this result with Theorem 3, with

γ ≤√mnf(k + 1).

Theorem 8 is a special case of Theorem 3 for SRLU factorizations. For a data matrixwith a rapidly decaying spectrum, the right-hand side of the second inequality is close toσj+1 (A), a substantial improvement over the sharpness of the bounds in [37].

While a CUR variant immediately follows by invoking Theorem 5 instead of Theorem 3,a stronger bound for CUR is developed later.

Theorem 9. (SRLU Spectral Bound)For 1 ≤ j ≤ k and τ ≤ O (mnk2f 3), a rank-k SRLU factorization satisfies:

σj (A)

1 + τ σk+1(A)

σj(A)

≤ σj

(LU)≤ σj (A)

(1 + τ

σk+1 (A)

σj (A)

),


Proof. After running k iterations of rank-revealing LU,

Π1AΠT2 = LU + C,

where C =

(0 00 S

), and S is the Schur complement. Then

σj (A) ≤ σj

(LU)

+ ‖C‖2

= σj

(LU)1 +

‖C‖2

σj

(LU) . (5.2)

For the upper bound:

σj

(LU)

= σj (A−C)

≤ σj (A) + ‖C‖2

= σj (A)

[1 +

‖C‖2

σj (A)

]= σj (A)

[1 +

‖S‖2

σj (A)

].

The final form is achieved using the same bound on γ as in Theorem 8.

While the worst case upper bound on τ is large, it is dimension-dependent, and j and kmay be chosen so that σk+1(A)

σj(A)is arbitrarily small compared to τ . In particular, if k is the

numeric rank of A, then the singular values of the approximation are numerically equal tothose of the data.

These bounds are ‘problem-specific bounds’ because their quality depends on the spec-trum of the original data, rather than universal constants that appear in previous results.The benefit of these problem-specific bounds is that an approximation of data with a rapidlydecaying spectrum is guaranteed to be high-quality. Furthermore, if σk+1 (A) is not smallcompared to σj (A), then no high-quality low-rank approximation is possible in the 2 andFrobenius norms. Thus, in this sense, the bounds presented in Theorems 8 and 9 are optimal.

We note that singular value ratios have appeared before. For example, Hwang, Lin, andYang [58] proved:

Theorem 10. Let C(n, k) = n!k!(n−k)!

. For nonsingular A ∈ Rn×n, there exist permutation

matrices Π1 and ΠT2 such that∣∣∣(U22)ij

∣∣∣ ≤ C(n, k)σk+1 (A)

1− C(n, k)σk+1(A)

σk(A)

.


Given a high-quality rank-k truncated LU factorization, Theorem 9 ensures that a low-rank approximation of rank ` with ` < k of the compressed data is an accurate rank-`approximation of the full data. The proof of this theorem centers on bounding the terms inTheorems 3 and 4. Experiments will show that τ is small in almost all cases.

Stronger results are achieved with the CUR version of SRLU:

Theorem 11.

‖Π1AΠT2 − LMU‖2 ≤ 2γσk+1 (A)

and

‖Π1AΠT2 − LMU‖F ≤ ωσk+1 (A) ,

where γ ≤ O (fk√mn) is the same as in Theorem 8, and, similarly, ω ≤ O (fk

√mn).

Proof. Note that the definition of α implies

‖S‖F ≤√

(m− k)(n− k)|α|.

The rest follows by using Theorem 5 in a manner similar to how Theorem 8 invoked Theorem3.

Theorem 12. If σ2j (A) > 2‖S‖2

2 then

σj (A) ≥ σj

(LMU

)≥ σj (A)

√1− 2γ

(σk+1 (A)

σj (A)

)2

,

where γ ≤ O (mnk2f 2), and f is an input parameter controlling a tradeoff of quality vs.speed as before.

Proof. Perform QR and LQ decompositions L = QLRL =:(QL

1 QL2

)(RL11 RL

12

RL22

)and

U = LUQU =:

(LU

11

LU21 LU

22

)(QU

1

QU2

). Then

LMU = QL1

(QL

1

)TA(QU

1

)TQU

1 .

Note that

ATQL2 =

(LU + C

)TQL

2

=(QL

1 RL11L

U11Q

U1 + C

)TQL

2

=(QU

1

)T (LU

11

)T (RL

11

)T (QL

1

)TQL

2 + CTQL2

= CTQL2 . (5.3)


Analogously

A(QU

2

)T= C

(QU

2

)T. (5.4)

Then

σj (A) = σj

((QL

1

)TA(QU

1

)T (QL

1

)TA(QU

2

)T(QL

2

)TA(QU

1

)T (QL

2

)TA(QU

2

)T)

= σj

((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)

=

λj((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)T

·

((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T))) 1

2

=

(λj

(((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T)T·((

QL1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T))+((

QL2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)T·((

QL2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T))) 12.

Continuing:

σj (A) ≤(λj

(((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T)T·((

QL1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T))+∥∥∥((QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)∥∥∥2

) 12

≤(λj

(((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T)T·((

QL1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T))+ ‖C‖22

) 12

=(λj

(((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T)·((

QL1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T)T)+ ‖C‖22

) 12

.


Continuing:

=

(λj

((QL

1

)TA(QU

1

)T ((QL

1

)TA(QU

1

)T)T+(QL

1

)TC(QU

2

)T ((QL

1

)TC(QU

2

)T)T)+ ‖C‖2

2

) 12

≤(λj

((QL

1

)TA(QU

1

)T ((QL

1

)TA(QU

1

)T)T)+

∥∥∥∥(QL1

)TC(QU

2

)T ((QL

1

)TC(QU

2

)T)T∥∥∥∥2

+ ‖C‖22

) 12

≤(λj

((QL

1

)TA(QU

1

)T ((QL

1

)TA(QU

1

)T)T)+ 2‖C‖2

2

) 12

≤(σ2j

((QL

1

)TA(QU

1

)T)+ 2‖C‖2

2

) 12

=(σ2j

(LMU

)+ 2‖C‖2

2

) 12

= σj

(LMU

)√√√√√1 + 2

‖C‖2

σj

(LMU

)2

= σj

(LMU

)√√√√√1 + 2

‖S‖2

σj

(LMU

)2

.

Solve for σj

(LMU

)for the lower bound. The upper bound:

σj (A) = σj

((QL

1

)TA(QU

1

)T (QL

1

)TA(QU

2

)T(QL

2

)TA(QU

1

)T (QL

2

)TA(QU

2

)T)

≥ σj

((QL

1

)TA(QU

1

)T)= σj

(QL

1

(QL

1

)TA(QU

1

)TQU

1

)= σj

(LMU

).

As before, the constants are small in practice. Observe that for most real data matrices,their singular values decay with increasing j. For such matrices this result is significantlystronger than Theorem 9.


As in Theorem 5, the factor of 2 stems from approximating the spectral norm of a matrixwith a zero submatrix. As before, the approximation is sharper under the Frobenius norm:

Theorem 13. (Frobenius Bound)

‖LMU‖F ≥ ‖A‖F√1 +

(‖S‖F

‖LMU‖F

)2.

Note: to see that Theorem 13 improves upon Theorem 12, note that accumulating

σ2j (A)− 2‖S‖2

2 ≤ σ2j

(LMU

)gives a weaker lower bound than

‖A‖2F−‖S‖2

F ≤ ‖LMU‖2F ,

these two inequalities being equivalent to the results of these theorems.

Proof.

‖A‖2F = ‖

(QL)T

A(QU)T ‖2

F

= trace

((QL1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)T

·

((QL

1

)TA(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T))

= trace

(((QL

1

)TA(QU

1

)T)T (QL1

)TA(QU

1

)T)

+trace

( 0(QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)T

·

(0

(QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T))

.


Continuing:

‖A‖2F ≤ trace

(((QL

1

)TA(QU

1

)T)T (QL1

)TA(QU

1

)T)

+trace

((QL1

)TC(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T)T

·

((QL

1

)TC(QU

1

)T (QL

1

)TC(QU

2

)T(QL

2

)TC(QU

1

)T (QL

2

)TC(QU

2

)T))

= ‖LMU‖2F+trace

(CTC

)= ‖LMU‖2

F+‖C‖2F .

Note that Theorem 13 is a general result. For SRLU factorizations:

Theorem 14. If σj (A) ≥ ‖S‖F then

‖LMU‖F ≥ ‖A‖F

√1− γ

(σk+1 (A)

‖A‖F

)2

,

where γ ≤ O (mnf 2k2).

Proof. As in Theorem 12, the result is achieved by solving for ‖LMU‖F and substituting‖S‖F with a bound in terms of m,n, k, f , and A. Note that, as seen above, the definitionof α implies ‖S‖F≤

√(m− k)(n− k)|α|. The rest of the proof is similar to the proof of

Theorem 12. Substituting in this bound for ‖S‖F and the previously established bound for|α| completes the proof.

Although the condition in Theorem 14 does not guarantee that the operand in the squareroot in the statement of the theorem is nonnegative, it does imply the operand is nonnegativebefore the final substitution. Numeric experiments imply that the constant is small, and thusa condition on the constant would be pessimistic. Also, note that for Theorems 12 and 14the condition in the statement is not a significant restriction. In both proofs, the restrictionis applied in the final steps to clean up the form of the bound.

Note: in general there is no nuclear norm bound in the form of Theorem 13. Counterexamples for the nuclear norm are easy to construct to show that such bounds with thenuclear norm are indeed false.

Theorem 15. (Monotonicity of CUR Approximations)Suppose LkUk is a rank k truncated LU approximation to a matrix A, and suppose the decom-

position Lk+cUk+c =

(Lk

(0Lc

))(Uk(

0 Uc

)) is a rank k + c truncated LU approximation


obtained by continuing the LU factorization from the rank k approximation. Then

σj

(Lk

(L†kAU†k

)Uk

)≤ σj

(Lk+c

(L†k+cAU†k+c

)Uk+c

)≤ σj (A) .

Proof. Perform QR and LQ decompositions on L and U, similar to the decomposition intheorem 12, but with extra blocks:

A = LU =(QL

1 QL2 QL

3

)RL11 RL

12 RL13

RL22 RL

23

RL33

LU11

LU21 LU

22

LU31 LU

32 LU33

QU1

QU2

QU3

.

Then

σj

(Lk+c

(L†k+cAU†k+c

)Uk+c

)= σj

((QL

1 QL2

)((QL1

)T(QL

2

)T)

A((

QU1

)T (QU

2

)T)(QU1

QU2

))

= σj

(((QL

1

)T(QL

2

)T)

A((

QU1

)T (QU

2

)T))≥ σj

((QL

1

)TA(QU

1

)T)= σj

(QL

1

(QL

1

)TA(QU

1

)TQU

1

)= σj

(Lk

(L†kAU†k

)Uk

).

The rightmost inequality is a direct application of Theorem 12.

Note: there is no result equivalent to Theorem 15 for a plain LU factorization. Consider

A =

1 1 11 1.01 01 1.0001 1

=

11 11 .01 1

1 1 1.01 −1

.01

.

Then

σ1

11 11 .01

(1 1 1.01 −1

) < σ1 (A) < σ1

111

(1 1 1) .

Caveat: consider the matrices

A =

(In InIn −3In

), and L =

(InIn

),U =

(In In

). (5.5)

ThenLL†AU†U = 0!


Nevertheless,‖A− LL†AU†U‖2= ‖A‖2= 1 +

√5 < 4 = ‖A− LU‖2,

and‖A− LL†AU†U‖F= ‖A‖F= 2

√3√n < 4

√n = ‖A− LU‖F .

Theorem 16. (Approximate CUR lower bound)Let M = L†kLk+cUk+cU

†k, as in line (4.2). Let Sk be the Schur complement after a rank-

k truncated LU decomposition, and let Sk+c be the Schur complement after extending thetruncated LU decomposition to rank-(k + c). Then

σj

(LMU

)≥ σj (A)√

1 + 2 ‖Sk+c‖2σj(LMU)

+2‖Sk‖22+3‖Sk+c‖22

σ2j (LMU)

.

Proof. Let A have the decomposition as in Theorem 15. Then

σj (A) = σj

(QL1 QL

2 QL3

)TA

QU1

QU2

QU3

T

= σj

(QL

1

)TA(QU

1

)T (QL

1

)TA(QU

2

)T (QL

1

)TA(QU

3

)T(QL

2

)TA(QU

1

)T (QL

2

)TA(QU

2

)T (QL

2

)TA(QU

3

)T(QL

3

)TA(QU

1

)T (QL

3

)TA(QU

2

)T (QL

3

)TA(QU

3

)T

= σj

(QL

1

)TA(QU

1

)T (QL

1

)TSk(QU

2

)T (QL

1

)TSk+c

(QU

3

)T(QL

2

)TSk(QU

1

)T (QL

2

)TSk(QU

2

)T (QL

2

)TSk+c

(QU

3

)T(QL

3

)TSk+c

(QU

1

)T (QL

3

)TSk+c

(QU

2

)T (QL

3

)TSk+c

(QU

3

)T

≤

√√√√√√σ2j

(QL

1

)TA(QU

1

)T (QL

1

)TSk(QU

2

)T(QL

2

)TSk(QU

1

)T (QL

2

)TSk(QU

2

)T(QL

3

)TSk+c

(QU

1

)T (QL

3

)TSk+c

(QU

2

)T+ ‖Sk+c‖2

2

≤

√√√√σ2j

((QL

1

)TA(QU

1

)T (QL

1

)TSk(QU

2

)T(QL

2

)TSk(QU

1

)T (QL

2

)TSk(QU

2

)T)

+ 2‖Sk+c‖22

≤

√√√√σ2j

((QL

1

)TA(QU

1

)T(QL

2

)TSk(QU

1

)T)

+ ‖Sk‖22+2‖Sk+c‖2

2.


Continuing:

σj (A) ≤√σ2j

((QL

1

)TA(QU

1

)T)+ 2‖Sk‖22+2‖Sk+c‖2

2

=√σ2j

(RL

11LU11 + RL

12LU21 + RL

13LU31

)+ 2‖Sk‖2

2+2‖Sk+c‖22

≤√(

σj(RL

11LU11 + RL

12LU21

)+ ‖RL

13LU31‖2

)2+ 2‖Sk‖2

2+2‖Sk+c‖22

≤√(

σj(RL

11LU11 + RL

12LU21

)+ ‖Sk+c‖2

)2+ 2‖Sk‖2

2+2‖Sk+c‖22

=√σ2j

(RL

11LU11 + RL

12LU21

)+ 2σj

(RL

11LU11 + RL

12LU21

)‖Sk+c‖2+2‖Sk‖2

2+3‖Sk+c‖22

=

√σ2j

(LMU

)+ 2σj

(LMU

)‖Sk+c‖2+2‖Sk‖2

2+3‖Sk+c‖22

The result follows.

Theorem 17. (Approximate CUR upper bound)Under the assumptions of Theorem 16:

σj

(LMU

)≤ σj (A)

(1 +‖Sk+p‖2

σj (A)

).

Proof.

σj

(LMU

)≤ σj

(QL

1

(QL

1

)TLk+pUk+p

(QU

1

)TQU

1

)≤ σj (Lk+pUk+p)

≤ σj (A)

(1 +‖Sk+p‖2

σj (A)

),

where we have used Theorem 9.

Theorem 18. ∥∥∥LMU∥∥∥F≥ ‖A‖F√

1 +‖Sk‖2F +‖Sk+c‖2F‖LMU‖2

F

.

Proof. Using Theorem 16 in the same way that Theorem 13 is based on Theorem 12 yields∥∥∥LMU∥∥∥F≥ ‖A‖F√

1 +‖Sk‖2F +2‖Sk+c‖2F‖LMU‖2

F

.

If the leading submatrix is separated into its components when the components of Sk andSk+c are separated out, then the factor of 2 can be dropped.


Theorem 18 can also be simplified by solving for∥∥∥LMU

∥∥∥F

, as other theorems above

have done.Next, an alternative CUR bound is presented which can be stronger or weaker than the

bound in Theorem 12.

Theorem 19. For a general LU factorization

σj

(LMU

)≥ σj (A11)− 1

4‖S‖2. (5.6)

Proof. Let NL = L21L−111 and NU = U−1

11 U12. Then

L(L)†

=

(I

NL

)(I + NT

LNL

)−1 (I NL

)and (

U)†

U =

(I

NU

)(I + NUNT

U

)−1 (I NU

).

Then

L(L)†

C(U)†

U =

(I

NL

)(I + NT

LNL

)−1NLSNU

(I + NUNT

U

)−1 (I NU

).

Note that

σj

(LU +

(I

NL

)(I + NT

LNL

)−1NLSNU

(I + NUNT

U

)−1 (I NU

))≥ σj

(A11 +

(I + NT

LNL

)−1NLSNU

(I + NUNT

U

)−1)

because the singular values of a sub matrix are bounded by the corresponding singular valuesof the whole matrix. Let σNL

1 and σNU1 denote the leading singular values of NL and NU ,

and note that

σNL1

1 +(σNL

1

)2 ,σNU

1

1 +(σNU

1

)2 ≤ 1

2.


Then

σj

(LMU

)= σj

(L(L)†

A(U)†

U

)= σj

(L(L)† (

LU + C)(

U)†

U

)= σj

(L(L)†

LU(U)†

U + L(L)†

C(U)†

U

)= σj

(LU + L

(L)†

C(U)†

U

)≥ σj

(A11 +

(I + NT

LNL

)−1NLSNU

(I + NUNT

U

)−1)

≥ σj (A11)− ‖(I + NT

LNL

)−1NLSNU

(I + NUNT

U

)−1 ‖2

≥ σj (A11)− ‖(I + NT

LNL

)−1NL‖2‖S‖2‖NU

(I + NUNT

U

)−1 ‖2

= σj (A11)− σNL1

1 +(σNL

1

)2‖S‖2σNU

1

1 +(σNU

1

)2

≥ σj (A11)− 1

4‖S‖2.

To see that Theorems 12 and 19 are not redundant consider the matrices

A1 =

−1.0722 1.4367 −1.20780.9610 −1.9609 2.90800.1240 −0.1977 0.8252

and A2 =

−0.1303 0.8620 −0.84870.1837 −1.3617 −0.3349−0.4762 0.4550 0.5528

.

Perform rank-2 truncated LU factorizations on each without pivoting, and let (·)11 denotethe principal 2-by-2 submatrix. Both matrices satisfy the condition σ2

1 (A) > 2‖S‖22 required

by Theorem 12. However,

σ1 ((A1)11)− 1

4‖S1‖2 <

√σ2

1 (A1)− 2‖S1‖22

σ1 ((A2)11)− 1

4‖S2‖2 >

√σ2

1 (A2)− 2‖S2‖22.

Theorem 19 could be refined for SRLU in two ways: first an upper bound for ‖S‖ isfound in Theorem 8. Observing that the singular values of a submatrix are bounded aboveby the singular values of the whole matrix, and using the bound on ‖S‖ for SRLU, then anSRLU-specific bound becomes

σj

(LMU

)≥ σj (A11)

(1− γ

4

σk+1 (A)

σj (A)

),

where γ has the same bound as in Theorem 8. Second, for j = 1, a lower bound of σj (A11)

can be extracted from the proof of Theorem 7 by noting that σ1 (A11) ≥ |det (A11)|1/k.


Comparison of Singular Value Bounds

To see how the different factorizations explored in the previous sections capture informationfrom a data matrix A, consider the factorization in Theorem 15 and note that

σj (A) = σj

(QL1 QL

2 QL3

)TA

QU1

QU2

QU3

T

= σj

RL11L

U11 + RL

12LU21 + RL

13LU31 RL

12LU22 + RL

13LU32 RL

13LU33

RL22L

U21 + RL

23LU31 RL

22LU22 + RL

23LU32 RL

23LU33

RL33L

U31 RL

33LU32 RL

33LU33

.

Then the aforementioned factorizations can be represented as:

σj (LkUk) = σj(RL

11LU11

)σj

(LkMUk

)= σj

(RL

11LU11 + RL

12LU21

)σj (Lk+cUk+c) = σj

(RL

11LU11 + RL

12LU21 RL

12LU22

RL22L

U21 RL

22LU22

)σj (LkMUk) = σj

(RL

11LU11 + RL

12LU21 + RL

13LU31

).

Only the last equality captures the singular value of a submatrix (although not a submatrix ofA), and so only the CUR approximation possess monotonicity: as the rank of approximationincreases the error cannot increase.

Consider these approximations relative to sub matrices of A:

σj (A11) ≤ σj(RL

11LU11

)σj

(A11 A12

A21 A22

)≤ σj

(RL

11LU11 + RL

12LU21 RL

12LU22

RL22L

U21 RL

22LU22

).

Note that the expressions here have been simplified by recognizing that QLRL is lowertriangular and LUQU is upper triangular.

The LU Growth Factor

To understand the stability of the LU decomposition, we begin with a useful definition.

Definition 4. Let A ∈ Rn×n, and let A(i) denote the result of the LU decomposition, per-formed in place, after i steps. The growth factor of an LU decomposition is defined as

ρn =max i,j,k|A(k)

i,j |maxi,j|Ai,j|

.


The growth factor can be related to the backwards stability of the LU decompositionthrough the following bound [34]:

‖δA‖∞≤ 3ρnn3ε‖A‖∞,

where δA is a matrix defined by A + δA = LU for the numerically calculated factorizationLU ≈ A, and ε is machine precision.

LU Variant Bound on Element Growth Factor ρn

No pivoting ∞Partial pivoting 2n−1 [105]

Threshold pivoting with u < 1(1 + 1

u

)n−1[38]

Rook pivoting 1.5n34

log(n) [42]

Complete pivoting n12

(2 · 3 1

2 · · ·n1

n−1

) 12 ∼ cn

12n

14

log(n) [105]

SRLU α1+

(1+ 1

2+···+ 1

nb−1

)·(nb

) 12

(213

12 · · ·

(nb

) 1nb−1

) 12

2b−1

Table 5.1: Bounds for Growth Factors of LU Variants [53]

Table 5.1 compares the growth factor for several variations of the LU decomposition. [105]also shows that the bound for partial pivoting is attainable, while the bound for completepivoting is not attainable. Because the backwards error bound for LU with partial pivotingis exponential in n, this form of LU is not stable for all matrices. In practice, nevertheless,this bound is conservative, and LU with partial pivoting is stable for most data matrices[34]. Next, we establish bounds for element growth for a truncated LU factorization.

Theorem 20. (Deterministic LUCP Element Growth Factor Bound)If SRLU is performed on a matrix A ∈ R(sb)×(sb) in blocks of size b, with b columns chosen byRRQR from the full Schur complement and then b rows chosen by LUPP, then the elementgrowth factor is bounded by

ρdetSRLU ≤ α1+

(1+ 1

2+···+ 1

nb−1

)·(nb

) 12

(213

12 · · ·

(nb

) 1nb−1

) 12

2b−1. (5.7)

Proof. Let A(rb) denote the Schur complement when rb rows and columns remain. Let pidenote the absolute value of the pivot element chosen with i rows and columns remaining,and let pr := prb · prb−1 · · · p(r−1)b+1 for 1 ≤ r ≤ s. Note that

∣∣det(A(rb)

)∣∣ =r∏i=1

pr. (5.8)


Using RRQR, let A(rb)Π = Q

(Rrb

11 Rrb12

Rrb22

). For 1 ≤ i ≤ b

α · σi(Rrb

11

)≥ σi

(A(rb)

).

Let Lrb11U

rb11 = Q

(Rrb

11

0

)be the LU factorization with partial pivoting of the b columns chosen

by RRQR. Then

det((

Lrb11

)TLrb

11

)≤

trace((

Lrb11

)TLrb

11

)b

b

≤(nb

)b=

(nb

)b= rb,

where we have bounded the trace using the property that the entries in Lrb11 are bounded

in magnitude by 1. The inequality is simply the AM-GM inequality applied to eigenvalues.Then

b∏i=1

σi

(Q

(Rrb

11

0

))=

b∏i=1

σi(Rrb

11

)=

√det((

Rrb11

)TRrb

11

)=

√det(Urb

11

)det((

Lrb11

)TLrb

11

)det(Urb

11

)≤

(b−1∏i=0

prb−i

)r

b2

= prrb2 .

Consequently

∣∣det(A(rb)

)∣∣ =rb∏i=1

σi(A(rb)

)≤

(b∏i=1

σi(A(rb)

))r

≤(αbr

b2pr)r. (5.9)


Much of the following argument is modeled after the bound derived by Wilkinson in [105].Define

qi := log pi.

Taking logarithms of lines 5.9 and 5.8:

r−1∑i=1

qi ≤rb

2log(rα2)

+ (r − 1) qr (5.10)

and

s∑i=1

qi = log∣∣det

(A(sb)

)∣∣ . (5.11)

For r = 2, · · · , s− 1, note that

1

r(r − 1)+

1

(r + 1)r+ · · ·+ 1

(s− 1)(s− 2)+

1

s− 1=

1

r − 1.

Let h(k) := 1 + 12

+ · · ·+ 1k. Summing bound (5.10) divided by r(r − 1) for r = 2, · · · , s− 1

and adding (5.11) divided by s− 1 yields

q1 +qs

s− 1≤

s−1∑r=2

[b log

(α

1r−1

)+b

2log(r

1r−1

)]+

1

s− 1log∣∣det A(sb)

∣∣=

b

2log(α2h(s−2) · 213

12 · · · (s− 1)

1s−2

)+

1

s− 1log∣∣det A(sb)

∣∣≤ b

2log(α2h(s−2) · 213

12 · · · (s− 1)

1s−2

)+

s

s− 1qs +

sb

2(s− 1)log(sα2).

Let g(k) :=(α2h(k−1) · 213

12 · · · k

1k−1

) b2. Then

q1 − qs ≤ log (g(s)) +b

2log(sα2).

Thenp1

ps≤ s

b2αbg(s).

Note that within a block, the LUPP growth factor bound applies, and so

ps ≥ 2b(b−1)

2 pbsb.

Let pmin = min1≤i≤b pi. Then, because LUPP is performed on the last block:

pb1 ≤ 2b(b−1)

2 pbmin ≤ 2b(b−1)

2 p1.


Hence

p1

pn≤

(2

b(b−1)2

+b(b−1)

2 sb2αbg(s)

) 1b

= 2b−1s12α (g(s))

1b

= α(nb

) 12

(α

2

(1+ 1

2+···+ 1

nb−1

)· 213

12 · · ·

(nb

) 1nb−1

) 12

2b−1.

Bound (5.7) is a combination of the growth factor bounds of LUCP and LUPP becauseeach block iteration of the algorithm performs both row and columns swaps, but within each

block LUPP is performed. Indeed, the term(nb

) 12

(213

12 · · ·

(nb

) 1nb−1

) 12

is due to the nb

blocks

where both row and column pivoting is performed, and the term 2b−1 reflects the LUPP

growth bound from within each block factorization. The term α1+

(1+ 1

2+···+ 1

nb−1

)stems from

the quality of the QR factorization over the nb

blocks where QR is performed for columnselection. This term will change if a different column selection subroutine is implemented.

Although not explicitly derived, equality is not achievable in theorem 20. The componentof the bound that corresponds to LUCP leads to the lack of sharpness, which is proved in[105].

5.2 Comparison of SRLU Factorizations with RRLU

and RRQR Factorizations

In previous work on rank-revealing factorizations, such as [80], a bound on the quality of

A11 is established, while in this work bounds on LU are established. Because A11 is asubmatrix of A, its singular values are bounded above by the corresponding singular valuesof A. Thus, by guaranteeing the singular values of this submatrix are close to those ofthe original matrix implies that the sub matrix captures most of the “mass” of the originalmatrix. Note, however, that there is no clear interpretation of how A11 can approximatethe original matrix, as they are different sizes. In this work, which seeks to produce highquality low-rank approximations, the focus above is on the quality of the full approximationLU, and not on sub matrices. Nevertheless, lower bounds on the singular values of the submatrix A11 can be obtained using the analysis above in Theorem 7 by observing the thedeterminant is the product of singular values.


5.3 Fast SRLU

Note that the test 5.1 requires knowledge of the largest entry of the Schur complement. Theefficiency of SRLU, nevertheless, is largely due to avoiding calculating the Schur complement.A weaker and faster SRLU factorization can be calculated by avoiding computation of theSchur complement during Spectrum Revealing Pivoting. A faster correction strategy is tofind the column of R, the random projection of the Schur complement, and then pick thelargest entry in the corresponding column of the Schur complement. Calculating the columnnorms of R, updating a single column of S, and finding the largest entry in that column aresignificantly faster calculations than updating the whole Schur complement.

63

Chapter 6

Numerical Experiments with SRLU

Although it is not possible to test every aspect of SRLU, several of the most importantfeatures of SRLU are examined experimentally in this chapter.

6.1 Speed and Accuracy Tests

In Figure 6.1, the accuracy of our method is compared to the accuracy of the truncatedSVD. Note that SRLU did not perform any swaps in these experiments. “CUR” is theCUR version of the output of SRLU. Note that both methods exhibits a convergence ratesimilar to that of the truncated SVD (TSVD), and so only a constant amount of extra workis needed to achieve the same accuracy. When the singular values decay slowly, the CURdecomposition provides a greater accuracy boost. In Figure 6.2, the runtime of SRLU iscompared to that of the truncated SVD, as well as Subspace Iteration [50]. Note that forSubspace Iteration, we choose iteration parameter q = 0 and do not measure the time ofapplying the random projection, in acknowledgement that fast methods exist to apply arandom projection to a data matrix. All numeric experiments were run on NERSC’s Edison.For timing experiments, the truncated SVD is calculated with PROPACK.

6.2 Efficiency Tests

As a follow-up to the benchmarking tests in the previous section, the performance of SRLUis now benchmarked against hardware parameters. Because the most computationally ex-pensive steps in SRLU are all matrix-matrix multiplication, a finely tuned algorithm (whenimplemented carefully, e.g. using communication-avoiding blocking), SRLU has the potentialto utilize near-peak hardware performance.

The performance of TRLUCP is examined in figure 6.3 by comparing the approximatenumber of floating point operations, 2mnp + (m+ n) k2 against the time of calculation forsquare matrices of various sizes. Indeed, TRLUCP appears to scale linearly with the numberof floating point operations. For a matrix of size 10,000 by 10,000, TRLUCP is 86% efficient

CHAPTER 6. NUMERICAL EXPERIMENTS WITH SRLU 64

Figure 6.1: Accuracy Experiment on random 1000x1000 matrices with different rates ofspectral decay.

Figure 6.2: Time Experiment on random matrices of varying sizes, and a time experimenton a 1000x1000 matrix with varying truncation ranks.

on NERSC’s Edison [40]. The result of LAPACK’s DGEMM is included as well. Both TRLUCPand DGEMM are truncated to rank 300.

6.3 Towards Feature Selection

An image processing example is now presented to illustrate the benefit of highlighting im-portant rows and columns. In Figure 6.4 an image is compressed to a rank-50 approximationusing SRLU. Note that the rows and columns chosen overlap with the astronaut and theplanet, implying that minimal storage is needed to capture the black background, whichcomposes approximately two thirds of the image. While this result cannot be called featureselection per se, the rows and columns selected highlight where to look for features: rowsand/or columns are selected in a higher density around the astronaut, the curvature of theplanet, and the storm front on the planet.


Figure 6.3: Efficiency experiment on random matrices of varying sizes compared to peakhardware performance.

6.4 Sparsity Preservation Tests

The SRLU factorization is tested on sparse, unsymmetric matrices in Figure 6.5 from [33].Figure 6.6 shows the sparsity patterns of the factors of an SRLU factorization of a sparse datamatrix representing a circuit simulation (oscil dcop), as well as a full LU decomposition ofthe data. Note that the LU decomposition preserves the sparsity of the data initially, butthe full LU decomposition becomes dense.

6.5 Online Data Processing

Online SRLU is tested here on the Enron email corpus [70]. The documents were initiallyreverse-sorted by the usage of the most common word, and then reverse-sorted by the secondmost, and this process was repeated for the five most common words (the top five wordswere used significantly more than any other), so that the most common words occurred mostat the end of the corpus. The cumulative frequencies of the top 5 words in the Enron emailcorpus (after reordering) are plotted in Figure 6.7.

The data contains 39,861 documents and 28,102, and an initial SRLU factorization ofrank 20 was performed on the first 30K documents. The initial factorization contained none


Figure 6.4: Image processing example. The original image [81], a rank-50 approximationwith SRLU, and a highlight of the rows and columns selected by SRLU.

(a) Original circuit simulation data matrix. (b) Depiction of corresponding circuit.

Figure 6.5: Circuit Simulation Data.

of the top five words, but, after adding the remaining documents and updating, the top threewere included in the approximation. To understand why the last two may have been missed,the correlation matrix of the top 5 words is computed:

power company energy market california

power 1 0.40 0.81 0.51 0.78company 0.40 1 0.42 0.57 0.28energy 0.81 0.42 1 0.51 0.78market 0.51 0.57 0.51 1 0.48california 0.78 0.23 0.78 0.48 1

.

The fourth and fifth words ‘market’ and ‘california’ have high covariance with at least twoof the three top words, and so their inclusion may be redundant in a low-rank approximation.


(a) The L and U matrices of a rank 43 SRLU factorization (with updated Schur com-plement). The green entries compose a low-rank approximation of the data. Red entriesare the additional data needed for an exact factorization.

(b) The L and U matrices of a full LU factorization

Figure 6.6: Sparse Data Processing Example with Circuit Simulation Data.

6.6 Pathological Test Matrix

Here, we test SRLU on the Devil’s Stairs [98], a test matrix with a singular value rateof decay that may be difficult for a low-rank approximation algorithm to capture. TheDevil’s Stairs matrix used here is a randomly generated 100-by-100 with the first 20 singularvalues equal to 1, and then each following set of 20 singular values have 1/10th the value ofthe preceding 20 singular values. The singular values of various rank approximations withSRLU are plotted against the singular values of this test matrix in Figure 6.8. The SRLUfactorization does not appear to commit any serious errors for any truncation rank. Somesingular values are overestimated and some are underestimated, but all appear to convergecorrectly as the target rank increases, with the leading singular values converging rapidly,and the smaller singular values with a bounded relative error.


Figure 6.7: The cumulative uses of the top five most commonly used words in the Enronemail corpus after reordering.

6.7 Image Compression Example

Although sophisticated algorithms exist specifically for high quality image compression, anexperiment is explored here to demonstrate how oversampling can make up for quality thatmay be lost in SVD approximations, such as SRLU. Figure 6.9 shows an experiment wherevarious factorizations are used to compress an image.

The original image [76], of rank-480, is compressed using the truncated SVD to rank-100,a high quality approximation. A rank-100 approximation with TRLUCP exhibits a clearimage, although with reduced quality. A rank-200 approximation with TRLUCP appearsto match the quality of the previous SVD approximation. For comparison, a rank-150approximation with QRCP also appears to match the quality of the SVD compression.

6.8 Testing Quality Controls

Tightness of the Theoretical Bounds

Here, we test the dimension-dependent constants that appear in the theorems in the theo-retical analysis of SRLU. Although the data matrix does affect the size of the constant seenin practice, Table 6.1 summarizes the maximum constants seen with 1,000-by-1,000 randommatrices of spectrums decaying at rates 0.9, 0.8, and 0.7. In the case of spectral bounds, themaximums were taken over the top 10 singular values. Each experiment was performed 10times and averaged. The truncation rank is k = 100.

The Sensitive of SRLU to the Tuning Parameter f

Here, we test the average number of swaps needed as the tuning parameter f becomes small.In practice, f ≈ 10 is recommended. In many real-world datasets, no swaps will be needed.Table 6.2 shows the average number of swaps needed for small values of f for a random


(a) Rank-10 approximation. (b) Rank-30 approximation.

(c) Rank-50 approximation. (d) Rank-90 approximation.

Figure 6.8: Singular values of SRLU factorizations of various ranks (red) versus the singularvalues of the Devil’s Stairs matrix (blue).

1000-by-1000 matrix with a singular value decay rate of 0.9. The truncation rank is k = 100.


Table 6.1: Mean values of the constants from the theorems presented in this work, forvarious random matrices. Constants for spectral theorems are averaged over the top 10singular values. TRLUCP was used, and no swaps were needed, so SRLU results matchTRLUCP.

Constant Theorem Meanγ 8 (1st ineq.) 7.97γ 8 (2nd ineq.) 0.04τ 9 0.16γ 11 1.57ω 11 0.24γ 12 0.40

Table 6.2: Average number of swaps needed on a random 1000-by-1000 matrix for varioussmall values of f .

f Average number of swaps2.1 0.71.8 2.01.5 4.41.2 8.8


(a) Original image and a rank-100 truncated SVD approximation.

(b) A rank-100 approximation with TRLUCP and a rank-200 approximation with TR-LUCP.

(c) A rank-150 approximation with QRCP.

Figure 6.9: Image compression experiment with various factorizations. From left to right:James Wilkinson, Wallace Givens, George Forsythe, Alston Householder, Peter Henrici, andFriedrich Bauer. (Gatlinburg, Tennessee, 1964.)

72

Part III

Unweighted Graph Sparsification

73

Chapter 7

Unweighted Column Selection

Part III introduces a new low-rank approximation algorithm. This algorithm, called Un-weighted Column Selection, calculates a sparse graph (called a sparsifier) that accuratelyapproximates the original graph in some sense. A previous result, [12], uses an algorithmthat is based on a “purely linear algebraic theorem”. Similarly, the algorithm presented hereis built on linear algebra theory.

7.1 Introduction

Spectral graph sparsification has emerged as a powerful tool in the analysis of large-scalenetworks by reducing the overall number of edges, while maintaining a comparable graphLaplacian matrix. In this paper, we present an efficient algorithm for the construction of anew type of spectral sparsifier, the unweighted spectral sparsifier. Given a general undirectedand unweighted graph G = (V,E), and an integer ` < |E| (the number of edges in E), wecompute an unweighted graph H = (V, F ) with F ⊂ E and |F |= ` such that for everyx ∈ RV

xTLGx

κ≤ xTLHx ≤ xTLGx,

where LG and LH are the Laplacian matrices for G and H, respectively, and κ ≥ 1 is a slowly-varying function of |V |, |E| and `. This work addresses the open question of the existenceof unweighted graph sparsifiers for unweighted graphs [12]. Additionally, our algorithmefficiently computes unweighted graph sparsifiers for weighted graphs, leading to sparsifiedgraphs that retain the weights of the original graphs. A version of this chapter appears in[7].

7.2 Background

Graph sparsification seeks to approximate a graph G with a graph H on the same vertices,but with fewer edges. Called a sparsifier, H requires less storage than G and serves as a

CHAPTER 7. UNWEIGHTED COLUMN SELECTION 74

proxy for G in computations where G is too large, evoking the effectiveness of sparsifiersin wide-ranging applications of graphs, including social networks, conductance, electricalnetworks, and similarity [26, 28, 75, 92]. In some applications, graph sparsification improvesthe quality of the graph, such as in the design of information networks and the hardwiring ofprocessors and memory in parallel computers [13, 66]. Sparsifiers have also been utilized tofind approximate solutions of symmetric, diagonally-dominant linear systems in nearly-lineartime [13, 92, 93, 94].

Recent work on graph sparsification includes [2, 63, 91, 92, 95]. Batson, Spielman,and Srivastava [12] prove that for every graph there exists a spectral sparsifier where thenumber of edges is linear in the number of vertices. They further provide a polynomial-time, deterministic algorithm for the sparsification of weighted graphs, which could produceweights that differ greatly from the weights of the original graph. The work of Avron andBoutsidis [9] explores unweighted sparsification in the context of finding low-stretch spanningtrees. They provide a greedy edge removal algorithm and a volume sampling algorithm withtheoretical guarantees. In comparison, our novel greedy edge selection algorithm has tightertheoretical bounds for both spanning trees and in the more general context of unweightedgraph sparsification.

Our work introduces a deterministic, greedy edge selection algorithm to calculate spar-sifiers for weighted and unweighted graphs. Our algorithm selects a subset of edges forthe sparse approximation H, without assigning or altering weights. While the Dual Setalgorithms of [9, 12, 17] reweight all selected edges for computing weighted sparsifiers, ouralgorithm produces unweighted sparsifiers for an unweighted input graph, and can create aweighted sparsifier for a weighted input graph by assigning the original edge weights to thesparsifier. Hence our concept of unweighted sparsification applies to both unweighted andweighted graphs. To formalize:

Definition 5. Let G = (V,E,w) be a given graph1. We define an unweighted sparsificationof G to be any graph of the form H = (V, F, w IF ), where

IF (e) =

1, if e ∈ F0, otherwise

is the indicator function and is the Hadamard product, i.e.

(w IF ) (e) =

we, if e = (u, v) ∈ F0, otherwise

.

Several definitions have been proposed for the notion in which a sparsifier approximatesa dense graph. Benzcur and Karger [14] introduced cut sparsification, where the sum ofthe weights of the edges of a cut dividing the set of vertices is approximately the same forthe dense graph and the sparsifier. Spielman and Teng [95] proposed spectral sparsification,

1Note that any unweighted graph G = (V,E) induces a weighted graph G = (V,E,w) where we = 1 ife = (u, v) ∈ E and we = 0 otherwise.


a generalization of cut sparsification, which seeks sparsifiers with a Laplacian matrix closeto that of the input graph. We follow the work of [12, 95] and base our work on spectralsparsification, for which we now present a rigorous definition.

Given an undirected graph G = (V,E,w), define the signed edge-vertex incidence matrixBG ∈ RE×V as

(BG)ej =

−1, if e = (u, v) ∈ E and j = u ∈ V1, if e = (u, v) ∈ E and j = v ∈ V ,0 otherwise

where all edges are randomly assigned a direction, and e = (u, v) ∈ E is an edge from u tov. Define the diagonal weight matrix WG ∈ RE×E

(WG)ef =

we, if e = f ∈ E,0 otherwise

.

The Laplacian of the graph isLG = BT

GWGBG.

Note thatxTLGx =

∑(u,v)∈E

w(u,v) (xu − xv)2

for a vector x ∈ RV . To compare Laplacians of graphs X and Y defined on the same set ofnodes we denote

LX LY if and only if xTLXx ≤ xTLY x, for all x.

Definition 6. The graph H is a κ-approximation of G if

1

κLG LH LG.

Because our unweighted sparsification algorithm does not change the weights of the edgeskept in H, it is immediate that LH LG:

Proposition 1. If H is an unweighted sparsification of G, then

LH LG.

Proof.

xTLHx =∑

(u,v)∈F

w(u,v) (xu − xv)2

≤∑

(u,v)∈F

w(u,v) (xu − xv)2 +∑

(u,v)∈E\F

w(u,v) (xu − xv)2

= xTLGx

for all x ∈ RV .


Our algorithm does not operate directly on the Laplacian matrix. Rather, we considerthe SVD of W

1/2G BG.

W1/2G BG = UT

GΣGVG, (7.1)

where ΣG is a diagonal matrix containing all non-zero singular values of W1/2G BG; and where

UG ∈ Rn×m is a row orthonormal matrix, with n = |V |−r, and r being the number ofconnected components in G. For the unweighted graph, WG is simply the identity matrix.UG plays a similar role to that of the matrix Vn×m in [12] and the matrix Y in [9]. Ouralgorithm utilizes the column-orthogonality of UT

G , highlighting the reason for not workingdirectly with the Laplacian matrix. We note, nevertheless, that this algorithm can be adaptedto any orthogonal decomposition of W

1/2G BG.

7.3 An Unweighted Column Selection Algorithm

Our algorithm selects edges for a sparsifier based on the columns ui of UG,

UG =(u1 u2 · · · um

)∈ Rn×m,

where m = |E| is the number of edges, and n = |V | − r, as above. Therefore, the edges of Gthat are included in the sparsifier are exactly the columns of UG that our algorithm selects.

Denote the number of edges kept as `def= |F |. Let Πt denote the set of selected edges after t

iterations.We propose the following greedy algorithm for column selection on UG. Initially set

A0 = 0n×n and Π0 = ∅, and choose a constant T > 0. At step t ≥ 0:

• Solve for the unique λ < λmin (At) such that

tr (At − λI)−1 = T. (7.2)

• Solve for the unique λ ∈ (λ, λmin (A)) such that

(λ− λ

)(m− t+

n∑j=1

1− λjλj − λ

)=

∑nj=1

1−λj(λj−λ)(λj−λ)∑n

j=11

(λj−λ)(λj−λ)

, (7.3)

where λj is the jth largest eigenvalue of the symmetric matrix At.

• Find an index i /∈ Πt such that

tr(At − λI + uiu

Ti

)−1

≤ tr (At − λI)−1 . (7.4)

• Update At and Πt.


While equations (7.2) and (7.3) are relatively straightforward to justify and solve, equa-tion (7.4) requires careful consideration, and is the focus of much of section 3. Note thatequation (7.2) can be solved in O (n3) operations, equation (7.3) in O (n) operations, andequation (7.4) in O (n2m) operations. This last complexity count follows because testing theinequality scales with O (n2), and potentially all remaining indices must be tested. Thus thetotal complexity of selecting ` columns is O (`n2m).

While this procedure will work for any T > 0, we will show that an effective choice is

T = T ∗(

1 + F(T ∗))

,

where

F(T)

=

[(1− n

T

)`

m− `−12

+ T − n− n

T

],

and where T ∗ is the minimizer of F(T)

, given as

T ∗ =n(m+ `+1

2− n

)+√n`(m− `−1

2

) (m+ `+1

2− n

)`− n

.

Our spectral bounds are derived using this choice of T . We summarize this procedure in theUnweighted Column Selection algorithm.

Algorithm: Unweighted Column Selection (UCS)

Inputs: G = (V,E,w), T > 0, `.Outputs: Huw = (V, F, w IF )

1: Calculate the column-orthogonal matrix UTG

2: Set A0 = 0n×n, Π0 = ∅3: for t = 0, · · · , `− 1 do4: Solve for λ using equation (7.2)

5: Calculate λ using equation (7.3)6: Find i 6∈ Πt such that inequality (7.4) is satisfied7: Update At+1 = At + uiu

Ti

8: Update Πt+1 = Πt ∪ i9: end for

10: Let F = Π` be the selected edges

Theorem 21 below confirms the correctness of the Unweighted Column Selection Algo-rithm. This theorem, along with other properties of the UCS algorithm, will be discussedand proved in Section 7.4.

Theorem 21. Let G = (V,E,w) and let n < ` < m. Then the sparsified graph H producedby the UCS algorithm satisfies

1

κLG LH LG,


where1

κ=

(`− n)2(√n(m+ `+1

2− n

)+√`(m− `+1

2

))2

+ (`− n)2

. (7.5)

7.4 Correctness and Performance of the UCS

Algorithm

The goal of this section is to prove Theorem 21. Section 7.4 establishes that the UCSalgorithm is well-defined. Section 7.4 proves a lower bound for the minimum singular valueof the submatrix selected by the UCS algorithm, and provides a good choice for the inputparameter T . In section 7.4, the UCS algorithm is shown to be a graph sparsificationalgorithm.

The Existence of a Solution to Equation (7.4)

The next two lemmas show that equation (7.4) always has a solution.

Lemma 1. At a given iteration t in the UCS algorithm, at step 6 define

f(x)def= (x− λ)

[m− t+

n∑j=1

1− λjλj − λ

]−∑n

j=11−λj

(λj−λ)(λj−x)∑nj=1

1(λj−λ)(λj−x)

.

Then there exists λ, with λ < λ < λn, such that f(λ)

= 0. Furthermore,

0 <(λ− λ

)∑n

j=11−λj

(λj−λ)(λj−λ)2∑n

j=11


−(λ− λ

) n∑j=1

1− λj(λj − λ)

(λj − λ

) . (7.6)

Proof. Clearly f (λ) < 0. Although f is undefined at λn, let λεn := λn− ε, where ε > 0. Notethat

limε→0+

(n∑j=1

1− λj(λj − λ) (λj − λεn)

)/(n∑j=1

1

(λj − λ) (λj − λεn)

)= 1− λn

because the last term in each sum will dominate the rest of the sum. Furthermore,

limε→0+

(λεn − λ)

[m− t+

n∑j=1

1− λjλj − λ

]= 1− λn + lim

ε→0+(λεn − λ)

[m− t+

n−1∑j=1

1− λjλj − λ

]> 1− λn.


Hence for small ε > 0, we have f (λεn) > 0, and, therefore, λ exists, with λ < λ < λn, and

f(λ)

= 0 via the Intermediate Value Theorem. Note that if there exists 0 < γ < n such

that λγ = λγ+1 = · · · = λn, then we repeat the same argument replacing the expression1− λn with

∑nj=γ 1− λj = (n− γ + 1) (1− λn).

Now we prove inequality (7.6). We use the following version of the Cauchy-Schwartzformula: for aj, bj ≥ 0 then (

∑ajbj)

2 ≤(∑

a2jbj)

(∑bj). Consequently n∑

j=1

1− λj(λj − λ)

(λj − λ

)2

≤

n∑j=1

1− λj

(λj − λ)(λj − λ

)2

(0 +n∑j=1

1− λjλj − λ

)

<

n∑j=1

1− λj


)2

>0︷︸︸︷m− t+

n∑j=1

1− λjλj − λ

=1(

λ− λ) n∑

j=1

1− λj


)2

∑n

j=11−λj

(λj−λ)(λj−λ)∑nj=1

1


,

where the last step comes from f(λ)

= 0. The strict inequality above holds because

m− t ≥ m− `+ 1 ≥ 1. After some simple algebra,

(λ− λ

) n∑j=1

1− λj(λj − λ)

(λj − λ

) < n∑

j=1

1− λj


)2

/ n∑

j=1

1


) ,

which implies our desired inequality because 0 < λ− λ.

Next, we show that our algorithm is well defined in the sense we can always find a newindex i /∈ Πt for each iteration that satisfies (6).

Lemma 2. An index i /∈ Πt can always be found to satisfy line (6) of the UCS algorithm for0 ≤ t < `.


Proof. Note the two following partial fraction results

λ− λ(λj − λ

)(λj − λ)

=1

λj − λ− 1

λj − λ(7.7)

λ− λ(λj − λ

)(λj − λ)2

+1(

λj − λ)

(λj − λ)=

1(λj − λ

)2 . (7.8)

Using the fact that f(λ) = 0, followed by the inequality of Lemma 1, we have

(λ− λ

)[m− t+

n∑j=1

1− λjλj − λ

]

=

n∑j=1

1− λj(λj − λ

)(λj − λ)

/ n∑j=1

1(λj − λ

)(λj − λ)

+ 0

<

n∑j=1

1− λj(λj − λ

)(λj − λ)

/ n∑j=1

1(λj − λ

)(λj − λ)

+(λ− λ

)∑nj=1

1−λj(λj−λ)(λj−λ)2∑n

j=11


−(λ− λ

) n∑j=1

1− λj(λj − λ

)(λj − λ)

=

(λ− λ) n∑j=1

1− λj(λj − λ

)(λj − λ)2

+n∑j=1

1− λj(λj − λ

)(λj − λ)

/ n∑

j=1

1(λj − λ

)(λj − λ)

− (λ− λ)2n∑j=1

1− λj(λj − λ

)(λj − λ)

=

∑nj=1

1−λj(λj−λ)

2∑nj=1

1


−(λ− λ

)( n∑j=1

1− λjλj − λ

−n∑j=1

1− λjλj − λ

),

where the last line follows from equations (7.7) and (7.8). After some rearranging:(m− t+

n∑j=1

1− λjλj − λ

) n∑j=1

λ− λ(λj − λ

)(λj − λ)

<

n∑j=1

1− λj(λj − λ

)2 .

This inequality can be rewritten using the trace property tr(xyT

)= yTx and the identity


∑i/∈Πt

uiuTi =

m∑i=1

uiuTi −

∑i∈Πt

uiuTi = In − At:

(∑i 6∈Πt

1 + uTi

(At − λI

)−1

ui

)(tr(At − λI

)−1

− tr (At − λI)−1

)

=

(m− t+

∑i 6∈Πt

tr

[(At − λI

)−1

uiuTi

])( n∑j=1

1

λj − λ−

n∑j=1

1

λj − λ

)

=

(m− t+ tr

[(At − λI

)−1

(I − At)]) n∑

j=1

λ− λ(λj − λ

)(λj − λ)

=

(m− t+

n∑j=1

1− λjλj − λ

) n∑j=1

λ− λ(λj − λ

)(λj − λ)

<

n∑j=1

1− λj(λj − λ

)2

= tr

((At − λI

)−2

(I − At))

=∑i 6∈Πt

uTi

(At − λI

)−2

ui.

Moving terms to the right and dividing by

(tr(At − λI

)−1

− tr (At − λI)−1

)> 0 (because

λ > λ) gives

∑i 6∈Πt

uTi

(At − λI

)−2

ui

tr(At − λI

)−1

− tr (At − λI)−1−(

1 + uTi

(At − λI

)−1

ui

) > 0.

For this to be true, there must exist an i 6∈ Πt such that uTi

(At − λI

)−2

ui

tr(At − λI

)−1

− tr (At − λI)−1−(uTi

(At − λI

)−1

ui

) > 1.


This last relation gives

tr (At − λI)−1 > tr(At − λI

)−1

−uTi

(At − λI

)−2

ui

1 + uTi

(At − λI

)−1

ui

= tr(At − λI

)−1

− tr

(At − λI

)−1

uiuTi

(At − λI

)−1

1 + uTi

(At − λI

)−1

ui

= tr

(At − λI + uiu

Ti

)−1

,

where the last line was accomplished with the trace property previously indicated and theSherman-Morrison formula.

Lower Bound on λmin(A`)

Lemma 2 ensures that the UCS algorithm can indeed find all ` indices. We now estimatean eigenvalue lower bound on A`. Let λ(t), λ(t) and λ

(t)j represent the values of λ, λ and λj,

respectively, determined in iteration t. Then note that by the definitions of λ and λ we have

λ(0) < λ(0) ≤ λ(1) < λ(1) ≤ · · · ≤ λ(`−1) < λ(`−1).

Define the following quantity and functions:

Tdef= T

(1− λ(`−1)

), (7.9)

g(t)def=

`(

1− n

T

)m− t+ T − n

, and F (T )def=

`(

1− n

T

)m− `−1

2+ T − n

− n

T.

To bound λmin (A`), we first establish a recurrence relation on λ(`−1).

Lemma 3. After the last iteration of the UCS algorithm, we have

λ(`−1) ≥(

1− λ(`−1))[1

`

`−1∑t=0

g (t)− n

T

].

Proof. Remember that T = tr(A− λ(t)I

)−1=

n∑j=1

1

λ(t)j − λ(t)

, and note that

1− λ(t)j

λ(t)j − λ(t)

=1− λ(t)

λ(t)j − λ(t)

+λ(t) − λ(t)

j

λ(t)j − λ(t)

=1− λ(t)

λ(t)j − λ(t)

− 1. (7.10)


The equation f(λ(t))

= 0 gives

(λ(t) − λ(t)

)(m− t+

n∑j=1

1− λ(t)j

λ(t)j − λ(t)

)=

∑nj=1

1−λ(t)j(λ(t)j −λ(t)

)(λ(t)j −λ(t)

)∑nj=1

1(λ(t)j −λ(t)

)(λ(t)j −λ(t)

) .

Applying equation (7.10) to both sides:(λ(t) − λ(t)

) (m− t+

(1− λ(t)

)T − n

)= 1− λ(t) −

∑nj=1

1

λ(t)j −λ(t)∑n

j=11(

λ(t)j −λ(t)

)(λ(t)j −λ(t)

)

≥ 1− λ(t) −n

(maxj∗

1

λ(t)j∗−λ

(t)

)(

maxj∗1

λ(t)j∗−λ

(t)

)(∑nj=1

1

λ(t)j −λ(t)

)= 1− λ(t) − n

T.

Since (λ(t−1) − λ(t)

)≤ 0, and

(λ(t) − λ(t)

)≥

1− λ(t) − nT

m− t+ (1− λ(t))T − n,

we have

λ(`−1) ≥ λ(`−1) +`−1∑t=1

≤0︷︸︸︷(λ(t−1) − λ(t)

)−λ(0) + λ(0)

=`−1∑t=0

(λ(t) − λ(t)

)+ λ(0)

≥`−1∑t=0

1− λ(t) − nT

m− t+ (1− λ(t))T − n− n

T

≥`−1∑t=0

1− λ(`−1) − nT

m− t+(

1− λ(`−1))T − n

− n

T. (7.11)

Inequality (7.11) follows by noting that the terms in the sum are decreasing in λ(t). The finalsubstitution is necessary because solving the preceding recurrence relation is impractical. Tofurther simplify calculations, we define

T := T(

1− λ(`−1)).


Therefore,

λ(`−1) ≥(

1− λ(`−1))[ `−1∑

t=0

1− n

T

m− t+ T − n− n

T

]

=(

1− λ(`−1))[1

`

`−1∑t=0

g(t)− n

T

].

Next, to demonstrate the effectiveness of the algorithm, we derive a lower bound for λnafter ` iterations. This analysis will involve selecting an appropriate T to maximize the lowerbound.

Lemma 4. If T > n, then

λmin (A`) ≥F(T)

1 + F(T) .

Proof. A key observation is that g(t) is strictly convex in t, which is easily verified by showing

that the second derivative d2gdt2

(t) is positive by our assumptions that T > n and m ≥ ` > t.Next, we note that the harmonic mean is strictly less than the arithmetic mean unless areterms are equal and can further bound the recurrence relation in Lemma 3:

λ(`−1) ≥(

1− λ(`−1))1

`

`−1∑t=0

`(

1− n

T

)m− t+ T − n

− n

T

>

(1− λ(`−1)

) `(

1− n

T

)1`

∑`−1t=0 m− t+ T − n

− n

T

=

(1− λ(`−1)

) `(

1− n

T

)m− `−1

2+ T − n

− n

T

=

(1− λ(`−1)

)F(T).

Along with λn > λ from Lemma 1, this finally leads to

λmin = λn > λ(`−1) >F(T)

1 + F(T) . (7.12)


The expression on the right-hand side of (7.12) is monotonically increasing in F . So,

maximizing F(T)

will also maximize the lower bound on λ(`−1).

Lemma 5. The function F (T ) is maximized at

T ∗ =n(m+ `+1

2− n

)+√n`(m− `−1

2

) (m+ `+1

2− n

)`− n

.

Proof. Setting the derivative of F(T)

to zero:

dF

dT=

(n− `)T 2 + 2n(m+ `+1

2− n

)T + n

(m+ `+1

2− n

) (m− `−1

2− n

)T 2(m− `−1

2− n+ T

)2 = 0.

Solving for the desired root:

T ∗ =n(m+ `+1

2− n

)+√n`(m− `−1

2

) (m+ `+1

2− n

)`− n

.

We see that T ∗ is the global maximum on the region T ∈ (n,∞) via the first derivative test

since dF

dT> 0 for n < T < T ∗ and dF

dT< 0 for T ∗ < T .

We remark that combining (7.9) and (7.12) implies that the UCS algorithm should choose

T = T ∗(

1 + F(T ∗))

for effective column selection. We are now ready to estimate λmin (A`).

Theorem 22. If T is chosen according to Lemma 5 in the UCS algorithm, then

λmin (A`) >1

κ,

where κ is defined in (7.5).

Proof. We wish to apply our choice of T to Lemma 4. We satisfy the assumption

T ∗ =n(m+ `+1

2− n

)+√n`(m− `−1

2

) (m+ `+1

2− n

)`− n

≥ n (m− n)

`− n≥ n.

Therefore, plugging T ∗ into (7.12) of Lemma 4:

λmin (A`) >F(T)

1 + F(T)

=(`− n)T − n

(m+ `+1

2− n

)T(m− `−1

2− n+ T

)+ (`− n)T − n

(m+ `+1

2− n

)=

(`− n)2(√n(m+ `+1

2− n

)+√`(m− `+1

2

))2

+ (`− n)2

.


Correctness of the Unweighted Column Selection Algorithm

We are now in a position to prove Theorem 21. Our arguments are similar to those of theweighted sparsifier algorithm in [12].

Proof of Theorem 21. By Proposition 1, we only need to show 1κLG LH . Consider the

SVD of W1/2G BG in equation (7.1), and let x be any vector such that y = ΣGVGx 6= 0. Then

LG = BTGWGBG = V T

G Σ2GVG,

LH = BTGWHBG = BT

GW1/2G ΠT

` Π`W1/2G BG

= V TG ΣG

(UGΠT

` ΠÙTG

)ΣGVG.

It follows that

xTLHx

xTLGx=

xT(V TG ΣG

(UGΠT

` ΠÙTG

)ΣGVG

)x

xT (V TG Σ2

GVG)x

=yTUGΠT

` ΠÙTGy

yTy. (7.13)

On the other hand, by construction we have

A` =∑j∈Π`

ujuTj = UGΠT

` ΠÙTG .

With equation (7.13), the Courant-Fisher min-max property gives

xTLHx

xTLGx=yTUGΠT

` ΠÙTGy

yTy≥ λmin (A`) >

1

κ,

where the last line is due to Theorem 22.

7.5 Performance Comparison of UCS and Other

Algorithms

This section compares the bound (7.5) to bounds of other current methods.

Comparison with Twice-Ramanujan Sparsifiers

Given a weighted graph G = (V,E,w), the algorithm of [12] produces a sparsified graphH = (V, F, w), where F is a subset of E and w contains new edge weights, such that

LG LH

(√d+ 1√d− 1

)2

LG, (7.14)


where the parameter d is defined via the equation ` = dd (n− 1)e.By choosing d to be a moderate and dimension-independent constant, equation (7.14)

asserts that every graph G = (V,E,w) has a weighted spectral sparsifier with a numberof edges linear in |V |. This strong result, nevertheless, is obtained by allowing unrestrictedchanges in the graph weights. Such changes may be undesirable, especially ifG is unweighted,and the UCS algorithm may be preferred.

To compare the effectiveness of these two types of sparsifiers, we simplify equation (7.5):

1

κ≈

(√d− 1

)2

m/n+ d/2 +(√

d− 1)2 .

It follows that for κ = Θ(1), a dimension-independent constant, we must choose d = Θ(m/n).This is the price one must pay to retain the original weights. For d m/n, the UCSalgorithm computes a sparsified graph with a κ that grows at most linearly with m/n. Thealgorithm of [12] runs in time O (dn3m), which is equivalent to UCS.

Near-Optimal Column-Based Matrix Reconstruction

The algorithm of [12] has been generalized in [17] to a column selection algorithm for com-puting CX decompositions. In this work, Boutsidis, Drineas, and Magdon-Ismail prove that,given row-orthonormal matrices

V T1 =

(~v1

1 ~v12 · · · ~v1

m

)∈ Rn×m and V T

2 =(~v2

1 ~v22 · · · ~v2

m

)∈ R(m−n)×m,

then for a given n < ` ≤ m there exist weights si ≥ 0 with at most ` of them nonzero suchthat

λn

(m∑i=1

si~v1i

(~v1i

)T) ≥(

1−√n

`

)2

(7.15)

and

λ1

(m∑i=1

si~v2i

(~v2i

)T) ≤

(1 +

√m− n`

)2

. (7.16)

In the context of CX decompositions,

[V T

1

V T2

]= V T ∈ Rm×m is understood to be the loadings

matrix of a data matrix A, i.e. A = UΣV T is the SVD of A (although the algorithm couldbe applied to other matrices for other applications). Their work includes an algorithm forfinding the weights, Deterministic Dual Set Spectral Sparsification (DDSSS).


Theorem 23. Let ΠDDSSS` ∈ Rm×` denote a matrix that chooses the ` columns selected by

the DDSSS algorithm. The inequalities (7.15) and (7.16) imply

σ2min

(V T

1 ΠDDSSS`

)≥

(√`−√n)2

(√`+√m− n

)2

+(√

`−√n)2

def=

1

κDDSSS

.

Proof. We interpret these inequalities as a bound on λn by first partitioning

V TΠ =

(V1 V ′1V2 V ′2

),

where Π is a permutation matrix that orders the selected columns first. Then, using a CSdecomposition [103], we can write

V1

V2

=

P1

(C 0

)QT

1

P2

−S 00 I0 0

QT2

,

where C and S are diagonal matrices with non-negative entries such that C2 + S2 = I.Furthermore, because P1 and Q are orthogonal, by inspection C contains the singular valuesof V1. Hence

λn ≥ σ2min (V1) = σ2

min(C).

Now let W be a weight matrix, whose diagonal entries are√si, the weights from above.

Define

Qdef= QT

1

(WW T

)Q1

def=

(Q11 Q12

Q21 Q22

).


Then

(V2W ) (V1W )†

= V2W (V1W )T(V1WW TV T

1

)−1

= V2

(WW T

)V T

1

(V1

(WW T

)V T

1

)−1

= V2

(WW T

) (P1

(C 0

)QT

1

)T (P1

(C 0

)QT

1

(WW T

)Q1

(C0

)P T

1

)−1

= P2

−S 00 I0 0

Q

(C0

)P T

1 P1

((C 0

)Q

(C0

))−1

P T1

= P2

−S 00 I0 0

( Q11 Q12

Q21 Q22

)(C0

)(CQ11C

)−1

P T1

= P2

−S 00 I0 0

( Q11C

Q21C

)C−1Q−1

11 C−1P T

1

= P2

−SC−1

Q21Q−111 C

−1

0

P T1 .

Therefore √1− σ2

min(C)

σ2min(C)

= ‖SC−1‖2

≤ ‖(V2W ) (V1W )† ‖2

≤

(1 +

√m− n`

)(1−

√n

`

)−1

.

Rearranging

σ2min(C) ≥

(1−

√n`

)2(1 +

√m−n`

)2

+(1−

√n`

)2

=

(√`−√n)2

(√`+√m− n

)2

+(√

`−√n)2 .


Corollry 1. Let κUCS be as defined in equation (7.5). Then

1

κUCS

>1

κDDSSS

.

Proof.

(`− n)2(√n(m+ `+1

2− n

)+√`(m− `+1

2

))2

+ (`− n)2

=(√`−√n)2(√

n(m+ `+12−n)+

√`(m− `+1

2 )√`+√n

)2

+ (√`−√n)2

≥ (√`−√n)2(√

n(m+ `+12−n)+

√`m

√`+√n

)2

+ (√`−√n)2

≥ (√`−√n)2(√

n(m+`−n)+√`m

√`+√n

)2

+ (√`−√n)2

≥ (√`−√n)2(√

n(m+`−n)+√`m−n`+`2

√`+√n

)2

+ (√`−√n)2

≥ (√`−√n)2(√

nm−n2+√n`+√`m−n`+

√`2√

`+√n

)2

+ (√`−√n)2

=(√`−√n)2(

(√m−n+

√`)(√`+√n)√

`+√n

)2

+ (√`−√n)2

=(√`−√n)2(√

m− n+√`)2

+ (√`−√n)2

.

This suggests the UCS algorithm may find a better subset than the column selectionalgorithm in [17]. Observe that typically m ` ≥ n. For the purpose of finding a well-conditioned subset of columns in V T

1 ∈ Rn×m, requiring the whole matrix V T ∈ Rm×m iscomputationally expensive. On the other hand, an even better subset can be obtained byapplying the UCS algorithm directly to V T

1 , at considerable savings in computational timeand memory usage. This algorithm runs in time O

(`m(n2 + (m− `)2)) ≈ O (`m3), far

slower than UCS.


Figure 7.1: Autonomous System Example: Original Graph

7.6 A Numeric Example: Graph Visualization

We test the UCS algorithm on the Autonomous systems AS-733 dataset in [67]2. The datais undirected, unweighted, and contains 493 nodes and 1189 edges. To visualize the data,nodes are plotted using coordinates determined by the force-directed Fruchterman-Reingoldalgorithm. This algorithm treats the edges of a graph as forces (similar to springs), andperturbs node coordinates until the graph appears to be near an equilibrium state [45].

We apply the force-directed algorithm with two methodologies. First, the force-directedalgorithm is run on the whole graph to determine a fixed set of node coordinates. Usingthese coordinates, the original graph is plotted with various sparsifiers in Figure 7.2. Second,we run the force-directed algorithm on each sparsifier to determine node coordinates for thatsparsifier, and plot both the sparsifier and the original graph on these coordinates (Figure7.3). While this requires rerunning the force-directed algorithm for each sparsifier, thealgorithm converges faster because of the reduced number of edges.

Although the original graph can be considered sparse, visualization of the graph is dif-ficult. In Figure 7.1, a few nodes are seen to have high degree, but little information isreadily available about important edges in the graph or about how important nodes arerelated. Figure 7.2 shows that plotting the sparsifier on the original graph provides incre-mental benefit. The sparser graphs begin to highlight important nodes and important edgesconnecting them, but visualization remains difficult. Rerunning the force-directed algorithmon the sparsifiers, nevertheless, evokes an easily interpretable structure, where importantnodes, clusters, and important edges connecting clusters are readily visible (Figure 7.3).

2File as19981229


(a) 984 Edges (b) 738 Edges (c) 615 Edges (d) Spanning Tree3

Figure 7.2: Autonomous Systems Graph with Sparsifiers of Various Cardinalities (nodecoordinates calculated from whole graph)

(a) 984 Edges (b) 738 Edges (c) 615 Edges (d) Spanning Tree3

Figure 7.3: Autonomous Systems Graph with Sparsifiers of Various Cardinalities (nodecoordinates recalculated for each sparsifier)

(a) 984 Edges (b) 738 Edges (c) 615 Edges (d) 493 Edges

Figure 7.4: Progress During Iteration and Theoretical Singular Value Lower Bound forSparsifiers of Various Cardinalities

7.7 Relationship to the Kadison-Singer Problem

Let p ≥ 2 be an integer, and let U = (u1, · · · , um) ∈ Rn×m be a matrix that satisfies

n∑k=1

ukuTk = I, and ‖uk‖2≤ δ, for k = 1, · · · ,m, (7.17)

3Calculated by running UCS algorithm with ` = 493 and omitting the final edge. In general, a spanningtree for a connected graph can be found by selecting ` = n + 1 edges and removing an edge from a loopcreated by the UCS algorithm.


where 0 < δ < 1. Equation (7.17) implies that U is a row-orthonormal matrix and that eachcolumn of U is uniformly bounded away from 1 in 2−norm. Marcus et al. [73] show thatthere exists a partition

P = P1 ∪ · · · ∪ Pp (7.18)

of 1, · · · , n such that

‖U (:,Pk)‖2 ≤1√p

+ δ, for k = 1, · · · , p.

When the graphG is sufficiently dense, equation (7.18) implies the existence of an unweightedgraph sparsifier (see Batson, et al. [12]) .

7.8 Additional Thoughts

We have presented an efficient algorithm for the construction of unweighted spectral spar-sifiers for general weighted and unweighted graphs, addressing the open question of theexistence of such graph sparsifiers for general graphs [12]. Our algorithm is supported bystrong theoretical spectral bounds. Through numeric experiments, we have demonstratedthat our sparsification algorithm can be an effective tool for graph visualization, and an-ticipate that it will prove useful for wide-ranging applications involving large graphs. Animportant feature of our sparsification algorithm is the deterministic unweighted columnselection algorithm on which it is based. An open question is the existence of a larger lowerspectral bound, either with the same T or a new one.

94

Chapter 8

Additional Results on UnweightedGraph Sparsification

8.1 A Running Bound

Note that the bound 7.5 assumes the recommended T and all ` iterations have been per-formed. This bound does not apply, nevertheless, at iteration s < ` by replacing ` with s inthis bound because T was determined by `, not s. Here, a new bound is derived for whens < ` iterations have been performed.

For a given n,m, and `, we look at the lower bound at step s ≤ `. From inequality (7.11):

λs ≥s∑t=0

1− λs − nT

m− T +(

1− λs)T − n

− n

T

≥ s ·1− λs − n

T

m− s2

+(a− λs

)T − n

− n

T.

Then

λ(m− s

2+(

1− λs)T − n

)≥ s

(1− λs − n

T

)− n

T

(m− s

2+(

1− λs)T − n

)−T

(λs)2

+ λ(m− s

2+ T − n

)≥ s− sλs − sn

T− nm

T+sn

2T− n+ nλs +

n2

T

−T(λs)2

+ λs(m+

s

2+ T − 2n

)≥ s− sn

T− nm

T+sn

2T− n+

n2

T

We solve for the smaller root of this quadratic to find a lower bound.

λs ≥2n− T − s

2−m+

√(m+ s

2+ T − 2n

)2+ 4T

(−s+ sn

2T+ nm

T+ n− n2

T

)−2T

.

CHAPTER 8. ADDITIONAL RESULTS ON UNWEIGHTED GRAPHSPARSIFICATION 95

8.2 Faster Subset Selection for Matrices and

Applications

Their improved bound in Theorem 3.5:

‖X†S‖2ξ≤(

1 +

√m

`

)2(1−

√n

`

)−2

‖X†‖2ξ .

Equivalently

‖XS‖2ξ

‖X‖2ξ

≥(1−

√n`

)2(1 +

√m`

)2

= κDDSSS2.

Then

κDDSSS2 =

(√`−√n)2

(√`+√m)2

=(`− n)2(√

`+√n)2 (√

`+√m)2

=(`− n)2(

`+√`m+

√n`+

√nm)2

=(`− n)2(

`− n+ n+√`m+

√n`+

√nm)2

<(`− n)2(

`− n+ n+√`(m− `+1

2

)+√n`+

√nm)2

<(`− n)2(

`− n+ n+√`(m− `+1

2

)+√n (`+m)

)2 .

CHAPTER 8. ADDITIONAL RESULTS ON UNWEIGHTED GRAPHSPARSIFICATION 96

Continuing:

κDDSSS2 <(`− n)2(

`− n+ n+√`(m− `+1

2

)+√n(m+ `+1

2− n

))2

<(`− n)2(

`− n+√`(m− `+1

2

)+√n(m+ `+1

2− n

))2

<(`− n)2

(`− n)2 +(√

`(m− `+1

2

)+√n(m+ `+1

2− n

))2

= κUCS.

97

Appendix A

An Algorithm for Sparse PCA

Although the truncated Singular Value Decomposition is a mathematically optimally low-rank approximation algorithm in the sense of Theorem 2, there are many drawbacks to thistype of approximation. Part II demonstrated that SRLU simultaneously addresses many ofthese drawbacks. This section presents an algorithm specifically for the problem of sparsity,known as Sparse PCA [110].

Recall that the SVD is generally dense, regardless of the amount of sparsity in the inputdata. While the SVD finds the optimal subspace of a chosen rank k, the basis for thissubspace will generally be a set of k dense principal components and loading vectors. As aresult, the rank-k subspace is a combination of all input variables. Instead, one may ask thenatural question of which k variables best explain the data, a different question which theSVD does not address.

A.1 Problem Formulation

The sparse PCA problem has many formulations. Here, we use the following objectivefunction:

minUTU=IV (Π⊥)=0

‖A− UV T‖TF−η

[log det

(V TV

)−

k∑j=1

log V (:, j)TV (:, j)

]. (A.1)

For a target rank k and A ∈ Rm×n, we have U ∈ Rm×k is orthonormal, and V ∈ Rn×k.Because the determinant of a matrix is the product of the singular values, this objectivefunction seeks to simultaneously maximize all of the singular values of V . The last term inthe object function serves to keep the columns of V bounded. The presence of the logarithmicfunction will allow us to analyze this function analytically. Furthermore, the logarithmicterm in the final expression renders a column-wise penalty similar to the `1 norm, which isgenerally known to induce sparsity.

APPENDIX A. AN ALGORITHM FOR SPARSE PCA 98

Sparsity will be represented in the following setup. The sparsity pattern of V (here thejth column can be used instead of 1st column):

V =

(v1 V1

0 V2

). (A.2)

A.2 Setup for a Single Column

Let

a =

(a1

a2

)def=(ATU

)j.

Note

V TV =

(vT1 v1 vT1 V1

V T1 v1 V T

2 V2 + V T1 V1

),

and define

Pdef= I − V1

(V T

1 V1 + V T2 V2

)−1V T

1 .

Note that

‖A− UTV ‖2F = tr

(ATA

)− 2tr

(UTAV

)+ tr

(V TV

).

Then the part of the objective function that is updated:

minv1−2vT1 a1 + vT1 v1 − η

[log vT1 P v1 − log vT1 v1

]. (A.3)

Taking the derivative:

−a1 + v1 − η[P v1

vT1 P v1

− v1

vT1 v1

]= 0. (A.4)

A.3 Solving for v1

Rewriting:

a1 = v1

(1 +

η

vT1 v1

)− η

vT1 P v1

P v1.

Define

αdef= 1 +

η

vT1 v1

,

βdef=

η

vT1 P v1

.


Thus we have

v1 = (αI − βP )−1 a1, (A.5)

and

α = λβ = 1 +η

aT1 (αI − βP )−2 a1

, (A.6)

β =η

aT1 (αI − βP )−1 P (αI − βP )−1 a1

. (A.7)

Additionally, define

λdef=

α

β=

(1 + η

vT1 v1

)vT1 P v1

η

=vT1 P v1

η+vT1 P v1

vT1 v1

. (A.8)

Extracing a factor of 1/β from the terms in the denominator of equation (A.7), and notingthe terms commute, implies

β =η

1β2aT1 (λI − P )−2 Pa1

=ηβ2

aT1 (λI − P )−2 Pa1

=1

ηaT1 (λI − P )−2 Pa1. (A.9)

Similarly, equation (A.6) implies

α = λβ =ηβ2

aT1 (λI − P )−2 a1

+ 1. (A.10)

Applying equation (A.9) to both sides of equation (A.10) yields

λ

ηaT1 (λI − P )−2 Pa1 = 1 +

η

aT1 (λI − P )−2 a1

[aT1 (λI − P )−2 Pa1

]2η2

λ =η

aT1 (λI − P )−2 Pa1

+aT1 (λI − P )−2 Pa1

aT1 (λI − P )−2 a1

λaT1 (λI − P )−2 a1 − aT1 (λI − P )−2 Pa1

aT1 (λI − P )−2 a1

=η

aT1 (λI − P )−2 Pa1

aT1 (λI − P )−2 (λI − P ) a1

aT1 (λI − P )−2 a1

=η

aT1 (λI − P )−2 Pa1

aT1 (λI − P )−1 a1

aT1 (λI − P )−2 a1

=η

aT1 (λI − P )−2 Pa1

. (A.11)


This provides a formula for λ. Where to look for λ is discussed later.

A.4 More on λ and the Objective Function

Using the notation introduced above, we can use equation (A.9) to rewrite our objectivefunction, in equation (A.3), as

Obj. = −2ηaT1 (λI − P )−1 a1

aT1 (λI − P )−2 Pa1

+η2aT1 (λI − P )−2 a1[aT1 (λI − P )−2 Pa1

]2 − η logaT1 (λI − P )−2 Pa1

aT1 (λI − P )−2 Pa1

= − η2aT1 (λI − P )−2 a1[aT1 (λI − P )−2 Pa1

]2 − η logη

aT1 (λI − P )−1 a1

= −η2 1

aT1 (λI − P )−2 a1

(aT1 (λI − P )−1 a1

η

)2

− η logη

aT1 (λI − P )−1 a1

= −[aT1 (λI − P )−1 a1

]2aT1 (λI − P )−2 a1

− η logη

aT1 (λI − P )−1 a1

,

where we have used equation (A.11) in the derivations above. Then

dObj.

dλ= −

[−2[aT1 (λI − P )−1 a1

] [aT1 (λI − P )−2 a1

]2+2[aT1 (λI − P )−1 a1

]2aT1 (λI − P )−3 a1

]/[aT1 (λI − P )−2 a1

]2− d

dλη log

η

aT1 (λI − P )−1 a1

= 2aT1 (λI − P )−1 a1[aT1 (λI − P )−2 a1

]2 ([aT1 (λI − P )−2 a1

]2−aT1 (λI − P )−1 a1 · aT1 (λI − P )−3 a1

)− d

dλη log

η

aT1 (λI − P )−1 a1

.


Let αj denote the jth component of a1, and µj the jth eigenvalue of P . Then we can write[aT1 (λI − P )−2 a1

]2 − aT1 (λI − P )−1 a1 · aT1 (λI − P )−3 a1

=k∑i=1

α2i

(λ− µi)2

k∑j=1

α2j

(λ− µj)2 −k∑i=1

α2i

(λ− µi)1

k∑j=1

α2j

(λ− µj)3

=k∑

i,j=1

α2iα

2j [(λ− µj)− (λ− µi)](λ− µi)2 (λ− µj)3

=k∑

i,j=1

α2iα

2j [µi − µj]

(λ− µi)2 (λ− µj)3 . (A.12)

Suppose that t1 and t2 are indices such that µt1 > µt2 . Then the terms in the sum in euqation(A.12) corresponding to this pair of indices are

α2t1α2t2

[µt1 − µt2 ](λ− µt1)

2 (λ− µt2)3 +

α2t2α2t1

[µt2 − µt1 ](λ− µt2)

2 (λ− µt1)3

=α2t1α2t2

(λ− µt2)3 (λ− µt1)

3 [(µt1 − µt2) (λ− µt1) + (µt2 − µt1) (λ− µt2)]

=α2t1α2t2

(λ− µt2)3 (λ− µt1)

3

[− (µt1 − µt2)

2]≤ 0.

This implies line (A.12) is non-positive. By inspection,

d

dλη log

η

aT1 (λI − P )−1 a1

≥ 0.

Therefore,

dObj.

dλ< 0. (A.13)

A.5 Bounds on λ

Note that equation (A.8) implies

vT1 P v1

vT1 v1

≤ vT1 P v1

η+vT1 P v1

vT1 v1

= λ ≤ ‖P‖2

η+ ‖P‖2.

Consider that we are picking v1 to maximize λ, as argued by line (A.13), and note that byconstruction P 0 and P T = P . Then

‖P‖2= ‖P 1/2‖22= max

x‖P 1/2 x

‖x‖2

‖2.


Let

x = arg maxx‖P 1/2 x

‖x‖2

‖2.

Then

‖P‖2 = ‖P 1/2‖22

= ‖P 1/2 x

‖x‖2

‖22

=xT(P 1/2

)TP 1/2x

xT x

=xTPx

xT x

≤ vT1 P v1

η+vT1 P v1

vT1 v1

=vT1 P v1

η+vT1 P v1

vT1 v1

= λ.

Hence

‖P‖2≤ λ ≤ ‖P‖2+‖P‖2

η. (A.14)

Equation (A.11) has more than one root, so the bounds in inequality (A.14) provide aguidance for the correct root. To guarantee that a root can be found, nevertheless, thesebounds may still be insufficient. A stable and fast algorithm such as Brent’s method [19]requires bounds with opposing signs when searching for the zero of a function. Define

F (λ)def=

aT1 (λI − P )−1 a1

aT1 (λI − P )−2 a1

− η

aT1 (λI − P )−2 Pa1

,

and let

P = QΛQT

be the SVD of P , which takes this form because P is symmetric. Denote b = aT1Q, and letdi be the diagonal elements of Λ (in descending order). Then

F (λ) =b [λI− Λ]−1 bT

b [λI− Λ]−2 bT− η

b [λI− Λ]−2 ΛbT

=

∑i

b2iλ−di∑

ib2i

(λ−di)2− η∑

idib2i

(λ−di)2.


Note that if λ1 is the solution to ∑i

b2i

λ− di=

η

d1

, (A.15)

then F (λ1) ≤ 0. Let t be the largest integer such that dt > 0. Then the solution λ2 to∑i

b2i

λ− di=

η

dt(A.16)

will lead to F (λ2) ≥ 0. Therefore, equations A.15 and A.16 lead to lower and upper boundsrespectively on the value of λ that is a zero of F . Although the need to solve new equationsmay seem circular, note that equations A.15 and A.16 are far more simple than F .

A.6 Putting It All Together

A value for λ is found by solving equation (A.11). Then solve for β using equation (A.9).Hence α = λβ. Finally, solve for v1 using equation (A.5). We summarize in Algorithm 9.


Inputs: A,U, V , and a sparsity pattern for VOutputs: Refined V

1: for j = 1, · · · , n do2: Calculate aj =

(ATU

)j, and calculate P corresponding to jth column

3: Solve for λ:

aTj (λI − P )−1 aj

aTj (λI − P )−2 aj=

η

aTj (λI − P )−2 Paj

4: Solve for β:

β =1

ηaTj (λI − P )−2 Paj

5: Solve for α:

α = λβ

6: Solve for Vj

Vj = (αI − βP )−1 aj

7: end for

Algorithm 9: Sparse PCA Refinement Algorithm


A.7 Error Bound

Using the results above, we calculate:

‖a1 − v1‖2 = (a1 − v1)T (a1 − v1)

= aT1 a1 + vT1 v1 − 2aT1 v1

= aT1 a1 + aT1 (αI − βP )−2 a1 − 2aT1 (αI − βP )−1 a1

= aT1 a1 +1

β2aT1 (λI − P )−2 a1 −

2

βaT1 (λI − P )−1 a1

= aT1 a1 +η2[

aT1 (λI − P )−2 Pa1

]2aT1 (λI − P )−2 a1

−2ηa1 (λI − P )−1 a1

aT1 (λI − P )−2 Pa1

(A.17)

= aT1 a1 +

[aT1 (λI − P )−1 a1

]2aT1 (λI − P )−2 a1

− 2

[aT1 (λI − P )−1 a1

]2aT1 (λI − P )−2 Pa1

(A.18)

= aT1 a1 −[a1 (λI − P )−1 a1

]2aT1 (λI − P )−2 a1

= aT1 (λI − P )−1 (λI − P ) a1 −[a1 (λI − P )−1 a1

]2aT1 (λI − P )−2 a1

= λaT1 (λI − P )−1 a1 − aT1 (λI − P )−1 Pa1 −[a1 (λI − P )−1 a1

]2aT1 (λI − P )−2 a1

=

(λ− aT1 (λI − P )−1 a1

aT1 (λI − P )−2 a1

)aT1 (λI − P )−1 a1 − aT1 (I − P )−1 Pa1

=aT1 (λI − P )−2 Pa1

aT1 (λI − P )−2 a1

aT1 (λI − P )−1 a1 − aT1 (λI − P )−1 Pa1 (A.19)

= η − aT1 (λI − P )−1 Pa1

< η. (A.20)

Equation (A.17) uses equation (A.9). Equation (A.18) uses equation (A.11). Equation (A.19)uses the identity in my notebook (write this as its own lemma). Note that the bounds in line(A.14) imlpy that (λI − P )−1 is positive semidefinite, and so is P , hence inequality (A.20)follows.

This error bound provides insight into the tradeoff between accuracy and sparsity. Inparticular, notice that this bound implies the error disappears as η goes to 0.

106

Appendix B

An Efficient Implementation of theGeneralized Minimum ResidualMethod for Stiff PDE Problems

B.1 Preconditioned GMRES

In this chapter, a potential optimization to the Generalized Minimum Residual Method(GMRES) is reduced. Unlike the algorithms above, GMRES [88] is not a low-rank approxi-mation algorithm, rather it is a linear solver for large, sparse systems. GMRES, nevertheless,is a Krylov subspace method; an iterative method that builds a subspace at each iterationand seeks the best solution to the overall problem which lies in that subspace. In this sense,the Krylov subspace is a low-rank approximation to the entire system for the purpose offinding the solution to the system. This work is preliminary, and numeric experiments arenot yet available. This algorithm was originally motivated by boundary-layer Navier Stokessimulations using high order finite element methods.

GMRES, which is generally known to be one of the most stable iterative solvers, is oftenused to solve unsymmetric systems with potentially high condition numbers. As a result,GMRES is often applied to preconditioned systems. A preconditioned system is of the form

M−1Ax = M−1b.

The matrix M is known as a preconditioner and is chosen so that M is as close to A aspossible, or, more specifically, so that the eigenvalues of M−1A are as close to 1 as possibleand not near 0. As a result, the system B.1 is easier to solve than Ax = b.

Many strategies for preconditioning exist. The incomplete LU decomposition is a com-mon choice of preconditioned for large, sparse, unstable linear systems because it is fastto compute and apply, and because it does not isolate parts of the system (for instance, ablock preconditioner isolates the block diagonal components of a linear system and does notaccount for interaction between those blocks). The incomplete LU factorization takes the

APPENDIX B. AN EFFICIENT IMPLEMENTATION OF THE GENERALIZEDMINIMUM RESIDUAL METHOD FOR STIFF PDE PROBLEMS 107

form

A ≈ LU,

where the factorization is performed from start to finish using a standard LU decomposition(as opposed to the truncated LU factorization discussed previously in this work), but onlypopulates L and U where there are nonzero entries in A. Thus the factorization time is afunction on the number of nonzeros in the matrix A, and is not an O (n3) algorithm. (Notethat A is assumed to be square by the nature of the problem.) The rest of this chapter willassume that GMRES is being applied to a system preconditioned with incomplete LU.

B.2 A New Optimization

The most expensive step in GMRES is the calculation of a matrix-vector product. This isperformed at each iteration until the algorithm converges. For a stiff PDE problems, therows and columns of A (and therefore L and U) must be given a careful ordering beforefactorization because pivoting is generally too expensive (in terms of memory management)during iteration because of the size of the system. Thus discretization of the system likelyimplies that points in the domain that are close together (interact through the PDE beingsimulated) should be as close to possible in the linear representation A. Furthermore, forconvection-dominated problems, these elements may benefit by being ordered in the directionof the convection, allowing information to flow through the system A as the dynamics of thePDE would flow through the domain.

Inevitably, some parts of the linear system Ax = b will converge faster than other parts.This may occur in parts of the system that correspond to parts of the domain that are affectedleast by the dynamics of the PDE. Unstructured meshes, such as boundary layer meshes,provide a method for spending less time and fewer calculations on parts of the system whereless change occurs. However, it remains unlikely that all parts of the system will convergesimultaneously. Furthermore, unstructured meshes are likely more concerned with balancingaccuracy with computation time than just computation time. We now present an algorithmthat potentially optimizes GMRES for stiff PDE problems by ceasing calculation at thematrix-vector product step of GMRES by turing off calculations for parts of the system thathave been determined to have converged before the entire system has converged.

As linear systems in general must simultaneously converge, care must be taken when de-termining if parts of the domain have converged before the entire system has converged. Fora finite element method, all nodes within an element will typically be ordered together, andeach node may be represented with several elements in the system A. The algorithm pre-sented here relies on the following assumption: if the part of the linear system correspondingto an element is contiguous in the linear system and the residual corresponding to that partof the system is below a given tolerance, then that element is considered to have converged.While it is possible for parts of a linear system to appear to have converged during iterationbut have actually achieved a small residual with an incorrect value by chance, it should be


unlikely that the part of a system corresponding to an entire element should accidentallyexhibit a small residual. Note however, that a tolerance for the convergence of the partof a system corresponding to an element within the domain should be much smaller thanthe tolerance for the convergence of the entire system, if not machine precision. Note that,for restarted GMRES, the residuals of individual elements can only easily be calculated atrestarts.

Our goal is to compute the preconditioned matrix-vector product

x =

x1

x2

x3

x4

= U−1L−1Ab,

where the underline corresponds to elements that have been deemed to have converged andhence do not need to be updated. There may be more back-and-forths between elementsthat have converged and elements that are still converging, but the for of the vector x aboveshould be sufficient to generalize to all possibilities. We will need the following partitions:

L =

L11

L21 L22

L31 L32 L33

L41 L42 L43 L44

U =

U11 U12 U13 U14

U22 U23 U24

U33 U34

U44

A =

(A1 A2 A3 A4

)b =

b1

b2

b3

b4

.

Here, the size of the partitions match the size of the partition of x. We will first solve

y =

y1

y2

y3

y4

= L−1Ab,

and then solve the entire system for x. Operations will be performed from right to left


because operating a vector is significantly cheaper than matrix-matrix operations. Denote

L−1 =

L−1

11

X21 L−122

X31 X32 L−133

X41 X42 X43 L−144

U−1 =

U−1

11 Y12 Y13 Y14

U−122 Y23 Y24

U−133 Y34

U−144

.

Then

y1 = L−111 A1b1.

As before, the inverse indicates that this sub-system should be solved, and does not requiresolving explicitly for L−1

11 . As elements converge, the vector y should be stored in memoryas well. Although the vector b may change at each iteration, by assumption the parts of bcorresponding to the parts of x that have converged will be constant. Suppose now that y2

has been deemed fixed and need not be calculated, then the next task is to compute y3. Ify3 can be calculated, then once it’s known any ordering of fixed and unknown elements canbe solved for. Write

a1

a2

a3

a4

= Ab.

Note that

L31y1 + L32y2 + L33y3 = a3.

Then, knowing y1 and y2, we can solve

y3 = L−133 (a3 − L31y1 − L32y2) .

If Ly is also stored in memory, then the expressions above need not be recalculated. Note thatrecalculation of y4, which represents elements that have converged that appear at the end ofthe vector y, can be avoided by simply terminating after the final un-converged element hasbeen computed.

The above can be applied to an portions remaining in the vector y until the entire vectoris found. To find x, the process is repeated in the reverse direction because U is uppertriangular. For example, given x4 has been calculated or was previously fixed:

x3 = U−133 (y3 −U34x4) .


There are many potential optimizations when simulating a PDE, and most may conflictwith other possible optimizations. The choice of one preconditioned, for example, precludesthe possibility of using a different preconditioner. Note, nevertheless that the algorithm de-scribed here does not conflict with any other known optimizations. If all elements converge atabout the same rate, then this algorithm will provide no discernible benefit. This algorithm,nevertheless, will not cause additional harm either, and has the potential for considerabletime savings for stiff PDE problems.

111

Appendix C

Strassen’s Algorithm

As discussed in Section 2.8, there are various ways to compute matrix multiplication. Al-though a study of matrix multiplication is not necessary in this work, the main results inPart II depend on matrix multiplication. Strassen’s Algorithm is presented here for two rea-sons: demonstration that the computational complexity of SRLU can be improved by usingthe Strassen form of matrix multiplication in place of DGEMM calls, and because Strassen’sAlgorithm is one of the most unexpected and incredible results encountered by this authorduring his PhD studies.

The algorithm is defined for square matrices of size 2k-by-2k for some k, but can easilybe adjust for all matrices. The matrix product(

A11 A12

A21 A22

)(B11 B12

B21 B22

)=

(C11 C12

C21 C22

)can be calculated by first computing

P1 = (A11 + A22) (B11 + B22) ,

P2 = (A21 + A22) B11,

P3 = A11 (B12 −B22) ,

P4 = A22 (B21 −B11) ,

P5 = (A11 + A12) B22,

P6 = (A21 −A11) (B11 + B12) , and

P7 = (A12 −A22) (B21 + B22) .

Then

C11 = P1 + P4 −P5 + P7

C12 = P3 + P5

C21 = P2 + P4

C22 = P1 −P2 + P3 + P6.

APPENDIX C. STRASSEN’S ALGORITHM 112

Calculation of each Pi requires one matrix-matrix multiplication of size 2k−1-by-2k−1, and soif the equations above are applied recursively then the amount of time spent in multiplicationfor a n-by-n matrix is

T (n) ≈ 7 · T(n

2

)≈ 7log2(n) · T (1)

≈ O(nlog2 7

).

Here the approximation is due to lower order operations needed as well, but when all opera-tions are considered the order of complexity remains O

(nlog2 7

)[47]. The three-nested loop

standard algorithm is an O (n3) algorithm, and the definition and conventional algorithmfor matrix multiplication give no indication that this complexity could be improved upon.Newer algorithms have been discovered with even smaller complexity, although these algo-rithms have no practical implementation. It remains an open conjecture that matrix-matrixmultiplication can be arbitrarily close to or achieve O (n2) [106].

113

Appendix D

A Visualization of SRLU

InitializationXXXXDGEMMXXXXO (pmn) flops

p Ωm

m A

n

= p Rn

Iterate j = 0 : b : (k − b):

1. Column selectionXXXXDGETRF (or RRQR)XXXXO ((n− j)pb) flops

Rp

n

2. Partial Schur updateXXXXDGEMMXXXX2(m− j)jb flops

APPENDIX D. A VISUALIZATION OF SRLU 114

m-j

j b

3. LU factorizationXXXXDGETRFXXXX2

3(m− j)b2 flops

m-j

b

4. Update UXXXXDGEMMXXXX2bj (n− (j + b)) flops

b

j

n-(j+b)

5. Update RXXXXDGEMMXXXXO (pb (n− (j + b)) flops

Multiple low-order operations

TOTAL FLOPS:2pmn+ (m+ n) k2 + (low order)

115

Bibliography

[1] D. Achlioptas. “Database-friendly random projections: Johnson-Lindenstrauss withbinary coins”. In: J. Comput. Syst. Sci. 66.4 (2003), pp. 671–687.

[2] K. J. Ahn, S. Guha, and A. McGregor. “Graph sketches: sparsification, spanners, andsubgraphs.” In: PODS. ACM, 2012, pp. 5–14.

[3] N. Ailon and B. Chazelle. “The Fast Johnson–Lindenstrauss Transform and Approx-imate Nearest Neighbors”. In: SIAM J. Comput. 39.1 (2009), pp. 302–322.

[4] N. Ailon and E. Liberty. “Fast Dimension Reduction Using Rademacher Series onDual BCH Codes.” In: Discrete and Computational Geometry 42.4 (Dec. 17, 2009),pp. 615–630.

[5] Y. Aizenbud, G. Shabat, and A. Averbuch. “Randomized LU Decomposition UsingSparse Projections.” In: CoRR abs/1601.04280 (2016).

[6] D. G. Anderson and M. Gu. “An Efficient, Sparsity-Preserving, Online Algorithm forData Approximation”. In: CoRR (2015). url: \urlhttps://arxiv.org/abs/1602.05950.

[7] D. G. Anderson, M. Gu, and C. Melgaard. “An Efficient Algorithm for UnweightedSpectral Graph Sparsification.” In: CoRR abs/1410.4273 (2014).

[8] D. G. Anderson et al. “Spectral Gap Error Bounds for Improving CUR Matrix De-composition and the Nystrm Method.” In: AISTATS. Vol. 38. JMLR Proceedings.2015.

[9] H. Avron and C. Boutsidis. “Faster Subset Selection for Matrices and Applications”.In: CoRR abs/1201.0127 (2012).

[10] K. C. Barr and K. Asanovic. “Energy-aware lossless data compression.” In: ACMTrans. Comput. Syst. 24.3 (Oct. 22, 2008), pp. 250–291.

[11] J. Batson, D. A. Spielman, and N. Srivastava. “Twice-Ramanujan Sparsifiers”. In:SIAM Journal on Computing 41.6 (2012), pp. 1704–1721.

[12] J. D. Batson, D. A. Spielman, and N. Srivastava. “Twice-Ramanujan Sparsifiers.” In:SIAM J. Comput. 41.6 (2012), pp. 1704–1721.

[13] J. D. Batson et al. “Spectral sparsification of graphs: theory and algorithms.” In:Commun. ACM 56.8 (2013), pp. 87–94.

\urlhttps://arxiv.org/abs/1602.05950

\urlhttps://arxiv.org/abs/1602.05950

BIBLIOGRAPHY 116

[14] A. A. Benczur and D. R. Karger. “Approximating s-t Minimum Cuts in O(n2) Time.”In: STOC. ACM, 1996, pp. 47–55.

[15] M. W. Berry, Z. Drmac, and E. R. Jessup. “Matrices, Vector Spaces, and InformationRetrieval”. In: SIAM Review 41.2 (1999), pp. 335–362.

[16] M. W. Berry, S. A. Pulatova, and G. W. Stewart. “Algorithm 844: Computing sparsereduced-rank approximations to sparse matrices.” In: ACM Trans. Math. Softw. 31.2(2005), pp. 252–269.

[17] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. “Near-Optimal Column-Based Ma-trix Reconstruction”. In: CoRR abs/1103.0995 (2011).

[18] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. “Near-Optimal Column-Based Ma-trix Reconstruction.” In: SIAM J. Comput. 43.2 (2014), pp. 687–717.

[19] R. P. Brent. “An algorithm with guaranteed convergence for finding a zero of a func-tion”. In: The Computer Journal 14.4 (1971), pp. 422–425.

[20] E. J. Candes and T. Tao. “The power of convex relaxation: near-optimal matrixcompletion”. In: IEEE Transactions on Information Theory 56.5 (2010), pp. 2053–2080.

[21] E. J. Cands and B. Recht. “Exact matrix completion via convex optimization.” In:Commun. ACM 55.6 (2012), pp. 111–119.

[22] E. J. Cands et al. “Robust Principal Component Analysis?” In: CoRR abs/0912.3599(2009).

[23] E. Carson et al. Write-Avoiding Algorithms. Tech. rep. UCB/EECS-2015-163. EECSDepartment, University of California, Berkeley, 2015.

[24] T. F. Chan. “Rank revealing QR factorizations”. In: Linear algebra and its applica-tions 88/89 (1987), pp. 67–82.

[25] H. Cheng et al. “On the Compression of Low Rank Matrices.” In: SIAM J. ScientificComputing 26.4 (2005), pp. 1389–1404.

[26] F. Chierichetti, S. Lattanzi, and A. Panconesi. “Rumour Spreading and Graph Con-ductance.” In: SODA. SIAM, 2010, pp. 1657–1663.

[27] E. Chow and A. Patel. “Fine-Grained Parallel Incomplete LU Factorization.” In:SIAM J. Scientific Computing 37.2 (2015).

[28] P. Christiano et al. “Electrical flows, laplacian systems, and faster approximation ofmaximum flow in undirected graphs”. In: STOC. ACM, 2011, pp. 273–282.

[29] A. Civril and M. Magdon-Ismail. “Deterministic Sparse Column Based Matrix Re-construction via Greedy Approximation of SVD”. In: Algorithms and Computation.Vol. 5369. Lecture Notes in Computer Science. 2008, pp. 414–423.

[30] K. L. Clarkson and D. P. Woodruff. “Low Rank Approximation and Regression inInput Sparsity Time”. In: CoRR abs/1207.6365 (2012).

BIBLIOGRAPHY 117

[31] A. Dasgupta, R. Kumar, and T. Sarlos. “A sparse Johnson: Lindenstrauss transform”.In: STOC. ACM, 2010, pp. 341–350.

[32] S. Dasgupta and A. Gupta. “An elementary proof of a theorem of Johnson and Lin-denstrauss.” In: Random Struct. Algorithms 22.1 (2003), pp. 60–65.

[33] T. A. David and Y. Hu. “The University of Florida Sparse Matrix Collection”. In:ACM Transactions on Mathematical Software 38 (1 2011), pp. 1–25. url: http:

//www.cise.ufl.edu/research/sparse/matrices.

[34] J. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.

[35] A. Deshpande and S. Vempala. “Adaptive Sampling and Fast Low-Rank Matrix Ap-proximation.” In: APPROX-RANDOM. Vol. 4110. Lecture Notes in Computer Sci-ence. Springer, 2006, pp. 292–303.

[36] A. Deshpande et al. “Matrix Approximation and Projective Clustering via VolumeSampling.” In: Theory of Computing 2.12 (2006), pp. 225–247.

[37] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. “Relative-Error CUR MatrixDecompositions.” In: SIAM J. Matrix Analysis Applications 30.2 (2008), pp. 844–881.

[38] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Sec-ond. Oxford Science Publications. New York: The Clarendon Press Oxford UniversityPress, 1989, pp. xiv+341. isbn: 0-19-853421-3.

[39] C. Eckart and G. Young. “The approximation of one matrix by another of lowerrank”. In: Psychometrika 1.3 (1936), pp. 211–218.

[40] Office of Science U.S. Department of Energy. Configuration. https://www.nersc.gov/users/computational-systems/edison/configuration/. Accessed on August22, 2015. Published on March 30, 2015.

[41] S. Fine and K. Scheinberg. “Efficient SVM Training Using Low-Rank Kernel Repre-sentations.” In: Journal of Machine Learning Research 2 (2001), pp. 243–264.

[42] L. V. Foster. “The growth factor and efficiency of Gaussian elimination with rookpivoting”. In: J. of Comp. and Appl. Math. 86 (1997), pp. 177–194.

[43] P. Frankl and H. Maehara. “The Johnson-Lindenstrauss lemma and the sphericity ofsome graphs.” In: J. Comb. Theory, Ser. B 44.3 (1988), pp. 355–362.

[44] A. M. Frieze, R. Kannan, and S. Vempala. “Fast monte-carlo algorithms for findinglow-rank approximations.” In: J. ACM 51.6 (2004), pp. 1025–1041.

[45] T. Fruchterman and E. Reingold. “Graph Drawing by Force-Directed Placement”. In:Software–Practice & Experience 21.11 (1991), pp. 1129–1164.

[46] P. E. Gill et al. “Maintaining LU Factors of a General Sparse Matrix.” In: LinearAlgebra and Its Applications. 88 (1987), pp. 239–270.

[47] G. H. Golub and C. F. Van Loan. Matrix Computations. 4th. JHU Press, 2013.

http://www.cise.ufl.edu/research/sparse/matrices

http://www.cise.ufl.edu/research/sparse/matrices

https://www.nersc.gov/users/computational-systems/edison/configuration/

https://www.nersc.gov/users/computational-systems/edison/configuration/

BIBLIOGRAPHY 118

[48] J. Gondzio. “Stable algorithm for updating dense LU factorization after row or columnexchange and row and column addition or deletion”. In: Optimization: A journal ofMathematical Programming and Operations Research (2007), pp. 7–26.

[49] L. Grigori, J. Demmel, and X. S. Li. “Parallel Symbolic Factorization for Sparse LUwith Static Pivoting.” In: SIAM J. Scientific Computing 29.3 (2007), pp. 1289–1314.

[50] M. Gu. “Subspace Iteration Randomization and Singular Value Problems.” In: SIAMJ. Scientific Computing 37.3 (2015).

[51] M. Gu and S. C. Eisenstat. “Efficient algorithms for computing a strong rank-revealingQR factorization”. In: SIAM J. Sci. Comput. 17.4 (1996), pp. 848–869.

[52] N. Halko, P.-G. Martinsson, and J. A. Tropp. “Finding Structure with Randomness:Probabilistic Algorithms for Constructing Approximate Matrix Decompositions”. In:SIAM Review 53.2 (2011), pp. 217–288.

[53] N. J. Higham. Accuracy and stability of numerical algorithms (2. ed.). SIAM, 2002,pp. I–XXX, 1–680. isbn: 978-0-89871-521-7.

[54] N. J. Higham and S. D. Relton. “Estimating the Largest Elements of a Matrix.” In:SIAM J. Scientific Computing 38.5 (2016).

[55] Y. P. Hong and C.-T. Pan. “Rank-revealing QR factorizations and the singular valuedecomposition”. In: Mathematics of Computations 58 (1992), pp. 213–232.

[56] H. Hotelling. “Some New Methods in Matrix Calculation”. In: Ann. Math. Stat. 14(1943), pp. 1–34.

[57] C.-J. Hsieh and P. A. Olsen. “Nuclear Norm Minimization via Active Subspace Se-lection.” In: ICML. Vol. 32. JMLR Proceedings. 2014, pp. 575–583.

[58] T.-M. Hwang, W.-W. Lin, and E. K. Yang. “Rank Revealing LU Factorizations”. In:Linear Algebra and Its Applications 175 (1992), pp. 115–141.

[59] P. Indyk and R. Motwani. “Approximate Nearest Neighbors: Towards Removing theCurse of Dimensionality”. In: STOC. 1998, pp. 604–613.

[60] P. Indyk and A. Naor. “Nearest-neighbor-preserving embeddings.” In: ACM Trans-actions on Algorithms 3.3 (2007).

[61] W.B. Johnson and J. Lindenstrauss. “Extensions of Lipschitz Mappings into a HilbertSpace”. In: Contemporary Mathematics 26 (1984), pp. 189–206.

[62] Daniel M. Kane and Jelani Nelson. “Sparser Johnson-Lindenstrauss transforms”. In:SODA. SIAM, 2012, pp. 1195–1206.

[63] M. Kapralov and R. Panigrahy. “Spectral sparsification via random spanners.” In:ITCS. ACM, 2012, pp. 393–398.

[64] A. Khabou et al. “LU Factorization with Panel Rank Revealing Pivoting and ItsCommunication Avoiding Version.” In: SIAM J. Matrix Analysis Applications 34.3(2013), pp. 1401–1429.

BIBLIOGRAPHY 119

[65] B. Klartag and S. Mendelson. “Empirical processes and random projections”. In: JFunct Anal 225 (2005), pp. 229–245.

[66] C. Leiserson. “Fat-trees: Universal Networks for Hardware-efficient Supercomputing”.In: IEEE Trans. Comput. 34.10 (1985), pp. 892–901.

[67] J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection.http://snap.stanford.edu/data. Oct. 2014.

[68] N Li, Y Saad, and E Chow. “Crout Versions of ILU for General Sparse Matrices”. In:SIAM J. Sci. Comput. 25.2 (2003), pp. 716–728.

[69] E. Liberty et al. “Randomized algorithms for the low-rank approximation of matri-ces”. In: Proceedings of the National Academy of Sciences 104.51 (2007), p. 20167.

[70] M. Lichman. UCI Machine Learning Repository. 2013. url: http://archive.ics.uci.edu/ml.

[71] C. Lv and Q. Zhao. “Integration of Data Compression and Cryptography: AnotherWay to Increase the Information Security.” In: AINA Workshops (2). IEEE ComputerSociety, 2007, pp. 543–547.

[72] M. W. Mahoney and P. Drineas. “CUR matrix decompositions for improved dataanalysis”. In: Proceedings of the National Academy of Sciences 106.3 (2009), pp. 697–702.

[73] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing Families II: MixedCharacteristic Polynomials and the Kadison-Singer Problem”. In: CoRR 1306.3969(2014).

[74] P.-G. Martinsson, V. Rokhlin, and M. Tygert. “A Randomized Algorithm for the Ap-proximation of Matrices.” In: Tech. Rep., Yale University, Department of ComputerScience 1361 (2006).

[75] M. Mathioudakis et al. “Sparsification of influence networks.” In: KDD. ACM, 2011,pp. 529–537.

[76] MathWorks. The Gatlinburg and Householder Symposia. https://www.mathworks.com / company / newsletters / articles / the - gatlinburg - and - householder -

symposia.html. Accessed on December 10, 2016. Published in 2013.

[77] J. Matousek. “On variants of the Johnson-Lindenstrauss lemma”. In: Random Struct.Algorithms 33.2 (2008), pp. 142–156.

[78] C. Melgaard and M. Gu. “Gaussian Elimination with Randomized Complete Pivot-ing.” In: CoRR abs/1511.08528 (2015).

[79] L. Miranian and M. Gu. “Stong rank revealing LU factorizations”. In: Linear Algebraand its Applications 367 (2003), pp. 1–16.

[80] L. Miranian and M. Gu. “Strong Rank Revealing LU Factorizations.” In: LinearAlgebra and its Applications 367 (2002), pp. 1–16.

http://snap.stanford.edu/data

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

https://www.mathworks.com/company/newsletters/articles/the-gatlinburg-and-householder-symposia.html



BIBLIOGRAPHY 120

[81] NASA. NASA Celebrates 50 Years of Spacewalking. https://www.nasa.gov/image-feature/nasa-celebrates-50-years-of-spacewalking. Accessed on August 22,2015. Published on June 3, 2015. Original photograph from February 7, 1984.

[82] J. v. Neumann and H. H. Goldstine. “Numerical Inverting of Matrices of High Order”.In: Bull. Amer. Math. Soc. 53 (1947), pp. 1021–1099.

[83] C.-T. Pan. “On the existence and computation of rank-revealing LU factorizations”.In: Linear Algebra and its Applications 316.1 (2000), pp. 199–222.

[84] C. H. Papadimitriou et al. “Latent Semantic Indexing: A Probabilistic Analysis.” In:J. Comput. Syst. Sci. 61.2 (2000), pp. 217–235.

[85] L. Parsons, E. Haque, and H. Liu. “Subspace clustering for high dimensional data: areview”. In: ACM SIGKDD Explorations Newsletter 6.1 (2004), pp. 90–105.

[86] W. B. Pennebaker and J. L. Mitchell. JPEG Still Image Data Compression Standard.Norwell, MA, USA: Kluwer Academic Publishers, 1992.

[87] B. Recht, M. Fazel, and P. A. Parrilo. “Guaranteed Minimum-Rank Solutions ofLinear Matrix Equations via Nuclear Norm Minimization.” In: SIAM Review 52.3(2010), pp. 471–501.

[88] Y. Saad. Iterative Methods for sparse linear systems. 2nd. SIAM, 2003.

[89] T. Sarlos. “Improved Approximation Algorithms for Large Matrices via Random Pro-jections.” In: FOCS. IEEE Computer Society, 2006, pp. 143–152.

[90] Silicon Valley. https : / / www . amazon . com / Silicon - Valley - Season - 1 / dp /

B00M4ZPZPY. Nov. 2016.

[91] D. A. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances.”In: SIAM J. Comput. 40.6 (2011), pp. 1913–1926.

[92] D. A. Spielman and S.-H. Teng. “Nearly-linear time algorithms for graph partitioning,graph sparsication, and solving linear systems”. In: STOC’04. 2004, pp. 81–90.

[93] D. A. Spielman and S.-H. Teng. “Nearly-Linear Time Algorithms for Precondition-ing and Solving Symmetric, Diagonally Dominant Linear Systems”. In: CoRR ab-s/cs/0607105 (2006).

[94] D. A. Spielman and S.-H. Teng. “Solving Sparse, Symmetric, Diagonally-DominantLinear Systems in Time O(m1.31)”. In: CoRR cs.DS/0310036 (2003).

[95] D. A. Spielman and S.-H. Teng. “Spectral Sparsification of Graphs”. In: ().

[96] P. Stange, A. Griewank, and M. Bollhofer. “On the Efficient Update of RectangularLU-Factorizations Subject to Low Rank Modifications.” In: Electronic Transactionson Numerical Analysis. 26 (2007), pp. 161–177.

[97] G. W. Stewart. “An updating algorithm for subspace tracking.” In: IEEE Trans.Signal Processing 40.6 (1992), pp. 1535–1541.

https://www.nasa.gov/image-feature/nasa-celebrates-50-years-of-spacewalking

https://www.nasa.gov/image-feature/nasa-celebrates-50-years-of-spacewalking

https://www.amazon.com/Silicon-Valley-Season-1/dp/B00M4ZPZPY

https://www.amazon.com/Silicon-Valley-Season-1/dp/B00M4ZPZPY

BIBLIOGRAPHY 121

[98] G. W. Stewart. “The QLP Approximation to the Singular Value Decomposition.” In:SIAM J. Scientific Computing 20.4 (1999), pp. 1336–1348.

[99] V. Strassen. “Gaussian elimination is not optimal”. In: Numerische Mathematik 13.4(1969), pp. 354–356.

[100] D. S. Taubman and M. W. Marcellin. JPEG2000 : image compression fundamentals,standards, and practice. Boston: Kluwer Academic Publishers, 2002, pp. –.

[101] A. M. Turing. “Rounding-Off Errors is Matrix Processes”. In: Quart. J. Appl. Math.(1948), pp. 287–308.

[102] Madeleine Udell et al. “Generalized Low Rank Models.” In: Foundations and Trendsin Machine Learning 9.1 (2016), pp. 1–118.

[103] C. Van Loan. “Computing the CS and the Generalized Singular Value Decomposi-tions”. In: Numerische Mathematik 46, Issue 4 (1985), pp. 479–491.

[104] S. S. Vempala. The Random Projection Method. Vol. 65. DIMACS Series in DiscreteMathematics and Theoretical Computer Science. DIMACS/AMS, 2004, pp. 1–103.

[105] J. H. Wilkinson. “Error Analysis of Direct Methods of Matrix Inversion.” In: J. ACM8.3 (1961), pp. 281–330.

[106] V. V. Williams. Breaking the Coppersmith-Winograd barrier. Manuscript. 2012.

[107] D. P. Woodruff. “Sketching as a Tool for Numerical Linear Algebra.” In: Foundationsand Trends in Theoretical Computer Science 10.1-2 (2014), pp. 1–157.

[108] F. Woolfe et al. “A fast randomized algorithm for the approximation of matrices”.In: Applied and Computational Harmonic Analysis 25.3 (2008), pp. 335–366.

[109] J. Xiao. On Reliability of Randomized QR Factorization with Column Pivoting. Octo-ber 5, Matrix Computations and Scientific Computing Seminar, Berkeley, California.2016.

[110] H. Zou, T. Hastie, and R. Tibshirani. “Sparse principal component analysis”. In:Journal of Computational and Graphical Statistics 15 (2006), pp. 262–286.

reliable and e cient algorithms for spectrum-revealing low ......both rigorous theory and numeric...

Documents