the local marchenko-pastur law for sparse …adlam/thesis.pdfthe local marchenko-pastur law for...

84
The Local Marchenko-Pastur Law for Sparse Covariance Matrices Ben Adlam Department of Mathematics, Harvard University Cambridge MA 02138, USA [email protected] March 25, 2013 We review the resolvent method for proving local laws of random matrix ensembles by coving all the necessary prerequisites and outlining the general strategy. With this machinery we consider N ˆN sparse covariance matrices H :X ˚ X, where the moments of the entries of X “pxij q decay slowly at a rate inversely proportional to the sparseness parameter q. We calculate the higher moments of the entries of H to their leading order and reproduce a large deviation estimate for quadratic forms involving the random variables xij . After rescaling H, so its bulk eigenvalues are typically of order one, we prove the new result that, as long as q grows at least logarithmically with N , the density of the ensembles’ eigenvalues is given by the Marchenko-Pastur law for spectral windows of length larger than N ´1 (up to logarithmic corrections). Finally, we give a new formal calculation to characterize the diffusion process undergone by the eigenvalues of a covariance matrix H :X ˚ TX when the entries of X are undergoing independent Brownian motions and show that the process is only independent of the eigenvectors when T c for some c P R. 1

Upload: dinhkhanh

Post on 18-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

The Local Marchenko-Pastur Law

for Sparse Covariance Matrices

Ben Adlam

Department of Mathematics, Harvard UniversityCambridge MA 02138, [email protected]

March 25, 2013

We review the resolvent method for proving local laws of random matrix ensembles by coving all thenecessary prerequisites and outlining the general strategy. With this machinery we consider NˆN sparsecovariance matrices H :“ X˚X, where the moments of the entries of X “ pxijq decay slowly at a rateinversely proportional to the sparseness parameter q. We calculate the higher moments of the entriesof H to their leading order and reproduce a large deviation estimate for quadratic forms involving therandom variables xij . After rescaling H, so its bulk eigenvalues are typically of order one, we provethe new result that, as long as q grows at least logarithmically with N , the density of the ensembles’eigenvalues is given by the Marchenko-Pastur law for spectral windows of length larger than N´1 (up tologarithmic corrections). Finally, we give a new formal calculation to characterize the diffusion processundergone by the eigenvalues of a covariance matrix H :“ X˚TX when the entries of X are undergoingindependent Brownian motions and show that the process is only independent of the eigenvectors whenT “ c1 for some c P R.

1

Contents

1 Introduction 3

2 Preliminaries 72.1 The Stieltjes transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Resolvent formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 The Marchenko-Pastur distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 The general strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Definitions and results 25

4 Moment calculations 32

5 Large deviation estimates 37

6 Local Marchenko-Pastur law 456.1 Preliminary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2 Basic estimates on the event Ωpzq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Stability of the self-consistent equation on Ωpzq . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 Initial estimates for large η . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.5 Continuity argument: conclusion of the proof of Theorem 6.2 . . . . . . . . . . . . . . . . . . 66

7 A Brownian motion model for covariance matrices 69

8 Derivation of the Marchenko-Pastur equation 77

9 Acknowledgements 82

2

1. Introduction

A random matrix is a random variable that takes values in the space of matrices [4]. The most commonexample of a random matrix is more concrete than the above definition: construct a random N ˆN matrixHN “ phijq by sampling its entries hij according to some know probability distributions and enforce astructural constraint (such as symmetry) to ensure its eigenvalues are real. We refer to the sequence ofmatrices pHN q

8N“1 as an ensemble. Simplifying this construction further, we can require that the entries of

the matrix are independent of one another. Remarkably, under few assumptions about the distributions ofthe entries, when the dimension of random matrix becomes large its spectral statistics can be described veryprecisely. In particular, if the matrix entries are properly normalized, then as the dimension of the matrixtends to infinity its spectrum (basically, a histogram of the eigenvalues) will fill out a compact region of thereal line with positive density

The location and density of its eigenvalues can be specified at both a global scale, that is for intervalsof order one, and a local scale, that is for intervals of the same order as the inverse of the dimension (up tologarithmic factors). With a local law, which describes the density of the eigenvalues of a random matrixon a short scale, it is possible to say how close the spectrum of a random matrix is likely to be to itslimiting profile. It is this type of result we present in this paper. There are many more spectral statistics:for example, we can say exact things about the eigenvectors, the spectral spacing, the correlations betweeneigenvalues, and the distribution of the largest eigenvalue. Interestingly, the asymptotic spectral statistics areoften independent of the distribution of the entries and behave in the same way as the Gaussian distribution(which is a canonical point of comparison), a property referred to as universality.

Eugene Wigner’s seminal work in the 1950s is considered by many to be the beginning of random matrixtheory [22]. Wigner introduced random matrices to model the spectra of heavy atoms. He postulatedthat the gaps between the lines in the spectrum of a heavy atom should resemble the spacings betweenthe eigenvalues of a random matrix; in Section 7 we see very clearly how the spacing between eigenvaluesbehaves like electrostatic repulsion. The idea of universality can be found in Wigner’s original work, in whichhe postulated that the spacings of the spectrum should not depend on anything but the symmetry class ofthe underlying evolution. The first ingredient for the modern approach to proving universality for randommatrices is a local law, so a natural extension to this research would be to prove universality for this paper’sensemble.

While Wigner’s work formulated random matrix theory in more modern terms, it is possible to trace theorigins of random matrix theory back to the 1920s. The statistician, John Wishart, worked in multivariatestatistics and he introduced random matrices to analyze large samples of data [21]. Suppose we have somecentered M dimensional data from which we take N independent, identically distributed samples and formthe M ˆN data matrix

X :“ rx1, . . . ,xN s,

where

xi :“ px1i, . . . xMiq˚ P RM .

Suppose we wish to understand correlations between the data, so we define the MˆM population covariancematrix

rΣijs “ Σ :“ Ex1x˚1 ,

so that

Σij “ Exi1xj1,

3

that is the covariance between the ith and jth component of the distribution we are drawing our samplesfrom. How well can we estimate Σ with the sample covariance matrix

H :“1

NXX˚?

In the classical case, where M is fixed and N is very large, we can use the law of large numbers to concludethat

H Ñ Σ,

since

Hij “1

N

Nÿ

k“1

xikxjk Ñ Exi1xj1 “ Σij .

So, we have convergence in each entry. Denote the spectra of the two matrices as follows: let pσαqMα“1 be the

ordered spectrum of the population covariance matrix and pλαqMα“1 be the ordered spectrum of the sample

covariance matrix. Then, a simple corollary is that λα converges to σα with at most Gaussian fluctuationon the order N´12 by the central limit theorem. So, not only can we estimate the population covariancematrix with the sample covariance matrix but we have an idea of how many samples we need for a desiredlevel of accuracy.

Now we move to the high-dimensional case; instead of fixing M we allow it to be a function of N thatgoes to infinity with N such that

limNÑ8

MN

N“ γ P p0,8q.

This setting is naturally motivated by high-dimensional data, such as genomic sequences, wireless communi-cation, and finance, where the data are high-dimensional and the sample sizes are limited. Random matriceshave many application in these areas; see, for example, [2] for a summary.

Often what we are interested in for applications (such as principal component analysis) are not the entriesof the matrix Σ but its eigenvalues or its inverse; or even more specifically, its largest eigenvalues. So, wefocus on convergence of the spectrum and not the entries of the matrix. It is thus natural to ask, does thespectrum of the sample covariance matrix approximate the spectrum of the population covariance matrix?The answer is no, not in the high-dimensional case. Even if we deal with the simplest case and assume thateach entry xij has mean zero, variance one, and that they are all independent, which is referred to as the nullcovariance case, we obtain a negative answer. In this case, it is clear Σ “ 1 but the spectrum of the samplecovariance matrix converges to a deterministic distribution, called the Marchenko-Pastur distribution, whichis supported on a compact interval containing 1 almost surely. The distribution µmp is exposed at length inSubsection 2.3, but for now we note that the density of the Marchenko-Pastur distribution has a closed formexpression and is supported on the compact interval rλ´, λ`s possibly with a point mass at 0: the densityis given by

fmpptq “1

a

pλ` ´ tqpt´ λ´q

γtIrλ´,λ`sdt

whereλ˘ :“ p1˘

?γq2.

We have plotted the function in Figure 1 and, as expected, its shape depends on γ.There has been much work on covariance matrices after Wishart. Marchenko and Pastur proved a global

law for generalized covariance matrices in [17], and thus the distribution carries their names. With a quitedistinct approach Silverstein and Bai proved a global law for generalized covariance matrices in [19, 20]

4

0 1 2 3 4 5t

0.2

0.4

0.6

0.8

1.0

1.2

fmpHtL

(a) The Marchenko-Pastur density for γ “ 0.1.

0 1 2 3 4 5t

0.2

0.4

0.6

0.8

1.0

1.2

fmpHtL

(b) The Marchenko-Pastur density for γ “ 1.5.

Figure 1: The figure shows the Marchenko-Pastur density for two different values of γ. Notice that as γbecomes larger the length of the density’s support increases. The vertical line at 1 is where we would naivelyexpect the eigenvalues of a null sample covariance matrix to concentrate but we can see that the eigenvaluesspread out on a compact interval containing 1.

under milder assumption than [17]. Recently, Pillai and Yin have proved universality for null case covariancematrices X˚X in [16], and in doing so they had to prove a local law for the ensemble. However, they assumethat the entries of the matrix X have sub-exponential decay. Sub-exponential decay of the matrix entriesis a common assumption when proving local laws, but it was weakened in the papers [7, 8], where Erdos,Knowles, Yau, and Yin prove universality (and thus a local law also) for sparse matrices. Sparse matrices,which include as a special case the adjacency matrices of Erdos-Renyi random graphs, have entries whosehigher moments decay at a rate proportional to inverse powers of the sparseness parameter q and not N12.We combine these two research directions in Section 3 of this paper by defining sparse covariance matricesand stating our results for the ensemble.

We can summarize the paper’s content briefly as follows: in Section 2 we review some of the basic toolsrequired to prove local laws for random matrices using the resolvent method. In particular, we state andprove the basic properties of the Stieltjes transform and explain why it is particularly useful in randommatrix theory. We discuss resolvent formalism by stating the resolvent identities and proving the mostcommonly used observations about resolvents. Then we discuss the Marchenko-Pastur distribution in moredetail and state the properties we need for our main proof. Finally, we summarize the general method forusing resolvents.

In Section 3, we state our main result, a weak local law of sparse covariance matrices, which generalizesthe weak local law proved in [16] and holds on the optimal scale of η Á M´1 uniformly to the edge. Weprove this result in Section 6, and the proof follows the strategy outlined in Subsection 2.4. Covariancematrices are unusual, as they have two closely related ensembles: H :“ X˚X and H :“ XX˚, where X is anM ˆN sparse matrix with independent entries. As we show in Section 3, the nonzero eigenvalues of H andH are identical, despite the difference in their dimension. This implies the following relationship betweentheir Stieltjes transforms, denoted by mN and mM respectively:

mN pzq “

ˆ

M

N´ 1

˙

1

z`M

NmM pzq, (1.1)

5

which tells us that controlling mN is equivalent to controlling mM . However, in our proof, because of the formof the large deviation estimate, Lemma 5.1, we are forced to control both the entries of Gpzq :“ pH ´ z1q´1

and Gpzq :“ pH ´ z1q´1 simultaneously. Since no simple formula such as (1.1) is available for the matrixentries of G and G, controlling both types of entry simultaneously is nontrivial, and to our knowledge thishas not been done before.

Our next result is a leading order bound on the moments of the entries of sparse covariance matrices.We calculate this bound in Section 4 using simple combinatorial arguments, and find that the moments ofthe off-diagonal entries of H decay inversely proportional to q2 (for q ă M14q and that the moments ofthe on-diagonal entries of H are order one. The bound on the off-diagonal entries plateaus when q ě M14

to a decay inversely proportional to M12, which essentially tells us when q ě M14 the matrix H is nolonger sparse. Our proof in Section 6 relies on a large deviation estimate, which was first proved in [7], sowe state and reprove the large deviation estimate in Section 5 along with a simplified proof for one of thelarge deviation estimates originally given in [9].

In Section 7 we present a model of Dyson’s Brownian motion for generalized covariance matrices X˚TX.The section’s results are both positive and negative. We present a new formal calculation of the jointdistribution of the diffusion process undergone by the eigenvalues of a covariance matrix when the entiresof the data matrix X are undergoing standard Brownian motion. Amazingly, the joint distribution of theeigenvalues decouples from their corresponding eigenvectors when T “ c1 for some c P R. This result isoriginally due to Bru [3]. Our new result in this section is negative: when T ‰ c1 for all c P R, the jointdistribution of the eigenvalues does not decouple from their corresponding eigenvectors—the eigenvaluesdepend on the position of the eigenvectors at time t in a nontrivial way. Finally, in Section 8, we reviewthe derivation of the self-consistent equation for generalized covariance matrices, which was first performedin [19].

6

2. Preliminaries

The work of this section is to introduce the key tools of the complex-analytic approach to random matrixtheory: the Stieltjes transform (in Subsection 2.1), resolvent formalism (in Subsection 2.2), and some basicfacts about the global distribution (in our case, the Marchenko-Pastur distribution in 2.3). Throughout thesection we explain how these fundamental tools work together and in Subsection 2.4 we outline the generalstrategy for proving local laws with the complex-analytic approach. This general method is applied to sparsecovariance matrices in Section 6, but first a note on notation:

Notation 2.1 (Constants and order notation). Throughout the paper we do not keep track of thevalues of constants explicitly. We use c and C to denote generic positive constants which may only dependon the constants in assumptions such as (3.1), (3.2), and (3.3); in particular c and C do not depend onthe dimension N or the sparseness parameter q defined in Section 3. Typically, we use c to denote a smallconstant and C to denote a large constant. The fundamental parameter in our model is N (or equivalentlyM), and the notations ", !, op¨q, and Op¨q always refer to the limit as N Ñ 8 (or equivalently M Ñ 8).We write fpNq ! gpNq when fpNq “ opgpNqq and fpNq „ gpNq when C´1fpNq ď gpNq ď CfpNq for allN ě N0 for some N0.

Notation 2.2 (Matrices and vectors). We use uppercase letters for matrices, such as X and H, andbold lowercase letters for vectors, such as v and x. In particular 1 denotes the identity matrix whosedimension will be clear from the context. Note this is distinct from the indicator function of a set A Ď C,which is denoted by 1A and where 1z ” 1tzu for z P C. We use lowercase greek letters, such as α and β, forindices that enumerate the spectral parameters of a matrix (such as eigenvalues and eigenvectors), whereasindices such as i, j, k, and l enumerate the rows, columns, and entries of a matrix. For a vector v we denotethe kth component by vpkq. So, when we use xj to denote the j column of the matrix X, we have theequivalent notations xjpkq ” xkj . Note that for a matrix A we use A˚ to denote conjugate transpose. Evenwhen the matrix A has real entries we still use A˚ rather than the regular transpose AJ. Also, we use thenotation

pTqÿ

i

:“Nÿ

i“1iRT

, (2.1)

where T Ď t1, . . . , Nu. Note that this is distinct from the sum

ÿ

i‰j

:“ÿ

1ďi‰jďN

Nÿ

i“1

Nÿ

j“1i‰j

. (2.2)

2.1. The Stieltjes transform. In this subsection we discuss the Stieltjes transform which, is a key tool inrandom matrix theory and is used extensively throughout this paper. We define the transform, state someof its important properties, and discuss how it is used in random matrix theory. We explain that studyingconvergence in Stieltjes transforms is equivalent to studying convergence in probability distributions, butdue to the nice analytic properties of the Stieltjes transform the former is often preferable.

Definition 2.3 (Stieltjes transform). Let µ PM pRq be a Borel measure on R and let z :“ E` iη P C.Then we define the Stieltjes transform of µ, denoted by mµ : Cz supppµq Ñ C, as

mµpzq :“

ż

R

µpdtq

t´ z. (2.3)

7

In particular, if µ P PpRq is a probability distribution (µpRq “ 1) and the random variable X is distributedaccording to µ, then

mµpzq “ Eˆ

1

X ´ z

˙

. (2.4)

While mµpzq is defined for all z P Cz supppµq, if we restrict the function’s domain to the upper half-planeH, then we have mµ : HÑ H, since

Immµ “ η

ż

R

µpdtq

pt´ Eq2 ` η2ě

1

ηą 0. (2.5)

Moreover, mµ is analytic on H. From now on we only consider the Stieltjes transforms of probabilitydistribution µ and we restrict the transform’s domain to z :“ E ` iη P H, so that η ą 0. There is a verysimple bound for the Stieltjes transform, namely

|mµpzq| ďż

R

µpdtq

|t´ z|“

ż

R

µpdtqa

pt´ Eq2 ` η2ď

ż

R

µpdtq

η“

1

η. (2.6)

If the probability distribution µ is compactly supported, then we can swap the limit and integration to find

limzÑ8

´zmµpzq “ limzÑ8

ż

R

z µpdtq

z ´ t“

ż

Rµpdtq “ 1. (2.7)

Even when µ is not compactly supported, we have

limηÑ8

´iη mµpiηq “ limηÑ0

ż

R

µpdtq

1´ tiη

“ 1, (2.8)

by the dominated convergence theorem. These fundamental properties make the Stieltjes transform a nice,well-behaved, analytic object that is amenable to the tools of complex analysis, which in turn makes it easierfor us to study convergence in probability distributions.

We can also relate the Stieltjes transform mµ to the moments of the measure µ. Suppose additionallythat µ is compactly supported, supppµq Ď r´C,Cs for some C P R, and define the kth moment of µ as

µk :“

ż

Rtk µpdtq “ EXk (2.9)

for integers k ě 0. Then, trivially, |µk| ď Ck and for each k the Stieltjes transform admits the followingasymptotic expansion in the neighborhood of infinity given by

mµpzq “ ´kÿ

i“0

µizi`1

´ o

ˆ

1

zk`1

˙

. (2.10)

Informally, mµp8q “ 0 and we may think of µk as Laurent coefficients at 8, so mµ is like a momentgenerating function. This is an important fact for random matrices as the moments of random matrices arerelated to the combinatorial structure of labelled planar graphs, which enforces some recursive relations onthe moments. These recursive relations lead directly to algebraic equations for the Stieltjes transform, whichin turn help to estimate the moments. See [1] for more details on this use of the Stieltjes transform.

Under some mild assumption we can recover the probability distribution µ from its Stieltjes transformmµ. This statement is made precise in the following proposition.

8

Proposition 2.4 (Inversion of the Stieltjes transform). Let µ P PpRq, then for any a ă b

limηÑ0

1

π

ż b

a

ImmµpzqdE “ µpa, bq `1

2µtau `

1

2µtbu. (2.11)

Proof. By definition, we have

1

πImmµpzq “

1

π

ż

RIm

1

t´ zµpdtq “

ż

R

1

π

η

pt´ Eq2 ` η2µpdtq,

which is the density of the convolution µ˚ Pη at E, where

PηpdEq :“1

π

η

E2 ` η2dE

is the Cauchy distribution. Then, using Fubini’s theorem, we see

limηÑ0

1

π

ż b

a

Immµpzq dE “1

πlimηÑ0

ż b

a

ż

R

η

pt´ Eq2 ` η2µpdtqdE

“1

πlimηÑ0

ż

ż b

a

dE

pt´ Eq2 ` η2µpdtq

“1

πlimηÑ0

η

ż

R

1

η

´ arctan

ˆ

t´ E

η

˙b

a

µpdtq

“1

πlimηÑ0

ż

Rarctan

ˆ

t´ a

η

˙

´ arctan

ˆ

t´ b

η

˙

µpdtq.

Now, we take the limit of the integrand and use the properties of the arctangent to find

limηÑ0

ˆ

arctan

ˆ

t´ a

η

˙

´ arctan

ˆ

t´ b

η

˙˙

$

&

%

0 if t ă aπ2 if t “ a

π if a ă t ă bπ2 if t “ b

0 if b ă t

(2.12)

and thus

limηÑ0

1

π

ż b

a

Immµpzq dE “1

π

ż

R

´π

21aptq `

π

21bptq ` π1pa,bqptq

¯

µpdtq “ µpa, bq `1

2µtau `

1

2µtbu,

by the dominated convergence theorem.

It is easy from here to recover the density of µ, if it admits one. From the interpretation as a convolutionin the proof of Proposition 2.4, it is possible to think of η as a resolution which gives a sharper picture ofthe original probability distribution µ as η gets smaller. Following the convention in quantum mechanics,we use the letter E as the real part of the spectral parameter due to its interpretation as an energy level.

Next, we present two immediate corollaries of Proposition 2.4.

9

-1 0 1 2 3E

0.2

0.4

0.6

0.8

1.0

density

(a) The density recovered when η “ 1.

-1 0 1 2 3E

0.2

0.4

0.6

0.8

1.0

density

(b) The density recovered when η “ 0.3.

-1 0 1 2 3E

0.2

0.4

0.6

0.8

1.0

density

(c) The density recovered when η “ 0.01.

-1 0 1 2 3E

0.2

0.4

0.6

0.8

1.0

density

(d) The density recovered when η “ 10´10.

Figure 2: The figures show numerical examples of the densities recovered using Proposition 2.4 as η ap-proaches zero for the Stieltjes transform of the Marchenko-Pastur distribution when γ “ 0.3. In Figure 2athe density is almost completely smoothed out by the Poisson kernel but in Figure 2d we recover the densityvery accurately.

Corollary 2.5. Let z :“ E ` iη. If µ admits a density dµdE , then

1

πlimηÑ0

Immµpzq “dµpEq

dE. (2.13)

So we have point-wise convergence.

Corollary 2.6. Let µ, ν P PpRq, then

µ “ ν if and only if mµ “ mν . (2.14)

As we shall see, a sequence Stieltjes transforms can be produced from a random matrices ensemble, andthus the goal of understanding the limiting properties of the ensemble can be achieved through understandingthe limiting properties of the sequence of Stieltjes transforms. However, limits of Stieltjes transforms of

10

probability distributions are not always Stieltjes transforms of probability distributions, as the followingsimple example shows: define

mN pzq “ mδN pzq :“

ż

R

δN ptq

t´ z“

1

N ´ z.

Thus, limNÑ8mN “ 0, which clearly cannot be the Stieltjes transform of any probability distribution.We wish to understand the circumstances under which the limit of the Stieltjes transforms of a sequenceof probability distribution is always the Stieltjes transform of a probability distribution. The condition issimple and can be understood intuitively as specifying that the sequence of probability distribution shouldnot lose any mass to infinity in the limit. The next proposition formalizes this idea and can be found in [15].

Proposition 2.7. Let pµN q8N“1 be a sequence contained in PpRq and denote their Stieltjes transforms by

pmN pzqq8N“1 respectively. Suppose that pointwise for all z :“ E ` iη P H,

limNÑ8

mN pzq “ mpzq (2.15)

for some function mpzq. Then there exists µ P PpRq, with a Stieltjes transform mµpzq “ mpzq, if and only if

limηÑ8

´iη mpiηq “ 1, (2.16)

in which case µN Ñ µ in distribution.

Proof. The forward direction follows from the calculation in (2.8): if mpzq “ mµpzq for some µ P PpRq,then (2.7) holds as this is a property of the Stieltjes transform of any probability distribution.

Helly’s selection theorem tells us that some subsequence µNk converges vaguely to a possibly defectivemeasure µ. Now using the fact that pz ´ Eq´1 is continuous and vanishes at infinity, we find

mNkpzq Ñ mµpzq

pointwise for all z P H.Thus, mµ “ m, and so the limits of all subsequences of pµN q

8N“1 have the same Stieltjes transforms m.

Moreover, by condition (2.16) it follows that µ is not defective and in fact µpRq “ 1. In more detail,

1 “ limηÑ8

´iη mpiηq “ limηÑ8

´iη

ż

R

µpdtq

t´ iη“ limηÑ8

ż

R

µpdtq

1´ tiη

ż

Rµpdtq

by the dominated convergence theorem. Then by the uniqueness of Stieltjes transforms implied by Proposi-tion 2.4, all subsequences of pµN q

8N“1 have the same limit, and thus µN Ñ µ in distribution.

Before we explain the Stieltjes transform’s relationship with matrices we need to make a definition.

Definition 2.8 (Empirical distribution of a matrix). Let A be an N ˆ N Hermitian matrix anddenote its (real) eigenvalues by

pλαqNα“1 . (2.17)

Then the empirical distribution of A is the (random) probability distribution on R defined by

µA :“1

N

Nÿ

α“1

δλα , (2.18)

11

That is a point mass of size 1N at each eigenvalue of A. In particular, the number of eigenvalues in aninterval I is given by

N

ż

I

µApdtq. (2.19)

Using Definition 2.8, we find that the Stieltjes transform of the empirical distribution of A is given by

mApzq :“

ż

R

µApdtq

t´ z“

1

N

Nÿ

α“1

1

λα ´ z. (2.20)

From now on we refer to mApzq as the Stieltjes transform of the matrix A and use the slightly abusivenotation mApzq instead of mµApzq.

2.2. Resolvent formalism. When studying the spectrum of operators on Hilbert spaces (and more generalspaces), we can use resolvent formalism to apply concepts from complex analysis. The operators we areinterested in are matrices and the resolvent has an intricate relationship with the Stieltjes transform, whichwe have already demonstrated as a useful tool for studying the convergence of probability distributions onthe real line.

Definition 2.9 (Resolvent). Let A be an N ˆN Hermitian matrix. Then denote the resolvent (or Greenfunction) of A by G : CzσpAq Ñ GLN pCq, which is defined as

G ” Gpzq :“1

A´ z1. (2.21)

Moreover, we denote the pi, jq matrix entry of Gpzq by Gij ” Gijpzq.

Thus, we can also write the Stieltjes transform of A as the trace of the resolvent of A:

mN pzq “1

N

ÿ

α

1

λα ´ z“

1

NTr

1

A´ z1“

1

NTrGpzq “

1

N

ÿ

i

Giipzq, (2.22)

since pA ´ z1q´1 is a rational function of A and so its eigenvalues are ppλα ´ zq´1qNα“1. This expression isanother advantage of the Stieltjes transform for random matrices; it allows us to use the algebraic propertiesof matrices and all the resolvent identities. To state and prove the resolvent identities we need to define theminor of a matrix.

Definition 2.10 (Minors). Let T Ď t1, . . . , Nu and let A “ paijq be an N ˆN matrix. We denote by ApTq

the pN ´ |T|q ˆ pN ´ |T|q Hermitian matrix obtained by removing all the columns and rows of A indexed byi P T . Note that we keep the original names of the indices for entries of A; for example the p2, 2q entry ofAp1q is still denoted as a22.

Given a minor matrix ApTq, we denote its eigenvalues and eigenvectors by

´

λpTqα

¯N´|T|

α“1and

´

vpTqα

¯N´|T|

α“1(2.23)

respectively. Similarly, we denote the resolvent and Stieltjes transform of the minor matrix by

GpTq ” GpTqpzq :“1

ApTq ´ z1and mpTq ” mpTqpzq :“

1

N

N´|T|ÿ

α“1

1

λpTqα ´ z

“1

NTrGpTqpzq (2.24)

12

respectively. Note that we use the normalization 1N , instead of the more natural pN ´ |T|q´1in (2.24); this

is a convenience for later applications. Regardless, the cardinality of T is often less than some fixed constant,and thus N „ N ´ |T|. Furthermore, we abbreviate ptiuq as piq and ptiu Y Tq as piTq in our notation forminors.

Now we have defined the minor of a matrix we can state the Schur complement formula, which relatesthe diagonal entries of the inverse of a matrix to the block decomposition of the matrix; it is essential toproving the resolvent identities which follow it.

Lemma 2.11 (Schur complement formula). Let D be an N ˆ N Hermitian matrix. Then we candecompose D as

D :“

A B˚

B C

(2.25)

for A, B, and C an mˆm, pN ´mq ˆm, and pN ´mq ˆ pN ´mq complex matrices respectively. Supposethat D is invertible, then it follows that

´

DpTq¯´1

ij

´

`

A´B˚C´1B˘pTq

¯´1

ij

(2.26)

for 1 ď i, j ď m and i, j R T.

Proof. Without loss of generality, we assume T “ H. The proof for nonempty T is identical. The proofuses a block Gaussian elimination by multiplying the matrix A from the right with the block-lower triangularmatrix defined by

L :“

1 0´C´1B 1

,

where 0 represents the mˆ pN ´mq matrix of all zeros and 1 is mˆm identity matrix, and C´1 exists byassumption. It is easy to see

L´1 “

1 0C´1B 1

.

So, performing a block matrix multiplication, we find

DL “

A B˚

B C

1 0´C´1B 1

A´B˚C´1B B˚

0 C

1 B˚C´1

0 1

A´B˚C´1B 00 C

.

Thus, we may conclude„

A B˚

B C

1 B˚C´1

0 1

A´B˚C´1B 00 C

1 0C´1B 1

,

and therefore we have the following expression for the inverse of D

D´1 “

1 0´C´1B 1

`

A´B˚C´1B˘´1

0 C´1

1 ´B˚C´1

0 1

.

13

Finally, we have“

D´1‰

ij“

`

A´B˚C´1B˘´1

ı

ij

for a ď i, j ď m.

Remark 2.12. Lemma 2.11 is easily extended to the case where A is not in the top left-hand corner of thematrix D. One defines the resulting decomposition of D in the same way and the proof is identical, butnotating the proof is more complicated.

The next lemma states the resolvent identities. The form of the resolvent identities depends on the matrixensemble in question; for example, the identities are slightly different for Wigner matrices and covariancematrices. While the form of the resolvent identities depends on the particular ensemble, here we state themost general form of the resolvent identities to demonstrate their wide applicability. In Lemma 6.5, we statethe resolvent identities for the specific case of covariance matrices, which we need for the proof of the localMarchenko-Pastur law in Section 6.

Lemma 2.13 (General resolvent identities). For an NˆN Hermitian matrix A “ paijq, we let GpTqij pzq

be as defined in Definition 2.10 and suppose z P CzσpAq. Then for the diagonal elements of Gpzq, wheni R T,we have

GpTqii pzq “

1

aii ´ z ´´

apiTqi

¯˚

GpiTqpzqapiTqi

. (2.27)

For the off-diagonal elements of Gpzq, when i ‰ j and i, j R T, we have

GpTqij pzq “ ´G

pTqii pzqG

piTqjj pzq

ˆ

aij ´´

apijTqi

¯˚

GpijTqpzqapijTqj

˙

. (2.28)

Finally, when i ‰ k, j ‰ k and i, j, k R T,

GpTqij pzq “ G

pkTqij pzq `

GpTqik pzqG

pTqkj pzq

GpTqkk pzq

. (2.29)

Proof. We only need elementary linear algebra for this proof. Throughout the prove we assume withoutloss of generality that T “ H, as the proof is identical for nonempty T. First we prove (2.27). For ease ofnotation we perform the calculation for i “ 1 but the following is true for general i. Then

A :“

»

a11

´

ap1q1

¯˚

ap1q1 Ap1q

fi

fl (2.30)

and so

A´ z1 :“

»

a11 ´ z´

ap1q1

¯˚

ap1q1 Ap1q ´ z1

fi

fl .

Therefore, using the Schur complement formula stated in Lemma 2.11, we immediately see

G11 “1

a11 ´ z ´´

ap1q1

¯˚

pAp1q ´ z1q´1ap1q1

“1

a11 ´ z ´´

ap1q1

¯˚

Gp1qpzqap1q1

.

14

Now we show (2.28). For easy of notation let i “ 1 and j “ 2. Consider the following block decomposition

A´ z1 “

»

a11 ´ z a12

´

ap12q1

¯˚

a21 a22 ´ z´

ap12q2

¯˚

ap12q1 a

p12q2 Ap12q ´ z1

fi

ffi

ffi

ffi

fl

.

Using the Schur complement formula, Lemma 2.11, with m “ 2, we get

Gks “

»

»

a11 ´ z ´´

ap12q1

¯˚

Gp12qap12q1 a12 ´

´

ap12q1

¯˚

Gp12qap12q2

a21 ´

´

ap12q2

¯˚

Gp12qap12q1 a22 ´ z ´

´

ap12q2

¯˚

Gp12qap12q2

fi

fl

´1fi

ffi

fl

ks

. (2.31)

Define

Kks :“ aks ´ 1k“sz ´´

ap12qk

¯˚

Gp12qap12qs

andK :“ K11K22 ´K12K21.

Thus, the expression (2.31), gives us

G11 “K22

K, G22 “

K11

K, G12 “ ´

K12

K, and G21 “ ´

K21

K.

So, we see

G12 “ ´K12

K“ ´

K12

K22

K22

K“ ´G11

K12

K22,

and by using the first resolvent identity (2.27) with i “ 2 on Gp1q, we find

“ ´G11K12

K22“ ´G11G

p1q22 K12,

that is,

G12 “ ´G11Gp1q22

ˆ

a12 ´

´

ap12q1

¯˚

Gp12qap12q2

˙

.

Now, we prove (2.29). Again for easy of notion, we prove it for i “ j “ 1 and k “ 2. Note

K11K22

K2´K11

K

1

K11“K12K21

K2, (2.32)

and thus, by (2.32), we find

G11G22 ´G221

K11“ G12G21.

Since, using the first resolvent identity (2.27) with i “ 1 on Gp2q, K´111 “ G

p2q11 , we have

G11 ´Gp2q11 “

G12G21

G22,

which completes the proof.

15

Along with the identities in Lemma 2.13, the following is a fundamental property of the resolvent and wemake use of it frequently during the proof in Section 6. The Lemma is true for any resolvent, which triviallyincludes the resolvents of minors.

Lemma 2.14 (Ward identity). Let z :“ E ` η and let A be an N ˆ N Hermitian matrix. Denote theresolvent of A by G. Then

Nÿ

j“1

|Gijpzq|2 “1

ηImGiipzq (2.33)

and, as a simple corollary,Nÿ

i,j“1

|Gijpzq|2 “1

ηIm TrGpzq. (2.34)

Proof. Denote the eigenvalues and eigenvectors of A by

pλαqNi“1 and pvαq

Ni“1

respectively. We use an eigendecompoisition on G: note thatˆ

1

λα ´ z

˙N

α“1

(2.35)

are the eigenvalues of G with corresponding eigenvectors pvαqNα“1, thus the entires of G are given by

Gij “ÿ

α

vαpiqvαpjq

λα ´ z, and Gij “

ÿ

α

vαpiqvαpjq

λα ´ z.

So, we seeÿ

j

|Gij |2 “ÿ

j

GijGij

“ÿ

j

ÿ

α

vαpiqvαpjq

λα ´ z

ÿ

β

vβpiqvβpjq

λβ ´ z

“ÿ

α,β

vαpiqvβpiq

|λα ´ z|2ÿ

j

vαpjqvβpjq

“ÿ

α

|vαpiq|2

|λα ´ z|2

“1

ηImGii,

(2.36)

since

Im1

λα ´ z“

η

|λα ´ z|2.

Then summing equation (2.36) over i yields

ÿ

i,j

|Gijpzq|2 “1

ηIm

ÿ

i

Gii “1

ηIm TrG,

which completes the proof.

16

The next lemma is essential and it provides the impetus to study the minors of matrices: the eigenvaluesof a minor matrix Apiq are interlaced with (and thus controlled by) the eigenvalues of the matrix A. Withthis observation we can recursively relate the Stieltjes transforms, resolvents, eigenvalues, etc. of a matrixwith the same of its minors. The resolvent identities allow us to make equations which express the resolventof a matrix in terms of the resolvents of its minors, and putting this together with interlacing allows us tocontrol the error terms in the equations.

Proposition 2.15 (Eigenvalue interlacing). Block decompose an N ˆ N Hermitian random matrixA “ paijq into its jth minor and the remaining elements. Assume the distributions of the entires of A aresmooth. Denote the eigenvalues of A and Apjq by

pλαqNα“1 and

´

λpjqα

¯N´1

α“1. (2.37)

respectively, ordered from smallest to largest. Then the eigenvalues of A and Apjq are strictly interlacedalmost surely:

λ1 ă λpjq1 ă λ2 ă λ

pjq2 ă ¨ ¨ ¨ ă λN´1 ă λ

pjqN´1 ă λN . (2.38)

Proof. For easy of notation assume j “ 1, as the proof is identical for general j. Suppose that λ is one ofthe eigenvalues of A and let v :“ pv1, . . . , vN q

˚be the corresponding normalized eigenvector, so that

Av “ λv. (2.39)

From the continuity of the distribution it also follows that v1 ‰ 0 almost surely. From the decomposition(2.30) and the equation (2.39) we get the pair of equations

av1 ` a˚vp1q “ λv1 and av1 `Ap1qvp1q “ λvp1q,

where vp1q :“ pv2, . . . , vN q˚. Rearranging the above equation we find

a˚vp1q “ pλ´ aq v1 and vp1q “´

λ1´Ap1q¯´1

av1

and then substituting the second equation into the first we see

pλ´ aq v1 “ a˚´

λ1´Ap1q¯´1

av1. (2.40)

Using a spectral representation of Ap1q, where vp1qα are the corresponding normalized eigenvectors of the

eigenvalues λp1qα , we find

pλ´ aq v1 “v1

N

N´1ÿ

α“1

∣∣∣?Na ¨ vp1qα

∣∣∣2λ´ λ

p1qα

.

Since v1 ‰ 0 almost surely, we find

λ´ a “1

N

N´1ÿ

α“1

∣∣∣?Na ¨ vp1qα

∣∣∣2λ´ λ

p1qα

. (2.41)

17

Moreover, vp1qα and a are independent and

∣∣∣?Na ¨ vp1qα

∣∣∣2 is strictly positive almost surely. So, from equation

(2.41) we may conclude that λ ‰ λp1qα for all α almost surely. Also note that the function

Φpλq :“1

N

N´1ÿ

α“1

∣∣∣?Na ¨ vp1qα

∣∣∣2λ´ λ

p1qα

(2.42)

is strictly decreasing from 8 to ´8 in the interval´

λp1qα´1, λ

p1qα

¯

. Therefore, there is exactly one solution to

the equationλ´ a “ Φpλq

in the interval´

λp1qα´1, λ

p1qα

¯

, since λ ´ a is increasing in the same interval. By a similar argument we can

show there is exactly one solution in each of the intervals p´8, λp1q1 q and pλ

p1qN´1,8q.

The following is a simple consequence of Proposition 2.15 and integration by parts; it states how inter-lacing affects the Stieltjes transforms of a matrix and its minors.

Corollary 2.16. Let z :“ E` iη P H. Denote the Stieltjes transform of A and Apjq by mN pzq and mpjqN´1pzq

respectively. Then, because the eigenvalues of A and Apjq are interlaced, we have the following bound on thedifference of their Stieltjes transform ∣∣∣mN pzq ´m

pjqN´1pzq

∣∣∣ ď C

Nη. (2.43)

Proof. Let µ ” µN and µpjq ” µpjqN´1 denote the empirical distributions of A and Apjq respectively, and

define F ptq :“ µ pp´8, tsq and F pjqptq :“ µpjq pp´8, tsq. From Lemma 2.15 we know that the eigenvalues ofA and Apjq are interlaced, thus we have the following uniform bound

supt

∣∣∣NF ptq ´NF pjqptq∣∣∣ ď 1. (2.44)

Integrating each integral by parts in the second step, we find∣∣∣Nmpzq ´Nmpjqpzq∣∣∣ “ ∣∣∣∣żR

NdF ptq

t´ z´

ż

R

NdF pjqptq

t´ z

∣∣∣∣“

∣∣∣∣∣„

NF ptq ´NF pjqptq

t´ z´

8

´8

`

ż

R

NF ptq ´NF pjqptq

pt´ zq2dt

∣∣∣∣∣“

∣∣∣∣żR

NF ptq ´NF pjqptq

pt´ zq2dt

∣∣∣∣ď

∣∣∣∣żR

dt

pt´ zq2

∣∣∣∣ď

ż

R

dt

|t´ z|2

ďC

η,

where the bound (2.44) is used for the fourth line.

18

2.3. The Marchenko-Pastur distribution. In this section we define the Marchenko-Pastur distribution, whichwas introduced by Marchenko and Pastur in [17] to describe the global limiting spectrum of generalizedcovariance matrices, and review some of its important properties. Throughout the entire paper (and thissubsection in particular),

?¨ denotes the complex square root with a branch cut on the negative real axis.

Definition 2.17 (Marchenko-Pastur distribution). The distribution is defined with the density func-tion

dνmpptq

dt:“

a

pλ` ´ tqpt´ λ´q

2πγt1rλ´,λ`s (2.45)

whereλ˘ :“ p1˘

?γq

2, (2.46)

so that

µmppAq :“

#

νmppAq if 0 ă γ ď 1

p1´ γq δ0 ` νmppAq if γ ą 1(2.47)

for any measurable set A Ď R.

From this definition, we calculate the Stieltjes transform of the Marchenko-Pastur distribution and derivea self-consistent equation for the transform in Lemma 2.18.

Lemma 2.18. Define mmp ” mmppzq as the Stieltjes transform of the Marchenko-Pastur distribution µmp.Then

mmppzq “1´ γ ´ z ` i

a

pλ` ´ zqpz ´ λ´q

2γz. (2.48)

Moreover, mmppzq satisfies the self-consistent equation

mmppzq “1

1´ γ ´ zγmmppzq ´ z. (2.49)

Proof. The following proof is based on Lemma 3.11 of [2]. For now assume γ ă 1; we will deal with theremaining cases at the end of the proof. By definition, we have

mmppzq “

ż λ`

λ´

1

t´ z

a

pλ` ´ tqpt´ λ´q

2πtγdt.

Using the substitution t “ 1` γ ` 2?γ cos s and then the substitution ζ “ exppisq, we see

mmppzq “

ż π

0

2

π

sin2 s

p1` γ ` 2?γ cos sqp1` γ ` 2

?γ cos s´ zq

ds

“1

π

ż 2π

0

´

eis´e´is

2i

¯2

p1` γ `?γpeis ` e´isqqp1` γ `

?γpeis ` e´isq ´ zq

ds

“i

¿

|ζ|“1

pζ ´ ζ´1q2

ζp1` γ `?γpζ ` ζ´1qqp1` γ `

?γpζ ` ζ´1q ´ zq

“i

¿

|ζ|“1

pζ2 ´ 1q2

ζpp1` γqζ `?γpζ2 ` 1qqpp1` γqζ `

?γpζ2 ` 1q ´ zζq

dζ,

19

where we used the fact that dt “ ´2?γ sin s ds and ds “ ´ i

ζdζ. Notice that the integrand has five simplepoles at the points

ζ0 :“ 0, ζ1 :“´p1` γq ` p1´ γq

2?γ

“ ´?γ, ζ2 :“

´p1` γq ´ p1´ γq

2?γ

“ ´1?γ,

ζ3 :“´p1` γq ` z `

a

p1` γ ´ zq2 ´ 4γ

2?γ

, and ζ4 :“´p1` γq ` z ´

a

p1` γ ´ zq2 ´ 4γ

2?γ

.

It is easy to calculate the residues at each of the five poles and find them equal to

1

γ, ´

1´ γ

γz,

1´ γ

γz,

a

p1` γ ´ zq2 ´ 4γ

γz, and ´

a

p1` γ ´ zq2 ´ 4γ

γz(2.50)

respectively. Now, we have to decide whether the singularities are inside the unit circle and thus whichresidues to count. Noting that ζ3ζ4 “ 1, we see that only one of the residues corresponding to ζ3 and ζ4should be counted. By the definition of the complex square root, we know that the real and imaginary partsof

a

p1` γ ´ zq2 ´ 4γ and z ´ p1` γq, hence

|ζ3| ą 1 and |ζ4| ă 1.

Moreover,

|ζ1| “ |´?γ| ă 1 and |ζ2| “ |´1?γ| ą 1.

Thus, by Cauchy’s residue theorem, we obtain

mmppzq “ ´1

2

˜

1

γ`γ ´ 1

γz´

a

p1` γ ´ zq2 ´ 4γ

γz

¸

“1´ γ ´ z `

a

pz ` γ ´ 1q2 ´ 4γz

2γz

“1´ γ ´ z ` i

a

pλ` ´ zqpz ´ λ´q

2γz,

(2.51)

which completes the proof when γ ă 1.

When γ ą 1 the Marchenko-Pastur distribution also has a point mass of size 1´γ at zero, so the integralhas an extra term we must take into account. Moreover, when γ ą 1, we have |ζ1| “

∣∣´?γ∣∣ ą 1 and

|ζ2| “∣∣´1

?γ∣∣ ă 1, and thus the residue at ζ2 should be taken into consideration. After some manipulation,

we find equation (2.51) is still true. The case when γ “ 1 is true because of the equation’s continuity in γ.

20

We can prove equation (2.49) by calculating and using equation (2.51):

´1

mmp“ ´

2γz

1´ γ ´ z ` ia

pλ` ´ zqpz ´ λ´q

“ ´

2γz´

1´ γ ´ z ´ ia

pλ` ´ zqpz ´ λ´q¯

´

1´ γ ´ z ` ia

pλ` ´ zqpz ´ λ´q¯´

1´ γ ´ z ´ ia

pλ` ´ zqpz ´ λ´q¯

“ ´

2γz´

1´ γ ´ z ´ ia

pλ` ´ zqpz ´ λ´q¯

p1´ γ ´ zq2` pλ` ´ zqpz ´ λ´q

“ ´

2γz´

1´ γ ´ z ´ ia

pλ` ´ zqpz ´ λ´q¯

p1´ γ ´ zq2 ´ p1´ γ ´ zq2 ` 4γz

“γ ´ 1` z ` i

a

pλ` ´ zqpz ´ λ´q

2

“2γ ´ 2` 2z ´ γ ` 1´ z ` iz

a

pλ` ´ zqpz ´ λ´q

2

“ γ ´ 1` z ` zγ1´ γ ´ z ` i

a

pλ` ´ zqpz ´ λ´q

2γz

“ γ ´ 1` z ` zγmmp,

which is equivalent to equation (2.49).

As we have just seen mmppzq is a solution for equation (2.49); the next lemma states that it is the uniquesolution which is a function from the upper half-plane to the upper half-plane—in particular, it is the onlyfunction that is the Stieltjes transform of some distribution, as Stieltjes transforms are functions from theupper half-plane to the upper half-plane. This is a converse to Lemma 2.18

Lemma 2.19. Suppose that s : HÑ H is a complex function from the upper half-plane to the upper half-plane,which satisfies the equation

spzq `1

z ` γ ´ 1` zγspzq“ 0. (2.52)

Then,s ” mmp. (2.53)

Proof. The proof is very simple, as (2.52) is a quadratic equation in s:

zγs2 ` pz ` γ ´ 1qs` 1 “ 0.

Now using the quadratic formula, we get

s “1´ γ ´ z ˘

a

pz ` γ ´ 1q2 ´ 4zγ

2zγ“

1´ γ ´ z ˘ ia

pλ` ´ zqpz ´ λ´q

2γz

and since s : HÑ H, we must choose the positive square root to get

s “1´ γ ´ z ` i

a

pλ` ´ zqpz ´ λ´q

2γz. (2.54)

This completes the proof: s ” mmp.

21

Remark 2.20. There is an alternate proof of Lemma 2.18 which uses the observation made in equation(2.10). By looking at the moment generating function for the Marchenko-Pastur distribution, one can deriveequation (2.49) using some simple combinatorics of graphs. From there it is possible to solve equation (2.49)using the quadratic formula and the fact that mpzq is a function from H to H to select the correct root, thusavoiding any complex analysis. As a reference, see [1] page 20, for example.

In fact, it is possible to see a priori that the solution (2.54) of equation (2.52) is the Stieltjes transformof the Marchenko-Pastur distribution using the inversion formula in Proposition 2.4. In more detail: notethat

Imp1´ γ ´ z ` ia

pλ` ´ zqpz ´ λ´qq “ ´η ` Im´

ia

pλ` ´ zqpz ´ λ´q¯

,

Rep1´ γ ´ z ` ia

pλ` ´ zqpz ´ λ´qq “ 1´ γ ´ E ` Re´

ia

pλ` ´ zqpz ´ λ´q¯

,

Im

ˆ

1

2γz

˙

“ ´η

2γ |z|2,

and Re

ˆ

1

2γz

˙

“E

2γ |z|2.

Thus

Im

˜

1´ γ ´ z ` ia

pλ` ´ zqpz ´ λ´q

2γz

¸

“E Re

´

ia

pλ` ´ zqpz ´ λ´q¯

´ ηp1´ γq ´ η Im´

ia

pλ` ´ zqpz ´ λ´q¯

2γ|z|2

Re´

za

pλ` ´ zqpz ´ λ´q¯

´ ηp1´ γq

2γ|z|2,

which when we take the limit η Œ 0 and divide by π yields the density

1

π

Re´

Ea

pλ` ´ EqpE ´ λ´q¯

2γE2“

1

a

pλ` ´ EqpE ´ λ´q

γE.

The Stieltjes transform of the Marchenko-Pastur distribution has some nice analytic properties on areasonable subset of the upper half-plane. These properties are essential for controlling various quantitiesduring the proof in Section 6. Define the region

R :“ tz :“ E ` iη P C : 1γą1pλ´5q ď E ď 5λ` and M´1 ď η ď 10p1` γqu Ď H, (2.55)

which we shall motivate further in Section 3, but for now note that the energy parameter E is restricted tobeing close to the Marchenko-Pastur distribution and η is at most order 1. For z :“ E` iη, we also introducethe edge parameter

κ ” κpzq :“ mint|λ` ´ E| , |E ´ λ´|u, (2.56)

which records how close the energy of the spectral parameter z is to the edges of the support of the Marchenko-Pastur distribution. Note that when η is small κ has similar behavior to the magnitude of

pλ` ´ zqpz ´ λ´q,

which is an important term in the density of the Marchenko-Pastur distribution and its Stieltjes transform.We now collect some of the properties of the Marchenko-Pastur distribution in the region R in the followinglemma.

22

Lemma 2.21. For z :“ E ` iη P R, we have the following uniform bounds:

|mmppzq| „ 1, (2.57)

which tells us mmppzq is order 1 throughout R;∣∣1´mmppzq2∣∣ „ ?κ` η; (2.58)

|1´ γ ´ z ´ zγmmp| ě c ą 0, (2.59)

which tells us that the denominator in (2.49) is not too small throughout R;

Immmppzq „

#

η?κ`η

if κ ě η and |E| R rλ´, λ`s?κ` η if κ ď η or |E| P rλ´, λ`s,

(2.60)

which tells us how the density of the Marchenko-Pastur distribution behaves near its edges—in particular,when η is small, we see square root decay as we approach the edges of the support. Furthermore,

Immmppzq

Nηě O

ˆ

1

N

˙

(2.61)

andB

Immmppzq

ηď 0. (2.62)

Proof. The proofs are elementary calculations using the expression (2.48) and the self-consistent equation(2.49).

2.4. The general strategy. We can now explain how one puts together the tools in Section 2 to prove a locallaw for an ensemble of random matrices H ” HN and thus produce an overview of the proof in Section 6.To begin with it is desirable that the ensemble’s spectrum is supported on an interval of order 1. This canoften be achieved by with the following heuristic: use a moment calculation to compute TrH2 and normalizethe entries of the matrix to make TrH2 order N . This heuristic comes from the fact that TrH2 “

ř

α λ2α,

where pλαqNα“1 are the eigenvalues of H. If the eigenvalues λα are order 1, then so are the positive λ2

α, whichimplies

ř

α λ2α is order N . We perform a calculation of this form for sparse covariance matrices in Section 4.

Note that by the remarks in Subsection 2.1, if we want to say that for intervals of order 1N (up to poly-logarithmic factors) the density of the eigenvalues of the random ensemble is close to some deterministicdensity with probability going to 1 and with error term going to 0 as the dimension N goes to infinity, thenwe can study the same convergence in Stieltjes transforms. Moreover, we wish to do this uniformly for z ina particular region R where E is close to the limiting spectrum of the ensemble and η is at most order 1 andat least 1N up to poly-logarithmic factors. We defined such a region for the ensemble we shall consider in(2.55).

A self-consistent equation is an equation which uniquely defines a function from H to H, in particular itdefines the Stieltjes transform of a probability distribution. The equation usually has nice properties such asstability—that is, it can be perturbed and then the solution of the perturbed self-consistent equation is closeto the solution of the original equation. Additionally, the solution to the self-consistent equation normallyhas a number of nice analytic properties in the region R mentioned above, such as the properties outlined inLemma 2.21 for the Stieltjes transform of the Marchenko-Pastur distribution. We prove the self-consistentequation’s stability in Lemma 6.12.

23

Thus, the general strategy is to find a self-consistent equation for the Stieltjes transformation of yourensemble using the resolvent identities outlined in Lemma 2.13. The ensemble’s self-consistent equationshould have some perturbation term depending on z that is controllable, in the sense that it can be boundedby a function fpNq that goes to zero as N becomes large. Typically, the control is optimal for fpNq “ 1N ,so these produce strong local laws; for fpNq “ 1

?N , for example, we can derive weak local laws. However,

apart from this perturbation term, the self-consistent equation should closely resemble the self-consistentequation of the Stieltjes transform of some probability distribution. We refer to this as the global self-consistent equation and the ensemble’s self-consistent equation as the approximate self-consistent equation.For the ensemble we shall consider, the approximate self-consistent equation is derived in Lemma 6.7.

Often the perturbation term in the approximate self-consistent equation is in the denominator of a termin the equation, simply due to the form of the resolvent identity. So, because of the stability of the globalself-consistent equation, the approximate self-consistent equation has a solution that is provably close to thesolution of the global self-consistent equation when the perturbation term is small. Naturally, we are thenforced to control the size of the perturbation term.

Often the perturbation involves some quadratic form of the entries of the matrix ensemble with coefficientsbased on the entries of the resolvent. Thus, we use bounds, which we can prove hold with high probability,on the resolvent (see Lemma 6.10) and large deviation estimates (see Section 5) to control the size of theperturbation term. This produces bounds on the perturbation term that hold with high probability (seeLemma 6.11). Often the bounds are easier to prove for large η, as for large η many of the quantities we dealwith are bounded and well behaved. This is a simple consequence of the bound (2.6). Once we have a boundon the perturbation term with high probability, the stability of the self-consistent equation tells us that theapproximate self-consistent equation’s solution is close to the solution of the global self-consistent equationwith high-probability.

To show the bounds on the perturbation term are still valid for small η, we use a continuity (or sometimescalled bootstrapping) argument, which we perform in Subsection 6.5. From our initial estimate for large η,which holds with high probability, we define a sequence of complex numbers with imaginary components thatdecrease from large η down to the smallest values of η (1N up to logarithmic factors). This sequence musthave at most polynomially (inN) many points and the gaps between points must have size at most orderN´8,which is possible since the original value of large η is order 1. Then, using an induction argument, we provethat for all points in the sequence the perturbation term is small with high probability. The induction relieson the Lipschitz continuity (with Lipschitz constant greater than η´2 ě N2 in the region R) of the resolventand the fact that the points are very close together. Moreover, since there are at most polynomially manypoints the perturbation term is small on all of the points concurrently with high probability (see Remark3.4). A similar argument for a lattice covering the whole region R works, which proves the bound uniformlyon R by the Lipschitz continuity of the resolvent.

24

3. Definitions and results

We begin this section by defining the class of N ˆ N random matrices H ” HN “ X˚NXN we consider.The parameter N should be thought of as large and often we shall not explicitly state the N -dependence ofobjects. We consider the general class of random matrices with sparse entries characterized by the parameterq, which may be N -dependent. The parameter q expresses the singularity of the distributions of the entriesof X.

Definition 3.1 (H). Fix a (possibly N -dependent) parameter ξ ” ξN , satisfying

1` a0 ď ξ ď A0 log logM, (3.1)

where a0 ą 0 and A0 ě 10. This parameter ξ will be used as an exponent in logarithmic corrections aswell as probability estimates. We consider N ˆ N matrices of the form H “ X˚X, where X “ pxijq is anM ˆN matrix whose entries are real and independent. We assume that the entries of X satisfy the momentconditions

Exij “ 0, E|xij |2 “1

M, and E|xij |p ď

Cp

Mqp´2(3.2)

for 1 ď i ďM , 1 ď j ďM , and 3 ď p ď plogMqA0 log logM , where C is a positive constant. We assume thatthe ratio of M and N , defined as γN :“ NMN converges to some positive constant γ at the following rate:

|γN ´ γ| ďC

M. (3.3)

Moreover, the (possibly N -dependent) parameter q ” qN satisfies

plogMq3ξ ď q ď CM12. (3.4)

This differs from the case where the entries are Gaussian primarily because the moments of the entriesdecay slowly. The variance is of order M´1 as usually, but the higher moments decay at a rate proportionalto inverse powers of q and not N12 as usual. Thus, unlike Wishart matrices, the entries of the above sparsematrices satisfying Definition 3.1 do not have a natural scale. For a more precise statement see Remark 2.5of [7].

Remark 3.2. While the matrix X is not directly related to the the adjacency matrix of the directed Erdos-Renyi random graph, since it is not square, this is the motivation for the assumptions we make on themoments of the entires of X. We reproduce the motivation given in [7]: suppose (here only) that X issquare with M “ N , then the assumptions on the matrix X are motivated by the adjacency matrix of thedirected Erdos-Renyi random graph, which is the random graph where each edge exists independently withprobability 0 ď p ď 1. Thus the adjacency matrix, or the directed Erdos-Renyi matrix, has independententries which are equal to 1 with probability p and 0 with probability 1 ´ p. For convenience and a moreconcise statement of our results we will replace p with the new parameter q ” qN :“ pNpq12. Moreover, asusual, we will rescale the entries of our matrices so that the bulk eigenvalues typically lie in an interval ofsize of order 1.

Thus we give the following definition for the matrix Y . Let Y “ pyijq be the N ˆN matrix whose entriesare independent and identically distributed according to

yij :“β

q

#

1 with probability q2

N

0 with probability 1´ q2

N .(3.5)

25

Here the scaling β :“ p1 ´ q2Nq´12 has been introduced for convenience. Notice that the parameterq ď N12 expresses the sparseness of the matrix (or more generally the singularity of the distribution of xijand yij) and it may depend on N . Typically, Y has Nq2 non vanishing entries, so we find that if q ! N12

then the matrix is called sparse.

Notice that the entries of Y are not mean-zero, so we center them and write

yij “ xij `βq

N, (3.6)

so that X “ pxijq is an N ˆN matrix with mean-zero entries. It is easy to calculate the following momentbounds on the entries of X:

Ex2ij “

1

Nand E|xij |p ď

1

Nqp´2(3.7)

for p ě 2. In Definition 3.1 we define a more general class of random matrices which contain the abovemotivating example.

Many of the events we deal with, such as various inequalities holding true, are of very high probability. Inparticular, they quickly become increasingly likely as N grows and in the limit hold almost surely. However,for finite N adversarial events can occur rarely, so we cannot rule them out. This forces us to assume suchadversarial events do not occur, which is true with very high probability. Many of our results will be statedin terms of very high probability events which can be characterized by two positive parameter, ξ and ν,where ξ is subject to assumption (3.1).

Definition 3.3 (High probability events). Recall the restrictions on ξ in equation (3.1). Let Ω be anM -dependent (or equivalently N -dependent) event. We say Ω holds with pξ, νq-high probability if

PpΩcq ď e´νplogMqξ (3.8)

for M ěM0pν, a0, A0q. We can generalize this definition: let Ω0 be a given event, then we say that Ω holdswith pξ, νq-high probability on Ω0 if

PpΩ0 X Ωcq ď e´νplogMqξ (3.9)

for M ěM0pν, a0, A0q.

Remark 3.4. Just as we mentioned in Notation 2.1, we will allow ν to decrease from one line to anotherwithout noting this or introducing a new notation. Since such changes to ν will occur only finitely manytimes in the proof, the constant remains positive. Hence our results will hold for all ν ď ν0 where ν0 dependsonly on the constants in (3.2). An interesting and convenient property of high probability events is that theconjunction of polynomially many high probability events is also a high probability event, possibly with areduction in the constant ν. Let tAi : i P IMu, where |IM | ď CMk for some constant C and k P N, be acollection of events with pξi, νiq-high probability respectively. Define ξ :“ mini ξi and ν :“ mini νi, then,

26

since ν ą 0 and ξ ą 1,

P

˜«

č

iPIAi

ffc¸

“ P

˜

ď

iPIAci

¸

ďÿ

iPIP pAci q

ďÿ

iPIe´νiplogMqξi

ď CNke´νplogMqξ

“ elogC`k logMe´νplogMqξ

ď eνplogMqξ

for large enough M and with a possible reduction in ν.

We now state and prove a lemma discussing the equivalence of the ensemble defined in Definition 3.1 andits cousin XX˚.

Lemma 3.5. Suppose that H satisfies Definition 3.1. Then the nonzero eigenvalues of H “ X˚X are identicalto the nonzero eigenvalues of H :“ XX˚. In terms of their empirical distributions, we have

µH “ p1´ γ´1N qδ0 ` γ

´1N µH . (3.10)

Moreover, this implies that the Stieltjes transforms of X˚X and XX˚, denoted by mN ” mN pzq and mM ”

mM pzq respectively, are related by

mN pzq “ pγ´1N ´ 1q

1

z` γ´1

N mM pzq. (3.11)

Proof. Denote the eigenvalues of H and H in increasing order by

pλαqNα“1 and pλαq

Mα“1

respectively, and denote their corresponding eigenvectors by

pvαqNα“1 and pvαq

Mα“1

respectively. Then by definition

Hvα “ λαvα and H vα “ λα vα.

Suppose that λα ‰ 0, then it is an eigenvalue of H with normalized eigenvector

Xvα‖Xvα‖

,

since

XX˚Xvα‖Xvα‖

“1

‖Xvα‖X pX˚Xqvα “ λα

Xvα‖Xvα‖

.

27

We can show the opposite direction starting with λα in exactly the same way. One can also use the Sylvester’sdeterminant theorem to show that the nonzero spectra of H and H agree. So, for example and without lossof generality, if we assume M ď N , then

σ pHq :“ pλ1, . . . , λN q “

¨

˝0, . . . , 0loomoon

N´M

, λ1, . . . , λM

˛

‚.

Thus,

µH “1

N

ÿ

λPσpHq

δλ “N ´M

N`

1

N

ÿ

λPσpHq

δλ “ p1´ γ´1N qδ0 ` γ

´1N µH . (3.12)

Moving on to the claim in equation (3.11), when we integrate (3.12), we see

mN “γ´1N ´ 1

z` γ´1

N mM , (3.13)

which is what we wanted to show.

Remark 3.6. Examining relation (3.13) we can see precisely the difference between the spectra of X˚X andXX˚: the coefficient of 1z results from the different number of zero eigenvalues, as can be seen from theequation for the Stieltjes transformation of a matrix in (2.22).

Notation 3.7 (Fundamental objects of X˚X and XX˚ and their minors). Lemma 3.5 essentiallytells us that the two ensembles H and H are equivalent and we can use the two interchangeably. We makeuse of this observation several times throughout the paper. To empathize which ensemble we are using atany time, we adopt the following notation hereafter: Let H :“ X˚X be as defined in Definition 3.1, whereH “ phijq and X “ pxijq. We denote the eigenvalues and eigenvectors of H as

pλαqNα“1 and pvαq

Nα“1 (3.14)

respectively. Furthermore, we denote the resolvent and Stieltjes transform of H by G ” Gpzq “ pGijpzqqand mN ” mN pzq respectively.

We use the underline to denote the corresponding objects for the ensemble of MˆM matrices H “ XX˚:H “ phijq, pλαq

Mα“1, pvαq

Mα“1, G ” Gpzq “ pGijpzqq, and mM ” mM pzq.

The minors of covariance matrices have a special expression in terms of the entries of X. Let xi be theith column of X and XpTq be the M ˆ pN ´ |T|q matrix formed by removing all the columns of X indexedby i P T. Then, for i “ 1, we have

X ““

x1 Xp1q‰

(3.15)

and thus

X˚X “

«

x˚1 x1 x˚1Xp1q

`

Xp1q˘˚

x1

`

Xp1q˘˚Xp1q

ff

. (3.16)

So, in the notation of Subsection 2.2, we have

h11 “ x˚1 x1, h1 “

´

Xp1q¯˚

x1, and Hp1q “´

Xp1q¯˚

Xp1q. (3.17)

28

Note that h11 and Hp1q are independent. In general, we form the minor HpTq using XpTq. Following thenotation of Definition 2.10, we denote the eigenvalues, eigenvectors, resolvent, and Stieltjes transform ofHpTq as

´

λpTqα

¯N´|T|

α“1,

´

vpTqα

¯N´|T|

α“1, GpTq ” GpTqpzq “

´

GpTqij pzq

¯

, and mpTqN´|T| ” m

pTqN´|T|pzq (3.18)

respectively. We also define the M ˆM matrix

HpTq :“ XpTq´

XpTq¯˚

. (3.19)

Note that this is not the pTq minor of H. It is merely a rank |T | perturbation of H. As above we define thecorresponding objects

´

λpTqα

¯M

α“1,

´

vpTqα

¯M

α“1, GpTq ” GpTqpzq “

´

GpTqij pzq

¯

, and mpTqM ” m

pTqM pzq. (3.20)

It is also necessary for us to introduce a second type of minor: Let ri be the ith row of X and XrTs bethe pM ´ |U|q ˆN matrix formed by removing all the row of X indexed by i P U. Then, for i “ 1, we have

X “

r1

Xr1s

. (3.21)

and thus

XX˚ “

«

r1r˚1 r1

`

Xr1s˘˚

Xr1sr˚1 Xr1s`

Xr1s˘˚

ff

. (3.22)

So, in the notation of Subsection 2.2, we have

h11 “ r1r˚1 , h1 “ Xr1sr˚1 , and Hp1q “ Xr1s

´

Xr1s¯˚

. (3.23)

Note that h11 and Hr1s are independent. In general, we form the minor HrUs using XrUs. Following thenotation of Definition 2.10, we denote the eigenvalues, eigenvectors, resolvent, and Stieltjes transform ofHrUs as

´

λrUsα

¯M´|U|

α“1,

´

vrUsα

¯M´|U|

α“1, GrUs ” GrUspzq “

´

GrUsij pzq

¯

, and mrUsM´|U| ” m

rUsM´|U|pzq (3.24)

respectively. We also define the N ˆN matrix

HrUs :“´

XrUs¯˚

XrUs. (3.25)

Note that this is not the rUs minor of H. As above we define the corresponding objects

´

λrUsα

¯N

α“1,

´

vrUsα

¯N

α“1, GrUs ” GrUspzq “

´

GrUsij pzq

¯

, and mrUsN ” m

rUsN pzq. (3.26)

Finally, it is sometimes necessary for us to mix these two types of minor. Let XpU,Tq be the pM ´ |U|q ˆpN ´ |T|q matrix formed by removing all the row of X indexed by i P U and all the columns of X indexedby j P T. Then define

HpU,Tq :“´

XpU,Tq¯˚

XpU,Tq and HpU,Tq :“ XpU,Tq´

XpU,Tq¯˚

, (3.27)

29

along with all their related objects.So, in summary, the underline or lack thereof indicated the order of the matrices X and X˚. The

minors are notated with parentheses p¨q if they are formed by removing columns of X, they are notatedwith square brackets r¨s if they are formed by removing rows of X, and they are notated with two argumentparentheses p¨ , ¨q if they are formed by removing both rows (in the first coordinate) and columns (in thesecond coordinate).

We now list our results. We introduce the spectral parameter

z :“ E ` iη, (3.28)

where E P R and η ą 0. ForL ” LN ě 8ξ, (3.29)

where ξ satisfies the assumption (3.1), define the region

RL :“ tz :“ E ` iη P C : 1γą1pλ´5q ď E ď 5λ` and plogMqLM´1 ď η ď 10p1` γqu, (3.30)

where λ˘ are given by the Marchenko-Pastur distribution, see equation (2.46). Extend this definition to theregion

R :“ tz :“ E ` iη P C : 1γą1pλ´5q ď E ď 5λ` and M´1 ď η ď 10p1` γqu. (3.31)

which, for all L, encloses RL. Note that for z P R, Lemma 2.21 tells us mmppzq „ 1. Finally, for z :“ E` iη,we introduce the quantity

κ ” κpzq :“ mint|λ` ´ E| , |E ´ λ´|u, (3.32)

which records how close the energy of the spectral parameter z is to the spectral edges of the Marchenko-Pastur distribution.

Theorem 3.8. Let H be as defined in Definition 3.1. Then there are constants ν ą 0 and C ą 0 such thatthe following statements hold for any ξ satisfying (3.1) and L satisfying (3.29). The events

č

zPRL

"

maxi‰j

|Gijpzq| ďCplogMqξ

q`CplogMq2ξ

pMηq12

*

, (3.33)

č

zPRL

"

maxi

|Giipzq ´mN pzq| ďCplogMqξ

q`CplogMq2ξ

pMηq12

*

, (3.34)

andč

zPRL

"

maxi

|Giipzq ´mmppzq| ďCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

(3.35)

hold with pξ, νq-high probability. Furthermore, we have the weak local semicircle law: the event

č

zPRL

"

|mN pzq ´mmppzq| ďCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

(3.36)

holds with pξ, νq-high probability.

30

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5Λ0.0

0.2

0.4

0.6

0.8

1.0density

(a) γ “ 12 and p “ 0.4 (or equivalently q “ 40).

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5Λ0.0

0.2

0.4

0.6

0.8

1.0density

(b) γ “ 12 and p “ 0.001 (or equivalently q “ 2).

0.0 0.5 1.0 1.5 2.0 2.5 3.0Λ0.0

0.2

0.4

0.6

0.8

1.0density

(c) γ “ 14 and p “ 0.4 (or equivalently q “ 20?

2).

0.0 0.5 1.0 1.5 2.0 2.5 3.0Λ0.0

0.2

0.4

0.6

0.8

1.0density

(d) γ “ 14 and p “ 0.001 (or equivalently q “?

2).

Figure 3: This figure shows the spectra of four sparse covariance matrices. The entries were sampled accordingto a Bernoulli distribution with parameter p and then the matrix was normalized. In each subfigure we plota histogram of the eigenvalues and the profile predicted by the Marchenko-Pastur distribution, to comparethe two. In figure 3a the matrix X has dimension 8, 000 ˆ 4, 000 and q is relatively large, and we see thatthe Marchenko-Pastur distribution agrees closely with the spectrum. In subfigure 3b the dimensions of Xare the same but q is relatively small, and thus the agreement is less accurate. In subfigures 3c and 3d Xhas dimension 8, 000 ˆ 2, 000, and we observe the same phenomenon for a different value of γ. Note thisdependence on q is to be expected given the statement of Theorem 3.8.

31

4. Moment calculations

In this section we calculate the moments of the entries of the matrix H, as defined in Definition 3.1, up totheir leading orders. Before that we prove a pξ, νq-high probability bound on the entries of X and calculateETrH2 as we discussed in Subsection 2.4, which serves as a nice warm up for Lemma 4.3 and the largedeviation estimates of Section 5. Through this section we use the following notation for the moments ofentries of X,

µk :“ E|x11|k. (4.1)

for integers k ě 0. Note that in Definition 3.1, it is not assumed that the random variables pxijq areidentically distribution, only that they satisfy the common moment bound (3.2). This means that we do not

necessarily have µk “ E |xij |k for general i and j, however since we are interested only in asymptotics in thissection, without loss of generality, we ignore this dependence on i and j.

Lemma 4.1. Choose K large enough so that K ě Ce for C depending on (3.2), then we have with pξ, νq-highprobability

|xij | ďK

q. (4.2)

Proof. Choose p :“ νplogMqξ and then apply Markov’s inequality to show

|xij | ąK

q

˙

“ Pˆ

|xij |p ąKp

qp

˙

ďE |xij |p

Kpqp

ďCp

Mqp´2

qp

Kp

ďq2

M

ˆ

C

K

˙p

ď e´νplogMqξ ,

where we used the moment bound (3.2) in the third line and the bound (3.4) on q in the last line.

Lemma 4.2. Let H “ phijq be as defined in Definition 3.1. Then

ETrH2 „ N. (4.3)

Proof. We calculate, using the definition H “ X˚X, to see

hij “Mÿ

k“1

xkixkj

and thus“

H2‰

ii“

Nÿ

j“1

hijhji “Nÿ

j“1

˜

Mÿ

k“1

xkixkj

¸˜

Mÿ

s“1

xsixsj

¸

.

32

Therefore, we have the following expression

ETrH2 “

Nÿ

i,j“1

Mÿ

k,s“1

Exkixkjxsixsj .

The value of the summand is completely dependent on the assignment of indices i, j, k, and l, since therandom variables are centered and independent. So, we find

Exkixkjxsixsj “

$

&

%

µ4 if i “ j and k “ s

µ22 if i “ j and k ‰ s

µ22 if i ‰ j and k “ s

0 if i ‰ j and k ‰ s.

Hence,

ETrH2 “

Nÿ

i“1

Mÿ

k“1

µ4 `

Nÿ

i“1

ÿ

k‰s

µ22 `

ÿ

i‰j

Mÿ

k“1

µ22 “MNµ4 `MNpN ´ 1qµ2

2 `MpM ´ 1qNµ22,

by some simple counting. Therefore, invoking the moment bound (3.2), we find

ETrH2 ď C4N

q2` γN pM `N ´ 2q ď CN,

by the assumption (3.4) on q. Similarly,

ETrH2 ěMNpM `N ´ 2qNµ22 ě γN pM `N ´ 2q ě cN,

which completes the proof.

Lemma 4.3 (Moments of the entries of H). Let H “ phijq be as defined in Definition 3.1. Then wehave the following leading order bounds,

Ehpij ď

$

&

%

C2p

q2p if i ‰ j and q ăM14

C2p

Mp2 if i ‰ j and q ěM14

C2p if i “ j

(4.4)

for all 3 ď p ď plogMqA0 log logM

Proof. We will now calculate the moments of the entries of H “ phijq to the leading order. There are twocases for the moment calculations: i “ j, for the diagonal terms, and i ‰ j, for the off-diagonal terms, whichgive different asymptotics. This is what we would expect and is true of non-sparse covariance matrices too.We will calculate the first two moments precisely and obtain leading order terms for higher moments. By asimple calculation

hij “

#

řMk“1 xkixkj if i ‰ j

řMk“1 x

2ki if i “ j

. (4.5)

33

So for the mean of the entries, we have

Ehij “

#

0 if i ‰ j

1 if i “ j. (4.6)

First, we calculate their variance in the case where i ‰ j. Since the variables are centered and independent,only the terms where a variable occurs twice contribute to the sum. Thus,

Eh2ij “

Mÿ

k1,k2“1

Exk1ixk2ixk1jxk2j “Mÿ

k1,k2“1

Exk1ixk2i ¨ Exk1jxk2j “Mÿ

k“1

Ex2ki ¨ Ex2

kj “M ¨ µ22 “

1

M,

where we used the fact that i ‰ j for the second equality. Second, we calculate their variance in the casewhere i “ j. By similar reasoning to above

Eh2ii “

Mÿ

k1,k2“1

Ex2k1ix

2k2i “

Mÿ

k“1

Ex4ki `

ÿ

k1‰k2

Ex2k1i ¨ Ex

2k2i “Mµ4 `MpM ´ 1qµ2

2 “Mµ4 `M ´ 1

M.

Summarizing, we get

Eh2ij “

#

1M if i ‰ j

Mµ4 `M´1M if i “ j

and using the moment bound from equation (3.2), we see

E|hij |2 ď

#

1M if i ‰ jC4

q2 ` 1 if i “ j. (4.7)

More generally, we can look at the pth moments of the entries:

Ehpij “

#

řMk1,...,kp“1 Exk1i ¨ ¨ ¨xkpi ¨ Exk1j ¨ ¨ ¨xkpj if i ‰ j

řMk1,...,kp“1 Ex2

k1i¨ ¨ ¨x2

kpiif i “ j

. (4.8)

First, we deal with the off-diagonal case, i ‰ j. Since the entries of X are independent, we have

Ehpij “Mÿ

k1,...,kp“1

`

Exk1i ¨ ¨ ¨xkpi˘2. (4.9)

It is clear that each assignment of the indices k1, . . . , kp induces a partition (or equivalence relation) Π onthe set t1, . . . , pu, where two elements, i and j, of the set are in the same block if and only if ki “ kj . Eachpartition Π is characterized by l blocks with sizes r1, . . . , rl such that

r1 ` ¨ ¨ ¨ ` rl “ p.

We reorganize the sum in (4.9) so that we are summing over all partitions Π. Note that distinct assignmentscan yield identical partitions. Moreover, the value of the summand is completely determined by the partitionΠ. So let nΠ denote the number of assignments to the indices k1, . . . , kp that induce the partition Π. Now,since the entries of X are centered, we can conclude that any summand corresponding to a partition Π with

34

a block of size 1 will be zero. So we may restrict to summing over partitions Π such that rk ě 2 for all1 ď k ď l, and thus l ď p2. Then, using the fact that the entries of X are independent, we can rewrite thesum (4.9) as

ÿ

Π

k“1

pExrk11q2ďÿ

Π

k“1

µ2rkďÿ

Π

k“1

C2rk

M2q2rk´4“ C2p

ÿ

Π

nΠ1

M2lq2p´4l(4.10)

where we have used the moment bound from equation (3.2). Now we use the trivial bound

nΠ ďM l. (4.11)

Using (4.11) in equation (4.10) we see

C2pÿ

Π

nΠ1

M2lq2p´4lď C2p

ÿ

Π

1

M lq2p´4l“ C2p

ÿ

Π

1

q2p

ˆ

q4

M

˙l

Either q4 ăM , in which case the leading order term is when l “ 1 and we get the bound

C2p

q2p,

since there is one such partition such that l “ 1. Otherwise q4 ěM , in which case the leading order term iswhen l “ p2 and we get the bound

C2p

Mp2

since there is one such partition such that l “ p2.Second, we turn to the diagonal case, i “ j. As before, we will sum over partitions Π, however we must

now sum over all partition (not only those with blocks of size at least 2, so it is possible that p “ l) as eachrandom variable in squared. Thus we may rewrite the second sum of (4.8) as

ÿ

Π

k“1

Ex2rk11 ď

ÿ

Π

k“1

µ2rk ďÿ

Π

k“1

C2rk

Mq2rk´2“ C2p

ÿ

Π

nΠ1

M lq2p´2lď C2p

ÿ

Π

1

q2p´2l,

where we used the bounds (3.2) and (4.11). Since q grows with M , for M large enough the leading ordercomes from the term when l “ p—that is, each block has size 1. There is exactly one such partition. So, theleading order for the diagonal case is

C2p,

as we claimed.

Remark 4.4. We can interpret Lemma 4.3 as follows: the dominant term for the diagonal entries comesfrom the product of the second moments (l “ p). Whereas, for the off-diagonal entries the dominant termcomes from the highest moment of entries of X (l “ 1) when q ă M14 and the product of the secondmoments (l “ p2) when q ěM14. The transition point q “M14 in the moments of the off diagonal entriesoccurs because the bound

Ehpij ďC2p

q2p

35

is too strong when q ąM14. That is, we would have the bound

Ehpij ăC2p

Mp2,

which is a faster decay than is exhibited by choosing X to have Gaussian entries. Thus, we must settle forthe weaker bound

Ehpij ďC2p

Mp2.

This plateau in the moment bounds is exactly what we should expect and essentially what we are finding isthat covariance matrices are only sparse when q ăM14, rather than q ăM12.

36

5. Large deviation estimates

In this section we state and prove a large deviation estimate for random variables satisfying the momentconditions outlined in Definition 3.1. The argument’s format is typical: in Lemma 5.3 we prove high momentbounds for the sums in question by expanding the sums and using combinatorial arguments to bound them.Then we complete the proof of Lemma 5.1 using Markov’s inequality in the same spirt as the proof of Lemma4.1. This large deviation estimate was first proved in [7] and a simplified proof was presented in [9].

Lemma 5.1. Let paiqMi“1 be centered and independent random variables satisfying

E|ai|p ďCp

Mqp´2(5.1)

for 2 ď p ď plogMqA0 log logM . Then there is a ν ą 0, depending only on C in (5.1), such that for all ξsatisfying (3.4), and for any Ai P C and Bij P C, we have with pξ, νq-high probability

∣∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣∣ ď plogMqξ

»

maxi |Ai|q

`

˜

1

M

Mÿ

i“1

|Ai|2¸

12fi

fl , (5.2)

∣∣∣∣∣ Mÿi“1

Bii`

a2i ´ Ea2

i

˘

∣∣∣∣∣ ď plogMqξ Bdq, (5.3)

and

∣∣∣∣∣ ÿ

1ďi‰jďM

aiBijaj

∣∣∣∣∣ ď plogMq2ξ

»

Boq`

˜

1

M2

ÿ

1ďi‰jďM

|Bij |2¸12

fi

fl , (5.4)

where

Bd :“ maxi

|Bii| and Bo :“ maxi‰j

|Bij | . (5.5)

Furthermore, let paiqMi“1 and pbiq

Mi“1 be independent random variables, each satisfying (5.1). Then there is a

constant ν ą 0, depending only on C in (5.1), such that for all ξ satisfying (3.1) and Bij P C we have withpξ, νq-high probability∣∣∣∣∣ M

ÿ

i,j“1

aiBijbj

∣∣∣∣∣ ď plogMq2ξ

»

Bdq2`Boq`

˜

1

M2

ÿ

1ďi‰jďM

|Bij |2¸12

fi

fl . (5.6)

Remark 5.2. A simple intuition for the bounds in Lemma 5.1 is to think of the two terms on the right-handside of (5.2) as follows: the first term comes from the special case where the Ai are dominated by a very

large maximum term A :“ maxi |Ai|. In this case, it is easy to see |Aai| ď plogMqξAq´1 with pξ, νq-high

probability. The second term comes from the variance ofř

iAiai. Moreover, the result is optimal, up tofactors of logM . However, the powers of q are not optimal, but this has no effect on later applications. Theother bounds have similar interpretations.

Note that we assume here that the coefficients Ai and Bij are deterministic. However, the coefficientscan be random, so long as they are independent of the random variables paiq

Mi“1 and pbiq

Mi“1. In this case we

just take a partial expectation, and it is in this sense that we frequently make use of Lemma 5.1.

37

To prove Lemma 5.1, we need the following lemma, which provides bounds on the higher moments of thesums in question.

Lemma 5.3. Let paiqMi“1 and pbiq

Mi“1 be real, centered, and independent random variables satisfying

E|ai|p ďCp

Mqp´2and E|bi|p ď

Cp

Mqp´2(5.7)

for all 2 ď p ď plogNqA0 log logN . Then for all even p satisfying 2 ď p ď plogMqA0 log logM and all Ai, Bij P Cwe have

E∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣p ď pCpqp

«

maxi|Ai|q

`

ˆ

1

M

Mÿ

i“1

|Ai|2˙12

ffp

, (5.8)

E∣∣∣∣ Mÿi“1

Bii`

a2i ´ Ea2

i

˘

∣∣∣∣p ď pCpqp

«

maxi|Bii|q

`

ˆ

1

Mq2

Mÿ

i“1

|Bii|2˙12

ffp

, (5.9)

E

∣∣∣∣∣ Mÿi“1

aiAibi

∣∣∣∣∣p

ď pCpqp

»

maxi |Ai|q2

`

˜

1

M2

Mÿ

i“1

|Ai|2¸12

fi

fl

p

, (5.10)

and E

∣∣∣∣∣ ÿ

1ďi‰jďM

aiBijaj

∣∣∣∣∣p

ď pCpq2p

«

maxi‰j |Bij |q

`

ˆ

1

M2

ÿ

1ďi‰jďM

|Bij |2˙12

ffp

, (5.11)

for some C depending only on the constant in (5.7).

Proof. It is easier to prove (5.8), (5.9), and (5.10), so we start by showing those. We set p “ 2r andcompute

E∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣2r “ Mÿ

i1,...,i2r“1

Ai1 ¨ ¨ ¨AirAir`1¨ ¨ ¨Ai2r E ai1 ¨ ¨ ¨ airair`1

¨ ¨ ¨ ai2r . (5.12)

It is clear that each assignment of the indices i1, . . . , i2r induces a partition (or equivalence relation) Π onthe set t1, . . . , 2ru, where two elements, i and j, of the set are in the same block if and only if ij “ ik. Eachpartition Π is characterized by l blocks with sizes r1, . . . , rl such that

r1 ` ¨ ¨ ¨ ` rl “ 2r.

We reorganize the sum in (5.12) so that we are summing over all partitions Π. Note that distinct assignmentscan yield identical partitions. Moreover, the value of the summand is completely determined by the partitionΠ. So let nΠ denote the number of assignments to the indices i1, . . . , i2r that induce the partition Π. Now,since the random variables ai are centered, we can conclude that any summand corresponding to a partitionΠ with a block of size one will be zero. So we may restrict to summing over partitions Π such that ri ě 2 forall 1 ď i ď l and l ď r. Then, using the fact that the random variables ai are independent, we can rewritethe sum (5.12) and thus find that the contribution of the partition Π to (5.12) is bounded in absolute valueby

ÿ

i1,...,il

s“1

|Ais |rsE|ais |rs ďlź

s“1

ˆ Mÿ

i“1

|Ai|rsCrs

Mqrs´2

˙

, (5.13)

38

by the bound in equation (5.7). Abbreviating A :“ maxi|Ai|, we find that (5.13) is bounded by

s“1

ˆ

pCAq´1qrsA´2M´1q2Mÿ

i“1

|Ai|2˙

“ pCAq´1q2rˆ

1

A2Mq´2

Mÿ

i“1

|Ai|2˙l

ď pCAq´1q2r max

#

1,

ˆ

1

A2Mq´2

Mÿ

i“1

|Ai|2˙r

+

ď Cr

«

A

q`

ˆ

1

M

Mÿ

i“1

|Bii|2˙12

ff2r

.

Next, it is easy to see that the total number of partitions of 2r elements is bounded by pCrq2r, so that weget

E∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣2r ď pCrq2r

«

A

q`

ˆ

1

M

Mÿ

i“1

|Ai|2˙12

ff2r

“ pCpqp

«

maxi|Ai|q

`

ˆ

1

M

Mÿ

i“1

|Bii|2˙12

ffp

.

This concludes the proof of (5.8).The proof of (5.9) is very similar but has a slight modification. Let bi :“ a2

i ´ Ea2i , then we set p “ 2r

and, as before, compute

E∣∣∣∣ Mÿi“1

Biibi

∣∣∣∣2r “ Mÿ

i1,...,i2r“1

Bi1i1 ¨ ¨ ¨BirirBir`1ir`1¨ ¨ ¨Bi2ri2r E bi1 ¨ ¨ ¨ birbir`1

¨ ¨ ¨ bi2r . (5.14)

We reorganize the sum in (5.14) in the same way as before, so that we are summing over all partitions Π.Again, since the random variables bi are centered, we can conclude that any summand corresponding to apartition Π with a block of size one will be zero. So we may restrict to summing over partitions Π such thatri ě 2 for all 1 ď i ď l and l ď r. Then, using the fact that the random variables bi are independent, wecan rewrite the sum (5.14) and thus find that the contribution of the partition Π to (5.14) is bounded inabsolute value by

ÿ

i1,...,il

s“1

|Bisis |rsE|bis |rs ďlź

s“1

ˆ Mÿ

i“1

|Bii|rsCrs

Mq2rs´2

˙

, (5.15)

since

E |bi|p “ E`∣∣a2i

∣∣` ∣∣Ea2i

∣∣˘p ď 2pE|ai|2p ` 2pE E|ai|2p ďCp

Mq2p´2

by the bound in equation (5.7). Abbreviating B :“ maxi|Bii|, we find that (5.15) is bounded by

s“1

ˆ

pCBq´2qrsB´2M´1q2Mÿ

i“1

|Bii|2˙

“ pCBq´2q2rˆ

1

B2Mq´2

Mÿ

i“1

|Bii|2˙l

ď pCBq´2q2r max

#

1,

ˆ

1

B2Mq´2

Mÿ

i“1

|Bii|2˙r

+

ď Cr

«

B

q2`

ˆ

1

Mq2

Mÿ

i“1

|Bii|2˙12

ff2r

.

39

As before, it is easy to see that the total number of partitions of 2r elements is bounded by pCrq2r, so thatwe get

E∣∣∣∣ Mÿi“1

Biibi

∣∣∣∣2r ď pCrq2r

«

B

q2`

ˆ

1

Mq2

Mÿ

i“1

|Bii|2˙12

ff2r

“ pCpqp

«

maxi|Bii|q

`

ˆ

1

Mq2

Mÿ

i“1

|Bii|2˙12

ffp

.

This concludes the proof of (5.9).The proof of (5.10) is very similar to the above, as paibiq

Mi“1 are independent centered random variables.

However, they satisfy the moment condition

E |aibi|p “ E |ai|p E |bi|p ďC2p

M2q2p´4,

by equation (5.7). Thus, the final bound we get is

E

∣∣∣∣∣ Mÿi“1

Aiaibi

∣∣∣∣∣p

ď pCpqp

»

maxi |Ai|q2

`

˜

1

M2

Mÿ

i“1

|Ai|2¸12

fi

fl .

The proof of (5.11) is harder. The original argument in [7] uses a combinatorial argument on graphs, buthere we present a simpler proof from [9]. If we formulate (5.11) in terms of Lp-norms, we find it is equivalentto ∥∥∥∥∥ÿ

i‰j

aiBijaj

∥∥∥∥∥p

p

ď pCpq2p

«

maxi‰j |Bij |q

`

ˆ

1

M2

ÿ

1ďi‰jďM

|Bij |2˙12

ffp

.

The proof relies on the identity

1 “1

2M´2

ÿ

I\J“NM

1iPI ¨ 1jPJ , (5.16)

for any fixed 1 ď i ‰ j ď N , where the sum ranges over all partitions of NM :“ t1, . . . ,Mu into two disjointsets I and J . Effectively, (5.16) says the number of partitions of NM into two disjoint blocks I and J suchthat i P I and j P J is 2M´2. Moreover,

ÿ

I\J“NM

1 “ 2M ´ 2, (5.17)

where the sum ranges over nonempty subsets I and J . Effectively, (5.17) says the number of partitions ofNM into two disjoint blocks I and J is 2M ´ 2. Note that the ratio between the the second and first numberis bounded by 4.

Thus, using the identity (5.16), we see∥∥∥∥∥ÿi‰j

aiBijaj

∥∥∥∥∥p

∥∥∥∥∥ÿi‰j

ÿ

I\J“NM

aiBijaj1iPI ¨ 1jPJ

2M´2

∥∥∥∥∥p

ď1

2N´2

ÿ

I\J“NM

∥∥∥∥∥ÿiPI

ÿ

jPJ

aiBijaj

∥∥∥∥∥p

. (5.18)

Now, we can use the fact that the families paiqiPI and pajqjPJ are independent and the obvious bounds|I| ďM and |J | ďM . Define

Ai :“ÿ

jPJ

Bijaj

40

and use (5.8) to find

‖Ai‖p ď pCpq

«

maxj |Bij |q

`

ˆ

1

M

piqÿ

j

|Bij |2˙12

ff

. (5.19)

Now, we may rewrite (5.18) as ∥∥∥∥∥ÿi‰j

aiBijaj

∥∥∥∥∥p

ď1

2M´2

ÿ

I\J“NM

∥∥∥∥∥ÿiPI

aiAi

∥∥∥∥∥p

.

and use (5.8) again to get

1

2M´2

ÿ

I\J“NM

∥∥∥∥∥ÿiPI

aiAi

∥∥∥∥∥p

ď1

2M´2

ÿ

I\J“NM

pCpq

«

maxi|Ai|q

`

ˆ

1

M

Mÿ

i“1

|Ai|2˙12

ff

. (5.20)

Combining (5.20) with the bound (5.19), we have∥∥∥∥∥ÿi‰j

aiBijaj

∥∥∥∥∥p

ď1

2M´2

ÿ

I\J“NM

pCpq2

«

maxi‰j |Bij |q

`

ˆ

1

M2

ÿ

i‰j

|Bij |2˙12

ff

.

By equation (5.17), this implies∥∥∥∥∥ÿi‰j

aiBijaj

∥∥∥∥∥p

ď pCpq2

«

maxi‰j |Bij |q

`

ˆ

1

M2

ÿ

i‰j

|Bij |2˙12

ff

,

which is equivalent to

E

∣∣∣∣∣ÿi‰j

aiBijaj

∣∣∣∣∣p

ď pCpq2p

«

maxi‰j |Bij |q

`

ˆ

1

M2

ÿ

i‰j

|Bij |2˙12

ffp

and thus completes the proof.

Proof of Lemma 5.1. The proof is a simple application of Lemma 5.3 and Markov’s inequality. To beginwith, we show (5.2). Let p :“ νplogMqξ and choose ν small enough so that ν ď 1

Ce . The probability that

∣∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣∣ ą plogMqξ

»

maxi |Ai|q

`

˜

1

M

Mÿ

i“1

|Ai|2¸

12fi

fl

is equivalent to the probability that

∣∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣∣p

ą plogMqpξ

»

maxi |Ai|q

`

˜

1

M

Mÿ

i“1

|Ai|2¸

12fi

fl

p

41

as both sides of the inequality are positive. Moreover, using Markov’s inequality, the probability is thusbounded by

E∣∣∣řM

i“1Aiai

∣∣∣pplogMqpξ

maxi|Ai|q `

´

1M

řMi“1 |Ai|

2¯12

p (5.21)

Now we may use the result (5.8) from Lemma 5.3, to bound (5.21) by

pCpqp

plogMqpξ“

ˆ

CνplogMqξ

plogMqξ

˙νplogMqξ

ď e´νplogMqξ (5.22)

This shows that with pξ, νq-high probability∣∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣∣ ď plogMqξ

»

maxi |Ai|q

`

˜

1

M

Mÿ

i“1

|Ai|2¸

12fi

fl . (5.23)

Now, in the same way we show (5.3). Let p :“ νplogMqξ and choose ν small enough so that ν ď 12Ce . We

can bound the first term using Markov’s inequality and the bound in (5.9). In more detail, the probabilitythat ∣∣∣∣∣ M

ÿ

i,j“1

Bij paiaj ´ Eaiajq

∣∣∣∣∣ ą plogMqξBdq

is equivalent to the probability that∣∣∣∣∣ Mÿ

i,j“1

Bij paiaj ´ Eaiajq

∣∣∣∣∣p

ą plogMqξpBpdqp,

as both sides of the inequality are positive, and the probability is thus bounded by

E∣∣∣řM

i,j“1Bij paiaj ´ Eaiajq∣∣∣p

plogMqξpBpdqp

. (5.24)

Now we may use the result (5.9) from Lemma 5.3, to bound (5.24) by

pCpqp

plogMqξpqp

Bp

«

maxi|Bii|q

`

ˆ

1

Mq2

Mÿ

i“1

|Bii|2˙12

ffp

“pCpqp

plogMqξp

«

1`

ˆ

1

M

Mÿ

i“1

|Bii|2

B2d

˙12ffp

ď pCνqp¨ 2p

ď e´νplogMqξ ¨

This shows that with pξ, νq-high probability∣∣∣∣∣ Mÿ

i,j“1

Bij paiaj ´ Eaiajq

∣∣∣∣∣ ď plogMqξBdq. (5.25)

42

Now we prove (5.4). The argument is similar and again uses Markov’s inequality. Let p :“ νplogMqξ andchoose ν small enough so that ν ď 1

Ce . The probability that∣∣∣∣∣ÿi‰j

aiBijaj

∣∣∣∣∣ ą plogMq2ξ

»

Boq`

˜

1

M2

ÿ

i‰j

|Bij |2¸12

fi

fl

is equivalent to the probability that∣∣∣∣∣ÿi‰j

aiBijaj

∣∣∣∣∣p

ą plogMq2ξp

»

Boq`

˜

1

M2

ÿ

i‰j

|Bij |2¸12

fi

fl

p

,

as both sides of the inequality are positive, and the probability is thus bounded by

E∣∣∣ři‰j aiBijaj

∣∣∣pplogMq2ξp

Boq `

´

1M2

ř

i‰j |Bij |2¯12

p . (5.26)

Now we may use the result (5.11) from Lemma 5.3, to bound (5.26) by

pCpq2p

«

Boq `

ˆ

1M2

ř

i‰j |Bij |2˙12

ffp

plogMq2ξp„

Boq `

´

1M2

ř

i‰j |Bij |2¯12

p “

ˆ

C2p2

plogMq2ξ

˙p

ˆ

C2ν2plogMq2ξ

plogMq2ξ

˙νplogMqξ

ď pCνqνplogMqξ

ď e´νplogMqξ .

This shows that with pξ, νq-high probability∣∣∣∣∣ÿi‰j

aiBijaj

∣∣∣∣∣ ď plogMq2ξ

»

Boq`

˜

1

M2

ÿ

i‰j

|Bij |2¸12

fi

fl . (5.27)

Finally, we prove (5.6). Using the triangle inequality, we see∣∣∣∣∣ Mÿ

i,j“1

aiBijbj

∣∣∣∣∣ ď∣∣∣∣∣ Mÿi“1

aiBiibi

∣∣∣∣∣`∣∣∣∣∣Mÿi‰j

aiBijbj

∣∣∣∣∣ .We can now deal with the diagonal and off-diagonal terms separately. Using Markov’s inequality and (5.10)from Lemma 5.3 in the same way as above, we see∣∣∣∣∣ Mÿ

i“1

aiBijbj

∣∣∣∣∣ ď plogMqξ

»

Bdq2`

˜

1

M2

Mÿ

i“1

|Bii|2¸12

fi

fl ď 2plogMqξBdq2

(5.28)

43

with pξ, νq-high probability. For the off-diagonal terms, let Ai :“ř

j‰iBijbj . Then we can use (5.2) fromLemma 5.1, which we have already proved, to see

|Ai| ď plogMqξ

»

Boq`

˜

1

M

ÿ

j‰i

|Bij |2¸12

fi

fl

with pξ, νq-high probability. Since Ai is independent of aj (it is a linear combination, with deterministiccoefficients, of the independent quantities bj) we may use (5.2) again to see∣∣∣∣∣ÿ

i‰j

aiBijbj

∣∣∣∣∣ “∣∣∣∣∣ Mÿi“1

Aiai

∣∣∣∣∣ď plogMqξ

»

maxi |Ai|q

`

˜

1

M

Mÿ

i“1

|Ai|2¸12

fi

fl

ď CplogMqξ

»

Boq`

˜

1

M2

ÿ

i‰j

|Bij |2¸12

fi

fl

(5.29)

with pξ, νq-high probability, since the conjunction of OpMq pξ, νq-high probability events is also a pξ, νq-highprobability event, by Remark 3.4. So putting together the bounds (5.28) and (5.29), we get∣∣∣∣∣ M

ÿ

i,j“1

aiBijbj

∣∣∣∣∣ ď plogMq2ξ

»

Bdq2`Boq`

˜

1

M2

ÿ

i‰j

|Bij |2¸12

fi

fl ,

where we absorbed the constant C into the smaller constant ν.

44

6. Local Marchenko-Pastur law

The work of this section is to prove a weak local law for the ensemble defined in Definition 3.1. We follow themethod outlined in Subsection 2.4. So, our goal is to estimate the following quantities, which we abbreviatefor convenience in the following definition.

Definition 6.1. Let z :“ E ` iη P R by as defined in (3.31). Then define

Λd ” Λdpzq :“ max1ďiďN

|Giipzq ´mmppzq| , (6.1)

Λo ” Λopzq :“ max1ďi‰jďN

|Gijpzq| , (6.2)

and Λ ” Λpzq :“ |mN pzq ´mmppzq| , (6.3)

where the subscripts refer to “diagonal” and “off-diagonal” matrix element of the resolvent. Furthermore,we define similar quantities for the minors:

ΛpTqd ” Λ

pTqd pzq :“ max

1ďiďNiRT

∣∣∣GpTqii pzq ´mmppzq∣∣∣ , (6.4)

ΛpTqo ” ΛpTqo pzq :“ max1ďi‰jďNi,jRT

∣∣∣GpTqij ∣∣∣ , (6.5)

and ΛpTq ” ΛpTqpzq :“∣∣∣mpTqN pzq ´mmppzq

∣∣∣ . (6.6)

The quantities ΛpTqd , Λ

pTqo , and ΛpTq are defined in the natural way for the resolvent GpTq:

ΛpTqd ” Λ

pTqd pzq :“ max

1ďiďM

∣∣∣∣GpTqii pzq ´ ˆ

γmmppzq `γ ´ 1

z

˙∣∣∣∣ , (6.7)

ΛpTqo ” ΛpTqo pzq :“ max1ďi‰jďM

∣∣∣GpTqij ∣∣∣ , (6.8)

and ΛpTq ” ΛpTqpzq :“

∣∣∣∣mpTqM pzq ´

ˆ

γmmppzq `γ ´ 1

z

˙∣∣∣∣ , (6.9)

where we have adjusted for the point mass at 0. We define similar quantities for the minors rUs and pU,Tq.

Recall the definition of RL from (3.30).

Theorem 6.2. Let H be as defined in Definition 3.1. Then there are constants ν ą 0 and C ą 0 such thatthe following statements hold for any ξ satisfying (3.1) and L satisfying (3.29). The events

č

zPRL

#

Λopzq ďCplogMqξ

q`CplogMq2ξ

pMηq12

+

, (6.10)

č

zPRL

#

maxi

|Giipzq ´mN pzq| ďCplogMqξ

q`CplogMq2ξ

pMηq12

+

, (6.11)

andč

zPRL

#

Λdpzq ďCplogMqξ

?q

`CplogMq2ξ

pMηq14

+

(6.12)

45

hold with pξ, νq-high probability. Furthermore, we have the weak local Marchenko-Pastur law: the event

č

zPRL

"

Λpzq ďCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

(6.13)

holds with pξ, νq-high probability.

Roughly, Theorem 6.2 states that

|Gij ´ δijmN | À 1

q`

1?Nη

(6.14)

and

|mN ´mmp| ď maxi

|Gii ´mmp| À1?q`

1

pNηq14. (6.15)

Note that in (6.14) the diagonal term Gii is compared to the random quantity mN and not mmp, as it isin (6.15). The bound in (6.15) is not optimal, as we would expect an error term of the form pNηq´1 ratherthan pNηq´14; this lack of precision is due to instability near the edge. The bound can be improved if werelax the requirement that our bound be uniform up to the edge. If we only wanted a bound in the bulk,then our method could achieve a stronger bound of the form pNηq´12, at the expense of a term that blowsup near the edges. Despite the fact that the bounds are not optimal, and thus the local law is only a weaklocal law, the bounds hold on the optimal scale of η ÁM´1, uniformly up to the edge.

6.1. Preliminary lemmas. During the proof of the local Marchenko-Pastur law we will need some basicproperties of the Marchenko-Pastur distribution, which are summarized in Subsection 2.3, and some otherelementary lemmas, which we state and prove in this section.

Lemma 6.3. Let A be an M ˆN matrix, such that

pA˚A´ z1q and pAA˚ ´ z1q (6.16)

are invertible—that is, z is not in the spectra of A˚A or AA˚. Then we have the following identity

A pA˚A´ z1q´1A˚ “ 1` z pAA˚ ´ z1q

´1. (6.17)

Proof. The proof is a simple manipulation. We have

AA˚ “ A pA˚A´ z1q´1pA˚A´ z1qA˚

“ A pA˚A´ z1q´1pA˚AA˚ ´ zA˚q

“ A pA˚A´ z1q´1A˚ pAA˚ ´ z1q ,

and multiplying on the right by pAA˚ ´ z1q´1

, we get

AA˚ pAA˚ ´ z1q´1“ A pA˚A´ z1q

´1A˚. (6.18)

ThusAA˚ pAA˚ ´ z1q

´1“ pAA˚ ´ z1` z1q pAA˚ ´ z1q

´1“ 1` z pAA˚ ´ z1q

´1,

which with equation (6.18) completes the proof.

46

Lemma 6.4 (Rank one perturbation formula). Suppose that A is an invertible N ˆN matrix and qand r are column vectors of dimension N , such that

1` r˚A´1q ‰ 0. (6.19)

Then we have the following identity

pA` qr˚q´1“ A´1 ´

A´1qr˚A´1

1` r˚A´1q. (6.20)

Proof. Again the proof is a simple calculation. Noteˆ

A´1 ´A´1qr˚A´1

1` r˚A´1q

˙

pA` qr˚q “ 1`A´1qr˚ ´A´1qr˚ `A´1q

`

r˚A´1q˘

1` r˚A´1q

“ 1`A´1qr˚ ´A´1qr˚ `

`

r˚A´1q˘

A´1qr˚

1` r˚A´1q

“ 1`A´1qr˚ ´1` r˚A´1q

1` r˚A´1qA´1qr˚

“ 1,

(6.21)

where we used the fact that`

r˚A´1q˘

is a scalar in the second step. So, multiplying (6.21) on the right bypA` qr˚q´1 yields (6.20).

We shall also need the following resolvent identities for the matrix elements of the resolvents GpTqij and

GpTqij , most of which are a special case of Lemma 2.13.

Lemma 6.5 (Resolvent identities). Let GpTqij and G

pTqij be as defined in (3.7). Then for the diagonal

elements of Gpzq, we have

Giipzq “1

´z ´ zx˚i Gpiqpzqxi

or equivalently x˚i Gpiqpzqxi “

´1

zGiipzq´ 1. (6.22)

For the off-diagonal elements of Gpzq, when i ‰ j, we have

Gijpzq “ zGiipzqGpiqjj pzqx

˚i G

pijqpzqxj . (6.23)

Finally, when i ‰ k and j ‰ k,

Gijpzq “ Gpkqij pzq `

GikpzqGkjpzq

Gkkpzq. (6.24)

For the H ensemble, we have the rank one perturbation formulas,

G “ Gpiq ´Gpiqxix

˚i G

piq

1` x˚i Gpiqxi

“ Gpiq ` zGiipzqGpiqxix

˚i G

piq (6.25)

and

Gpiq “ G`Gxix

˚i G

1´ x˚i Gxi. (6.26)

Moreover, we have the following identity for the trace of the resolvent

TrGpTqpzq “M ´N ` |T|

z` TrGpTqpzq. (6.27)

47

Remark 6.6. Lemma 6.5 remains trivially valid for the minors HpTq and HpTq of H and H, and we frequentlyuse this fact in later proofs. So, stating the results in their most general form: for the diagonal elements ofGpU,Tqpzq, we have

GpU,Tqii pzq “

1

´z ´ z´

xpUqi

¯˚

GpU,iTqpzqxpUqi

or´

xpUqi

¯˚

GpU,iTqpzqxpUqi “

´1

zGpU,Tqii pzq

´ 1. (6.28)

For the off-diagonal elements of GpU,Tqpzq, when i ‰ j, we have

GpU,Tqij pzq “ zG

pU,Tqii pzqG

pU,iTqjj pzq

´

xpUqi

¯˚

GpU,ijTqpzqxpUqj . (6.29)

Finally, when i ‰ k and j ‰ k,

GpU,Tqij pzq “ G

pU,kTqij pzq `

GpU,Tqik pzqG

pU,Tqkj pzq

GpU,Tqkk pzq

. (6.30)

For the HpU,Tq ensemble, we have the rank one perturbation formulas,

GpU,Tq “ GpU,iTq ´GpU,iTqx

pUqi

´

xpUqi

¯˚

GpU,iTq

1`´

xpUqi

¯˚

GpU,iTqxpUqi

“ GpU,iTq ` zGpU,Tqii pzqGpU,iTqx

pUqi

´

xpUqi

¯˚

GpU,iTq (6.31)

and

GpU,iTq “ GpU,Tq `GpU,Tqx

pUqi

´

xpUqi

¯˚

GpU,Tq

1´´

xpUqi

¯˚

GpU,TqxpUqi

. (6.32)

A little thought shows that one can easily prove the corresponding identities in exactly the same way:for the diagonal elements of GpU,Tqpzq, we have

GpU,Tqii pzq “

1

´z ´ zrpTqi GpiU,Tqpzq

´

rpTqi

¯˚ or rpTqi GpiU,Tqpzq

´

rpTqi

¯˚

“´1

zGpU,Tqii pzq

´ 1. (6.33)

For the off-diagonal elements of GpU,Tqpzq, when i ‰ j, we have

GpU,Tqij pzq “ zG

pU,Tqii pzqG

piU,Tqjj pzqr

pTqi GpijU,Tqpzq

´

rpTqj

¯˚

. (6.34)

Finally, when i ‰ k and j ‰ k,

GpU,Tqij pzq “ G

pkU,Tqij pzq `

GpU,Tqik pzqG

pU,Tqkj pzq

GpU,Tqkk pzq

. (6.35)

For the HpU,Tq ensemble, we have the rank one perturbation formulas,

GpU,Tq “ GpiU,Tq ´GpiU,Tq

´

rpTqi

¯˚

rpTqi GpiU,Tq

1` rpTqi GpiU,Tq

´

rpTqi

¯˚ “ GpiU,Tq ` zGpU,Tqii pzqGpiU,Tq

´

rpTqi

¯˚

rpTqi GpiU,Tq (6.36)

48

and

GpiU,Tq “ GpU,Tq `GpU,Tq

´

rpTqi

¯˚

rpTqi GpU,Tq

1´ rpTqi GpU,Tq

´

rpTqi

¯˚ . (6.37)

Moreover, we have the following identity for the trace of the resolvent

TrGpU,Tqpzq “pM ´ |U|q ´ pN ´ |T|q

z` TrGpU,Tqpzq. (6.38)

Proof of Lemma 6.5. We use Lemmas 2.13 and 6.3 for this proof. First we prove (6.22). Fix 1 ď j ď N .Using equation (2.27) from Lemma 2.13, we find

Gii “1

hii ´ z ´´

hpiqi

¯˚

Gpiqhpiqi

, (6.39)

which we can simplify using the definition of the matrix H “ X˚X. For ease of notation we will performthe calculation for j “ 1 but the following is true for general j. Then

X ““

x1 Xp1q‰

and thus

X˚X “

«

x˚1 x1 x˚1Xp1q

`

Xp1q˘˚

x1

`

Xp1q˘˚Xp1q

ff

. (6.40)

Substituting this into equation (6.39), we see

G11 “1

x˚1 x1 ´ z ´ x˚1Xp1q

´

`

Xp1q˘˚Xp1q ´ z1

¯´1`

Xp1q˘˚

x1

.

Then we use the identity in Lemma 6.3 for A “ Xp1q,

Xp1qˆ

´

Xp1q¯˚

Xp1q ´ z1

˙´1´

Xp1q¯˚

“ 1` z

ˆ

Xp1q´

Xp1q¯˚

´ z1

˙´1

,

to see

G11 “1

´z ´ zx˚1Gp1qx1

. (6.41)

Now we prove (6.23). From (2.28), we see

Gij “ ´GiiGpiqjj

ˆ

hij ´´

hpijqi

¯˚

Gpijqhpjiqj

˙

. (6.42)

Note that we can produce a similar equation to (6.40) by using the pijq minor of X. For easy of notation,let i “ 1 and j “ 2, then we have

X ““

x1 x2 Xp12q‰

49

and

X˚X “

»

x˚1 x1 x˚1 x2 x˚1Xp12q

x˚2 x1 x˚2 x2 x˚2Xp12q

`

Xp12q˘˚

x1

`

Xp12q˘˚

x2

`

Xp12q˘˚Xp12q

fi

fl . (6.43)

When used with (6.43), (6.42) yields

G12 “ ´G11Gp1q22

˜

x˚1 x2 ´ x˚1Xp12q

ˆ

´

Xp12q¯˚

Xp12q ´ z1

˙´1´

Xp12q¯˚

x2

¸

So using Lemma 6.3, we get

G12 “ zG11Gp1q22 x1

ˆ

Xp12q´

Xp12q¯˚

´ z1

˙´1

x2 “ zG11Gp1q22 x1G

p12qx2.

The proof is identical for general i and j.Equation (6.24) is identical to (2.29) from Lemma 2.13. Equation (6.25) follows immediately from Lemma

6.4 with A “ Hpiq ´ z1 and q “ r “ xi, since

H “ Hpiq ` xix˚i .

Using (6.22) completes the proof. Similarly, (6.26) follows immediately from Lemma 6.4 with A “ H ´ z1,q “ xi, and r “ ´xi. Equation (6.27) can be shown in exactly the same way as (3.11) of Lemma 3.5.

The following lemma contains the self-consistent resolvent equation, which is essential for our proof.

Lemma 6.7 (Approximate self-consistent equation). We have the identity

Giipzq “1

1´ γ ´ zγmN pzq ´ z ´∆ipzq, (6.44)

where∆i ” ∆ipzq :“ 1´ γ ´ zγmN pzq ` zm

piqM pzq ´ Γipzq (6.45)

andΓi ” Γipzq :“ zm

piqM pzq ´ zx

˚i G

piqpzqxi. (6.46)

Proof. Using equation (6.22) of Lemma 6.5, we have

Gii “1

´z ´ zx˚i Gpiqxi

“1

´z ´ zmpiqM ` Γi

whereΓi :“ zm

piqM ´ zx˚i G

piqxi.

We can rearrange further, to find

Gii “1

1´ γ ´ zγmN ´ z ´∆i,

where∆i :“ 1´ γ ´ zγmN ` zm

piqM ´ Γi,

which completes the proof.

50

Remark 6.8. During the proof, we show that maxi |∆i| is bounded. Summing (6.44) over i and normalizingby N´1 yields

mN “1

N

Nÿ

i“1

1

1´ γ ´ zγmN ´ z ´∆i, (6.47)

which is like the Marchenko-Pastur equation (2.49) but with a perturbation ∆i. However, given that thisperturbation is small, we can use a Taylor expansion, under certain assumptions, to find that mN is closeto mmp with some small, controllable error term.

6.2. Basic estimates on the event Ωpzq.

Definition 6.9. For z :“ E ` iη P RL, we introduce the event

Ω ” Ωpzq :“

Λdpzq ` Λopzq ` Λdpzq ` Λopzq ď CplogMq´ξ(

(6.48)

and the control parameter

Ψ ” Ψpzq :“

d

∆pzq ` Immmppzq

Mη. (6.49)

The event Ω is desirable; if it does not hold, then Λd, Λo, Λd, or Λ0 is too large. Note that the boundson the event Ω are very weak, however they are often sufficient to prove much stronger bounds. So we useΩ as an assumption set and later show the event holds with pξ, νq-high probability, first for η order one andthen for smaller η by bootstrapping.

Note that Ψ is a random variable and that on RL, we have

η ě plogMqL1

Mě plogMq8ξ

1

M

and

|mmp| „ 1

as stated in Lemma 2.21. Thus, on Ω we have

Ψ ď CpMηq´12 ď CplogMq´4ξ, (6.50)

since

∆ “ |mN ´mmp| ď1

N

Nÿ

i“1

|Gii ´mmp| ď Λd ď plogMq´ξ

on Ω. Throughout this section we shall make use of the Ward identity

Nÿ

j“1

|Gij |2 “1

ηImGii,

which is proved in Lemma 2.14 from Subsection 2.2.

51

Lemma 6.10 (Rough resolvent bounds on Ωpzq). Fix T Ď t1, . . . , Nu such that |T| ď 10 (where 10 canbe replaced by any fixed constant). For z P R, there is a constant C depending on |T|, such that the followingestimates

maxiRT

∣∣∣GpTqii pzq ´Giipzq∣∣∣ ď CΛopzq2 ď CplogMq´2ξ, (6.51)

c ď∣∣∣GpTqii pzq∣∣∣ ď C, (6.52)

and ΛpTqo pzq ď CΛopzq ď CplogMq´ξ (6.53)

hold in Ωpzq. Now fix T Ď t1, . . . , Nu such that |T| ď 10 (where 10 can be replaced by any fixed constant).Then for z P RL, there is a constant C depending on |T|, such that the following estimates

maxi,j

∣∣∣GpTqij pzq ´Gijpzq∣∣∣ ď CplogMq´ξ, (6.54)

max1ďiďM

∣∣∣GpTqii pzq∣∣∣ ď C, (6.55)

and ΛpTqo pzq ď CplogMq´ξ (6.56)

hold in Ωpzq with pξ, νq-high probability. Similarly, fix U Ď t1, . . . ,Mu such that |U| ď 10 (where 10 can bereplaced by any fixed constant). For z P R, there is a constant C depending on |U|, such that the followingestimates

maxiRU

∣∣∣GrUsii pzq ´Giipzq∣∣∣ ď CΛopzq2 ď CplogMq´2ξ, (6.57)

c ď∣∣∣GrUsii pzq∣∣∣ ď C, (6.58)

and ΛrUso pzq ď CΛopzq ď CplogMq´ξ (6.59)

hold in Ωpzq. Now fix U Ď t1, . . . ,Mu such that |U| ď 10 (where 10 can be replaced by any fixed constant).Then for z P RL, there is a constant C depending on |U|, such that the following estimates

maxi,j

∣∣∣GrUsij pzq ´Gijpzq∣∣∣ ď CplogMq´ξ, (6.60)

max1ďiďN

∣∣∣GrUsii pzq∣∣∣ ď C, (6.61)

and ΛrUso pzq ď CplogMq´ξ (6.62)

hold in Ωpzq with pξ, νq-high probability.

Proof. For T “ H, (6.57) and (6.59) are vacuously true. For (6.58), we have z P R and so (2.57) of Lemma2.21 applies, thus

|Gii| ď |Gii ´mmp|` |mmp| ď Λd ` |mmp| ď plogMq´ξ ` C ď C,

since we are on the event Ω. Similarly, by the reverse triangle inequality,

|Gii| ě ||mmp|´ |Gii ´mmp|| ě c´ plogMq´ξ ě c

for large enough M .

52

For T ‰ H, we proceed by inducting for only finitely many steps on the cardinality of T. We restrict theinduction in this way because then there are only finitely many changes to the constants c and C. Supposewe have the claims (6.57), (6.58), and (6.59) for T and we wish to show them for kT, where k P t1, . . . , NuzTon Ω. From (6.24) of Lemma 6.5, we find

∣∣∣GpkTqii ´Gii

∣∣∣ ď ∣∣∣GpkTqii ´GpTqii

∣∣∣` ∣∣∣GpTqii ´Gii∣∣∣ ď∣∣∣∣∣GpTqik GpTqkiG

pTqkk

∣∣∣∣∣` CΛ20 ď CΛ2

0,

where the induction assumption (6.57) in the penultimate step, and the induction assumptions (6.58) and(6.59) in the final step. We can now use the fact that we are on the event Ω and take the maximum overi R T to prove the claim (6.57). Similarly, using (6.24) of Lemma 6.5, we prove

∣∣∣GpkTqii

∣∣∣ ď ∣∣∣GpTqii ∣∣∣` ∣∣∣∣∣GpTqki GpTqikGpTqkk

∣∣∣∣∣ ď C ` CΛ20 ď C ` CplogMq´2ξ ď C,

since we are on the event Ω. Again using (6.24) of Lemma 6.5, we see with the reverse triangle inequality

∣∣∣GpkTqii

∣∣∣ ě ∣∣∣∣∣∣∣∣GpTqii ∣∣∣´ ∣∣∣∣∣GpTqki GpTqikGpTqkk

∣∣∣∣∣∣∣∣∣∣ ě c´ CplogMq´2ξ ě c,

for large enough M , where we again used the fact we are on the event Ω. Finally, we show (6.59) by notingthat ∣∣∣GpkTqij

∣∣∣ ď ∣∣∣GpTqij ∣∣∣` ∣∣∣∣∣GpTqik G

pTqkj

GpTqkk

∣∣∣∣∣ ď CΛ0 ` CΛ20 ď CΛ0,

since we are on the event Ω, where we used the induction assumptions (6.58) and (6.59) and the fact thatΛo dominates Λ2

o on Ω. Lastly, taking the max over i ‰ j and i, j R kT yields (6.59).The argument for (6.60), (6.61), and (6.62) is similar to above but more complicated. We use induction

on the cardinality of T and point out the base case for (6.60) is trivial and that the base case for (6.62)follows immediately from the assumption that we are on Ω. For the base case of (6.61), we see

|Gii| ď∣∣∣∣Gii ´ ˆ

γmmp `γ ´ 1

z

˙∣∣∣∣` γ |mmp|`

|γ ´ 1||z|

ď plogMq´ξ ` C ď C,

since we are on Ω. Now assume that we have (6.60), (6.61), and (6.62) for T and we wish to show them forkT, where k P t1, . . . ,MuzT.

Now, we use the rank-one perturbation formula (6.25) of Lemma 6.5 and take the ij entry to see

GpTqij ´G

pkTqij “ zG

pTqii

GpkTqxkx˚kG

pkTqı

ij

“ zGpTqii

Mÿ

l,r“1

xlkGpkTqil G

pkTqrj xrk

“ zGpTqii

˜

Mÿ

l‰r

xlkGpkTqil G

pkTqrj xrk `

Mÿ

l“1

ˆ

x2lk ´

1

M

˙

GpkTqil G

pkTqlj `

1

M

Mÿ

l“1

GpkTqil G

pkTqlj

¸

.

(6.63)

53

Note that by (6.58), GpTqii is order one on Ω, so we can disregard it at the expense of a constant factor, and

similarly, z is bounded by a constant in R. We can control the first term with the large deviation estimate,Lemma 5.1. Thus, with with pξ, νq-high probability on Ω

∣∣∣∣∣Mÿl‰r

xlkGpkTqil G

pkTqrj xrk

∣∣∣∣∣ ď plogMq2ξ

»

maxl‰r

∣∣∣GpkTqil GpkTqrj

∣∣∣q

`

˜

1

M2

ÿ

l‰r

∣∣∣GpkTqil GpkTqrj

∣∣∣2¸12fi

fl

ď plogMq2ξ

»

maxl‰r

∣∣∣GpkTqil GpkTqrj

∣∣∣q

`

˜

ImGpkTqii ImG

pkTqjj

M2η2

¸12fi

fl

ď plogMq´ξ maxl,r

∣∣∣GpkTqlr

∣∣∣2 ` plogMq´6ξ´∣∣∣GpkTqii

∣∣∣ ∣∣∣GpkTqjj

∣∣∣¯12

ď plogMq´ξ maxl,r

∣∣∣GpkTqlr

∣∣∣2 ` plogMq´6ξ maxl,r

∣∣∣GpkTqlr

∣∣∣ ,

(6.64)

where we used the Ward identity, Lemma 2.14, in the second step twice; and assumption (3.4) on q and thefact that z P RL, so η ě plogMqLM in the third step. We also use the large deviation estimate, Lemma5.1, on the second term of (6.63), to see

∣∣∣∣∣Mÿl“1

ˆ

x2lk ´

1

M

˙

GpkTqil G

pkTqlj

∣∣∣∣∣ ď plogMqξmaxl

∣∣∣GpkTqil GpkTqlj

∣∣∣q

ď plogMq´2ξ maxl,r

∣∣∣GpkTqlr

∣∣∣2(6.65)

with with pξ, νq-high probability, where we used (3.4) again. We see for the third term of (6.63) that∣∣∣∣∣ 1

M

Mÿ

l“1

GpkTqil G

pkTqlj

∣∣∣∣∣ ď C

M

Mÿ

l“1

ˆ∣∣∣GpkTqil

∣∣∣2 ` ∣∣∣GpkTqlj

∣∣∣2˙“

C

´

ImGpkTqii ` ImG

pkTqjj

¯

ď CplogMq´8ξ maxl

∣∣∣GpkTqll

∣∣∣ ,(6.66)

by the definition of RL in (3.30) and the bound on L in (3.29).

From equations (6.63), (6.64), (6.65), and (6.66) we see∣∣∣GpTqij ´GpkTqij

∣∣∣ ď CplogMq´ξ maxl,r

∣∣∣GpkTqlr

∣∣∣2 ` CplogMq´6ξ maxl,r

∣∣∣GpkTqlr

∣∣∣` CplogMq´2ξ max

l,r

∣∣∣GpkTqlr

∣∣∣2 ` CplogMq´8ξ maxl

∣∣∣GpkTqll

∣∣∣ď CplogMq´ξ max

l,r

∣∣∣GpkTqlr

∣∣∣2 ` CplogMq´6ξ maxl,r

∣∣∣GpkTqlr

∣∣∣

54

with pξ, νq-high probability on Ω, where we have isolated the dominant terms. Using the simple inequalitypx` yq2 ď Cx2 ` Cy2, we see,

CplogMq´ξ maxl,r

∣∣∣GpkTqlr

∣∣∣2 ď CplogMq´ξ maxl,r

´∣∣∣GpkTqlr ´G

pTqlr

∣∣∣` ∣∣∣GpTqlr ∣∣∣¯2

ď CplogMq´ξ maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣2 ` CplogMq´ξ maxl,r

∣∣∣GpTqlr ∣∣∣2ď CplogMq´ξ max

l,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣2 ` CplogMq´ξ,

with pξ, νq-high probability on Ω, where we used the induction hypotheses (6.61) and (6.62). Similarly,

CplogMq´6ξ maxl,r

∣∣∣GpkTqlr

∣∣∣ ď CplogMq´6ξ maxl,r

´∣∣∣GpkTqlr ´G

pTqlr

∣∣∣` ∣∣∣GpTqlr ∣∣∣¯ď CplogMq´6ξ max

l,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣` CplogMq´6ξ,

by the induction hypotheses (6.61) and (6.62). Putting all this together, we see with pξ, νq-high probabilityon Ω∣∣∣GpkTqij ´G

pTqij

∣∣∣ ď CplogMq´ξ maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣2 ` CplogMq´6ξ maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣` CplogMq´ξ,

which immediately implies

maxi‰j

∣∣∣GpkTqij ´GpTqij

∣∣∣ ď CplogMq´ξ maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣2 ` CplogMq´ξ

with pξ, νq-high probability on Ω, when we take the maximum over all i and j, since CplogMq´6ξ “ op1q.Next we may conclude

maxi‰j

∣∣∣GpkTqij ´GpTqij

∣∣∣ ď CplogMq´ξ or maxi‰j

∣∣∣GpkTqij ´GpTqij

∣∣∣ ď CplogMq´ξ maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣2 . (6.67)

However, we may rule out the case that

CplogMqξ ď maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣ (6.68)

by noting that this is obviously false for η order one by the trivial bound

maxl,r

∣∣∣GpkTqlr ´GpTqlr

∣∣∣ ď maxl,r

∣∣∣GpkTqlr

∣∣∣`maxl,r

∣∣∣GpTqlr ∣∣∣ ď 2

η.

Then a simple continuity argument rules out (6.68) for all η. Thus,

maxi‰j

∣∣∣GpkTqij ´GpTqij

∣∣∣ ď CplogMq´ξ.

So, we see with pξ, νq-high probability on Ω

maxi,j

∣∣∣GpkTqij ´GpTqij

∣∣∣ ď CplogMq´ξ. (6.69)

55

Now we easily find

maxi,j

∣∣∣GpkTqij ´Gij

∣∣∣ ď maxi,j

∣∣∣GpkTqij ´GpTqij

∣∣∣`maxi,j

∣∣∣GpTqij ´Gij∣∣∣ ď CplogMq´ξ

with pξ, νq-high probability on Ω, by (6.69) and the induction hypothesis (6.60). Similarly, with pξ, νq-highprobability on Ω

maxi

∣∣∣GpkTqii

∣∣∣ ď maxi

∣∣∣GpTqii ∣∣∣`maxi

∣∣∣GpkTqii ´GpTqii

∣∣∣ ď C ` CplogMq´ξ ď C,

by (6.69) and the induction hypothesis (6.61), and with pξ, νq-high probability on Ω

maxi‰j

∣∣∣GpkTqij

∣∣∣ ď maxi‰j

∣∣∣GpTqij ∣∣∣`maxi‰j

∣∣∣GpkTqij ´GpTqij

∣∣∣ ď CplogMq´ξ,

by (6.69) and the induction hypothesis (6.62). This completes the whole proof.

Lemma 6.11. For fixed z P RL, we have with pξ, νq-high probability

Λopzq ď CplogMqξ1

q` CplogMq2ξΨpzq, (6.70)

maxi

|Γipzq| ď CplogMqξ1

q` CplogMq2ξΨpzq, (6.71)

and maxi

|Giipzq ´mN pzq| ď CplogMqξ1

q` CplogMq2ξΨpzq (6.72)

on Ωpzq. Similarly, we have with pξ, νq-high probability

Λopzq ď CplogMqξ1

q` CplogMq2ξΨpzq, (6.73)

and maxi

|Giipzq ´mM pzq| ď CplogMqξ1

q` CplogMq2ξΨpzq (6.74)

on Ωpzq.

Proof. First, we prove that (6.70) holds with pξ, νq-high probability. Let i ‰ j, then using the resolventidentity (6.23) from Lemma 6.5, we see

Λo “ |z|maxi‰j

∣∣∣GiiGpiqjj x˚i Gpijqxj

∣∣∣ ď C|z|maxi‰j

∣∣∣x˚i Gpijqxj∣∣∣ ,where we used (6.58) of Lemma 6.10 to bound the diagonal terms of the resolvent and the fact that we areon the event Ω. It is important to note that Gpijq is independent of xi and xj . Thus, using (5.6) from thelarge deviation estimate, Lemma 5.1, we obtain that there is a constant ν ą 0 such that for any ξ satisfyingassumption (3.1), the inequality

∣∣∣x˚i Gpijqxj∣∣∣ “∣∣∣∣∣ Mÿ

k,l“1

xkiGpijqkl xlj

∣∣∣∣∣ ď plogMq2ξ

»

maxk

∣∣∣Gpijqkk

∣∣∣q2

`Λpijqo

q`

˜

1

M2

ÿ

k‰l

∣∣∣Gpijqkl

∣∣∣2¸12fi

fl (6.75)

56

holds with pξ, νq-high probability. Note that on Ω we have maxk

∣∣∣Gpijqkk

∣∣∣ ď C and Λpijqo ď CplogMq´ξ by

Lemma 6.10. Thus, we can rewrite (6.75) as

∣∣∣x˚i Gpijqxj∣∣∣ ď CplogMq2ξ

»

1

q2`plogMq´ξ

q`

˜

1

M2

ÿ

k‰l

∣∣∣Gpijqkl

∣∣∣2¸12fi

fl

on Ω with pξ, νq-high probability. Using the Ward identity in Lemma 2.14, we see that

1

M2

ÿ

k‰l

∣∣∣Gpijqkl

∣∣∣2 ď 1

M2ηIm TrGpijq.

Now we use (6.27) from Lemma 6.5, to state the bound in terms Gpijq:

1

ηIm TrGpijq “

1

ηIm TrGpijq `

1

ηIm

N ´ |T|´Mz

“1

ηIm TrGpijq `

N ´ |T|´M|z|2

. (6.76)

Moreover,

1

MTrGpijq “ C

˜

1

N

Nÿ

k“1

Gpijqkk ´Gkk

¸

` C

˜

1

N

Nÿ

k“1

Gkk ´mmp

¸

` Cmmp,

so on Ω1

NIm TrGpijq ď CΛ2

o ` CΛ` C Immmp, (6.77)

by the definition of Λ and the bound (6.57) in Lemma 6.10. Furthermore,∣∣∣∣N ´ |T|´MM2

∣∣∣∣ ď C

M, (6.78)

where C depends on γ. Using the observations in (6.77) and (6.78), we obtain

1

M2

ÿ

k‰l

∣∣∣Gpijqkl

∣∣∣2 ď C Immmp ` CΛ

Mη`CΛ2

o

Mη`

C

|z|2M.

from (6.76). However, since |z| ď C on R, we have

|z|2

M2

ÿ

k‰l

∣∣∣Gpijqkl

∣∣∣2 ď C Immmp ` CΛ

Mη`CΛ2

o

Mη`C

M. (6.79)

Now, taking the maximum over i ‰ j in equation (6.79), using |z| ď C on R again, and absorbing theconstants into the front constant, gives us

Λo ď `CplogMqξ1

q` CplogMq2ξ

«

1

q2`

d

Immmp ` Λ

Mη`

Λ2o

Mη`

1

M

ff

on Ω with pξ, νq-high probability. However, by the trivial inequality?x` y ď

?x `

?y for positive x and

y (which is easily proved by squaring), we see

Λo ďCplogMq2ξ?Mη

Λo ` CplogMqξ1

q` CplogMq2ξ

«

1

q2`

d

Immmp ` Λ

Mη`

1

M

ff

57

on Ω with pξ, νq-high probability. Note that

CplogMq2ξ?Mη

“ op1q,

in RL by assumption (3.29) on L. Moreover, we can use equation (2.61) to see that the term ImmmppMηqdominates the term 1M . Thus, with a possible increase in the constant C, we see

Λo ď op1qΛo ` CplogMqξ1

q` CplogMq2ξ

«

1

q2`

d

Immmp ` Λ

ff

on Ω with pξ, νq-high probability. Therefore, collecting the Λo terms and dividing by their coefficient, we seeon Ω with pξ, νq-high probability that

Λo ď CplogMqξ1

q` CplogMq2ξ

1

q2`Ψ

,

for large enough M by the definition of Ψ. Finally, we may remove the term 1q2, as it is dominated by the1q term, to get

Λo ď CplogMqξ1

q` CplogMq2ξΨ,

This completes the proof of (6.70).The proof of (6.71) is similar and we omit the level of detail we showed for (6.70). Recall the definition

of Γi in (6.46). Expanding Γi and separating the diagonal and off-diagonal terms, we see

Γi “ zmpiqM ´ zx˚i G

piqxi “z

M

Mÿ

k“1

Gpiqkk ´ z

Mÿ

k,l“1

xkiGpiqkl xli “ z

Mÿ

k“1

Gpiqkk

ˆ

1

M´ x2

ki

˙

´ zÿ

k‰l

xkiGpiqkl xli. (6.80)

Thus, we can bound (6.80) by using the large deviation estimate, Lemma 5.1, on each of the terms on theright-hand side. In more detail, we use (5.3) on the first term, to see that there is a ν ą 0 such that for anyξ satisfying (3.1), such that ∣∣∣∣∣ Mÿ

k“1

Gpiqkk

ˆ

1

M´ x2

ki

˙

∣∣∣∣∣ ď plogMqξmaxk

∣∣∣Gpiqkk∣∣∣q

with pξ, νq-high probability. Since maxk

∣∣∣Gpiqkk∣∣∣ ď C, by Lemma 6.10, we may write∣∣∣∣∣ Mÿk“1

Gpiqkk

ˆ

1

M´ x2

ki

˙

∣∣∣∣∣ ď CplogMqξ1

q(6.81)

with pξ, νq-high probability on Ω. For the second term we use (5.4) to see,∣∣∣∣∣ÿk‰l

xkiGpiqkl xli

∣∣∣∣∣ ď C plogMq2ξ

»

CplogMq´ξ

q`

˜

1

M2

ÿ

k‰l

∣∣∣Gpiqkl ∣∣∣2¸12

fi

fl

58

with pξ, νq-high probability on Ω, since Λpiqo ď CplogMq´ξ by Lemma 6.10. We deal with the term

1

M2

ÿ

k‰l

∣∣∣Gpiqkl ∣∣∣2in the same way as before and use the fact that on Ω we have Λo ď plogMq´ξ, to find∣∣∣∣∣ÿ

k‰l

xkiGpiqkl xli

∣∣∣∣∣ ď CplogMqξ1

q` CplogMq2ξ

Λ0?Mη

` CplogMq2ξΨ (6.82)

with pξ, νq-high probability on Ω. Using the fact |z| ď C on R, we can conclude from equations (6.81) and(6.82) that

|Γi| ď CplogMqξ1

q` CplogMq2ξ

Λ0?Mη

` CplogMq2ξΨ (6.83)

with pξ, νq-high probability on Ω. Finally, invoking (6.70) to bound Λo and taking the maximum over i, weget

maxi

|Γi| ď C

plogMqξ

q` plogMq2ξΨ

.

This completes the proof of (6.71).Finally, we turn to the proof of (6.72). Notice that

maxi

|Gii ´mN | “ maxi

∣∣∣∣∣ 1

N

Nÿ

j“1

Gii ´Gjj

∣∣∣∣∣ ď maxi

1

N

Nÿ

i“1

|Gii ´Gjj | ď maxi‰j

|Gii ´Gjj | .

Recall the self-consistent resolvent equation (6.44) and the bound (6.58) of Lemma 6.10, then calculating wesee

|Gii ´Gjj | “ |Gii| |Gjj | |∆j ´∆i| ď C maxi

|∆i|

on Ω. It remains to see maxi |∆i| is small with pξ, νq-high probability. Recall the definition of ∆i in (6.45),then we see

zmpiqM “ zγNm

piqN ` γN ´ 1´

|T|M

by equation (6.27). Hence

|∆i| ď |γN ´ γ|` |z|∣∣∣γNmpiqN ´ γmN

∣∣∣` |Γi| .

The term |γN ´ γ| ď CM by assumption (3.3) on γN . We can again use the bound |z| ď C on R and thennote that

∣∣∣γNmpiqN ´ γmN

∣∣∣ ď |γN ´ γ|∣∣∣mpiqN ∣∣∣` γ ∣∣∣mpiqN ´mN

∣∣∣ ď C

M`γ

N

piqÿ

k

∣∣∣Gpiqkk ´Gkk∣∣∣` γ

N|Gii|

on Ω, by assumption (3.3) on γN and (6.58) from Lemma 6.10. Now, using (6.24) from Lemma 6.5, we see∣∣∣γNmpiqN ´ γmN

∣∣∣ ď C

M` CΛopzq

2 (6.84)

59

on Ω, by equations (6.57) and (6.58) from Lemma 6.10 again. We can then use (6.70) to bound (6.84).Moreover,

|Γi| ď C

plogMqξ

q` plogMq2ξΨ

(6.85)

with pξ, νq-high probability on Ω, by (6.71). Thus, by combining the bounds (6.84) and (6.85), we get

|∆i| ď C

plogMqξ

q` plogMq2ξΨ

with pξ, νq-high probability on Ω.One can show (6.73) and (6.74) in the same way as above.

Note that immediately from (6.72), we see

Λdpzq ď Λpzq ` CplogMqξ1

q` CplogMq2ξΨpzq (6.86)

on Ωpzq with pξ, νq-high probability. Similarly, from (6.74), we see

Λdpzq ď Λpzq ` CplogMqξ1

q` CplogMq2ξΨpzq (6.87)

on Ωpzq with pξ, νq-high probability, by equation (6.27).

6.3. Stability of the self-consistent equation on Ωpzq. We now expand the self-consistent equation. Fixz :“ E ` iη P RL. It is easy to show that if one has a bound on the perturbation term ∆i in the self-consistent equation (6.44), that is

maxi

|∆i| ď δpzq (6.88)

for some function δ which is decreasing in η, then with a Taylor expansion we have the following bound∣∣∣∣mN ´1

1´ γ ´ zγmN ´ z

∣∣∣∣ ď Cδ. (6.89)

Equation (6.89) shows that if the perturbation term is small, then mN approximately satisfies the self-contentequation (2.49) with a small error term. The conclusion of Lemma 6.12 is that, in fact, we have the bound

Λ “ |mN ´mmp| ď Cδ

?δ ` η ` κ

, (6.90)

which is typically stronger.

Lemma 6.12. For z :“ E ` iη P RL, define

δ ” δpzq :“ maxi

|∆ipzq| (6.91)

and assume δpzq is decreasing in η and that on Ωpzq, we have δpzq ď plogMq´2ξ with pξ, νq-high probability.Then on Ωpzq with pξ, νq-high probability

Λpzq :“ |mN pzq ´mmppzq| ď Cδpzq

a

δpzq ` η ` κpzq, (6.92)

whereκ ” κpzq :“ mint|λ` ´ E| , |E ´ λ´|u. (6.93)

60

Proof. Recall that for z P RL, we have

|1´ γ ´ z ´ zγmmp| ě c ą 0

from equation (2.59) of Lemma 2.21. Thus, using the reverse triangle inequality, we get

|1´ γ ´ z ´ zγmN | ě ||1´ γ ´ z ´ zγmmp|´ C|z| |mmp ´mN || ě c, (6.94)

since we are on the event Ω and |z| ď C in R, so |mmp ´mN | “ Λ ď Λd ď plogMq´ξ. Then expanding

mN “1

N

Nÿ

i“1

1

1´ γ ´ zγmN ´ z ´∆i“

1

1´ γ ´ zγmN ´ z¨

1

N

Nÿ

i“1

1

1´ ∆i

1´γ´zγmN´z

to the first order, we see∣∣∣∣mN ´1

1´ γ ´ zγmN ´ z

∣∣∣∣ ď Oˆ

maxi |∆i||1´ γ ´ zγmN ´ z|

˙

ď Cδpzq, (6.95)

by the bound (6.94).For some fixed ∆ ” ∆pzq P R, the equation

mpzq ´1

1´ γ ´ zγmpzq ´ z“ ∆pzq (6.96)

has two distinct solutions which we denote by m∆˘pzq. Equation (6.96) can be written equivalently as

zγmpzq2 ` pγ ´ 1` z ´ zγ∆qmpzq ` 1`∆´ γ∆´ z∆ “ 0,

so explicit calculation using the quadratic formula shows

m∆˘pzq “

1´ γ ´ z ˘ ip1` γ∆qb

`

λ∆` ´ z

˘ `

z ´ λ∆´

˘

2γz`

2(6.97)

where

λ∆˘ :“

˜

a

1`∆pγ ´ γ2q ˘?γ

1`∆γ

¸2

. (6.98)

This follows by noting

p1´ γ ´ z ` γz∆q2 ´ 4γzp1` p1´ γ ´ zq∆q “ ´p1`∆γq2pλ∆` ´ zqpz ´ λ

∆´q.

We use the notation m˘pzq ” m0˘pzq and point out that m`pzq ” mmppzq, as equation (6.97) simplifies to

equation (2.48) when ∆ “ 0. Currently, we have the following conclusion: there exists ∆ ď Cδ, such that

mN “ m∆` or mN “ m∆

´ . (6.99)

We take an intermission from the proof of Lemma 6.12 to prove a short lemma about the solutions toequation (6.96), after which we resume the proof. For a fixed energy 1γą1pλ´5q ď E ď 5λ`, we define η asthe solution of

δpE ` iηq “ plogMq´ξpκ` ηq, (6.100)

61

which is unique as the left-hand side is decreasing in η by assumption and the right-hand side is obviouslyincreasing in η. Note that η ! 1 for large enough M . So for any η ě η, we have

δpzq ď plogMq´ξpκ` ηq ď κ` η, (6.101)

and conversely, when η ď η, we have

δpzq ě plogMq´ξpκ` ηq. (6.102)

Lemma 6.13. Let m∆˘pzq be as defined in (6.97) and let z :“ E`iη P RL. For sufficiently small ∆, depending

on γ, we have

max˘

∣∣m∆˘pzq ´m˘pzq

∣∣ ď C∆

a

η ` κpzq. (6.103)

Moreover, if η ě η, then ∣∣m∆`pzq ´m

∆´pzq

∣∣ ě c?κ` η " plogMq´ξ (6.104)

and if η ď η, then ∣∣m∆`pzq ´m

∆´pzq

∣∣ ď CplogMqξ?δ. (6.105)

Proof of Lemma 6.13. Using the difference of two squares, for ∆ small enough, we have

∆ :“ max˘

∣∣λ∆˘ ´ λ˘

∣∣ď

∣∣∣∣∣a

1`∆pγ ´ γ2q ˘?γ

1`∆γ`

1˘?γ `∆pγ ˘ γ

?γq

1`∆γ

∣∣∣∣∣∣∣∣∣∣a

1`∆pγ ´ γ2q ˘?γ

1`∆γ´

1˘?γ `∆pγ ˘ γ

?γq

1`∆γ

∣∣∣∣∣ď C

∣∣∣∣∣a

1`∆pγ ´ γ2q ´ 1´∆pγ ˘ γ?γq

1`∆γ

∣∣∣∣∣ď C

∣∣∣a1`∆pγ ´ γ2q ´ 1∣∣∣` C∆

ď C∆,

(6.106)

where we used a Taylor expansion on the terma

1`∆pγ ´ γ2q. Similarly, we find

min˘

∣∣λ∆˘ ´ λ˘

∣∣ ě c∆. (6.107)

Therefore, by (6.106), we have

max˘

∣∣m∆˘ ´m˘

∣∣ ď ∆

2`

∣∣∣b`

λ∆` ´ z

˘ `

z ´ λ∆´

˘

´a

pλ` ´ zq pz ´ λ´q∣∣∣

2γ|z|`

∆∣∣∣b`

λ∆` ´ z

˘ `

z ´ λ∆´

˘

∣∣∣2|z|

ď C∆` C∆?κ` C

∣∣∣∣b`

λ∆` ´ z

˘ `

z ´ λ∆´

˘

´a

pλ` ´ zq pz ´ λ´q

∣∣∣∣ .Now, we use the inequality, ∣∣?x` y ´?x∣∣ ď C

|y|a

|x|` |y|,

62

which holds for all complex numbers x and y, where we set the quantities x :“ pλ` ´ zqpz ´ λ´q andy :“ pλ∆

` ´ zqpz ´ λ∆´q ´ pλ` ´ zqpz ´ λ´q, to see∣∣∣∣b`

λ∆` ´ z

˘ `

z ´ λ∆´

˘

´a

pλ` ´ zq pz ´ λ´q

∣∣∣∣ď C

∣∣pλ∆` ´ zqpz ´ λ

∆´q ´ pλ` ´ zqpz ´ λ´q

∣∣b

|pλ` ´ zqpz ´ λ´q|`∣∣pλ∆

` ´ zqpz ´ λ∆´q ´ pλ` ´ zqpz ´ λ´q

∣∣ď C

∆?κ` η

,

where the last line follows from the following bounds: |y| ď C∆ and thus by (6.106), |y| ď C∆, |x| ě cκ,and |y| ě cη.

Now we show (6.104). Note, for η ě η,∣∣m∆` ´m

∆´

∣∣ ě c∣∣`λ∆

` ´ z˘ `

z ´ λ∆´

˘∣∣12 ě c

?κ` η.

Finally, we show (6.105). Note, for η ď η,∣∣m∆` ´m

∆´

∣∣ ď C∣∣`λ∆

` ´ z˘ `

z ´ λ∆´

˘∣∣12 ď C

?∆ ď CplogMqξ

a

δpzq,

which completes the proof.

Using Lemma 6.13 and equation (6.99), we may conclude that there is some ∆ ď Cδ such that

mint|mN ´mmp| , |mN ´m´|u ď C∆

?η ` κ

ď Cδ

?η ` κ

. (6.108)

Now we must distinguish between the two solutions. If we could see that mN “ m∆` , by arguing that only

m∆` (and not m∆

´) is a function from H to H, then we could conclude the proof using Lemma 6.13. However,it is too difficult to see this from the expression (6.97). Despite this difficultly, we can identify m∆

` as thecorrect solution for η „ 1, using a different argument. In more detail, let z0 :“ E ` iη0 P RL such thatη0 ě 1. Using Lemma 6.13, we have the following simple bound∣∣m∆

`pz0q ´m∆´pz0q

∣∣ ě c?κ` η0 ě c, (6.109)

so for η „ 1 we can easily distinguish between the solutions. Then

|mN pz0q ´mmppz0q| “ Λpz0q ď Λdpz0q ď plogMq´ξ (6.110)

since we are on the event Ω. Moreover, from Lemma 6.13 and our assumption on δ, we have∣∣m∆`pz0q ´m`pz0q

∣∣ ď Cδ

?κ` η0

ď CplogMq´2ξ, (6.111)

since η0 „ 1. Now note,∣∣m∆`pz0q ´mN pz0q

∣∣ ď ∣∣m∆`pz0q ´m`pz0q

∣∣` |m`pz0q ´mN pz0q| ď CplogMq´ξ (6.112)

63

by equations (6.109) and (6.111). So, using the reverse triangle inequality and equations (6.109) and (6.112)∣∣m∆´pz0q ´mN pz0q

∣∣ ě ∣∣∣∣m∆´pz0q ´m

∆`pz0q

∣∣´ ∣∣m∆`pz0q ´mN pz0q

∣∣∣∣ ě c.

Thus, mN pz0q ‰ m∆´pz0q and we must have mN pz0q “ m∆

`pz0q. Therefore,

|mN pz0q ´mmppz0q| “∣∣m∆`pz0q ´m`pz0q

∣∣ ď Cδ

?η0 ` κ

ď Cδ

?δ ` η0 ` κ

(6.113)

by Lemma 6.13 and the bound (6.101).Now, using the continuity of the functions mN , m∆

` , and m∆´ in η, we may conclude that mN “ m∆

` forall η ě η by equation (6.104) of Lemma 6.13. Thus, using (6.103) of Lemma 6.13, we have

|mN pzq ´mmppzq| “∣∣m∆`pzq ´m`pzq

∣∣ ď Cδ

?η ` κ

ď Cδ

?δ ` η ` κ

.

by equation (6.102) again.Now, assume η ď η. We show that even if mN is close to m´, it is also close to m`. First assume

mN “ m∆` , then, trivially, we see

|mN ´mmp| “∣∣m∆` ´m`

∣∣ ď Cδ

?κ` η

Now assume mN “ m∆´ . Thus by the triangle inequality and equation (6.105) of Lemma 6.13, we have

|mN ´mmp| “∣∣m∆´ ´m`

∣∣ď

∣∣m∆´ ´m

∆`

∣∣` ∣∣m∆` ´m`

∣∣ď CplogMqξ

?δ ` C

δ?κ` η

ď Cδ

?δ ` η ` κ

,

where the last bound is an interpolation of the two pervious terms.

6.4. Initial estimates for large η. In order to start the induction in the continuity argument of Subsection6.5, we need an initial estimate of Λd `Λo for large η—that is, we need to prove that ΩpE ` iηq is an eventof pξ, νq-high probability for η „ 1, as from this it follows that the estimates in Lemmas 6.10 and 6.11 allhold with pξ, νq-high probability for η „ 1.

Lemma 6.14. Let η :“ 10p1` γq and specify E so that z :“ E ` iη P RL. Then we have

Λopzq ` Λdpzq ` Λopzq ` Λdpzq ď CplogMq´ξ (6.114)

with pξ, νq-high probability. In other words, Ωpzq holds with pξ, νq-high probability.

Proof. Fix z :“ E ` iη P RL where η :“ 10p1 ` γq. We repeatedly make use of the following trivialestimates: ∣∣∣GpTqij ∣∣∣ ď 1

η,∣∣∣GpTqij ∣∣∣ ď 1

η,∣∣∣mpTqN ∣∣∣ ď 1

η,∣∣∣mpTqM ∣∣∣ ď 1

η, and |mmp| ď

1

η, (6.115)

64

where T Ď t1, . . . , Nu is arbitrary. When η „ 1, (6.115) implies that all these quantities are bounded byconstants. These estimates are fairly immediate: the operator norms of G and G are bounded by 1η, sincethe spectra of H and H are real and z is at least η from the real line. Thus,

|Gij | “ |e˚i Gej | ď ‖e˚i ‖‖G‖‖ej‖ ď1

η.

We can prove exactly the same for GpTqij and G

pTqij . The bounds on m

pTqN and m

pTqM follow directly from the

bounds on the resolvent entires, as they are just averages of the diagonal terms. We demonstrated the boundon mmp in Section 2.

First, we estimate the quantity Λo. Let i ‰ j. Then following the calculation bounding Λo in the proofof Lemma 6.11 and using the large deviation estimate, Lemma 5.1, we see that

∣∣∣x˚i Gpijqxj∣∣∣ “∣∣∣∣∣ Mÿ

k,l“1

xkiGpijqkl xlj

∣∣∣∣∣ ď plogMq2ξ

»

maxk

∣∣∣Gpijqkk

∣∣∣q2

`Λpijqo

q`

˜

1

M2

ÿ

k‰l

∣∣∣Gpijqkl

∣∣∣2¸12fi

fl (6.116)

with pξ, νq-high probability. Note that Λpijqo ď C and maxk |Gkk| ď C by (6.115). Using (6.115) and the

Ward identity from Lemma 2.14, as before, we get

Λo ď CplogMq2ξ

¨

˝

1

q2`

1

q`

d

Im TrGpijq

˛

with pξ, νq-high probability by taking the max over i ‰ j, since z ď C on RL. Moverover, the term 1qdominates the term 1q2. Finally, with pξ, νq-high probability

Λo ď CplogMq2ξ

¨

˝

1

q`

d

Im TrGpijq

˛

‚ď CplogMq´ξ (6.117)

by the bound (3.4) on q.Now, we need to estimate Λd. First, we estimate the error term |∆i| as in the proof of Lemma 6.11, to

see|∆i| ď |γN ´ γ|` |z|

∣∣∣γNmpiqN ´ γmN

∣∣∣` |Γi|

and using the large deviation estimate Lemma 5.1, we see recalling (6.80)

|Γi| ď plogMqξmaxk

∣∣∣Gpiqkk∣∣∣q

` C plogMq2ξ

»

Λpiqo

q`

˜

1

M2

ÿ

k‰l

∣∣∣Gpiqkl ∣∣∣2¸12

fi

fl (6.118)

with pξ, νq-high probability. Simplifying (6.118) in the same way as before, we see

|Γi| ď CplogMqξ1

q` CplogMq2ξ

1

q`

1?M

ď CplogMq´ξ

with pξ, νq-high probability. The term |γN ´ γ| ď CM by assumption (3.3) on γN . We can again use thebound |z| ď C on R and then note that∣∣∣γNmpiqN ´ γmN

∣∣∣ ď |γN ´ γ|∣∣∣mpjqN ∣∣∣` γ ∣∣∣mpiqN ´mN

∣∣∣ ď C

M`γ

N

piqÿ

k

∣∣∣Gpiqkk ´Gkk∣∣∣` γ

N|Gii| ď

C

M

65

by assumption (3.3) on γN and (6.115). Therefore, with pξ, νq-high probability and η ě 10p1` γq, we see

maxi

|∆i| ď CplogMq´ξ. (6.119)

Finally, from the self-consistent resolvent equation (6.44) and the Marchenko-Pastur self-consistent equa-tion (2.49), we find

Gii ´mmp “∆i

G´1ii m

´1mp

.

Now, mmp ď 1p10p1` γqq, so m´1mp ě p10p1` γqq ě 10, and similarly G´1

ii ě 10∣∣∣∣ 1

G´1ii m

´1mp

∣∣∣∣ ě 100.

Thus, from (6.119) we get

|Gii ´mmp| ď CplogMq´ξ,

which, when we take the max over i, yields

Λd ď CplogMq´ξ. (6.120)

Now, we can reproduce almost exactly the same argument for Λo and Λd. We only outline the proofbecause the method is almost identical. Basically, one has to remove rows from X rather than columns toproduce XrUs. This gives us the minors HrUs of H and thus the minors GrUs of G, and so we can use theresolvent identities from Remark 6.6. Since for η order 1 we only invoke trivial bounds on these minors, thisapproach actually helps here; we avoid the problem of having to control two types of resolvent simultaneously.Once one has these resolvent identities, it is easy to see how to use the large deviation estimate, Lemma 5.1,and the trivial bounds ∣∣∣GrUsij ∣∣∣ ď 1

ηand

∣∣∣GrUsij ∣∣∣ ď 1

η

to obtain exactly the same bounds on Λo and Λd as Λo and Λd.Finally, summing equations (6.117) and (6.120) and the corresponding bounds on Λo and Λd finishes the

proof of (6.114).

6.5. Continuity argument: conclusion of the proof of Theorem 6.2. To complete the proof of Theorem 6.2,we use a continuity argument, or what is sometimes called bootstrapping, in η to go from large η “ 10p1`γqdown to the smallest scale of η “ M´1plogMqL. To begin with we prove (6.13), so we use Lemma 6.14 forthe initial estimate.

Choose any decreasing finite sequence pηkqKk“1 such that K ď CN8, |ηk ´ ηk`1| ď N´8, η1 :“ 10pγ ` 1q,

and ηK :“M´1plogMqL. For now we fix E such that 1γą1pλ´5q ď E ď 5λ` and define zk :“ E ` iηk.For the initial point z1, Lemma 6.14 implies that Ωpz1q holds with pξ, νq-high probability, since the

imaginary part of z1 is order one, η1 “ 10p1 ` γq. Thus, using (6.71) from Lemma 6.11, we have withpξ, νq-high probability the bound

maxi

|Γipz1q| ď CplogMqξ1

q` CplogMq2ξΨpz1q ď CplogMqξ

1

q` CplogMq2ξ

1

pMηq12,

66

by (6.50). Since maxi |Γipz1q| is the dominant term in the perturbation term maxi |∆i|, (which one can seefrom the calculation at the end of the proof of Lemma 6.11) we can use Lemma 6.12 to find

Λpz1q “ |mN ´mmp| ďCplogMqξ 1

q ` CplogMq2ξ 1pMηq12

b

κ` η ` CplogMqξ 1q ` CplogMq2ξ 1

pMηq12

ďCplogMqξ

?q

`CplogMq2ξ

pMηq14.

This estimate proves the claim for the initial point z1; the next lemma extends the claim to all ηk for k ď K.

Lemma 6.15. Define the events

Ωk :“ Ωpzkq X

"

Λpzkq ďCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

. (6.121)

ThenP pΩckq ď 2ke´νplogMqξ . (6.122)

Proof. We use induction on k. We just demonstrated the base case k “ 1. So, we take as our inductionhypothesis that Ωk holds with pξ, νq-high probability. Now we prove the same for k ` 1. Note that

Ωck`1 “`

Ωck`1 X Ωk˘

Y`

Ωck`1 X Ωck˘

by de Morgan’s law. Similarly,

Ωck`1 “`

Ωck`1 X Ωk X Ωpzk`1q˘

Y`

Ωck`1 X Ωk X Ωpzk`1qc˘

Y`

Ωck`1 X Ωck˘

,

so thatP`

Ωck`1

˘

ď P`

Ωck`1 X Ωk X Ωpzk`1q˘

` P pΩk X Ωpzk`1qcq ` P pΩckq .

By the induction hypothesis, we have

P pΩckq ď 2ke´νplogMqξ .

For convenience, define

A :“ P pΩk X Ωpzk`1qcq and B :“ P

`

Ωk X Ωpzk`1q X Ωck`1

˘

. (6.123)

First, we show that A is small. For any i and j using the Lipschitz continuity of the resolvent, we have

|Gijpzkq ´Gijpzk`1q| ď |zk`1 ´ zk| supzPRL

∣∣∣∣BGijpzqBz

∣∣∣∣ ď N´8 supzPRL

1

pIm zq2 ď N´6, (6.124)

and similarly,|Gijpzkq ´Gijpzk`1q| ď N´6. (6.125)

Therefore, using (6.124), we see

Λdpzk`1q ` Λopzk`1q ď Λdpzkq ` Λopzkq ` 2N´6.

Now we may use the much stronger bounds, (6.70) and (6.86), as we are on the event Ωk and hence Ωpzkq.So, with pξ, νq-high probability

Λdpzkq ` Λopzkq ` 2N´6 ď Λpzkq ` C

plogMqξ

q` plogMq2ξΨpzkq

.

67

We can simplify this bound further, to see on Ωk with pξ, νq-high probability

Λdpzk`1q ` Λopzk`1q ďCplogMqξ

?q

`CplogMq2ξ

pMηq14! CplogMq´ξ,

by the induction hypothesis, (3.4), and (6.50). One can show in almost exactly the same way that

Λdpzk`1q ` Λopzk`1q ! CplogMq´ξ

with pξ, νq-high probability. This proves that

A ď e´νplogMqξ .

We can bound B similarly using Lemma 6.12, to find for all k

P`

Ωck`1

˘

ď 2e´νplogMqξ ` PpΩckq ď 2pk ` 1qe´νplogMqξ (6.126)

from which the claim follows.

To complete the proof of Theorem 6.2, we use a lattice argument which strengthens the result of Lemma6.15 to a statement uniform in z P RL. Again, the result relies on the Lipschitz continuity of the mapz ÞÑ Gijpzq, with a Lipschitz constant bounded by η´2 ď N2.

Corollary 6.16. There is a constant C such that

P

˜

ď

zPRL

Ωpzqc

¸

` P

˜

ď

zPRL

"

Λpzq ąCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

¸

ď e´νplogMqξ . (6.127)

Proof. Let L Ď RL be a lattice such that |L| ď CN6 and for any z P RL there is a z P L such that|z ´ z| ď N´3. Using the Lipschitz continuity of G, we see

|Gijpzq ´Gijpzq| ď |z ´ z| supzPRL

∣∣∣∣BGijpzqBz

∣∣∣∣ ď N´3 supzPRL

1

pIm zq2 ď N´1. (6.128)

The same bound is true of |mpzq ´mpzq|. So, using Lemma 6.15, we get

P

˜

č

zPL

"

Λpzq ďCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

¸

ě 1´ e´νplogMqξ ,

for some C large enough and then from equation (6.128), we see

P

˜

ď

zPRL

"

Λpzq ąCplogMqξ

?q

`CplogMq2ξ

pMηq14

*

¸

ď e´νplogMqξ . (6.129)

The other term can be dealt with in the same way.

This proves (6.13). To demonstrate (6.10), one can use (6.13), (6.70), and Corollary 6.16. Similarly, oneuses (6.13), (6.86), and Corollary 6.16 to show (6.11), and (6.13), (6.72), and Corollary 6.16 to show (6.12).

68

7. A Brownian motion model for covariance matrices

In this section we present a formal calculation to characterize the diffusion process undergone by the eigen-values of a covariance matrix H :“ X˚TX when the entries of X are undergoing independent Brownianmotions and T is a nonnegative deterministic M ˆM diagonal matrix. The derivation is an exercise in Itocalculus. We fix N throughout this section and do not notate the t-dependence of all quantities during theproof.

The original result of this form is due to Dyson and is commonly called Dyson’s Brownian motion—thatis, the diffusion process induced on the eigenvalues, denoted by pλαptqq

Nα“1 of a symmetric (or Hermitian)

N ˆ N matrix when its entries are independent real (or complex) Brownian motions. The concept wasintroduced in 1962 by Freeman Dyson in his seminal paper [5]. Dyson’s interest in the model was mostlyphysical and he thought of this as a new type of Coulomb gas (now referred to as log gases): that is, “Npoint charges executing Brownian motions under the influence of their mutual electrostatic repulsions.” Theoriginal result of Dyson’s paper, presented in modern terms, is that the joint distribution of the eigenvaluesof a symmetric (or Hermitian) matrix when its entries are independent real (or complex) Brownian motionsis a unique solution to a system of coupled stochastic differential equations,

dλαptq “

c

2

βNdbαptq `

1

N

¨

˝

pαqÿ

β

1

λαptq ´ λβptq

˛

‚dt, (7.1)

with initial conditions to specify the initial positions of the eigenvalues, where pbαqNα“1 are independent

standard Brownian motions and β “ 1 (or β “ 2) in the real (or complex) case.. What is significant aboutthe system of coupled stochastic differential equations (7.1) is that the joint distribution of the eigenvaluesdoes not depend on their corresponding eigenvectors—the eigenvectors and eigenvalues decouple.

Here we investigate the corresponding phenomenon for covariance matrices. The results in this section areboth negative and positive. The positive result, which was originally proved in [3], is a formal calculation toshow that when, for some positive constant c, T “ c1 and thus H “ cX˚X the processes of the eigenvectorsand eigenvalues decouple. In particular, we can interpret Theorem 7.2 as characterizing the processes of theeigenvalues as N independent squared Bessel processes which interact through the last drift term in equation(7.3). Thus, the processes stay positive and they repel each other. The repulsion term is more complicatedand interesting than the corresponding term for Dyson’s Brownian motion: the pairwise repulsion is not onlyinversely proportional to the difference of the eigenvalues, but also proportional to the sum of the eigenvalues.For example, when two eigenvalues are close to zero the repulsion can be weak even when the eigenvaluesare close to each other. All of these phenomena can be seen easily in the simulations provided by Figure 4.

The negative result states that when T is not a multiple of 1, the processes of the eigenvectors andeigenvalues do not decouple. In more detail, the stochastic part of the diffusion process for the eigenvectorsdepends nontrivially on the components of the eigenvectors at time t. So during the proof, we perform thecalculation for general T as defined above but it is only be successful in the case T “ c1.

Definition 7.1. We consider N ˆ N matrices of the form Hptq “ Xptq˚TXptq, where X “ pxijptqq is anM ˆN matrix whose entries are real and independent standard Brownian motions and T :“ diagpt1, . . . tM qwhose entries are deterministic nonnegative real numbers. Furthermore, we denote the eigenvalues of H andtheir corresponding normalized eigenvectors as

pλαqNα“1 and pvαq

Nα“1 (7.2)

respectively.

69

Theorem 7.2. Let H satisfy Definition 7.1. If T “ c1, for some positive constant c, then at each timet, Hptq has N distinct eigenvalues almost surely and the N -dimensional process pλ1, . . . , λN q satisfies thecoupled system of stochastic differential equations

dλα “ 2?ca

λαptq dbαptq ` cMdt`

pαqÿ

β

λαptq ` λβptq

λαptq ´ λβptqdt (7.3)

for 1 ď α ď N , where b1, . . . , bN are independent Brownian motions. Note that the process does not depend onthe eigenvectors of Hptq. Conversely, if T is not a multiple of 1, then the N -dimensional process pλ1, . . . , λN qdoes not decouple from the eigenvectors of Hptq.

Proof sketch. By definition, we have

Hvα “ λαvα (7.4)

and

v˚βvα “ δαβ . (7.5)

We want to find out how the eigenvalues evolve when the entries of X are undergoing Brownian motion. Weuse the notation

9f “Bf

Bxij

rather than the usual time derivative, so differentiating equation (7.4) with respect to xij we get:

9λαvα ` λα 9vα “ 9Hvα `H 9vα. (7.6)

Then differentiating equation (7.5) we get:

9v˚αvβ ` v˚α 9vβ “ 0 and 9v˚αvα “ 0. (7.7)

Multiplying equation (7.6) on the left-hand side by v˚α, yields

9λαv˚αvα ` λαv˚α 9vα “ v˚α9Hvα ` v˚αH 9vα

or equivalently,9λα “ v˚α

9Hvα, (7.8)

since λαv˚α 9vα “ v˚αH 9vα. Similarly, multiplying equation (7.6) on the left-hand side by v˚β , yields

9λαv˚βvα ` λαv˚β 9vα “ v˚β9Hvα ` v˚βH 9vα,

which implies

λαv˚β 9vα “ v˚β9Hvα ` λβv˚β 9vα,

or equivalently

v˚β 9vα “v˚β

9Hvα

λα ´ λβ, (7.9)

70

time

(a) 10 independent Brownian motions started at 0.

time

(b) Dyson’s Brownian motion for a 10ˆ 10 matrix.

time

(c) A simulation of the eigenvalues of H for N “ 10, where the entries of X are standard Brownian motions startedat the origin.

Figure 4: In all simulations T “ 1. In Figure 4a the processes intersect frequently and do not interact,whereas in Figure 4b the processes do not intersect due to their mutual repulsion. As in Figure 4b, theprocesses in Figure 4c repel each other but the repulsion is stronger for the larger eigenvalues.

where the second step is by equation (7.5) and v˚βH “ λβv˚β , and the last step is a simple rearrangement as

71

λβ ‰ λα almost surely. Next we can use an eigendecomposition of 9vα and equation (7.9) to see

9vα “ÿ

β

pvβ ¨ 9vαqvβ “

pαqÿ

β

pvβ ¨ 9vαqvβ “

pαqÿ

β

v˚β9Hvα

λα ´ λβvβ , (7.10)

since the α term in the sum, vα ¨ 9vα “ 9v˚αvα is zero by equation (7.7).

Now we calculate 9H explicitly by looking at its pk, lqth entry:

9Hı

kl“

BX˚

BxijTX

kl

`

X˚TBX

Bxij

kl

«

Mÿ

r“1

ˆ

BX˚

Bxij

˙

kr

trxrl

ff

kl

`

«

Mÿ

r“1

xrktr

ˆ

BX

Bxij

˙

rl

ff

kl

«

Mÿ

r“1

pδirδjkq trxrl

ff

kl

`

«

Mÿ

r“1

xrktr pδirδjlq

ff

kl

“ ti rxilδjk ` xikδjlskl .

(7.11)

Thus, equations (7.8) and (7.11) imply

9λα “ ti

Nÿ

r“1

Nÿ

k“1

vαprq rxirδjk ` xikδjrskr vαpkq

“ ti

Nÿ

r“1

Nÿ

k“1

vαprqxirδjkvαpkq `Nÿ

r“1

Nÿ

k“1

vαprqxikδjrvαpkq

“ ti

Nÿ

r“1

vαprqxirvαpjq `Nÿ

k“1

vαpjqxikvαpkq

“ 2ti

Nÿ

r“1

xirvαprqvαpjq.

(7.12)

Similarly, equation (7.10) and (7.11) imply

9vα “ ti

pαqÿ

β

řNr“1 xirrvαprqvβpjq ` vαpjqvβprqs

λα ´ λβvβ

and thus

9vαpkq “ ti

pαqÿ

β

řNr“1 xirrvαprqvβpjq ` vαpjqvβprqs

λα ´ λβvβpkq. (7.13)

Now, we want to get the second order partial derivatives of the eigenvalues, but we will not need all ofthem, only those where we take the derivative with respect to the same entry twice. That is, differentiating

72

equation (7.12) with respect to xij we get

:λα : “B2λαBx2

ij

“ 2ti

Nÿ

r“1

BxirBxij

vαprqvαpjq ` xirBvαprq

Bxijvαpjq ` xirvαprq

Bvαpjq

Bxij

“ 2tivαpjq2 ` 2ti

Nÿ

r“1

xir 9vαprqvαpjq ` xirvαprq 9vαpjq.

(7.14)

We can substitute equation (7.13) into equation (7.14) to see

:λα “ 2tivαpjq2`2ti

pαqÿ

β

1

λα ´ λβ

Nÿ

r,s“1

xirxis“

vαpjqvαpsqvβpjqvβprq

` vαpjq2vβprqvβpsq ` vαprqvαpsqvβpjq

2 ` vαpjqvαprqvβpjqvβpsq‰

.

(7.15)

The eigenvalues are functions of the entires ofX, so we will now take the Ito derivative of λαpx11, . . . , xMN q

using Ito’s lemma. Thus,

dλα “Mÿ

i“1

Nÿ

j“1

BλαBxij

dxij `1

2

Mÿ

i“1

Nÿ

j“1

Mÿ

k“1

Nÿ

l“1

B2λαBxijBxkl

pdxijqpdxklq

Mÿ

i“1

Nÿ

j“1

BλαBxij

dxij `1

2

Mÿ

i“1

Nÿ

j“1

B2λαBx2

ij

dt.

(7.16)

We can substitute equations (7.12) and (7.15) into equation (7.16) to see

dλα “ 2Mÿ

i“1

Nÿ

j,r“1

tixirvαprqvαpjqdxij `Mÿ

i“1

Nÿ

j“1

tivαpjq2dt

`

pαqÿ

β

1

λα ´ λβ

Mÿ

i“1

Nÿ

j,r,s“1

tixirxis“

vαpjqvαpsqvβpjqvβprq

` vαpjq2vβprqvβpsq ` vαprqvαpsqvβpjq

2 ` vαpjqvαprqvβpjqvβpsq‰

dt. (7.17)

Let’s take this one term at a time. First, note that

Nÿ

r“1

?tixirvαprq “ pT

12Xvαqpiq

where pT 12Xvαqpiq is the ith entry of the vector T 12Xvα. Suppose λα ‰ 0 (which is true almost surely),and define

vα :“T 12Xvα‖T 12Xvα‖

,

where‖T 12Xvα‖2 “

A

T 12Xvα , T12Xvα

E

“ xvα , X˚TXvαy “ λα,

73

and thusNÿ

r“1

?tixirvαprq “ pT

12Xvαqpiq “a

λαvαpiq. (7.18)

Note that vα is a normalized eigenvector of T 12XX˚T 12 with eigenvalue λα, since

T 12XX˚T 12vα “1

‖T 12Xvα‖T 12X pX˚TXqvα “ λα

T 12Xvα‖X˚TX‖

“ λαvα.

Using the above observation in (7.18) we can see that

2Mÿ

i“1

Nÿ

j,r“1

tixirvαprqvαpjqdxij “ 2a

λα

Mÿ

i“1

Nÿ

j“1

?tivαpiqvαpjqdxij .

Define bα :“řMi“1

řNj“1

?tivαpiqvαpjqxij “

?λα, so if we can show that bα is a standard Brownian motion

we know then that the stochastic part of the eigenvalue’s process is a squared Bessel process, since b2α “ λα.Define Et pAq :“ E pA |Fptqq, where Fptq is the σ-algebra of events that can be defined in terms of theprocess up to time t. Thus, we must show Et pdbαq “ 0 and Et pdbαdbβq “ cδαβ . For any unit vector v,v ¨ dv “ dv ¨ v “ 0, so we find

dbα “Mÿ

i“1

Nÿ

j“1

?tidvαpiqvαpjqxij `

?tivαpiqdvαpjqxij `

?tivαpiqvαpjq dxij

“a

λαdvα ¨ vα `a

λαvα ¨ dvα `Mÿ

i“1

Nÿ

j“1

?tivαpiqvαpjq dxij

Mÿ

i“1

Nÿ

j“1

?tivαpiqvαpjq dxij .

(7.19)

So when we take the expectation and use the facts that Et pdxijq “ 0 and that vα and vα are measurable inFptq, we see

Et pdbαq “ EtMÿ

i“1

Nÿ

j“1

?tivαpiqvαpjq dxij

Mÿ

i“1

Nÿ

j“1

?tivαpiqvαpjqEt pdxijq

“ 0.

(7.20)

Next, we get

Et pdbαdbβq “ EtMÿ

i,k“1

Nÿ

j,l“1

?titkvαpiqvαpjqvβpkqvβplqpdxijqpdxklq

Mÿ

i“1

Nÿ

j“1

tivαpiqvβpiqvαpjqvβpjq dt

“ δαβ

Mÿ

i“1

tivαpiq2 dt.

(7.21)

74

Note that the quantityřMi“1 tivαpiq

2 is only deterministic and independent of the components of the eigen-vectors if t1 “ . . . “ tM “ c. If T does not satisfy this condition, then the stochastic part of the processpλ1, . . . , λN q depends in a nontrivial way on the eigenvectors. This completes the proof of the negative claimin Theorem 7.2. However, when T “ c1 we get

Mÿ

i“1

tivαpiq2 “ cδαβ dt. (7.22)

So, in this special case,

2Mÿ

i“1

Nÿ

j,r“1

xirvαprqvαpjqdxij “ 2?ca

λα dbα. (7.23)

We continue to notate our calculations for general T , to demonstrate how the calculations of other termswork out, but from now on we remain in the special case T “ c1. For the second term in (7.17), we see

Mÿ

i“1

Nÿ

j“1

tivαpjq2dt “ Tr pT qdt “ cMdt. (7.24)

For the third term in (7.17), we shall see

Mÿ

i“1

Nÿ

j,r,s“1

tixirxis“

vαpjqvαpsqvβpjqvβprq ` vαpjq2vβprqvβpsq

` vαprqvαpsqvβpjq2 ` vαpjqvαprqvβpjqvβpsq

dt

“ λα ` λβ .

(7.25)

Take the first summand in equation (7.25) and use equation (7.18) twice, to see

Mÿ

i“1

Nÿ

j,r,s“1

?tixir

?tixisvαpjqvαpsqvβpjqvβprq “

a

λα

Mÿ

i“1

Nÿ

j,r“1

?tixirvαpjqvαpiqvβpjqvβprq

“a

λαλβ

Mÿ

i“1

Nÿ

j“1

vαpjqvαpiqvβpjqvβpiq

“ λαδαβ .

Now, take the second of the terms in equation (7.25) and perform a similar calculation, to find

Mÿ

i“1

Nÿ

j,r,s“1

?tixir

?tixisvαpjq

2vβprqvβpsq “a

λβ

Mÿ

i“1

Nÿ

j,r“1

?tixirvαpjq

2vβprqvβpiq

“ λβ

Mÿ

i“1

Nÿ

j“1

vαpjq2vβpiq

2

“ λβ .

75

The third term is similar to the second and the fourth term is similar to the first. Therefore,

pαqÿ

β

1

λα ´ λβ

nÿ

i,j,r,s“1

xirxis“

vαpjqvαpsqvβpjqvβprq

` vαpjq2vβprqvβpsq ` vαprqvαpsqvβpjq

2 ` vαpjqvαprqvβpjqvβpsq‰

dt

pαqÿ

β

λα ` λβλα ´ λβ

dt.

(7.26)

So, putting together equations (7.23), (7.24), and (7.26) we find

dλα “ 2?ca

λα dbα ` cMdt`

pαqÿ

β

λα ` λβλα ´ λβ

dt. (7.27)

Note that this is a formal calculation, not a rigorous proof. We have ignored the issues associated withintersecting eigenvalues, which can be handled but we have omitted doing this here.

76

8. Derivation of the Marchenko-Pastur equation

In this section we consider matrices of the form H :“ T 12X˚XT 12, where X :“ pxijq is an M ˆ N realmatrix whose entries xij are independent and identically distributed with Ex11 “ 0 and Ex2

11 “ 1M ; andT is a deterministic positive-definite N ˆ N matrix, so that T :“ U˚diagpt1, . . . , tN qU with U an N ˆ Nunitary matrix, where the ti ě 0, and the empirical distribution of T converges almost surely in distributionto a probability density function µT which is compactly supported on r0, hs for some h P R as N Ñ 8.Furthermore, as in Definition 3.1, we assume M “ MN and that NM “ γN Ñ γ P p0,8q as N Ñ 8.As before, all these quantities are N (or equivalently M) dependent but we repress this dependence in ournotation. We derive the self-consistent equation for the ensemble H by reworking the proof originally givenin [19]; the same derivation is also reproduced in [6].

Definition 8.1. We define the generalized Marchenko-Pastur equation as

mpzq “

ż

R

µT pdtq

tp1´ γ ´ γzmpzqq ´ z. (8.1)

The above equation has a unique solution, denoted by mfcpzq, when we stipulate that mfc : HÑ H.

Note that, unlike the case T “ 1, mfcpzq does not in general have an explicit formula. Also note thatwhen T “ 1, we recover the regular Marchenko-Pastur equation (2.49), as µT “ δ1. The uniqueness of theequation’s solution is proved in Section 5 of [19].

As was explained for the simpler ensemble in Remark 3.6 and Notation 3.7, there are many simplerelationships between the ensembles

H :“ T 12X˚XT 12 and H :“ XTX˚, (8.2)

so in some sense they are equivalent. The proofs are similar to before and we will keep the same broad ideawith our notation. We denote the Stieltjes transform of the matrix H as

mN pzq :“1

NTrpH ´ z1q´1 “

1

NTrGpzq. (8.3)

Similarly, we denote the Stieltjes transform of the matrix H as

mM pzq :“1

MTrpH ´ z1q´1 “

1

MTrGpzq. (8.4)

These two transforms are related by

mN pzq “ pγ´1N ´ 1q

1

z` γ´1

N mM pzq (8.5)

or equivalently

mM pzq “ pγN ´ 1q1

z` γNmN pzq, (8.6)

since the nonzero eigenvalues of H and H agree, as in Lemma 3.5. Now we derive the self-consistent equationfor H. Denote the ith row of X by xi. Note that these rows are independent of one another. Define thequantities

ri :“ xiT12 and Hpiq :“

piqÿ

j

r˚j rj “ H ´ r˚i ri, (8.7)

77

since

H “ T 12X˚XT 12 “ T 12

˜

Mÿ

j“1

x˚j xj

¸

T 12 “

Mÿ

j“1

r˚j rj .

Notice that the ri are independent of one another (since the xi are independent) and that ri is independentof Hpiq (which is not a minor but behaves very similarly to a minor). We are forced into this approach, ratherthan the usual analysis of minors, because the non-diagonal nature of T stops the columns of T 12X˚XT 12

being independent. Note that when T is diagonal we can simply look at columns XT 12, in which case theanalysis is similar to usual, but with these definition we can utilize the independence of X even when T isnon-diagonal.

During the derivation we will utilize the following simple lemma to express the resolvent of H in termsof a sum of the resolvents of Hpiq. This result is analogous to the first resolvent identity in Lemma 6.5.

Lemma 8.2. Let q P CN be a row and A be an N ˆN matrix such that A and A` q˚q are invertible, thenwe have the following equality

qpA` q˚qq´1 “1

1` qA´1q˚qA´1. (8.8)

Proof. Note

p1` qA´1q˚qq “ q ` qA´1q˚q “ qA´1pA` q˚qq,

then divide by the scalar 1`qA´1q˚ and on the right by pA`q˚qq´1, both of which are valid by assumption.

Now using the definition of H, we see

pH ´ z1q ` z1 “ H “

Mÿ

j“1

r˚j rj

and multiplying by pH ´ z1q´1 on the right, we obtain

1` zG “ 1` zpH ´ z1q´1 “

Mÿ

j“1

r˚j rjpH ´ z1q´1 “

Mÿ

j“1

r˚j rjG. (8.9)

Denote

Gpiq ” Gpiqpzq :“ pHpiq ´ z1q´1.

Now we use Lemma 8.2 with q “ ri and A “ Hpiq ´ z1 on each term of the sum above, to see

ripH ´ z1q´1 “ ripH

piq ´ z1` r˚i r˚i q´1 “

1

1` ripHpiq ´ z1q´1r˚i

ripHpiq ´ z1q´1,

which we can multiply by r˚ on the left and sum over i, to find

1` zG “Mÿ

i“1

r˚i ripH ´ z1q´1 “

Mÿ

i“1

1

1` riGpiqr˚i

r˚i riGpiq, (8.10)

78

where we used (8.9). Now, taking the trace of equation (8.10) (using the trace’s linearity) and dividing byM , gives by definition

γN ` γNzmN “1

MTrp1q ` z

N

M

1

NTrG

“1

M

Mÿ

i“1

1

1` riGpiqr˚i

Tr”

r˚i riGpiqı

“1

M

Mÿ

i“1

riGpiqr˚i

1` riGpiqr˚i

“ 1´1

M

Mÿ

i“1

1

1` riGpiqr˚i

,

(8.11)

where the third equality follows by the cyclic property of the trace and the last equality follows from a simplepartial fraction. Subtracting 1 from both sides of equation (8.11) and dividing by z yields

mM “ pγN ´ 1q1

z` γNmN “ ´

1

z

1

M

Mÿ

i“1

1

1` rGpiqr˚i(8.12)

by equation (8.6). So, multiplying by T on the right gives us

´zmMT “1

M

Mÿ

i“1

1

1` rGpiqr˚iT (8.13)

Next, we prove the following simple resolvent equation.

Lemma 8.3. Fix z P C. For any N ˆN matrices A and B, such that pA´ z1q and pB ´ z1q are invertible,we have

pA´ z1q´1´ pB ´ z1q

´1“ ´pA´ z1q

´1pA´Bq pB ´ z1q

´1. (8.14)

Proof. The proof is a simple manipulation. Note

pA´ z1q´1pB ´ z1`A´Bq “ pA´ z1q

´1pA´ z1q “ 1

and then multiply on the right by pB ´ z1q´1

to get

pA´ z1q´1` pA´ z1q

´1pA´Bq pB ´ z1q

´1“ pB ´ z1q

´1

and complete the proof.

79

Next we use the identity (8.14) and the definition of H to see

p´zmMT ´ z1q´1 ´G “ ´p´zmMT ´ z1q

´1

˜

p´zmMT q ´Nÿ

i“1

pr˚i riq

¸

G

“ ´p´zmMT ´ z1q´1

«

p´zmMT qG´

˜

Mÿ

i“1

r˚i ri

¸

G

ff

“ ´p´zmMT ´ z1q´1

«

1

M

Mÿ

i“1

1

1` rGpiqr˚iTG´

˜

Mÿ

i“1

r˚i riG

¸ff

“ ´p´zmMT ´ z1q´1 1

M

Mÿ

i“1

1

1` riGpiqr˚i

TG

` p´zmMT ´ z1q´1

Mÿ

i“1

1

1` riGpiqr˚i

r˚i riGpiq

“1

z

Mÿ

i“1

1

1` riGpiqr˚i

p´mMT ´ 1q´1r˚i riGpiq ´

1

Mp´mMT ´ 1q´1TG

,

where the second equality is a simple manipulation; the third equality follows from equation (8.13); thefourth follows by equations (8.9) and (8.10); and the last equality follows from factoring the above. Next wetake the trace of the above and divide by N , to get

1

NTrp´zmMT ´ z1q

´1 ´mN “1

zN

Mÿ

i“1

1

1` riGpiqr˚i

ˆ

ˆ

Tr”

p´mMT ´ 1q´1r˚i riGpiqı

´1

MTr

p´mMT ´ 1q´1TG‰

˙

“1

zN

Mÿ

i“1

1

1` riGpiqr˚i

ˆ

riGpiqp´mMT ´ 1q´1r˚i ´

1

MTr

Gp´mMT ´ 1q´1T‰

“1

zγM

Mÿ

i“1

1

1` riGpiqr˚i

dipzq,

(8.15)

where

dipzq :“ riGpiqpzqp´mM pzqT ´ 1q´1r˚i ´

1

MTr

Gpzqp´mM pzqT ´ 1q´1T‰

. (8.16)

Let’s examine the first term on the left-hand side of equation (8.15). We see

1

NTrp´zmMT ´ z1q

´1 “1

N

Nÿ

i“1

1

´z ´ tizmM

as ´mM pzqT ´ 1 is simply a rational transformation of the spectrum of T . Substituting for mM pzq using

80

equation (8.6), we see from equation (8.15)

1

N

Nÿ

i“1

1

´z ´ ti pγN ´ 1` γNmNzq´mN “

1

zγM

Mÿ

i“1

1

1` riGpiqr˚i

di

and then rearranging yields

mN pzq ´1

N

Nÿ

i“1

1

ti p1´ γN ´ γNzmN pzqq ´ z“ ´

1

zγM

Mÿ

i“1

1

1` riGpiqpzqr˚i

dipzq, (8.17)

where the right-hand side is an error term which can be controlled. Equation (8.17) is an approximateversion of equation (8.1).

81

9. Acknowledgements

The author would like to thank Professor Horng-Tzer Yau for his guidance and supervision since last summer,and acknowledge the constant patience and help afford by the postdoctoral follows, Alex Bloemendal andKevin Schnelli, throughout the project. The author also thanks Antti Knowles for his useful commentsand for pointing out the simplified proof of (5.11) in the large deviation estimate. This work was kindlysupported by a grant from the Harvard College Research Program during the summer of 2012.

82

References

[1] Anderson, G., Guionnet, A., Zeitouni, O.: An Introduction to Random Matrices. Studies in ad-vanced mathematics, 118, Cambridge University Press, 2009.

[2] Bai, Z.D., Silverstein, J.W.: Spectral Analysis of Large Dimensional Random Matrices. Springerseries in statistics, Springer, 2010.

[3] Bru, M.: Diffusions of perturbed principal component analysis. J. Multivariate Anal., 29 127-136(1989).

[4] Diaconis, P.: What is. . . a random matrix? Notices of the AMS, 52:11 1348-1349 (2005).

[5] Dyson, F.J.: A Brownian-motion model for the eigenvalues of a random matrix. J. Math. Phys., 31191-1198 (1962).

[6] Erdos, L., Farrell, B.: Local eigenvalue density for general MANOVA matrices. arXiv: 1207.0031v1.

[7] Erdos, L., Knowles, A., Yau, H.-T., Yin, J.: Spectral statistics of Erdos-Renyi graphs I: localsemicircle law. To appear in Ann. Prob. Preprint arXiv:1103.1919

[8] Erdos, L., Knowles, A., Yau, H.-T., Yin, J.: Spectral statistics of Erdos-Renyi graphs II: eigenvaluespacing and the extreme eigenvalues. To appear in Comm. Math. Phys. Preprint arXiv:0911.3687.

[9] Erdos, L., Knowles, A., Yau, H.-T., Yin, J.: Delocalization and diffusion profile for random bandmatrices. Preprint arXiv:1205.5669.

[10] Erdos, L., Knowles, A., Yau, H.-T., Yin, J.: The local semicircle law for a general class of randommatrices. Preprint arXiv:1212.0164.

[11] Erdos, L., Schlein, B., Yau, H.-T., Yin, J.: The local relaxation flow approach to universality ofthe local statistics for random matrices. Ann. Inst. H. Poincare Probab. Statist., 48 1-46 (2012).

[12] Erdos, L., Yau, H.-T., Yin, J.: Bulk universality for generalized Wigner matrices. Probab. TheoryRelated Fields 154 1-2, 341-407, (2012).

[13] Erdos, P. and Renyi, A.: On random graphs. I. Publicationes Mathematicae, 6, 290–297 (1959).

[14] Erdos, P.; Renyi, A.: The evolution of random graphs. Magyar Tud. Akad. Mat. Kutato Int. Kozl.,5 17–61 (1960).

[15] Geronimo, J.; Hill, T.: Necessary and sufficient condition that the limit of Stieltjes transforms isa Stieltjes transform. J. Approx. Theory, 121 54-60 (2003)

[16] Pillai, N., Yin, J.: Universality of covariance matrices. arXiv: 1110.2501v5.

[17] Marchenko, V., Pastur, A.: Distribution of eigenvalues for some sets of random matrices. USSR-Sb., 1 457-483 (1967).

[18] Mehta, M.L.: Random Matrices. Academic Press, New York, 1991.

[19] Silverstein, J.W., Bai, Z.D.: On the empirical distribution of eigenvalues of a class of large dimen-sional random matrices. J. Multivariate Anal., 55 (2):175-192, (1995).

83

[20] Silverstein, J.W.: Strong convergence of the empirical distribution of eigenvalues or large dimen-sional random matrices. J. Multivariate Anal., 55 (2):331-339, (1995).

[21] Wishart, J.: Generalized product moment distribution in samples. Biometrika, 20A (12): 3252.(1928).

[22] Wigner, E.: Characteristic vectors of bordered matrices with infinite dimensions. Ann. of Math.,62, 548–564, (1955).

84