a first course in random matrix theory · dynamics is described by a matrix a without any...

A First Course in Random Matrix Theory

for Physicists, Engineers and Data Scientists

Marc PottersJean-Philippe Bouchaud

Contents

Preface page 1

1 Matrices, Eigenvalues and Singular Values 51.1 Matrices 51.2 Eigenvalues and Eigenvectors 71.3 Singular Values 101.4 Some Useful Theorems on Eigenvalues 121.5 Some Useful Matrix Identities 14

2 Moments of Random Matrices and the Wigner Ensemble 162.1 The Wigner Ensemble 172.2 Resolvent and Stieltjes Transform 222.3 Non-crossing Pair Partitions 322.4 Other Gaussian Ensembles 39

3 Wishart Ensemble and Marcenko-Pastur Distribution 453.1 Wishart Matrices 463.2 Marcenko-Pastur Using the Cavity Method 51

4 Joint Distribution of Eigenvalues 584.1 From Matrix Elements to Eigenvalues 584.2 Maximum Likelihood and Large N Limit 66

5 Dyson Brownian Motion 775.1 Stochastic Calculus 775.2 Stochastic Matrices 82

iv Contents

5.3 Addition of a Large Wigner Matrix Using DBM 905.4 Dyson Brownian Motion with a Potential 98

6 Addition of Large Random Matrices 1076.1 Low rank Harrish-Chandra-Itzykson-Zuber integral 1076.2 R-transform 112

7 Free Probabilities 1177.1 Algebraic Probabilities 1187.2 Non-Commuting Variables 1257.3 Free Product 1367.4 Large Random Matrices 142

8 Addition and Multiplication: Summary and Examples 1458.1 Summary 1458.2 General Sample Covariance Matrices 159

9 Extreme Eigenvalues and Outliers 1679.1 Largest Eigenvalue 1679.2 Outliers 171

10 Bayesian estimation 18310.1 Bayesian estimation 18310.2 Bayesian estimation of the true covariance matrix 18610.3 Linear Ridge regression and Marcenko-Pastur law 188

11 Eigenvector Overlaps and Rotationally Invariant Estimators 19211.1 Eigenvector Overlaps 19211.2 Rotationally Invariant Estimator 19211.3 Conditional Average in Free Probability 20011.4 Real Data 202

12 Applications to Finance 20312.1 Portfolio Theory 203

13 Replica Trick 20813.1 Stieltjes Transform 20913.2 Resolvent Matrix 21313.3 Rank-1 HCIZ 21813.4 Annealed vs Quenched 225

Contents v

Appendix A Mathematical Tools 227A.1 Saddle point method 227References 229Index 232

Preface

Physicists have always understood the world through data and models inspiredby this data. They build models from data and confront their models with thedata generated by new experiments or observations. Real data is by nature noisy,but until recently, classical statistical tools have been successful in dealing withthis randomness. The recent emergence of large datasets, together with the com-puting power to analyze them, has created a situation where not only the numberof data points is large but also the number of studied variables. Classical statis-tical tools are inadequate to tackle this situation. Random matrix theory, and inparticular the study of large sample covariance matrices, can help make sense ofthese big datasets. Random matrix theory is also linked to many modern prob-lems in statistical physics such as the spectral theory of random graphs, inter-action matrices of spin glasses, non-intersecting random walk and compressedsensing.

This book is purposely introductory and informal. As an analogy, high schoolseniors and college freshmen are typically taught both calculus and analysis. Inanalysis one learns how to make rigorous proofs, define a limit and a deriva-tive. At the same time in calculus one can learn about computing complicatedderivatives, multi-dimensional integrals and solving differential equations rely-ing only on a intuitive definition (with precise rules) of these concepts. Thisbook proposes a “calculus” of random matrices.

Rather than make statements about the most general case, concepts are defined

2 Preface

with strong hypothesis (e.g. Gaussian entries) in order to simplify the computa-tions and favor understanding. Precise notions of norm, topology, convergence,exact domain of application are left out, again to favor intuition over rigor. Thereare many good rigorous books on the subject and the hope is that the interestedreader will be able to now read them guided by his newly built intuition.

The book focuses on real symmetric matrices. Extension to Hermitian com-plex or quaternionic matrices is explained once and later relegated to foot-notes.

Historical perspective Guhr et al. [1998], Francesco et al. [1995].

This book was first conceived as a graduate class on random matrices and theirapplications for applied mathematics student. I taught this class in my firstsemester (Fall 2017) of sabbatical at UCLA. I want to thank my students forsitting through and engaging with me in this fairly non-standard class by math-ematical standards. I want to thank in particular Fan Yang who typed up theoriginal hand-written notes. I also want to thank Andrea Bertozzi, Stanley Os-her and Terrence Tao who welcomed me for a year at UCLA. During that yearI had many fruitfull discussions with members and vistors of the UCLA math-ematics deparment and with participants of the IPAM long program in quan-titative linear algebra, including Alice Guionnet, Jun Yin and Horng-Tzer Yauand more particularly with Nicholas Cook, David Jekel, Nikhil Srivastava andDimitri Shlyakhtenko.

Exercise

0.1 Warm up exercise

Before we learn anything about random matrices, let’s do the followingnumerical experiments:

• Let M be a random real symmetric orthogonal matrix, that is an N × Nmatrix satisfying M = Mᵀ = M−1. Show that all the eigenvalues of Mare ±1.

• Let X be a Wigner matrix, i.e. an N × N real symmetric matrix whose

Exercise 3

diagonal and upper triangular entries are iid Gaussian random numberswith zero mean and variance σ2/N. You can use X = σ/

√2N(H +

HT ) where H is a non-symmetric N × N matrix will with iid standardGaussians.

• The matrix E will be E = M + X. E can be thought of as a noisy versionof M. The goal of these exercise is to understand numerically how thematrix E is corrupted by the Wigner noise.

• The matrix P+ is defined as P+ = 12 (M + 1N). Convince yourself that P+

is the projector onto the eigenspace of M with eigenvalue +1. Explainthe effect of the matrix P+ on eigenvectors of M.

• An easy way to generate a random matrix M is to generate a Wignermatrix (independent of X), diagonalize it, replace every eigenvalue byits sign and reconstruct the matrix. The procedure does not depend onthe σ used for the Wigner.

• Using the computer language of your choice, for a large value of N (aslarge as possible while keeping computing times below one minute), fora three interesting values of σ of your choice, do the following numeri-cal analysis.

(a) Plot a histogram of the eigenvalues of E, for a single sample first, andthen for many samples (say 100).

(b) From your numerical analysis, in the large N limit, for what values ofσ do you expect a non-zero density of eigenvalue near zero.

(c) For every normalized eigenvector vi of E, compute the norm of the vec-tor P+vi. For a single sample, do a scatter plot of |P+vi|

2 vs λi (its eigen-value). Turn your scatter plot into an approximate conditional expecta-tion value (using an histogram) including data from may samples.

(d) Build an estimator Ξ(E) of M using only data from E. We want to min-imise the error e = 1

N ||(Ξ(E) − M)||2F where ||A||2F = TrAAᵀ. Considerfirst Ξ1(E) = E and then Ξ0(E) = 0. What is the error e of these twoestimators. Try to build an ad-hod estimator Ξ(E) that has a lower errore than these two.

4 Preface

(e) Show numerically that the eigenvalues of M are not iid . For each sam-ple M rank its eigenvalues λ1 < λ2 < . . . < λN . Consider the eigenvaluespacing sk = λk − λk−1 for eigenvalues in the bulk (.2N < k < .3N and.7N < k < .8N). Make an histogram of sk including data form 100samples. Make a 100 pseudo-iid samples: mix eigenvalues for 100 dif-ferent samples and randomly choose N from the 100N possibilities, donot choose the same eigenvalue twice for a given pseudo-iid sample. Foreach pseudo-iid sample, compute sk in the bulk and make an histogramof the values using data from all 100 pseudo-iid samples. (Bonus) Tryto fit a exponential distribution to these two histograms. The iid shouldbe well fitted by the exponential but not the original data (not iid).

1Matrices, Eigenvalues and Singular Values

1.1 Matrices

Matrices appear in all corners of science, from mathematics to physics, com-puter science, biology. In fact, before Schrodinger’s equation, Quantum Me-chanics was formulated by Heisenberg in terms of a “matrix mechanics”. Letus give three examples motivating the study of matrices, and the different formsthat those can take.

Consider a generic dynamical system describing the time evolution of a certainN-dimensional vector x(t), for example the three dimensional position of a pointin space. Let us write the equation of motion as:

dxdt

= F(x), (1.1)

where F(x) is an arbitrary vector field. Equilibrium points x∗ are such thatF(x∗) = 0. Consider now small deviations from equilibrium, i.e. x = x∗ + εywhere ε 1. To first order in ε, the dynamics becomes linear, and givenby:

dydt

= Ay (1.2)

where A is a matrix whose elements are given by Ai j = ∂ jFi(x∗), where i, jare indices that run from 1 to N. When F can itself be written as the gradientof some potential U, i.e. Fi = −∂iU(x), the matrix A becomes symmetric, i.e.

6 Matrices, Eigenvalues and Singular Values

Ai j = A ji = −∂i jU. But this is not always the case; in general the linearizeddynamics is described by a matrix A without any particular property – exceptthat it is a square N × N array of real numbers.

Another standard setting is the so-called Master equation for the evolution ofprobabilities. Call i = 1, . . . ,N the different possible states of a system and Pi(t)the probability to find the system in state i at time t. When memory effects canbe neglected, the dynamics is called Markovian and the evolution of Pi(t) isdescribed by the following discrete time equation:

Pi(t + 1) =

N∑j=1

Ai jP j(t), (1.3)

meaning that the system has a probability Ai j to jump from state j to state ibetween t and t + 1. Note that all elements of A are positive; furthermore, sinceall jump possibilities must be exhausted, one must have, for each j:

∑i Ai j = 1.

This ensures that∑

i Pi(t) = 1 at all times, since

N∑i=1

Pi(t + 1) =

N∑i=1

N∑j=1

Ai jP j(t) =

N∑j=1

N∑i=1

Ai jP j(t) =

N∑j=1

P j(t) = 1. (1.4)

Matrices such that all elements are positive and such that the sum over all rowsare equal to unity for each column are called stochastic matrices.

As a third important example, let us consider random, N-dimensional real vec-tors X, with some given multivariate distribution P(X). The covariance matrixC of the Xs is defined as

Ci j = E[XiX j] − E[Xi]E[X j] (1.5)

where E means that we are averaging over the distribution P(X). Clearly, thematrix C is real and symmetric. It is also positive semi-definite, in the sense thatfor any vector x,

xCx ≥ 0. (1.6)

If it were not the case, it would be possible to find a linear combination of thevectors X with a negative variance, which is obviously impossible.

The three examples above are all such that the corresponding matrices are N ×

1.2 Eigenvalues and Eigenvectors 7

N square matrices. Examples where matrices are rectangular also abound. Forexample, one could consider two sets of random real vectors: X of dimension Nand Y of dimension Y. The cross-covariance matrix defined as

Cia = E[XiYa] − E[Xi]E[Ya]; i = 1, . . . ,N; a = 1, . . . ,M, (1.7)

is an N × M matrix that describes the correlations between the two sets of vec-tors.

1.2 Eigenvalues and Eigenvectors

One learns a great deal about matrices by studying their eigenvalues and eigen-vectors. For a square matrix A a pair of scalar and non-zero vector (λ, v) satis-fying

Av = λv (1.8)

is called an eigenvalue-eigenvector pair.

Trivially if v is an eigenvector αv is also an eigenvector when α is a non-zeroreal number. Sometimes multiple non-collinear eigenvectors share the sameeigenvalue, we say that this eigenvalue is degenerate and has multiplicity equalto the dimension of the vector space of its eigenvectors.

If Eq. (1.8) is true, it implies that the equation (A−λ1)v = 0 has non trivial solu-tions, which requires that det(A − λ1) = 0. The eigenvalues λ are thus the rootsof the so-called characteristic polynomial of the matrix A, obtained by expand-ing det(A−λ1). Clearly, this polynomial is of order N and therefore has at mostN different roots, that correspond to the (possibly complex) eigenvalues of A.Note that the characteristic polynomial of AT coincides with the characteristicpolynomial of A, so the eigenvalues of A and AT are identical.

Now, let λ1, λ2, . . . λN be the N eigenvalues of A with v1, v2, . . . vN the corre-sponding eigenvectors. We define Λ as the N ×N diagonal matrix with λi on thediagonal, and V as the N × N matrix whose jth column is v j, i.e. Vi j = (v j)i isthe ith component of v j. Then, by definition:

AV = VΛ,


since once expanded, this reads:∑k

AikVk j = Vi jλ j,

or Av j = λ jv j. If the eigenvectors are linearly independent (which is not truefor all matrices), the matrix inverse V−1 exists and one can therefore write Aas:

A = VΛV−1, (1.9)

which is called the eigenvalue decomposition of the matrix A.

Symmetric matrices (such that A = AT ) have very nice properties regardingtheir eigenvalues and eigenvectors.

• They have exactly N eigenvalues when counted with their multiplicity.

• All their eigenvalues and eigenvectors are real.

• Their eigenvectors are orthogonal and can be chosen to be orthonormal (i.e.vT

i v j = δi j). Here we assume that for degenerate eigenvalues we pick anorthogonal set of corresponding eigenvectors.

If we choose orthonormal eigenvectors, the matrix V has the property VT V = 1(⇒ VT = V−1). Hence it is thus an orthogonal matrix V = O and Eq. (1.9)reads:

A = OΛOT

where Λ is a diagonal matrix containing the eigenvalues associated with theeigenvectors in the columns of O. A symmetric matrix can be diagonalizedby an orthogonal matrix. Remark that an N × N orthogonal matrix is fullyparametrised by N(N − 1)/2 “angles”, whereas Λ contains N diagonal ele-ments. So the total number of parameters of the diagonal decomposition isN(N − 1)/2 + N which is identical, as it should be, to the number of differentelements of a symmetric N × N matrix.

Let us come back to our dynamical system example, Eq. (1.2). One basic ques-tion is to know whether the perturbation y will grow with time, or decay withtime. The answer to this question is readily given by the eigenvalues of A. Forsimplicity, we assume F to be a gradient such that A is symmetric. Since the

Exercise 9

eigenvectors of A are orthonormal, one can decompose y in terms of the vas:

y(t) =

N∑i=1

ci(t)vi. (1.10)

Taking the dot product of Eq. 1.2 with vi then shows that the dynamics of thecoefficients ci(t) are decoupled and given by:

dci

dt= λici (1.11)

where λi is the eigenvalue associated to vi. Therefore, any component of theinitial perturbation y(t = 0) that is along an eigenvector with positive eigenvaluewill grow exponentially with time, until the linearized approximation leadingto Eq. 1.2. Conversely, components along directions with negative eigenvaluesdecrease exponentially with time. An equilibrium x∗ is called stable providedall eigenvalues are negative, and marginally stable if some eigenvalues are zerowhile all others are negative.

The important message carried by the example above is that diagonalizing amatrix amounts to finding a way to decouple the different degrees of freedom,and convert a matrix equation into a set of N scalar equations, as Eqs. (1.11).We will see later that the same idea hold for covariance matrices as well: theirdiagonalization allows one to find a set of uncorrelated vectors. This is usuallycalled Principal Component Analysis (PCA).

Exercise

1.1 Instability of eigenvalues of non-symmetric matrices:

Consider the N × N square band diagonal matrix M0 defined by [M0]i j =


2δi, j−1:

M0 =

0 2 0 · · · 00 0 2 · · · 0

0 0 0. . . 0

0 0 0 · · · 20 0 0 · · · 0

(a) Show that MN

0 = 0 and so all the eigenvalues of M0 must be zero. Usea numerical eigenvalue solver for non-symmetric matrices and confirmnumerically that this is the case for N = 100.

(b) If O is an orthogonal matrix (OOT = 1), OM0OT has the same eigenval-ues as M0. Following exercise 0.1, generate a random orthogonal matrixO. Numerically find the eigenvalues of OM0OT . Do you get the sameanswer as in (a)?

(c) Consider M1 whose elements are all equal to those of of M0 exceptfor one element in the lower left corner [M1]N,1 = (1/2)N−1. Show thatMN

1 = 1, more precisely show that the characteristic polynomial of M1

is given by det(M1 − λ1) = λN − 1, therefore M1 has N distinct eigen-values equal to the N complex roots of unity λk = e2πik/N .

(d) For N greater than about 60, OM0OT and OM1OT are indistinguishableto machine precision. Compare numerically the eigenvalues of thesetwo rotated matrices.

1.3 Singular Values

A non symmetric, square matrix cannot in general be decomposed as A =

OΛOT , where Λ is a diagonal matrix and O an orthogonal matrix. One canhowever find a very useful alternative decomposition as

A = VSUT (1.12)

where S is a diagonal matrix, whose elements are called the singular values ofA and U,V are two real, orthogonal matrices. Whenever A is symmetric, one

1.3 Singular Values 11

has S = Λ and U = V. We will see that Eq.(1.12) also holds for rectangularN × M matrices; see section 1.3 for a concrete example.

Let us introduce two matrices B and B, defined as B ≡ AAT and B = AT A. It isplain to see that these matrices are symmetric, since BT = (AAT )T = ATT AT = B(and similarly for B).

Claim: B and B have the same non-zero eigenvalues.

In fact, let λ > 0 be an eigenvalue of B and v , 0 is the corresponding eigen-vector. Then we have, by definition

AAT v = λv.

Let u = AT v, then we can get from the above equation that

AT AAT v = λAT v⇒ Bu = λu.

Moreover,

‖u‖2 = vAAT v = vBv , 0⇒ u , 0.

Hence λ is also an eigenvalue of B. Note that for degenerate eigenvalues λ of B,an orthogonal set of corresponding eigenvectors v` gives rise to an orthogonalset AT v` of eigenvectors of B. Hence the multiplicity of λ in B is at least that ofB. Similarly, we can show that any nonzero eigenvalue of B is also an eigenvalueof B. This finishes the proof of the claim.

Note that B has at most N nonzero eigenvalues and B has at most T nonzeroeigenvalues. Thus by the above claim, if T > N, B has at least T − N zeroeigenvalues, and if T < N, B has at least N −T zero eigenvalues. We denote theother minN,T eigenvalues of B and B by λk1≤k≤minN,T . Then the SVD of Ais expressed as Eq. (1.12, where V is the N × N orthogonal matrix consisting ofthe N normalized eigenvectors of B, U is the T ×T orthogonal matrix consistingof the T normalized eigenvectors of B, and S is an N × T rectangular diagonalmatrix with S kk =

√λk ≥ 0, 1 ≤ k ≤ minN,T , and all other entries equal to

zero.


For instance if N < T , we have

S =

√λ1 0 0 0 · · · 00

√λ2 0 0 · · · 0

0 0. . . 0 · · · 0

0 0 0√λN · · · 0

.Although (non-degenerate) normalized eigenvectors are unique up to a sign, thechoice of the positive sign for the square-root

√λk imposes a condition on the

combined sign for the left and right singular vectors vk and wk. In other words,simultaneously changing both vk and uk to −vk and −uk leaves the matrix Ainvariant but for non-zero singular values one cannot individually change thesign of either vk or uk.

The recipe to find the so-called Singular Value Decomposition, Eq. (1.12), isthus to diagonalize both AAT (to obtain V and S2) and AT A (to obtain U andagain S2). It is insightful to again count the number of parameters involved inthis decomposition. A general N×M matrix has NM different elements, and theSVD decomposition amounts to writing**???***:

NM ≡12

N(N − 1) +12

M(M − 1) + min(N,M) − ∗∗ (1.13)

The interpretation of Eq. (1.12) for N×N matrices is that one can always find anorthonormal basis of vectors u such that the application of a matrix A amountsto a rotation (or an improper rotation) of u into another orthonormal set v,followed by a dilation of each vk by a positive factor

√λk. Normal matrices are

such that U = V. In other words, A is normal whenever A commutes with itstranspose: AAT = AT A. Symmetric, skew-symmetric and orthogonal matricesare normal, but other cases are possible. For example a 3 × 3 matrix such thateach row and each column has exactly two elements equal to 1 and one elementequal to 0 is normal.

1.4 Some Useful Theorems on Eigenvalues

In this section, we state without proof three very useful theorems on eigenval-ues.

1.4 Some Useful Theorems on Eigenvalues 13

1.4.1 Gershgorin’s Circle Theorem

Let A be a real matrix, with elements Ai j. Define Ri as Ri =∑

j,i Ai j, and Di adisk in the complex plane centred on Aii and of radius Ri. Then every eigenvalueof A lies within at least one disk Di — see Fig. ***. In particular, eigenvaluescorresponding to eigenvectors with a maximum amplitude on i lies within thediskDi.

1.4.2 The Perron-Frobenius Theorem

Let A be a real matrix, with all its elements positive Ai j > 0. Then the topeigenvalue λmax is unique and real (all other eigenvalues have a smaller realpart). The corresponding top eigenvector has all its elements positive:

Av = λmaxv; vi > 0, ∀i. (1.14)

The top eigenvalue satisfies the following inequalities:

mini

∑j

Ai j ≤ λmax ≤ maxi

∑j

Ai j. (1.15)

Application: Suppose A is a stochastic matrix, such that all its elements arepositive and satisfy

∑i Ai j = 1, ∀ j. Then clearly the vector ~1 is an eigenvector

of AT , with eigenvalue λ = 1. But since the Perron-Frobenius can be appliedto AT , the inequalities (1.15) ensure that λ is the top eigenvalue of AT , and thusalso of A. All the elements of the corresponding eigenvector v are positive, anddescribe the stationary state of the associated Master equation, i.e.

P∗i =∑

j

Ai jP∗j −→ P∗i =vi∑j v j

. (1.16)

1.4.3 The Eigenvalue Interlacing Theorem

Let A be an N × N symmetric matrix with eigenvalues λ1 ≥ λ2 · · · ≥ λN . Con-sider the N − 1 × N − 1 submatrix A\i obtained by removing the ith row and


ith columns of A. Its eigenvalues are µ1 ≥ µ2 · · · ≥ µN−1. Then the followinginterlacing inequalities hold:

λ1 ≥ µ1 ≥ λ2 . . . µN−1 ≥ λN . (1.17)

1.5 Some Useful Matrix Identities

1.5.1 Sherman-Morrison formula

The Sherman-Morrison formula gives the inverse of a matrix A perturbed by arank-1 perturbation:

(A + uvT )−1 = A−1 −A−1uvT A−1

1 + vT A−1u. (1.18)

valid for any invertible matrix A and vectors u and v such that the denominatordoes not vanish.

The associated matrix determinant lemma reads:

det(A + vuT ) = det A ·(1 + uT A−1v

)(1.19)

for invertible A.

1.5.2 Schur complement formula

Schur complement, also called inversion by partitioning, relates the blocks ofthe inverse of a matrix to the inverse of blocks of the original matrix. Let M bean invertible matrix that we divide in four blocks as

M =

(M11 M12

M21 M22

)and M−1 = Q =

(Q11 Q12

Q21 Q22

)

Q−111 = M11 −M12(M22)−1M21, (1.20)

1.5 Some Useful Matrix Identities 15

Now we recall the following Schur complement formula (or inverse by parti-tioning). Let M be an N × N matrix such that

M =

(M11 M12

M21 M22

),

where [M11] = n × n, [M12] = n × (N − n), [M21] = (N − n) × n, [M22] =

(N − n)× (N − n), and M22 is invertible. Then the upper left n× n block of M−1

is

(M−1)11 =(M11 −M12M−1

22 M21)−1

.

2Moments of Random Matrices and the Wigner

Ensemble

In many circumstances, the matrices that are encountered are large, and with noparticular structure. It was Wigner’s insight to postulate that one can often re-place a large complex (but deterministic) matrix by a typical element of a certainensemble of random matrices. This bold proposal was made in the context ofthe study of large complex atomic nuclei, where the “matrix” is the Hamiltonianof the system, which is a Hermitian matrix (see section 2.4.1) describing all theinteractions between the neutrons and protons contained in the nucleus. At thetime, these interactions were not well known; but even if they had been, the taskof diagonalizing the Hamiltonian to find the energy levels of the nucleus was soformidable that Wigner looked for an alternative. He suggested that we shouldabandon the idea of finding precisely all energy levels, but rather rephrase thequestion as a statistical question: what is the probability to find an energy levelwithin a certain interval; what is the probability that the distance between twosuccessive levels is equal to a certain value, etc. The idea of Wigner was that theanswer to these questions could be, to some degree, universal, i.e. independentof the specific Hermitian matrix describing the system, provided it was complexenough. If this is the case, why not replace the Hamiltonian of the system by apurely random matrix with the correct symmetry properties?

This idea has been incredibly fruitful and has led to the development of a sub-field of mathematical physics called “Random Matrix Theory”. In this book wewill study the properties of some ensembles of random matrices. We will mostly

2.1 The Wigner Ensemble 17

focus on symmetric matrices with real entries as those are the most commonlyencountered in data analysis and statistical physics. For example, Wigner’s ideahas been transposed to glasses and spin-glasses, where the interaction betweenpairs of atoms or pairs of spins is often replaced by a real symmetric, randommatrix. In other cases, the randomness stems from noisy observations. For ex-ample, when one wants to measure the covariance matrix of the returns of a largenumber of assets using a sample of finite length (for example the 500 stocks ofthe S&P500 using 4 years of daily data, i.e. 4 × 250 = 1000 data points perstock), there is inevitably some measurement noise that pollutes the determina-tion of said covariance matrix. We will be confronted with this precise problemin chapter 3.

In the present chapter, we will investigate the simplest of all ensembles of ran-dom matrices, which was proposed by Wigner himself in the context recalledabove. These are matrices where all elements are Gaussian random variables,with the only constraint that the matrix is symmetric (the Gaussian OrthogonalEnsemble, GOE), complex Hermitian (the Gaussian Unitary Ensemble, GUE)or symplectic (the Gaussian Symplectic Ensemble, GSE).

2.1 The Wigner Ensemble

2.1.1 Normalized Trace and Sample Averages

We first generalize the notion of expectation value and moments from classicalprobabilities to large random matrices. We could consider EAk but that objectis very large (N × N dimensional) and most of all its dimension grows with N.It is not clear how to interpret it as N → ∞. It turns out that the correct analogof the expectation value is the normalized trace operator τ(.) as

τ(A) =1NETr A (2.1)

The normalization by 1/N is there to make the normalized trace operator finiteas N → ∞. For example for the identity matrix τ(1) = 1 independently of thedimension and therefore makes sense as N → ∞. When using the notation τ(A)

18 Moments of Random Matrices and the Wigner Ensemble

we will only consider the dominant term as N → ∞, implicitly taking the largeN limit.

For a polynomial function of matrix F(A) or by extension for a function thatcan be written as a power series, the trace of the function can be computed onthe eigenvalues

1N

Tr F(A) =1N

N∑k=1

F(λk) = 〈F(λ)〉 (2.2)

where 〈.〉 denotes the average over the eigenvalues of a single matrix A (sam-ple). For large random matrices, many scalar quantities such as τ(F(A)) do notfluctuate from sample to sample, or more precisely such fluctuations go to zeroin the large N limit. Physicists speak of this phenomena as self-averaging andmathematicians speak of concentration of measure.

τ(F(A)) =1NETr F(A) ≈

1N

Tr F(A) for a single A (2.3)

When the eigenvalues of a random matrix A converge to a well-define densityρ(λ), we can write

τ(F(A)) =

∫ρ(λ)F(λ)dλ (2.4)

Using F(A) = Ak, we can define the k-th moment of a random matrix by mk ≡

τ(Ak). The first moment m1 is simply the normalized trace of A, while m2 =

1/N∑

i j A2i j the normalized sum of the squares of all the elements. The square-

root of m2 satisfies the axioms of a norm and is called the Frobenius norm of A:

||A||F :=√

m2.

2.1.2 Moments of Wigner Matrices

We will define a Wigner matrix X a symmetric matrix (X = XT ) with Gaussianentries with zero mean. In a symmetric matrix there are really two types of ele-ments: diagonal and off-diagonal which can have different variances. Diagonalelements have variance σ2

d and off-diagonal elements have variance σ2od. Note

that Xi j = X ji so they are not independent variables.


In fact, the elements in a Wigner matrix do not need to be Gaussian or even to beiid , as there are many weaker (more general) definitions of the Wigner matrixthat yield the same final statistical results in the limit of large matrices N → ∞.For the purpose of this introductory book we will stick to the strong Gaussianhypothesis.

The first few moments of our Wigner matrix (X) are given by

τ(X) =1N

E[Tr X] =1N

Tr E[X] = 0 (2.5)

τ(X2) =1N

E[Tr XXT ] =1N

E

N∑i j=1

X2i j

=1N

[N(N − 1)σ2od + Nσ2

d] (2.6)

The term containing σ2od dominates when the two variances are of the same or-

der of magnitude. So for a Wigner matrix we can pick any variance we want onthe diagonal (as long as it is small with respect to Nσ2

od). We want to normalizeour Wigner matrix so that its second moment is independent of the size of thematrix (N). Let us pick

σ2od = σ2/N (2.7)

For σ2d the natural choice seems to be σ2

od = σ2/N. However we will ratherchoose σ2

d = 2σ2/N, which is easy to generate numerically and more impor-tantly respects rotational invariance for finite N, as we show in the next sub-section. The ensemble described here (with the choice σ2

d = 2σ2od) is called the

Gaussian Orthogonal Ensemble or GOE.1

To generate a GOE matrix numerically, first generate a non-symmetric randomsquare matrix H of size N where each element is N(0, σ2/(2N)). Then let theWigner matrix X be X = H + HT . The matrix X will then be symmetric withdiagonal variance twice the off-diagonal variance. The reason is that off diag-onal terms are sums of two independent Gaussian variables, so the variance isdoubled. Diagonal elements, on the other hand, are equal to twice the originalvariables Hii and so their variance is multiplied by four.

1 Some author define a GOE matrix to have σ2 = 1 others as σ2 = N. For us a GOE matrix can have anyvariance and is thus synonymous to Gaussian rotationally invariant Wigner matrix.


With any choice of σ2d we have

τ(X2) = σ2 + O(1/N), (2.8)

and hence we will call the parameterσ2 the variance of the Wigner matrix.

The third moment τ(X3) = 0 by parity of the Gaussian distribution. Later wewill show that

τ(X4) = 2σ4 (2.9)

For Gaussian variables E[x4] = 3σ4, this implies that the eigenvalue density ofa Wigner is not Gaussian. What is this eigenvalue distribution? We will showmany times in these lectures that it is given by a semi-circle

ρ(λ) =

√4σ2 − λ2

2πσ2 for − 2σ < λ < 2σ. (2.10)

2.1.3 Rotational Invariance

Recall: to rotate a vector w = Ov where O is an orthogonal matrix OT = O−1

(i.e. OOT = 1). Note that in general O is not symmetric.

To rotate a matrix one writes X = OXOT . The eigenvalues of X are the sameas those of X. The eigenvectors are Ov where v are the eigenvectors ofX.

A rotationally invariant random matrix ensemble is such that the matrix OXOT

is as probable as the matrix X itself (OXOT in law= X).

Let us show that the construction X = H + HT with an Gaussian iid matrix Hleads to a rotationally invariant ensemble. First, note an important property ofGaussian variables is that a Gaussian iid vector v (a white multi-variate Gaus-sian vector) is rotationally invariant. The reason is that w = Ov is again a Gaus-sian vector (since sums of Gaussians are still Gaussian), with covariance givenby

E[wiw j] =∑k`

OikO j`E[vkv`] =∑k`

OikO j`δk` = (OOT )i j = δi j.


Now, write

X = H + HT (2.11)

where H is a square matrix filled with iid Gaussian random numbers. Each col-umn of H it is rotationally invariant: OH in law

= H and the matrix OH is row-wiserotationally invariant: OHOT in law

= OH. So H is rotationally invariant as a ma-trix. Now

OXOT = O(H + HT )OT in law= H + HT = X (2.12)

which shows that the Wigner ensemble with σ2d = 2σ2

od is rotationally invariantfor any matrix size N. More general definition of the Wigner ensemble (includ-ing non Gaussian ensemble) are asymptotically rotationally invariant.

Another way to see the rotational invariance of the Wigner ensemble is to lookat the joint law of matrix elements.

P(Xi j) =

12πσ2

d

N/2 12πσ2

od

N(N−1)/4

exp

−N∑

i=1

X2ii

2σ2d

−

N∑i< j

X2i j

2σ2od

(2.13)

where only the diagonal and upper triangular elements are independent vari-ables. With the choice σ2

od = σ2/N and σ2d = 2σ2/N this becomes.

P(Xi j) ∝ exp−

N4σ2 Tr X2

. (2.14)

Under the change of variable X→ X = OXOT the argument of the exponentialis invariant. This transformation is linear so the Jacobian is constant. A constantJacobian can only change the normalisation, but since the un-normalised proba-bility distribution has not changed, the normalisation constant must be the same,therefore the Jacobian of this change of variable must be one and X in law

= OXOT .By the same argument any matrix whose joint probability density of its elementscan be written as P(Mi j) ∝ exp −N Tr V(M), where V(.) is an arbitrary func-tion, will be rotationally invariant.


2.2 Resolvent and Stieltjes Transform

2.2.1 Definition and Basic Properties

In this section we introduce the Stieltjes transform of matrix. It will give us in-formation about all the moments of the random matrix and also about its densityof eigenvalues in the large N limit. First we need to define the matrix resol-vent.

Given an N × N real symmetric matrix A, its resolvent is given by

GA(z) = (z1 − A)−1, (2.15)

where z is a complex variable defined away from all the (real) eigenvalues ofA and 1 denotes the identity matrix. Then the Stieltjes transform of A is givenby2

gAN(z) =

1N

Tr (GA(z)) =1N

N∑k=1

1z − λk

, (2.16)

where λk are the eigenvalues of A. The subscript N indicates that this is the finiteN Stieltjes transform of a single realization of A. When it is clear from contextwhich matrix we consider we will drop the A superscript and write gN(z).

The Steltjes transform gives us information about the density of eigenvalues ofA. For a given random matrix A, we can define the empirical spectral distribu-tion (ESD) also called the sample eigenvalue density:

ρN(λ) =1N

N∑k=1

δ(λ − λk).

Then the Stieltjes transform can be written as

gN(z) =

∫ +∞

−∞

ρN(λ)z − λ

dλ.

2 In mathematical literature, the Stieltjes transform is commonly defined as sA(z) = −(1/N) Tr GA(z) i.e.with an extra minus sign. Some authors prefer the name Cauchy transform.

2.2 Resolvent and Stieltjes Transform 23

Note that gN(z) is well-defined for any z < λk : 1 ≤ k ≤ N. In particular, it iswell-behaved at∞:

gN(z) =

∞∑k=0

1zk+1

1N

Tr(Ak),1N

Tr(A0) = 1. (2.17)

We will consider random matrices A such that for large N, the normalizedtraces of powers of A converge to their expectation value (deterministic num-bers):

limN→∞

1N

Tr(Ak) = τ(Ak)

We then expect that for large enough z, the function gA(z) converges to a deter-ministic limit g(z) defined as

g(z) = limN→∞

E[gN(z)] whose Taylor series is g(z) =

∞∑k=0

1zk+1 τ(Ak), (2.18)

for z away from the real axis.

Thus g(z) is the moment generating function of A. In other words, the knowl-edge of g(z) near infinity is equivalent to the knowledge of all the moments ofA. To the level of rigor of this book, the knowledge of all the moments of A isequivalent to the knowledge of the density of its eigenvalues. For any functionF(x) defined over the support of the eigenvalues ([λ−, λ+]) of A we can computeits expectation

τ(F(A)) =

∫ λ+

λ−

ρ(λ)F(λ)dλ where ρ(λ) = E[ρA(λ)].

Alternatively we can approximate the function F(x) arbitrarily well by a poly-nomial P(x) = a0 + a1x + . . . + aK xK and have

τ(F(A)) ≈ τ(P(A)) =

K∑k=0

akτ(Ak).

To recap, we only need to know g(z) in the neighborhood of |z| → ∞ to know allthe moments of A and these moments tell us everything about ρ(λ). In comput-ing the Steiltjes transform in concrete cases, we will make use of that fact andonly estimate it for very large values of z.


Recall in classical probability theory, the moment generating function of arandom variable X is given by the Fourier transform of its density, or equiva-lently by

ϕ(t) = EeitX , mk =

(−i

ddt

)k

ϕ(t)|t=0 ;

ϕ(t) =

∞∑k=0

mk(it)k

k!.

The Stieltjes transform also gives the negative moments when they exist. If theeigenvalues of A satisfy min λk > c for some constant c > 0, then the inversemoments of A exist and are given by the expansion of g(z) around z = 0:

g(z) = −

∞∑k=0

zkτ(A−k−1). (2.19)

In particular, we have

g(0) = −τ(A−1).

Exercise

2.1 Stieltjes transform for shifted and scaled matrices If A is a randommatrix drawn from a well-behaved ensemble with Stieljes transform g(z).What is the Stieljes transform of the random matrices αA and A+β1 whereα and β are non-zero real numbers and 1 the identity matrix.

2.2.2 Stieltjes Transform of the Wigner Ensemble

We are now ready to compute the Stieltjes transform of the Wigner ensemble.The first technique we will use is sometimes called the cavity method or theself-consistent equation. We will relate the Stieltjes transform of a Wigner ofsize N with that of one of size N − 1. In the large N limit, the two converge tothe same limiting Stieltjes transform and give us a self-consistent equation thatcan be solved easily.

Exercise 25

We would like to calculate g(z) when X is a Wigner matrix, with Xi j ∼ N(0, σ2/N)and Xii ∼ N(0, 2σ2/N).

We can use Schur complement formula (1.20) to compute the (1, 1) element ofthe inverse of M = z1 − X. Then we have

1(GX)11

= M11 −

N∑k,l=2

M1k(M22)−1kl Ml1, (2.20)

where the matrix M22 is the (N − 1) × (N − 1) submatrix of M with the firstrow and column removed. For large N, we argue that the r.h.s. is dominated byits expectation value with small (O(1/

√N)) fluctuations. We will only compute

its expectation value but getting a handle on its fluctuations is straightforward.First, we note that EM11 = z. We then note that the entries of M22 are indepen-dent of the ones of M1i = −X1i. Thus we can first take the partial expectationover the X1i, and get

EX1i

[M1i(M22)−1

i j M2 j]

=σ2

N(M22)−1

ii δi j

so we have

EX1i

N∑k,l=2

M1k(M22)−1kl Ml1

=σ2

NTr

((M22)−1

)Another observation is that 1/(N −1) Tr

((M22)−1

)is the Stieltjes transform of a

Wigner matrix of size N − 1. In the large N limit, the Stieltjes transform shouldbe independent of the matrix size and the difference between 1/N and 1/(N −1)is negligible. So we have

E[

1N

Tr((M22)−1

)]→ g(z).

We therefore have that 1/(GX)11 equals a deterministic number with negligiblefluctuations; hence in the large N limit we have

E[

1(GX)11

]=

1E(GX)11

From the rotational invariance of X and therefore of GX, all diagonal entries of


GX must have the same expectation value.

E [(GX)11] =1NETr(GX) = EgN → g.

Putting all the pieces together, we find that in the large N, limit Eq. (2.20) be-comes

1g(z)

= z − σ2g(z). (2.21)

Solving (2.21) we obtain that

σ2g

2 − zg + 1 = 0⇒ g =z ±√

z2 − 4σ2

2σ2 .

We know that g(z) should be analytic for large complex z but the square rootabove can run into branch-cuts. It is convenient to pull-out a factor of z andexpress the square-root as a function of 1/z which becomes small for largez,

g(z) =z ± z

√1 − 4σ2/z2

2σ2 .

We can now choose the correct root: the + sign gives an incorrect g(z) ∼ z/σ2 forlarge z while the − signs gives g(z) ∼ 1/z for any large complex z as expected,so we have:

g(z) =z − z

√1 − 4σ2/z2

2σ2 . (2.22)

Note for numerical applications, it is very important to pick the correct branchof the square root. The function g(z) is analytic for |z| > 2σ, the branch-cuts ofthe square-root must therefore be confined to the interval [−2σ, 2σ].

It might seem strange that g(z) given by Eq. (2.22) has no poles but only branch-cuts. For finite N, the sample Stieltjes transform

gN(z) :=1N

N∑k=1

1z − λk

, (2.23)

Exercise 27

Figure 2.1 The branch-cuts of the Wigner Stieltjes transform

has poles at the eigenvalues of X. As N → ∞, the poles can fuse and

1N

N∑k=1

δ(x − λk) ∼ ρ(x).

The density ρ(x) can have extended support and/or isolated Dirac masses. Thenas N → ∞, we have

g(z) =

∫suppρ

ρ(x)dxz − x

,

which is the Stieltjes transform of the limiting measure ρ(x).

2.2.3 Convergence of Stieltjes near the Real Axis

Recall the limiting and sample Stieltjes transforms for Wigner matrix in (2.22)and (2.23). It is natural to ask the following questions: how does gN(z) convergeto g(z) =

∫ρ(x)dx

z−x , and how do we recover ρ(x) from g(z)?

We have argued before that gN(z) converges to g(z) for very large complex z suchthat the Taylor series around infinity is convergent. The function g(z) is not de-fined on the real axis for z = x on the support of ρ(x), nevertheless, immediately


4 2 0 2 4x

0.0

0.1

0.2

0.3

0.4

0.5

0.6

P(x) 2

Figure 2.2 The Cauchy kernel for η = 0.5

below (and above) the real axis the random function gN(z) converges to g(z). Letus study the random function gN(z) just below the support of ρ(x).

We let z = x− iη, with x ∈ suppρ and η is a small positive number. Then

gN(x − iη) :=1N

N∑k=1

1x − iη − λk

=1N

N∑k=1

x − λk + iη(x − λk)2 + η2 .

We focus on the imaginary part of gN(x− iη). Note that it is a convolution of theESD (ρN(λ)) and π times the Cauchy kernel:

πP(x) =η

x2 + η2 .

The Cauchy kernel (P(x)) is highly peaked around zero with the window widthof order η. There are N eigenvalues lying inside the interval [λ−, λ+] assumedto be order 1, so the typical eigenvalue spacing is of order N−1.

(1) Suppose η N−1. Then there are typically 0 or 1 eigenvalue within a windowof size η around x. Then Im gN will be affected by the fluctuations of singleeigenvalues of X, and hence it cannot converge to any deterministic function.(see Fig. 2.3).

(2) Suppose N−1 η 1 (e.g. η = N−1/2). Then on a small scale η ∆x 1,the density ρ is locally constant and there are a great number of eigenvalues

Exercise 29

3 2 1 0 1 2 3x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Img(

xi

)

analytic 0= 1/ N= 1

0.100 0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100x

0.0

0.5

1.0

1.5

2.0

2.5

Img(

xi

)

analytic 0= 1/ N= 1/N

Figure 2.3 Imaginary part of g(x− iη) for the Wigner ensemble. The analyticalresult for η → 0+ is compared with numerical simulations (N = 400). On theleft for η = 1/

√N and η = 1. Note that for η = 1 the density is quite deformed.

On the right (zoom near x = 0) for η = 1/√

N and η = 1/N. Note that forη = 1/N, the density fluctuates wildly as only a small number of (random)eigenvalues contribute to the Cauchy kernel.

inside

n ∼ Nρ(x)∆x Nη 1.

The law of large numbers allows us to replace the sum with an integral, weobtain that

1N

∑k:λk∈[x−∆x,x+∆x]

iη(x − λk)2 + η2 → i

∫ x+∆x

x−∆x

ρ(x)ηdy(x − y)2 + η2 → iπρ(x),

where the last limit is obtained by writing u = (y − x)/η and noting that asη→ 0 we have ∫ ∞

−∞

duu2 + 1

= π.

Exercise

2.2 Finite N approximation and small imaginary part

Im gN(x− iη)/π is a good approximation to ρ(x) for small positive η, where


gN(z) is the sample Stieltjes transform (gN(z) = (1/N)∑

k 1/(z − λk)). Nu-merically generate a Wigner matrix of size N and σ2 = 1.

(a) For three values of η 1/N, 1/√

N, 1, plot Im gN(x − iη)/π and the the-oretical ρ(x) on the same plot for x between -3 and 3.

(b) Compute the error as a function of η where the error is (ρ(x)− ImgN(x−iη)/π)2 summed for all values of x between -3 and 3 spaced by intervalsof 0.01. Plot this error for η between 1/N and 1. You should see that1/√

N is very close to the minimum of this function.

2.2.4 Stieltjes Inversion Formula

From the above discussions, we observe the following:

(1) The Stieljes inversion formula also called the Sokhotski–Plemelj formula:

limη→0+

Im g(x − iη) = πρ(x). (2.24)

(2) When applied to finite size Stieltjes transform gN(z), we should take N−1

η 1 for gN(x − iη) to converge to g(x − iη) and for (2.24) to hold. Numeri-cally, η = N−1/2 works quite well.

We discuss briefly why η = N−1/2 works best. First, we want η to be as smallas possible such that the local density ρ(x) is not blurred. On the other hand,we want Nη to be as large as possible such that we include the statistics ofenough number of eigenvalues to have a well-defined ρ(x). In fact, the errorbetween gN and g is of order 1

Nη . Thus we want to minimize the error

ρ′(x)η +1

Nη, ρ′(x)η : systematic error,

1Nη

: statistical error.

Then it is easy to see that the total error is minimized when η is of orderN−1/2.

Exercise 31

3 2 1 0 1 2 30.00

0.05

0.10

0.15

0.20

0.25

0.30

()

Figure 2.4 Density of eigenvalues of a Wigner Matrix with σ = 1: The semi-circle law.

2.2.5 Density of Eigenvalues of a Wigner Matrix

We go back to study the Stieltjes transform (2.22) of the Wigner matrix. Notethat for z = x−iηwith η→ 0, g(z) can only have an imaginary part if

√x2 − 4σ2

is imaginary. Then using (2.24), we get the Wigner semi-circle law

ρ(x) =1π

limη→0+

Im g(x − iη) =

√4σ2 − x2

2πσ2 , −2σ ≤ x ≤ 2σ.

Note the following features of the semi-circle law: (1) asymptotically there isno eigenvalue for x > 2σ and x < −2σ; (2) the eigenvalue density has square-root singularities near the edges: ρ(x) ∼

√x + 2σ near the left edge and ρ(x) ∼

√2σ − x near the right edge.

Exercise

2.1 From the moments to the density

A large random matrix as moments τ(Ak) = 1/k.

(a) Using Eq. (2.18) write the Taylor series of g(z) around infinity.


(b) Sum the series to get a simple expression for g(z). Hint: look up theTaylor Series of log(1 + x).

(c) Where are the singularities of g(z) on the real axis?

(d) Use Eq. (2.24) to find the density of eigenvalues ρ(λ).

(e) Check your result by recomputing the moments and the Stieltjes trans-form from ρ(λ).

(f) Redo all the above steps for a matrix whose odd moments are zero andeven moments are τ(A2k) = 1. Note that in this case the density ρ(λ) hasDirac masses.

2.3 Non-crossing Pair Partitions

2.3.1 Fourth Moment of a Wigner Matrix

We now discuss the calculation of the moments of the Wigner matrix X. Westated earlier that for a Wigner matrix we have τ(X4) = 2σ4. We shall computethis forth moment directly and then develop a technique to compute all othermoments.

We have

τ(X4) =1NETr(X4) =

1N

∑i, j,k,l

〈Xi jX jkXklXli〉. (2.25)

Recall that (Xi j : 1 ≤ i ≤ j ≤ N) are independent Gaussian random variablesof mean zero. So for the expectations in the above sum to be nonzero, each Xentry need to be equal to another X entry.3

(1) If Xi j = X jk = Xkl = Xli, then

〈Xi jX jkXklXli〉 =3σ4

N2 ,

3 When we say that Xi j = Xkl, we mean that they are the same random variable, given that X is asymmetric matrix it means either (i = k and j = l) or (i = l and j = k).

2.3 Non-crossing Pair Partitions 33

Figure 2.5 Graphical representation of the three terms contributing to τ(X4).The last one is a crossing partition and has a zero contribution.

and there are N2 of them. Thus the total contribution from these terms are

1N

N2 3σ4

N2 =3σ4

N→ 0.

(2) Suppose there are two different pairs. Then there are three possibilities (seeFig. 2.5):

(i) Xi j = X jk, Xkl = Xli, and Xi j is different than Xli (i.e. j , l). Then

(2.25) =1N

∑i, j,l

〈X2i jX

2il〉 =

1N

(N3 − N2)(σ2

N

)2

→ σ4

as N → ∞.

(ii) Xi j = Xli, X jk = Xkl, and Xi j is different than X jk (i.e. i , k). Then

(2.25) =1N

∑i,k, j

〈X2i jX

2jk〉 =

1N

(N3 − N2)(σ2

N

)2

→ σ4

as N → ∞.

(iii) Xi j = Xkl, X jk = Xli, and Xi j is different than X jk (i.e. i , k). Then we musthave i = l and j = k from Xi j = Xkl, and i = j and k = l from X jk = Xli. Thisgives a contradiction: there are no such terms.


In sum, we obtain that

τ(X4)→ σ4 + σ4 = 2σ4

as N → ∞, where the two terms come from the two non-crossing partitions.

In the next section, we generalize this calculation to arbitrary moments of X.Odd moments are zero by symmetry. Even moments τ(X2k) can be written as asum over non-crossing diagrams (non-crossing pair partitions of 2k elements),where each such diagram contributes σ2k. So

τ(X2k) = Ckσ2k,

where Ck are Catalan numbers, the number of such non-crossing diagrams. Theysatisfy

Ck =

k∑j=1

C j−1Ck− j =

k−1∑j=0

C jCk− j−1,

with C0 = C1 = 1 and can be written explicitly as

Ck =1

k + 1

(2kk

).

2.3.2 Catalan Numbers: Counting Non-Crossing Pair Partitions

We would like to calculate all moments of X. Note that all the odd momentsτ(X2k+1) vanish (since the odd moments of a Gaussian random variable van-ish). We only need to compute the even moments

τ(X2k) =1NE

[Tr(X2k)

]=

1N

∑i1,...,i2k

E(Xi1i2 Xi2i3 . . .Xi2k i1

). (2.26)

Since we assume that the elements of X are Gaussian, we can expand theabove expectation value using Wick’s theorem using the covariance of theXi j’s. The matrix X is symmetric, so we have to keep track of the fact thatXi j is the same variable as X ji. For this reason, using Wick’s theorem provesquite tedious and we will not follow this route here. From the Taylor seriesat infinity of the Stieltjes transform, we expect every even moment of X to


Figure 2.6 Graphical representation of the 15 terms contributing to τ(X6).Only the 5 on the left are non-crossing and have a non-zero contribution asN → ∞.

converge to an O(1) number as N → ∞. We will therefore drop any O(1/N)or smaller as we proceed. In particular the difference of variance betweendiagonal and off-diagonal elements of X doesn’t matter to first order in 1/N.

In Eq. (2.26), each X entry must be equal to at least one another X entry,otherwise the expectation is zero. On the other hand, it is easy to show thatfor the partitions that contain at least one group of > 2 (actually ≥ 4) X entriesthat are equal to each other, their total contribution will be of order O(1/N)or smaller (e.g. in case (i) above). Thus we only need to consider the caseswhere each X entry is paired to exactly one another X entry, which we alsoreferred to a pair partition.

We need to count the number of types of pairings of 2k elements that con-tributes to τ(X2k) as N → ∞. We associate to each pairing a diagram. Forexample, for k = 3, we have, 5!! = 5 · 3 · 1 = 15 possible pairings (see Fig.2.6).

To compute the contribution of each of these pair partitions, we will computethe contribution of non-crossing pair partitions and argue that pair partitionswith crossings do not contribute in the large N limit. First we need to definewhat is a non-crossing pair partition of 2k elements. A pair partition can bedraw as a diagram where the 2k elements are on points a line and each point isjoined with its pair partner by an arc drawn above that line. If at least two arcscross each other the partition is called crossing and non-crossing otherwise.In figure 2.6 the five partition on the left are non-crossing while the ten othersare crossing.

In a non-crossing partition of size 2k, there is always at least one pairingbetween between consecutive points (the smallest arc). If we remove the firstsuch pairing we get a non-crossing pair partition of 2k − 2 elements. We canproceed this way until we get to a paring of only two elements: the unique


Figure 2.7 Graphical representation of the only term contributing to τ(X2).Note that the indices of two terms are already equal prior to pairing.

Figure 2.8 Zoom into the smallest arc of a non-crossing partition. The twomiddle matrices are paired while the other two could be paired together or toother matrices to the left and right respectively. After the pairing of Xil+1,il+2

and Xil+2,il+3 , we have il+1 = il+3 and the index il+2 is free.

(non-crossing) pair partition contributing to (Fig. 2.7)

τ(X2) = σ2.

We can use this argument to prove by induction that each non-crossing par-tition contributes a factor σ2k. In Figure 2.8, consecutive elements Xil+1,il+2

and Xil+2,il+3 are paired, we want to evaluate that pair and remove it from thediagram. The variance contributes a factor σ2/N. We can make two choicesfor index matching. First consider il+1 = il+3 and il+2 = il+2. In that case, theindex il+2 is free and its summation contributes a factor of N. The identityil+1 = il+3 means that the previous matrix Xil,il+1 is now linked by matrix mul-tiplication to the following matrix Xil+1,il+4 . In other words we are left with σ2

times a non-crossing partition of size 2k − 2 which contributes σ2k−2 by ourinduction hypothesis. The other choice of index matching il+1 = il+2 = il+3can be viewed as fixing a particular value for il+2 and is included in the sumover il+2 in the previous index matching. So by induction we do have thateach non-crossing pair partition contributes σ2k.

Before we discuss the contribution of crossing pair partitions, let’s analyse interms of powers of N, the computation we just did for the non-crossing case.The computation of each term in τ(X2k) involves 2k matrices that have in total4k indices. The trace and the matrix multiplication forces 2k equalities amongthese indices. The normalisation of the trace and the k variance terms give a


Figure 2.9 In a non-crossing pairing, the paring of site 1 with site 2 j splits thegraph into two disjoint non-crossing parings.

factor of σ2k/Nk+1. To get a result of order 1 we need to be left with k + 1free indices whose summation give a factor of Nk+1. Each of the k pairingimposes 2 matching between pairs of indices. For the first k − 1 choice ofpairing we managed to match one pair of indices that were already equal. Atthe last step we matched to pairs of indices that were already equal. Hencein total we added only k + 1 equality constraints that left use with k + 1 freeindices as needed.

We can now argue that crossing pair partition do not contribute in the largeN limit. For crossing partition it is not possible to choose a matching at everystep that matches a pair of indices that are already equal. If we use the pre-vious algorithm of removing at each step the leftmost smallest arc, at somepoint, the smallest arc will have a crossing and we will be pairing to matri-ces that share no indices, adding 2 equality constraints at this step. The resultwill therefore be down by at least a factor of 1/N with respect to the non-crossing case. This argument is not really a proof but an intuition why thismight be true. A more rigorous proof can be found in Tao [2012], Andersonet al. [2010] and Mingo and Speicher [2017]. In the last reference, the authorscompute the moments of X exactly for every N (when σ2

d = σ2od).

We can now complete our moments computation. Let

Ck := # of non-crossing pairings of 2k elements.

The number Ck are called Catalan numbers. Since every non-crossing pairpartition contributes a factor σ2k. Summing over all non-crossing pairings,we immediately get that

τ(X2k) = Ckσ2k. (2.27)

2.3.3 Recursion Relation for Catalan Numbers

In order to compute the Catalan numbers Ck, we will write a recursion relationfor them. Take a non-crossing pairing, site 1 is linked to some even site 2 j (it


is easy to see 1 cannot link to odd site in order for the partition to be non-crossing). Then the diagram is split into two smaller non-crossing pairingsof sizes 2( j − 1) and 2(k − j), respectively (see Fig. 2.9). Thus we get theinductive relation

Ck =

k∑j=1

C j−1Ck− j =

k−1∑j=0

C jCk− j−1, (2.28)

where we let C0 = C1 = 1. In fact, one can prove by induction that Ck is givenby the Catalan number:

Ck =1

k + 1

(2kk

). (2.29)

Using the Taylor series for the Stieltjes transform (2.17), we can use the Cata-lan number recursion relation to find an equation for the Stieltjes transformof the Wigner ensemble

g(z) =

∞∑k=0

Ck

z2k+1σ2k. (2.30)

Thus using (2.28), we obtain that

g(z) −1z

=

∞∑k=1

σ2k

z2k+1

k−1∑j=0

C jCk− j−1

=σ2

z

∞∑j=0

C j

z2 j+1σ2 j

∞∑k= j+1

Ck− j−1

z2(k− j−1)+1σ2(k− j−1)

=σ2

z

∞∑j=0

C j

z2 j+1σ2 j

∞∑

l=0

Cl

z2l+1σ2l

=σ2

zg

2(z),

which gives the same self-consistent equation for g as in (2.21) and hence thesame solution:

g(z) =z − z

√1 − 4σ2/z2

2σ2 .

The same result could have been derived by substituting the explicit solu-tion for the Catalan number Eq. (2.29) into (2.30) but this route requires theknowledge of the Taylor series

√1 − x = 1 −

∞∑k=0

2k + 1

(2kk

) ( x4

)k+1.

2.4 Other Gaussian Ensembles 39

2.4 Other Gaussian Ensembles

In the context of Random Matrices, it was pointed out by Dyson that there existprecisely three division rings that contains the real numbers, namely, the realthemselves, the complex numbers and the quaternions. He showed that this factimplies that there are only three acceptable ensemble of Gaussian random ma-trices: GOE, GUE and GSE. Each is associated with a Dyson index called β (1,2and 4 respectively) and except for this difference in β almost all of the resultsin this book (and many more) apply to the three ensembles. In particular theirmoments and eigenvalue density are the same as N → ∞, while correlationsand deviations from the asymptotic formulas follow families of laws with β asa parameter. In this section we will review the other two ensembles (GUE andGSE)4.

2.4.1 Complex Hermitian Matrices

For matrices with complex entries, the analog of a symmetric matrix is a (com-plex) Hermitian matrix. It satisfies A† = A where the dagger operator is thecombination of matrix transposition and complex conjugaison. There are twoimportant reasons to study complex Hermitian matrices. First they appear inmany applications, especially in quantum mechanics. There, the energy andother observables are mapped into Hermitian operators, or Hermitian matricesfor systems with a finite number of states. The first large N result of RandomMatrix Theory is the Wigner semi-circle law. As recalled in the introductionto chapter 2, it was obtained by Wigner as he modeled the energy levels of acomplicated heavy nuclei as a random Hermitian matrix.

The other reason Hermitian matrices are important is mathematical. In the largeN limit, the three ensembles (real, complex and quaternionic (see below)) be-have the same way. But for finite N, computations and proofs are much simplerin the complex case. The main reason is that the Vandermonde determinant thatwe will introduce in section 4.1.3 is easier to manipulate in the complex case.For this reason, most mathematicians discuss the complex Hermitian case first

4 More recently, it was shown how ensemble with an arbitrary value of β can be constructed.***Ref***


and treat the real and quaternionic cases as extensions. In this book, as we wantto stay close to applications in data science and statistical physics, we will dis-cuss complex matrices only in this sections. In the rest of the book we will indi-cate in footnotes how to extend the result to complex Hermitian matrices.

A complex Hermitian matrix A has real eigenvalues and it can be diagonalizedwith a suitable unitary matrix U. A unitary matrix satisfies U†U = 1. So Acan be written as A = UΛU†, with Λ the diagonal matrix containing its Neigenvalues.

We want to built the complex Wigner matrix: a Hermitian matrix with iid Gaus-sian entries. We will choose a construction that has unitary invariance for everyN. Let us study the unitary invariance of complex Gaussian vectors. But first weneed to define a complex Gaussian number.

We say that the complex variable z is centered Gaussian with variance σ2 ifz = xr + i xi where xr and xi are centered Gaussian variables of variance σ2/2.We have

E|z|2 = Ex2r + Ex2

i = σ2.

A white complex Gaussian vector x is a vector whose components are iid com-plex centered Gaussians. Consider y = Ux where U is a unitary matrix. Each ofthe components is a linear combination of Gaussian variables so y is Gaussian.It is relatively straightforward to show that each component has the same vari-ance σ2 and that there is no covariance between different components. Hencey is also a white Gaussian vector. The ensemble of white complex Gaussianvector is invariant under unitary transformation.

To define the Hermitian Wigner matrix, we first define a (non-symmetric) squarematrix H whose entries are centered complex Gaussian numbers and let X bethe Hermitian matrix defined by

X = H + H†

If we repeat the arguments of section 2.1.3, we can show that the ensemble ofX is invariant under unitary transformation: UXU† in law

= X.

We did not specify the variance of the elements of H. We would like X to be


normalized as τ(X2) = σ2+O(1/N). Choosing the variance of the H as E|Hi j|2 =

1/(2N) achieves precisely that.

The Hermitian matrix X has real diagonal elements with EX2ii = 1/N and off-

diagonal elements that are complex Gaussian with E|Xi j|2 = 1/N. In other

words the real and imaginary parts of the off-diagonal elements of X have vari-ance 1/(2N). We can put all this information together in the joint-law of thematrix elements of the Hermitian matrix H:

P(Xi j) ∝ exp−

N2σ2 Tr X2

.

This law is identical to the real symmetric case (Eq. 2.14) up to a factor of 2.When can then write both the symmetric and Hermitian case as

P(Xi j) ∝ exp−βN4σ2 Tr X2

, (2.31)

where β is 1 or 2 respectively.

The complex Hermitian Wigner ensemble with σ2 = 1 is called the GaussianUnitary Ensemble or GUE.

The other sections of this chapter apply equally to the real symmetric and com-plex Hermitian case. Both the self-consistent equation for the Stieltjes transformand the counting of non-crossing pair partition rely on the independence of theelements of the matrix and on the fact that E|Xi j|

2 = 1/N, true in both cases. Wethen have that the Stieljes transform of the two ensembles is the same and theyhave the same distribution of eigenvalues in the large N limit. The same will betrue for the quaternionic case below (β = 4).

2.4.2 Quaternionic Hermitian Matrices

We will define here the quaternionic Hermitian matrices and the GSE. Thereare many fewer applications of quaternionic matrices than the more commonreal or complex matrices. We include this discussion here for completeness.In the literature the link between symplectic matrices and quaternions can bequite obscure for the novice reader. Except for the existence of an ensemble


of matrices with β = 4 we will never refer to quaternionic matrices after thissection which can safely be skipped.

Quaternions are a non-commutative extensions of the real and complex num-bers. They are written as a real linear combinations of the real number 1 andthree abstract non-commuting objects (i, j, k) satisfying

i2 = j2 = k2 = ijk = −1 ⇒ ij = −ji = k, jk = −kj = i, ki = −ik = j.

So we can write a quaternion as h = xr + i xi + j xj +k xk. If only xr is non-zerowe say that h is real. We define the quaternionic conjugation as 1∗ = 1, i∗ =−i, j∗ = −j, k∗ = −k so that the norm |h|2 := hh∗ = x2

r + x2i + x2

j + x2k is always

real and non-negative. The abstract objects i, j and k can be represented as2 × 2 complex matrices:

1 =

(1 00 1

)i =

(i 00 −i

)j =

(0 1−1 0

)k =

(0 ii 0

),

where the i in the matrices is now the usual unit imaginary number.

Quaternions share all the algebraic properties of real and complex numbersexcept for commutativity (they form a division ring). Since matrices in gen-eral do not commute, matrices built out of quaternions behave like real orcomplex matrices.

A Hermitian quaternionic matrix is a square matrix A whose elements arequaternions and satisfy A = A†. Here the dagger operator is the combina-tion of matrix transposition and quaternionic conjugation. They are diago-nalizable and their eigenvalues are real. Matrices that diagonalize Hermitianquaternionic matrices are called symplectic. Written in terms of quaternionsthey satisfy SS† = 1.

Given representation of quaternions as 2 × 2 complex matrices, an N × Nquaternionic Hermitian matrice A can be written as a 2N×2N complex matrixQ(A). We choose a representation where

Z := Q(1j) =

(0 1−1 0

).

For a 2N × 2N complex matrix Q to be the representation of a quaternionicHermitian matrix it has to have two properties. First, quaternionic conjuga-tion acts just like Hermitian conjugation so Q† = Q. Second it has to bewritable as a real linear combination of unit quaternions. One can show thatsuch matrices (and only them) satisfy

QR := ZQT Z−1 = Q†,


where QR is called the dual of Q. In other words an N × N Hermitian quater-nionic matrix corresponds to a 2N × 2N self-dual Hermitian matrix (i.e.Q = Q† = QR). In this 2N × 2N representation symplectic matrices arecomplex matrices satisfying:

SS† = SSR = 1.

To recap, a 2N × 2N Hermitian self-dual matrix Q can be diagonalized bya symplectic matrix S. Its 2N eigenvalues are real and they occur in pairs asthey are the N eigenvalues of the equivalent Hermitian quaternionic N × Nmatrix.

We can now define the third Gaussian matrix ensemble, namely the Gaus-sian Symplectic Ensemble (GSE) consisting of Hermitian quaternionic ma-trices whose off-diagonal element are quaternions with Gaussian distributionof zero mean and variance E|Xi j|

2 = 1/N. This means that the four compo-nents of each Xi j is a Gaussian number of zero mean and variance 1/(4N).The diagonal element of X are real Gaussian numbers with zero mean andvariance 1/(2N). As usual Xi j = X∗ji so only the upper (or lower) triangularelements are independent. The GSE (like GOE and GUE) is customarily de-fined to have unit variance (i.e. τ(X2) = 1) but we can scale such a matrix byσ and called it a (quaternionic) Wigner matrix of variance σ2, the joint lawof its elements is given by

P(Xi j) ∝ exp−

Nσ2 Tr X2

,

which we identify with Eq. (2.31) with β = 4. This parameter β = 4 is afundamental property of the symplectic group and will consistently appear incontrast with the orthogonal and unitary cases, β = 1 and β = 2 (see section4.1.3).

The parameter β measures the randomness of the norm of the matrix ele-ments. More precisely, we have

|Xi j|2 =

x2

r for real symmetricx2

r + x2i for complex Hermitian

x2r + x2

i + x2j + x2

k for quaternionic Hermitian.

where xr, xi, xj, xk are real Gaussian numbers such that E|Xi j|2 = 1. We see

that the fluctuations of |Xi j|2 decrease with β (precisely Var[|Xi j|

2] = 2/β). Bythe law of large numbers, in the β→ ∞ limit (if such thing existed) we wouldhave |Xi j|

2 = 1 with no fluctuations.


Bibliographical Notes

Many Books Tao [2012], Anderson et al. [2010], Blower [2009]

Historical Wigner [1951], Dyson [1962b]

3Wishart Ensemble and Marcenko-Pastur Distribution

In this chapter we will study the statistical properties of large sample covariancematrices of some N-dimensional variables observed T times. More precisely,the empirical set consists of N×T datas xt

i1≤i≤N,1≤t≤T , where we have T obser-vations and each observation contains N variables. Examples abound: we couldconsider the daily returns of N stocks, over a certain time period, or the numberof spikes fired by N neurons during T consecutive time intervals of length ∆t,etc. Throughout this book, we will use the notation q for the ratio N/T . Whenthe number of observations is much larger than the number of variables, one hasq 1. If the number of observations is smaller than the number of variables (acase that can easily happen in practice), then q > 1.

In the case where q → 0, one can faithfully reconstruct the “true” (or popula-tion) covariance matrix C of the N-variables from empirical data. For q = O(1),on the other hand, the empirical (or sample) covariance matrix is a strongly dis-torted version of C, even in the limit of a large number of observations. In thischapter, we will derive the well known Marcenko-Pastur law for the eigenval-ues of the sample covariance matrix for arbitrary values of q, in the “white” casewhere the population covariance matrix C is the identity.

46 Wishart Ensemble and Marcenko-Pastur Distribution

3.1 Wishart Matrices

3.1.1 Sample Covariance Matrices

We assume that the observed variables xti have zero mean. (Otherwise, we need

to remove the sample mean T−1 ∑t xt

i from xti for each i. For simplicity, we

will not consider this case.) Then the sample covariances of the data are givenby

Ei j =1T

T∑t=1

xti x

tj.

Thus Ei j form an N × N matrix E, called the sample covariance matrix (scm):

E =1T

HHT ,

where H is an N × T data matrix with entries Hit = xti.

The matrix E is symmetric and positive semi-definite (all its eigenvalues aregreater or equal to zero):

E = ET , and vT Ev = (1/T )‖HT v‖2 ≥ 0,

for any v ∈ RN . Thus E is diagonalizable and has all eigenvalues ≥ 0.

We can define another covariance matrix by transposing the data matrix H:

F =1N

HT H,

The matrix F is a T × T matrix, it is also symmetric and positive semi-definite.If the index i (1 < i < N) labels the variables and the index t (1 < t < T )the observations, we can call the matrix F the covariance of the observations(as opposed to E the covariance of the variables). Fts measures how similar theobservations at t are to those at s — in the above example of neurons, it wouldmeasure how similar is the firing pattern at time t and at time s.

As we saw in section 1.3, the matrices NE and TF have the same non-zeroeigenvalues. Also the matrix E has at least N − T zero eigenvalues if N > T(and at least T − N zero eigenvalues for F if T > N).

3.1 Wishart Matrices 47

Assume for a moment that N ≤ T (i.e. (q ≤ 1), then we know that in additionto the N (zero or non-zero) eigenvalues q−1λE

k , F has T − N zero eigenvalues.This allows us to write an exact relationship between the Stieltjes transforms ofE and F:

gFT (z) =

1T

T∑k=1

1z − λF

k

=1T

N∑k=1

1z − q−1λE

k

+ (T − N)1

z − 0

= q2gE

N(qz) +1 − q

z. (3.1)

A similar argument with T < N leads to the same Eq. (3.1) so it is actuallyvalid for any value of q. The relationship should be true as well in the large Nlimit:

gF(z) = q2gE(qz) +

1 − qz

.

3.1.2 First and Second Moments of a Wishart Matrix

We now study the sample covariance matrix E. Assume that the column vectorsof H are drawn independently from a multivariate Gaussian distribution withmean zero and “true” (or “population”) covariance matrix C, i.e.,

E[HitH js] = Ci jδts.

And again

E =1T

HHT .

Sample covariance matrices of this type were first studied by the Scottish mathe-matician John Wishart (1898-1956) and are now called Wishart matrices.

Recall that if (X1, . . . X2n) is a zero-mean multivariate normal random vector,then by Wick’s theorem,

E[X1X2 · · · X2n] =∑∏

E[XiX j] =∑∏

Cov(Xi, X j).


where∑∏

means that we sum over all distinct pairings of X1, . . . , X2n andeach summand is the product of the n pairs.

First taking expectation, we obtain that

E(Ei j) =1TE

T∑t=1

HitH jt

=1T

T∑t=1

Ci j = Ci j.

Thus, we have E(E) = C: as it is well known, the sample covariance matrixis an unbiased estimator of the true covariance matrix (at least when E[xt

i] =

0).

For the fluctuations, we need to study the higher order moments of E. The sec-ond moment can be calculated as

τ(E2) :=1

NT 2E[Tr(HHT HHT )

]=

1NT 2

∑i, j,t,s

E(HitH jtH jsHis

).

Then by Wick’s theorem, we have (insert a figure on pairings)

τ(E2) =1

NT 2

∑t,s

∑i, j

C2i j +

1NT 2

∑t,s

∑i, j

CiiC j jδts +1

NT 2

∑t,s

∑i, j

C2i jδts

= τ(C2) +NTτ(C)2 +

1Tτ(C2).

Suppose N,T → ∞ with some fixed ratio N/T = q for some constant q > 0, thelast term on the r.h.s. tends to zero. Then we get

τ(E2) − τ(E)2 → τ(C2) − τ(C)2 + qτ(C)2.

The variance of the scm is greater than that of the true covariance by a termproportional to q. When q → 0 we recover prefect estimation and the twomatrices have the same variance. If C = αI (a multiple of the identity) thenτ(C2) − τ(C)2 = 0 but τ(E2) − τ(E)2 → qα2.

3.1 Wishart Matrices 49

3.1.3 The Law of Wishart Matrices

Next, we give the joint distribution of elements of E. For each fixed column ofH, the joint distribution of the elements are

P(Hit

Ni=1

)=

1√(2π)N det C

exp

−12

∑i, j

Hit(C)−1i j H jt

.Taking product over 1 ≤ t ≤ T (since the columns are independent), we obtainthat

P (H) =1

(2π)NT2 det CT/2

exp[−

12

Tr(HT C−1H

)]=

1

(2π)NT2 det CT/2

exp[−

T2

Tr(EC−1

)].

Then we make a change variables H→ E. As shown in the technical paragraphbelow, the Jacobian of the transformation is proportional to (det E)

T−N−12 . The ex-

act expression for the law of the matrix elements was obtained by Wishart:

P (E) =1

2NT/2ΓN(T/2)(det E)(T−N−1)/2

(det C)T/2 exp[−

T2

Tr(EC−1

)]. (3.2)

Note that the density is restricted to positive semi-definite matrices E. Using theidentity det E = exp(Tr log E), we can rewrite the above expression as

P (E) =1

2NT/2ΓN(T/2)1

(det C)T/2 exp[−

T2

Tr(EC−1

)+

T − N − 12

Tr log E].

We will denote by W a sample covariance matrix with C = 1 and call sucha matrix a white Wishart. In this case, as N,T → ∞ with N/T ≡ q, we getthat1

P (W) ∝ exp[−

N2

Tr V(W)],

1 For large N, complex and quaternionic Hermitian white Wishart matrices have the same law of theelements up to a factor of β in the exponential:

P (W) ∝ exp[−βN2

Tr V(W)],

with V(W) as above and β equal to 1, 2 or 4.


where

V(W) := (1 − q−1) log W + q−1W.

Note that the above P(W) is rotationally invariant in the white case. In fact, ifa vector v has Gaussian distributionN(0, 1N×N), then Ov has the same distribu-tionN(0, 1N×N) for any orthogonal matrix O. Hence OH has the same distribu-tion as H, which shows that OEOT has the same distribution as E.

Jacobian of the Transformation H→ E

The aim is to compute the volume ω(E) corresponding to all H’s such thatE = T−1HHT :

ω(E) =

∫dHδ(E − T−1HHT ).

First note that one can always choose E to be diagonal, because one can al-ways rotate the integral over H to an integral over OH, where O is the rotationmatrix that makes E diagonal. Now, introducing the Fourier representation ofthe δ function for all N(N + 1)/2 independent components of E, one has:

ω(E) =

∫dHdA exp

(i Tr(AE − T−1AHHT )

),

where A is the symmetric matrix of the corresponding Fourier variables, towhich we add a small imaginary part proportional to 1 to make all the fol-lowing integrals well-defined. The Gaussian integral over H can now be per-formed explicitly for all ts, leading to:∫

dH exp(−iT−1 Tr(AHHT )

)∝ (det A)−T/2,

leaving us with

ω(E) ∝∫

dA exp (i Tr(AE)) (det A)−T/2.

We can change variables from A to B = E1/2AE1/2. The Jacobian of thistransformation is:∏

i

dAii

∏j > idAi j =

∏i

E−1ii

∏j > i(EiiE j j)−1/2

∏i

dBii

∏j > idBi j

= (det(E))−N+1

2

∏i

dBii

∏j > idBi j.

3.2 Marcenko-Pastur Using the Cavity Method 51

So finally:

ω(E) ∝[∫

dB exp (i Tr(B)) (det B)−T/2]

(det(E))T−N−1

2 ,

as announced in the main text.

3.2 Marcenko-Pastur Using the Cavity Method

3.2.1 Self-Consistent Equation for the Resolvent

We first derive the asymptotic distribution of the Wishart matrix with C = 1,i.e. the Marcenko-Pastur distribution. We shall use the same method as in thederivation of the Wigner semicircle law in section 2.2. In the case C = 1, theN × T matrix H is filled with iid standard Gaussian random numbers and wehave W = (1/T )HHT .

As in section 2.2, we wish to derive a self-consistent equation satisfied by theStieltjes transform

gW(z) = τ (GW(z)) , GW(z) := (z1 −W)−1.

We fix a large N and first write an equation for the element 11 of GW(z). We willargue later that G11(z) converges to g(z) with negligible fluctuations [we hence-forth drop the subscript W as this entire section deals with the white Wishartcase].

Using again the Schur complement formula (1.20), we have that

1(G(z))11

= M11 −M12(M22)−1M21,

where M := z1 −W, and the submatrices of size, respectively, [M11] = 1 × 1,[M12] = 1 × (N − 1), [M21] = (N − 1) × 1, [M22] = (N − 1) × (N − 1). We canexpand the above expression and write

1(G(z))11

= z −W11 −1

T 2

T∑t,s=1

N∑j,k=2

H1tH jt(M22)−1jk HksH1s. (3.3)


Note that the three matrices M22, H jt ( j ≥ 2) and Hks (k ≥ 2) are independentof the entries H1t for all t. We can write the last term on the r.h.s. as

1T

N∑t,s=1

H1tΩtsH1s with Ωts :=1T

N∑j,k=2

H jt(M22)−1jk Hks.

Provided K2 := T−1 TrΩ2 converges to a finite limit when T → ∞ 2, one readilyshows that the above sum converges to T−1 TrΩ with fluctuations of the orderof KT−1/2. So we have, for large T

1(G(z))11

= z −W11 −1T

∑2≤ j,k≤N

∑t HktH jt

T(M22)−1

jk + O(T−1/2)

= z −W11 −1T

∑2≤ j,k≤N

Wk j(M22)−1jk + O(T−1/2)

= z − 1 −1T

Tr W2G2(z) + O(T−1/2),

where in the last step we have used the fact that W11 = 1 + O(T−1/2) and notedW2 and G2(z) the sample covariance matrix and resolvent of the (N-1) variablesexcluding (1). We can re-write the trace term

Tr(W2G2(z)) = Tr(W2(z1 −W2)−1

)= −Tr 1 + z Tr

((z1 −W2)−1

)= −Tr 1 + z Tr G2(z).

In the region where Tr G(z)/N converges for large N to the deterministic g(z),Tr G2(z)/N should also converge to the same limit as G2(z) is just an (N − 1) ×(N − 1) version of G(z). So in the region of convergence we have

1(G(z))11

= z − 1 + q − qzg(z) + O(N−1/2),

where we have introduced q = N/T = O(1), such that N−1/2 and T−1/2 are of thesame order of magnitude. This last equation states that 1/G11(z) has negligible

2 It can be self-consistently checked from the solution below that limT→∞ K2 = −qg′W(z)


fluctuations and can safely be replaced by its expectation value, i.e.

1(G(z))11

= E(

1(G(z))11

)+ O(N−1/2)

=1

E ((G(z))11)+ O(N−1/2)

By rotational invariance of W, we have

E(G(z))11 =1NETr(G(z))→ g(z).

In the large N limit we obtain the following self-consistent equation for g(z):

1g(z)

= z − 1 + q − qzg(z). (3.4)

3.2.2 Solution and Density of Eigenvalues

Solving (3.4) we obtain that3

qzg2 − (z − 1 + q)g + 1 = 0⇒ g(z) =z − (1 − q) ± z

√(1 − λ+/z)(1 − λ−/z)2qz

,

where

λ± = (1 ±√

q)2.

Again we want to have a branch of g(z) that behaves like 1/z as |z| → ∞. Thisleads us to choose the branch

g(z) =z − (1 − q) − z

√(1 − λ+/z)(1 − λ−/z)2qz

. (3.5)

Note that for z = x − iη with x , 0 and η→ 0, g(z) can only have an imaginarypart if

√(x − λ+)

√(x − λ−) is imaginary. Then using (2.24), we get the famous

Marcenko-Pastur distribution for the bulk:

ρ(x) =1π

limη→0+

Im g(x − iη) =

√(λ+ − x)(x − λ−)

2πqx, λ− < x < λ+.


0 1 2 3 4 5 60.0

0.2

0.4

0.6

0.8

1.0

()

q=1/2q=2

Figure 3.1 Marcenko-Pastur distribution: density of eigenvalues for a Wishartmatrix for q = 1/2 and q = 2. Note that for q = 2 there is a Dirac mass at zero( 1

2δ(λ)). Also note that the two bulk densities are the same up to a rescalingand normalization ρ1/q(λ) = q2ρq(qλ).

Moreover by studying the behavior of Eq. (3.5) near z = 0 one sees that thereis a pole at 0 when q > 1. This gives a delta mass as z→ 0:

q − 1q

δ(x),

which corresponds to the N − T trivial zero eigenvalues of E in the N > T case.Combining the above discussions, the full Marcenko-Pastur law can be writtenas

ρMP(x) =

√[(λ+ − x)(x − λ−)]+

2πqx+

q − 1q

δ(x)θ(q − 1),

where we denote a+ := maxa, 0 for any a ∈ R, and

θ(q − 1) :=

0, if q ≤ 1

1 if q > 1.

Note that the Stieltjes transforms (Eq. (3.5)) for q and 1/q are related by Eq.(3.1). As a consequence the bulk densities for q and 1/q are the same when

3 By introducing 1/z in the square-root we make sure that there are no singularities for |z| > λ+ so g(z) isregular towards infinity.


properly rescaled (see Fig. 3.1):

ρ1/q(λ) = q2ρq(qλ).

3.2.3 General (Non-White) Wishart matrices

Recall our definition of a Wishart matrix from section 3.1.2: a Wishart matrix isa matrix EC defined as

EC =1T

HCHTC,

where HC is a N×T rectangular matrix with independent columns. Each columnis a random Gaussian vector with covariance matrix C; EC corresponds to thesample (empirical) covariance matrix of variables characterized by a population(true) covariance matrix C.

To understand the case where the true matrix C is different from the identitywe first discuss how to generate a multivariate Gaussian vector with covariancematrix C. We diagonalize C as

C = ODOT , D =

σ2

1. . .

σ2N

.Then we can define the square root of C as

C1/2 = OD1/2OT , D1/2 =

σ1

. . .

σN

.We now generate N i.i.d. unit Gaussian random variables xi, 1 ≤ i ≤ N, whichform a random column vector x with entries xi. Then we can generate the vectory = C1/2x. We claim that y is a multivariate Gaussian vector with covariancematrix C. In fact, y is linear combination of multivariate Gaussian, so it is itselfmultivariate Gaussian. On the other hand, we have, using E[xxT] = 1,

E(yyT ) = E(C1/2xxT C1/2) = C.


By repeating the above argument on every columns of HC, t = 1, . . . ,T , we seethat this matrix can be written as HC = C1/2H, with H a rectangular matrix withi.i.d. unit Gaussian entries. The matrix EC is then equivalent to

EC =1T

HCHTC =

1T

C1/2HHT C1/2 = C1/2WC1/2,

where W = 1T HHT is a white Wishart matrix with q = N/T . We will later see

that the above combination of a matrices is called the free product of C and W.Free probability will allow us to compute the resolvent and the spectrum in thegeneral C case.

Exercise

3.1 Properties of the Marcenko-Pastur solution

We saw that the Stieltjes transform of a large Wishart matrix (with q =

N/T ) should be given by

g(z) =z + q − 1 ±

√(z + q − 1)2 − 4qz2qz

(3.6)

where the sign of the square-root should be chosen such that g(z) → 1/zwhen z→ ±∞.

(a) Show that the zeros of the argument of the square-root are given byλ± = (1 ±

√q)2.

(b) The function

g(z) =z + q − 1 −

√z − λ−

√z − λ+

2qz(3.7)

should have the right properties. Show that it behaves as g(z) → 1/zwhen z → ±∞. By expanding in powers of 1/z up to 1/z3 compute thefirst and second moments of the Wishart distribution.

(c) Show that Eq. (3.7) is regular at z = 0 when q < 1. In that case, computethe first inverse moment of the Wishart matrix τ(E−1). What happens

Exercise 57

when q → 1? Show that Eq. (3.7) has a pole at z = 0 when q > 1 andcompute the value of this pole.

(d) The non-zero eigenvalues should be distributed according to the Marcenko-Pastur distribution

ρq(x) =

√(x − λ−)(λ+ − x)

2πqx. (3.8)

Show that this distribution is correctly normalized when q < 1 but notwhen q > 1. Use what you know about the pole at z = 0 in that case tocorrectly write down ρq(x) when q > 1.

(e) In the case q = 1, Eq. (3.8) has an integrable singularity at x = 0.Write a simpler formula for ρ1(x). Let u be the square of an eigenvaluefrom a Wigner matrix of unit variance, i.e. u = y2 where y is distributedaccording to the semi-circular law ρ(y) =

√4 − y2/(2π). Show that u is

distributed according to ρ1(x). This result is a priori not obvious as aWigner matrix is symmetric while the square matrix H is generally not,nevertheless moments of high dimensional matrices of the form HHᵀ

are the same whether the matrix H is symmetric or not.

(f) Generate three matrices E = HHᵀ/T where the matrix H is a N × Tmatrix of iid Gaussian numbers of variance 1. Choose a large N andthree values of T such that q = N/T equals 1/2, 1, 2. Plot a normalizedhistogram of the eigenvalues in the three cases vs the correspondingMarcenko-Pastur distribution, don’t show the peak at zero. In the caseq = 2, how many zero eigenvalues do you expect? How many do youget?


Pastur and Scherbina [2010], Couillet and Debbah [2011]

Historical Wishart [1928], Marchenko and Pastur [1967]

4Joint Distribution of Eigenvalues

In the previous two chapters, we haves studied the moments, the Stieltjes trans-form and the eigenvalue density of two classical ensembles (Wigner and Wishart).These three properties are really single eigenvalue properties of these two en-sembles. By this we mean that they are completely determined by the law ofa single eigenvalue (the density) and that they don’t tell us anything about thecorrelations between different eigenvalues.

In this chapter we will extend these results in two directions. First we willconsider a larger class of ensembles (the orthogonal ensembles) that containsWishart and Wigner. Second we will study the joint law of all eigenvalues. Inthese models, the eigenvalues are strongly correlated and can be thought of asinteracting through pairwise repulsion.

4.1 From Matrix Elements to Eigenvalues

4.1.1 Matrix Potential

Consider symmetric random matrices M whose elements are distributed as theexponential of the trace of a matrix potential:

P(M) = Z−1N exp

−

N2

Tr V(M), (4.1)

4.1 From Matrix Elements to Eigenvalues 59

where ZN is a normalization constant. These matrix ensembles are called or-thogonal ensembles for they are rotationally invariant, i.e. invariant under or-thogonal transformation.1 For the Wigner ensemble, we have

V(x) =x2

2σ2 .

For Wishart ensemble, we have

V(x) =x + (q − 1) log x

q. (4.2)

We can also consider other matrix potentials, e.g.

V(x) =x2

2+

gx4

4.

Note that Tr V(M) depends only on the eigenvalues of M. We would like to writedown the joint distribution of these eigenvalues. The key is to find the Jacobianof the change of variables from the entries of M to the eigenvalues λ1, . . . , λN.

4.1.2 Infinitesimal Rotations

Before computing the Jacobian of the transformation from matrix elements toeigenvalues and eigenvectors (or orthogonal matrices), let’s count the numberof variables in both parametrizations. Suppose M can be diagonalized as

M = OΛOT .

The symmetric matrix M has N(N + 1)/2 independent variables, and Λ has Nindependent variables as a diagonal matrix. To find the number of independentvariables in O we first realize that OOT = 1 is an equation between symmetric

1 The results of this chapter extend to Hermitian (β = 2) or quarternion-Hermitian (β = 4) matrices withthe simple introduction of a factor β in the probability distribution:

P(M) ∝ exp−βN2

Tr V(M),

this factor will match the factor of β from the Vandermonde determinant. These two other ensembles arecalled unitary ensembles and symplectic ensembles, respectively. Collectively they are called the betaensembles.

60 Joint Distribution of Eigenvalues

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0x

2

1

0

1

2

3

4

5

6

V(x)

q=1/2q=2

Figure 4.1 The Wishart matrix potential (Eq. (4.2)) for q = 1/2 and q = 2. Theintegration over positive semi-definite matrices imposes that the eigenvaluesmust be greater or equal to zero. For q < 1 the potential naturally confinesthe eigenvalues to be greater than zero and the constraint will not be explicitlyneeded in the computation. For q ≥ 1, the constraint is needed to obtain asensible result.

matrices and thus imposes N(N + 1)/2 constraints out of N2 potential values forthe elements of O, therefore O has N(N − 1)/2 independent variables.

The change of variables from M to (Λ,O) will introduce a factor | det(∆)|]where

∆ ≡ ∆(M) =

[∂M∂Λ

,∂M∂O

]is the Jacobian matrix of dimension N(N + 1)/2 × N(N + 1)/2.

To find the proper scaling of the Jacobian, we assume that that the matrix ele-ments of M have some dimensions (say centimeters). Using dimensional anal-ysis, we have

[DM] ∼ mN(N+1)/2, [DΛ] ∼ mN , [DO] ∼ 1.

Hence we must have

[| det(∆)|] ∼ mN(N−1)/2,

with has the same dimension as an eigenvalue raised to the power N(N − 1)/2,the number of distinct off-diagonal elements in M.

4.1 From Matrix Elements to Eigenvalues 61

We first compute | det(∆)| for M that is diagonal. The Jacobian consists of firstderivatives of M, to compute it we will consider small (first order) perturbationsof M by rotations and changes in eigenvalues.

For rotations O near the identity, we have

O = 1 + ε δO,

where ε is a small number and δO is some matrix. From the identity

1 = OOT = 1 + ε(δO + δOT ) + ε2δOδOT ,

we get δO = −δOT by comparing terms of the first order in ε.2 A convenientbasis to write such infinitesimal rotations is

ε δO =∑

1≤k<l≤N

θklA(kl),

where A(kl) are the elementary anti-symmetric matrices with entries

A(kl)i j = δikδ jl − δilδ jk, k < l,

that is A(kl) has only two non zero elements: [A(kl)]kl = 1 and [A(kl)]lk = −1,

A(kl) =

0 . . . . . . 0...

. . . 1...

−1...

. . ....

0 . . . . . . 0

.

In a neighborhood of a diagonal matrix Λ, any matrix M can be parameterizedas

M = Λ + δM ≈

1 +∑k,l

θklA(kl)

(Λ + δΛ)

1 −∑k,l

θklA(kl)

.2 The reader familiar with the analysis of compact Lie group will recognize the statment that

anti-symmetric matrices form the Lie algebra of O(N).


So to first order in δΛ and θkl,

δM ≈ δΛ +∑k,l

θkl[A(kl)Λ − ΛA(kl)

].

4.1.3 Vandermonde Determinant

Using this local parameterization, we can compute the Jacobian matrix and findits determinant. We have

∂Mi j

∂Λnn= δinδ jn,

i.e. perturbation of an eigenvalue only changes the corresponding diagonal ele-ment with slope 1.

For k < l and i < j,

∂Mi j

∂θkl=

(A(kl)Λ − ΛA(kl)

)i j

=

λl − λk, if i = k, j = l

0, otherwise,

an infinitesimal rotation in the direction kl modifies only one distinct off-diagonalelement (Mkl ≡Mlk) with slope λl − λk. In particular if two eigenvalues are thesame (λk = λl) a rotation of the eigenvectors in that sub-space has no effect onthe matrix M.

Thus the Jacobian has the form

∆(M) =

1. . .

1λ2 − λ1

λ3 − λ1. . .

λN − λN−1

.

Thus we have

| det(∆(M))| =∏k<l

|λk − λl|.

Exercise 63

We obtained the above result for a matrix M that was diagonal. To compute theJacobian at every possible symmetric matrix M′, we consider the fixed orthog-onal matrix O′ that diagonalizes M′ so M′ = O′M(O′)T where M is diagonal.The change of variable from M to M′ has absolute Jacobian determinant (AJD)equal to 1. To see this, we first realize that the transformation is linear so its AJDis a constant depending only on O′. To show that these real positive constantsare always equal to 1 is a bit more tricky. It would be strange that some rotationswould have AJD greater than one while their inverse would have AJD less thanone. Applying multiple rotation successively yields an AJD which is the productof the individual ADJs (|∆O||∆O′ | = |∆OO′ |), since applying the same rotation alarge number of times we can get arbitrarily close to the identity Ok ≈ 1 forsome k we must have |∆Ok | = |∆O|

k ≈ 1 which implies that |∆O| = 1. Indeedthese positive constants form a 1-d representation of the group O(N) and theonly such representation is the constant 1.

Thus we get3

| det(∆(M′))| =∏k<l

|λk − λl| (4.3)

for any N × N symmetric matrix M′. Note that the above Jacobian has nodependence on the matrix O, thus we can integrate out the rotation parts of(4.1):

P(λi) ∝∏k<l

|λk − λl| exp

−N2

N∑i=1

V(λi)

. (4.4)

A key feature of the above probability density is that the eigenvalues are not in-dependent, and they repel each other due to the

∏k<l |λk−λl| term. In particular,

the probability density vanishes when two eigenvalues tend to each other.

Exercise

4.1 Vandermonde determinant for 2 × 2 matrices.3 A similar result exists for unitary and symplectic matrices, namely | det(∆(M))| =

∏k<l |λk − λl |

β. Whereβ, as usual, equals 1, 2 or 4 in the orthogonal, unitary and symplectic case respectively.


In this exercise we will explicitly compute the Vandermonde determinantfor 2 × 2 matrices. We define O and Λ as

O =

(cos(θ) sin(θ)− sin(θ) cos(θ)

)and Λ =

(λ1 00 λ2

).

Then any 2 × 2 symmetric matrix can be written as M = OΛOT .

(a) Write explicitly M11, M12 and M22 as a function of λ1, λ2 and θ.

(b) Compute the 3 × 3 matrix ∆ of partial derivatives of M11, M12 and M22

with respect to λ1, λ2 and θ.

(c) In the special cases where θ equals 0, π/4 and π/2 show that | det∆| =|λ1 − λ2|. If you have the courage show that | det∆| = |λ1 − λ2| for all θ.

4.1.4 Coulomb Gas Analogy

As mentioned in the footnotes, the orthogonal ensemble defined above can beextended to complex or quaternion Hermitian matrices by introducing a factorof β (equal to 1, 2 or 4) in both the potential and the Vandermonde determinant.The joint law of the eigenvalues can then be written as

P(λi) = Z−1N exp

−β

2

N∑

i=1

NV(λi) −N∑

i, j=1j,i

log |λi − λ j|

. (4.5)

This joint law is exactly the Boltzmann factor (e−E/T ) for a gas of N one-dimension particles at temperature T = 2/β whose potential energy is givenby NV(x) and that interact with each other via a pairwise repulsive force gen-erated by the potential VR(x, y) = − log(|x − y|). The repulsive term happensto be the the 2-dimensional Coulomb potential for particles that all have thesame sign. The term 1-d and 2-d may be a bit confusing. The particles live in1-d, parameterized by the value λi, while the repulsive force is the 2-d Coulombinteraction ( fi j = 1/(λi − λ j)).

Even though we are interested in one particular value of β (namely β = 1),we can build an intuition by considering this system at various temperature. At

Exercise 65

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00.0

0.5

1.0

1.5

2.0

V(x)N=20

Figure 4.2 Representation of a typical N = 20 GOE matrix as a Coulombgas. The full curve represents the potential V(x) = x2/2 and the 20 crosses,the positions of the eigenvalues of a typical configuration. In this analogy,the eigenvalues feel a potential NV(x) and a repulsive pairwise interactionV(x, y) = − log(|x − y|). They fluctuate according to the Boltzmann weighte−βE/2 with β = 1 in the present case.

very low temperature (β → ∞), the N particles all want to minimize their po-tential energy and sit at the minimum of NV(x) but if they try to do so they willpay a high price in interaction energy as this energy increases as the particles getclose to each other. The particles will have to spread themselves around the min-imum of NV(x) to minimize both the potential and interaction energy to find theequilibrium configuration. At non-zero temperature (finite β) the particles willfluctuate around the equilibrium solution. Since the repulsion energy diverges asany two eigenvalues converge, the particles will always avoid each other. Figure4.1.4 shows a typical configuration of particles/eigenvalues for N = 20 at β = 1in a quadratic potential (GOE matrix).

In the next section, we will study this equilibrium solution which is exact at lowtemperature β→ ∞ and is the maximum likelihood solution at finite β and finiteN.


4.1.5 Statistics of Eigenvalue Differences

4.2 Maximum Likelihood and Large N Limit

4.2.1 Maximum Likelihood Configuration: Stieltjes Transform

We continue our study of the joint law of eigenvalues for the orthogonal ensem-ble (β = 1) and even more generally for the beta ensembles. In the previoussection, we saw that in the Coulomb gas analogy, β → ∞ corresponds to thezero temperature limit, and that in this limit the eigenvalues freeze to the mini-mum of the energy (potential plus interaction). We will argue that this freezing(or concentration) also happens when N → ∞ for fixed β. Let’s study this min-imum energy configuration. We can rewrite Eq. (4.5), we have

P(Λ) ∝ e−βNL

2 (λi), L(λi) =

N∑i=1

V(λi) −1N

N∑i, j=1

j,i

log |λi − λ j|.

For finite N and finite β, we can still consider the solution that minimizes L(λi).This is the maximum likelihood solution, i.e. the configuration of λi that hasmaximum probability. The minimum of L is determined by the equations

∂L∂λi

= V ′(λi) −2N

N∑j=1j,i

1λi − λ j

= 0. (4.6)

These are N coupled equations of N variables that can get very tedious to solveeven for moderate value of N. In exercise 4.2 we will find the solution for N = 3in the Wigner case. The solution of these equations is the set of equilibriumpositions of all the eigenvalues, i.e. the set of eigenvalues that maximizes thejoint probability. To obtain the statistical behavior of this solution (in particularthe density of eigenvalue), we will compute the Stieltjes transform of the λi

satisfying Eq. (4.6). The trick is to make algebraic manipulations to both sidesof the equation to make the Stieltjes transform explicitly appear. In the first stepwe multiply the equation by 1/(z − λi) where z is a complex variable not equal

4.2 Maximum Likelihood and Large N Limit 67

to any eigenvalues, and then sum over the index i.

1N

N∑i=1

V ′(λi)z − λi

=2

N2

N∑i, j=1

j,i

1(λi − λ j

)(z − λi)

=1

N2

N∑

i, j=1j,i

1(λi − λ j

)(z − λi)

+

N∑i, j=1

j,i

1(λ j − λi

)(z − λ j)

=

1N2

N∑i, j=1

j,i

1(z − λi) (z − λ j)

= g2N(z) −

1N2

N∑i=1

1(z − λi)2

= g2N(z) +

g′N(z)N

, (4.7)

where g(z) is the Stieltjes transform at finite N:

gN(z) :=1N

N∑i=1

1z − λi

. (4.8)

We still need to handle the left-hand side. First we add and subtract V ′(z) on thenumerator

1N

N∑i=1

V ′(λi)z − λi

= V ′(z)g(z) −1N

N∑i=1

V ′(z) − V ′(λi)z − λi

= V ′(z)g(z) − PN(z),

where we have defined a new function PN(z) as

PN(z) :=1N

N∑i=1

V ′(z) − V ′(λi)z − λi

(4.9)

This does not look very useful as the equation for g(z) will depend on someunknown function PN(z) that depends on the eigenvalues whose statistics weare trying to determine. The key realization is that if V ′(z) is a polynomial ofdegree k then PN(z) is also a polynomial and it has degree k − 1. Indeed, Foreach i in the sum, V ′(z) − V ′(λi) is a degree k polynomial having z = λi as azero, so (V ′(z) − V ′(λi))/(z − λi) is a polynomial of degree k − 1. PN(z) is thesum of such polynomial so it is itself a polynomial of degree k − 1.


In fact, the argument is easy to generalize to the Laurent polynomials, i.e.xkV ′(x) is a polynomial for some k ∈ N. For example, in the Wishart case wehave a Laurent polynomial

V ′(x) =1q

(1 +

q − 1x

).

Nevertheless from now on we make the assumption that V ′(x) is a polynomial.We will later discuss how to relax this assumption.

Thus we get from Eq. (4.7) that

V ′(z)gN(z) − PN(z) = g2N(z) +

g′N(z)N

(4.10)

for some polynomial PN(z) of degree deg(V ′(x))−1 that needs to be determinedself-consistently using Eq. (4.9). For a given V ′(x), the coefficients of PN(z)are related to the moments of the λi which themselves can be obtained fromexpanding gN(z) around infinity. Equation (4.10) is not very useful at finite N.On the other hand its large N limit is quite interesting.

Exercise

4.2 Maximum likelihood for 3 × 3 Wigner matrices.

In this exercise we will write explicitly the three eigenvalues of the max-imum likelihood configuration of a 3 × 3 GOE matrix. The potential forthis ensemble is V(x) = x2/2.

(a) Let λ1, λ2, λ3 be the three maximum likelihood eigenvalues of the 3 × 3GOE ensemble in decreasing ordered. By symmetry we expect λ3 =

−λ1. What do you expect for λ2?

(b) Consider equation (4.6). Assuming (λ3 = −λ1), check that your guessfor λ2 is indeed a solution. Now write the equation for λ1 and solve it.

(c) Using your solution and the definition (4.8), show that the Stieltjes

Exercise 69

transform of the maximum likelihood configuration is given by

g3(z) =z2 − 1

3

z3 − z.

(d) In the simple case V(x) = x2/2, the 0-degree polynomial PN(z) is justa constant (independent of N) that can be evaluated from the definition(4.9). What is this constant?

(e) Verify that your g3(z) satisfies Eq. (4.10) with N = 3.

4.2.2 Large N Limit

We will now study Eq. (4.10) in the large N limit. In this limit, gN(z) is self-averaging so computing gN(z) for the most likely configuration is the same ascomputing the average g(z). As N → ∞ Eq. (4.10) becomes

V ′(z)g(z) − P(z) = g2(z). (4.11)

Each value of N gives a different degree-(k − 1) polynomial PN(z). From thedefinition (4.9), we can show that the coefficients of PN(z) are related to themoments of the maximum likelihood configuration of size N. In the large Nlimit these moments converge so the sequence PN(z) converges to a well definepolynomial of degree (k − 1) P(z).

Equation (4.11) is quadratic in g(z), its solution is given by

g(z) =V ′(z) ±

√(V ′(z))2 − 4P(z)

2. (4.12)

The eigenvalues of M will be located where g(z) has an imaginary part for z veryclose to the real axis. The first term V ′(z) is a real polynomial and is always realfor real z. The expression (V ′(z))2 − 4P(z) is also a real polynomial so g(z)cannot be complex on the real axis unless (V ′(z))2 − 4P(z) < 0. In this case√

(V ′(z))2 − 4P(z) is purely imaginary. We concluded that

Re(g(x)) = P

∫ρ(λ)dλx − λ

=V ′(x)

2where ρ(x) , 0, (4.13)


where P denotes the principal part of the integral. Re(g(x)) is also called theHilbert transform of ρ(λ).

We have shown that the Hilbert transform density of eigenvalue (within its sup-port) is equal to the derivative of the potential. We first realize that the potentialoutside the support of the eigenvalue has no effect on the distribution of eigen-values. This is natural in the Coulomb gas analogy. At equilibrium, the particlesdo not feel the potential away from where they are. One consequence is thatwe can consider potential that are not confining at infinity as long as they havea confinement region and that all eigenvalues are in that region. For examplewe will consider the potential V(x) = x2/2 + γx4/4. For small negative γ theregion around x = 0 is convex, if all eigenvalues are contained in that region,we can modify the potential away from it so that V(x) → ∞ for |x| → ∞ andhave Eq. (4.1) be normalizable. Supposed now that we had a potential that isnot a polynomial. On a finite region we can approximate it arbitrarily well by apolynomial. If we choose the region of approximation such that for every suc-cessive approximations all eigenvalues lie in that region, we can take the limitsof these approximation and find that Eq. (4.13) holds even if V ′(x) is not a poly-nomial.

We can also ask the reverse question. Given a density ρ(x), does there exist amodel from the orthogonal ensemble (or other β-ensemble) that has ρ(λ) as itseigenvalue density? If the Hilbert transform of ρ(x) is well-defined, then yesand Eq. (4.13) gives the answer. Note that the potential is only defined up toan additive constant (it can be absorbed in the normalization of Eq. (4.1)) soknowing its derivative is enough to compute V(x). Note also that we only knowthe value of V(x) on the support of ρ(x), outside this support we should chooseV(x) to be convex and to go to infinity as |x| → ∞.

Exercise

4.3 Matrix potential for the uniform density

In exercise 2.1, we saw that the Stieltjes transform for a uniform density

Exercise 71

of eigenvalues between 0 and 1 is given by

g(z) = log( zz − 1

).

(a) By computing Re(g(x)) for x between 0 and 1, find V ′(x) using Eq.(4.13).

(b) Compute the Hilbert transform of the uniform density to recover youranswer in (a).

(c) From your answer in (a) and (b), show that the matrix potential is givenby

V(x) = 2[(1 − x) log(1 − x) + x log(x)] + C for 0 < x < 1,

where C is an arbitrary constant. Note that for x < 0 and x > 1 thepotential should be completed by a convex function that goes to infinityas |x| → ∞.

4.2.3 Wigner and Wishart

Now we apply the above theory to the Wigner ensemble. In this simple case,P(z) can be computed from its definition without knowing the eigenvalues, wehave

V ′(z) =zσ2 , P(z) =

1σ2 .

Then (4.10) gives (by taking the correct branch)

g(z) =z −√

z − 2σ√

z + 2σ2σ2 ,

which is equivalent to (2.22).

In the Wishart case, we only consider the case q < 1, otherwise (q ≥ 1) thepotential is not confining and we need to impose the positive semi-definitenessof the matrix to avoid eigenvalues running to minus infinity. We have


V ′(x) =1q

(1 +

q − 1x

).

In this case xV ′(x) is of degree 1, so zP(z) is a polynomial of degree zero:

P(z) =cz

for some constant c. Thus (4.10) gives

g(z) =z + q − 1 ±

√(z + q − 1)2 − 4cq2z

2qz.

As z→ +∞, this expression becomes

g(z) =cqz

+ O(1/z2)

Imposing g(z) ∼ z−1 gives c = q−1. Choosing the correct branch on both sides,we recover Eq. (3.5).

4.2.4 Convex Potential and One-Cut Assumption

For more general polynomial potentials, finding an explicit solution for the lim-iting Stieltjes transform is a little bit more involved. First we need to find thecorrect branch of Eq. (4.12)

g(z) =V ′(z)

2

1 −√

1 −4P(z)

(V ′(z))2

.For a particular polynomial V ′(x), P(z) is a polynomial that depends on themoments of the matrix M. The expansion of g(z) near z → ∞ will give a set ofself-consistent equations for the coefficients of P(z).

The problem simplifies greatly if the support of density of eigenvalue is com-pact, i.e. if the density ρ(λ) is non-zero for all λ’s between certain λ− and λ+. Weexpect this to be true if the potential V(x) is convex. Indeed by the Coulomb gasanalogy we could place all eigenvalues near the minimum of V(x) and let themfind their equilibrium configuration by repelling each other. For a convex po-tential it is natural to assume that the equilibrium configuration would not have

Exercise 73

any gaps. This assumption is equivalent to assuming that the limiting Stieltjestransform has a single branch-cut (from λ− and λ+), hence the name one-cutassumption.

So for a convex polynomial potential V(x), we expect that there exists a welldefine equilibrium density ρ(λ) that is non-zero if and only if λ− < λ < λ+ andthat g(z) satisfies

g(z) =

∫ λ+

λ−

ρ(λ)z − λ

dλ

From this equation we notice three important properties of g(z):

• The function g(z) is potentially singular at λ− and λ+.

• Near the real axis (Im z = 0+) g(z) has an imaginary part if z ∈ (λ−, λ+) andis real otherwise.

• The function g(z) is analytic everywhere else.

If we go back to Eq. (4.12), we notice that any non-analytical behavior mustcome from the square-root. First on the real axis, they only way g(z) can havean imaginary part is if ∆(z) := (V ′(z))2 − 4P(z) < 0. So ∆(z) (a polynomialof degree 2k) must change sign at some values λ− and λ+, hence these mustbe zeros of the polynomial. On the real axis, the other potential zeros ∆(z) canonly be of even multiplicity (otherwise ∆(z) would change sign). Elsewhere isthe complex plane, zeros should also be of even multiplicity, otherwise

√∆(z)

would be singular at those zeros. In other words ∆(z) must be of the form

∆(z) = (z − λ−)(z − λ+)Q2(z),

for some polynomial Q(z) of degree k − 1 where k is the degree of V ′(x). Wecan therefore write g(z) as

g(z) =V ′(z) − Q(z)

√z − λ−

√z − λ+

2, (4.14)

where again Q(z) is a polynomial with real coefficients of degree one less thanV ′(z). The condition that

g(z)→1z

when |z| → ∞


is now sufficient to compute Q(z) and also λ± for a given potential V(z). Ex-panding Eq. (4.14) near z→ ∞, Q(z) and λ± must be such as to cancel the k + 1polynomial coefficients of V ′(z) and also insure that the 1/z term as unit coeffi-cient. This gives k + 2 equations to determine the k coefficients of Q(z) and thetwo edges λ±.

Once we know the polynomial Q(x), we can read off the eigenvalue density

ρ(λ) =Q(λ)

√(λ+ − λ)(λ − λ−)

2πfor λ− ≤ λ ≤ λ+.

We see that generically the eigenvalue density behaves as ρ(λ± ∓ δ) ∝√δ near

both edges of the spectrum. If by chance (or by construction) one of the edgesis a zero of Q(z), one then has a δ(2n+1)/2 behavior near that edge where n ishere the multiplicity of root of Q(z). A potential with generic

√δ behavior at

the edge of the density is called non-critical and critical otherwise.

4.2.5 M2 + M4 Potential

One of the original motivations of Brezin, Itzykson, Parisi and Zuber to studythe ensemble defined by Eq. (4.1) was to count planar diagrams. To do sothey considered the potential

V(x) =x2

2+γx4

4. (4.15)

We will not discuss how one can count planar diagrams from such a potentialbut just compute the Stieltjes transform and the density of eigenvalue. Sincethe potential is symmetric around zero we expect λ+ = −λ− =: 2a. We intro-duce this extra factor of 2, so that if γ = 0, we obtain the semi-circular lawwith a = 1. Since V ′(z) = z + γz3 is a degree three polynomial, we write

Q(z) = a0 + a1z + γz2,

where the coefficient of z2 was chosen to cancel the γz3 term at infinity. Ex-panding Eq. (4.14) near z→ ∞ and imposing g(z) = 1/z + O(1/z2) we get

a1 = 0

1 − a0 + 2γa2 = 0

2a4γ + 2a2a0 = 2

Exercise 75

3 2 1 0 1 2 30.0

0.1

0.2

0.3

0.4

()

= 1= 0= 1/12

4 3 2 1 0 1 2 3 4x

0

1

2

3

4

5

V(x)

= 1= 0= 1/12

Figure 4.3 (left) Density of eigenvalues for the potential V(x) = 12 x2 +

γ4 x4 for

three values of γ. For γ = 1, even if the minimum of the potential is at λ = 0,the density develops a double hump due to the repulsion of the eigenvalues.γ = 0 corresponds to the Wigner case (semi-circle law). Finally γ = −1/12is the critical value of γ. At this point the density is given by Eq. (4.18). Forsmaller values of γ the density does not exists. (right) Shape of the potentialfor the same three values of γ. The crosses on the bottom curve indicate theedges of the critical spectrum.

Solving for a0, we find

g(z) =z + γz3 − (1 + 2γa2 + γz2)

√z − 2a

√z + 2a

2, (4.16)

where a is a solution of

3γa4 + a2 − 1 = 0 ⇒ a2 =

√1 + 12γ − 1

6γ.

The density of eigenvalue for the potential (4.15) reads

ρ(λ) =(1 + 2γa2 + γλ2)

√4a2 − λ2

2πfor γ > −

112, (4.17)

with a defined as above. For positive values of γ, the potential is confining(it is convex and grows faster than a logarithm for z → ±∞). In that case theequation for a always has a solution, so the Stieltjes transform and the densityof eigenvalues are well defined. For small negative values of γ, the problemstill make sense. The potential is convex near zero and the eigenvalues willstay near zero as long as the repulsion doesn’t push them too far in the non-convex region.

There is a critical value of γ = γ∗ = −1/12, which corresponds to a =√

2.


At this critical point, gc(z) and the density are given by

gc(z) =z3

24

(1 − 8z2

) 32

− 1 +12z2

and ρc(λ) =

(8 − λ2

)3/2

24π. (4.18)

At this point the density of eigenvalues at the upper edge (λ+ = 2√

2) behavesas ρ(λ) ∼ (2

√2 − λ)3/2

+ and similarly at the lower edge (λ− = −2√

2). Forvalues of γ more negative than γ∗, there are no real solutions for a and Eq.(4.16) ceases to make sense. In the Coulomb gas analogy, the eigenvaluespush each other up to a critical point after which they run off to infinity. Thereis no simple argument that gives the location of the critical point (except fordoing the above computation). It is given by a delicate balance between therepulsion of the eigenvalues and the confining potential. In particular it is notgiven by the point V ′(2a) = 0 as one might naively expect. At the criticalpoint V ′′(2a) = −1 so we are out of the convex region.


Vandermonde see Mehta [2004]

Historical Brezin et al. [1978]

5Dyson Brownian Motion

In this chapter we would like to start our investigation of the addition of randommatrices. We will start by studying how a fixed large matrix (random or not)is modified when added with a Wigner matrix. The elements Wigner matrixare Gaussian random numbers and each can be written as a sum of Gaussiannumbers with smaller variance. By pushing this reasoning to the limit we canwrite the addition of a Wigner matrix as a continuous process of addition ofinfinitesimal Wigner matrices. These process, viewed through the eigenvaluesand eigenvectors of the matrix is what is called Dyson Brownian motion afterthe physicist Freeman Dyson who first studied it.

Before defining Dyson Brownian motion, we will review classical Brownianmotion and the study of functions of Brownian motion: stochastic calculus.

5.1 Stochastic Calculus

5.1.1 Brownian Motion

The starting point of stochastic calculus is the Brownian motion (also calledWiener process) usually written in differential form as

dXt = µdt + σdBt,

78 Dyson Brownian Motion

0.0 0.2 0.4 0.6 0.8 1.0t

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

B t

Figure 5.1 An example of Brownian motion.

with dt an infinitesimal time increment and dBt a Gaussian variable with in-finitesimal variance such that EdBt = 0, EdB2

t = dt. Its higher order momentsare of order o(dt) as dt → 0. In other words, dBt is a mean zero random variablewhich has fluctuations of order

√dt.

When integrated the process becomes

X(t) = X0 + B(t),

where B(t) is a Gaussian random variable with mean µt and variance σ2t. Theprocess X(t) is continuous but nowhere differentiable. Note that B(t) and B(t′)are not independent but their increments are, i.e. if t < t′ B(t) and B(t′) − B(t)are independent.

The process X(t) up to time T can be understood in the following discrete way.We can divide [0,T ] according to tk = kT/N, 0 ≤ k ≤ N, and let δt = T/N.Then X(t) can be written as

X(tk) = X0 +

k−1∑l=0

µδt +

k−1∑l=0

σBl,

where Bl ∼ N(0, δt) for each l. By construction X(tN) = X(T ). In the limitN → ∞, we have δt → dt, Bk → dB and X(tk) becomes a continuous timeprocess X(t). Note the convention that X(tk) is built from past increments Bl for

5.1 Stochastic Calculus 79

l < k but does not include Bk. This prescription is called the Ito prescription,1

its main advantage is that X(t) is independent of the equal-time dBt.

5.1.2 Ito’s Lemma

We now study the behavior of functions of a Wiener process X(t). Because dB2

is of order dt we have to be careful when evaluating derivatives of functions ofX(t).

Given a twice differentiable function F(x), we consider the process F(X(t)).Then we have

F(X(t + δt)) = F(X(t)) + δXtF′(X(t)) +(δXt)2

2F′′(X(t)) + o(δt),

where we have

δXt ≈ µδt + σδBt,

and

(δXt)2 ≈ µ2(δt)2 + σ2δt + σ2[(δBt)2 − δt

]+ 2µσδtδBt → σ2dt + o(dt).

The random variable (δBt)2 has mean σ2δt and fluctuates around this mean withstandard deviation

√2σ2δt. Although these fluctuations are also of order δt the

fact that they have zero mean prevents them from contributing when integrated(their variance (which is additive) is of order (δt)2 and goes to zero even whenintegrated). Thus keeping the terms up to the 1st order of δt and letting δt → dt,we get

dFt =∂F∂X

dXt +σ2

2∂2F∂2X

dt, (5.1)

where compared to the ordinary calculus, we have the correction term depend-ing on the second order derivative of F. More generally, we can consider aprocess where µ and σ depend on Xt and t, i.e.,

dXt = µ(X(t), t)dt + σ(X(t), t)dBt.

1 In the Stratonovich prescription, half of Bk contributes to X(tk). In this prescription, there is no Itolemma, i.e. the chain rule simply without any correction term, but to price to pay is a correlation betweenX(t) and dB(t). We will not use the Stratonovich prescription in this text.


Then for the function F(X, t), we have

dFt =∂F∂X

dXt +

[∂F∂t

+σ2(X(t), t)

2∂2F∂2X

]dt, (5.2)

which is called Ito’s lemma.

Ito’s lemma can be extended to function of several stochastic variables. If wehave a collection of N independent stochastic variables Xi(t) (written vectori-ally as X(t)) satisfying

dXit = µi(X(t), t)dt + σi(X(t), t)dBt,

and a function of the Xi(t)’s: F(X(t), t). The vectorial form of Ito’s lemma statesthat F(X(t), t) must satisfy

dFt =

N∑i=1

∂F∂Xi

dXit +

∂F∂t

+

N∑i=1

σ2i (X(t), t)

2∂2F∂2Xi

dt. (5.3)

If the Xi(t) contain correlated Brownian motion, the last sum with the secondderivative should also contain cross-terms that include covariances multipliedby mixed derivatives. We will not using such a correlated Ito formula in thisbook.

5.1.3 Variance as a Function of Time

As an illustration of how to use Ito’s formula let’s compute the variance as afunction of time for a simple stochastic process. Consider the stochastic processXt such that

dXt = σdBt, X(0) = 0.

and the function F(x) = x2. Applying Eq. (5.1), we get that

dFt = 2X(t)dXt + σ2dt ⇒ F(Xt) = 2∫ t

0σX(s)dBs + σ2t.

In order to take the expectation value of this equation, we consider E[X(s)dBs].The random infinitesimal element dBs is chosen after X(s) and is therefore in-

5.1 Stochastic Calculus 81

dependent of it. Therefore E[X(s)dBs] = 0. We get

E[X2(t)] = σ2t.

For Brownian motion, variance from the origin grows linearly with time. Thesame result can be derived directly from the integrated form X(t) = σB(t), whereB(t) is a Gaussian random number of variance equal to t.

5.1.4 Gaussian Addition

Ito’s lemma can be used to compute a special case of the law of addition ofindependent random variables, namely when one of the variables is Gaussian.Consider the random variable Z = Y + X, where Y is some random variable, andX is an independent Gaussian (X ∼ N(µ, σ2)). The law of Z is uniquely deter-mined by its characteristic function, i.e., the Fourier transform of its probabilitydensity

ϕZ(k) =

∫eikzP(z)dzE[eikZ].

We now let Z(t) be a Brownian motion with Z(0) = Y:

dZt = µdt + σdBt, Z(0) = Y.

Note that Z(1) has the same law as Z. Then we study the family of functionsF(Zt) = eikZt with Ito’s formula (Eq. (5.2)).

dFt = ikeikZt dZt −k2σ2

2eikZt dt =

(ikµF −

k2σ2

2F)

dt + ikFdBt.

Taking the expectation value, writing F(t) = EF(t), and noting that the differ-ential d is a linear operator and therefore commutes with the expectation value,we obtain

dFt =

(ikµ −

k2σ2

2

)F(t)dt

1F(t)

ddt

F(t) =ddt

log(F(t)

)=

(ikµ −

k2σ2

2

),


From its solution at t = 1, we get

log (ϕZ(k)) = log (ϕY (k)) + ikµ −k2σ2

2.

Using the fact a Gaussian random variable with mean µ and variance σ2 hascharacteristic function

ϕX(k) = eikµ− k2σ22 .

We can re-write our result as

log (ϕZ(k)) = log (ϕY (k)) + log (ϕX(k)) .

We have recovered the fact that the log-characteristic function is additive underthe addition of independent random variables. Although the result is true ingeneral, the calculation above using stochastic calculus is only valid if one ofthe random variable is Gaussian.

After introducing the Dyson Brownian motion, we will perform a similar com-putation for matrices and get out first intuition about the addition of large ran-dom matrices.

5.2 Stochastic Matrices

5.2.1 Perturbation Theory in Quantum Mechanics

We would like to generalize the ideas of Brownian motion and stochastic calcu-lus to random matrices. Imagine a matrix that evolves through addition of smalliid Gaussian random numbers to all its elements. We would like to compute thetime evolution of its eigenvalues and eigenvectors. In order to understand thestochastic calculus of random matrices, we start by considering small (random)perturbations to a known matrix.

We begin with a perturbed symmetric matrix

H = H0 + εH1,

where H0 is an unperturbed symmetric matrix whose eigenvalues and eigenvec-tors are assumed to be known, ε is a small parameter, and H1 is a symmetric

5.2 Stochastic Matrices 83

matrix that gives the perturbation. In fact, this shares the same setting as theperturbation theory in quantum mechanics, where H, H0 and H1 are Hermitianmatrices which give the Hamiltonians of a quantum system. The equations wewill get are the same in the real symmetric and complex Hermitian case.

Suppose λi0, 1 ≤ i ≤ N, are the eigenvalues of H0 and vi

0, 1 ≤ i ≤ N, arethe corresponding eigenvectors. We assume that the perturbed eigenvalues andeigenvectors are given by the asymptotic series

λiε = λi

0 +∑

k

εkλik, vi

ε = vi0 +

∑k

εkvik, (5.4)

with the constraint that

‖viε‖ = ‖vi

0‖ = 1, 1 ≤ i ≤ N.

The quantity ‖viε‖ must be constant, in particular its first order variation with

respect to ε must be zero. This constraint gives that

vi1 ⊥ vi

0.

We assume that all λi0 are different from each other, i.e. we consider non-

degenerate perturbation theory. Then plugging (5.4) into

Hviε = λi

εviε

and comparing the terms of the same order of ε, one can obtain that

λiε = λi

0 + εviT0 H1vi

0 + ε2N∑

j=1j,i

∣∣∣∣v jT0 H1vi

0

∣∣∣∣2λi

0 − λj0

+ O(ε3), (5.5)

and

viε = vi

0 + ε∑j,i

v jT0 H1vi

0

λi0 − λ

j0

v j0 + O(ε2).

Notice that the first order correction to viε is perpendicular to vi

0 as it does nothave a component in that direction.


5.2.2 Dyson Brownian Motion

Next we use the above formulas to derived the so-called Dyson-Brownian mo-tion (DBM), which gives the evolutions of eigenvalues of a random matrix plusa Wigner ensemble whose variance grows linearly with time (which we shallcall matrix DBM later). Let M0 be the initial matrix (random or not), X1 be aunit Wigner matrix that is independent of M0. Then we study the eigenvaluesof

M = M0 + σX1

using (5.5).

The derivation of DBM is much simpler if we use the rotational invariance of theWigner ensemble. The matrix X1 has the same law in any basis, we thereforechoose to express it in the diagonal basis of H0, in order to do so we mustwork with the exact rotationally invariant Wigner ensemble where the diagonalvariance is twice the off-diagonal variance.

First, for the first order term (in terms of ε), we have

wi := viT0 X1vi

0 = (X1)ii ∼ N

(0,

2N

),

Note that the (X1)ii are independent for different i’s.

Then we study the second order term. We have

w ji := v jT0 X1vi

0 = (X1) ji ∼ N

(0,

1N

).

So w2ji is a random variable with mean 1/N. Its fluctuations are also of order 1/N

but they have zero mean so when integrated over time their contribution is oforder higher than dt/N. In other words w2

ji can be treated as deterministic.

Now using (5.5) with σ2 = dt, we get that

dλi =

√2N

dBi +1N

N∑j=1j,i

dtλi − λ j , (5.6)

where dBi denotes a Brownian increment which comes from the σwi term. One

5.2 Stochastic Matrices 85

0.0 0.2 0.4 0.6 0.8 1.0t

2

1

0

1

2

t

Figure 5.2 A simulation of DBM for a N = 25 matrix starting for a Wignerwith σ2 = 1/4 and evolving for one unit of time.

can derive a similar process for the eigenvectors that we give here for complete-ness

dvi =1√

N

N∑j=1j,i

dBi j

λi − λ j v j −1

2N

N∑j=1j,i

dt(λi − λ j)2 vi, (5.7)

where dBi j = dB ji (i , j) is a symmetric collection of Brownian motions,independent of each other and of the dBi above.

The formulas (5.6) and (5.7) give the Dyson-Brownian motion for the stochasticevolution of the eigenvalues and eigenvectors of matrices of the form

M = M0 + X(t), (5.8)

where M0 is some initial matrix and X(t) is an independent Wigner ensemblewith parameter σ2 = t. We shall call the above matrix process as a matrix DysonBrownian motion (MDBM).

In our study of large random matrices, we will be interested in DBM when Nis large, but actually DBM is well define and correct for any N. As with the Itolemma, we assumed that the Gaussian process can be divided into infinitesimalincrements and that perturbation theory becomes exact at that scale. We madeno assumption about the size of N. We did need a rotationally invariant Gaussianprocess so the diagonal variance must be twice the off-diagonal one. In the most


extreme example of N = 1, the eigenvalue of a 1 × 1 matrix is just the value ofits only element. Under DBM it undergoes a Brownian motion with a varianceof 2 per unit time.

Exercise

5.1 Variance as a function of time under DBM

Consider the Dyson Brownian motion for a finite N matrix.

dλi =

√2N

dBi +1N

N∑j=1j,i

dtλi − λ j

and the function F(λi) that computes the second moment

F(λi) =1N

N∑i=1

λ2i .

(a) Write down the stochastic process for F(λi) using the Ito vectorialformula (5.3). In the case at hand F does not depend explicitly on timeand σ2

i = 2/N. You will need to use the following identity

2N∑

i, j=1j,i

λi

λi − λ j=

N∑i, j=1

j,i

λi − λ j

λi − λ j= N(N − 1).

(b) Take the expectation value of your equation and show that F(t) ≡ EF(λi(t))follows

F(t) = F(0) +N + 1

Nt

Do not assume that N is large.

5.2.3 DBM as a Consequence of Ito

Another way to derive the Dyson Brownian motion for the eigenvalues is toconsider matrix Brownian motion (5.8) as Brownian motion on the elements

Exercise 87

of the matrix X. We have to treat the diagonal and off-diagonal elements sepa-rately because we want to use the rotationally invariant Wigner matrix (GOE)with diagonal variance equal to twice the off-diagonal variance. Also, only halfof the off-diagonal elements are independent (the matrix X is symmetric). Wehave

dXkk =

√2N

dBkk and dXkl =

√1N

dBkl for k < l

where dBkk and dBkl are N and N(N − 1)/2 independent unit Brownian mo-tion.

Each eigenvalue λi is a function of the matrix elements of X. We can use thevectorial form of Ito’s lemma (5.3) to write a stochastic differential equation forλi(X),

dλi =

N∑k=1

∂λi

∂Xkk

√2N

dBkk +

N∑k=1

l=k+1

∂λi

∂Xkl

√1N

dBkl +

N∑k=1

∂2λi

∂X2kk

dtN

+

N∑k=1

l=k+1

∂2λi

∂X2kl

dt2N

.

(5.9)

The key is to be able to compute the following partial derivatives

∂λi

∂Xkk,

∂λi

∂Xkl,

∂2λi

∂X2kk

,∂2λi

∂X2kl

, k < l.

Since Xt is rotational invariant, we can make a change of basis with orthogonalmatrices such that X0 is diagonal:

X0 = diag(λ1(0), . . . , λN(0)).

In order to compute the partial derivatives above, we can consider a single per-turbation of the matrix X, first we modify of a diagonal elements and later wewill modify an off-diagonal elements.

A perturbation of the diagonal entry δXkk will affect λi with i = k in a linearfashion and it will leave all other eigenvalues unaffected.

λi(δXkk) = λi + δXkkδki


Thus we have∂λi

∂Xkk= δik,

∂2λi

∂X2kk

= 0.

Now we discuss how a perturbation in an off-diagonal entry can affect the eigen-values. Again we work at t = 0 (i.e. X is diagonal). A perturbation of theXkl = Xlk entry gives

X =

λ1. . .

λk δXkl. . .

δXkl λl. . .

λN

.

Since this matrix is block diagonal (after a simple permutation), the eigenvaluesin the 1 × 1 blocks are not affected by the perturbation, so

∂λi

∂Xkl= 0,

∂2λi

∂X2kl

= 0, ∀i , k, l.

On the other hand, the eigenvalues of the block(λk δXkl

δXkl λl

)are modified and become

λ± =λk + λl

2±λk − λl

2

√1 +

4(δXkl)2

(λk − λl)2 .

We can expand to second order in δXkl and find

λk(δXkl) = λk +(δXkl)2

λk − λland λl(δXkl) = λl +

(δXkl)2

λl − λk.

We readily see that the first partial derivative of λi with respect to an off-diagonal

Exercise 89

element is always zero.

∂λi

∂Xkl= 0 for k < l.

For the second derivative we get

∂2λi

∂X2kl

=2δik

λi − λl+

2δil

λi − λkfor k < l.

Of the two terms on the rhs, the first term exists only if (l > i) while the secondterm is present only when (k < i). So for a given i, only (N −1) term of the form2/(λi − λ j) are present. Putting everything back into Eq. (5.9), we find

dλi =

√2N

dBi +1N

N∑j=1j,i

dtλi − λ j

,

where dBi are independent Brownian motions (the old dBkk for k = i). We haverecovered Equation (5.6).

5.2.4 Stochastic vs Deterministic DBM

In the rest of this section, we make some remarks on the DBM

dλi =

√2N

dBi +1N

N∑j=1j,i

dtλi − λ j

,

Here the dBi term is a stochastic term which leads to the Ito term when using Itocalculus, and the second term is a deterministic Dyson term which describes theinteractions between different eigenvalues. On average, the stochastic term canbe ignored (since its expectation vanishes), and we have a deterministic Dysonequation

dλi =1N

N∑j=1j,i

dtλi − λ j


This equation gives deterministic “rigid” positions of the eigenvalues (not surewhat is the meaning here). Physically, the eigenvalues can be thought as par-ticles evolving trough pairwise repulsion, which is the point of view we haveused in Section 4.1. (insert a figure)

For the single eigenvalues properties, the fluctuations are not important due tothe self-averaging property. Thus each eigenvalue is roughly distributed as ρ(λ),where ρ is the asymptotic distribution of the eigenvalues of the random matrixwe are considering. Suppose we order the eigenvalues as λ1 ≤ . . . ≤ λN . Foreach i and we define the γi as∫ γi

−∞

= inf

x :∫ x

−∞

ρ(λ)dλ =iN

.

In the large N limit, the eigenvalues λi are very rigid around γi. Now we considerthree cases of distributions of λi, 1 ≤ i ≤ N.

• λi are i.i.d. with distribution ρ(λ).

• λi are the random matrix eigenvalues.

• λi = γi for all i.

In the large N limit, all these three cases have roughly the same single eigen-value distribution ρ. But the correlations between eigenvalues and small scalefluctuations are very different. For the random matrix statistics, the Dyson forcebetween the eigenvalues is repulsive such that λN increases and λ1 decreases.The eigenvalues never cross each other: the Brownian motion may bring twoeigenvalues close, but the Dyson force repels them as 1

λi+1−λi. (insert a fig-

ure)

One can use the DBM to study difference between eigenvalues, statistics ofextremal eigenvalues, correlations between eigenvalues, and so on. For a review,one can refer to (insert a reference).

5.3 Addition of a Large Wigner Matrix Using DBM

Now that we have define DBM, we will apply it to the problem of addition oflarge random matrices. Specifically we will show how the Stieltjes transform

5.3 Addition of a Large Wigner Matrix Using DBM 91

g(z) of a large matrix M is modify when when add a Wigner matrix. We willobtain Burgers’ equation for the time evolution of gN(z). To solve this equationwe will introduce the R-transform. The tool of DBM allow us only to consideraddition of a Wigner to a general matrix. In the next two chapter we will seethat the R-transform is also the right object to compute the spectrum of the sumof two general random matrices.

5.3.1 DBM for the Stieltjes Transform: Burgers’ Equation

Consider a matrix M(t) that under goes DBM starting from a matrix M0. Ateach time t, M(t) can be viewed as the sum of the matrix M0 and a Wignermatrix of variance t,

M(t) = M0 + Xt. (5.10)

In order to understand the spectrum of the matrix M(t), let’s compute its Stieltjestransform g(z) which is the expectation value of the large N limit of functiongN(z) defined by

gN (z, λi) :=1N

N∑i=1

1z − λi

.

gN can be seen as a function of eigenvalues λi that undergo DBM, while theparameter z just stays constant. Since the eigenvalues evolve with time gN isreally a function of both z and t. We can use Ito’s lemma to write a stochas-tic differential equation for gN (z, λi). First we need to compute the followingpartial derivatives

∂gN

∂λi=

1N

1(z − λi)2 ,

∂2gN

∂λ2i

=2N

1(z − λi)3 .

We can now apply Ito (5.3) to DBM (5.6) and find

dgN =1N

√2N

N∑i=1

dBi

(z − λi)2 +1

N2

N∑i, j=1

j,i

dt(z − λi)2(λi − λ j)

+2

N2

N∑i=1

dt(z − λi)3 ,


The second term is singular when i = j (not included in the sum). We can getrid of this singularity by symmetrizing the expression in i and j. To do so, werealize that i and j are dummy indices that are summed over, we can renamei j (and vice versa) and get the same expression, adding the two versions anddividing by 2 we get that this term is given by

1N2

N∑i, j=1

j,i

dt(z − λi)2(λi − λ j)

=1

2N2

N∑i, j=1

j,i

[dt

(z − λi)2(λi − λ j)+

dt(z − λ j)2(λ j − λi)

]

=1

2N2

N∑i, j=1

j,i

(2z − λi − λ j)dt(z − λi)2(z − λ j)2 =

1N2

N∑i, j=1

j,i

dt(z − λi)(z − λ j)2

=1

N2

N∑i, j=1

dt(z − λi)(z − λ j)2 −

1N2

N∑i=1

dt(z − λi)3 = −gN

∂gN

∂zdt −

1N2

N∑i=1

dt(z − λi)3 .

Thus we have

dgN =1N

√2N

N∑i=1

dBi

(z − λi)2 − gN∂gN

∂zdt +

1N2

N∑i=1

dt(z − λi)3

=1N

√2N

N∑i=1

dBi

(z − λi)2 − gN∂gN

∂zdt +

12N

∂2gN

∂z2 dt.

Thus taking the expectation (such that the dBi term vanishes), we get

EgN(z) = −E[gN(z)

∂gN(z)∂z

]dt +

12N

E[∂2gN(z)∂z2

]dt.

This equation is exact for any N. We can now take the N → ∞ limit. Using thefact that the Stieltjes transform is self-averaging, we get a PDE for g(z, t):

∂g

∂t= −g

∂g

∂z. (5.11)

We have written the derivative with respect to time as a partial derivative be-cause we were considering the function F for a fixed z. Equation (5.11) is calledthe inviscid Burgers’ equation.

5.3 Addition of a Large Wigner Matrix Using DBM 93

5.3.2 Solution of Burgers’ Equation: R-Transform

Let M(t) = M(0) + X(t) be a large matrix DBM as in (5.8) and recall thatits Stieltjes transform g(z) satisfies the Burger’s equation in (5.11), with initialcondition g(z, 0) = g0(z) := gX(0)(z). Using method of characteristics, one canshow that

g(z, t) = g0(z − tg(z, t)). (5.12)

In fact, one can verify that (5.12) satisfies the equation (5.11).

Example: Suppose M(0) = 0. Then we have g0(z) = z−1. Plugging into (5.12),we obtain that

g(z, t) =1

z − tg(z, t),

which is the self-consistent equation (2.21) in the Wigner case with σ2 = t.Indeed, if we start with the zero matrix, then M(t) = X(t) is just a Wigner withparameter σ2 = t.

Back to the general case, we denote

gt(z) = g0(z − gt(z)).

We assume that zt(g) is the inverse function of gt(z) and z0(g) is the inversefunction of g0(z). Now fix g = gt(z) = g0(z − tg) and z = zt(g), we apply z0 to gand get

z0(g) = z − tg = zt(g) − tg

zt(g) = z0(g) + tg. (5.13)

The inverse of the Stieltjes transform of M(t) is given by the inverse of that ofM(0) plus a simple shift tg. If we know g0(z) we can compute its inverse andeasily know the inverse zt(g) and hopefully be able to recover gt(z).

Example: Suppose M(0) is a Wigner matrix with variance σ2. We first want tocompute the inverse of g0(z), to do so we use the fact that g0(z) satisfies equation(2.21), and we get that

z0(g) = σ2g +1g.


Then by (5.13), we get that

zt(g) = z0(g) + tg =(σ2 + t

)g +

1g,

which is the z(g) for Wigner matrices with variance σ2 + t. In other words gt(z)satisfies the Wigner equation (2.21) with σ2 replaced by σ2 + t. This result isnot surprising, each element of the sum of two Wigner matrices is just the sumof Gaussian random variables. So M(t) is itself a Wigner matrix with the sumof the variances as its variance.

We can now tackle the more general case when the initial matrix is not neces-sarily Wigner. For B = A + X where X is a Wigner matrix that has variance tand is independent of A. Then by (5.13), we get

zB(g) = zA(g) + tg = zA(g) + zX(g) −1g.

We define the R-transform as

R(g) := z(g) −1g. (5.14)

Then we have the nice additive relation

RB(g) = RA(g) + RX(g).

In the next two chapters we will generalize this law of addition to large matricesthat are not necessarily Wigner.

Exercises

5.2 Taylor series for the R-transform

Let g(z) be the Stieltjes transform of a random matrix M.

g(z) = τ((z1 −M)−1

)=

∫suppρ

ρ(λ)dλz − λ

We saw that the power series of g(z) around z = ∞ is given by the moments

Exercises 95

of M (mn ≡ τ(Mn))

g(z) =

∞∑n=0

mn

zn+1 with m0 ≡ 1

Call z(g) the functional inverse of g(z) which is well-defined in a neigh-borhood of g = 0. And define R(g) as

R(g) = z(g) − 1/g

(a) By writing the power series of R(g) near zero, show that R(g) is regularat zero and that R(0) = m1. Therefore the power series of R(g) starts atg0.

R(g) =

∞∑n=1

κngn−1

(b) Now assume m1 = κ1 = 0 and compute κ2, κ3 and κ4 as a function ofm2, m3 and m4 in that case.

(c) Using your answer from Exercise 2.1: If A is a random matrix drawnfrom a well-behaved ensemble with Stieljes transform gA(z) and R-transform RA(g). What is the R-transform transform of the random ma-trices αA and A + β1 where α and β are non-zero real numbers.

5.3 Sum of a symmetric orthogonal and Wigner matrices

Consider as in Exercise 0.1 a random symmetric orthogonal matrix M anda Wigner matrix X of variance σ2. We are interested in the spectrum oftheir sum E = M + X.

(a) Given that the eigenvalues of M are ±1 and that in the large N limit eacheigenvalue appears with weight 1

2 . Write the limiting Stieltjes transformgM(z).

(b) E can be thought of undergoing Dyson Brownian motion starting atE(0) = M and reaching the desired E at t = σ2. Use Eq. (5.12) to writean equation for gE(z). This will be a cubic equation in g.

(c) You can obtain the same equation using the inverse function zM(g) of


your answer in (a). Show that

zM(g) =1 +

√1 − 4g2

2g,

where one had to pick the root that makes z(g) 1/g near g = 0.

(d) Using Eq. (5.13), write zE(g) and invert this relation to obtain an equa-tion for gE(z). You should recover the same equation as in (b).

(e) Eigenvalues of E will be located where your equation admits non-realsolutions for real z. First look at z = 0, the equation becomes quadraticafter factorizing a trivial root. Find a criteria for σ2 such that the equa-tion admits non-real solutions. Compare with your answer in Ex. 0.1(b).

(f) At σ2 = 1, the equation is still cubic but is somewhat simpler. A realcubic equation of the form ax3 + bx2 + cx + d = 0 will have non-realsolutions iff ∆ < 0 where ∆ = 18abcd − 4b3d + b2c2 − 4ac3 − 27a2d2.Using this criteria show that for σ2 = 1 the edges of the eigenvaluespectrum are given by λ = ±3

√3/2 ≈ ±2.60.

(g) Again at t = 1, the solution near g(0) = 0 can be expanded in fractionalpowers of z. Show that we have

g(z) = z1/3 + O(z) which implies ρ(x) =

√3

23√|x|,

for x near zero.

(h) For σ2 = 1/2, 1 and 2, solve numerically the cubic equation for gE(z)for z = x real and plot the density of eigenvalues ρ(x) = | Im(gE(x))|/πfor one of the complex roots if present.

5.3.3 Existence of R-Transform near Zero

To solve Burgers’ equation we invoked z(g) the inverse of the Stieltjes trans-form g(z). For this to make sense we need to show that g(z) is invertible at least

Exercise 97

in some region of the complex plane. The limiting Stieltjes transform can bewritten from the density of eigenvalues

g(z) =

∫ λ+

λ−

ρ(λ)z − λ

dλ,

where λ± are the extreme edges of the eigenvalue spectrum and ρ(λ) is a positivefunction (the density). From this expression we see that g(z) is positive andmonotonically decreasing from z = λ+ all the way to infinity and similarly it isnegative and monotonically increasing from minus infinity to λ−. Therefore g(z)is invertible for large enough z.

In fact for large z, we have

g(z) = z−1 + O(z−2).

Thus for small enough g, there always exists a large z such that

z = z(g) =1g

+ O(1).

Thus while g(z) is analytic at ∞, R(g) is analytic near 0. For large g, R(g) maynot exist.2

For Wigner matrices, we have seen that

R(g) = σ2g (5.15)

For a white Wishart (Marcenko Pastur law), using (3.4) we get

R(g) =1

1 − qg. (5.16)

Exercise

5.4 Sum of a white Wishart and a Wigner matrix Let W be a white Wishart2 For real x > λmax, g(x) is monotonically decreasing and therefore invertible but ceases to be precisely at

x = λ+. Similarly for x < λ− so the function R(g) = g(z) − 1/z is only well defined for g(λ−) < g < g(λ+),note that g(λ−) < 0. Now the function R(g) defined near zero may be analytic beyond g(λ+) but it is nolonger the inverse (minus 1/g) of g(z) beyond that point. For the unit Wigner matrix R(g) = g is analyticeverywhere but it is related to the inverse of g(z) only for −1 < g < 1.


matrix with parameter q = N/T and X a Wigner matrix with off-diagonalvariance σ2/N. We saw that their R-transforms are given by

RW(g) =1

1 − qgand RX(g) = σ2g

(a) Let E = W + X. What is RE(g)?

(b) By inverting the relation between R(g) and g(z), write an equation forgE(z). It is a cubic equation, don’t panic!

(c) For real and large z your equation will have three real solutions (g →±∞, g → 0, g → 1/q as z → ±∞). For z in a finite interval containingz = 1, the equation will have one real (spurious) root and two complexroots (complex conjugate of each other). The imaginary part either ofthese complex solutions (with the correct sign) will give us the densityof eigenvalues of E via

ρ(x) = limη→0+

Im gE(x − iη)π

Find analytically the density of eigenvalues of E, or alternatively solvenumerically for a few values of q and σ2. Check your result for q = 0(pure Wigner) and σ2 = 0 (pure Marcenko-Pastur). Show a plot of ρ(x)for q = σ2 = 1/2.

(d) Generate numerically the matrix E (q = σ2 = 1/2) by summing aWigner and a Wishart for N = 1000. Compare the spectrum of eigen-values with your plot of ρ(x).

5.4 Dyson Brownian Motion with a Potential

In this section, we will learn about another application of Dyson Brownian Mo-tion, namely sampling a random matrix distribution from the orthogonal (oranother beta) ensemble. Before we study DBM with a potential, let us first lookat a simpler one-dimensional problem: the Langevin equation.

5.4 Dyson Brownian Motion with a Potential 99

0 5 10 15 20 25 30 35 40t

3

2

1

0

1

2

X t

4 3 2 1 0 1 2 3 4X

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

P(X)

Figure 5.3 (left) A simulation of the Langevin equation for the Ornstein-Uhlenbeck process (5.17) with 50 steps per unit time. Note that the correlationtime is τc = 2, so excursion away from zero typically take two time units tomean-revert back to zero. Farther excursions take longer to come back. (right)Histogram of the values of Xt for the same process simulated up to t = 2000and comparison with the normal distribution. The agreement varies from sam-ple to sample as a rare far excursion can affect the sample distribution even fort = 2000.

5.4.1 Langevin Equation

We would like to construct a stochastic process for a variable Xt such that inthe steady-state regime the values of Xt are drawn from a given probabilitydistribution P(x). To build our stochastic process, let us first consider the simpleBrownian motion with unit variance per unit time:

dXt = dBt.

If we think of this equation in discrete time, it means that at each time incrementdt we add to Xt an independent random Gaussian variable of variance dt. Thevariance of Xt grows linearly with time and the process is not stationary. Tomake it stationary we need a mechanism to reduce the variance of Xt. We cannot‘subtract’ variance but we can reduce Xt by scaling. At every step let’s replaceXt by Xt/

√1 + dt. If the variance of Xt is one it will stay one after scaling and

addition of dBt. This is actually a convergent mechanism. If the variance of Xt

is greater than one, it will decrease and it will increase if it started from belowone. We know that the distribution of Xt under Brownian motion is Gaussian


(if the initial condition is Gaussian or constant). With this extra rescaling, Xt isstill Gaussian at every step. Once Xt has reached the steady-state (variance one),then Xt will always be a Gaussian random number of unit variance. Xt samplesthe normal distribution. The discrete process we just described can be taken tothe continuum limit. In that case Xt/

√1 + dt → Xt −

12 Xtdt. As a stochastic

differential equation, we have

dXt = dBt −12

Xtdt. (5.17)

This stationary version of the random walk is call the Ornstein-Uhlenbeck pro-cess. A physical interpretation of this equation is that of a particle located at Xt

moving in a viscous media subjected to random forces dBt/dt and a determinis-tic force −Xt/2. The viscous media is such that velocity (and not acceleration)is proportional to force. We would like to generalize the above formalism to anydistribution P(x). One way to do so is to change the linear force −Xt/2 to a gen-eral non-linear force −V ′(x)/2 where we have written the force as the derivativeof a potential and introduced a factor of two that will prove convenient. If thepotential is convex, the force will drive the particle towards the minimum of thepotential while the noise dBt will drive the particle away. We expect that thissystem will reach a steady-state. Our stochastic equation is now

dXt = dBt −12

V ′(Xt)dt.

What is the distribution of Xt in the steady-state? To find out, let’s consider afunction of Xt, F(Xt) and see how it behaves in the steady-state. Using Ito’slemma Eq. (5.1), we have

dFt = F′(Xt)[dBt −

12

V ′(Xt)dt]

+12

F′′(Xt)dt.

Taking the expectation value and demanding that dE[Ft]/dt = 0 in the steady-state we find

E[F′(Xt)V ′(Xt)

]= E

[F′′(Xt)

]. (5.18)

This must be true for any function F(x), to make sense of it let’s write h(x) =

F′(x) and write these expectation values using the the steady-state probability

5.4 Dyson Brownian Motion with a Potential 101

P(x) that we are trying to determine:∫h(x)V ′(x)P(x)dx =

∫h′(x)P(x)dx. (5.19)

Since we want to relate an integral of h and one of h′ we should use integrationby part of the right-hand side:∫

h′(x)P(x)dx = −

∫h(x)P′(x)dx = −

∫h(x)

P′(x)P(x)

P(x)dx.

Since Eq. (5.19) is true for any function h(x) we must have

V ′(x) = −P′(x)P(x)

⇒ P(x) = Z−1 exp[−V(x)],

where Z is an integration constant the fixes the normalization of the law P(x).

To recapitulate, given a probability density P(x), we can define a potentialV(x) = log P(x) (up to an irrelevant additive constant) and consider the stochas-tic differential equation

dXt = dBt −12

V ′(Xt)dt. (5.20)

The stochastic variable Xt will eventually reach a steady-state. In that steady-state the law of Xt will be given by P(x). Equation (5.20) is called the Langevinequation. The strength of the Langevin equation is that is allows to replace theaverage over the probability P(x) by a sample average over time of a stochasticprocess.3 In the Langevin equation, time is an artificial parameter: any rescallingof time would yield a valid Langevin equation

dXt = σdBt −σ2

2V ′(Xt)dt,

in particular the choice σ2 = 2 is very natural and often made.

We have learned another useful fact from Eq. (5.18), the random variable V ′(X)acts as a derivative with respect to X under the expectation value. In that senseV ′(X) can be considered the conjugate variable to X.

It is very straightforward to generalize our 1-d Langevin equation to a set of3 A process for which the time evolution samples the entire set of possible values is called ergotic. A

discussion of the condition for ergoticity is beyond the scope of this book.


N variables Xi that are drawn from the joint law P(x) = Z−1 exp[−V(x)]. Weget

dXk = dBk −12∂

∂XkV(X)dt, (5.21)

where we have dropped the subscript t for clarity.

Exercise

5.5 Langevin equation for Student’s t-distributions

The family of Student’s t-distributions, parametrize by the tail exponent µ,is given by the probability density

Pµ(x) = Z−1µ

(1 +

x2

µ

)− µ+12

with Z−1µ =

Γ(µ+1

2

)√µπΓ

(µ2

) .(a) What is the potential V(x) and its derivative V ′(x) for these laws.

(b) Using Eq. (5.18), show that for a t-distributed variable x we have

E[

x2

x2 + µ

]=

11 + µ

.

(c) Write the Langevin equation for a Student’s t-distribution. What is theµ→ ∞ limit of this equation?

(d) Simulate your Langevin equation for µ = 3, 20 time step per unit timeand run the simulation for 20000 units of time. Make a normalized his-togram of the sampled values of Xt and compare with the law for µ = 3given above.

(e) Compared to the Gaussian process (Ornstein-Uhlenbeck), the Studentt-process has many more short excursions but the long excursions aremuch longer than the Gaussian ones. Explain this behavior by compar-ing the function V ′(x) in the two cases. Describe their relative small |x|and large |x| behavior.

Exercise 103

0 2 4 6 8 10t

1.5

1.0

0.5

0.0

0.5

1.0

1.5

t

1.5 1.0 0.5 0.0 0.5 1.0 1.50.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

()

Figure 5.4 (left) A simulation of DBM with a potential for a N = 25 matrixstarting for a Wigner with σ2 = 1/10 and evolving within the potential V(x) =x2

2 + x4

4 for 10 unit of time. Note that the steady state is quickly reached (withinone or two unit of time). (right) Histogram of the eigenvalues for the sameprocess for N = 400 and 200 discrete steps per unit time. The histogram isover all matrices from time 3 to 10 (560000 points). The agreement with thetheoretical density (Eq. (4.17)) is very good.

5.4.2 DBM for Orthogonal Ensembles

In this section, we would like to extend our 1-d Langevin equation to the case ofsymmetric matrices drawn from a potential (Eq. (4.1)). One elegant way to doso is to consider matrix Brownian motion, i.e. a stochastic differential equationfor matrix elements where at each time step dt one adds a GOE matrix withvariance dt. This formalism is quite elegant and ressembles very closely the 1-dcase. To do it properly, we would need to write the first and second derivatives offunctions of symmetric matrices and define Ito’s lemma for such objects.

Our method will be more straightforward, albeit less elegant. We will workdirectly on the eigenvalues and write a Langevin equation for them. The jointlaw of the eigenvalues for beta ensemble is given by

P(λi) = Z−1N exp

−β2 N∑

i=1

NV(λi) −∑i, j

log |λi − λ j|

.

Applying the vectorial Langevin equation (5.21) to this potential for the eigen-


values we get

dλk = dBk +∑l,k

βdt2

1λk − λl

−Nβ4

V ′(λk)dt.

As we said earlier, the parametrization of time in a Langevin equation is arbi-trary, we choose to rescale time to recover Eq. (5.6) in the absence of a poten-tial

dλk =

√2N

dBk +

1N

∑l,k

β

λk − λl−β

2V ′(λk)

dt. (5.22)

Dyson Brownian motion with a potential has many applications. Numerically itcan be used to generate matrices for an arbitrary potential at task not obviousa priori from the definition (4.1). Figure 5.4 shows a simulation of the matrixpotential studied in Section 4.2.5. Note that DBM generates the correct densityof eigenvalue, it also generates the proper statistics for the joint distribution ofeigenvalues.

More theoretically, DBM can be used in proofs of local universality. Local uni-versality is the concept that many properties of the joint law of eigenvalues don’tdepend on the specifics of the random matrix in question. Many such proper-ties are a consequence of eigenvalue repulsion and indeed depend only on thesymmetry class (beta) of the model.

Another useful property of DBM is its speed of convergence to the steady state.With time normalized as in Eq. (5.22), global properties (such as the densityof eigenvalues) converge in a time of order 1. Local properties (e.g. eigenvaluespacing) converge much faster, in a time of order 1/N.

Exercise

5.6 Moments under DBM with a potential

Consider the moments of the eigenvalues as a function of time

Fk(t) =1N

N∑i=1

λki (t),

Exercise 105

for eigenvalues under going DBM under a potential V(x), Eq. (5.22), inthe orthogonal case β = 1. In this exercise you will need to show (and touse) the following identity

2N∑

i, j=1j,i

λki

λi − λ j=

N∑i, j=1

j,i

λki − λ

kj

λi − λ j=

N∑i, j=1

j,i

k∑l=0

λliλ

k−lj

(a) Using Ito calculus, write a stochastic differential equation for F2(t).

(b) By taking the expectation value of your equation, show that

ddtEF2(t) = 1 − E

1N

N∑i=1

λi(t)V ′(λi(t))

+1N

(c) In the Wigner case, V ′(x) = x, find the steady-state value of E[F2(t)]for any finite N.

(d) For for a random matrix X drawn from a generic potential V(x), showthat in the large N limit, we have

τ[V ′(X)X

]= 1,

where τ is the expectation value of the normalized trace defined by (2.1).

(e) Show that this equation is consistent with τ(W) = 1 for a Wishart matrixwhose potential is given by Eq. (4.2).

(f) In the large N limit, find and a general expression for τ[V ′(X)Xk] bywriting the steady-state equation for E[Fk+1(t)], you can neglect the Itoterm. The first two should be given by

τ[V ′(X)X2] = 2τ[X] and τ[V ′(X)X3] = 2τ[X2] + τ[X]2.

(g) In the unit Wigner case V ′(x) = x, show that your relation in (f) is equiv-alent to the Catalan number inductive relation (2.28), with τ(X2m) = Cm

and τ(X2m+1) = 0.



Dyson Brownian motion is not often discussed in books on Random MatrixTheory. The subject is treated in Baik et al. [2016] and Erdos and Yau [2017],the latter is one of the rare references that discusses DBM with a potential be-yond the Ornstein-Uhlenbeck model.

Historical Dyson [1962a]

6Addition of Large Random Matrices

6.1 Low rank Harrish-Chandra-Itzykson-Zuber integral

6.1.1 Generalization of the Wigner case

In the previous chapter, we used Dyson Brownian motion to compute the Stielt-jes transform of the sum of a Wigner and any other matrix (random or not).When

E = B + Xσ2 ,

the Stieltjes transform of E is given by

gE(z) = gB(z − σ2gE(z))

where gB(z) is the Stieltjes transform of B and σ2 the variance of the Wignermatrix. From gE(z) we can compute the density of eigenvalues of E using Eq.(2.24). We would like to find a generalization of this result to a larger class ofmatrices than Wigner.

Take two N×N matrices B, with eigenvalues λi1≤i≤N and eigenvectors vi1≤i≤N ,and C, with eigenvalues µi1≤i≤N and eigenvectors ui1≤i≤N . Then the eigen-values of B + C will depend on the overlaps between the eigenvectors of B andthe eigenvectors of C. In the trivial case where vi = ui for all i, we have that theeigenvalues of B + C are given by νi = λi + µi. However, this is neither genericand nor very interesting.

108 Addition of Large Random Matrices

One important property in the Wigner case is that the eigenvectors of X areHaar distributed, that is, the matrix of eigenvectors is distributed uniformly inthe group O(N) and each eigenvector is uniformly distributed on the unit sphereS N−1. Thus it is very unlikely that they will have significant overlap with theeigenvectors of B. This is the property that we want to keep in our general-ization. We will study what happens for general matrices B and C when theireigenvectors are random with respect to one another. We will define this rela-tive randomness notion more precisely in the next chapter. Here, to ensure therandomness of the eigenvectors, we will apply a random rotation to the matrixC:

E = B + OCOT ,

where O is a Haar distributed random orthogonal matrix. Then it is easy to seethat OCOT is rotational invariant since O′O is also Haar distributed for anyfixed O′ ∈ O(N).

Before we proceed with the addition of large matrices, we remind ourselvesof the law of addition in standard probability theory. For a random variableX, the function HX(t) defined as

HX(t) = logEeitX (6.1)

is additive under the addition of independent random variable. In other wordsif Z = X + Y where X and Y are independent random variables, we have

HZ(t) = logEeit(X+Y) = log(EeitXEeitY

)= HX(t) + HY (t).

6.1.2 Harrish-Chandra-Itzykson-Zuber intergal

To generalize exp(itX) to matrices, we want to multiply the matrix B withanother matrix, but we need to take the exponential of a scalar, so we pro-pose:

I(A,B) :=⟨exp

(N2

Tr AOBOT)⟩

O, (6.2)

for some fixed matrix A. The notation 〈·〉O means the integral over all orthogonalmatrices O normalized such that 〈1〉O = 1, this defines the Haar measure on the

6.1 Low rank Harrish-Chandra-Itzykson-Zuber integral 109

group of orthogonal matrices. Now if E = B + O1COT1, we have

I(A,E) =

⟨exp

(N2

Tr AO(B + O1COT1)OT

)⟩O

= I(A,B)I(A,C).

Equation (6.2) is called the Harrish-Chandra-Itzykson-Zuber (HCIZ) integral.1

For general A, the HCIZ integral is quite complicated. Fortunately, for our pur-pose we can take A to be rank-1 and in this case the integral can be computed.A symmetric rank-1 matrix can be written as

A = avvT ,

where a is the eigenvalue and v is a unit vector. We will show that the large Nbehavior of I(a,B) is given by

I(a,B) ≈ exp(N

2HB(a)

),

for some function HB(a) that depends on the particular matrix B. More formallywe define

HB(a) = limN→∞

2N

log⟨exp

(N2

Tr AOBOT)⟩

O. (6.3)

If E = B+C where C is rotational invariant with respect to B, the above formulagives that

HE(a) = HB(a) + HC(a).

i.e. H is additive.

The function HB(a) is defined here for a fixed matrix B. If B is a random matrix,we should take the expectation value of HB(a) with respect to B. It turns out thatHB(a) is self-averaging is most interesting cases and this extra expectation valueis not necessary. We discuss this point further in section 6.2.3.

1 The HCIZ can be defined with an integral over orthogonal, unitary or symplectic matrices. In the generalcase it is defined as

Iβ(A,B) :=⟨exp

( Nβ2

Tr AOBO†)⟩

O,

with beta equal to 1, 2 or 4 and O is averaged over the corresponding group. The unitary β = 2 case is themost often studied.


6.1.3 Computation of rank-1 HCIZ

To get a sensible theory, we need to have a concrete expression for this functionH. Without loss of generality, we can assume B is diagonal (in fact, we candiagonalize B and absorb the eigen-matrix into O). Moreover, for simplicity weassume that a > 0. Then OT AO can be regarded as a random projector

OT AO = ψψT ,

with ‖ψ‖2 = a and ψ/‖ψ‖ uniformly distributed on the unit sphere. Then wemake a change of variable ψ→ ψ/

√N, and calculate

Za(B) =

∫dNψ

(2π)N/2 δ(‖ψ‖2 − Na

)exp

(12ψT Bψ

), (6.4)

where we have added a factors of (2π)−N/2 for later convenience. Because Za(B)is not properly normalized (i.e. Za(0) , 1), we will need to normalize it tocompute I(a,B): ⟨

exp(N

2Tr AOBOT

)⟩O

=Za(B)Za(0)

.

We can express the Dirac delta as an integral

δ(x) =

∫ ∞

−∞

e−izx

2πdz =

∫ i∞

−i∞

e−zx/2

4πdz.

Now let Λ be a parameter such that Λ > λmax(B). We introduce the factor

1 = exp

−Λ(‖ψ‖2 − Na

)2

,since ‖ψ‖2 = Na. Then absorbing Λ into z, we get that

Za(B) =

∫ Λ+i∞

Λ−i∞

dz4π

∫dNψ

(2π)N/2 exp(−

12ψT (z − B)ψ +

Nza2

)

6.1 Low rank Harrish-Chandra-Itzykson-Zuber integral 111

z

zz(a)max

+ i

i

Figure 6.1 Graphical representation of the integral Eq. (6.5) in the complexplane. The red crosses represent the eigenvalues of B and are singular pointsof the integrant. The integration is from Λ − i∞ to Λ + i∞ where Λ > λmax.The saddle point is at z = z(a) > λmax. Since the integrant is analytic right ofλmax, the integration path can be deformed to go through z(a).

We can now perform the Gaussian integral over the vector ψ.

Za(B) =

∫ Λ+i∞

Λ−i∞

dz4π

det (z − B)−1/2 exp(Nza

2

)=

∫ Λ+i∞

Λ−i∞

dz4π

exp

N2

za −1N

∑k

log(z − λk(B))

, (6.5)

where λk(B), 1 ≤ k ≤ N, are the eigenvalues of B. Then we denote

F(z,B) := za −1N

∑k

log(z − λk(B)).

The integral in (6.5) is oscillatory, and by the stationary phase approximation,it is dominated by the point where

∂zF(z) = 0⇒ a −1N

∑k

1z − λk(B)

= a − gBN(z) = 0.

If gBN(z) can be inverted then z = z(a). For x > λmax, gB

N(x) is monotonicallydecreasing and thus invertible. So for a < gB

N(λmax), a unique z(a) exist andz(a) > λmax. Since F(z) is analytic to the right of z = λmax, we can deform the


contour to reach this point (see figure 6.1). Using the saddle-point formula (Eq.(A.2)), we have

Za(B) ∼

√4π/(4π)

|N∂2z F(z(a),B)|1/2

exp

N2

z(a)a −1N

∑k

log(z(a) − λk(B))

.

∼1

2√

Nπ|g′B(z(a))|exp

N2

z(a)a −1N

∑k

log(z(a) − λk(B))

For the case B = 0, we have gB(z) = z−1 ⇒ z(a) = a−1, so we get

Za(0) ∼1

2a√

Nπexp

[N2

(1 + log a

)]. (6.6)

In the large N limit, the prefactor in front of the exponential does not contributeto HB(a) and we get

limN→∞

2N

log⟨exp

(N2

Tr AOBOT)⟩

O= z(a)a−1− log a−

1N

∑k

log(z(a)−λk(B))

By the definition (6.3), we then get that

HB(a) = H(z(a), a), H(z, a) := za − 1 − log a −1N

∑k

log(z − λk(B)).

6.2 R-transform

6.2.1 Low rank HCIZ saddle point

We found an expression for HB(a) but in a form that is not easy to work with.Now H(z, a) comes from a saddle point approximation and therefore its partialderivative with respect to z is zero. This allows us to compute a much simplerexpression for the derivative of HB(a):

dHB(a)da

=∂H∂z

(z(a), a)dz(a)

da+∂H∂a

(z(a), a) =∂H∂a

(z(a), a) = z(a) −1a

= R(a),

where we used that∂H∂z

(z(a), a) = 0

6.2 R-transform 113

by the definition of z(a), and R(a) denotes the R-transform defined in (5.14).Moreover, from the definition, we trivially have HB(0) = 0. Hence we canwrite

HB(a) =

∫ a

0RB(x)dx. (6.7)

We already know that H is additive. Thus its derivative, i.e. the R-transform, isalso additive

RE(a) = RB(a) + RC(a),

if C is rotational invariant with respect to B.

The discussion leading to Eq. (6.7) can be extended to the HCIZ integral (Eq.(6.2)), when the rank of the matrix A is very small compared to N. In this casewe get

I(A,B) ≈ exp(N

2Tr HB(A)

), (6.8)

with the same HB(x) as above. When A has rank-1 we recover that Tr HB(A) =

HB(a) where a is the sole non-zero eigenvalue of A.

The above formalism is based on the assumption that g(z) is invertible, whichis generally only true when a = g(z) is small enough. This corresponds to thecase where z is sufficiently large. Recall that the expansion of g(z) at large zhave coefficients given by the moments of the random matrix by (2.17). On theother hand, the expansion of H(a) around a = 0 will give coefficients called thecumulants of the random matrix, which are important objects in the study offree probability as we will show later.

6.2.2 Large a behavior

There is an apparent paradox in the result of our computation of low-rankHCIZ. For a given matrix B there are two natural bounds to I(a,B)

exp(Naλmin

2

)≤ exp

(12ψT Bψ

)≤ exp

(Naλmax

2

)


exp(Naλmin

2

)≤ I(a,B) ≤ exp

(Naλmax

2

)(6.9)

where λmin and λmax are the smallest and largest eigenvalues of B respectively.Focusing on the upper bound, we have

HB(a) ≤ aλmax.

On the other hand, the integral of the R-transform gives for a unit Wignermatrix

HW(a) =a2

2. (6.10)

One might think that this quadratic behavior will violate the above boundfor a > 4, but we should remember that Eq. (6.7) is only valid for a < g+,the value at which g(z) ceases to be invertible. In the absence of outliers,g+ = g(λmax). For a unit Wigner this point is g+ = g(2) = 1; the boundis not violated. For a > g+, one can still compute HB(a) but the result willdepend explicitly on λmax. In any case, it is not very useful for computing thespectrum of the sum of large random matrices. For completeness we give herethe result for any positive a when the spectrum of eigenvalues is continuousand λmax is the upper edge of the spectrum (no outliers):

dHB(a)da

=

R(a) for a ≤ g+ ≡ g(λmax)λmax − 1/a for a > g+

(6.11)

6.2.3 Annealed vs quenched average

When B is a (rotationally invariant) random matrix, we claim that HB(a)should be averaged over the ensemble of B matrices. Since HB(a) is a logof an average (over orthogonal matrices), one might be tempted to computeHB(a) by replacing the average over O by an average over B.2 In other words:

E[log

⟨exp

(N2

Tr AOBOT

)⟩O

]B

?≈ log E

[exp

(N2

Tr AB)]

B.

However, they are not quite exactly the same due to the fluctuations of eigen-values of B. Nevertheless, let’s compute the annealed average when B is a

2 In the language of spin glasses, the average of the log is called the quenched average while the log of theaverage is called the annealed average. Often the two are equal in the high temperature phase but differ atlower temperature.

6.2 R-transform 115

0 1 2 3 4 5a

0

2

4

6

8

10

12

S W(a

)

quenchedannealedupper bound

Figure 6.2 The function HX(a) for a unit Wigner computed with a quenchedaverage (HCIZ integral) and an annealed average. We also show the upperbound given by Eq. (6.9). The annealed and quenched average are identicalup to a ≤ g+ = 1 and differ for larger a. The annealed average violates thebound, which is expected as in this case λmax fluctuates and large values of itdominate the average.

Wigner matrix and A = ae1eT1, where e1 is the unit vector (1, 0, . . . , 0)T . Then⟨

exp(N

2Tr AB

)⟩B

=

⟨exp

(Na2

eT1Be1

)⟩B

=

∫dB11√4πσ2/N

exp(NaB11

2−

N4σ2 B2

11

)= exp

(N2

a2σ2

2

),

so the annealed HAB (a) is given by

HAB (a) =

12σ2a2,

which is the same as Eq. (6.10). There is one important difference betweenthe quenched and annealed average. In the first case, the computation waslimited to a such that a < g+ while in the second case the computation isvalid for all a. Therefore one concludes that for a < g+, the quenched andannealed computation give the same result but they differ beyond a > g+.Note that for large enough a the annealed computation will violate the upperbound (6.9), this is not a problem as in the average over B, the eigenvaluesfluctuate and for large a the average is dominated by configurations whereλmax is much greater than λ+, the edge of the limiting spectrum.


The same calculation can also be done for Wishart matrices and we find againthat the two averages match for a < g+. We will show in chapter ?? that for alarge class of random matrices that converge to a density of eigenvalues witheigenvalue repulsion, the two averages coincide for a < g+. For ensemblewithout eigenvalue repulsion (e.g. iid eigenvalues) the two averages differeven at small a.


Historical Harish-Chandra [1957], Itzykson and Zuber [1980]

Parisi formula Marinari et al. [1994], Zinn-Justin [1999], Guionnet and Maıda[2005]

7Free Probabilities

In the previous two chapters we saw how to compute the spectrum of the sum oftwo large random matrices, first when one of them is a Wigner and later whenone is rotationally invariance with respect to the other. In this chapter, we wouldlike to formalize this notion of relative rotational invariance.

The idea is as follow. In standard probability theory, one can work abstractly bycomputing expectation values (moments) of random variables. The concept ofindependence is then equivalent to the factorization of moments (e.g. E[A3B2] =

E[A3]E[B2] when A and B are independent).

Generally random matrices don’t commute and the concept of factorization ofmoments is not very powerful for non-commuting objects. Following von Neu-mann, Voiculescu extended the concept of independence to non-commuting ob-jects. He called this property freeness. He then showed how to computed thesum and the product of free variables. It was later realized that large rotation-ally invariance matrices are free (asymptotically). In other words, free prob-abilities gave us very powerful tools to compute sums and products of largerandom matrices. We already encountered the free addition, free multiplicationwill allow us to study sample covariance matrices is the presence of true corre-lations.

At first, this chapter may seem too dry and abstract for someone looking forapplications. Bare with us, it is not that complicated and we will try to keep

118 Free Probabilities

the jargon to a minimum. The reward will be one of the most powerful tools inrandom matrix theory.

7.1 Algebraic Probabilities

7.1.1 General Definitions

The ingredients we will need are as follows:1

• A ring R of random variables, which can be non-commutative with respectto the multiplication.

• A field of scalars, which is usually taken to be C. The scalars commute witheverything.

• An operation ∗, called involution. For instance, ∗ denotes the conjugate forcomplex numbers, the transpose for real matrices, and the conjugate trans-pose for complex matrices,

• A positive linear functional (R → C) τ which satisfies τ(AB) = τ(BA) forA, B ∈ R. By positive we mean τ(AA∗) is real non-negative. We also ask thatτ be faithful, i.e. τ(AA∗) = 0 ⇒ A = 0. For instance, τ can be E for standardprobability theory, and can be 1

N Tr or 1NETr for ring of matrices.

We shall call the elements in R the random variables and denote them by capitalletters. For any A ∈ R and k ∈ N, we call τ(Ak) the k-th moment of A. Inparticular, we call τ(A) the mean of A and τ(A2) − τ(A)2 the variance of A. Weshall say that two elements A and B have the same distribution if they have thesame moments of all orders. The axiom stated above imply that we consideronly variables that have finite moments to all orders.

The ring of variables must have an element called 1 such that A1 = 1A = Afor every A. It satisfies τ(1) = 1. We will call 1 and its multiples α1 constants.Adding a constant simply shifts the mean as

τ(A + α1) = τ(A) + α.

1 In mathematical language, the first three items give a *-algebra, while τ gives a tracial state on thisalgebra.

7.1 Algebraic Probabilities 119

7.1.2 Addition of Commuting Variables

We first consider the commutative case, i.e.

AB = BA, ∀A, B ∈ R.

We shall say A and B are independent, if

τ(p(A)q(B)) = τ(p(A))τ(q(B))

for any polynomial p, q. This condition is equivalent to the factorization of mo-ments.

Constants are independent of everything, from a scalar α we can build the con-stant α1 and write A +α to mean A +α1. Then this setting recovers the classicalprobability theory of commutative random variables (with finite moments toevery order).

Now we study the moments of the addition of independent random variablesA + B. First we trivially have by linearity

τ(A + B) = τ(A) + τ(B).

From now on we will assume

τ(A) = τ(B) = 0,

i.e. A, B have mean zero. For a non-zero mean variable A, we write A = A−τ(A)so τ(A) = 0. We can recover the formulas for moments and cumulants of A withnon zero mean simply by substituting A − τ(A) in formulas for zero mean A.The procedure is straightforward but leads to rather cumbersome results.

Then we have

τ((A + B)2

)= τ(A2) + τ(B2) + 2τ(AB)

= τ(A2) + τ(B2) + 2τ(A)τ(B) = τ(A2) + τ(B2),

i.e. the variance is also additive. For the third moment, we have

τ((A + B)3

)= τ(A3) + τ(B3) + 3τ(A)τ(B2) + 3τ(B)τ(A2) = τ(A3) + τ(B3),


which also additive. However, the fourth and higher moments are not additiveanymore. For example, with similar calculations as above, we get

τ((A + B)4

)= τ(A4) + τ(B4) + 6τ(A2)τ(B2).

7.1.3 Commuting Cumulants

For zero mean variables the first three moments are additive but not higher one.Nevertheless certain combinations of higher moments are additive, we call themcumulants. Note that for a variable with non-zero mean A, the second and thirdcumulant are the second and third moments of A = A − τ(A),

κ1(A) = τ(A)

κ2(A) = τ(A2) = τ(A2) − τ(A)2

κ3(A) = τ(A3) = τ(A3) − 3τ(A2)τ(A) + 2τ(A)3.

Back to zero mean variables, we define the fourth cumulant

κ4(A) := τ(A4) − 3τ(A2)2.

Then we can verify that

κ4 (A + B) = τ((A + B)4

)− 3τ

((A + B)2

)2

= τ(A4) + τ(B4) + 6τ(A2)τ(B2) − 3(τ(A2) + τ(B2)

)2

= τ(A4) − 3τ(A2)2 + τ(B4) − 3τ(B2)2 = κ4(A) + κ4(B),

which is additive again. In general, τ((A + B)n) will be of the form τ(An) +τ(Bn)plus some homogeneous mix of lower order terms. We can then define the n-thcumulant κn iteratively such that

κn(A + B) = κn(A) + κn(B),

where

κn(A) = τ(An) + lower order terms moments. (7.1)

Cumulants are by definition additive for independent variables. We also knowthat the log characteristic function introduced in Eq. (6.1) is additive. We can

7.1 Algebraic Probabilities 121

use it as an alternative definition of cumulants. We define the characteristic func-tion:2

ϕA(k) = τ(eikA

),

where the exponential function is defined through its power series:

τ(eikA) =

∞∑l=0

(ik)l

l!τ(Al), (7.2)

hence the characteristic function is the moment generating function. We knowfrom the properties of the exponential and the factorization of moments that forindependent A, B,

ϕA+B(k) = ϕA(k)ϕB(k).

Here is a more algebraic demonstration. For each l,

τ((A + B)l) =

l∑i=0

(li

)τ(Ai)τ(Bl−i),

with which we get

ϕA+B(k) =

∞∑l=0

∑i≤l

(ik)l

l!

(li

)τ(Ai)τ(Bl−i) =

∑i≤l

(ik)l−iτ(Bl−i)(l − i)!

(ik)iτ(Ai)i!

=

∑i

(ik)iτ(Ai)i!

∑

j

(ik) jτ(B j)j!

= ϕA(t)ϕB(t).

We now define

HA(k) := logϕA(k).

Then for independent A, B, we have

HA+B(k) = HA(k) + HB(k).

Then we shall expand HA(k) as a power series of k and call the coefficients the

2 The factor i is the definition is not necessary in this setting as the formal power series of the exponentialand the logarithm don’t need to converge. We nevertheless include it by analogy with the Fouriertransform.


cumulants, i.e.,

HA(k) = log τ(eikA

)=

∞∑n=1

κn(A)n!

(ik)n. (7.3)

By the additive property of H, the cumulants defined in the above way are auto-matically additive. In fact, using the power series for log(1 + x), we have

HA(k) =

∞∑n=1

(−1)n−1

n

∞∑l=1

(ik)l

l!τ(Al)

n

≡

∞∑k=1

κn(A)n!

(ik)n, (7.4)

by matching powers of (ik) we can obtain an expression for κn for any n. Inparticular for n = 1 on the left hand side we recover Eq. (7.1) indicating thatboth definition of cumulants agree. Note that the above relation gives a morestraightforward definition of the cumulants than the iterative one above. We canwork out by hand the first few cumulants.

From (7.4), we have κ1(A) = τ(A). Now we assume A has mean zero, i.e. τ(A) =

0. Then

τ(eikA

)= 1 +

(ik)2

2τ(A2) +

(ik)3

6τ(A3) +

(ik)4

24τ(A4) + . . . ,

and the first few terms in the expansion of (7.4) are

HA(k) =(ik)2

2τ(A2) +

(ik)3

6τ(A3) +

(τ(A4)

24−τ(A2)2

8

)(ik)4 + . . . ,

from which we recover the first four cumulants defined above:

κ1(A) = 0, κ2(A) = τ(A2), κ3(A) = τ(A3), κ4(A) = τ(A4) − 3τ(A2)2.

When introducing a mean to a variable, A = A + a, where τ(A) = 0, we onlychange the first cumulant:

κ1(A) = a and κn(A) = κn(A) for n ≥ 2.

Working by hand, the expression for κn can soon become very cumbersomefor larger n. Nevertheless, by exponentiating Eq. (7.3) and matching with Eq.

Exercise 123

(7.2), one can extract the following (commutative) moment-cumulant relation

τ(An) =∑

r1,r2,...,rn≥0r1+2r2+...+nrn=n

n!κr11 κ

r22 · · · κ

rnn

(1!)r1 (2!)r2 · · · (n!)r2 r1!r2! · · · rn!

= κn + products of lower order terms + κn1.

(7.5)

In particular, the scaling properties for the moments and cumulants (see (7.6)below) are consistent in the above relation due to r1 + 2r2 + . . . + nrn = n.

Exercise

7.1 Cumulants of a constant

Show that a constant α1 has κ1 = α and κn = 0 for n ≥ 2. (Hint computeHα1(k) = log

(τ(eikα1

))).

7.1.4 Scaling of Moments and Cumulants

Moments and cumulants have simple scaling under scalar multiplication. Forany scalar α, by commutativity of scalars and linearity of τ we have

τ((αA)k

)= αkτ

(Ak

).

For the cumulant, we first look at the scaling of the log-characteristic func-tion

HαA(k) = log(τ(eikαA

))= HA(αk).

And by (7.3), we have

HαA(k) = HA(αk) =

∞∑n=1

αnκn(A)n!

(ik)n.

Thus we have the scaling property

κn(αA) = αnκn(A). (7.6)


7.1.5 Law of Large Mumbers and Central Limit Theorem

Continuing in our study of algebraic probabilities, we would like to recover twovery important theorem in probability theory, namely the law of large numberand the central limit theorem. The first states that the sample average convergesto the mean (a constant) as n → ∞ and the second that a large sum of properlycentered and rescaled random variables converge to a Gaussian.

First we need to define in our context what we mean by a constant and a Gaus-sian. At first reading, we should think of the variables in this section as com-muting (standard random variable). We will later introduce non-commutatingcumulants. The arguments of this section apply in the non-commutative casewith independence replaced by freeness.

We have defined the constant random variable A = α1, which satisfies

κ1(A) = a, κl(A) = 0, ∀l > 1.

Then we define the “Gaussian” random variable as an element A that satis-fies

κ2(A) , 0, κl(A) = 0, ∀l > 2.

Note that this definition (in the commutative case) is equivalent to the standardGaussian random variable with density

Pµ,σ2(x) =1

√2πσ2

exp(−

(x − µ)2

2σ2

),

with κ1 = µ and κ2 = σ2.

We shall call κ1(A) the mean, and κ2(A) the variance. Now we can give a simpleproof for the law of large numbers (LLN) and central limit theorem (CLT) inour algebraic setting. Let

S n :=1n

n∑i=1

Ai,

where Ai are iid copies3 of some element A. Then by (7.6) and the additive

3 iid copies are variables Ai that have exactly the same moments and are all independent.

7.2 Non-Commuting Variables 125

property of cumulants, we get that

κl(S n) =nnl κl(A)→

κ1(A), if l = 1

0, if l > 1.

In other words, S n converges to a constant in the sense of cumulants. On theother hand,

Tn :=1√

n

n∑i=1

Ai,

where Ai are i.i.d. copies of some element A with κ1(A) = 0. Then it is easy tosee that

κl(Tn) =n

nl/2 κl(A)→

0, if l = 1

κ2(A), if l = 2

0, if l > 2

.

In other words, Tn converges to a Gaussian random variable with varianceκ2(A) = τ(A2) in the sense of cumulants.

In order to quantify the convergence of general random variables, one needsto define some “distance” or more generally “topology” on R. In fact, wehave a natural topology on C, and we can define the convergence throughthe convergence of complex-valued functions ϕA(t) or FA(t). However, thesefunctions can be defined through moments of A only. Hence this induces anatural notation of convergence for a sequence of random variables An, thatis, we shall say An converges to A if and only if τ(Ak

n)→ τ(Ak) for any k.

In our algebraic probability setting we have made the implicit assumptionthat the variables we consider have finite moments of all order. This is a verystrong assumption. In particular it excludes any variable whose probabilitydecays as a power law. If we relax this assumption we would find that somesums of power-law distributed variables converge not to a Gaussian but to aLevy distribution. A similar concept exists in the non-cumulative case. But itis beyond the scope of this book.

7.2 Non-Commuting Variables

We now return to our original goal of developing an extension of standard prob-abilities for non-commuting objects. One of the goals is to generalize the law


of addition of independent variables. We consider a variable equal to A + Bwhere A and B are non-commutative objects such as large random matrices. Ifwe compute the first three moments of A + B, no particular problem arise andthe behave as in the commutative case. Things get more interesting at the forthmoment:

τ((A + B)4

)= τ

(A4

)+ τ

(A3B

)+ τ

(A2BA

)+ τ

(ABA2

)+ τ(BA3)

+ τ(A2B2

)+ τ (ABAB) + τ

(BA2B

)+ τ(AB2A) + τ(BABA) + τ(B2A2)

+ τ(B3A

)+ τ

(B2AB

)+ τ

(BAB2

)+ τ(AB3) + τ(B4)

= τ(A4

)+ 4τ

(A3B

)+ 4τ

(A2B2

)+ 2τ (ABAB) + 4τ

(AB3

)+ τ(B4),

where in the second step we used the tracial property of τ: τ(AB) = τ(BA)for any A, B ∈ R. For commutative random variable, we have the statementτ(A2B2) = τ(A2)τ(B2) for independent A, B. In the non-commutative case, wealso need to handle the term τ (ABAB). In general ABAB is not equal to A2B2.“Independence” is not enough to deal with this term, so we need a new concept.A radical solution would be to postulate that this term is zero. Or more preciselythat τ(ABAB) = 0 whenever τ(A) = τ(B) = 0. As we compute higher momentsof A+ B we will encounter more and more complicated similar mixed moments.The concept of freeness deals with all of them at once.

7.2.1 Freeness

Given two random variables A, B. We say they are free if for any polynomialsp1, . . . , pn and q1, . . . , qn such that

τ(pk(A)) = 0, τ(qk(B)) = 0, ∀k,

we have

τ (p1(A)q1(B)p2(A)q2(B) · · · pn(A)qn(B)) = 0.

We will call a polynomial (or a variable) traceless if τ(p(A) = 0). Note that 1(and similarly any constant) is free with respect to any A ∈ R since

τ(p(1)) = 0⇔ p(1) = 0.


Moreover, it is easy to see that if A, B are free, then p(A), q(B) are free for anypolynomials p, q. By extension, F(A) and G(B) are also free for any function Fand G defined by their power series.

The freeness is only interesting in the non-commutative case. For the commu-tative case, it is easy to check that A, B are free if and only if either A or B isconstant.

Assuming A, B are free and τ(A) = τ(B) = 0, we now compute the moments ofA + B:

τ((A + B)2

)= τ

(A2

)+ τ

(B2

)+ 2τ (AB) = τ

(A2

)+ τ

(B2

),

τ((A + B)3

)= τ

(A3

)+ τ

(B3

)+ 3τ

(A2B

)+ 3τ

(AB2

)= τ

(A3

)+ τ

(B3

)+ 3τ

((A2 − τ

(A2

))B)

+ 3τ(A2

)τ (B) + 3τ

(A

(B2 − τ

(B2

)))+ 3τ (A) τ

(B2

)= τ

(A3

)+ τ

(B3

),

τ((A + B)4

)= τ

(A4

)+ 4τ

(A3B

)+ 4τ

(A2B2

)+ 2τ (ABAB) + 4τ

(AB3

)+ τ

(B4

)= τ

(A4

)+ 4τ

((A3 − τ

(A3

))B)

+ 4τ(A3

)τ (B) + 4τ

((A2 − τ

(A2

)) (B2 − τ

(B2

)))+ 4τ

(A2

)τ(B2

)+ 4τ

(A

(B3 − τ

(B3

)))+ 4τ (A) τ

(B3

)+ τ

(B4

)= τ

(A4

)+ τ

(B4

)+ 4τ

(A2

)τ(B2

).

In particular, we have

τ((A + B)4

)− 2τ

((A + B)2

)2= τ

(A4

)+ τ

(B4

)− 2τ

(A2

)2− 2τ

(B2

)2.

7.2.2 Free Cumulants

If we define the cumulants as

κ1 (A) = τ (A) , κ2 (A) = τ(A2

), κ3 (A) = τ

(A3

), κ4 (A) = τ

(A4

)−2τ

(A2

)2,

where A = A− τ(A)1, then they are additive for free random variables. The firstthree are the same than the commutative ones. But for the fourth cumulant, the


coefficient before τ(A2)2 is now 2 instead of 3. Higher cumulants all differ fromtheir commutative counterparts.

Remark: The free random variables are in some sense “maximally” non-commuting.For example, for free and mean zero variables A and B, we have τ(ABAB) = 0while τ(A2B2) = τ(A2)τ(B2).

As in the commutative case, we can define the k-th free cumulant iterativelyas

κk(A) = τ(Ak) + homogeneous products of lower order moments,

such that

κk(A + B) = κk(A) + κk(B), ∀k,

whenever A, B are free.

An important example of non-commutative free random variables are two inde-pendent large random matrices where one of them is rotational invariant.4

7.2.3 Additivity of the R-Transform

In the previous two chapters, we saw that the R-transform is additive for largerotationally invariant matrices. We will show here that we can define the R-transform in our abstract algebraic probability setting and that this R-transformis indeed additive for free variables.

First we define the Stieltjes transform as the moment generating function as in(2.17), we can define gA(z) for large z as

gA(z) =

∞∑k=0

1zk+1 τ(Ak). (7.7)

Then we can also define the R-transform as before:

RA(g) = zA(g) −1g

4 Freeness is only exact in the large N limit of random matrices.


for small g.5 We claim that the R-transform is additive for free random variables,i.e.,

RA+B(g) = RA(g) + RB(g),

if A, B are free.

The following derivation is taken from Tao [2012]. We let zA(g) be the inversefunction of

gA(z) = τ[(z − A)−1

],

whose power series is actually given by (7.7). Consider a fixed scalar g. Byconstruction

τ(g1) = τ[(zA(g) − A)−1

]The arguments of τ() on both sides of the equation have the same mean but theyare in general different, let’s define their difference as gXA via

(1 + XA)g = (zA − A)−1, (7.8)

where zA is zA(g). By the definition, we have τ(XA) = 0.

We can invert Eq. (7.8) and find

A − zA = −1g

(1 + XA)−1.

Consider another variable B, free from A. For the same fixed g we can find thescalar zB ≡ zB(g) and define XB with τ(XB) = 0 as for A and find

B − zB = −1g

(1 + XB)−1.

Since XA and XB are functions of A and B, XA and XB are free.

5 Here the inverse function zA(g) is really defined as the formal power series that satisfies gA(zA(g)) = g toall orders.


A + B − zA − zB = −1g

(1 + XA)−1 −1g

(1 + XB)−1

= −1g

(1 + XA)−1(2 + XA + XB)(1 + XB)−1

A + B − zA − zB +1g

= −1g

(1 + XA)−1(1 − XAXB)(1 + XB)−1

[A + B −

(zA + zB −

1g

)]−1

= −g(1 + XB)(1 − XAXB)−1(1 + XA).

Using the expansion

(1 − XAXB)−1 =∑n=0

(XAXB)n,

We can develop the expression

τ[(1 + XB)(1 − XAXB)−1(1 + XA)

],

it will contain 1 plus terms of the form τ(XAXBXAXB . . . XB) where the initialand final factor might be either XA or XB but the important point is that XA andXB always alternate. By the freeness and zero mean of XA and XB, all theseterms are zero. Hence we get

τ

[A + B −

(zA + zB −

1g

)]−1 = −g⇒ gA+B(zA + zB − g−1) = g.

Thus we get

zA+B = zA + zB − g−1 ⇒ RA+B = RA + RB.

7.2.4 R-Transform and Cumulants

The R-transform R(g) is actually defined as a power series in g. We claim thatthe coefficients of this power series are exactly the non-commutative cumu-lants defined earlier. In other words, RA(g) is the cumulants generating func-tion:

RA(g) =

∞∑k=1

κk(A)gk−1. (7.9)


To show that these coefficients are indeed the cumulants we first realize that theequality R(g) = z(g) − 1/g is equivalent to

zg(z) − 1 = g(z)R(g(z)). (7.10)

We can compute the power series of the two sides of this equality:

zg(z) − 1 =

∞∑k=1

mk

zk (7.11)

where mk ≡ τ(Ak) denotes the k-th moment, and

g(z)R(g(z)) =

∞∑k=1

κk

1z

+

∞∑l=1

ml

zl+1

k

. (7.12)

Equating the right hand sides of Eqs. (7.11,7.12) and matching power of 1/z weget recursive relations between moments (mk) and cumulants (κk):

m1 = κ1 ⇒m1 = κ1

m2 = κ2 + κ1m1 ⇒m2 = κ2 + κ21

m3 = κ3 + 2κ2m1 + κ1m2 ⇒m3 = κ3 + 3κ2κ1 + κ31

m4 = κ4 + 4κ3m1 + κ2[2m2 + m21] + κ1m3 ⇒m4 = κ4

1 + 6κ2κ21 + 2κ2

2 + 4κ3κ1 + κ4

By looking at the z−k term coming from the [1/z + . . .]k term in Eq. (7.12) werealize that mk = κk + . . . where “. . .” are homogeneous combinations of lowerorder κk and mk.

The coefficient of the power series Eq. (7.9) are additive under addition of freevariables and they have the property

κk(A) = τ(Ak) + homogeneous products of lower order moments,

they are therefore the cumulants defined in Section 7.2.2.


Figure 7.1 Generic non-crossing partition of 23 elements. In Eq. (7.13), thisparticular partition appears for m23 and contributes κ5κ

23κ

52κ

21. Note that 23 =

5 + 2 · 3 + 5 · 2 + 5 · 1.

Figure 7.2 List of all non-crossing partitions of 4 elements. In equation (7.13)for m4, the first partition contributes κ1κ1κ1κ1 = κ4

1. The next 6 all contributeκ2κ

21 and so forth. We read m4 = κ4

1 + 6κ2κ21 + 2κ2

2 + 4κ3κ1 + κ4.

7.2.5 Cumulants and Non-Crossing Partitions

We saw that Eq. (7.10) can be used to compute cumulants iteratively. Actuallythat equation can be translated into a systematic relation between moments andcumulants:

mn =∑

π∈NC(n)

κπ1 · · · κπlπ, (7.13)

where π ∈ NC(n) indicates that the sum is over all possible non-crossing par-titions of n elements. For any such a partition π the integers π1, π2, · · · , πlπ

(1 ≤ lπ ≤ n) equal the number elements in each group (see Fig. 7.1). Theysatisfy

n =

lπ∑k=1

πk.

We will show that, if we define cumulants by Eq. (7.13), we recover Eq. (7.10).But before we do so, let’s first show this relation on a simple example. Figure 7.2shows the computation of the fourth moment in terms of the cumulants.

The argument is very similar to the recursion relation obtained for Catalan num-bers where we considered non-crossing pair partitions (see Section 2.3.3). Herethe argument is slightly more complicated as we have partitions of all possible


Figure 7.3 .

size. We consider the moment mn for n ≥ 1. We break down the sum over allnon-crossing partitions of n element by looking at l the size of the set containingthe first element (for example in Fig. 7.1, the first element belongs to a set ofsize l = 5). The size of this first set can be 1 ≤ l ≤ n. This initial set breaksthe partition into l (possibly empty) disjoint smaller partitions. They must bedisjoint, otherwise there would be a crossing. In Figure 7.3 we show how aninitial 5-set breaks the full partition into 5 blocks. In each of these blocks, everynon-crossing partition is possible, the only constraint is that the total size of thepartition must be n. The sum over all possible non-crossing partition of size kof the relevant κ’s is the moment mk. Note that the empty partition contributesa multiplicative factor 1, so we define m0 ≡ 1. Putting everything together weobtain the following recursion relation for mn,

mn =

n∑l=1

κl

∏k1,k2,...,kl≥0

k1+k2+...+kl=n−l

mk1mk2 . . .mkl . (7.14)

The r.h.s. of Eq. (7.14) is exactly the term multiplying z−n in the r.h.s. of Eq.(7.12), which must be equal to the n moment mn. This shows that the relation(7.13) is equivalent to our previous definition of the free cumulants.

It is interesting to contrast the moment-cumulant relation in the standard (com-mutative) case (Eq. (7.5)) and the free (non-commutative) case (Eq. (7.13)).Both can be written as a sum over all partitions on n elements, in the standardcase, all partitions are allowed while in the free case the sum is only over non-crossing partition.


7.2.6 Freness as the Vanishing of Mixed-Cumulants

We have defined freeness in Section 7.2.1, as the property of two variables Aand B such that the trace of any mixed combination of traceless polynomialsin A and in B vanishes. There exists another equivalent definition of freeness,namely every mixed cumulant of A and B vanish. To make sense of this defini-tion we first need to introduce cumulants of several variables.They are definedrecursively by

τ(A1A2 · · · An) =∑

π∈NC(n)

κπ(A1A2 · · · An), (7.15)

where Ai’s are not necessarily distinct and NC(n) is the set of all non-crossingpartitions of n elements. Here

κπ(A1A2 · · · An) = κπ1(· · · ) · · · κπn(· · · )

are the products of cumulants in the same partition. (draw a graph of partitions)We also call these generalized cumulants the free cumulants.

When all the variables in Eq. (7.15) are the same (Ai = A) we recover the previ-ous definition of cumulants with a slightly different notation (e.g. κ3(A, A, A) ≡κ3(A)). Cumulants with more than one variables are called mixed cumulants(e.g. κ4(A, A, B, A)). By applying Eq. (7.15) we find for the low generalized cu-mulants of two variables

m1(A) = κ1A

m2(A, B) = κ1(B)κ1(B) + κ2(A, B)

m3(A, A, B) = κ1(A)2κ1(B) + κ2(A, A)κ1(B) + 2κ2(A, B)κ1(A) + κ3(A, A, B).

We can now state more precisely the alternate definition of freeness: a set ofvariables are free iif all their mixed-cumulants vanish. For example, in the lowcumulants listed above, freeness of A and B implies that κ2(A, B) = κ3(A, A, B) =

0.

We remark that vanishing of mixed cumulant implies that free cumulants areadditive. In Speicher’s notation, κk(A, B,C, · · · ) is a multilinear function in each


of its argument, where k gives the number of variables. Thus we have

κk(A + B, A + B, · · · ) = κk(A, A, · · · ) + κk(B, B, · · · ) + mixed cumulants

= κk(A, A, · · · ) + κk(B, B, · · · ),

i.e. κk is additive.

7.2.7 Wigner Ensemble and CLT

We can now go back and re-read Section 7.1.5 and replace every occurrence ofthe word independent with free and cumulant by free cumulant.

The law of large numbers now states that the sum of free identically distributed(fid) variables normalized by 1/n converges to a constant (also called a scalar)with the same mean.

Let’s define the Wigner as the variable with second free cumulant κ2 = σ2 > 0and all other free cumulants equal to zero. The central limit theorem then statesthat the sum of zero-mean fid variables normalized by 1/

√n converges to a

Wigner with the same second cumulant.

For large symmetric random matrices, the Wigner defined here by its cumu-lant is the same Wigner ensemble as defined in chapter 1. We saw that the R-transform of the Wigner is given by R(x) = σ2x, i.e. the cumulant generatingfunction has a single term corresponding to κ2 = σ2.

Alternatively we note that the moments of a Wigner are given by the sum overnon-crossing pair partitions (Eq. (2.27)). Comparing with Eq. (7.13), we realizethat partitions containing anything else than pairs must contribute zero, henceonly the second cumulant of the Wigner is non-zero.

7.2.8 Subordination Relation for Addition of Free Variables

We now introduce the subordination relation for free addition. For free A and B,we have

RA(g) + RB(g) = RA+B(g)⇒ zA(g) + RB(g) = zA+B(g),


where

gA+B(zA+B) = g = gA(zA) = gA (zA+B − RB(g)) .

We call zA+B(g) ≡ z, then the above relations give

gA+B(z) = gA (z − RB(gA+B(z))) , (7.16)

which is called the subordination relation.

7.3 Free Product

When we studied sample covariance matrices, we showed that the sample co-variance matrix E of variables with true (population) covariance C is givenby

E = C1/2WqC1/2,

with Wq a white Wishart matrix of parameter q = N/T . To compute propertiesof the eigenvalues of E, we need to study the moments τ((C1/2WqC1/2)k) =

τ((CWq)k). Given that white Wishart are rotationally invariant, we need to un-derstand moments of products of free variables.

We start by noticing that the free product of traceless variables is trivial. If A, Bare free and τ(A) = τ(B) = 0, we have

τ((AB)k) = τ(ABAB · · · AB) = 0.

For large random matrices A, B that are asymptotically free, the above meansthat all the moments of AB vanishes as N → ∞. Even if AB , 0 for finiteN.

7.3.1 Low Moments of Free Products

Now we consider the case where A, B are free and τ(A) , 0, τ(B) , 0. Withoutloss of generality, we can assume that τ(A) = τ(B) = 1 by rescaling A and B.Then

τ(AB) = τ ((A − τ(A))(B − τ(B))) + τ(A)τ(B) = τ(A)τ(B) = 1.

7.3 Free Product 137

We can also use (7.15) to get

τ(AB) = κ2(AB) + κ1(A)κ1(B) = κ1(A)κ1(B) = 1,

since the mixed cumulants are zero. Similarly, using (7.15) we can get that (in-sert figures on the partitions)

τ((AB)2) = τ(ABAB) = κ1(A)2κ1(B)2 + κ2(A)κ1(B)2 + κ1(A)2κ2(B) = 1 + κ2(A) + κ2(B),

which gives

κ2(AB) = τ((AB)2) − τ(AB)2 = κ2(A) + κ2(B),

and

τ((AB)3) = κ3(A)κ1(B)3 + κ1(A)3κ3(B) + 3κ2(A)κ2(B) + 3κ2(A)κ1(A)κ1(B)3

+ 3κ2(B)κ1(B)κ1(A)3 + κ1(A)3κ1(B)3

= κ3(A) + κ3(B) + 3κ2(A)κ2(B) + 3(κ2(A) + κ2(B)) + 1.

κ3(AB) =τ((AB)3) − 3τ((AB)2)τ(AB) + 2τ(AB)3

=κ3(A) + κ3(B) + 3κ2(A)κ2(B)

Under free multiplication of unit-trace variables, the mean stays one and thevariance is additive. The third cumulant is not additive, it is strictly greater thanthe sum of the third cumulants unless one of the two variables is the identity(unit scalar).

7.3.2 Definition of the S-Transform

We will now show that the above relations can be encoded into the S -transformS M(t) which is multiplicative:

S AB(t) = S A(t)S B(t) (7.17)


for A and B free. To define the S -transform, we first introduce the T-transform:

tA(ζ) = τ[(1 − ζ−1A)−1

]− 1 (7.18)

= ζgA(ζ) − 1 (7.19)

=

∞∑k=1

mk

ζk .

The T-transform can also be written as

tA(ζ) = τ[T(ζ)

]where T(ζ) := A(ζ − A)−1. (7.20)

We will call the matrix resolvent T(ζ) the T-matrix. We let ζA(t) be the inversefunction of tA(ζ). When m1 , 0, tA is invertible for large ζ, and hence ζA existsfor small enough t. We then define the S-transform

S A(t) ≡t + 1tζA(t)

, (7.21)

for variables A such that τ(A) , 0.6

Let’s compute the S-transform of the identity S 1(t):

t1(ζ) =1

ζ − 1⇒ ζ1(t) =

t + 1t⇒ S 1(t) = 1,

as expected as the identity is free with respect to any variable. The S-transformscales in a simple way with the matrix A, to find its scaling we first notethat

tαA(ζ) = τ[(1 − (α−1ζ)−1A)−1

]− 1 = tA(ζ/α),

which gives that

ζαA(t) = αζA(t).

Then using (7.21), we get that

S αA(t) = α−1S A(t).6 Most author prefer to define the S-transform in terms of the moment generating function ψ(z) := t(1/z).

The definition S (t) = ψ−1(t)(t + 1)/t is equivalent to ours (ψ−1(t) is the functional inverse of ψ(z)). Weprefer to work with the T-transform as the function t(ζ) has an analytic structure very similar to that ofg(z), i.e. it is analytic near ζ → ∞ and has the same singularities as g(z) where the density of eigenvaluesis non trivial. The function ψ(z) is analytic near zero and singular for large values z corresponding to thereciprocal of the eigenvalues.

7.3 Free Product 139

The above scaling is slightly counter-intuitive but it is consistent with the factthat S A(0) = 1/τ(A). We will be focusing on unit trace matrices such that S (0) =

1.

The construction of the S-transform relies on the properties of mixed momentsof free variables. In that respect it is closely related to the R-transform. Usingthe relation tA(ζ) = ζgA(ζ) − 1, one can get the following relationships betweenRA and S A:

S A(t) =1

RA(tS A(t)), RA(g) =

1S A(gRA(g))

. (7.22)

7.3.3 Multiplicativity of the S-Transform

We can now show the multiplicative property (7.17). The demonstration issimilar to the additive case and adapted from it. Bizarrely we have never seenit in any textbooks.

We fix t and let ζA and ζB be the inverse T -transforms of tA and tB. Then wedefine EA through

1 + t + EA = (1 − A/ζA)−1,

and similarly for EB. We have τ(EA) = 0, τ(EB) = 0, and EA, EB are free.Then we have

AζA

= 1 − (1 + t + EA)−1,

which gives

ABζAζB

=[1 − (1 + t + EA)−1

] [1 − (1 + t + EB)−1

]= (1 + t + EA)−1 [(t + EA)(t + EB)] (1 + t + EB)−1.

Using

t(EA + EB) =t

1 + t

[(1 + t + EA)(1 + t + EB) − (1 + t)2 − EAEB

],


we can rewrite the above expression as

ABζAζB

=t

1 + t+ (1 + t + EA)−1

[−t +

EAEB

1 + t

](1 + t + EB)−1

⇒ 1 −1 + t

tABζAζB

= (1 + t)(1 + t + EA)−1[1 −

EAEB

t(1 + t)

](1 + t + EB)−1

⇒

[1 −

1 + tt

ABζAζB

]−1

=1

1 + t(1 + t + EB)

[1 −

EAEB

t(1 + t)

]−1

(1 + t + EA).

Using the expansion [1 −

EAEB

t(1 + t)

]−1

=∑n=0

(EAEB

t(1 + t)

)n

,

one can check that

τ

(1 + t + EB)[1 −

EAEB

t(1 + t)

]−1

(1 + t + EA)

= (1 + t)2,

where we used the freeness condition for EA and EB. Thus we get that

τ

[1 −

1 + tt

ABζAζB

]−1 = 1 + t ⇒ tAB

( tζAζB

1 + t

)= t,

which gives thatS AB(t) = S A(t)S B(t)

by the definition (7.21).

7.3.4 Subordination Relation for the Free Product

We next derive the subordination relation for free product using (7.17) and(7.21):

S AB(t) = S A(t)S B(t) ⇒ ζAB(t) =ζA(t)S B(t)

,

where

tAB(ζAB(t)) = t = tA(ζA(t)) = tA(ζAB(t)S B(t)).

We call ζAB(t) ≡ ζ, then the above relations give

tAB(ζ) = tA (ζS B(tAB(ζ))) , (7.23)

Exercises 141

which is the subordination relation for free product. In fact, the above is trueeven when S A does not exist, e.g. when τ(A) = 0.

When applying to random matrices, the form AB is not very useful since it is notnecessarily symmetric even if A and B are. But if A 0 (i.e. A is positive semi-definite symmetric) and B is symmetric, then A1/2BA1/2 has the same momentsas AB and is also symmetric. In our applications we will always encounteredA 0 and call A1/2BA1/2 the free product of A and B.

Exercises

7.2 Properties of S-transform

(a) Using Eq. (7.21), show that

R(x) =1

S (xR(x))(7.24)

Hint: define t = xR(x) = zg − 1 and identify x as g.

(b) For a variable such that τ(M) = κ1 = 1, write S (t) as a power series int, compute the first few terms of the powers series, up to (including) thet2 term, using Eq. (7.24) and Eq. (7.9). You should find

S (t) = 1 − κ2t + (2κ22 − κ3)t2 + O(t3).

(c) We have shown that, when A and B are mutually free with unit trace:

τ(AB) = 1

τ(ABAB) − 1 = κ2(A) + κ2(B)

τ(ABABAB) = κ3(A) + κ3(B) + 3κ2(A)κ2(B) + 3(κ2(A) + κ2(B)) + 1

Show that these relations are compatible with S AB(t) = S A(t)S B(t) andthe first few terms of your power series in (b).

(d) Consider M1 = 1 + σ1W1 and M2 = 1 + σ2W2 where W1 and W2 aretwo different (free) unit Wigner matrices and both σ’s are less than 1/2.


M1 and M2 have κ3 = 0 and are positive definite in the large N limit.What is κ3(M1M2)?

7.3 S-transform of the matrix inverse.

(a) Consider M an invertible symmetric random matrix and M−1 its inverse.Using Eq. (7.18), show that

tM(ζ) + tM−1

(1ζ

)+ 1 = 0 (7.25)

(b) Using Eq. (7.25), show that

S M−1(x) =1

S M(−x − 1)(7.26)

Hint: write u(x) = 1/ζ(t) where u(x) is such that x = tM−1(u(x)). Eq.(7.25) is then equivalent to x = −1 − t.

7.4 Large Random Matrices

7.4.1 Random Eigenvectors and Freeness

In this section we will try to understand how two large matrices can behave asif they were free. If this is the case, then we can use the results of this chapter,in particular the additivity of the R-transform and the multiplicativity of theS transform to compute the spectrum of the sum or product of certain largematrices.

Recall the definition of freeness. A and B are free if for any set of tracelesspolynomials p1, . . . , pn and q1, . . . , qn we have

τ (p1(A)q1(B)p2(A)q2(B) · · · pn(A)qn(B)) = 0. (7.27)

To make the link with large matrices we will consider A and B to be largesymmetric matrices and τ(M) ≡ 1/N Tr(M). The matrix A can be diagonalizedas UΛUT and B as VDVT . A traceless polynomial pi(A) can be diagonalized as

7.4 Large Random Matrices 143

UΛiUT where U is the same orthogonal matrix as for A and Λi = pi(Λ) is sometraceless diagonal matrix and similarly for qi(B). Eq. (7.27) becomes

τ(Λ1OD1OTΛ2OD2OT · · ·ΛnODnOT ) = 0 (7.28)

where we have introduces O = UT V the orthogonal matrix of basis changebetween the eigenvectors of A and those of B. In turns out that Eq. (7.28) isalways true when averaged over the orthogonal matrix O in the large N limit ifthe matrices Λi and Di are traceless. We also expect that in the large N limit Eq.(7.28) becomes self averaging so a single matrix O behaves as the average of allsuch matrices. We now have an effective criteria to state that the matrices A andB behave as free: they have to be large and the change of basis matrix of theireigenvectors O = UT V has to be essentially random.

We saw that the Wigner matrix X and the white Wishart W are rotationallyinvariant meaning that the matrix of their eigenvectors are random orthogonalmatrices. We can conclude that for N large, both X and W are free with re-spect to any matrix independent from them, in particular they are free from anydeterministic matrix.

We come back to the statement that in the large N limit the average over Oof Eq. (7.28) is zero for traceless matrices Λi and Di. For matrices Λi and Dithat are not traceless we have:

limN→∞

⟨τ(Λ1OD1OT⟩

O =τ(Λ1)τ(D1)

limN→∞

⟨τ(Λ1OD1OTΛ2OD2OT )

⟩O =τ(Λ1Λ2)τ(D1)τ(D2) + τ(Λ1)τ(Λ2)τ(D1D2)

− τ(Λ1)τ(Λ2)τ(D1)τ(D2)

We see that the first two such averages give a sum of terms with always atleast one τ(Λi) or τ(Di) and therefore gives zero is all such individual tracesare zero.

7.4.2 Infinite Dimensional Matrices

How does one define the ring of infinite dimensional matrices? Such matricesdo not exist per say but are the limit of finite dimensional matrices. The trickis to define a space of matrices with matrices of all size. This space will con-tain matrices of arbitrary large size and hence infinite dimensional matriceswill be limiting points in this space. The mathematical process of completion


adds to a space the limit of all converging sequences.7 By this process, theinfinite unit Wigner matrix will exist.

Elements of the ring can be added and multiplied together. How does one addor multiply two matrices of different sizes? If the ratio of the sizes of twosquare matrices A and B is an integer n, one can define a sum and a productby increasing the size of the smaller matrix (say B) by considering its tensorproduct with the n dimensional identity matrix 1n, so A + B ≡ A + (B ⊗ 1n)and similarly for the product. More explicitly if B is a k × k matrix, by B⊗ 1nwe mean the nk × nk block diagonal matrix

B ⊗ 1n =

B 0 . . . 00 B . . . 0...

.... . .

...0 0 . . . B

,which can be added to or multiplied by a nk × nk matrix A. Note that thetensor product with the identity does not change the normalized trace. Theidentity matrix of size k is expanded to the identity matrix of size nk, soidentity matrices of all sizes behave as the same object.

To make sure that the ratio of unequal size matrix is always an integer, weconsider the space of square matrices of size 2n×2n for all n. The ratio of twounequal size matrices is always a power of 2.

To finish the construction we need to define a norm on this space using thenormalized trace τ(AA∗). Using this norm we can define Cauchy sequenceand complete the space to include the limiting point of all converging se-quences.


Book Mingo and Speicher [2017]

Historical: Freeness introduced Voiculescu [1985], link with RMT Voiculescu[1991]

Additivity of R-transform presented here is from Tao [2012]

7 A Cauchy sequence is a sequence such that by looking sufficiently far in the sequence all remainingterms are arbitrarily close (∀ε∃N | ∀n,m > N |an − am | < ε). A complete space is one for which everyCauchy sequence converges. The process of completion adds points to a space (the limit of Cauchysequences) so that it becomes complete. For example the set of real numbers is the completion of therationals.

8Addition and Multiplication: Summary and Examples

In last few chapters we have built the necessary tools to compute the spectrum ofsums and products of random matrices. In this chapter we will review the resultspreviously obtained and show how they work on concrete examples. A readerimpatient to learn about applications can jump straight into this chapter and readchapters 6 and 7 later to understand the origins of these formulas.

8.1 Summary

We introduced the concept of freeness which can be summarized by the fol-lowing intuitive statement: two large matrices are free if their eigenvectors arerelatively random. In particular a large matrix drawn from a rotationally invari-ant ensemble is free with respect to any matrix independent of it.1 For exampleA and OBOT are free when O is a random rotation matrix. When A and B arefree, their R and S-transform are respectively additive and multiplicative:

RA+B(x) = RA(x) + RB(x), S AB(t) = S A(t)S B(t).

The free multiplication needs some clarification as AB is in general not a sym-metric matrix, the S-transform S AB(t) relates to the eigenvalues of the matrix

1 By large, we mean that all normalized moments computed using freeness are correct up to correctionsthat are O(1/N).

146 Addition and Multiplication: Summary and Examples√

AB√

A which are the same as those of√

BA√

B when both A and B are pos-itive semi-definite (otherwise the square root is ill-defined).

The R and S-transforms are defined by the following relations:

gA(z) = τ[(z − A)−1

]

tA(ζ) = τ[(1 − ζ−1A)−1

]− 1 = ζgA(ζ) − 1;

RA(g) = zA(g) −1g, S A(t) =

t + 1tζA(t)

if τ(A) , 0,

where zA(g) and ζA(t) are the inverse functions of gA(z) and tA(ζ), respec-tively.

Under multiplication by a scalar they behave as

RαA(x) = αRA(αx), S αA(t) = α−1S A(t).

The two transforms are related by the following equivalent identities:

S A(t) =1

RA(tS A(t)), RA(x) =

1S A(xRA(x))

.

The identity matrix has particularly simple transforms:

g1(z) =1

1 − z, t1(ζ) =

11 − ζ

,

R1(x) = 1, S 1(t) = 1.

8.1.1 Computing the Eigenvalue Density

The R-transform provides a systematic way to obtain the spectrum of the sum oftwo independent matrices E = B + C, where at least one of them is rotationally

8.1 Summary 147

invariant. Here is a simple recipe to compute the eigenvalue density of a freesum of matrices.

1 Find gB(z) and gC(z).

2 Invert gB(z) and gC(z) to get zB(g) and zC(g), and hence RC(g) and RC(g).

3 RE(g) = RB(g) + RC(g), which gives zE(g) = RE(g) − g−1.

4 Solve zE(g) = z for gE(z).

5 Use Eq. (2.24) to find the density.

In the multiplicative case (E =√

CB√

C), the recipe is similar:

1 Find tB(ζ) and tC(ζ).

2 Invert tB(ζ) and tC(ζ) to get ζB(t) and ζC(t), and hence S B(t) and S C(t).

3 S E(t) = S B(t)S C(t), which gives ζE(t)S E(t)t = t + 1.

4 Solve ζE(t) = ζ for tE(ζ).

5 Eq. (2.24) for gE(z) = (tE(z) + 1)/z is equivalent to

ρE(x) =limη→0+ Im tE(x − iη)

πx. (8.1)

In some cases, the equation in step 4 may be exactly solvable. But it is usu-ally a high order polynomial equation, or worse, a transcendental equation. Inthese cases numerical solution is still possible. There always exists at least onesolution which satisfies

g(z) = z−1 + O(z−2)

for z → ∞. Since the eigenvalues of B and C are real, their R and S transformsare real for real arguments. Hence the equation in step 4 is an equation withreal coefficients. To find a non-zero eigenvalue density we need to find solu-tions with a strictly positive imaginary part. When the equation is quadratic orcubic, there is at most one such solution. Numerically, we may even define ρ(x)as the maximum of the imaginary part of all 2 or 3 solutions (the density willbe zero when all solutions are real). For higher order polynomial and transcen-dental equations, we have to be more careful as there can be spurious complex

148 Addition and Multiplication: Summary and Examples

solutions. Exercises 5.4 and 8.1 show how do these computation in concretecases.

8.1.2 A Worked out Example

Suppose we wanted to computed the eigenvalue distribution of a matrix E = F+

OFOT where F is a diagonal matrix with entries uniformly distributed between-1 and 1 (e.g. [F]kk = 1 + (1 − 2k)/N) and O a random orthogonal matrix. Thisis the free sum of two matrices with uniform eigenvalue density.

First we need to compute the Stieltjes transform of F. We have

ρF(λ) =12

for − 1 < λ < 1.

Then the Stieltjes is

gF(z) =12

∫ 1

−1

dλz − λ

=12

log(z + 1z − 1

).

Note that when −1 < λ < 1 the argument of the log in gF(z) is negative soIm g(λ − iη) = iπ/2, consistent with a uniform distribution of eigenvalues. Wethen compute the R-transform by finding the inverse of gF(z)

z(g) =e2g + 1e2g − 1

= coth(g).

And so the R-transform of F is given by

RF(g) = coth(g) −1g.

The R-transform of E is twice that of F. To find the Stieltjes transform of E weneed to solve

z = RE(g) +1g

= 2 coth(g) −1g, (8.2)

for g(z). This is a transcendental equation and we need to solve it for complexz near the real axis. Before attempting to solve it, it is useful to plot z(g) (Fig-ure 8.1). The region where z = z(g) does not have real solutions is where theeigenvalues are. This region is between a local maximum and a local minimum

8.1 Summary 149

3 2 1 0 1 2 3g

10

5

0

5

10

z=

R(g)

+1/

g

1.25 1.50 1.75

1.55

1.56

Figure 8.1 z(g) = RE(g) + 1/g for the free sum of two flat distributions. Notethat there is a region of z near [−1.5, 1.5] when z = z(g) does not have realsolutions. This is where the eigenvalues lie. The inset shows a zoom of theregion near z = 1.5, indicating more clearly that z(g) has a minimum g+ ≈

1.49, so λ+ = z(g+). The exact edges of the spectrum are λ± ≈ ±1.5429.

of z(g). We should look for complex solutions of Eq. (8.2) near the real axis forRe(z) between −1.54 and 1.54. We can then put this equation into a complexnon-linear solver. The density will be given by Im g(z)/π for Im(z) very smalland Re(z) in the desired range. Note that complex solutions come in conjugatedpairs, and it is hard to force the solver to find the correct one. This is not aproblem since their imaginary parts have the same absolute value, we can justuse

ρ(λ) =| Im g(λ − iη))|

πfor some small η.

We have plotted the resulting density on Figure 8.2.

8.1.3 Wigner and Wishart

The Wigner ensemble is rotationally invariant, therefore a Wigner matrix is freefrom any matrix from which it is independent. For a Wigner matrix X of vari-ance σ2,

RX(x) = σ2x.


2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

()

Figure 8.2 Density of eigenvalue for the free sum of two uniform distribution.Continuous curve was computed using a numerical solution of Eq. (8.2). Thehistogram is a numerical simulation with N = 5000.

The Wigner matrix is stable under free addition, i.e. the free sum of two Wignersof variance σ2

1 and σ22 is a Wigner with variance σ2

1 + σ22.

The Wigner matrix is traceless (τ(X) = 0), so its S-transform does not exist.However, we can shift the mean of the entries of X by m. We have RX+m(x) =

m + σ2x. We can use Eq. (7.22) and compute the S -transform:

S X+m(t) =

√m2 + 4σ2t − m

2σ2t=

m2σ2t

√

1 +4σ2tm2 − 1

,it is regular at t = 0 when m , 0 but it is ill-defined otherwise.

For a white Wishart matrix W with parameter q = N/T , we have shown that

RW(x) =1

1 − qx.

To compute its S-transform we first remember that its Stieltjes transform g(z)satisfies Eq. (3.4) which can be written as an equation for t(ζ):

ζt − (1 + qt)(t + 1) = 0 ⇒ S W(t) =1

1 + qt. (8.3)

In exercise 8.2, one can show that for inverse-Wishart matrix of mean one and

8.1 Summary 151

60 40 20 0 20 40 60l

0.000

0.002

0.004

0.006

0.008

0.010

0.012

(l)

free log-normalWigner

Figure 8.3 Density of l = log(λ) for a free log-normal (8.4) with α = 100compared with a Wigner with the same endpoints. For α . 1 the density of lis indistinguishable by eye from a Wigner (not shown), while for α → ∞ thedistribution of l tends to a uniform distribution on [−α/2, α/2].

variance p,

S (t) = 1 − pt.

8.1.4 The Free Log-Normal

There exists a free version of the log-normal: Yα. It’s S-transform is given by

S α(t) = e−α(t+1/2). (8.4)

As a family of law, the free log-normal is stable in the sense that the freeproduct of two free log-normals with parameters α1 and α2 is a free log-normal with parameter α1 +α2. We have chosen to define the free log-normalnot to have unit trace but rather τ(Yα) = eα/2. With this normalization thefree log-normal has the additional property that Yα and Y−1

α are equal in law.This implies that the eigenvalue distribution of Yα is invariant under λ→ 1/λand that Yα has unit determinant. The matrix Yα can be viewed as the large nlimit of the product of n unit determinant free matrices with variance α/n.


Its first three free cumulant can be computed from Eq. (8.4) and read

κ1 = eα/2

κ2 = αeα

κ3 =3α2

2e3α/2.

By looking for the real extremas of

ζ(t) =t + 1

teα(t+1/2),

we can find the points t± where t(ζ) ceases to be invertible which in turn givethe edge of the spectrum λ± = ζ(t±):

t± =±√

1 + 4/α − 12

λ+ = 1/λ− =

[ √α

2+

√1 +

α

4

]2

exp

√α +

α2

4

.The eigenvalue distribution is symmetric in λ → 1/λ so the density ρ(l) ofl = log(λ) is symmetrical about zero. Figure 8.3 shows the density of l forα = 100.

8.1.5 Decaying Exponential Correlations

A matrix that will be useful when we will consider temporal correlation is thecovariance matrix for a process with decaying exponential correlations. Weconsider a T × T covariance matrix defined by

Kts = a−|t−s| with 0 < a < 1.

For instance K is the covariance matrix of an AR(1) process in the steadystate: if xt is a random process following

xt = axt−1 + ηt

where ηt are iid centered random numbers with variance 1/(1 − a2) so thatxt has unit variance in the steady state. Then E[xt xs] = Kts. The parametera measures the decay of the correlation, we can define a correlation time

8.1 Summary 153

τc := 1/(1 − a). This time is always greater than 1 (equal when K = 1) andtends to infinity as a→ 1. The matrix K is a Toepliz matrix:

K =

1 a a2 . . . aT−2 aT−1

a 1 a . . . aT−3 aT−2

a2 a 1 . . . aT−4 aT−3

......

.... . .

......

aT−2 aT−3 aT−4 . . . 1 aaT−1 aT−2 aT−3 . . . a 1

The matrix K is translational invariant i.e. it only depends on |t − s|. In aninfinite system, it can be diagonalized by plane waves (Fourier transform).For finite T this diagonalization is only approximate as there are “boundaryeffects” at the edge of the matrix. If the correlation time τc is not too big, theseboundary effects should be negligible. One way to make this diagonalizationexact is too modify the matrix so as to have the distance |t − s| defined on acircle. We define a new matrix K by

Kts = a−min(|t−s|,|t−s+T |,|t−s−T |).

K =

1 a a2 . . . a2 aa 1 a . . . a3 a2

a2 a 1 . . . a4 a3

......

.... . .

......

a2 a3 a4 . . . 1 aa1 a2 a3 . . . a 1

It may seem that we have greatly modified to matrix K as we have changedabout half of its elements, but if τc T most of the elements we havechanged were essentially zero and remained essentially zero. Only a finitenumber (≈ 2τ2

c) of elements in the bottom right and top left corners have re-ally changed. Changing a finite number of off-diagonal elements in a asymp-totically large matrix should not change its spectrum. The matrix K, whichwe will call K again, is a circulant matrix: it can be diagonalized by Fouriertransform. More precisely, its eigenvectors are

[vk]l = e2πikl/T for 0 ≤ k ≤ T/2.

Note that to each vk corresponds two eigenvectors, namely its real and imag-inary part, except for v0 and vT/2 which are real and have multiplicity one.The eigenvalues associated to k = 0 and k = T/2 are respectively the largest


(λ+) and smallest (λ−) and are given by

λ+ = 1 + 2T/2−1∑

k=1

ak + aT/2 ≈1 + a1 − a

λ− = 1 + 2T/2−1∑

k=1

(−a)k + (−a)T/2 ≈1 − a1 + a

=1λ+

.

In terms of the correlation time: λ+ = 2τc − 1. We label the eigenvalues of Kby an index xk = 2k/T so 0 ≤ xk ≤ 1. As T → ∞, xk a becomes a continuousparameter x and the different multiplicity of the first and the last eigenvaluesdoes not matter.

λ(x) =1 − a2

1 + a2 − 2a cos(πx)for 0 ≤ x ≤ 1

The t-transform of K can then be computed as

tK(z) =

∫ 1

0

1 − a2

z(1 + a2 − 2a cos(πx)) − (1 − a2)dx

Using ∫ 1

0

dxc − d cos(πx)

=1

√c − d

√c + d

,

and after some manipulations we find

tK(z) =1

√z − λ−

√z − λ+

with λ± =1 ± a1 ∓ a

From which we can read the density of eigenvalues (see Fig. 8.4):

ρK(λ) =1

πλ√

(λ − λ−)(λ+ − λ)for λ− < λ < λ+.

This density has integrable singularities at λ = λ±. It is normalized and itsmean is 1. We can also invert tK(z) with the equation

t2Z2K − 2bt2ZK + t2 − 1 = 0 where b =

1 + a2

1 − a2

and get

ZK(t) =bt2 +

√(b2 − 1)t4 + t2

t2

Exercises 155

0 1 2 3 4 5 6 70

1

2

3

4

5

6

()

a = 0.25a = 0.5a = 0.75

Figure 8.4 Density of eigenvalues for the decaying exponential covariancematrix K for three values of a: 0.25, 0.5 and 0.75.

so the S-transform is given by

S K(t) =t + 1√

1 + (b2 − 1)t2 + bt(8.5)

= 1 − (b − 1)t + O(t2),

where the last equality tells us that the matrix K has mean 1 and varianceσ2

K = b − 1 = a2/(1 − a2).

Exercises

8.1 Free product of two Wisharts

In this exercise, we will compute the eigenvalue distribution of a matrixE = (Wq0)1/2Wq(Wq0)1/2, as we shall see in section 8.2.1, this matrixwould be the sample covariance matrix of data with true covariance givenby a Wishart with q0 observed over T samples such that q = N/T .

(a) Using Eq. (8.3) and the multiplicativity of the S transform, write theS-transform of E.

(b) Using the definition of the S-transform write an equation for tE(z). Itis a cubic equation in t. If either q0 or q goes to zero, it reduces to thestandard Marcenko-Pastur quadratic equation.


(c) Use Eq. (8.1) and a numerical root finder plot the eigenvalue density ofE for q0 = 1/2 and q ∈ 1/4, 1/2, 3/4. In practice you can work withη = 0, of the three roots of your cubic equation, at most one will have apositive imaginary part. When all three solutions are real ρE(λ) = 0.

(d) Generate numerically two independent Wishart matrices with q = 1/2(N = 1000 and T = 2000) and compute E = (Wq0)1/2Wq(Wq0)1/2. Notethat the square-root of a matrix is obtained by applying the square-rootsto its eigenvalues. Diagonalize your E and compare its density with yourresult in (c).

8.2 The inverse-Wishart matrix.

Recall the Wishart matrix of parameter q = N/T < 1. It has the followingR-transform, S-transform and density of eigenvalues

R(x) =1

1 − qxand S (x) =

11 + qx

ρ(λ) =

√(λ − λ−)(λ+ − λ)

2πqλwith λ± =

(1 ±√

q)2 (8.6)

(a) Let W be a Wishart matrix with q < 1 and M0 = W−1 its inverse. UsingEq. (7.26) show that

S M0(t) = 1 − q − qt

(b) Recall that τ(W−1) = τ(M0) = 1/(1 − q). Define the (normalized)Inverse-Wishart as M = (1 − q)M0 and call p ≡ q/(1 − q). Show that

S M(t) = 1 − pt

and that the inverse-Wishart has mean 1 and variance p. Show that aninverse-Wishart as more skewness (κ3(M)) than a Wishart (κ3(W)) withthe same variance (qW = pM), use your result of 1(b) to compute theskewness of the Inverse Wishart and Eq. (7.9) for the skewness of theWishart.

(c) Compute the Stieltjes transform gM(z) for the Inverse Wishart: Invert

Exercises 157

Eq. (7.9) to find t(z) and then g(z). You should find

gM(z) =(1 + 2p)z − 1 − z

√(1 − 1/z)2 − 4p/z

2pz2 .

Use Eq. (2.24) to compute its density of eigenvalues. Show that you getthe same result by doing the following change of variable in Eq. (8.6)

x =1 − qλ

and p =q

1 − q

In particular, you should find that the edges of the inverse-Wishart spec-trum are given by

x± = 2p + 1 ± 2√

2(p + 1).

(d) Consider the free product of a Wishart W and an independent inverse-Wishart M. Compute the S-transform of WM and find tWM(z) and gWM(z)to finally get the density of eigenvalues ρWM(λ). This is one of the rarecases of sum or product of free variables where the resulting equation isquadratic in g or t.

(e) Generate numerically a normalised inverse-Wishart M for p = 1/4 andN = 1000. Check that τ(M) = 1 and τ(M2) = 1.25. Plot a normalisedhistogram of the eigenvalues of M and compare with your analyticalresults in (c).

(f) Generate an independent Wishart W with q = 1/4 and compute E =√

MW√

M. To compute√

M, diagonalise M, take the square-root of itseigenvalues and reconstruct

√M. Check that τ(E) = 1 and τ(E2) = 1.5.

Plot a normalized histogram of the eigenvalues of E and compare withyour result form (d).

(g) For every eigenvector vk of E compute ξk ≡ vᵀk Mvk, make a scatter plotof ξk vs λk the eigenvalue of vk. Your scatter plot should show a noisystraight line. We will see in chapter 11 that this is related to the fact thatlinear shrinkage is the optimal estimator of the true covariance from thesample covariance when the true covariance is an inverse-Wishart.


8.3 The exponential moving average sample covariance matrix (EMA-SCM). Instead of measuring the sample covariance matrix using a flataverage over a fixed time window T , one can compute the average usingan exponential weighted moving average. Let’s compute the spectrum ofsuch a matrix in the null case of IID data. Imagine we have an infinite timeseries of vectors of size N xt for t from minus infinity to now. We definethe EMA-SCM (on time-scale τc) as

E(t) =1τc

t∑t′=−∞

(1 − 1/τc)t−t′xt′xᵀt′ (8.7)

E(t) = (1 − 1/τc)E(t − 1) +1τc

xt′xᵀt′

The second term on the RHS can be thought of a Wishart matrix withT = 1 (or q = N). Now both E(t) and E(t − 1) are equal in law so we write

E in law= (1 − 1/τc)E +

1τc

Wq=N (8.8)

(a) Given that E and W are free, use the properties of the R-transform toget the equation

RE(x) = (1 − 1/τc)RE((1 − 1/τc)x) +1/τc

1 − (N/τc)x

(b) Take the limit N → ∞, τc → ∞with q ≡ N/τc fixed to get the followingdifferential equation for RE(x)

RE(x) = −xddx

RE(x) +1

1 − qx

(c) The definition of E is properly normalized τ(E) = 1 [show this usingEq. (8.8)], so we have the initial condition R(0) = 1. Show that

RE(x) =− log(1 − qx)

qx

solves your equation with the correct initial condition. Compute thevariance κ2(E).

8.2 General Sample Covariance Matrices 159

(d) To compute the spectrum of eigenvalues of E, one needs to solve a com-plex transcendental equation. First write z(g). For q = 1/2 plot z as afunction of g (for −4 < g < 2). You will see that there are values of zthat are never attainted by z(g), in other words g(z) has no real solutionsfor these z. Numerically find complex solutions for g(z) in that range.Plot the density of eigenvalues ρE(λ) given by Eq. (??). Plot also thedensity for a Wishart with the same mean and variance.

(e) Construct numerically the matrix E as in Eq. (8.7). Use N = 1000,τc = 2000 and use at least 10000 values for t′. Plot the eigenvaluedistribution of your numerical E against the distribution found in (d).

8.2 General Sample Covariance Matrices

In this section, we will show how to compute the various transforms (S (t), t(z), g(z))for sample covariance matrices when the data has true correlations. The N vari-ables can some true correlations (that we want to measure), we can this casespatial correlations. The T samples might non be independent, we can modelthis using temporal correlations. And of course, the data might have both typesof correlations.

Recall that if we store the data in a rectangular N × T matrix H, we define thesample covariance matrix as

E =1T

HHT .

We can also compute the singular values sk of H, note that these singularvalues are related to the eigenvalues of E via sk =

√Tλk. Recall as well that

once we have computed g(z) we can use Eq. (2.24) to compute the density ofeigenvalues ρ(λ).


8.2.1 Spatial Correlations

We saw in section 3.2.3 that a general Wishart matrix EC (with column covari-ance C) can be written as

EC = C1/2WqC1/2.

We recognize this formula as the free product of the covariance matrix C and awhite Wishart Wq. Note that since the white Wishart is rotationally invariant, itis free from any matrix C. From the multiplicativity of the S-transform and theS-transform of the white Wishart (Eq. 8.3), we have

S E(t) =S C(t)1 + qt

.

we can also use the subordination relation of the free product (Eq. 7.23) towrite

tE(z) = tC

(z

1 + qtE(z)

).

This last expression can be written in terms of the more familiar Stieltjes trans-form using t(z) = zg(z) − 1,

zgE(z) = ZgC(Z) where Z =z

1 − q + qzgE(z).

8.2.2 Temporal Correlations

A common problem in data analysis arrises when samples are not independent.Intuitively, correlated samples are somehow redundant and the sample covari-ance matrix should behave as if we had observed not T samples but an effectivenumber T ∗ < T . Let’s analyze more precisely the sample covariance matrix inthe presence of correlated samples. We will start by the case when the true spa-tial correlations are zero. Our data can then be written in a rectangular N × Tmatrix HK satisfying

E([HK]it[HK] js) = δi jKts,


where K is the T × T temporal covariance matrix that we assumed to be nor-malized as τ(K) = 1. Following the same arguments as in section 3.2.3, we canwrite

HK = HK1/2,

where H is a white rectangular matrix. So the sample covariance matrix be-comes

E =1T

HKHTK =

1T

HKHT .

Now this is not quite the free product of the matrix K and a white Wishart, butif we define the (T × T ) matrix F as

F =1N

HTKHK =

1N

K1/2HT HK1/2 = K1/2W1/qK1/2,

then F is the free product of the matrix K and a white Wishart with 1/q. So

S F(t) =S K(t)

1 + t/q.

To find the S-transform of E, we go back to section 3.1.1 where we obtained Eq.(3.1) relating the Stieltjes transforms of E and F. In terms of the t-transform, therelation is even simpler:

tF(z) = qtE(qz) ⇒ ZE(t) = qZF(qt),

where the functions Z(t) are the inverse t-transforms. Using the definition of theS-transform (Eq. (7.21)), we finally get

S E(t) =S K(qt)1 + qt

, (8.9)

which can be expressed as a relation between inverse t-transforms:

ZE(t) = q(1 + t)ZK(qt).

We can also write a subordination relation between the t-transforms:

qtE(z) = tK

(z

q(1 + tE(z))

).


0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.2

0.4

0.6

0.8

1.0

()

q = 0.01 b = 25q = 0.17 b = 1.5q = 0.25 b = 1

Figure 8.5 Density of eigenvalues for a sample covariance matrix with ex-ponential temporal correlations for three choice of parameters q and b suchthat qb = 0.25. All three densities are normalized, have mean 1 and varianceσ2

E = qb = 0.25. The bottom one is the Marcenko-Pastur density (q = 0.25),the top one is very close to the limiting density for q→ 0 with σ2 = bq fixed.

8.2.3 Exponential Temporal Correlations

The most common form of temporal correlation in experimental data is the de-caying exponential that we studied in section 8.1.5. Before considering truespatial correlations we can study the null model where we measure the sam-ple covariance matrix of N independent variables but where each measurementexhibits a time-correlation of the form

Exit x js = δi ja|t−s|.

We want to compute the S-transform of this SCM. From it we will be able tocompute the density of eigenvalues of the null hypothesis and also we will beable to introduce true spatial correlations via the free product (section 8.2.4).Combining Eq. (8.9) with Eq. (8.5), we get

S E(t) =1√

1 + (b2 − 1)(qt)2 + bqt(8.10)

From the S-transform, we can write an equation for tE(z), it is a forth orderequation:

q2t4 + 2q(q − bz)t3 + (z2 − 2bqz + q2 − 1)t2 − 2t − 1 = 0


0 1 2 3 4 50.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

()

2 = 0.252 = 0.52 = 1

Figure 8.6 Density of eigenvalues for the limiting distribution of sample co-variance matrix with exponential temporal correlations Wσ2 for three choiceof the parameter σ2: 0.25, 0.5 and 1.

Looking at Eq. (8.10), one notices that when b 1, the S-transform only de-pends on b and q only through the combination bq. One can define a new lim-iting distribution for the case q → 0, b → ∞ with qb = σ2 (the varianceof this distribution). Note that when b is large b = τc/2: half the correlationtime. Physically this correspond to a large number of variable N with verylong correlation time τc measured over even longer times T τc such thatNτc/(2T ) = σ2.

A better way to understand this random matrix is to consider N independentOrstein-Uhlenbeck processes with correlation time τc that we record over a longtime T0. We sample the data at interval ∆ and construct the sample covariancematrix (SCM) of the N variables. If ∆ τc, then each sample can be consideredindependent and the SCM will be a Marcenko-Pastur with q = N∆/T0. But ifwe sample at intervals ∆ τc, then the resulting SCM should no longer dependon ∆ but only on τc. Concretely if τc =1s sampling at 1ms or 1µs should givethe same result. The SCM will converge to this new random matrix Wσ2 withparameter σ2 = Nτc/(2T0). The S-transform of Wσ2 is given by

S Wσ2 (t) =1√

1 + (σ2t)2 + σ2t,


and its R-transform is

RWσ2 (g) =1√

1 − 2σ2g

= 1 + σ2g +32σ4g2 + O(g3)

The last equation gives its first three cumulants. We notice that it has moreskewness κ3 = 3

2σ4 than a MP with the same variance (q = σ2) for which

κ3 = σ4. The equation for the Stieltjes g(z) and the t-transform are both cubicequations. The distribution of eigenvalues of Wσ2 is shown on Figure 8.6. Notethat, unlike the Marcenko-Pastur, there is always a lower edge of the spectrumλ− > 0 and no Dirac at zero even for σ2 > 1. Unfortunately, the equation givingλ± is a forth order equation and doesn’t have a concise solution.

8.2.4 Spatial and Temporal Correlations

S E(t) =S C(t)S K(qt)

1 + qt

ZE(t) = qtZC(t)ZK(qt)

8.2.5 Fluctuating Variance

E =1T

T∑t=1

xtxTt =

T∑t=1

Pt

where Pt = xtxTt /T is a rank one matrix with non-zero eigenvalue equal to qσ2

t .Since the vectors xt are rotationally invariant, the matrix E can be viewed as thefree sum of a large number of rank-one matrices.

RE(g) =

T∑t=1

RPt (g)

Exercise 165

To compute the R-transform of the matrix E we need to compute the R-transformof a rank one matrix. Note that since there are T terms in the sum, we will needto know RPt (g) including correction of order 1/N.

gPt (z) =1N

(N − 1

z+

1z − qσ2

t

)=

1z

+1N

qσ2t

z(z − qσ2t )

Inverting to first order in 1/N we find

zPt (g) =1g

+1N

qσ2t

1 − qσ2t g

Since R(z) = z(g) − 1/z and using that q = N/T we find

RE(g) =1T

T∑t=1

σ2t

1 − qσ2t g

The fluctuations of σ2t can be stochastic or deterministic, in the large T limit we

can encode then in a probability density P(s) for s = σ2 and tun the sum into anintegral.

RE(g) =

∫ ∞

0

sP(s)1 − qsg

ds

Note that if the variance is always 1 (P(s) = δ(s − 1)), we recover the WishartR-transform

RWq(g) =1

1 − qg

In the general case, the R-transform of E is simply related to the t-transform ofthe distribution of s

RE(g) =1qg

ts

(1qg

)

Exercise

8.1 On the futility of oversampling

Consider data consisting of N variables with true correlation C and T inde-pendent observations. Instead of computing the sample covariance matrix


with these T observations, we repeat each one m times and sum over mTcolumns. Obviously the redundant columns should not change the samplecovariance matrix hence it should have the same spectrum as the one usingonly the original T observations.

(a) The redundancy of columns can be modeled as a temporal correlationswith a mT × mT covariance matrix K that is block diagonal with Tblocks of size K and all the values within one blocks equal to 1 andzero outside the blocks. Show that this matrix has T eigenvalues equalto m and (T − 1)m zero eigenvalues.

(b) Compute tK(z) for this model.

(c) Show that S K(t) = (1 + t)/(1 + mt).

(d) If we include the redondant columns we have a value of qm = N/(mT )but we need to take temporal correlations into account so S E(t) = S C(t)S K(qmt)/(1+

qmt). Show that in this case S E(t) = S C(t)/(1 + qt) with q = N/T whichis the result without the redundant columns.


Books Tulino and Verdu [2004], Couillet and Debbah [2011]

Historical Burda et al. [2005]

9Extreme Eigenvalues and Outliers

9.1 Largest Eigenvalue

For many ensemble of random matrices, the density of eigenvalue converges toa smooth bounded density ρ(x) as N → ∞.1 The largest eigenvalue λ1 of sucha matrix is typically λ+, the upper edge of the density ρ(x). It is natural to askwhat are the fluctuations of λ1 around λ+, what are their scaling in N and theirdistribution.

9.1.1 Tracy-Widom Distribution

It is known that the largest eigenvalues of many types of random matrices (in-cluding Wigner and Wishart matrices) satisfy the Tracy-Widom law as N →∞:

limN→∞

P(λ1 ≤ λ+ + γN−2/3µ

)= Fβ(µ),

where β takes the classical values β = 1, 2 or 4, and Fβ is related to a solutionto the Painleve-II equation. Here λ+ is the right edge of the spectrum of therandom matrix (i.e. it gives the classical location of the largest eigenvalue), andγ is some scaling parameter that depends on the matrix ensemble. For instance,

1 We won’t consider the possibility of a Dirac mass at or above the upper edge of the spectrum aδ(x − λ+).In that case λ1 is always λ+.

168 Extreme Eigenvalues and Outliers

5 4 3 2 1 0 1 2 3u

0.0

0.1

0.2

0.3

0.4

0.5

P(u

)

= 4= 2= 1

Figure 9.1 Density of the Tracy-Widom distribution for the three values of β.The graphes were generated using python code from Yao-Yuan Mao, Univer-sity of Pittsburg.

we have λ+ = 2 and γ = 1 for Wigner matrices with σ2 = 1, and λ+ = (1+√

q)2

and γ =√

qλ+ for Wishart matrices. Fβ has negative mean (i.e. for finite N,mean of λ1 is below λ+), and its positive tail and negative tail behave ratherdifferently:

logP(x > µ) ∼ −µ3/2, µ 1,

and

logP(x < µ) ∼ −|µ|3, µ −1.

Thus it is relatively rarer to see large negative values since the eigenvalue has toovercome a “repulsion” from the other eigenvalues.

The density of the Tracy-Widom distribution is plotted in Fig. 9.1 for β = 1, 2, 4.Note that the variance of the distribution diminishes with β in accordance withthe intuitive interpretation that higher β’s correspond to lower temperaturesand hence eigenvalues closer to their equilibrium positions. Note as well thatthe right tail is clearly fatter than the left, the distribution has positive skew-ness.

9.1 Largest Eigenvalue 169

9.1.2 A simple scaling argument

The exact shape of Fβ is determined by delicate repulsions between eigenvalues(i.e. the Vandermonde determinant in Eq. (4.4)). But the scaling N−2/3 onlydepends on the shape of the density near the edge and not on the precise jointdistribution of eigenvalues. Here we show that independent eigenvalues wouldhave the same scaling. We consider N positive i.i.d. random variables Xi withdistribution that satisfies

P(X ≤ x) ∼ Axα+1, as x→ 0.

We can think of X as X = λ+−λ for an eigenvalue λ. For the Wigner and Wishartensemble the exponent α equals 1/2. Then let Xmin = mini Xi, we have

P(Xmin < x) = 1 − P(Xmin ≥ x) = 1 − (1 − Axα+1)N .

The last equality comes from the fact that for the mimimum to exceed a value,all variables must exceed that values and that the variables Xi are i.i.d. In orderto get a probability of order one, we need x to be small: Axα+1 ∼ N−1, i.e.x ∼ N−1/(α+1). Then, for large N we have

P(Xmin ≥ x) ≈ exp(−NAxα+1) = exp(−µα+1), µ = (NA)1/(α+1)x.

We saw in Section 4.2.4 that a generic model from the orthogonal ensemble (orfrom any beta-ensemble) has ρ(x) ∼

√λ+ − x for x near the upper edge, hence

α = 1/2 and the scaling x ∝ N−2/3µ. Such a model is called non-critial. Asmentioned before, one can explicitely construct models with a different expo-nent and hence a different scalling for the distribution of the largest eigenvalue.In particular we can have ρ(x) ∼ (λ+ − x)k+1/2 for some positive integer k. Wesee that the i.i.d. model gives the correct scaling, but of course it cannot givethe correct limiting distribution. In particular, the tail behavior is wrong. Forexample, in this i.i.d. model we have

P(λ1 > λ+) = 0,

which cannot be correct.


Exercise

9.1 Universality of semi-circle and Tracy-Widom

Consider three different constructions of the Wigner matrix. W = 1/√

2N(H+

Hᵀ) where H is an N × N (non-symmetric matrix) with IID random num-bers drawn as

1 Gaussian with mean 0 and variance 1

2 ±1 with equal probability

3 ±d−1/3/√

3 where d is uniform (0, 1) and ± are chosen with equal prob-ability

(a) Show that cases (2) and (3) have mean 0 and unit variance.

(b) Write code to generate all three cases. Check numerically that τ(W) = 0and τ(W2)=1. Compute τ(W4) in all three cases. What do you expect?For case (3) compute τ(W4) on a few different samples. Explain yourresult.

(c) For all three cases: Generate a single N = 1000 matrix. Plot a normal-ized histogram of its eigenvalues. Do you observe the semi-circle law?

(d) For all three cases: Generate a large number of such matrices (say 1000)of moderate size (e.g. N = 200). For each matrix, compute its largeeigenvalue λ1 and compute the variable u given by

u = N2/3(λ1 − 2)

Plot the normalized histogram of u in all three cases separately. Super-impose the Tracy-Widom density for β = 1. Do you see an agreement?

To compute the largest eigenvalue of a symmetric matrix, there exists al-gorithms that are much faster than those computing all the eigenvalues.For example in python:

import scipy.sparse.linalg as splin

...

eigmax=splin.eigsh(A,k=1,return_eigenvectors=False,which=’LA’)[0]

9.2 Outliers 171

eigmax will be equal to the largest eigenvalue of A. Functions to com-pute the Tracy-Widom probability density are available in public domainpackages in python and R and also in Mathematica.

9.2 Outliers

We saw in the previous section that the largest eigenvalue of a large class ofrandom matrices does not fluctuate very far from the classical edge λ+. For ex-ample the largest eigenvalue of a N = 1000 Wigner matrix is typically within1000−2/3 = 0.01 of λ+ = 2. In real applications the largest eigenvalue can devi-ate quite substantially from the classical edge. The source of such a large eigen-value is usually not an improbably large Tracy-Widom fluctuation but rather atrue outlier that should be taken into account separately. We will model theseoutliers a as low-rank perturbation to a random matrix.

The largest eigenvalue of a random matrix can change radically if one makes asmall (low-rank) perturbation to the matrix. There are two typical cases, eitherthe perturbation is small enough and the largest eigenvalue is still at the edgeof the spectrum λ+ (with Tracy-Widom statistics if α = 1/2), or, for a largerperturbation a finite number of eigenvalues larger than λ+ emerge. These eigen-values are called outliers. By definition there are only of a finite number of themso they do not contribute to the density in the large N limit. Nevertheless theirpresence can have important consequences in practical applications.

9.2.1 Additive Perturbation

We will now study the outliers for an additive perturbation to a large randommatrix. Take a large symmetric random matrix M (e.g. Wigner or Wishart) witha well-behaved asymptotic spectrum that has a deterministic right edge λ+. Wewould like to know what happens if one adds to M a low rank (deterministic)perturbation of order 1. For simplicity, we only consider the rank-1 perturbationavvT in this section with ‖v‖ = 1 and a of order 1. We want to know whetherthere will be a single eigenvalue of M+avvT outside the spectrum of M (i.e. the


outlier) or not. To answer this question, we calculate the matrix resolvent

Ga(z) =(z −M − avvT )−1

,

The matrix function Ga(z) has a pole at every eigenvalue of M + avvT . An alter-native approach would have been to study the zeros of the function det (z −M − avvT ).But the resolvent Ga(z) will also give us information about the eigenvectors.

Now we apply the Sherman-Morrison formula (1.18), taking A = z−M, we getthat

Ga(z) = G(z) + aG(z)vvT G(z)1 − avT G(z)v

, (9.1)

where G(z) is the resolvent of the original matrix M. We are looking for a realeigenvalue such that λ1 > λ+. Let’s take z = λ1 ∈ R that lies outside spectrumof M, so G(λ) is real and regular. To have an outlier at λ1, we need a pole of Ga

at λ1, i.e.

1 − avT G(λ1)v = 0.

Let’s assume that M is drawn from a rotationally invariant ensemble or, equiv-alently, that the vector v is an independent random vector uniformly distributedon the unit sphere. In the language of Chapter 7, we say that the perturbationavvT is free from the matrix M. Then we have

vG(z)vT ≈1N

Tr G(z) = gN(z)→ g(z). (9.2)

Thus we have a pole when

ag0(λ1) = 1⇒ g0(λ1) = 1/a.

If z(g) the inverse function of g(z) exists, we arrive at

λ1 = z(1/a). (9.3)

The condition for the invertibility of g(z) happens to be the same as the conditionto have an outlier (λ1 > λ+). Let’s try to understand this in more details. It thelarge N limit, gN(z) converges to a function that can be expressed as the Stieltjestransform of the density

g(z) =

∫ λ+

λ−

ρ(x)z − x

dx,

9.2 Outliers 173

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0a

2.0

2.5

3.0

3.5

4.0

1

simulation1 = max(a + 1/a, 2)

Figure 9.2 Largest eigenvalue of a Gaussian Wigner matrix with σ2 = 1 witha rank-1 perturbation of magnitude a. Each dot is the largest eigenvalue ofa single random matrix with N = 200. Equation (9.5) is plotted as the solidcurve. For a < 1, the fluctuations follow a Tracy-Widom law with N−2/3 scal-ing while for a > 1 the fluctuations are Gaussian with N−1/2 scaling. From thegraph, we see fluctuations that are smaller, have negative mean and positiveskew when a < 1 compared to a > 1.

where λ± denote the edges of the spectrum. We see that g(z) is a monotonicallydecreasing function of z for z > λ+. In particular, g(z) is invertible for z > λ+,and

g+ = supz>λ+

g(z) = g(λ+).

Thus the inverse function z(g) is monotonically decreasing on 0 < g < g+,and

z(0)→ ∞, z(g+) = λ+.

Then λ1 = z(1/a) is monotonically increasing in a, and λ1 = λ+ when 1/a = g+.In sum, for a > 1/g+, there exists a unique outlier that is increasing in a. Thesmallest value for which we can have an outlier is a = 1/g+, corresponding toλ1 = λ+. For a < 1/g+ there is no outlier to the right of λ+.2

We can express the position of the outlier in terms of the R-transform (5.14),

2 Small (or negative) outliers λ < λ− behave similarly, we just need to consider the matrix −M − avvT andfollow the same logic.


0.5 1.0 1.5 2.0 2.5 3.0g

2.0

2.5

3.0

3.5

4.0

z

g+1/ggmax

Figure 9.3 Plot of the inverse function z(g) = g + 1/g for the unit Wignerfunction g(z). The cross indicates the point (g+, λ+). The line to the left of thispoint is the true inverse of g(z): z(g) is defined on [0, g+) and is monotonouslydecreasing in g. The line to the right is a spurious solution introduced by theR-transform. Note that the point g = g+ is a minimum of z(g) = g + 1/g.***change gmax in the graph ***

λ1 = R(1/a) + a for a > 1/g+. (9.4)

Using the cumulant expansion of the R-transform (7.9), we get an expressionfor large a,

λ1 = a + τ(M) +κ2(M)

a+ O(a−2).

For Wigner, we actually have the equality

λ1 = a +σ2

afor a > σ. (9.5)

By studying the fluctuations of vG(λ)vT around g(λ), one can show that thefluctuations of the outlier around R(a−1) + a are Gaussian and of order N−1/2.This is to be contrasted with the fluctuations of the largest eigenvalue whenthere are no outliers (no perturbation or a < g+) which are Tracy-Widom andare of order N−2/3. The transition between the two regimes is called the Baik-Ben Arous-Peche (BBP) transition.

9.2 Outliers 175

We finish this section with a discussion about the minimal a to have an outlier,namely, a∗ = g−1

+ . From Eq. (9.4), it is not completely obvious how to findthe point a∗. The function R(g) is well-defined for g ∈ [0, g+). However, R(g)typically also makes sense even beyond g+. In that case, one will have spurioussolutions for g(z) = g. Figure 9.3 shows a plot of z(g) = R(g) + 1/g in the unitWigner case. In this case one sees that z(g) is still well defined for g > g+ = 1even if this function is no longer the inverse of g(z). The point g+ is a minimumof z(g). This property is generic and comes from the monotonicity of g(z) forz > λ+:

dz(g)dg

∣∣∣∣∣g+

= 0, z(g+) = λ+.

For instance, for Wigner matrices, we have z(g) = σ2g+g−1, which gives

σ2 − g−2+ = 0⇒ g+ = σ−1,

and λ+ = z(σ−1) = 2σ, which is indeed the right edge of semi-circle law.

9.2.2 Associated Eigenvector

The matrix resolvent in Eq. (9.1) can also tell us about the eigenvectors of theperturbed matrix. We expect that for a very large rank-1 perturbation avvT , theeigenvector u1 associated with the outlier λ1 will be very close to the perturba-tion vector v. On the other hand, for λ1 ≈ λ+, the vector v will strongly mix withbulk eigenvectors of M so the eigenvector u1 will not contain much informationabout v.

To understand this phenomena quantitatively, we will study the (squared-)overlap|uT

1v|2. With the spectral decomposition of M + avvT , we can write

Ga(z) =

N∑k=1

ukuTk

z − λk,

where λ1 denotes the outlier and u1 its eigenvector. Then we have

limz→λ1

vT Ga(z)v · (z − λ1) = |uT1v|2.


Then by (9.1) and (9.2), we can get that

|uT1v|2 = lim

z→λ1

(g(z) + a

g(z)2

1 − ag(z)

)(z − λ1)

= limz→λ1g(z)

z − λ1

1 − ag(z).

We cannot simply evaluate the fraction above at z = λ1, for at that point g(λ1) =

a−1 and we would get 0/0. We can use l’Hospital’s rule3 and find

|uT1v|2 = −

g(λ1)2

g′(λ1), (9.6)

where we have used a−1 = g(λ1). The RHS is always positive since g(λ) < 0 asg is decreasing for λ > λ+.

We can rewrite Eq. (9.6) in terms of the R-transform and get a more usefulformula. To compute g′(z), we take the derivative with respect to z of implicitequation z = R(g(z)) + g−1(z) and get

1 = R′(g(z))g′(z) −g′(z)g2⇒ g′(z) =

1R′(g(z)) − g−2(z)

.

Hence we have

|uT1v|2 = 1 − g(λ1)2R′(g(λ1))

= 1 − a−2R′(a−1). (9.7)

We can now check our intuition about the overlap for large and small perturba-tions. For a large perturbation a→ ∞, Eq. (9.7) gives

|uT1v|2 → 1 if a→ ∞,

since R(0) is always finite.

The overlap near the transition λ1 → λ+ is a little more subtle. The derivative

3 L’Hospital’s rule states that

limx→x0

f (x)g(x)

=f ′(x0)g′(x0)

when f (x0) = g(x0) = 0.

9.2 Outliers 177

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0a

0.0

0.2

0.4

0.6

0.8

1.0

|uT 1v

|2

simulation|uT

1v|2 = max(1 a 2, 0)

Figure 9.4 Overlap between the largest eigenvector and the perturbation vec-tor of a Gaussian Wigner matrix with σ2 = 1 with a rank-1 perturbation ofmagnitude a. Each dot is the overlap for a single random matrix with N = 200.Equation (9.8) is plotted as the solid curve.

of g(z) can be written as

g′(z) = −

∫ λ+

λ−

ρ(x)(z − x)2 dx.

For a density that vanishes at the edge as ρ(λ) ∼ (λ+ − λ)α with an exponent αbetween 0 and 1, we have that g(z) is finite at z = λ+ but g′(z) diverges at thesame point. Note that non-critical eigenvalue densities have α = 1/2. From Eq.(9.6), we have in that case4

|uT1v|2 → 0 if λ1 → λ+.

If the density tends to a constant at λ+ (a discontinuous edge), then for z ap-proaching λ+ from above g(z) ∼ log(z − λ+) while g′(z) ∼ 1/(λ+ − z) and wehave |uT

1v|2 → 0.

Note that for λ1 > λ+, |uT1v|2 is of order 1. In Chapter 11 we will see that this

is not the case for the overlaps between eigenvectors in the bulk, which havetypical sizes of order N−1 for the squared overlaps.

4 In Chapter 4, we encountered a critical density where ρ(λ) behaves as (λ+ − λ)α with an exponentα = 3

2 > 1. In this case g′(z) does not diverge as z→ λ+ and the squared overlap at the edge of the BBPtransition would not go to zero. For example for the density given by Eq. (4.18) we find |uT

1v|2 = 49 at the

edge λ1 = 2√

2.


For Wigner matrices, R(x) = σ2x. We had λ1 = a + σ2/a and the overlap isgiven by

|uT1v|2 = 1 − σ2a−2 for a > σ. (9.8)

As a → σ, λ1 → 2σ and |uT1v|2 → 0, i.e. the eigenvector becomes delocalized

and merges with the bulk.

Exercise

9.2 Additive perturbation of a Wishart Matrix

Define a modified Wishart matrix W1 such that every element W1i j =

Wi j + a/N where W is a standard Wishart matrix and a is a constant oforder 1. W1 is a standard Wishart matrix plus a rank-1 perturbation W1 =

W + avvᵀ.

(a) What is the normalised vector v in this case?

(b) Using Eq. (9.4) find the value of the outlier and the minimal a in theWishart case?

(c) The square-overlap between the vector v and the new eigenvector u1 isgiven by Eq. (9.7). Give an explicit expression in the Wishart case.

(d) Generate large modified Wishart (q = 1/4, N = 1000) for a few ain the range [1, 5]. Compute the largest eigenvalue λ1 and associatedeigenvector u1. Plot λ1 and |uT

1v|2 as a function of a and compare withthe predictions of (b) and (c).

9.2.3 Multiplicative Perturbation

In data-analysis applications, we often need to understand the largest eigenvalueof a sample covariance matrix. A true covariance with a few isolated eigenvaluecan be treated as a matrix C0 with no isolated eigenvalues plus a low rank per-turbation. The passage from the true covariance to the sample covariance isequivalent to the free multiplication of the true covariance with a white Wishart

Exercise 179

matrix with appropriate aspect ratio q = N/T . To understand such matrices, wewill now study outliers for a multiplicative process.

Consider the free multiplication of the identity matrix with a rank-1 perturbationand another matrix B.

E = B1/2(1 + bvvT )B1/2,

where B is positive semi-definite with τ(B) = 1. We can think of B as a whiteWishart, but the result is more general. The case where the unperturbed matrixis not the identity can be treated as well, but we will study this model for sim-plicity. If B were the identity, the largest eigenvalue would be λ1 = a := 1 + b.In the absence of outlier (b = 0) we would have E = B. Let us compute theexact position of the outlier and the condition for its existence. The eigenvaluesof E are the zeros of its characteristic polynomial, in particular for the largesteigenvalue λ1 we have.

det(λ11 − B − B1/2bvvT B1/2) = 0. (9.9)

We are looking for an eigenvalue outside the spectrum of B, i.e. λ1 > λ+ whereλ+ is the upper edge of the support of ρB(λ). For such a λ1, the matrix λ11 − Bis invertible and we can use the matrix determinant lemma Eq. (1.19)

det(A + wuT ) = det A ·(1 + uT A−1w

),

with A = λ11 − B and w = −u =√

bB1/2v. Equation (9.9) becomes

det(λ11 − B) ·(1 − bvT B1/2GB(λ1)B1/2v

),

where we have introduced the matrix resolvent GB(λ1) = (λ11 − B)−1. As wesaid, the matrix λ11 − B is invertible so its determinant is non-zero. Thus weneed to have

1 − bvT B1/2GB(λ1)B1/2v = 0.

Again we assume that B is a rotationally invariant matrix or v is an independentrandom vector uniformly distributed on the unit sphere. Then we have

vT B1/2GB(λ1)B1/2v ≈ τ(B1/2GB(λ1)B1/2

)= τ

(B(λ1 − B)−1

),

where we recognize on the r.h.s. the definition of the T-transform of the matrix


B: t(λ1) from Eq. (7.20). Thus, the position of the outlier λ1 is given by thesolution of

1 − bt(λ1) = 0.

To know if this equation has a solution we need to know if t(ζ) is invertible. Theargument is very similar as the one for g(z) in the additive case. In the large Nlimit, t(ζ) converges to

t(ζ) =

∫ λ+

λ−

ρ(x)xζ − x

dx.

So t(ζ) is monotonically decreasing for ζ > λ+ and is therefore invertible. Wethen have

λ1 = ζ(b−1) if λ1 > λ+,

where we use the notation ζ(t) for the inverse of the T-transform of B. For t < t+,the T-transform is no longer invertible and, in any case, we were looking for anoutlier greater than λ+.

The inverse function ζ(t) can be expressed in terms of the S-transform via Eq.(7.21). We get

λ1 = ζ(b−1) =b + 1

S (b−1)for b >

1t(λ+)

, (9.10)

where b + 1 is the location of the outlier when B = 1.

Applying the theory to a Wishart matrix with

S (x) =1

1 + qx, λ+ = (1 +

√q)2,

we have

λ1 = (b + 1)(1 +

qb

)if b >

√q.

For large b, we roughly have λ1 ≈ b + 1 + q, i.e. a large eigenvalue a = b + 1in the covariance matrix C will appear as a + q in the eigenvalues of samplecovariance matrix.

Exercises 181

Exercises

9.3 Transpose version of multiplicative perturbation

Consider a positive definite rotationally invariant random matrix B anda normalized vector v. In this exercise, we will show that the matrix Fdefined by

F = (1 + cvvT )B(1 + cvvT ),

with c > 0 sufficiency large, has an an outlier λ1 given by Eq. (9.10) withb + 1 = (c + 1)2.

(a) Show that for two positive definite matrices A and B: B1/2AB1/2 has thesame eigenvalues as A1/2BA1/2.

(b) Show that for a normalized vector v(1 + (a − 1)vvT )1/2

= 1 + (√

a − 1)vvT

(c) Finish the proof of the above statement.

9.4 Multiplicative perturbation of an inverse-Wishart Matrix

Recall from Exercise 8.2 the definition of the inverse Wishart matrix M

M = (1 − q)W−1,

where W is a Wishart matrix with parameter q. The S-transform of M isgiven by

S M(t) = 1 − pt where p =q

1 − q.

Consider the diagonal matrix D with D11 = d and all other diagonal entriesequal to 1.

(a) D can be written as 1 + cvvT , What is the normalized vector v and theconstant c?

(b) Using the result from exercise 9.3, find the value of the largest eigen-value of the matrix DMD as a function of d. Note that your expressionwill only be valid for sufficiently large d.


(c) Numerically generate of matrices M with N = 1000 and p = 1/2 (q =

1/3). Find the largest eigenvalue of DMD for various values of d andmake a plot of λ1 vs d. Superimpose your analytical result.

(d) (harder) Find analytically the minimum value of d to have an outlier λ1.


In Section 9.1, we briefly mentioned the Tracy-Widom distribution for the fluc-tuations of largest eigenvalue around the edge of the spectrum. A proof of thisresult, or even just a heuristic derivation, is well beyond the scope of this book.The topic is now covered in most recent mathematics books on random matri-ces (see e.g. Anderson et al. [2010], Baik et al. [2016], Pastur and Scherbina[2010]).

None of the books on random matrix theory cited in the bibliography treat out-liers arising from low-rank additive or multiplicative perturbation. For a recentreview on the subject, the interested reader should consult Capitaine and Donati-Martin [2016].

Equation (9.5) was first published by physicists Edwards and Jones [1976] andby mathematicians Furedi and Komlos [1981]. Baik et al. [2005] studied thetransition in the statistics of the largest eigenvector in the presence of a low rankperturbation. The transition a > g+ from no outlier to one outlier is now calledthe BBP transition.

The case of additive or multiplicative perturbation to a general matrix (be-yond Wigner and Wishart) was worked out by Benaych-Georges and Nadakuditi[2011].

10Bayesian estimation

10.1 Bayesian estimation

10.1.1 Motivation

We study consider a random matrix modified by additive or multiplicative noise:

E = C + B, or E = C1/2BC1/2.

Suppose we observe E, then we want to obtain some information on C. We havethe following probabilistic approaches:

noise process: P(E|C); inference: P(C|E).

We have

P(E,C) = P(E|C)P(C) = P(C|E)P(E),

so we have the Bayesian rule

P(C|E) =P(E|C)P(C)

P(E).

E is supposed to be known, so we can assume that P(E) is a constant coefficientand we can write

P(C|E) ∝ P(E|C)P(C).

184 Bayesian estimation

We can recover the coefficient by imposing the normalization condition. Beforedoing Bayesian theory on random matrices, we first work out a simple exam-ple.

10.1.2 A simple case

We consider the 1D estimation problem:

y = x + η,

where x is some signal to be estimated, η is the noise, and y is the observation.Then P(y|x) is simply P(η) centered at x. Suppose η is a centered Gaussian noisewith variance N (for noise). Then we have

P(y|x) ∝ exp(−

(y − x)2

2N

)∝ exp

(2xy − x2

2N

).

Then we get that

P(x|y) ∝ P(x) exp(2xy − x2

2N

).

Here P(x) is called the prior, which is the probability of x in the absence of anyobservation. It is a property of the signal generator. A major advantage of theBayesian approach is that the prior is often quite arbitrary. If P(x) also takesGaussian form, then P(x|y) will also take a simple form. Such a prior is called“conjugate prior”, which is often used to make the computation tractable.

Example 1: P(x) is centered Gaussian with variance S (for signal). Then

P(x|y) ∝ exp(−

x2

2S+

2xy − x2

2N

)∝ exp

(−

S + N2S N

(x −

SS + N

y)2)

.

Example 2: P(x) is Bernoulli random variable with P(x = 1) = P(x = −1) =

1/2. Then

P(x|y) ∝exp (y/N) δ(x − 1) + exp (−y/N) δ(x + 1)

exp (y/N) + exp (−y/N).

Now suppose we have P(y|x), we want to know what is the “best” estimate for x.The answer will depend on a cost function. We will consider two costs:

10.1 Bayesian estimation 185

(A) the quadratic cost E(|X − X|2),

(B) and the absolute value cost E(|X − X|).

In case (A), it is easy to show that argmin E(|X − X|2) satisfies E(X − X) = 0,i.e. X = E(X) is the conditional mean of X given our prior and the observationof y.

In case (B), it is easy to show that argmin E(|X−X|) satisfies E(sign(X−X)) = 0,i.e. X = Xmed is the conditional median of X given our prior and the observationof y.

In Example 1, we have

E(X) = Xmed =S

S + Ny,

which gives a linear estimator. If the prior is centered at x0, one can showthat

E(X) = Xmed =S

S + Ny +

NS + N

x0.

We shall call the above a linear shrinkage, with the ratios depend on the signalto noise ratio S/N.

In Example 2, if y > 0, then P(x = 1) > 1/2, and if y < 0, then P(x = −1) > 1/2.So we have

Xmed = sign(y).

It is also easy to calculate that

E(X) = tanh(y/N).

(insert a figure on tanh)

The linear shrinkage is conservative in the sense that it undershoots towardszero (or x0). Now we check the variance:

y = x + η, σ2y = S + N, and σ2

x = S < σ2y .

On the other hand

X =S

S + Ny, σ2

x =S 2

S + N< σ2

x.


Thus we have

σ2x < σ

2x < σ

2y .

The above inequalities are also true for the example 2. One can verify that σ2x =

1, σ2y = 1 + N and σ2

x ≤ 1 (< 1 for X = E(X)).

10.2 Bayesian estimation of the true covariance matrix

In Markowitz theory, we needed E(rrT ) conditioning on what we know up tonow:

Ξ = E(C)|now.

The claim is that the sample covariance matrix is not Ξ in the large dimensional(i.e. large N) case. This is related to the Stein paradox.

We now go back to the Bayesian estimation:

P(C|E) ∝ P(E|C)P0(C),

where by (3.2) we have

P (E|C) ∝ (det C)−T/2 exp[−

T2

Tr(C−1E

)].

Right now, we pick the conjugate prior of the form

P0 (C) ∝ (det C)α exp[βTr

(C−1X

)]for some matrix X. This is in fact the probability density of the elements ofan inverse Wishart matrix. Consider an inverse Wishart matrix C of size N, T ∗

degree of freedom and centered at a (positive definite) matrix X. If T ∗ > N + 1,C has the density

P(C) ∝ (det C)−(T ∗+N+1)/2 exp[−

T ∗ − N − 12

Tr(C−1X

)].

Here T ∗ > N is some parameter that is unrelated to T in our problem, and thenormalization is such that EC = X. As T ∗ → ∞, we have C → X. Then wehave that

P(C|E) ∝ (det C)−(T+T ∗+N+1)/2 exp[−

T2

Tr(C−1A

)], (10.1)

10.2 Bayesian estimation of the true covariance matrix 187

where we define

A := E +T ∗ − N − 1

TX.

We notice that (10.1) is also a probability density for an inverse Wishart withT = T + T ∗ and centered at the matrix

TAT + T ∗ − N − 1

.

Hence we have

E(C|E) =TA

T + T ∗ − N − 1= αE + (1 − α)X.

i.e.

E(C|E) = αE + (1 − α)X,

where

α =T

T + T ∗ − N − 1.

This again gives a linear shrinkage. We make the following remarks.

• The linear shrinkage works even for small N case, i.e. without the large Nhypothesis.

• In general, if one has no idea of what X should be, one can use the identitymatrix:

Ξ(E) = αE + (1 − α)1.

Moreover, T ∗ is generally unknown. It may be inferred from the data orlearned via a validation phase.

• The linear shrinkage works quite well in practice, showing that inverse Wishartis not a bad prior for the true covariance matrix. (insert a figure on inverseWishart spectrum) The spectrum of the inverse Wishart matrix is more skewedthan that of Wishart.


10.3 Linear Ridge regression and Marcenko-Pastur law

We consider a linear model

y = HT a + ε,

where H is an N × T data matrix, ε is a T dimensional noise vector (which isnot directly observable), a is an N dimensional vector of coefficients we want toestimate, and y is the dependent variable. Then we have the following two typesof tasks.

• In-sample estimate: we observe H and y, and estimate a.

• Out-of-sample prediction: we choose some H2 and predict y with the in-sample estimate.

To estimate a, we try to minimize the forecast mean squared error

MSE(a) :=1TEε

∥∥∥y −HT a∥∥∥2

=1TEε

∥∥∥HT a + ε −HT a∥∥∥2.

Note that if we knew and used the true a, then the error is E(εT ε)/T . MinimizingMSE(a) over a, it is easy to see that

a =(HHT

)−1Hy.

When q := N/T < 1, HHT is in general invertible (e.g. in the case of Wishartmatrices). Then we write

a = E−1b, E :=1T

HHT , b :=1T

Hy,

and this is the best in-sample estimator. However, this is not necessarily the casefor the out-of-sample prediction.

Now we calculate the mean squared error MSE(a) for some vector of the forma = Ξ−1b, where Ξ is some matrix. (If Ξ is taken to be E, then we are reducedto the case of a.) Assuming E(εεT ) = Σ1T and after some calculations, one canshow that

MSE(a) =1T

Σ[T − 2 Tr(Ξ−1E)) + Tr(Ξ−1EΞ−1E)

]+ aT

(E − 2EΞ−1E + EΞ−1EΞ−1E

)a.

10.3 Linear Ridge regression and Marcenko-Pastur law 189

In the case Ξ = E, we have

MSE(a) =Σ

T(T − N) = (1 − q)Σ,

which is smaller than the true error E(εT ε)/T = Σ. The above result is a little“magic” in the sense that it actually does not depend on either E or a. We seethat as in the case of portfolio optimization, the error → 0 as N → T , and theactual error = 0 when N ≥ T .

Next we calculate the expected out-of-sample error. We draw another matrix H2

with size N × T2 and consider another independent noise vector ε2 of size T2

(where T2 is not related to T and can even be equal to 1). We calculate

MSE2(a) =1

T2EH2,ε2

∥∥∥HT2 a + ε2 −HT

2 a∥∥∥2

=1

T2EH2,ε2

∥∥∥∥∥HT2 a + ε2 −HT

2 Ξ−1Ea −HT2 Ξ−1 1

THε

∥∥∥∥∥2.

We assume that T−12 E(ε2ε2

T ) = C with also Tr C = Σ. When taking Ξ = E, wethen have

MSE2(a) =1

T2EH2,ε2

∥∥∥∥∥ε2 −HT2 Ξ−1 1

THε

∥∥∥∥∥2= Σ +

Σ

TTr(E−1C).

If E is taken as a sample covariance matrix with covariance C, then we have

Tr(E−1C) = Tr(W−1) ≈N

1 − q,

where W denotes a standard Wishart matrix. Thus we have

MSE2(a) = Σ +qΣ

1 − q=

Σ

1 − q.

Thus as expected, we see that

in-sample error ≤ true error ≤ out-of-sample error,

with out-of-sample error→ ∞ as N → T .

For the out-of-sample prediction, a is the best we can use to forecast. We now


focus on the estimation error on a, which is easier than the out-of-sample error.We have

error = Eε‖a − a‖2 = Eε∥∥∥∥∥a − E−1 1

THy

∥∥∥∥∥2= Eε

∥∥∥∥∥∥E−1HεT

∥∥∥∥∥∥2

=Σ

TTr(E−1).

(10.2)If E is a sample covariance matrix with covariance C, then we get

error =Σqτ(C−1)

1 − q.

Again the estimation error→ ∞ as N → T . In fact, we can improve over a usingthe Bayesian theory. In the case with Gaussian noise and Gaussian unknownvector a, the Bayesian theory gives the best linear estimator that minimizes theout-of-sample variance. Suppose H is given, then

P(a|y) ∝ P(y|a)P(a),

where

P(y|a) ∝ exp−

12Σ‖y −HT a‖2

, P(a) ∝ exp

−

12A‖a‖2

,

where A is the a-priori variance of a component of a. Then one can verifythat

P(a|y) ∝ exp−

T2Σ

(a − Ξ−1b

)TΞ

(a − Ξ−1b

),

where

Ξ := E + α1N , α :=Σ

T A, b =

1T

Hy.

From the above expression, it is easy to get the conditional mean as the bestestimator for a:

aR = Ξ−1b,

which is called the Ridge estimation. Since Ξ > E, aR underestimates a com-pared with a. Moreover, the trace of conditional covariance is the estimationerror. With

Ccond =Σ

TΞ−1,

10.3 Linear Ridge regression and Marcenko-Pastur law 191

we get that

Ridge estimation error =Σ

TTr Ξ−1 =

Σ

TTr(E + α)−1 = −qΣgE(−α).

Since gE(0) = −N−1 Tr E−1, we see that the naive estimation error in (10.2) isactually −qΣgE(0). Note that −gE(−x) is monotonically decreasing from x = 0to x→ ∞, so we have

Ridge estimation error < naive estimation error.

In the simple case with C = 1, we have an explicit expression for the estimationerror of Ridge regression. We remark that the Bayesian conditional expectationis the same as minimizing in-sample error with quadratic penalty.

11Eigenvector Overlaps and Rotationally Invariant

Estimators

11.1 Eigenvector Overlaps

11.2 Rotationally Invariant Estimator

We now start talking about the rotationally invariant estimator (RIE). If Ξ(E) isan estimator of C with the property that

Ξ(OEOT ) = OΞ(E)OT (11.1)

for any orthogonal matrix O, then we shall call Ξ a RIE. We expect that Ξ(E) tobe diagonal when E is diagonal, i.e. Ξ(E) can be diagonalized in the same basisas E. We now consider the Bayesian estimator with rotational invariance priorP0 (C). Then for

P(C|E) ∝ (det C)−T/2 exp[−

T2

Tr(C−1E

)]P0(C),

it is easy to verify that for any orthogonal matrix O,

E(C|OEOT ) =

∫CP(C|OEOT )P0(C)DC

= O[∫

CP(C|E)P0(C)DC]

OT = OE(C|E)OT ,

using the change of variable C = OT CO. Hence Ξ(E) = E(C|E) is a rotationallyinvariant estimator.

11.2 Rotationally Invariant Estimator 193

We now write a RIE Ξ(E) in the basis of E:

Ξ(E) =∑

k

ξkvkvTk ,

where vk are eigenvectors of E. We now cheat a little bit and assume that wealready know C. Of course, C itself is the best estimator for C, but we want toknow what is the best we can do under the basis of E. We want to minimize thesquare (i.e. the least-square-error)

Tr(Ξ(E) − C)2 =∑

k

vTk (Ξ(E) − C)2vk =

∑k

(ξ2

k − 2ξkvTk Cvk + vT

k C2vk).

Minimizing over ξk, it is easy to get that

ξk = vTk Cvk.

11.2.1 General Problem

In this chapter, we study the RIE for the following two cases:

(A) E = C + B with τ(B) = 0;

(B) E = C1/2BC1/2 with τ(B) = 1 and C positive semi-definite.

Here C is a matrix we want to estimate, and B is a rotationally invariant matrix(noise). The matrix E can be regarded as a noisy version of C. We have dis-cussed a little about the case (B) in last chapter for sample covariance matrix Ewith B being a Wishart matrix with parameter q.

Suppose in our priori knowledge, C is rotationally invariant. Then the best esti-mator Ξ(E) for C should be also rotationally invariant in the sense of (11.1). Inparticular, Ξ(E) should be in the same basis as E:

Ξ(E) =∑

k

ξkvkvTk , with E =

∑k

λkvkvTk .

Suppose we already know C, then we have derived in last section that the least-square-error is achieved by

ξk = vTk Cvk.

194 Eigenvector Overlaps and Rotationally Invariant Estimators

However, in practice we do not know C. But we still want to ask whether thereis a way to compute vT

k Cvk. As what we have done for outliers, we use theresolvent here:

G(z) =∑

k

vkvTk

z − λk, Tr CG(z) =

∑k

vTk Cvk

z − λk.

Then the residue at the pole z = λk gives the vTk Cvk. In the large N limit, we

expect that

1N〈Tr CG(z)〉 →

∫ λ+

λ−

ρ(λ)ξ(λ)z − λ

dλ

for some continuous function ξ. Then we have

limη→0+

Im τ(CG(λ − iη)) = πρ(λ)ξ(λ).

Combining with (2.24), we obtain that

ξ(λ) =limη→0+ Im H(λ − iη)limη→0+ Im g(λ − iη)

, (11.2)

where we denote

H(z) := τ(CG(z)). (11.3)

Now it remains to calculate H(z). Recall the subordination relations (7.16) and(7.23), we have

gE(z) = gC (z − RB(gE(z))) (11.4)

for case (A), and

tE(z) = tC (zS B(tE(z))) (11.5)

for case (B). We recall that

gE(z) =1NETr GE(z), tE(z) = zgE(z) − 1,

with

GE(z) =1

z − E, TE(z) = zGE(z) − IN .

11.2 Rotationally Invariant Estimator 195

It turns out the above relations (11.4) and (11.5) also hold for the expectationvalue of the resolvent matrices, i.e.

EGE(z) = GC (z − RB(gE(z))) , (11.6)

for case (A), and

ETE(z) = TC (zS B(tE(z))) (11.7)

for case (B).

11.2.2 Additive Case

We now compute H(z) using (11.6):

H(z) = τ(CGC(Z)) = τ[C(Z − C)−1

]= ZgC(Z) − 1.

where we denote Z = z − RB(gE(z)). Then we have

ξ(λ) =limη→0+ Im ZgC(Z)limη→0+ Im gC(Z)

= λ −limη→0+ Im RB(gE(z))gE(z)

limη→0+ Im gE(z), z = λ − iη.

(11.8)If there is no noise, i.e. B = 0, then we have ξ(λ) = λ. If B is small, then

RB(x) = τ(B) + τ(B2)x + · · · ,

where τ(B) = 0 and τ(B2) is small. Hence we have

ξ(λ) = λ + small correction.

We can consider the addition of Wigner noise. Then we have

RB(g) = σ2g, Im gR(g) = Imσ2g2 = 2σ2(Re g)(Im g),

which by (11.8) gives that

ξ(λ) = λ − 2σ2 Re gE(λ).

We consider a special case where C is another Wigner matrix with variance σ2S .


Then E is also a Wigner matrices with variance σ2 +σ2S . Using (2.22), it is easy

to get that in the bulk −2√σ2 + σ2

S < λ < 2√σ2 + σ2

S ,

Re gE(λ) =λ

2(σ2 + σ2S ).

Hence we obtain that

ξ(λ) = λ −λσ2

σ2 + σ2S

=λσ2

S

σ2 + σ2S

,

which is the linear shrinkage.

11.2.3 Multiplicative Case

We can now tackle the multiplicative case which includes the estimation of thetrue covariance matrix given the sample covariance. As in the additive case,we want to compute the function H(t) := τ(CGE(z)). In particular we want tofind an expression for H(t) that does not involve the unknown matrix C. To doso we use the subordination relation for the T matrix Eq. (11.7). It is usefulto re-express Eq. (11.2) in terms of TE(z) and tE(z). For the denominator wehave

limη→0+g(λ − iη) =

1λ

t(λ − iη).

And the imaginary part of the function H(t) can be written as

limη→0+

Im H(λ − iη) =1λ

limη→0+

Im τ(CTE(z))

=1λ

limη→0+

Im τ(CTC(zS B(t))),

where t = tE(λ − iη). Since TC(z) = C(z1 − C)−1, we have

τ [CTC(zS B(t))] = τ[(C2(zS B(t)1 − C)−1

]= τ

[(C(C − zS B(t)1 + zS B(t)1)(zS B(t)1 − C)−1

]= τ(C) + zS B(t)tC(zS B(t))

= τ(C) + zS B(t)tE(z)

Exercise 197

When z = λ+ iη, z→ λ and τ(C) is real and does not contribute to the imaginarypart, thus we get that

ξ(λ) = λlimη→0+ Im S B(tE(z))tE(z)

limη→0+ Im tE(z), z = λ − iη. (11.9)

Equation (11.9) is very general. It applies to sample covariance matrices wherethe noise matrix B is a white Wishart, but it also applies to any multiplicativenoise process. In particular, using the results of chapter 8.

We now apply the above theory to sample covariance matrix, which has

S B(t) =1

1 + qt.

We simplify t ≡ tE(z). In the bulk λ− < λ < λ+, t is complex with nonzeroimaginary part. Then we have

ξ(λ) = λlimη→0+ Im t

1+qt

limη→0+ Im t=

λ

|1 + qt(λ)|2. (11.10)

We consider a special case where C is the inverse-Wishart. Then E is a freeproduct of an inverse Wishart matrix with a Wishart matrix. Then (11.10) givesa shrinkage that is different from the linear shrinkage in the Bayesian estimatedue to the factor t(λ). In fact, 0 < t(λ+) < 1, which gives ξ(λ+) < λ+, and−1 < t(λ−) < 0, which gives ξ(λ−) > λ−.

Exercise

11.1 Cleaning a SCM when the true covariance is a Wishart

Assume that the true covariance Matrix C is given by a Wishart matrixwith parameter q0. This Wishart matrix is not the sample covariance matrixof anything but just a tractable model for C for which the computation canbe done semi-analytically (we will get cubic equations!).

We observe a sample covariance matrix E over T = qN time intervals. Eis the free product of C and another Wishart matrix of parameter q.

E = C1/2WC1/2


(a) Given that the S-transform of the true covariance is S C(t) = 1/(1 +

q0t) and the S-transform of the Wishart is S W(t) = 1/(1 + qt). Use theproduct of S-transforms for free multiplication and Eq. (7.21) to writeand equation for tE(z). It should be a cubic equation in t.

(b) Using a numerical polynomial solver (e.g. np.roots) solve for t(z) for zreal between 0 and 4, choose q0 = 1/4 and q = 1/2. Choose the rootwith positive imaginary part. Use Eqs. (7.19,2.24) to find the eigen-value density and plot this density. The edge of the spectrum should be(slightly below) 0.05594 and (slightly above) 3.746.

(c) For λ in the range [0.05594,3.746] plot the optimal cleaning function(use the same solution t(z) as in (b)):

ξ(λ) =λ

|1 + qt(λ)|2

(d) For N = 1000 numerically generate C (q0 = 1/4) and two versions ofW1,2 (q = 1/2) and two versions of E1,2 ≡ C1/2W1,2C1/2. E1 will bethe “in-sample” matrix and E2 the “out-of-sample” matrix. Check thatτ(C) = τ(W1,2) = τ(E1,2) = 1 and that τ(C2) = 1.25, τ(W2

1,2) = 1.5 andτ(E2

1,2) = 1.75.

(e) Plot the normalised histogram of the eigenvalues of E1, it should matchyour plot in (b).

(f) For every eigenvalue, eigenvector pair (λk, vk) of E1 compute ξora(λk) ≡vᵀk Cvk. Plot ξora(λk) vs λk and compare with your answer in (c).

We saw in class that the minimum risk portfolio with expected gain Gis given by

π = GC−1g

gᵀC−1g(11.11)

where C is the covariance matrix (or an estimator of it) and g is thevector of expected gains. Compute the matrix Ξ by taking the matrix E1

and replacing its eigenvalues λk by ξ(λk) and keeping the same eigen-vectors. Use your result of (c) for ξ(λ), if some λk are below 0.05594 orabove 3.746 replace them by 0.05594 and 3.746 respectively. This is so

Exercise 199

that you don’t have to worry about finding the correct solution t(z) for zoutside of the bulk.

(g) Build the three portfolios πC, πE and πΞ by computing Eq. (11.11) forthe three matrices C, E1 and Ξ using G = 1 and g = e1 the vector with 1in the first component and 0 everywhere else. These three portfolios cor-respond to the true optimal, the naıve optimal and the cleaned optimal.The true optimal is in general unobtainable. For these three portfolioscompute the in-sample risk Rin ≡ π

ᵀE1π, the true risk Rtrue ≡ πᵀCπ and

the out-of-sample risk Rout ≡ πᵀE2π.

(h) Comment on these nine values using your notes from class. For πC andπE you should find exact theoretical values. For πΞ, as far as I know,there is no closed form formula for the in or out-of-sample risk, it shouldbetter that πE but worse than πC.

11.2.4 Multiplicative-Additive Case

E =√

C + XW√

C + X (11.12)

where X is some general additive noise and W some general multiplicativenoise. Then

gE(z) = S W(t)gC

(S W(t)z − RX

(gE(z)S W(t)

))where t = zgE(z) − 1

The optimal estimator is then given by

ξ(λ) =limη→0+ Im[Zg − S ]

limη→0+ Im g(11.13)

where g = gE(z), S = S W(zg − 1) and Z = S g − RX(g/S ). For Wigner X andWishart W we have

S =1

1 − q + qgzand Z = S z − σ2g/S


11.2.5 RIE for Outliers

So far, we have focused on the bulk eigenvectors. It turns out that our formulasare also valid for outliers of C that appear as outliers of E. One can show thatoutside the bulk, g(z) and t(z) are analytic on the real axis and

Im g(λ − i0+) = −ηg′(λ), Im t(λ − i0+) = −ηt′(λ).

Then (11.8) and (11.9) can be written as

ξ(λ) = λ −ddg

[RB(g(λ))g(λ)]

and

ξ(λ) = λddt

[S B(t(λ))t(λ)] ,

respectively. For large outliers, we have

g(λ) ∼1λ, t(λ) ∼

τ(E)λ

.

Then in Wigner case, for large outliers we have

ξ(λ) ≈ λ −2σ2

λ.

Recall that if E has a large perturbation a, we shall have

λ = a +σ2

a,

which gives

ξ(λ) = a +σ2

a−

2aσ2

a2 + σ2 < a.

11.3 Conditional Average in Free Probability

In this section we give an alternative derivation of the rotationally invariantestimator. This derivation is more elegant, albeit more abstract. In particular,it doesn’t rely on the computation of eigenvector overlap, so by itself it missesthe important link between the RIE and the computation of overlaps.

11.3 Conditional Average in Free Probability 201

In the context of free probability, we work with abstract objects (E, C, etc.) thatsatisfy the axioms of chapter 7. We can think of them as infinite dimensionalmatrices. We are given the matrix E that was obtained by free operations froman unknown matrix C. For instance it could be given by a combination of freeproduct and free sum as in Eq. (11.12).

The matrix E is generated from the matrix C, in this sense, E depends on C. Wewould like to find the best estimator (in the least-square sense) of C given E. Itis given by the conditional average

Ξ = E[C]|E . (11.14)

In this abstract context, the only object we know is E so Ξ must be a functionof E. Let’s call this function Ξ = ξ(E). The fact that Ξ is a function of E onlyimposes that Ξ commutes with E, i.e. that Ξ is diagonal in the eigenbasis of E.One way to determine the function ξ(E) is to compute all possible moments ofthe form mk = τ[ξ(E)Ek]. They can be combined in the function

H(z) := τ[ξ(E)(z1 − E)−1

],

via its Taylor series at z→ ∞. Using Eq. (11.14), we write

H(z) = τ[E[C]|E (z1 − E)−1

].

But the operator τ[.] contains the expectation value over all variables. By thelaw of total expectation,

H(z) = τ[C(z1 − E)−1

],

which is identical to Eq. (11.3). To recover the function ξ(x) from H(z) we usea spectral decomposition of E

H(z) =

∫ρE(λ)

ξ(λ)z − λ

dλ,

so

limη→0+

Im H(λ + iη) = πρE(λ)ξ(λ).

which is equivalent to

ξ(λ) = limη→0+

Im H(λ + iη)Im gE(λ + iη)

.


11.4 Real Data

The good news about the RIE estimator is that it only depends on the transformsof the observable matrix E, gE(z) and tE(z) and the R or S transform of the noiseprocess. One may think that real world applications should be relatively straight-forward. Unfortunately, we need to know the behavior of the limiting transformson the real axis, precisely where the discrete N transforms gN(z) and tN(z) failto converge... There are three potential work-around this problem

11.4.1 Parametric Fit

One can postulate a functional form for ρC(λ) and fit the parameters on the data.This would give analytical formulas for all the relevant transforms so one canextract the exact behavior on the real axis.

11.4.2 Small Imaginary Part

One can work with the discrete transforms, but instead of computing their valueson the real axis (where they are singular), one works with z = x + iη withη ∼ 1/

√N. In practice this works well except for small eigenvalues where

|λ| ∼ η.

11.4.3 Validation Set

Compute on a different data set.

12Applications to Finance

12.1 Portfolio Theory

12.1.1 Returns and risk

We consider N risky assets with returns given by

rti =

pti − pt−1

i

pt−1i

,

where t denotes the time (day) and pti denotes the price of asset i at time t.

The portfolio πi is the dollar amount invested on asset i. The total capital is C.Naively, we can take C =

∑i πi. But one can borrow C <

∑i πi and pay the risk

free rate, or invest some capital at the risk free rate. Then the total return (interms of dollars) is

Rt =∑

i

πirti + (C −

∑i

πi)r0,

so that the excess return is

Rt −Cr0 =∑

i

πi(rti − r0),

where rti − r0 is the excess return of asset i. From now on, we will denote rt

i − r0

by rti and assume that Ert

i is negligible, i.e. Erti = 0. Then we want to ask what

204 Applications to Finance

is the risk of a portfolio π:

σ2R := E(R2) =

∑i, j

πiπ jE(rir j) = πT Cπ,

where C is the covariance matrix with entries Ci j = E(rir j).

However, usually one cannot hope to have the exact value of C. Instead we canonly use a noisy estimator of C, and the most natural choice is the sample co-variance matrix E. If π is uncorrelated with the noise in E, then we have

〈σ2E〉 = 〈πT Eπ〉 = πT 〈E〉π = πT Cπ,

i.e. the risk σ2E is also unbiased. In principle, one do not need the random matrix

theory for risk control. Also in practice, π is built based on optimization usingthe historical data, so generally

〈σ2E〉 , π

T Cπ.

12.1.2 Markowitz optimization

Next we talk about the portfolio optimization under the restriction

πT Cπ ≤ σ2R (12.1)

for some fixed σ2R, i.e. we have a bound for the possible risk. Here C need to be

the true variance matrix or at least the unbiased sample covariance matrix basedon what we know. Suppose we have a vector of expected returns g. We want tomaximize πT g subject to (12.1). With Lagrange multiplier, it is easy to see thatwe need to take

π = σ2R

C−1ggT C−1g

.

Equivalently, one can also set a return goal πT g ≥ G and minimize the riskπT Cπ. Again with Lagrange multiplier, we get that

π = GC−1g

gT C−1g.

12.1 Portfolio Theory 205

12.1.3 In-sample and out-of-sample risk

If we replace C with the sample covariance matrix E in the above argument, weget

π = GE−1g

gT E−1g.

We know that E has larger eigenvalues spectrum than that of C. In particular,the smallest eigenvalue λmin → 0 as q → 1, and E−1 then becomes singular.Even if q < 1, E−1 overweights the smaller eigenvalues. We now try to quantifythis overfitting. (insert a figure on in-sample risk and out-of-sample risk) Wehave

• in-sample risk:

πT Eπ,

• out-of-sample risk:

πT E2π, with 〈E2〉 = C,

• average OOS risk:

πT Cπ,

• the optimal portfolio:

π0 = GC−1g

gT C−1g.

• the optimal risk:

πT0 Cπ0.

Then in practice, we have that

• in-sample risk (ISR):

G2 gT E−1g(gT E−1g

)2 =G2

gT E−1g,

206 Applications to Finance

• average OOS risk (AOOSR):

G2 gT E−1CE−1g(gT E−1g

)2 ,

• the optimal risk (OR):

G2 gT C−1g(gT C−1g

)2 =G2

gT C−1g.

12.1.4 Error using the sample covariance matrix

Suppose g is a random vector of norm ‖g‖ = g0 and is independent of the noisein E and C, and E is a sample covariance matrix

E = C1/2WqC1/2,

where Wq is a Wishart matrix. Then due to the self-averaging mechanism, wehave

gT E−1g ≈ g20τ

(C−1/2W−1

q C−1/2)

= g20τ

(C−1

)τ(W−1

q

)=

g20τ

(C−1

)1 − q

,

where τ(W−1q ) is calculated used the expansion of the Stieltjes transform around

z = 0 (see (2.19)). Thus we have

ISR = (1 − q)G2τ(C−1)−1

g20

, OR =G2τ(C−1)−1

g20

.

For the AOOSR, we need to compute

gT E−1CE−1g ≈ g20τ

(C−1/2W−2

q C−1/2)

= g20τ

(C−1

)τ(W−2

q

)=

g20τ

(C−1

)(1 − q)3 ,

where τ(W−2q ) is also calculated using (2.19). Thus we get that

AOOSR =G2τ(C−1)−1

(1 − q)g20

.

12.1 Portfolio Theory 207

Note that the AOOSR diverges when q→ 1. Moreover, one can observe that

in-sample risk ≤ true optimal risk ≤ out-of-sample risk.

When q→ 0, we recovers the perfect estimation.

13Replica Trick

In this chapter we will review another important tool to perform computations inrandom system and in particular in random matrix theory. Suppose that we wantto compute the free energy of a random system and that we express this freeenergy as the logarithm of some partition function. We expect that free energywill not depend on the particular sample so we can average the free energywith respect to the randomness in the system to get the typical free energy.Unfortunately averaging the logarithm of a partition function is hard. What wecan do is compute the partition function to some power n and later let n → 0using

log Z = limn→0

Zn − 1n

. (13.1)

The partition function Zn is just the partition function of n non-interacting copiesof the same system Z, these copies are called replica, hence the name of the tech-nique. Averaging the logarithm is then equivalent to averaging Zn and taking theabove limit. The averaging procedure will couple the n copies and the resultingsystem might be hard to solve. In most interesting cases, the partition functioncan only be computed as the size of the system (say N) goes to infinity. Natu-rally one is tempted to interchange the limits (n→ 0 and N → ∞) but there is noreal justification for doing so. Another problem is the we can hope to computeEZn for all integers n but is that really sufficient to do a proper n→ 0 limit? Forthese reasons, replica trick computation are not considered rigorous. Neverthe-

13.1 Stieltjes Transform 209

less, they are a good source of intuition and results that mathematicians wouldcall conjectures.

13.1 Stieltjes Transform

13.1.1 General Case

To use the replica trick in random matrix theory, we first need to express theStieltjes transform as the average logarithm of a random determinant. In thelarge N limit and for z sufficiently far form the real eigenvalues, the discreteStieltjes transform gN(z) converges to a deterministic function g(z). The replicatrick will allow use to compute EgN(z) which also converges to g(z). Using thedefinition Eq. (2.16) and dropping the N subscript, we have

EgA(z) =1NE

N∑k=1

1z − λk

,

while the determinant of z1 − A is given by

det(z1 − A) =

N∏k=1

(z − λk).

We can turn the product in the determinant into a sum by taking the logarithmand turn log(z− λk) into (z− λk)−1 by taking the derivative with respect to z. Wethen get

EgA(z) =1NE

ddz

log det(z1 − A).

To compute the determinant we may use the multivariate Gaussian identity∫dNψ

(2π)N/2 exp(−ψT Mψ

2

)=

1√

det M,

which is exact for any N as long as the matrix M is positive definite. For zlarger than any potential eigenvalue of A, (z1−A) will be positive definite. TheGaussian formula allows us to compute det−1/2, we can absorb the power of

210 Replica Trick

−1/2 in the logarithm. Applying the replica trick (13.1) we get

EgA(z) =−2N

Eddz

limn→0

Zn − 1n

,

with

Zn =

∫ n∏α=1

dNψα(2π)N/2 exp

− n∑α=1

ψTα(z1 − A)ψα

2

, (13.2)

where we have written Zn as n copies of the same Gaussian integral. This is allfine, except our Zn is only defined for integer n and we need to take n→ 0. Thereplica trick is then to compute the limiting Stieltjes as

gA(z) = −2ddz

limN→∞

limn→0

EZn − 1Nn

.

13.1.2 Wigner Case

As an example of replica trick calculation, we can compute, yet again, the Stielt-jes transform for the Wigner ensemble. We want to take the expectation value ofEq. (13.2) in the case where A = X a symmetric Gaussian rotational invariantmatrix.

EZn =

∫ n∏α=1

dNψα(2π)N/2E

exp

− z2

n∑α=1

N∑i=1

ψ2αi −

N∑i< j

Xi jψαiψα j −12

N∑i

Xiiψαiψαi

,

=

∫ n∏α=1

dNψα(2π)N/2 exp

− z2

n∑α=1

N∑i=1

ψ2αi

N∏i< j

E

exp

−Xi j

n∑α=1

ψαiψα j

×

N∏i

E

exp

−12

Xii

n∑α=1

ψαiψαi

,

where we have isolated the products of expectation of independent terms andseparated the diagonal and off-diagonal terms. We can evaluate the expectationvalues using the following identity, for a centered Gaussian variable x of vari-ance σ2, we have

Eeax = eσ2a2/2.

13.1 Stieltjes Transform 211

Using the fact the diagonal and off-diagonal elements have variance 2σ2/N andσ2/N, respectively. We get

EZn =

∫ n∏α=1

dNψα(2π)N/2 exp

− z2

n∑α=1

N∑i=1

ψ2αi

N∏i< j

exp

−σ2

2N

n∑α=1

(ψαiψα j)2

×

N∏i

exp

−σ2

4N

n∑α=1

(ψαiψαi)2

We can now regroup all the terms in the exponential and combine the last twosum into a single sum over αi j. We notice that

N∑i, j=1

n∑α=1

(ψαiψα j)2 =

n∑α,β=1

N∑i=1

ψαiψβi

2

We would like to integrate over the variables ψαi but the argument of the ex-ponential contains a forth order term in the ψ’s. To tame this term, we use theHubbard-Stratonovich identity

exp(ax2

2

)=

∫dq√

2πaexp

(−

q2

2a+ xq

). (13.3)

Before we use Hubbard-Stratonovich, we need to regroup diagonal and of diag-onal terms in αβ.

N4σ2

n∑α,β=1

N∑i=1

ψαiψβi

2

=N

2σ2

n∑α<β

N∑i=1

ψαiψβi

2

+Nσ2

n∑α=1

N∑i=1

ψαiψαi

2

2

,

where in the diagonal terms we have pushed the factor 1/4 in the squared quan-tity for later convenience. We can now use Eq. (13.3), introducing diagonal qααand upper triangular qαβ to linearize the squared quantities. Writing the q’s as asymmetric matrix we have

EZn ∝

∫dq

∫ n∏α=1

dNψα(2π)N/2 exp

−N Tr q2

4σ2 −

N∑i=1

n∑α,β=1

(zδαβ − qαβ)ψαiψβi

2

,where dq is the integration over the independent component of the n × n sym-metric matrix q, note that we have dropped z-independent constant factors. The

212 Replica Trick

integral of ψαi is now a multivariate Gaussian integral, actually N copies of thesame n-dimensional Gaussian integral:∫ n∏

α=1

dψk√

2πexp

− n∑α,β=1

(zδαβ − qαβ)ψkψl

2

= (det(z1 − q))−1/2

Raising this integral to the Nth power and using det M = exp Tr log M wefind

EZn ∝

∫dq exp

[N Tr

(−

q2

4σ2 −12

log(z1 − q))]

=

∫dq exp

(−

N2

F(q)).

We need to evaluate EZn for large N, in this limit the integral over the matrix qcan be done by the saddle point method. More precisely, we could find a mini-mum of F(q) in the n(n + 1)/2 elements of q. Alternatively we can diagonalizeq, introducing the log of a Vandermonde determinant in the exponential (seesection 4.1.3).1 In terms of the eigenvalues qα of q,

F(qα) =

n∑α=1

q2α

2σ2 + log(z − qα) −1N

∑α,β

log |qα − qβ|.

To find a minimum, we take the partial derivatives of Fqα with respect to theqα and equate them to zero

qασ2 −

1z − qα

−1N

∑α,β

2qα − qβ

= 0

The effect of the last term is to push the eigenvalues qα away from each otherby a distance of order 1/N. Since there are only n such eigenvalues, the totalspread (from the largest to smallest) is of order n/N which we will neglect.Hence we can assume that all eigenvalues are identical and equal to q∗(z) whereq∗(z) satisfies

z − q∗ =σ2

q∗

1 A third method is more common in spin glass problems where the integrant has permutation symmetrybut not necessarily rotational symmetry. The n replicas that we introduced are indistinguishable, this isreflected in the fact that F(q) is invariant under permutations of the rows and columns of q. We cantherefore look for a solution that has permutation symmetry which is called in this context replicasymmetry. Note that a function with a certain symmetry can have non-symmetric minimums, hence thereplica symmetric solution can sometimes fail to be the true minimum.

Exercise 213

We recognize the self-consistent equation for the Stieltjes transform of the Wigner(Eq. (2.21)) where we make the identification q∗(z) = σ2gX(z). For N large andn small we have

EZn = exp(−

Nn2

F1(z, q∗(z)))

with F1(z, q) =q2

2σ2 + log(z − q) (13.4)

so

limN→∞

limn→0

EZn − 1Nn

= −F1(z, q∗(z))

2and g(z) =

ddz

F1(z, q∗(z))

To finish the computation we need to take the derivative of F1(z, q∗(z)) withrespect to z, but since q∗(z) is an extremum of F1 the partial derivative of F1(z, q)with respect to q is zero at q = q∗(z). We have

g(z) =ddz

F1(z, q∗(z)) =∂

∂zF1(z, q)

∣∣∣∣∣q=q∗(z)

=1

(z − q∗)=

q∗(z)σ2

We recover that g(z) = gX(z): the solution of the self-consistent Wigner equa-tion.

Exercise

13.1 Annealed replica trick for Wishart matrices

(a)

13.2 Resolvent Matrix

13.2.1 General Case

We saw that the replica trick can be used to compute the average Stieltjes trans-form of a random matrix. The Stieltjes transform is the normalized trace of theresolvent matrix GA(z) = (z1 − A)−1. In chapter 11 we will need to know theaverage of the elements of the resolvent matrix for free addition and multipli-cation. These averages can be done using the replica trick. An element of an

214 Replica Trick

inverse matrix can be written as a multivariate Gaussian integral∫dNψψiψ j exp

(−ψT Mψ

2

)=

(2π)N/2[M−1

]i j

√det M

,

which we can rewrite as[M−1

]i j

= limn→−1

Zn∫

dNψ

(2π)N/2ψiψ j exp(−ψT Mψ

2

),

with Zn = (√

det M)n. If we express Zn and n Gaussian integrals and com-bine them with the integral with the ψiψ j term (which we label number 1) weget: [

M−1]i j

= limn→0

∫ n∏α=1

dNψα(2π)N/2ψ1iψ1 j exp

− n∑α=1

ψTαMψα

2

. (13.5)

This equation can then be used to compute averages of elements of the resolventmatrix by using M = z1 − A for the relevant random matrix A.

13.2.2 Free Addition

In this section we will show how to use Eq. (13.5) to compute the average of theresolvent for the free sum of two matrices. Consider two symmetric matricesC and B and the new matrix E = C + OBOT where O is a random orthogonalmatrix. We want to compute

EGE(z) = E[(z1 − C −OBOT )−1

],

where the expectation value is over the orthogonal matrix O. We can alwayschoose B to be diagonal, if B is not diagonal to start with, we just absorbthe orthogonal matrix that diagonalize B in the matrix O. Expressing GE(z)in the eigenbasis of C is equivalent to choosing C to be diagonal. We can try touniquely define the eigenbasis of C by sorting its eigenvalues of C in decreasingorder. We still have an arbitrary sign to choose for each normalized eigenvector.If we flip one of these signs, the diagonalized C doesn’t change but the corre-sponding off-diagonal elements of GE(z) will change signs. We don’t expect theresult to depend on arbitrary choices of signs, so we expect that the off-diagonal

13.2 Resolvent Matrix 215

elements of EGE(z) will be zero. Note that while the average matrix EGE(z)commutes with C, a particular realization of the random matrix GE(z) will notin general commute with C.

GE(z)i j = limn→0

∫ n∏α=1

dNψα(2π)N/2ψ1iψ1 j exp

− n∑α=1

ψTα(z1 − C)ψα

2

× E

exp

n∑α=1

ψTαOBOTψα

2

. (13.6)

The last term with the expectation value can be re-written as

E[exp

(N2

Tr YOBOT)],

where Y = 1/N∑nα=1 ψαψ

Tα. We recognize the Harrish-Chandra-Itzykson-Zuber

integral discussed in chapter 6. Fortunately the matrix Y has at most rank n N, so we can use the low rank formula Eq. (6.8). Our expectation value be-comes

exp(N

2Tr HB(Y)

)= exp

(N2

Trn HB(Y)),

where Y is an n×n symmetric matrix defined by Yαβ = ψTαψβ/N and Trn denotes

the trace of an n× n matrix. We can now try to perform the integral of ψα in Eq.(13.6). But in order to do so we must treat Tr HB(Y), a non-linear function ofthe ψα. The trick is to make the matrix Y an integration variable that we fix toits definition using a delta function. The delta function is itself represented as anintegral over another (n × n) symmetric matrix Q. In other words we introducethe following factor:∫ i∞

−i∞

Nn(n+1)/2dQ23n/2πn/2

∫dY exp

−N2

Trn QY +12

n∑α=1

Trn QψαψTα

,where the integral over dQ and dY are over symmetric matrices. We have ab-sorbed a factor of N in Q and a factor of 2 on its diagonal, hence the extrafactors of 2 and N in front of dQ. We can now perform the Gaussian integral

216 Replica Trick

over ψα∫ N∏k=1

n∏α=1

dψαk√

2πψ1iψ1 j exp

−12

N∑k=1

n∑α,β=1

ψαk(zδα,β −Ckδα,β −Qαβ)ψβk

,where we have written the vectors ψα in terms of their components ψαk, andwhere Ck are the eigenvalues of C. We notice that the Gaussian integral is di-agonal in the index k, so we have Nn-dimensional Gaussian integrals differingonly by their value of Ck. The term ψ1iψ1 j make the integral zero if i , j andcontributes only to the integral where k = i = j. The result is then

δi j[((z −Ci)1n −Q)−1

]11

N∏k=1

(det((z −Ck)1n −Q))−1/2,

Returning to our main expression Eq. (13.6) and dropping constants that are 1as n→ 0,

EGE(z)i j = limn→0

∫ i∞

−i∞dQ

∫dYδi j

[((z −Ci)1n −Q)−1

]11

× exp

N2

−Trn QY + Trn HB(Y) −1N

N∑k=1

Trn log((z −Ck)1n −Q)

.

For large N the integral over Y and Q is dominated by the saddle-point, i.e.extremums of the argument of the exponential. The inverse-matrix term in frontof the exponential doesn’t not have a power of N so it does not contribute tothe determination of the saddle point. The extremum is over a function of twon×n symmetric matrices. We can take matrix derivatives by using the followingidentity for symmetric matrices

ddMab

Tr F(M) = [F′(M)]ab. (13.7)

Equating the derivative with respect two both Q and Y to zero, we find thefollowing two matrix equations for the saddle point

Q = RB(Y) and Y = gC(z −Q).

where we have used the fact that RB(x) = d/dxHB(x) (for x below some criticalvalue x∗) and that gC(z) = (1/N)

∑i 1/(z−Ci). If one considers the left equation

in the eigenbasis of Y, we wee that Q is diagonal in that basis. The right equation

13.2 Resolvent Matrix 217

is also compatible with that fact. Now if we think about the equations for eachof the n eigenvalues of Q and Y we realize that they all satisfy the same pairof equations. For large z, there is a unique solution to these equations, hence Qand Y must be multiples of the identity Q = q∗1n and Y = y∗1n with q∗ and y∗

the solutions of

q = RB(y) and y = gC(z − q). (13.8)

The saddle point for Q is real while the integral was over purely imaginary ma-trices, but for large values of z, the solutions of Eqs. (13.8) give small valuesfor q∗ and y∗ and for small values of these the integral can be deformed with-out encountering any singularities. We also confirm that we could use RB(x) =

d/dxHB(x) as y∗ can be made arbitrarily small by choosing a large enoughz.

The expectation of the resolvent is then given by

EGE(z)i j ≈ limn→0

δi j

(z −Ci − q∗)exp

nN2

−q∗y∗ + HB(y∗) −1N

N∑k=1

log((z −Ck − q∗)

.

As n→ 0 the exponential tends to 1 and we obtain, in matrix form,

EGE(z) = GC(z − RB(y∗)) where y∗ = gC(z − RB(y∗)). (13.9)

13.2.3 Resolvent Subordination for Addition and Multiplication

Equation (13.9) relates the average resolvent of E to that of C. We noticed thaty∗ is given by the normalized trace of the right-hand side (i.e. gC(z) = τ(GC(z))).By taking the normalized trace on both sides we find

y∗ = gE(z) = gC(z − RB(y∗)),

which is the subordination relation for the Stieltjes transform of a free sum thatwe found in section 7.2.8. We just have re derived this result, but what is more,we have found a relationship for the average of a matrix element of the resolventmatrix, namely

EGE(z) = GC(z − RB(gE(z))). (13.10)

218 Replica Trick

In the free multiplication case, namely E =√

CB√

C where C and B arelarge positive definite matrix whose eigenvectors are mutually random, a sim-ilar replica computation gives a subordination relation for the average T ma-trix

ETE(ζ) = TC[S B(t(ζ))ζ], (13.11)

with S B(t) the S-transform of the matrix B. If we take the normalized trace onboth side, we recover the subordination relation Eq. (7.23). Equation (13.11)can be turned into a subordination relation for the resolvent

EGE(z) = S (z)GC(S (z)z) where S (z) := S B(zgE(z) − 1). (13.12)

13.3 Rank-1 HCIZ

In chapter 6 we studied the rank-1 HCIZ integral and defined the function HB(a)as

HB(a) = limN→∞

2N

log⟨exp

(N2

Tr AOBOT)⟩

O, (13.13)

where the averaging is done over the orthogonal group for O, A is a rank-1matrix with eigenvalue a and B a fixed matrix. If B is a random matrix suchas a Wigner matrix, the averaging over O should be done for a fixed B andlater the function HB(a) can be averaged over the randomness of B. We call thisaveraging the quenched average.

We also saw in section 6.2.3 that for small values of a, the computation wherewe average over B before taking the log (the annealed average) gives the sameanswer at least in the Wigner case.

In this section we will compute Eq. (13.13) for B = X a Wigner matrix usingthe replica trick. The limit n → 0 will give us the quenched average, but wecan also look at the n = 1 annealed average and compare the two. We will seea phase transition, where for small values of a the two averages give the sameresult at for a > a∗ some critical value of a, the two computations differ.

To keep notation light we will set σ2 = 1. As in Eq. (6.4) we define the partition

13.3 Rank-1 HCIZ 219

function

Za(X) =

∫dNψ

(2π)N/2 δ(‖ψ‖2 − Na

)exp

(12ψT Xψ

),

And seek to compute

EHW(a) = limN→∞

2N

limn→0

(Zn

a(X) − 1n

)− 1 − log a, (13.14)

where 1 + log a is the large N limit of 2/N log Za(0) with Z0(a) given by Eq.(6.6). If we write Zn

a(X) as multiple copies of the same integral and express theDirac deltas as a Fourier integrals over zα, we get

Zna(X) =

∫ i∞

−i∞

n∏α=1

dzα

∫ n∏α=1

dNψα(2π)N/2 exp

12

n∑α=1

(Nzαa − zαψT

αψα)

+12

n∑α=1

ψTαXψα

.To take the expectation value over the random matrix X, we need to separatethe diagonal and off-diagonal elements of X, we can then take the expectationvalue and regroup those elements. The steps are the same as those we took insection 13.1.2.

E

exp

12

n∑α=1

ψTαXψα

= exp

N∑i, j=1

14N

n∑α=1

ψαiψα j

2= exp

14N

n∑α,β=1

N∑i=1

ψαiψβi

2

Hubbard-Stratonovich dq

E[. . .] =

∫dqC(n) exp

−NTr q2

4+

N∑i=1

n∑α,β=1

qαβψαiψβi

2

,

220 Replica Trick

After Gaussian integration

Zna(X) =

∫ i∞

−i∞

n∏α=1

dzα

∫dqC(n) exp

[N2

(a Tr z −

Tr q2

2− Tr log(z − q)

)],

(13.15)which makes sense provided that the real part of z is larger than all the eigen-values of q.

Fa(q, z) = a Tr z −Tr q2

2− Tr log(z − q) (13.16)

where z is the vector of zα treated as a diagonal matrix. As a check for n = 1 wehave at the saddle point

q = a =1

z − q⇒ z = a +

1a

2N

logEZ1a = a2 + 1 −

a2

2+ log a

So

EI(a) =a2

2We go back to the general n case. Using Eq. (13.7) can take a matrix derivativeof Eq. (13.16) with respect to q and z.

q = (z − q)−1 and[(z − q)−1

]αα

= a (13.17)

The second equation comes from the derivative with respect to zα, remember zis only a diagonal matrix the derivative with respect to z tells us only about thediagonal elements.

From this we argue that z must be a multiple of the identity and q of theform

q =

a b . . . bb a . . . b...

.... . . b

b b . . . a

, (13.18)

for some b. To find an equation for b and z we need to express Eqs. (13.17) in


those terms. To do so we first write the matrix q as a rank-1 perturbation of amultiple of the identity matrix:

q = (a − b)1 + nbP1,

where P1 = v1vT1 is the projector unto the normalized vector of all 1:

v1 =1√

n

11...

1

.Note that the eigenvalues of the matrix q are (a − b) + nb (with multiplicity 1)and (a − b) with (multiplicity (n − 1)). The since z is a multiple of the identity,the matrix z − q is a rank-1 perturbation of a multiple identity and it can be in-verted using the Sherman-Morrison formula Eq. (1.18). The first of Eqs. (13.17)becomes

(a − b)1 + nbP1 =1

z − a + b+

nbP1

(z − a + b)2(1 − nb(z − a + b)−1).

We can now equate the prefactors of the the identity matrix 1 and the projectorP1 separately to get two equations for our two unknowns (z and b). For theidentity matrix we get

(a − b) =1

z − a + b⇒ z = (a − b) +

1a − b

. (13.19)

For the second equation, we first replace (z − a + b)−1 by a − b and get

nb =(a − b)2nb

1 − nb(a − b).

We immediately find one solution: b = 0. For this solution both q = q01 andz = z01 are multiples of the identity and we have q0 = a and z0 = a + a−1.This is the (unique) solution we found in the annealed (n = 1) case. For generaln, there are potentially other solutions. Simplifying off nb, we find a quadraticequation for b

1 − nb(a − b) = (a − b)2,

222 Replica Trick

whose solution is

b± =(n − 2)a ±

√n2a2 − 4(n − 1)

2(n − 1).

We first need to understand the solution for generic moderate n (n > 2). Laterwe will analytically continue the solution to n→ 0. We will eventually reject thesolution with the + sign but let’s keep it for a moment. From the two solutionsfor b we can compute the corresponding values of z using Eq. (13.19). We get aterm with a square-root on the denominator that we simplify using (c±

√d)−1 =

(c ∓√

d)/(c2 − d), after further simplification we find

z± =n2a ± (n − 2)

√n2a2 − 4(n − 1)

2(n − 1),

where the choice of ± is the same as the one for b, the + choice leads to a largervalue for both b and z. The argument for rejecting the + solution is somewhatweak. Both solutions respect the condition that Re(z) is greater than (a−b) + nb(the largest eigenvalue of q for positive b). But for the z+ solution there is noway to deform the integration path of z to reach the solution without crossingthe singularity z = (a− b) + nb. Also the z+ solution is a local minimum and nota maximum of Eq. (13.16). For these reasons, and because we know the answerwe want to get, we reject the z+ solution.

We now need to compare the z− and z0 solution. We first notice that the twosolutions are equal when a = 1 (and n > 2), namely b = 0 and z = 2. Forother values of a we should compare the value of Fa(q, z) at the solutions andchoose the maximum one. There is a trick to save time and computation effort.Given that our solutions cancel the partial derivatives of Fa(q, z), we can easilycompute its derivative with respect to a

dda

Fa(q, z) =∂

∂aFa(q, z)|q(a),z(a) = Tr z(a) = nz(a).

The solution z−(a) is always greater or equal than z0(a), this can be shown ex-plicitly but easier seen of on graph (see Fig. 13.1). Since Fa=1(q, z) is the samefor both solution, the larger derivative of the − solution means that Fa(q−, z−) >Fa(q0, z0) for a > 1, so z− is the correct solution. For a < 1, the argument worksin reverse and we have Fa(q−, z−) < Fa(q0, z0), so z0 is the correct solution in


0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5a

2.00

2.05

2.10

2.15

2.20

2.25

2.30

2.35

z(a)

z , n=4z , n=10z0

Figure 13.1 Comparaison between the two solutions z−(a) and z0(a). Note thatz−(a) does not exist for a < a∗(n) < 1.

that case. Putting everything together we find for (n > 2).

EZna ∼ exp

(N2

nFn(a)),

where

dda

Fn(a) =

a + 1a for a ≤ 1

n2a−(n−2)√

n2a2−4(n−1)2(n−1) for a > 1

and Fn(0) = 0. We can now analytically continue this solution to n → 0. Thefirst part, where a ≤ 1, is easy as it does not depend on n. For the second partwe have to be careful with the sign of the square-root, it can change sign twice,first at n = 2 and then at n = 1. Since z(a) is continuous at a = 1 for all n > 2we will use this continuity to guide us is choosing the proper sign. We find thatthe combination

z−(a) =n2a − |n − 2|

√n2a2 − 4(n − 1)

2(n − 1)(13.20)

always gives z−(1) = 2. As a function of n it is continuous at n = 1. Thesefeatures can be seen on Figure 13.2. Th extrapolation of of z−(a) to n → 0,gives the very simple result, z− = 2 for all a.

We can now go back to the definition of the function I(a) (Eq. (13.14). After

224 Replica Trick

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0n

2.00

2.25

2.50

2.75

3.00

3.25

3.50

3.75

4.00

z(a)

a = 2a = 1.5a = 1

Figure 13.2 The solution z−(a) (Eq. (13.20)) as a function of n for three valuesof a. Note that z−(1) = 2 for all values of n. For all values of a > 1, it iscontinuous as a function of n and can be extrapolated to n → 0 at whichz−(a) = 2 for all a.

taking the n→ 0 and N → ∞ limits we find

I′(a) =

a for a ≤ 12 − 1

a for a > 1,

with the condition I(0) = 0. We recognize the solution already found in Eq.(6.11) for a GOE matrix of unit variance where R(a) = a and λmax = 2.

***The rest here should be edited*** b = 0 is a solution and it gives the an-nealed result.There might be other solution. Eq. (13.18) implies that q has onlytwo potentially distinct eigenvalues. In the n→ 0 limit they differ by a quantityof order n. Let call λ0 = λ+ n∆ with multiplicity 1 and λ1 = λ with multiplicityn − 1. To first order in n, Eq. (13.16) becomes

F(λ, z,∆) = n(−λ2

2− λ∆ − log(z − λ) +

∆

z − λ+ za

)We find two solutions of the saddle point equations

1 λ = 1, z = 2,∆ = a − 1

2 λ = a, z = a + 1/a,∆ = 0

13.4 Annealed vs Quenched 225

The saddle point evaluated at these two solutions is

F1 = n(2a −

12

)and F2 = n

(a2

2+ 1 + log a

)The two solutions are equal at a = 1. Due to some funny business with the n→ 0limit, we should take the minimum of F rather that the maximum. Solution (1)is therefore the right solution for a > 1, while solution (2), which is also theannealed solution, is correct for a < 1.

13.4 Annealed vs Quenched

The replica trick is quite burdensome as one as to keep track of n copies of theintegration vector ψn and these vector interact through the averaging process.At large N one typically as to do a saddle point over one or two n × n matrices(e.g. Q and Y in the free addition computation). In all computations of Stieltjestransforms above taking n = 1 instead of n → 0 gives the correct saddle point.In other words instead of using

E log Z = limn→0

EZn − 1n

,

E log Z ≈ logEZ

give the right result. For example, if we go back to Eq. (13.4) we see that takingthe logarithm of the n = 1 result gives the same result as the correct n→ 0 limit.The same is true for the Wishart case (see Exercise 13.1). For the free additionand multiplication one can also compute the Stieltjes transform using n = 1.This is a general result. Most natural ensemble of random symmetric matrices(such as those that those from chapter 4 and those arising from free additionand multiplication) feature a strong repulsion of eigenvalues. Because of thisrepulsion, eigenvalues do not fluctuate much around their classical positions.It is this lack of large fluctuation of the eigenvalues that make the n = 1 andn → 0 saddle point equivalent. For the rank-1 HCIZ, we had that n = 1 givesthe right answer in the small a regime which is dominated by bulk properties ofeigenvalues. In the large a regime, fluctuations of the largest eigenvalue matterand the n = 1 result is no longer correct. In section ?? we will give another

226 Replica Trick

counter-example, namely when eigenvalues are independent, they can fluctuatemuch more and the two types of averaging are no longer equivalent.


Book Mezard et al. [1987]. The subject of the replica trick applied to randommatrices is almost never discussed in books with the exception of Livan et al.[2018].

The replica trick was introduced by Brout [1959] and popularized in Edwardsand Anderson [1975], it was first applied to random matrices by Edwards andJones [1976].

A replica computation for the average resolvent in free multiplication (Eq. (13.12))can be found in Bun et al. [2017].

Appendix A

Mathematical Tools

A.1 Saddle point method

We would like to study the probability density Eq. (4.4) in this section. We firstrecall the following saddle point method (sometimes also called the Laplacemethod, the steepest descent or the stationary phase approximation). Considerthe integral

I =

∫eNF(x)dx. (A.1)

The key idea of the saddle point method is that when N is large I is dominatedby the maximum of F(x) plus Gaussian fluctuations around it. Suppose F takesmaximum at x0, then around x0 we have

F(x) = F(x0) +F′′(x0)

2(x − x0)2 + O(|x − x0|

3),

where F′′(x0) < 0. Thus for large N, we have

I ∼

√2π

−F′′(x0)NeNF(x0), (A.2)

where the symbol ∼ means that the ratio of both sides of the equations tends to1 as N → ∞. Often we are only interested in

limN→∞

1N

log I = F(x0),

228 Mathematical Tools

in which case we don’t need to compute the pre-factor.

Exercise

A.1 Saddle point method for the factorial function: Stirling’s approxima-tion We are going to estimate the factorial function for large argumentsusing an integral representation and the saddle point approximation.

n! = Γ[n + 1] =

∫ ∞

0xne−xdx.

(a) Write n! in the form Eq. (A.1) for some function F(x).

(b) Show that x0 = n is the solution to F′(x) = 0.

(c) Let I0(n) = nF(x0(n)) be an approximation of log(n!). Compare thisapproximation to the exact value for n = 10 and 100.

(d) Include the Gaussian corrections to the saddle point: Let I1(n) = log(I)where I is given by Eq. (A.2) for your function F(x). Show that

I1(n) = n log(n) − n +12

log(2πn).

(e) Compare I1(n) and log(n!) for n = 10 and 100.

References

Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni. An Introduction to RandomMatrices. Cambridge University Press, Cambridge, 2010.

Jinho Baik, Gerard Ben Arous, and Sandrine Peche. Phase transition of the largesteigenvalue for nonnull complex sample covariance matrices. Annals of Probabil-ity, pages 1643–1697, 2005.

Jinho Baik, Percy Deift, and Toufic Suidan. Combinatorics and Random Matrix Theory.American Mathematical Society, Providence, Rhode Island, 2016.

Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors offinite, low rank perturbations of large random matrices. Advances in Mathematics,227(1):494–521, 2011.

Gordon Blower. Random Matrices: High Dimensional Phenomena. Cambridge Uni-versity Press, Cambridge, 2009.

Edouard Brezin, Claude Itzykson, Giorgio Parisi, and Jean-Bernard Zuber. Planar dia-grams. Communications in Mathematical Physics, 59(1):35–51, 1978.

R. Brout. Statistical mechanical theory of a random ferromagnetic system. Phys. Rev.,115:824–835, Aug 1959.

Joel Bun, Jean-Philippe Bouchaud, and Marc Potters. Cleaning large correlation matri-ces: Tools from random matrix theory. Physics Reports, 666:1 – 109, 2017.

Zdzisław Burda, Jerzy Jurkiewicz, and Bartłomiej Wacław. Spectral moments of corre-lated wishart matrices. Phys. Rev. E, 71:026111, Feb 2005.

Mireille Capitaine and Catherine Donati-Martin. Spectrum of deformed random matri-ces and free probability. arXiv preprint arXiv:1607.05560, 2016.

Romain Couillet and Merouane Debbah. Random Matrix Methods for Wireless Com-munications. Cambridge University Press, Cambridge, 2011.

Freeman J Dyson. A brownian-motion model for the eigenvalues of a random matrix.Journal of Mathematical Physics, 3:1191–1198, 1962a.

Freeman J. Dyson. The threefold way. algebraic structure of symmetry groups andensembles in quantum mechanics. Journal of Mathematical Physics, 3(6):1199–1215, 1962b.

230 References

S F Edwards and P W Anderson. Theory of spin glasses. Journal of Physics F: MetalPhysics, 5(5):965, 1975.

S F Edwards and R C Jones. The eigenvalue spectrum of a large symmetric randommatrix. Journal of Physics A: Mathematical and General, 9(10):1595, 1976.

Laszlo Erdos and Horng-Tzer Yau. A Dynamical Approach to Random Matrix Theory.American Mathematical Society, Providence, Rhode Island, 2017.

P.Di Francesco, P. Ginsparg, and J. Zinn-Justin. 2d gravity and random matrices.Physics Reports, 254(1):1 – 133, 1995.

Z. Furedi and J. Komlos. The eigenvalues of random symmetric matrices. Combina-torica, 1(3):233–241, Sep 1981. ISSN 1439-6912.

Thomas Guhr, Axel Muller–Groeling, and Hans A. Weidenmuller. Random-matrixtheories in quantum physics: common concepts. Physics Reports, 299(4):189 –425, 1998. ISSN 0370-1573.

Alice Guionnet and Mylene Maıda. A fourier view on the r-transform and relatedasymptotics of spherical integrals. Journal of Functional Analysis, 222(2):435 –490, 2005. ISSN 0022-1236. doi: http://dx.doi.org/10.1016/j.jfa.2004.09.015.

Harish-Chandra. Differential operators on a semisimple lie algebra. American Journalof Mathematics, pages 87–120, 1957.

Claude Itzykson and Jean-Bernard Zuber. The planar approximation. ii. Journal ofMathematical Physics, 21:411–421, 1980.

Giacomo Livan, Marcel Novaes, and Pierpaolo Vivo. Introduction to Random Matrices:Theory and Practice. Springer, New York, 2018.

Vladimir Alexandrovich Marchenko and Leonid Andreevich Pastur. Distribution ofeigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4):507–536, 1967.

Enzo Marinari, Giorgio Parisi, and Felix Ritort. Replica field theory for deterministicmodels. ii. a non-random spin glass with glassy behaviour. Journal of Physics A:Mathematical and General, 27(23):7647, 1994.

Madan Lal Mehta. Random Matrices. Academic Press, San Diego, 3 edition, 2004.Marc Mezard, Miguel Angel Virasoro, and Giorgio Parisi. Spin glass theory and be-

yond. World scientific, 1987.James A. Mingo and Roland Speicher. Free Probability and Random Matrices.

Springer, New York, 2017.Leonid Pastur and Mariya Scherbina. Eigenvalue Distribution of Large Random Matri-

ces. American Mathematical Society, Providence, Rhode Island, 2010.Terrence Tao. Topics in Random Matrix Theory. American Mathematical Society,

Providence, Rhode Island, 2012.Antonia M Tulino and Sergio Verdu. Random Matrix Theory and Wireless Communi-

cations. Now publishers, Hanover, Mass., 2004.Dan Voiculescu. Symmetries of some reduced free product C*-algebras. Springer,

1985.Dan Voiculescu. Limit laws for random matrices and free products. Inventiones math-

ematicae, 104(1):201–220, 1991.

References 231

Eugene P Wigner. On the statistical distribution of the widths and spacings of nuclearresonance levels. In Mathematical Proceedings of the Cambridge PhilosophicalSociety, volume 47, pages 790–798. Cambridge Univ Press, 1951.

John Wishart. The generalised product moment distribution in samples from a normalmultivariate population. Biometrika, pages 32–52, 1928.

Paul Zinn-Justin. Adding and multiplying random matrices: A generalization ofvoiculescu’s formulas. Phys. Rev. E, 59:4884–4888, May 1999.

Index

annealed average, 114AR(1) process, 152

BBP transition, 174beta, 39, 59, 167beta ensembles, 59Brownian motion, 77Burgers’ equation, 92

Catalan numbers, 34Cauchy kernel, 28Cauchy sequence, 144Cauchy transform, see Stieltjes transformcentral limit theorem, 124, 135characteristic function, 81, 121characteristic polynomial, 7, 179circulant matrix, 153completion, 144concentration of measure, 18conjugate variable, 101constant, 118, 127Coulomb potential, 64cumulant, 120, 127

Dirac delta, 110Dyson Brownian motion, 84Dyson index, see beta

eigenvalue, 7density, 18

critical, 74, 177edge, 31, 74Marcenko-Pastur, 54sample, 22Wigner, 20, 31

largest, 167multiplicity, 7

repulsion, 64, 90eigenvector, 7

overlap, 175empirical spectral distribution, 22ergoticity, 101exponential correlations, 152exponential moving average, 158

free log-normal, 151free product, 56, 136freeness, 126

large matrices, 143, 145Frobenius norm, 18

Gaussian ensemble, 17orthogonal (GOE), 19symplectic (GSE), 43unitary (GUE), 41

Haar measure, 108HCIZ integral, 109, 215, 218Hermitian matrix, 39Hilbert transform, 70Hubbard-Stratonovich identity, 211

inverse-Wishart, 156involution, 118Ito

lemma, 80prescription, 79

l’Hospital’s rule, 176Langevin equation, 99Laurent polynomial, 68law of large numbers, 124, 135

matrix determinant lemma, 14, 179matrix potential, 58, 103

Index 233

convex, 72non-polynomial, 70white Wishart, 50, 59Wigner, 21, 59

maximum likelihood, 66moment, 18, 118

addition, 119generating function, 121Wigner matrix, 19

moment-cumulant relationcommutative, 123free, 133

multivariate Gaussian, 47

non-crossing partition, 34, 132normal matrix, 12normalized trace, 17

one-cut assumption, 73orthogonal ensemble, 59orthogonal matrix, 8, 20outlier, 171

Perron-Froebenius theorem, 13planar diagrams, 74positive semi-definite, 6

quaternion, 42quenched average, 114

R-transform, 94, 174identity matrix, 146uniform density, 148white Wishart, 97Wigner, 97

rank-1 matrix, 109, 171replica trick, 208resolvent, 22rotationally invariant ensemble, 20

S-transform, 138, 180identity matrix, 138, 146inverse matrix, 142white Whishart, 150Wigner, 150

saddle point method, 227sample covariance matrix, 46scalar, 118Schur complement, 14, 25, 51self-averaging, 18semi-circle law, 20, 31Sherman-Morrison formula, 14, 172singular values, 10Sokhotski–Plemelj formula, 30*-algebra, 118stationary phase approximation, 111, 227

Stieltjes inversion formula, 30Stieltjes transform

discrete, 22identity matrix, 146inverse Wishart, 157invertibility, 96, 172limiting, 23Marcenko-Pastur, 53Wigner, 26

Stirling’s approximation, 228stochastic calculus, 77Stratonovich prescription, 79subordination relation

addition, 135product, 140resolvent, 217

symmetric matrix, 8symplectic matrix, 42

T-matrix, 138T-transform, 138

identity matrix, 146temporal correlations, 159traceless, 126tracial state, 118Tracy-Widom distribution, 167

uniform density, 71, 148unitary matrix, 40

Vandermonde determinant, 62variance, 118

Wigner matrix, 20

Wick’s theorem, 47Wiener process, 77Wigner ensemble, 17Wishart matrix, 47

non-white, 55white, 49

a first course in random matrix theory · dynamics is described by a matrix a without any...

Documents