random matrices in machine learning - afiabasics of random matrix theory/motivation: large sample...
TRANSCRIPT
![Page 1: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/1.jpg)
Random Matrices in Machine Learning
Romain COUILLET
CentraleSupelec, University of ParisSaclay, FranceGSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble–Alpes, France.
June 21, 2018
1 / 113
![Page 2: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/2.jpg)
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
2 / 113
![Page 3: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/3.jpg)
Basics of Random Matrix Theory/ 3/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
3 / 113
![Page 4: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/4.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
4 / 113
![Page 5: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/5.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:
I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
![Page 6: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/6.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).
I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
![Page 7: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/7.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
![Page 8: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/8.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
![Page 9: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/9.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113
Context
Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)
Cp =1
nYpY
∗p =
1
n
n∑i=1
yiy∗i
(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers
Cpa.s.−→ Cp.
or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.
Random Matrix Regime
I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.
I For practical p, n with p ' n, leads to dramatically wrong conclusions
5 / 113
![Page 10: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/10.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
Eigenvalues of Cp
Den
sity
Empirical eigenvalue distribution
Marcenko–Pastur Law
Figure: Histogram of the eigenvalues of Cp for p = 500, n = 2000, Cp = Ip.
6 / 113
![Page 11: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/11.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113
The Marcenko–Pastur law
Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is
µp =1
p
p∑i=1
δλi(Ap).
Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1
nXpX∗p satisfies
µpa.s.−→ µc
weakly, where
I µc(0) = max0, 1− c−1I on (0,∞), µc has continuous density fc supported on [(1−
√c)2, (1 +
√c)2]
fc(x) =1
2πcx
√(x− (1−
√c)2)((1 +
√c)2 − x).
7 / 113
![Page 12: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/12.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113
The Marcenko–Pastur law
Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is
µp =1
p
p∑i=1
δλi(Ap).
Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1
nXpX∗p satisfies
µpa.s.−→ µc
weakly, where
I µc(0) = max0, 1− c−1
I on (0,∞), µc has continuous density fc supported on [(1−√c)2, (1 +
√c)2]
fc(x) =1
2πcx
√(x− (1−
√c)2)((1 +
√c)2 − x).
7 / 113
![Page 13: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/13.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113
The Marcenko–Pastur law
Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is
µp =1
p
p∑i=1
δλi(Ap).
Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1
nXpX∗p satisfies
µpa.s.−→ µc
weakly, where
I µc(0) = max0, 1− c−1I on (0,∞), µc has continuous density fc supported on [(1−
√c)2, (1 +
√c)2]
fc(x) =1
2πcx
√(x− (1−
√c)2)((1 +
√c)2 − x).
7 / 113
![Page 14: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/14.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1.2
x
Den
sityfc(x
)
c = 0.1
Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.
8 / 113
![Page 15: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/15.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1.2
x
Den
sityfc(x
)
c = 0.1
c = 0.2
Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.
8 / 113
![Page 16: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/16.jpg)
Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113
The Marcenko–Pastur law
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1.2
x
Den
sityfc(x
)
c = 0.1
c = 0.2
c = 0.5
Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.
8 / 113
![Page 17: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/17.jpg)
Basics of Random Matrix Theory/Spiked Models 9/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
9 / 113
![Page 18: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/18.jpg)
Basics of Random Matrix Theory/Spiked Models 10/113
Spiked Models
Small rank perturbation: Cp = Ip + P , P of low rank.
1 + ω1 + c1+ω1ω1
, 1 + ω2 + c1+ω2ω2
0
0.2
0.4
0.6
0.8
λini=1
µ
Figure: Eigenvalues of 1nYpY
∗p , Cp = diag(1, . . . , 1︸ ︷︷ ︸
p−4
, 2, 2, 3, 3), p = 500, n = 1500.
10 / 113
![Page 19: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/19.jpg)
Basics of Random Matrix Theory/Spiked Models 11/113
Spiked Models
Theorem (Eigenvalues [Baik,Silverstein’06])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗, where, for K fixed,
Ω = diag (ω1, . . . , ωK) ∈ RK×K , with ω1 ≥ . . . ≥ ωK > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), denoting λm = λm( 1nYpY ∗p ) (λm > λm+1),
λma.s.−→
1 + ωm + c 1+ωm
ωm> (1 +
√c)2 , ωm >
√c
(1 +√c)2 , ωm ∈ (0,
√c].
11 / 113
![Page 20: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/20.jpg)
Basics of Random Matrix Theory/Spiked Models 11/113
Spiked Models
Theorem (Eigenvalues [Baik,Silverstein’06])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗, where, for K fixed,
Ω = diag (ω1, . . . , ωK) ∈ RK×K , with ω1 ≥ . . . ≥ ωK > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), denoting λm = λm( 1nYpY ∗p ) (λm > λm+1),
λma.s.−→
1 + ωm + c 1+ωm
ωm> (1 +
√c)2 , ωm >
√c
(1 +√c)2 , ωm ∈ (0,
√c].
11 / 113
![Page 21: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/21.jpg)
Basics of Random Matrix Theory/Spiked Models 12/113
Spiked Models
Theorem (Eigenvectors [Paul’07])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗ =∑Ki=1 ωiuiu
∗i , ω1 > . . . > ωM > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), for a, b ∈ Cp deterministic and ui eigenvectorof λi(
1nYpY ∗p ),
a∗uiu∗i b−
1− cω−2i
1 + cω−1i
a∗uiu∗i b · 1ωi>
√c
a.s.−→ 0
In particular,
|u∗i ui|2a.s.−→
1− cω−2i
1 + cω−1i
· 1ωi>√c.
12 / 113
![Page 22: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/22.jpg)
Basics of Random Matrix Theory/Spiked Models 12/113
Spiked Models
Theorem (Eigenvectors [Paul’07])
Let Yp = C12p Xp, with
I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.
I Cp = Ip + P , P = UΩU∗ =∑Ki=1 ωiuiu
∗i , ω1 > . . . > ωM > 0.
Then, as p, n→∞, p/n→ c ∈ (0,∞), for a, b ∈ Cp deterministic and ui eigenvectorof λi(
1nYpY ∗p ),
a∗uiu∗i b−
1− cω−2i
1 + cω−1i
a∗uiu∗i b · 1ωi>
√c
a.s.−→ 0
In particular,
|u∗i ui|2a.s.−→
1− cω−2i
1 + cω−1i
· 1ωi>√c.
12 / 113
![Page 23: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/23.jpg)
Basics of Random Matrix Theory/Spiked Models 13/113
Spiked Models
0 1 2 3 40
0.2
0.4
0.6
0.8
1
Population spike ω1
|u∗ 1u
1|2
p = 100
p = 200
p = 400
1−c/ω21
1+c/ω1
Figure: Simulated versus limiting |u∗1u1|2 for Yp = C12p Xp, Cp = Ip + ω1u1u
∗1 , p/n = 1/3,
varying ω1.
13 / 113
![Page 24: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/24.jpg)
Basics of Random Matrix Theory/Spiked Models 14/113
Other Spiked Models
Similar results for multiple matrix models:
I Yp = 1n
(I + P )12XpX∗p (I + P )
12
I Yp = 1nXpX∗p + P
I Yp = 1nX∗p (I + P )X
I Yp = 1n
(Xp + P )∗(Xp + P )
I etc.
14 / 113
![Page 25: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/25.jpg)
Applications/ 15/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
15 / 113
![Page 26: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/26.jpg)
Applications/Reminder on Spectral Clustering Methods 16/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
16 / 113
![Page 27: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/27.jpg)
Applications/Reminder on Spectral Clustering Methods 17/113
Reminder on Spectral Clustering Methods
Context: Two-step classification of n objects based on similarity A ∈ Rn×n:
0 spikes
⇓ Eigenvectors ⇓(in practice, shuffled)
Eig
env.
1E
igen
v.2
17 / 113
![Page 28: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/28.jpg)
Applications/Reminder on Spectral Clustering Methods 17/113
Reminder on Spectral Clustering Methods
Context: Two-step classification of n objects based on similarity A ∈ Rn×n:
0 spikes
⇓ Eigenvectors ⇓(in practice, shuffled)
Eig
env.
1E
igen
v.2
17 / 113
![Page 29: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/29.jpg)
Applications/Reminder on Spectral Clustering Methods 18/113
Reminder on Spectral Clustering Methods
Eig
env.
1E
igen
v.2
⇓ `-dimensional representation ⇓(shuffling no longer matters)
Eigenvector 1
Eig
enve
ctor
2
⇓EM or k-means clustering.
18 / 113
![Page 30: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/30.jpg)
Applications/Reminder on Spectral Clustering Methods 18/113
Reminder on Spectral Clustering Methods
Eig
env.
1E
igen
v.2
⇓ `-dimensional representation ⇓(shuffling no longer matters)
Eigenvector 1
Eig
enve
ctor
2
⇓EM or k-means clustering.
18 / 113
![Page 31: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/31.jpg)
Applications/Reminder on Spectral Clustering Methods 18/113
Reminder on Spectral Clustering Methods
Eig
env.
1E
igen
v.2
⇓ `-dimensional representation ⇓(shuffling no longer matters)
Eigenvector 1
Eig
enve
ctor
2
⇓EM or k-means clustering.
18 / 113
![Page 32: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/32.jpg)
Applications/Kernel Spectral Clustering 19/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
19 / 113
![Page 33: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/33.jpg)
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
![Page 34: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/34.jpg)
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
![Page 35: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/35.jpg)
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)
I Refinements:I instead of K, use D −K, In −D−1K, In −D−
12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
![Page 36: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/36.jpg)
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
![Page 37: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/37.jpg)
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
![Page 38: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/38.jpg)
Applications/Kernel Spectral Clustering 20/113
Kernel Spectral Clustering
Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.
I Kernel spectral clustering based on kernel matrix
K = κ(xi, xj)ni,j=1
I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:
I instead of K, use D −K, In −D−1K, In −D−12KD−
12 , etc.
I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.
Intuition (from small dimensions)
I K essentially low rank with class structure in eigenvectors.
I Ng–Weiss–Jordan key remark: D−12KD−
12 (D
12 ja) ' D
12 ja (ja canonical
vector of Ca)
20 / 113
![Page 39: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/39.jpg)
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
![Page 40: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/40.jpg)
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
![Page 41: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/41.jpg)
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
![Page 42: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/42.jpg)
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
![Page 43: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/43.jpg)
Applications/Kernel Spectral Clustering 21/113
Kernel Spectral Clustering
−0.08
−0.07
−0.06
−0.1
0
0.1
−0.1
0
0.1
0.2
−0.2−0.1
00.10.2
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data, RBF kernel
(f(t) = exp(−t2/2)).
I Important Remark: eigenvectors informative BUT far from D12 ja!
21 / 113
![Page 44: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/44.jpg)
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
![Page 45: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/45.jpg)
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
![Page 46: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/46.jpg)
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
![Page 47: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/47.jpg)
Applications/Kernel Spectral Clustering 22/113
Model and Assumptions
Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).
Assumption (Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(p)
For 2 classes, this is
‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).
Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).
I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).
22 / 113
![Page 48: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/48.jpg)
Applications/Kernel Spectral Clustering 23/113
Model and Assumptions
Kernel Matrix:
I Kernel matrix of interest:
K =
f
(1
p‖xi − xj‖2
)ni,j=1
for some sufficiently smooth nonnegative f (f( 1pxTi xj) simpler).
I We study the normalized Laplacian:
L = nD−12
(K −
ddT
dT1n
)D−
12
with d = K1n, D = diag(d).(more stable both theoretically and in practice)
23 / 113
![Page 49: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/49.jpg)
Applications/Kernel Spectral Clustering 23/113
Model and Assumptions
Kernel Matrix:
I Kernel matrix of interest:
K =
f
(1
p‖xi − xj‖2
)ni,j=1
for some sufficiently smooth nonnegative f (f( 1pxTi xj) simpler).
I We study the normalized Laplacian:
L = nD−12
(K −
ddT
dT1n
)D−
12
with d = K1n, D = diag(d).(more stable both theoretically and in practice)
23 / 113
![Page 50: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/50.jpg)
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
![Page 51: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/51.jpg)
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
![Page 52: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/52.jpg)
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
![Page 53: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/53.jpg)
Applications/Kernel Spectral Clustering 24/113
Random Matrix Equivalent
I Key Remark: Under growth rate assumptions,
max1≤i6=j≤n
∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.
where τ = 1p
trC.
⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!
I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:
K = f(τ)1n1Tn︸ ︷︷ ︸
O‖·‖(n)
+√nK1︸ ︷︷ ︸
low rank, O‖·‖(√n)
+ K2︸︷︷︸informative terms, O‖·‖(1)
Clearly not the (small dimension) expected behavior.
24 / 113
![Page 54: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/54.jpg)
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
![Page 55: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/55.jpg)
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
![Page 56: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/56.jpg)
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
![Page 57: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/57.jpg)
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
![Page 58: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/58.jpg)
Applications/Kernel Spectral Clustering 25/113
Random Matrix Equivalent
Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,
∥∥∥L− L∥∥∥ a.s.−→ 0, where
L = nD−12
(K −
ddT
dT1n
)D−
12 , avec Kij = f
(1
p‖xi − xj‖2
)L = −2
f ′(τ)
f(τ)
[1
pPWTWP +
1
pJBJT + ∗
]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1
n1n1T
n,
J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)
B = MTM +
(5f ′(τ)
8f(τ)−f ′′(τ)
2f ′(τ)
)ttT −
f ′′(τ)
f ′(τ)T + ∗.
Recall M = [µ1, . . . , µk], t = [ 1√
ptrC1 , . . . ,
1√p
trCk ]T, T =
1p
trCaCb
ka,b=1
.
Fundamental conclusions:
I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!
I spectral clustering reads MTM , ttT and T , that’s all!
25 / 113
![Page 59: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/59.jpg)
Applications/Kernel Spectral Clustering 26/113
Isolated eigenvalues: Gaussian inputs
0 1 2 3 4
Eigenvalues of L
0 1 2 3 4
Eigenvalues of L
Figure: Eigenvalues of L and L, k = 3, p = 2048, n = 512, c1 = c2 = 1/4, c3 = 1/2,[µa]j = 4δaj , Ca = (1 + 2(a− 1)/
√p)Ip, f(x) = exp(−x/2).
26 / 113
![Page 60: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/60.jpg)
Applications/Kernel Spectral Clustering 27/113
Theoretical Findings versus MNIST
0 10 20 30 40 500
5 · 10−2
0.1
0.15
0.2
Eigenvalues of L
Figure: Eigenvalues of L (red) and (equivalent Gaussian model) L (white), MNIST data, p = 784,n = 192.
27 / 113
![Page 61: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/61.jpg)
Applications/Kernel Spectral Clustering 27/113
Theoretical Findings versus MNIST
0 10 20 30 40 500
5 · 10−2
0.1
0.15
0.2
Eigenvalues of L
Eigenvalues of L as if Gaussian model
Figure: Eigenvalues of L (red) and (equivalent Gaussian model) L (white), MNIST data, p = 784,n = 192.
27 / 113
![Page 62: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/62.jpg)
Applications/Kernel Spectral Clustering 28/113
Theoretical Findings versus MNIST
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data (red) and theoretical findings
(blue).
28 / 113
![Page 63: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/63.jpg)
Applications/Kernel Spectral Clustering 28/113
Theoretical Findings versus MNIST
Figure: Leading four eigenvectors of D−12KD−
12 for MNIST data (red) and theoretical findings
(blue).
28 / 113
![Page 64: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/64.jpg)
Applications/Kernel Spectral Clustering 29/113
Theoretical Findings versus MNIST
−.08 −.07 −.06
−0.1
0
0.1
Eigenvector 2/Eigenvector 1
−0.1 0 0.1
−0.1
0
0.1
0.2
Eigenvector 3/Eigenvector 2
Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1-and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.
29 / 113
![Page 65: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/65.jpg)
Applications/Kernel Spectral Clustering 29/113
Theoretical Findings versus MNIST
−.08 −.07 −.06
−0.1
0
0.1
Eigenvector 2/Eigenvector 1
−0.1 0 0.1
−0.1
0
0.1
0.2
Eigenvector 3/Eigenvector 2
Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1-and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.
29 / 113
![Page 66: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/66.jpg)
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
![Page 67: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/67.jpg)
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
p = 1024
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
![Page 68: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/68.jpg)
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
p = 1024
Theory
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
![Page 69: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/69.jpg)
Applications/Kernel Spectral Clustering 30/113
The surprising f ′(τ) = 0 case
−2 0 20
0.1
0.2
0.3
0.4
0.5
f ′(τ)
Cla
ssifi
cati
on
erro
r
p = 512
p = 1024
Theory
Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,
[C2]i,j = .4|i−j|, c0 = 14 .
I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).
30 / 113
![Page 70: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/70.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 31/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
31 / 113
![Page 71: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/71.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113
Position of the Problem
Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.
Method:
I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.
I Zero-mean Gaussian model for the data: for xi ∈ Ck,
xi ∼ N (0, Ck).
I Performance of L = nD−12
(K − 1n1T
n
1TnD1n
)D−
12 , with
K =f(‖xi − xj‖2
)1≤i,j≤n
, x =x
‖x‖
in the regime n, p→∞.(alternatively, we can ask 1
ptrCi = 1 for all 1 ≤ i ≤ k)
32 / 113
![Page 72: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/72.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113
Position of the Problem
Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.
Method:
I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.
I Zero-mean Gaussian model for the data: for xi ∈ Ck,
xi ∼ N (0, Ck).
I Performance of L = nD−12
(K − 1n1T
n
1TnD1n
)D−
12 , with
K =f(‖xi − xj‖2
)1≤i,j≤n
, x =x
‖x‖
in the regime n, p→∞.(alternatively, we can ask 1
ptrCi = 1 for all 1 ≤ i ≤ k)
32 / 113
![Page 73: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/73.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113
Position of the Problem
Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.
Method:
I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.
I Zero-mean Gaussian model for the data: for xi ∈ Ck,
xi ∼ N (0, Ck).
I Performance of L = nD−12
(K − 1n1T
n
1TnD1n
)D−
12 , with
K =f(‖xi − xj‖2
)1≤i,j≤n
, x =x
‖x‖
in the regime n, p→∞.(alternatively, we can ask 1
ptrCi = 1 for all 1 ≤ i ≤ k)
32 / 113
![Page 74: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/74.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
![Page 75: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/75.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
![Page 76: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/76.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon
, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
![Page 77: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/77.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113
Model and Reminders
Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).
Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,
L = nD−12
(K −
1n1Tn
1TnD1n
)D−
12 , with K =
f(‖xi − xj‖2
)ni,j=1
(x = x/‖x‖)
exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if
T =
1
ptrCaC
b
ka,b=1
has sufficiently large eigenvalues (here M = 0, t = 0).
33 / 113
![Page 78: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/78.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113
The case f ′(2) = 0
Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =
∑kb=1 cbCb.
Remark: [Neyman–Pearson optimality]
I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p
tr (C1 − C2)2 ≥ O(p−12 ).
Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and
L ≡ √pf(2)
2f ′′(2)
[L−
f(0)− f(2)
f(2)P
], P = In −
1
n1n1T
n.
Then, under Assumptions 2b,
L = PΦP +
1√p
tr (CaCb )
1na1Tnb
p
ka,b=1
+ o‖·‖(1)
where Φij = δi 6=j√p[(xTi xj)
2 − E[(xTi xj)
2]].
34 / 113
![Page 79: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/79.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113
The case f ′(2) = 0
Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(
√p), with Ca = Ca − C, C =
∑kb=1 cbCb.
(in this regime, previous kernels clearly fail)
Remark: [Neyman–Pearson optimality]
I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p
tr (C1 − C2)2 ≥ O(p−12 ).
Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and
L ≡ √pf(2)
2f ′′(2)
[L−
f(0)− f(2)
f(2)P
], P = In −
1
n1n1T
n.
Then, under Assumptions 2b,
L = PΦP +
1√p
tr (CaCb )
1na1Tnb
p
ka,b=1
+ o‖·‖(1)
where Φij = δi 6=j√p[(xTi xj)
2 − E[(xTi xj)
2]].
34 / 113
![Page 80: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/80.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113
The case f ′(2) = 0
Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n
p→ c0 ∈ (0,∞)
2. nan→ ca ∈ (0,∞)
3. 1p
trCa = 1 and trCaCb = O(
√p), with Ca = Ca − C, C =
∑kb=1 cbCb.
(in this regime, previous kernels clearly fail)
Remark: [Neyman–Pearson optimality]
I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p
tr (C1 − C2)2 ≥ O(p−12 ).
Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and
L ≡ √pf(2)
2f ′′(2)
[L−
f(0)− f(2)
f(2)P
], P = In −
1
n1n1T
n.
Then, under Assumptions 2b,
L = PΦP +
1√p
tr (CaCb )
1na1Tnb
p
ka,b=1
+ o‖·‖(1)
where Φij = δi 6=j√p[(xTi xj)
2 − E[(xTi xj)
2]].
34 / 113
![Page 81: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/81.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 35/113
The case f ′(2) = 0
−2 −1.5 −1 −0.5 00
1
2
3
λ1(L)
λ2(L)
Eigenvalues of L
Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c1 = c2 = 1/4, c3 = 1/2,
Ci ∝ Ip + (p/8)−54WiW
Ti , Wi ∈ Rp×(p/8) of i.i.d. N (0, 1) entries, f(t) = exp(−(t− 2)2).
⇒ No longer a Marcenko–Pastur like bulk, but rather a semi-circle bulk!
35 / 113
![Page 82: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/82.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 36/113
The case f ′(2) = 0
36 / 113
![Page 83: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/83.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
![Page 84: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/84.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
![Page 85: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/85.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
![Page 86: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/86.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113
The case f ′(2) = 0
Roadmap. We now need to:
I study the spectrum of Φ
I study the isolated eigenvalues of L (and the phase transition)
I retrieve information from the eigenvectors.
Theorem (Semi-circle law for Φ)Let µn = 1
n
∑ni=1 δλi(L). Then, under Assumption 2b,
µna.s.−→ µ
with µ the semi-circle distribution
µ(dt) =1
2πc0ω2
√(4c0ω2 − t2)+dt, ω = lim
p→∞
√2
1
ptr (C)2.
37 / 113
![Page 87: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/87.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 38/113
The case f ′(2) = 0
−2 −1.5 −1 −0.5 00
1
2
3
λ1(L)
λ2(L)
Eigenvalues of L
Semi-circle law
Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c1 = c2 = 1/4, c3 = 1/2,
Ci ∝ Ip + (p/8)−54WiW
Ti , Wi ∈ Rp×(p/8) of i.i.d. N (0, 1) entries, f(t) = exp(−(t− 2)2).
38 / 113
![Page 88: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/88.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 39/113
The case f ′(2) = 0
Denote now
T ≡ limp→∞
√cacb√p
trCaCb
ka,b=1
.
39 / 113
![Page 89: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/89.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 39/113
The case f ′(2) = 0
Denote now
T ≡ limp→∞
√cacb√p
trCaCb
ka,b=1
.
Theorem (Isolated Eigenvalues)Let ν1 ≥ . . . ≥ νk eigenvalues of T . Then, if
√c0|νi| > ω, L has an isolated
eigenvalue λi satisfying
λia.s.−→ ρi ≡ c0νi +
ω2
νi.
39 / 113
![Page 90: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/90.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 40/113
The case f ′(2) = 0
Theorem (Isolated Eigenvectors)For each isolated eigenpair (λi, ui) of L corresponding to (νi, vi) of T , write
ui =k∑a=1
αaija√na
+ σai wai
with ja = [0Tn1, . . . , 1T
na, . . . , 0T
nk]T, (wai )Tja = 0, supp(wai ) = supp(ja), ‖wai ‖ = 1.
Then, under Assumptions 1–2b,
αai αbi
a.s.−→(
1−1
c0
ω2
ν2i
)[viv
Ti ]ab
(σai )2 a.s.−→ca
c0
ω2
ν2i
and the fluctuations of ui, uj , i 6= j, are asymptotically uncorrelated.
40 / 113
![Page 91: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/91.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 41/113
The case f ′(2) = 0
Eig
enve
ctor
1
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Eig
enve
ctor
2
Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ
ai .
41 / 113
![Page 92: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/92.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 41/113
The case f ′(2) = 0
1√n1
(α11 − σ
11)
1√n2
(α21 + σ2
1)
1√n2α2
1Eig
enve
ctor
1
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Eig
enve
ctor
2
Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ
ai .
41 / 113
![Page 93: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/93.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 42/113
The case f ′(2) = 0
−4 −2 0 2 4
·10−2
−6
−4
−2
0
2
4
6
·10−2
1σ
2σ
Eigenvector 1
Eig
enve
ctor
2
Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ
ai .
42 / 113
![Page 94: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/94.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 43/113
Application to Massive MIMO UE Clustering
43 / 113
![Page 95: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/95.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 96: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/96.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 97: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/97.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 98: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/98.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 99: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/99.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 100: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/100.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 101: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/101.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113
Massive MIMO UE Clustering
Setting. Massive MIMO cell with
I p antenna elements
I n users equipments (UE) with channels x1, . . . , xn ∈ Rp
I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).
I T independent channel observations x(1)i , . . . , x
(T )i for UE i.
Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).
Algorithm.
1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x
(T )n (as if nT
values to cluster).
2. Extract dominant isolated eigenvectors u1, . . . , uκ
3. For each i, create ui = 1T
(In ⊗ 1TT )ui, i.e., average eigenvectors along time.
4. Perform k-class clustering on vectors u1, . . . , uκ.
44 / 113
![Page 102: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/102.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 45/113
Massive MIMO UE Clustering
−0.1 0 0.1
−0.2
−0.1
0
0.1
Eigenvector 1
Eig
enve
ctor
2
−5 0 5
·10−2
−5
0
5
·10−2
Eigenvector 1
Figure: Leading two eigenvectors before (left figure) and after (right figure) T -averaging. Setting:p = 400, n = 40, T = 10, k = 3, c1 = c3 = 1/4, c2 = 1/2, angular spread model with angles−π/30± π/20, 0± π/20, and π/30± π/20. Kernel function f(t) = exp(−(t− 2)2).
45 / 113
![Page 103: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/103.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 46/113
Massive MIMO UE Clustering
2 4 6 8 100.4
0.6
0.8
1
Random guess
T
Cor
rect
clu
ster
ing
pro
ba
bilit
y
k-means (oracle)
k-means
EM (oracle)
EM
Figure: Overlap for different T , using the k-means or EM starting from actual centroid solutions(oracle) or randomly.
46 / 113
![Page 104: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/104.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = 0 47/113
Massive MIMO UE Clustering
2 4 6 8 10
0.4
0.6
0.8
1
Random guess
T
Cor
rect
clu
ster
ing
pro
ba
bilit
y
Optimal kernel (k-means)
Optimal kernel (EM)
Gaussian kernel (k-means)
Gaussian kernel (EM)
Figure: Overlap for optimal kernel f(t) (here f(t) = exp(−(t− 2)2)) and Gaussian kernelf(t) = exp(−t2), for different T , using the k-means or EM.
47 / 113
![Page 105: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/105.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
48/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
48 / 113
![Page 106: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/106.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
![Page 107: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/107.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
![Page 108: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/108.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
![Page 109: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/109.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
49/113
Optimal growth rates and optimal kernels
Conclusion of previous analyses:
I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:
I optimal in ‖µa‖ = O(1), 1p trCa = O(p−
12 )
I suboptimal in 1p trCaC
b = O(1)
−→ Model type: Marcenko–Pastur + spikes.
I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:
I suboptimal in ‖µa‖ O(1) (kills the means)
I suboptimal in 1p trCaC
b = O(p−
12 )
−→ Model type: smaller order semi-circle law + spikes.
Jointly optimal solution:
I evenly weighing Marcenko–Pastur and semi-circle laws
I the “α-β” kernel:
f ′(τ) =α√p,
1
2f ′′(τ) = β.
49 / 113
![Page 110: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/110.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
50/113
New assumption setting
I We consider now a fully optimal growth rate setting
Assumption (Optimal Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(
√p).
Kernel:
I For technical simplicity, we consider
K = PKP = P
f
(1
p(x)T(xj )
)ni,j=1
P P = In −1
n1n1T
n.
i.e., τ replaced by 0.
50 / 113
![Page 111: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/111.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
50/113
New assumption setting
I We consider now a fully optimal growth rate setting
Assumption (Optimal Growth Rate)As n→∞,
1. Data scaling: pn→ c0 ∈ (0,∞), na
n→ ca ∈ (0, 1),
2. Mean scaling: with µ ,∑ka=1
nanµa and µa , µa − µ, then ‖µa‖ = O(1)
3. Covariance scaling: with C ,∑ka=1
nanCa and Ca , Ca − C, then
‖Ca‖ = O(1), trCa = O(√p), trCaC
b = O(
√p).
Kernel:
I For technical simplicity, we consider
K = PKP = P
f
(1
p(x)T(xj )
)ni,j=1
P P = In −1
n1n1T
n.
i.e., τ replaced by 0.
50 / 113
![Page 112: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/112.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
51/113
Main Results
TheoremAs n→∞, ∥∥∥√p (PKP +
(f(0) + τf ′(0)
)P)− K
∥∥∥ a.s.−→ 0
with, for α =√pf ′(0) = O(1) and β = 1
2f ′′(0) = O(1),
K = αPWTWP + βPΦP + UAUT
A =
[αMTM + βT αIk
αIk 0
]U =
[J√p, PWTM
]Φ√p
=
((ωi )Tωj )2δi6=j
ni,j=1
−
tr (CaCb)
p21na1Tnb
ka,b=1
.
Role of α, β:
I Weighs Marcenko–Pastur versus semi-circle parts.
51 / 113
![Page 113: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/113.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
51/113
Main Results
TheoremAs n→∞, ∥∥∥√p (PKP +
(f(0) + τf ′(0)
)P)− K
∥∥∥ a.s.−→ 0
with, for α =√pf ′(0) = O(1) and β = 1
2f ′′(0) = O(1),
K = αPWTWP + βPΦP + UAUT
A =
[αMTM + βT αIk
αIk 0
]U =
[J√p, PWTM
]Φ√p
=
((ωi )Tωj )2δi6=j
ni,j=1
−
tr (CaCb)
p21na1Tnb
ka,b=1
.
Role of α, β:
I Weighs Marcenko–Pastur versus semi-circle parts.
51 / 113
![Page 114: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/114.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
52/113
Limiting eigenvalue distribution
Theorem (Eigenvalues Bulk)As p→∞,
νn ,1
n
n∑i=1
δλi(K)
a.s.−→ ν
with ν having Stieltjes transform m(z) solution of
1
m(z)= −z +
α
ptrC
(Ik +
αm(z)
c0C)−1
−2β2
c0ω2m(z)
where ω = limp→∞1p
tr (C)2.
52 / 113
![Page 115: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/115.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
53/113
Limiting eigenvalue distribution
Figure: Eigenvalues of K (up to recentering) versus limiting law, p = 2048, n = 4096, k = 2,
n1 = n2, µi = 3δi, f(x) = 12β(x+ 1√
pαβ
)2. (Top left): α = 8, β = 1, (Top right):
α = 4, β = 3, (Bottom left): α = 3, β = 4, (Bottom right): α = 1, β = 8.
53 / 113
![Page 116: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/116.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
54/113
Asymptotic performances: MNIST
I MNIST is “means-dominant” but not that much!
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
MNIST (digits 1, 7) 612.7 71.1 2.5MNIST (digits 3, 6) 441.3 39.9 1.4MNIST (digits 3, 8) 212.3 23.5 0.8
−15 −10 −5 0 5 10 15
0.6
0.8
1
αβ
Ove
rla
p
Digits 1, 7
Digits 3, 6
Digits 3, 8
Figure: Spectral clustering of the MNIST database for varying αβ .
54 / 113
![Page 117: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/117.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
54/113
Asymptotic performances: MNIST
I MNIST is “means-dominant” but not that much!
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
MNIST (digits 1, 7) 612.7 71.1 2.5MNIST (digits 3, 6) 441.3 39.9 1.4MNIST (digits 3, 8) 212.3 23.5 0.8
−15 −10 −5 0 5 10 15
0.6
0.8
1
αβ
Ove
rla
p
Digits 1, 7
Digits 3, 6
Digits 3, 8
Figure: Spectral clustering of the MNIST database for varying αβ .
54 / 113
![Page 118: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/118.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
55/113
Asymptotic performances: EEG data
I EEG data are “variance-dominant”
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
EEG (sets A,E) 2.4 10.9 1.1
−60 −40 −20 0 20 40
0.6
0.8
1
αβ
Ove
rla
p
Figure: Spectral clustering of the EEG database for varying αβ .
55 / 113
![Page 119: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/119.jpg)
Applications/Kernel Spectral Clustering: The case f′(τ) = α√p
55/113
Asymptotic performances: EEG data
I EEG data are “variance-dominant”
Datasets ‖µ1 − µ2‖2 1√
ptr (C1 −C2)2 1ptr (C1 −C2)2
EEG (sets A,E) 2.4 10.9 1.1
−60 −40 −20 0 20 40
0.6
0.8
1
αβ
Ove
rla
p
Figure: Spectral clustering of the EEG database for varying αβ .
55 / 113
![Page 120: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/120.jpg)
Applications/Semi-supervised Learning 56/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
56 / 113
![Page 121: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/121.jpg)
Applications/Semi-supervised Learning 57/113
Problem Statement
Context: Similar to clustering:
I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.
I Problem statement: give scores Fia (di = [K1n]i)
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2
such that Fia = δxi∈Ca, for all labelled xi.
I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
where we naturally decompose
K =
[K(l,l) K(l,u)
K(u,l) K(u,u)
]D =
[D(l) 0
0 D(u)
]= diag K1n .
57 / 113
![Page 122: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/122.jpg)
Applications/Semi-supervised Learning 57/113
Problem Statement
Context: Similar to clustering:
I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.
I Problem statement: give scores Fia (di = [K1n]i)
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2
such that Fia = δxi∈Ca, for all labelled xi.
I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
where we naturally decompose
K =
[K(l,l) K(l,u)
K(u,l) K(u,u)
]D =
[D(l) 0
0 D(u)
]= diag K1n .
57 / 113
![Page 123: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/123.jpg)
Applications/Semi-supervised Learning 57/113
Problem Statement
Context: Similar to clustering:
I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.
I Problem statement: give scores Fia (di = [K1n]i)
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2
such that Fia = δxi∈Ca, for all labelled xi.
I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
where we naturally decompose
K =
[K(l,l) K(l,u)
K(u,l) K(u,u)
]D =
[D(l) 0
0 D(u)
]= diag K1n .
57 / 113
![Page 124: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/124.jpg)
Applications/Semi-supervised Learning 58/113
The finite-dimensional intuition: What we expect
Figure: Typical expected performance output
58 / 113
![Page 125: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/125.jpg)
Applications/Semi-supervised Learning 58/113
The finite-dimensional intuition: What we expect
Figure: Typical expected performance output
58 / 113
![Page 126: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/126.jpg)
Applications/Semi-supervised Learning 58/113
The finite-dimensional intuition: What we expect
Figure: Typical expected performance output
58 / 113
![Page 127: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/127.jpg)
Applications/Semi-supervised Learning 59/113
MNIST Data Example
0 50 100 150
0.8
1
1.2
Index
F(u
)·,a
[F(u)]·,1 (Zeros)
Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
59 / 113
![Page 128: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/128.jpg)
Applications/Semi-supervised Learning 59/113
MNIST Data Example
0 50 100 150
0.8
1
1.2
Index
F(u
)·,a
[F(u)]·,1 (Zeros)
[F(u)]·,2 (Ones)
Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
59 / 113
![Page 129: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/129.jpg)
Applications/Semi-supervised Learning 59/113
MNIST Data Example
0 50 100 150
0.8
1
1.2
Index
F(u
)·,a
[F(u)]·,1 (Zeros)
[F(u)]·,2 (Ones)
[F(u)]·,3 (Twos)
Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
59 / 113
![Page 130: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/130.jpg)
Applications/Semi-supervised Learning 60/113
MNIST Data Example
0 50 100 150
−4
−2
0
2
4
·10−2
Index
F(u
)·,a−
1 k
∑ k b=1F
(u
)·,b
[F(u)
]·,1 (Zeros)
Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T
k]·,a, 3-class MNIST data (zeros, ones,
twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.
60 / 113
![Page 131: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/131.jpg)
Applications/Semi-supervised Learning 60/113
MNIST Data Example
0 50 100 150
−4
−2
0
2
4
·10−2
Index
F(u
)·,a−
1 k
∑ k b=1F
(u
)·,b
[F(u)
]·,1 (Zeros)
[F(u)
]·,2 (Ones)
Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T
k]·,a, 3-class MNIST data (zeros, ones,
twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.
60 / 113
![Page 132: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/132.jpg)
Applications/Semi-supervised Learning 60/113
MNIST Data Example
0 50 100 150
−4
−2
0
2
4
·10−2
Index
F(u
)·,a−
1 k
∑ k b=1F
(u
)·,b
[F(u)
]·,1 (Zeros)
[F(u)
]·,2 (Ones)
[F(u)
]·,3 (Twos)
Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T
k]·,a, 3-class MNIST data (zeros, ones,
twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.
60 / 113
![Page 133: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/133.jpg)
Applications/Semi-supervised Learning 61/113
Theoretical Findings
Method: Assume nl/n→ cl ∈ (0, 1)
I We aim at characterizing
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
I Taylor expansion of K as n, p→∞,
K(u,u) = f(τ)1nu1Tnu
+O‖·‖(n− 1
2 )
D(u) = nf(τ)Inu +O(n12 )
and similarly for K(u,l), D(l).
I So that
(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1=
(Inu −
1nu1Tnu
n+O‖·‖(n
− 12 )
)−1
easily Taylor expanded.
61 / 113
![Page 134: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/134.jpg)
Applications/Semi-supervised Learning 61/113
Theoretical Findings
Method: Assume nl/n→ cl ∈ (0, 1)
I We aim at characterizing
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
I Taylor expansion of K as n, p→∞,
K(u,u) = f(τ)1nu1Tnu
+O‖·‖(n− 1
2 )
D(u) = nf(τ)Inu +O(n12 )
and similarly for K(u,l), D(l).
I So that
(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1=
(Inu −
1nu1Tnu
n+O‖·‖(n
− 12 )
)−1
easily Taylor expanded.
61 / 113
![Page 135: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/135.jpg)
Applications/Semi-supervised Learning 61/113
Theoretical Findings
Method: Assume nl/n→ cl ∈ (0, 1)
I We aim at characterizing
F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l)
I Taylor expansion of K as n, p→∞,
K(u,u) = f(τ)1nu1Tnu
+O‖·‖(n− 1
2 )
D(u) = nf(τ)Inu +O(n12 )
and similarly for K(u,l), D(l).
I So that
(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1=
(Inu −
1nu1Tnu
n+O‖·‖(n
− 12 )
)−1
easily Taylor expanded.
61 / 113
![Page 136: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/136.jpg)
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
![Page 137: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/137.jpg)
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:
I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
![Page 138: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/138.jpg)
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
![Page 139: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/139.jpg)
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
![Page 140: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/140.jpg)
Applications/Semi-supervised Learning 62/113
Main Results
Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,
I In the first order,
F(u)·,a = C
nl,a
n
[v︸︷︷︸
O(1)
+αta1nu√
n︸ ︷︷ ︸O(n− 1
2 )
]+ O(n−1)︸ ︷︷ ︸
Informative terms
where v = O(1) random vector (entry-wise) and ta = 1√p
trCa .
I Consequences:I Random non-informative bias v
I Strong Impact of nl,a
F(u)·,a to be scaled by nl,a
I Additional per-class bias αta1nu
α = 0 + β√p .
62 / 113
![Page 141: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/141.jpg)
Applications/Semi-supervised Learning 63/113
Main Results
As a consequence of the remarks above, we take
α =β√p
and define
F(u)i,a =
np
nl,aF
(u)ia .
TheoremFor xi ∈ Cb unlabelled,
Fi,· −Gb → 0, Gb ∼ N (mb,Σb)
where mb ∈ Rk, Σb ∈ Rk×k given by
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
63 / 113
![Page 142: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/142.jpg)
Applications/Semi-supervised Learning 63/113
Main Results
As a consequence of the remarks above, we take
α =β√p
and define
F(u)i,a =
np
nl,aF
(u)ia .
TheoremFor xi ∈ Cb unlabelled,
Fi,· −Gb → 0, Gb ∼ N (mb,Σb)
where mb ∈ Rk, Σb ∈ Rk×k given by
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
63 / 113
![Page 143: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/143.jpg)
Applications/Semi-supervised Learning 64/113
Main Results
Corollary (Asymptotic Classification Error)For k = 2 classes and a 6= b,
P (Fi,a > Fib | xi ∈ Cb)−Q(
(mb)b − (mb)a√[1,−1]Σb[1,−1]T
)→ 0.
Some consequences:
I non obvious choices of appropriate kernels
I non obvious choice of optimal β (induces a possibly beneficial bias)
I importance of nl versus nu.
64 / 113
![Page 144: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/144.jpg)
Applications/Semi-supervised Learning 64/113
Main Results
Corollary (Asymptotic Classification Error)For k = 2 classes and a 6= b,
P (Fi,a > Fib | xi ∈ Cb)−Q(
(mb)b − (mb)a√[1,−1]Σb[1,−1]T
)→ 0.
Some consequences:
I non obvious choices of appropriate kernels
I non obvious choice of optimal β (induces a possibly beneficial bias)
I importance of nl versus nu.
64 / 113
![Page 145: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/145.jpg)
Applications/Semi-supervised Learning 65/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.4
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
65 / 113
![Page 146: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/146.jpg)
Applications/Semi-supervised Learning 65/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.4
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Theory as if Gaussian
Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.
65 / 113
![Page 147: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/147.jpg)
Applications/Semi-supervised Learning 66/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784,nl/n = 1/16, Gaussian kernel.
66 / 113
![Page 148: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/148.jpg)
Applications/Semi-supervised Learning 66/113
MNIST Data Example
−1 −0.5 0 0.5 1
0.6
0.8
1
Index
Pro
ba
bilit
yo
fco
rrec
tcl
ass
ifica
tio
n
Simulations
Theory as if Gaussian
Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784,nl/n = 1/16, Gaussian kernel.
66 / 113
![Page 149: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/149.jpg)
Applications/Semi-supervised Learning 67/113
Is semi-supervised learning really semi-supervised?
Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
The problem with unlabelled data:
I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.
I Even best Laplacian regularizer brings SSL to be merely supervised learning.
67 / 113
![Page 150: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/150.jpg)
Applications/Semi-supervised Learning 67/113
Is semi-supervised learning really semi-supervised?
Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
The problem with unlabelled data:
I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.
I Even best Laplacian regularizer brings SSL to be merely supervised learning.
67 / 113
![Page 151: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/151.jpg)
Applications/Semi-supervised Learning 67/113
Is semi-supervised learning really semi-supervised?
Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with
(mb)a = −2f ′(τ)
f(τ)Mab +
f ′′(τ)
f(τ)ta tb +
2f ′′(τ)
f(τ)Tab −
f ′(τ)2
f(τ)2tatb + β
n
nl
f ′(τ)
f(τ)ta +Bb
(Σb)a1a2 =2trC2
b
p
(f ′(τ)2
f(τ)2−f ′′(τ)
f(τ)
)2
ta1 ta2 +4f ′(τ)2
f(τ)2
([MTCbM ]a1a2 +
δa2a1 p
nl,a1
Tba1
)with t, T,M as before, Xa = Xa −
∑kd=1
nl,dnl
Xd and Bb bias independent of a.
The problem with unlabelled data:
I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.
I Even best Laplacian regularizer brings SSL to be merely supervised learning.
67 / 113
![Page 152: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/152.jpg)
Applications/Semi-supervised Learning improved 68/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
68 / 113
![Page 153: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/153.jpg)
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
![Page 154: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/154.jpg)
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
![Page 155: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/155.jpg)
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
![Page 156: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/156.jpg)
Applications/Semi-supervised Learning improved 69/113
Resurrecting SSL by centering
Reminder:
F = argminF∈Rn×k
k∑a=1
∑i,j
Kij(Fiadα−1i − Fjadα−1
j )2 with F(l)ia = δxi∈Ca
⇔ F (u) =(Inu −D
−α(u)
K(u,u)Dα−1(u)
)−1D−α
(u)K(u,l)D
α−1(l)
F (l).
Domination of score flattening:
I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”
I Consequence: D−α(u)
K(u,u)Dα−1(u)
' 1n
1nu1Tnu
and clustering information
vanishes (not so obvious but can be shown).
Solution:
I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use
K = PKP , P = In −1
n1n1T
n.
69 / 113
![Page 157: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/157.jpg)
Applications/Semi-supervised Learning improved 70/113
Theoretical results
Setting
I K = 2, xi ∼ N (±µ, Ip)
I scores fu = (αInu − Kuu)−1Kulfl.
Theorem (Asymptotic mean and variance)As n→∞,
j(u)Ti fu
nui−mi
a.s.−→ 0,(fu −mi1nu )T D
(u)i (fu −mi1nu )
nui− σ2
ia.s.−→ 0
where, for i = 1, 2,
mi ≡ −cl
cusi
(1−
[1 +
cuc1c2‖µ‖2
c0
δ
1 + δ
]−1)
σ2i ≡
s2i c2l c
2i ‖µ‖2δ2
c20(1 + δ)2 − cuc0δ2
1 +cuc1c2‖µ‖2
c0
δ2
(1+δ)2(1 +
cuc1c2‖µ‖2c0
δ1+δ
)2+s2i clci
1− ciδ2
c0(1 + δ)2 − cuδ2
with δ defined as
δ ≡ −1
2+cu − c0 + sign(α)
√(α− α−)(α− α+)
2α.
70 / 113
![Page 158: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/158.jpg)
Applications/Semi-supervised Learning improved 70/113
Theoretical results
Setting
I K = 2, xi ∼ N (±µ, Ip)
I scores fu = (αInu − Kuu)−1Kulfl.
Theorem (Asymptotic mean and variance)As n→∞,
j(u)Ti fu
nui−mi
a.s.−→ 0,(fu −mi1nu )T D
(u)i (fu −mi1nu )
nui− σ2
ia.s.−→ 0
where, for i = 1, 2,
mi ≡ −cl
cusi
(1−
[1 +
cuc1c2‖µ‖2
c0
δ
1 + δ
]−1)
σ2i ≡
s2i c2l c
2i ‖µ‖2δ2
c20(1 + δ)2 − cuc0δ2
1 +cuc1c2‖µ‖2
c0
δ2
(1+δ)2(1 +
cuc1c2‖µ‖2c0
δ1+δ
)2+s2i clci
1− ciδ2
c0(1 + δ)2 − cuδ2
with δ defined as
δ ≡ −1
2+cu − c0 + sign(α)
√(α− α−)(α− α+)
2α.
70 / 113
![Page 159: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/159.jpg)
Applications/Semi-supervised Learning improved 71/113
Performance as a function of nu, nl
20 40 60 80 100
0.6
0.8
1
‖µ‖ = 1
‖µ‖ = 2
‖µ‖ = 3
cu/cl (blue), cl/cu (black)
Figure: Correct classification rate, at optimal α, as a function of (i) nu for fixed p/nl = 5 (blue)and (ii) nl for fixed p/nu = 5 (black); c1 = c2 = 1
2 ; different values for ‖µ‖. Comparison tooptimal Neyman–Pearson performance for known µ (in red).
71 / 113
![Page 160: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/160.jpg)
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).
72 / 113
![Page 161: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/161.jpg)
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).
72 / 113
![Page 162: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/162.jpg)
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).
72 / 113
![Page 163: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/163.jpg)
Applications/Semi-supervised Learning improved 72/113
The spike case or not (1)
Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law
I presence of isolated spike iif
‖µ‖2 >1
c1c2
√c0
cu.
I determines existence or not of unsupervised spectral clustering solution.
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
spike
Empirical eigenvalues
Marcenko–Pastur law
Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9
10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated
eigenvalue (spike).72 / 113
![Page 164: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/164.jpg)
Applications/Semi-supervised Learning improved 73/113
The spike case or not (2)
α+ 2.745 2.750 2.755 2.760
0.4
0.5
0.6
0.7
δ(α) = −(1 + cuc1c2c−10 ‖µ‖2)−1
α
‖µ‖ = 1.7 (< phase tr.)
‖µ‖ = 1.8 (> phase tr.)
Figure: Asymptotic correct classification probability Φ(m1σ1
)as a function of α for cu = 9
10 ,
c0 = 12 , c1 = 1
2 , two different values of ‖µ‖, below and above phase transition.
73 / 113
![Page 165: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/165.jpg)
Applications/Semi-supervised Learning improved 74/113
SSL: the road from supervised to unsupervised
α+ 3 3.5 4 4.5 50.59
0.6
0.61
0.62
α
3 3.5 4 4.5 5
0.75
0.8
0.85
α
Figure: Theory (solid) versus practice (dashed; from right to left: n = 400, 1000, 4000): correctclassification probability as a function of α for cu = 9
10 , c0 = 12 , c1 = 1
2 , and left: ‖µ‖ = 1.5(below phase transition); right: ‖µ‖ = 2.5 (above phase transition). Different values of n.
74 / 113
![Page 166: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/166.jpg)
Applications/Semi-supervised Learning improved 75/113
Experimental evidence: MNIST
Digits (0,8) (2,7) (6,9)
nu = 100
Centered kernel 89.5±3.6 89.5±3.4 85.3±5.9Iterated centered kernel 89.5±3.6 89.5±3.4 85.3±5.9
Laplacian 75.5±5.6 74.2±5.8 70.0±5.5Iterated Laplacian 87.2±4.7 86.0±5.2 81.4±6.8
Manifold 88.0±4.7 88.4±3.9 82.8±6.5
nu = 1000
Centered kernel 92.2±0.9 92.5±0.8 92.6±1.6Iterated centered kernel 92.3±0.9 92.5± 0.8 92.9±1.4
Laplacian 65.6±4.1 74.4±4.0 69.5±3.7Iterated Laplacian 92.2±0.9 92.4±0.9 92.0±1.6
Manifold 91.1±1.7 91.4±1.9 91.4±2.0
Table: Comparison of classification accuracy (%) on MNIST datasets with nl = 10. Computed over1000 random iterations for nu = 100 and 100 for nu = 1000.
75 / 113
![Page 167: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/167.jpg)
Applications/Semi-supervised Learning improved 76/113
Experimental evidence: Traffic signs (HOG features)
Class ID (2,7) (9,10) (11,18)
nu = 100
Centered kernel 79.0±10.4 77.5±9.2 78.5±7.1Iterated centered kernel 85.3±5.9 89.2±5.6 90.1±6.7
Laplacian 73.8±9.8 77.3±9.5 78.6±7.2Iterated Laplacian 83.7±7.2 88.0±6.8 87.1±8.8
Manifold 77.6±8.9 81.4±10.4 82.3±10.8
nu = 1000
Centered kernel 83.6±2.4 84.6±2.4 88.7±9.4Iterated centered kernel 84.8±3.8 88.0±5.5 96.4±3.0
Laplacian 72.7±4.2 88.9±5.7 95.8±3.2Iterated Laplacian 83.0±5.5 88.2±6.0 92.7±6.1
Manifold 77.7±5.8 85.0±9.0 90.6±8.1
Table: Comparison of classification accuracy (%) on German Traffic Sign datasets with nl = 10.Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.
76 / 113
![Page 168: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/168.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 77/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
77 / 113
![Page 169: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/169.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 78/113
Random Feature Maps and Extreme Learning Machines
Context: Random Feature MapI (large) input x1, . . . , xT ∈ Rp
I random W =
wT1. . .wTn
∈ Rn×p
I non-linear activation function σ.
Neural Network Model (extreme learning machine): Ridge-regression learningI small output y1, . . . , yT ∈ RdI ridge-regression output β ∈ Rn×d
78 / 113
![Page 170: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/170.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 78/113
Random Feature Maps and Extreme Learning Machines
Context: Random Feature MapI (large) input x1, . . . , xT ∈ Rp
I random W =
wT1. . .wTn
∈ Rn×p
I non-linear activation function σ.
Neural Network Model (extreme learning machine): Ridge-regression learningI small output y1, . . . , yT ∈ RdI ridge-regression output β ∈ Rn×d
78 / 113
![Page 171: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/171.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113
Random Feature Maps and Extreme Learning Machines
Objectives: evaluate training and testing MSE performance as n, p, T →∞
I Training MSE:
Etrain =1
T
T∑i=1
‖yi − βTσ(Wxi)‖2 =1
T‖Y − βTΣ‖2F
with
Σ = σ(WX) =σ(wT
i xj)
1≤i≤n1≤j≤T
β=1
TΣ
(1
TΣTΣ + γIT
)−1
Y .
I Testing MSE: upon new pair (X, Y ) of length T ,
Etest =1
T‖Y − βTΣ‖2F .
where Σ = σ(WX).
79 / 113
![Page 172: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/172.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113
Random Feature Maps and Extreme Learning Machines
Objectives: evaluate training and testing MSE performance as n, p, T →∞I Training MSE:
Etrain =1
T
T∑i=1
‖yi − βTσ(Wxi)‖2 =1
T‖Y − βTΣ‖2F
with
Σ = σ(WX) =σ(wT
i xj)
1≤i≤n1≤j≤T
β=1
TΣ
(1
TΣTΣ + γIT
)−1
Y .
I Testing MSE: upon new pair (X, Y ) of length T ,
Etest =1
T‖Y − βTΣ‖2F .
where Σ = σ(WX).
79 / 113
![Page 173: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/173.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113
Random Feature Maps and Extreme Learning Machines
Objectives: evaluate training and testing MSE performance as n, p, T →∞I Training MSE:
Etrain =1
T
T∑i=1
‖yi − βTσ(Wxi)‖2 =1
T‖Y − βTΣ‖2F
with
Σ = σ(WX) =σ(wT
i xj)
1≤i≤n1≤j≤T
β=1
TΣ
(1
TΣTΣ + γIT
)−1
Y .
I Testing MSE: upon new pair (X, Y ) of length T ,
Etest =1
T‖Y − βTΣ‖2F .
where Σ = σ(WX).
79 / 113
![Page 174: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/174.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 80/113
Technical Aspects
Preliminary observations:
I Link to resolvent of 1T
ΣTΣ:
Etrain =γ2
TtrY TY Q2 = −γ2 ∂
∂γ
1
TtrY TY Q
where Q = Q(γ) is the resolvent
Q ≡(
1
TΣTΣ + γIT
)−1
with Σij = σ(wTi xj).
Central object: resolvent E[Q].
80 / 113
![Page 175: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/175.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 80/113
Technical Aspects
Preliminary observations:
I Link to resolvent of 1T
ΣTΣ:
Etrain =γ2
TtrY TY Q2 = −γ2 ∂
∂γ
1
TtrY TY Q
where Q = Q(γ) is the resolvent
Q ≡(
1
TΣTΣ + γIT
)−1
with Σij = σ(wTi xj).
Central object: resolvent E[Q].
80 / 113
![Page 176: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/176.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113
Main Technical Result
Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q
∥∥ < Cnε−12
for some C > 0, where
Q =
(n
T
Φ
1 + δ+ γIT
)−1
Φ ≡ E[σ(XTw)σ(wTX)
]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to
δ =1
Ttr ΦQ.
Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1
pwTXAXTw ' 1
ptrXAXT)
Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1
ptr ΦA
∣∣∣ > t)≤ Ce−cnmin(t,t2)
81 / 113
![Page 177: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/177.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113
Main Technical Result
Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q
∥∥ < Cnε−12
for some C > 0, where
Q =
(n
T
Φ
1 + δ+ γIT
)−1
Φ ≡ E[σ(XTw)σ(wTX)
]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to
δ =1
Ttr ΦQ.
Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1
pwTXAXTw ' 1
ptrXAXT)
Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1
ptr ΦA
∣∣∣ > t)≤ Ce−cnmin(t,t2)
81 / 113
![Page 178: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/178.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113
Main Technical Result
Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q
∥∥ < Cnε−12
for some C > 0, where
Q =
(n
T
Φ
1 + δ+ γIT
)−1
Φ ≡ E[σ(XTw)σ(wTX)
]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to
δ =1
Ttr ΦQ.
Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1
pwTXAXTw ' 1
ptrXAXT)
Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1
ptr ΦA
∣∣∣ > t)≤ Ce−cnmin(t,t2)
81 / 113
![Page 179: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/179.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 82/113
Main Technical Result
I Values of Φ(a, b) for w ∼ N (0, Ip),
σ(t) Φ(a, b)
max(t, 0) 12π‖a‖‖b‖
(∠(a, b) acos(−∠(a, b)) +
√1− ∠(a, b)2
)|t| 2
π‖a‖‖b‖
(∠(a, b) asin(∠(a, b)) +
√1− ∠(a, b)2
)erf(t) 2
πasin
(2aTb√
(1+2‖a‖2)(1+2‖b‖2)
)1t>0
12− 1
2πacos(∠(a, b))
sign(t) 1− 2π
acos(∠(a, b))
cos(t) exp(− 12
(‖a‖2 + ‖b‖2)) cosh(aTb).
where ∠(a, b) ≡ aTb‖a‖‖b‖ .
I Value of Φ(a, b) for wi i.i.d. with E[wki ] = mk (m1 = 0), σ(t) = ζ2t2 + ζ1t+ ζ0
Φ(a, b) = ζ22
[m2
2
(2(aTb)2 + ‖a‖2‖b‖2
)+ (m4 − 3m2
2)(a2)T(b2)]
+ ζ21m2a
Tb
+ ζ2ζ1m3
[(a2)Tb+ aT(b2)
]+ ζ2ζ0m2
[‖a‖2 + ‖b‖2
]+ ζ2
0
where (a2) ≡ [a21, . . . , a
2p]T.
82 / 113
![Page 180: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/180.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 82/113
Main Technical Result
I Values of Φ(a, b) for w ∼ N (0, Ip),
σ(t) Φ(a, b)
max(t, 0) 12π‖a‖‖b‖
(∠(a, b) acos(−∠(a, b)) +
√1− ∠(a, b)2
)|t| 2
π‖a‖‖b‖
(∠(a, b) asin(∠(a, b)) +
√1− ∠(a, b)2
)erf(t) 2
πasin
(2aTb√
(1+2‖a‖2)(1+2‖b‖2)
)1t>0
12− 1
2πacos(∠(a, b))
sign(t) 1− 2π
acos(∠(a, b))
cos(t) exp(− 12
(‖a‖2 + ‖b‖2)) cosh(aTb).
where ∠(a, b) ≡ aTb‖a‖‖b‖ .
I Value of Φ(a, b) for wi i.i.d. with E[wki ] = mk (m1 = 0), σ(t) = ζ2t2 + ζ1t+ ζ0
Φ(a, b) = ζ22
[m2
2
(2(aTb)2 + ‖a‖2‖b‖2
)+ (m4 − 3m2
2)(a2)T(b2)]
+ ζ21m2a
Tb
+ ζ2ζ1m3
[(a2)Tb+ aT(b2)
]+ ζ2ζ0m2
[‖a‖2 + ‖b‖2
]+ ζ2
0
where (a2) ≡ [a21, . . . , a
2p]T.
82 / 113
![Page 181: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/181.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 83/113
Main Results
Theorem [Asymptotic Etrain]For all ε > 0,
n12−ε (Etrain − Etrain
)→ 0
almost surely, where
Etrain =1
T
∥∥∥Y T − ΣTβ∥∥∥2
F=γ2
TtrY TY Q2
Etrain =γ2
TtrY TY Q
[1n
tr ΨQ2
1− 1n
tr (ΨQ)2Ψ + IT
]Q
with Ψ ≡ nT
Φ1+δ
.
83 / 113
![Page 182: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/182.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 84/113
Main Results
I Letting X ∈ Rp×T , Y ∈ Rd×T satisfy “similar properties” as (X,Y ),
Claim [Asymptotic Etest]For all ε > 0,
n12−ε (Etest − Etest
)→ 0
almost surely, where
Etest =1
T
∥∥∥Y T − ΣTβ∥∥∥2
F
Etest =1
T
∥∥∥Y T −ΨTXX
QY T∥∥∥2
F
+1n
trY TY QΨQ
1− 1n
tr (ΨQ)2
[1
Ttr ΨXX −
1
Ttr (IT + γQ)(ΨXXΨXXQ)
]
with ΨAB = nT
ΦAB1+δ
, ΦAB = E[σ(ATw)σ(wTB)].
84 / 113
![Page 183: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/183.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
![Page 184: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/184.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
![Page 185: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/185.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = erf(t)
σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
![Page 186: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/186.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = erf(t)
σ(t) = |t|σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
![Page 187: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/187.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113
Simulations on MNIST: Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = max(t, 0)
σ(t) = erf(t)
σ(t) = |t|σ(t) = t
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class
MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
85 / 113
![Page 188: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/188.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 86/113
Simulations on MNIST: non Lipschitz σ(·)
10−4 10−3 10−2 10−1 100 101 102
10−1
100
σ(t) = sign(t)
σ(t) = 1t>0
σ(t) = 1− 12t2
γ
MS
E
Etrain
Etest
Etrain
Etest
Figure: Neural network performance for σ(·) either discontinuous or non Lipschitz, as a function of
γ, for 2-class MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.
86 / 113
![Page 189: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/189.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 87/113
Deeper investigation on Φ
Statistical Assumptions on XI Gaussian mixture model
xi ∈ Ca ⇔ xi ∼ N (1√pµa,
1
pCa).
I Growth rate: ‖µa‖ = O(1), 1√p
trCa = O(1).
TheoremAs p, T →∞, for all σ(·) given in next table,
‖PΦP − P ΦP‖ a.s.−→ 0
with
Φ ≡ d1
(Ω +M
JT
√p
)T (Ω +M
JT
√p
)+ d2UBU
T + d0IT
U ≡[J√p, φ]
B ≡[ttT + 2T t
tT 1
]and d0, d1, d2 given in next table (φi = ‖wi‖2 − E[‖wi‖2] for xi = 1√
pµa + wi).
87 / 113
![Page 190: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/190.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 87/113
Deeper investigation on Φ
Statistical Assumptions on XI Gaussian mixture model
xi ∈ Ca ⇔ xi ∼ N (1√pµa,
1
pCa).
I Growth rate: ‖µa‖ = O(1), 1√p
trCa = O(1).
TheoremAs p, T →∞, for all σ(·) given in next table,
‖PΦP − P ΦP‖ a.s.−→ 0
with
Φ ≡ d1
(Ω +M
JT
√p
)T (Ω +M
JT
√p
)+ d2UBU
T + d0IT
U ≡[J√p, φ]
B ≡[ttT + 2T t
tT 1
]and d0, d1, d2 given in next table (φi = ‖wi‖2 − E[‖wi‖2] for xi = 1√
pµa + wi).
87 / 113
![Page 191: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/191.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 88/113
Deeper investigation on Φ
Figure: Coefficients di in Φ for different σ(·).
σ(t) d0 d1 d2
t 0 1 0ReLU(t)
(14 −
12π
)τ 1
41
8πτ|t|
(1− 2
π
)τ 0 1
2πτ
LReLU(t) π−24π (ς+ + ς−)2τ 1
4 (ς+ − ς−)2 18τπ (ς+ + ς−)2
1t>014 −
12π
12πτ 0
sign(t) 1− 2π
2πτ 0
ς2t2 + ς1t+ ς0 2τ2ς22 ς21 ς22
cos(t) 12 + e−2τ
2 − e−τ 0 e−τ4
sin(t) 12 −
e−2τ
2 − τe−τ e−τ 0
erf(t) 2π
(arccos
(2τ
2τ+1
)− 2τ
2τ+1
)4π
12τ+1 0
exp(− t22 ) 1√2τ+1
− 1τ+1 0 1
4(τ+1)3
where
I ReLU(t) = max(t, 0)
I LReLU(t) = ς+ max(t, 0) + ς−max(−t, 0).
88 / 113
![Page 192: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/192.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113
Deeper investigation on Φ
Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0
Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)
ς+ = ς− = 1
C1 C2 C3 C4 C1 C2 C3 C4
ς+ = ς− = 1
ς+ = 1, ς− = 0
Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)
89 / 113
![Page 193: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/193.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113
Deeper investigation on Φ
Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0
Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)
ς+ = ς− = 1
C1 C2 C3 C4 C1 C2 C3 C4
ς+ = ς− = 1
ς+ = 1, ς− = 0
Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)
89 / 113
![Page 194: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/194.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113
Deeper investigation on Φ
Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0
Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)
ς+ = ς− = 1
C1 C2 C3 C4 C1 C2 C3 C4
ς+ = ς− = 1
ς+ = 1, ς− = 0
Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)89 / 113
![Page 195: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/195.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 90/113
Depper investigation on Φ: Simulation results
Table: Clustering accuracies for different σ(t) on MNIST dataset (n = 32).
σ(t) T = 32 T = 64 T = 128
mean-oriented
t 85.31% 88.94% 87.30%1t>0 86.00% 82.94% 85.56%
sign(t) 81.94% 83.34% 85.22%sin(t) 85.31% 87.81% 87.50%erf(t) 86.50% 87.28% 86.59%
cov-oriented
|t| 62.81% 60.41% 57.81%cos(t) 62.50% 59.56% 57.72%
exp(− t22 ) 64.00% 60.44% 58.67%
balanced (t) 82.87% 85.72% 82.27%
90 / 113
![Page 196: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/196.jpg)
Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 91/113
Depper investigation on Φ: Simulation results
Table: Clustering accuracies for different σ(t) on epileptic EEG dataset (n = 32).
σ(t) T = 32 T = 64 T = 128
mean-oriented
t 71.81% 70.31% 69.58%1t>0 65.19% 65.87% 63.47%
sign(t) 67.13% 64.63% 63.03%sin(t) 71.94% 70.34% 68.22%erf(t) 69.44% 70.59% 67.70%
cov-oriented
|t| 99.69% 99.69% 99.50%cos(t) 99.00% 99.38% 99.36%
exp(− t22 ) 99.81% 99.81% 99.77%
balanced (t) 84.50% 87.91% 90.97%
91 / 113
![Page 197: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/197.jpg)
Applications/Community Detection on Graphs 92/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
92 / 113
![Page 198: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/198.jpg)
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.
I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, withpreferential attachment Cab between Ca and Cb
I edge probability for nodes i ∈ Cgi :P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )
93 / 113
![Page 199: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/199.jpg)
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with
preferential attachment Cab between Ca and Cb
I edge probability for nodes i ∈ Cgi :P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )
93 / 113
![Page 200: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/200.jpg)
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with
preferential attachment Cab between Ca and CbI edge probability for nodes i ∈ Cgi :
P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )
93 / 113
![Page 201: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/201.jpg)
Applications/Community Detection on Graphs 93/113
System Setting
Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with
preferential attachment Cab between Ca and CbI edge probability for nodes i ∈ Cgi :
P (i ∼ j) = qiqjCgigj .
I adjacency matrix A with
Aij ∼ Bernoulli(qiqjCgigj )93 / 113
![Page 202: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/202.jpg)
Applications/Community Detection on Graphs 94/113
Limitations of Classical Methods
I 3 classes with µ bi-modal (µ = 34δ0.1 + 1
4δ0.5)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
94 / 113
![Page 203: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/203.jpg)
Applications/Community Detection on Graphs 94/113
Limitations of Classical Methods
I 3 classes with µ bi-modal (µ = 34δ0.1 + 1
4δ0.5)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
94 / 113
![Page 204: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/204.jpg)
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
![Page 205: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/205.jpg)
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
![Page 206: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/206.jpg)
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
![Page 207: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/207.jpg)
Applications/Community Detection on Graphs 95/113
Proposed Regularized Modularity Approach
Recall: P (i ∼ j) = qiqjCgigj .
Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,
Cab = 1 +Mab√n, Mab = O(1).
Community information is weak but highly redundant
Considered Matrix:
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α.
95 / 113
![Page 208: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/208.jpg)
Applications/Community Detection on Graphs 96/113
Asymptotic Equivalence
Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖
a.s.−→ 0, where
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α
Lα =1√nD−αq XD−αq + UΛUT
with Dq = diag(qi), X zero-mean random matrix with variance profile,
U =[D1−αq
J√n
D−αq X1n], rank k + 1
Λ =
[(Ik − 1kc
T)M(Ik − c1Tk) −1k
1Tk 0
]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T
na, 0, . . . , 0]T ∈ Rn.
Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”
Optimal choice αopt of α from study of limiting spectrum.
I eigenvectors correlated to D1−αq J
Necessary regularization by Dα−1.
96 / 113
![Page 209: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/209.jpg)
Applications/Community Detection on Graphs 96/113
Asymptotic Equivalence
Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖
a.s.−→ 0, where
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α
Lα =1√nD−αq XD−αq + UΛUT
with Dq = diag(qi), X zero-mean random matrix with variance profile,
U =[D1−αq
J√n
D−αq X1n], rank k + 1
Λ =
[(Ik − 1kc
T)M(Ik − c1Tk) −1k
1Tk 0
]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T
na, 0, . . . , 0]T ∈ Rn.
Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”
Optimal choice αopt of α from study of limiting spectrum.
I eigenvectors correlated to D1−αq J
Necessary regularization by Dα−1.
96 / 113
![Page 210: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/210.jpg)
Applications/Community Detection on Graphs 96/113
Asymptotic Equivalence
Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖
a.s.−→ 0, where
Lα = (2m)α1√nD−α
[A−
ddT
2m
]D−α
Lα =1√nD−αq XD−αq + UΛUT
with Dq = diag(qi), X zero-mean random matrix with variance profile,
U =[D1−αq
J√n
D−αq X1n], rank k + 1
Λ =
[(Ik − 1kc
T)M(Ik − c1Tk) −1k
1Tk 0
]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T
na, 0, . . . , 0]T ∈ Rn.
Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”
Optimal choice αopt of α from study of limiting spectrum.
I eigenvectors correlated to D1−αq J
Necessary regularization by Dα−1.
96 / 113
![Page 211: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/211.jpg)
Applications/Community Detection on Graphs 97/113
Eigenvalue Spectrum
−6 −4 −2 0 2 4 6
−Sα Sα
Sα
spikes
Eigenvalues of L1 (n = 2000)
Limiting law
Figure: 3 classes, c1 = c2 = 0.3, c3 = 0.4, µ = 12δ0.4 + 1
2δ0.9, M = 4
3 −1 −1−1 3 −1−1 −1 3
.
97 / 113
![Page 212: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/212.jpg)
Applications/Community Detection on Graphs 98/113
Phase Transition
Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where
τα = limx↓Sα+
−1
gα(x), phase transition threshold
with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of
fα(x) =
∫q1−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)
gα(x) =
∫q2−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).
In this case, λi(Lα)a.s.−→ (gα)−1
(−1/λi(M)
).
Clustering possible when λi(M) > (minα τα):
I “Optimal” αopt ≡ argminα τα.I From qi ≡ di√
dT1n
a.s.−→ qi, µ ' µ ≡ 1n
∑ni=1 δqi and thus:
Consistent estimator αopt of αopt.
98 / 113
![Page 213: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/213.jpg)
Applications/Community Detection on Graphs 98/113
Phase Transition
Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where
τα = limx↓Sα+
−1
gα(x), phase transition threshold
with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of
fα(x) =
∫q1−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)
gα(x) =
∫q2−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).
In this case, λi(Lα)a.s.−→ (gα)−1
(−1/λi(M)
).
Clustering possible when λi(M) > (minα τα):
I “Optimal” αopt ≡ argminα τα.
I From qi ≡ di√dT1n
a.s.−→ qi, µ ' µ ≡ 1n
∑ni=1 δqi and thus:
Consistent estimator αopt of αopt.
98 / 113
![Page 214: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/214.jpg)
Applications/Community Detection on Graphs 98/113
Phase Transition
Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where
τα = limx↓Sα+
−1
gα(x), phase transition threshold
with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of
fα(x) =
∫q1−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)
gα(x) =
∫q2−2α
−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).
In this case, λi(Lα)a.s.−→ (gα)−1
(−1/λi(M)
).
Clustering possible when λi(M) > (minα τα):
I “Optimal” αopt ≡ argminα τα.I From qi ≡ di√
dT1n
a.s.−→ qi, µ ' µ ≡ 1n
∑ni=1 δqi and thus:
Consistent estimator αopt of αopt.
98 / 113
![Page 215: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/215.jpg)
Applications/Community Detection on Graphs 99/113
Simulated Performance Results (2 masses of qi)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
(Proposed, α = 1) (Proposed, αopt)
Figure: 3 classes, µ = 34δ0.1 + 1
4δ0.5, c1 = c2 = 14 , c3 = 1
2 , M = 100I3.
99 / 113
![Page 216: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/216.jpg)
Applications/Community Detection on Graphs 99/113
Simulated Performance Results (2 masses of qi)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
(Proposed, α = 1)
(Proposed, αopt)
Figure: 3 classes, µ = 34δ0.1 + 1
4δ0.5, c1 = c2 = 14 , c3 = 1
2 , M = 100I3.
99 / 113
![Page 217: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/217.jpg)
Applications/Community Detection on Graphs 99/113
Simulated Performance Results (2 masses of qi)
(Modularity A− ddT
2m) (Bethe Hessian D − rA)
(Proposed, α = 1) (Proposed, αopt)
Figure: 3 classes, µ = 34δ0.1 + 1
4δ0.5, c1 = c2 = 14 , c3 = 1
2 , M = 100I3.
99 / 113
![Page 218: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/218.jpg)
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 1
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
![Page 219: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/219.jpg)
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 0
α = 12
α = 1
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
![Page 220: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/220.jpg)
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 0
α = 12
α = 1
α = αopt
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
![Page 221: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/221.jpg)
Applications/Community Detection on Graphs 100/113
Simulated Performance Results (2 masses for qi)
10 20 30 40 500
0.2
0.4
0.6
0.8
1
∆
Ove
rla
p
α = 0
α = 12
α = 1
α = αopt
Bethe Hessian
Phase transition
Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3
4 δq(1)+ 1
4 δq(2)with
q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.
100 / 113
![Page 222: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/222.jpg)
Applications/Community Detection on Graphs 101/113
Simulated Performance Results (2 masses for qi)
0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
q2 (q1 = 0.1)
Ove
rla
p
α = 0
α = 0.5
α = 1
α = αopt
Bethe Hessian
Figure: Overlap performance for n = 3000, K = 3, µ = 34 δq(1)
+ 14 δq(2)
with q(1) = 0.1 and
q(2) ∈ [0.1, 0.9], M = 10(2I3 − 131T3), ci = 1
3 .
101 / 113
![Page 223: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/223.jpg)
Applications/Community Detection on Graphs 102/113
Real Graph Example: PolBlogs (n = 1490, two classes)
−1 −0.5 0 0.5 1
L0
−100 0 100
L1
(λmax ' 1.75) (λmax ' 483)
Algorithms Overlap Modularityαopt (' 0) 0.897 0.4246α = 0.5 0.035 ' 0α = 1 0.040 ' 0BH 0.304 0.2723
102 / 113
![Page 224: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/224.jpg)
Perspectives/ 103/113
Outline
Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models
ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√
p
Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs
Perspectives
103 / 113
![Page 225: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/225.jpg)
Perspectives/ 104/113
Summary of Results and Perspectives I
Random Neural Networks.
4 Extreme learning machines (one-layer random NN)
4 Linear echo-state networks (ESN)
. Logistic regression and classification error in extreme learning machines (ELM)
. Further random feature maps characterization
. Generalized random NN (multiple layers, multiple activations)
. Random convolutional networks for image processing
Non-linear ESN
Deep Neural Networks (DNN).
. Backpropagation in NN (σ(WX) for random X, backprop. on W )
Statistical physics-inspired approaches (spin-glass models, Hamiltonian-basedmodels)
Non-linear ESN
DNN performance of physics-realistic models (4th-order Hamiltonian, locality)
104 / 113
![Page 226: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/226.jpg)
Perspectives/ 105/113
Summary of Results and Perspectives II
References.
H. W. Lin, M. Tegmark, “Why does deep and cheap learning work so well?”,
arXiv:1608.08225v2, 2016.
C. Williams, “Computation with infinite neural networks”, Neural Computation, 10(5),
1203-1216, 1998.
Herbert Jaeger. Short term memory in echo state networks. GMD-Forschungszentrum
Informationstechnik, 2001.
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, “Extreme learning machine : theory
and applications”, Neurocomputing, 70(1) :489501, 2006.
N. El Karoui, “Concentration of measure and spectra of random matrices: applications to
correlation matrices, elliptical distributions and beyond”, The Annals of Applied Probability,19(6), 2362-2405, 2009.
C. Louart, Z. Liao, R. Couillet, ”A Random Matrix Approach to Neural Networks”, (submitted
to) Annals of Applied Probability, 2017.
R. Couillet, G. Wainrib, H. Sevi, H. Tiomoko Ali, “The asymptotic performance of linear echo
state neural networks”, Journal of Machine Learning Research, vol. 17, no. 178, pp. 1-35, 2016.
Choromanska, Anna, et al. ”The Loss Surfaces of Multilayer Networks.” AISTATS. 2015.
Rahimi, Ali, and Benjamin Recht. ”Random Features for Large-Scale Kernel Machines.” NIPS.
Vol. 3. No. 4. 2007.
105 / 113
![Page 227: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/227.jpg)
Perspectives/ 106/113
Summary of Results and Perspectives I
Kernel methods.
4 Spectral clustering
4 Subspace spectral clustering (f ′(τ) = 0)
. Spectral clustering with outer product kernel f(xTy)
4 Semi-supervised learning, kernel approaches.
4 Least square support vector machines (LS-SVM).
. Support vector machines (SVM).
Kernel matrices based on Kendall τ , Spearman ρ.
Applications.
4 Massive MIMO user subspace clustering (patent proposed)
Kernel correlation matrices for biostats, heterogeneous datasets.
Kernel PCA.
Kendall τ in biostats.
References.
N. El Karoui, “The spectrum of kernel random matrices”, The Annals of Statistics, 38(1),
1-50, 2010.
106 / 113
![Page 228: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/228.jpg)
Perspectives/ 107/113
Summary of Results and Perspectives II
R. Couillet, F. Benaych-Georges, “Kernel Spectral Clustering of Large Dimensional Data”,
Electronic Journal of Statistics, vol. 10, no. 1, pp. 1393-1454, 2016.
R. Couillet, A. Kammoun, “Random Matrix Improved Subspace Clustering”, Asilomar
Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2016.
Z. Liao, R. Couillet, ”A Large Dimensional Analysis of Least Squares Support Vector
Machines”, (submitted to) Journal of Machine Learning Research, 2017.
X. Mai, R. Couillet, “The counterintuitive mechanism of graph-based semi-supervised learning
in the big data regime”, IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP’17), New Orleans, USA, 2017.
107 / 113
![Page 229: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/229.jpg)
Perspectives/ 108/113
Summary of Results and Perspectives I
Community detection.
4 Heterogeneous dense network clustering.
. Semi-supervised clustering.
Sparse network extensions.
Beyond community detection (hub detection).
Applications.
4 Improved methods for community detection.
. Applications to distributed optimization (network diffusion, graph signalprocessing).
References.
H. Tiomoko Ali, R. Couillet, “Spectral community detection in heterogeneous large networks”,
(submitted to) Journal of Multivariate Analysis, 2016.
F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, P. Zhang, “Spectral
redemption in clustering sparse networks. Proceedings of the National Academy of Sciences”,110(52), 20935-20940, 2013.
C. Bordenave, M. Lelarge, L. Massoulie, “Non-backtracking spectrum of random graphs:
community detection and non-regular Ramanujan graphs”, Foundations of Computer Science(FOCS), 2015 IEEE 56th Annual Symposium on, pp. 1347-1357, 2015
A. Saade, F. Krzakala, L. Zdeborova, “Spectral clustering of graphs with the Bethe Hessian”,
In Advances in Neural Information Processing Systems, pp. 406-414, 2014.
108 / 113
![Page 230: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/230.jpg)
Perspectives/ 109/113
Summary of Results and Perspectives I
Robust statistics.
4 Tyler, Maronna (and regularized) estimators
4 Elliptical data setting, deterministic outlier setting
4 Central limit theorem extensions
Joint mean and covariance robust estimation
Robust regression (preliminary works exist already using strikingly differentapproaches)
Applications.
4 Statistical finance (portfolio estimation)
4 Localisation in array processing (robust GMUSIC)
4 Detectors in space time array processing
Correlation matrices in biostatistics, human science datasets, etc.
References.
R. Couillet, F. Pascal, J. W. Silverstein, “Robust Estimates of Covariance Matrices in the
Large Dimensional Regime”, IEEE Transactions on Information Theory, vol. 60, no. 11, pp.7269-7278, 2014.
109 / 113
![Page 231: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/231.jpg)
Perspectives/ 110/113
Summary of Results and Perspectives II
R. Couillet, F. Pascal, J. W. Silverstein, “The Random Matrix Regime of Maronna’s
M-estimator with elliptically distributed samples”, Elsevier Journal of Multivariate Analysis, vol.139, pp. 56-78, 2015.
T. Zhang, X. Cheng, A. Singer, “Marchenko-Pastur Law for Tyler’s and Maronna’s
M-estimators”, arXiv:1401.3424, 2014.
R. Couillet, M. McKay, “Large Dimensional Analysis and Optimization of Robust Shrinkage
Covariance Matrix Estimators”, Elsevier Journal of Multivariate Analysis, vol. 131, pp. 99-120,2014.
D. Morales-Jimenez, R. Couillet, M. McKay, “Large Dimensional Analysis of Robust
M-Estimators of Covariance with Outliers”, IEEE Transactions on Signal Processing, vol. 63,no. 21, pp. 5784-5797, 2015.
L. Yang, R. Couillet, M. McKay, “A Robust Statistics Approach to Minimum Variance Portfolio
Optimization”, IEEE Transactions on Signal Processing, vol. 63, no. 24, pp. 6684–6697, 2015.
R. Couillet, “Robust spiked random matrices and a robust G-MUSIC estimator”, Elsevier
Journal of Multivariate Analysis, vol. 140, pp. 139-161, 2015.
A. Kammoun, R. Couillet, F. Pascal, M.-S. Alouini, “Optimal Design of the Adaptive
Normalized Matched Filter Detector”, (submitted to) IEEE Transactions on InformationTheory, 2016, arXiv Preprint 1504.01252.
110 / 113
![Page 232: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/232.jpg)
Perspectives/ 111/113
Summary of Results and Perspectives III
R. Couillet, A. Kammoun, F. Pascal, “Second order statistics of robust estimators of scatter.
Application to GLRT detection for elliptical signals”, Elsevier Journal of Multivariate Analysis,vol. 143, pp. 249-274, 2016.
D. Donoho, A. Montanari, “High dimensional robust m-estimation: Asymptotic variance via
approximate message passing”, Probability Theory and Related Fields, 1-35, 2013.
N. El Karoui, “Asymptotic behavior of unregularized and ridge-regularized high-dimensional
robust regression estimators: rigorous results.” arXiv preprint arXiv:1311.2445, 2013.
111 / 113
![Page 233: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/233.jpg)
Perspectives/ 112/113
Summary of Results and Perspectives I
Other works and ideas.
4 Spike random matrix sparse PCA
. Non-linear shrinkage methods
. Sparse kernel PCA
. Random signal processing on graph methods.
. Random matrix analysis of diffusion networks performance.
Applications.
4 Spike factor models in portfolio optimization
. Non-linear shrinkage in portfolio optimization, biostats
References.
R. Couillet, M. McKay, “Optimal block-sparse PCA for high dimensional correlated samples”,
(submitted to) Journal of Multivariate Analysis, 2016.
J. Bun, J. P. Bouchaud, M. Potters, “On the overlaps between eigenvectors of correlated
random matrices”, arXiv preprint arXiv:1603.04364 (2016).
Ledoit, O. and Wolf, M., “Nonlinear shrinkage estimation of large-dimensional covariance
matrices”, 2011
112 / 113
![Page 234: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)](https://reader035.vdocuments.us/reader035/viewer/2022071500/611eef64cd47e96f17250db8/html5/thumbnails/234.jpg)
Perspectives/ 113/113
The End
Thank you.
113 / 113