all you wanted to know about (sparse / high dimensional ...saharon/bigdata2018/pca_high_dim.pdf ·...
TRANSCRIPT
All you wanted to know about(sparse / high dimensional)
PCA and Covariance Estimation———————
but didn’t dare to ask
Boaz Nadler
Department of Computer Science and Applied MathematicsThe Weizmann Institute of Science
June 2018
Boaz Nadler PCA
Part 0: Introduction
Boaz Nadler PCA
Introduction
In many applications we need to analyze multivariate data,
x1, . . . , xn ∈ Rp
Throughout talk:p - dimension, n - number of samples
Many different scenarios:
- p small, n large.
- both p and n large.
- p large, n small.
Boaz Nadler PCA
Introduction
In many applications we need to analyze multivariate data,
x1, . . . , xn ∈ Rp
Throughout talk:p - dimension, n - number of samples
Many different scenarios:
- p small, n large.
- both p and n large.
- p large, n small.
Boaz Nadler PCA
Unsupervised Tasks
- Visualization,
- Dimensionality Reduction
- Compression, Denoising
- Exploratory Data Analysis - finding structure in data
- Many hypothesis testing problems.
Boaz Nadler PCA
Unsupervised Tasks
- Visualization,
- Dimensionality Reduction
- Compression, Denoising
- Exploratory Data Analysis - finding structure in data
- Many hypothesis testing problems.
Boaz Nadler PCA
Unsupervised Tasks
- Visualization,
- Dimensionality Reduction
- Compression, Denoising
- Exploratory Data Analysis - finding structure in data
- Many hypothesis testing problems.
Boaz Nadler PCA
Unsupervised Tasks
- Visualization,
- Dimensionality Reduction
- Compression, Denoising
- Exploratory Data Analysis - finding structure in data
- Many hypothesis testing problems.
Boaz Nadler PCA
Unsupervised Tasks
- Visualization,
- Dimensionality Reduction
- Compression, Denoising
- Exploratory Data Analysis - finding structure in data
- Many hypothesis testing problems.
Boaz Nadler PCA
Supervised Tasks
In some cases we observe also a response yi for each xi .
Common tasks:
- Classification (Linear Discriminant Analysis)
- Regression (Multivariate Linear Regression).
Boaz Nadler PCA
Density Estimation and Moments
Typical assumption: xi are i.i.d. from some r.v. X with unknowndensity f (x).
In principle, f (x) describes everything about X .
So, we can try to estimate it.
Non-parametric (kernel) density estimation:
|f (x)− f (x)| ∼ n−2β/(2β+p)
where β measure of smoothness of f (x).
Curse of dimensionality:For small error ε, need exponential in p number of samples,n = O(exp(Cp)).
Typically not practical if p > 15
Boaz Nadler PCA
Density Estimation and Moments
Typical assumption: xi are i.i.d. from some r.v. X with unknowndensity f (x).
In principle, f (x) describes everything about X .
So, we can try to estimate it.
Non-parametric (kernel) density estimation:
|f (x)− f (x)| ∼ n−2β/(2β+p)
where β measure of smoothness of f (x).
Curse of dimensionality:For small error ε, need exponential in p number of samples,n = O(exp(Cp)).
Typically not practical if p > 15
Boaz Nadler PCA
Density Estimation and Moments
Typical assumption: xi are i.i.d. from some r.v. X with unknowndensity f (x).
In principle, f (x) describes everything about X .
So, we can try to estimate it.
Non-parametric (kernel) density estimation:
|f (x)− f (x)| ∼ n−2β/(2β+p)
where β measure of smoothness of f (x).
Curse of dimensionality:For small error ε, need exponential in p number of samples,n = O(exp(Cp)).
Typically not practical if p > 15
Boaz Nadler PCA
Density Estimation and Moments
Typical assumption: xi are i.i.d. from some r.v. X with unknowndensity f (x).
In principle, f (x) describes everything about X .
So, we can try to estimate it.
Non-parametric (kernel) density estimation:
|f (x)− f (x)| ∼ n−2β/(2β+p)
where β measure of smoothness of f (x).
Curse of dimensionality:For small error ε, need exponential in p number of samples,n = O(exp(Cp)).
Typically not practical if p > 15
Boaz Nadler PCA
Density Estimation and Moments
Typical assumption: xi are i.i.d. from some r.v. X with unknowndensity f (x).
In principle, f (x) describes everything about X .
So, we can try to estimate it.
Non-parametric (kernel) density estimation:
|f (x)− f (x)| ∼ n−2β/(2β+p)
where β measure of smoothness of f (x).
Curse of dimensionality:For small error ε, need exponential in p number of samples,n = O(exp(Cp)).
Typically not practical if p > 15
Boaz Nadler PCA
First and Second Moments of X
Since already at moderate dimensions cannot estimate f (x), optfor first and second moments:
Definition: The population mean vector µ ∈ Rp is
µ = E[X ] =
∫xf (x)dx
Definition: The population covariance matrix Σp×p is defined as
Σ = E[(X − µ)(X − µ)T ] =
∫(x− µ)(x− µ)T f (x)dx
In many problems, estimates of µ,Σ will suffice to solve a dataanalysis task.
Boaz Nadler PCA
Example I: Quadratic Discriminant Analysis
Binary (2-class) classification problem:
Observe xin+i=1 from class +1
and
x′in−i=1 from class −1.
Goal: classify label y ∈ ±1 of new x.
QDA: Assume each class is multivariate Gaussian,
Likellhood Ratio = C (Σ+,Σ−)exp(−(x− µ+)TΣ−1+ (x− µ+)/2)
exp(−(x− µ−)TΣ−1− (x− µ−)/2)
To apply QDA, need to estimate µ± and Σ± (actually its inverse)
Boaz Nadler PCA
Example I: Quadratic Discriminant Analysis
Binary (2-class) classification problem:
Observe xin+i=1 from class +1
and
x′in−i=1 from class −1.
Goal: classify label y ∈ ±1 of new x.
QDA: Assume each class is multivariate Gaussian,
Likellhood Ratio = C (Σ+,Σ−)exp(−(x− µ+)TΣ−1+ (x− µ+)/2)
exp(−(x− µ−)TΣ−1− (x− µ−)/2)
To apply QDA, need to estimate µ± and Σ± (actually its inverse)
Boaz Nadler PCA
Example I: Quadratic Discriminant Analysis
Binary (2-class) classification problem:
Observe xin+i=1 from class +1
and
x′in−i=1 from class −1.
Goal: classify label y ∈ ±1 of new x.
QDA: Assume each class is multivariate Gaussian,
Likellhood Ratio = C (Σ+,Σ−)exp(−(x− µ+)TΣ−1+ (x− µ+)/2)
exp(−(x− µ−)TΣ−1− (x− µ−)/2)
To apply QDA, need to estimate µ± and Σ± (actually its inverse)
Boaz Nadler PCA
Example I: Quadratic Discriminant Analysis
Binary (2-class) classification problem:
Observe xin+i=1 from class +1
and
x′in−i=1 from class −1.
Goal: classify label y ∈ ±1 of new x.
QDA: Assume each class is multivariate Gaussian,
Likellhood Ratio = C (Σ+,Σ−)exp(−(x− µ+)TΣ−1+ (x− µ+)/2)
exp(−(x− µ−)TΣ−1− (x− µ−)/2)
To apply QDA, need to estimate µ± and Σ± (actually its inverse)
Boaz Nadler PCA
Example I: Quadratic Discriminant Analysis
Binary (2-class) classification problem:
Observe xin+i=1 from class +1
and
x′in−i=1 from class −1.
Goal: classify label y ∈ ±1 of new x.
QDA: Assume each class is multivariate Gaussian,
Likellhood Ratio = C (Σ+,Σ−)exp(−(x− µ+)TΣ−1+ (x− µ+)/2)
exp(−(x− µ−)TΣ−1− (x− µ−)/2)
To apply QDA, need to estimate µ± and Σ± (actually its inverse)
Boaz Nadler PCA
Example II: Markowitz Portfolio Optimization
[Harry Markowitz, Nobel prize 90’]User can invest in p stocks, with weight vector w = (w1, . . . ,wp),s.t.
∑wi = 1.
Stock i has expected return µi . All p stocks have joint covarianceΣ, where Σii is the variance of the i-th stock.
Goal: Construct a portfolio that achieves an average target returnµR , namely
∑i wiµi = µR
butwith minimal risk (variance).
minw
wTΣw
To construct portfolio, need to estimate µi and Σ.
Boaz Nadler PCA
Example II: Markowitz Portfolio Optimization
[Harry Markowitz, Nobel prize 90’]User can invest in p stocks, with weight vector w = (w1, . . . ,wp),s.t.
∑wi = 1.
Stock i has expected return µi . All p stocks have joint covarianceΣ, where Σii is the variance of the i-th stock.
Goal: Construct a portfolio that achieves an average target returnµR , namely
∑i wiµi = µR
butwith minimal risk (variance).
minw
wTΣw
To construct portfolio, need to estimate µi and Σ.
Boaz Nadler PCA
Example II: Markowitz Portfolio Optimization
[Harry Markowitz, Nobel prize 90’]User can invest in p stocks, with weight vector w = (w1, . . . ,wp),s.t.
∑wi = 1.
Stock i has expected return µi . All p stocks have joint covarianceΣ, where Σii is the variance of the i-th stock.
Goal: Construct a portfolio that achieves an average target returnµR , namely
∑i wiµi = µR
butwith minimal risk (variance).
minw
wTΣw
To construct portfolio, need to estimate µi and Σ.
Boaz Nadler PCA
Example II: Markowitz Portfolio Optimization
[Harry Markowitz, Nobel prize 90’]User can invest in p stocks, with weight vector w = (w1, . . . ,wp),s.t.
∑wi = 1.
Stock i has expected return µi . All p stocks have joint covarianceΣ, where Σii is the variance of the i-th stock.
Goal: Construct a portfolio that achieves an average target returnµR , namely
∑i wiµi = µR
butwith minimal risk (variance).
minw
wTΣw
To construct portfolio, need to estimate µi and Σ.
Boaz Nadler PCA
Example III: Dimensionality of Data / Signals in Noise
Analytical Chemistry, Array Signal Processing, ...:
Assume “signal+noise” model
x =K∑j=1
sjvj + noise
Given observed data x1, . . . , xn, a basic question is: what is K ?
The covariance matrix of data has K spikes, eigenvalues largerthan noise variance σ2, and remaining eigenvalues all equal σ2.
Question: How to estimate K , which signal strengths can bedetected ?
Boaz Nadler PCA
Example III: Dimensionality of Data / Signals in Noise
Analytical Chemistry, Array Signal Processing, ...:
Assume “signal+noise” model
x =K∑j=1
sjvj + noise
Given observed data x1, . . . , xn, a basic question is: what is K ?
The covariance matrix of data has K spikes, eigenvalues largerthan noise variance σ2, and remaining eigenvalues all equal σ2.
Question: How to estimate K , which signal strengths can bedetected ?
Boaz Nadler PCA
Example III: Dimensionality of Data / Signals in Noise
Analytical Chemistry, Array Signal Processing, ...:
Assume “signal+noise” model
x =K∑j=1
sjvj + noise
Given observed data x1, . . . , xn, a basic question is: what is K ?
The covariance matrix of data has K spikes, eigenvalues largerthan noise variance σ2, and remaining eigenvalues all equal σ2.
Question: How to estimate K , which signal strengths can bedetected ?
Boaz Nadler PCA
Detection of Structure / Signals in Noise
Eigenvalues of Sn (log-scale):
5 10 15 20 25
−2
−1
0
1
2
3
p= 25 n= 1000
5 10 15 20 25
−2
−1
0
1
2
3
p= 25 n= 50
Boaz Nadler PCA
Estimation of Signal Direction
Suppose there is one signal in the data.
x = sv + σξ
Goal: Estimate signal direction (vector v).
Questions: How should it be estimated ?How well can it be estimated ?What happens when dimension p is high ?
Boaz Nadler PCA
Estimation of Signal Direction
Suppose there is one signal in the data.
x = sv + σξ
Goal: Estimate signal direction (vector v).
Questions: How should it be estimated ?How well can it be estimated ?What happens when dimension p is high ?
Boaz Nadler PCA
Estimation of Signal Direction
0 1 2 30
20
40
60
80
σ
λ1, λ
2 as a function of σ
0 1 2 30
0.2
0.4
0.6
0.8
1
σ
< vk , e
1 >
Boaz Nadler PCA
Estimation of Signal Direction
0 2 4 6
x 104
1
1.1
1.2
1.3
1.4
1.5
n
λ1, λ
2 as a function of n
0 2 4 6
x 104
0
0.2
0.4
0.6
0.8
n
< vPCA
, e1 >
Boaz Nadler PCA
Common Statistical Inference Problems
Given data x1, . . . , xn:
- Estimate population covariance matrix Σ
- Estimate the precision matrix Ω = Σ−1.
- Estimate the largest eigenvalues and eigenvectors of Σ.
- Suppose Σ (or Σ−1) have many zeros. Find their support.
- Many hypothesis testing problems: Is Σ = Σ0 ?
- Uncorrelated variables: Is Σ = diag , is Σ banded ?
Boaz Nadler PCA
Common Statistical Inference Problems
Given data x1, . . . , xn:
- Estimate population covariance matrix Σ
- Estimate the precision matrix Ω = Σ−1.
- Estimate the largest eigenvalues and eigenvectors of Σ.
- Suppose Σ (or Σ−1) have many zeros. Find their support.
- Many hypothesis testing problems: Is Σ = Σ0 ?
- Uncorrelated variables: Is Σ = diag , is Σ banded ?
Boaz Nadler PCA
Common Statistical Inference Problems
Given data x1, . . . , xn:
- Estimate population covariance matrix Σ
- Estimate the precision matrix Ω = Σ−1.
- Estimate the largest eigenvalues and eigenvectors of Σ.
- Suppose Σ (or Σ−1) have many zeros. Find their support.
- Many hypothesis testing problems: Is Σ = Σ0 ?
- Uncorrelated variables: Is Σ = diag , is Σ banded ?
Boaz Nadler PCA
Common Statistical Inference Problems
Given data x1, . . . , xn:
- Estimate population covariance matrix Σ
- Estimate the precision matrix Ω = Σ−1.
- Estimate the largest eigenvalues and eigenvectors of Σ.
- Suppose Σ (or Σ−1) have many zeros. Find their support.
- Many hypothesis testing problems: Is Σ = Σ0 ?
- Uncorrelated variables: Is Σ = diag , is Σ banded ?
Boaz Nadler PCA
Common Statistical Inference Problems
Given data x1, . . . , xn:
- Estimate population covariance matrix Σ
- Estimate the precision matrix Ω = Σ−1.
- Estimate the largest eigenvalues and eigenvectors of Σ.
- Suppose Σ (or Σ−1) have many zeros. Find their support.
- Many hypothesis testing problems: Is Σ = Σ0 ?
- Uncorrelated variables: Is Σ = diag , is Σ banded ?
Boaz Nadler PCA
Inference Problems
Typically, µ and Σ are unknown and must be estimated from data.
Classical Estimates:
Sample Mean:
x =1
n
∑i
xi
Sample Covariance Matrix:
Sn =1
n − 1
n∑i=1
(xi − x)(xi − x)T
Boaz Nadler PCA
Inference Problems
Typically, µ and Σ are unknown and must be estimated from data.
Classical Estimates:
Sample Mean:
x =1
n
∑i
xi
Sample Covariance Matrix:
Sn =1
n − 1
n∑i=1
(xi − x)(xi − x)T
Boaz Nadler PCA
Outline
Q-1: How close are Sn and its eigenvalues/vectors to those of Σ ?
Part 1. p small, n→∞ [classical asymptotic statistics]
Part 2. p, n both large, [modern high dimensional statistics]
Q-2: What if we know additional information: sparsity.
Part 3. Sparse Principal Component Analysis, Sparse CovarianceEstimation
Boaz Nadler PCA
Outline
Q-1: How close are Sn and its eigenvalues/vectors to those of Σ ?
Part 1. p small, n→∞ [classical asymptotic statistics]
Part 2. p, n both large, [modern high dimensional statistics]
Q-2: What if we know additional information: sparsity.
Part 3. Sparse Principal Component Analysis, Sparse CovarianceEstimation
Boaz Nadler PCA
Outline
Q-1: How close are Sn and its eigenvalues/vectors to those of Σ ?
Part 1. p small, n→∞ [classical asymptotic statistics]
Part 2. p, n both large, [modern high dimensional statistics]
Q-2: What if we know additional information: sparsity.
Part 3. Sparse Principal Component Analysis, Sparse CovarianceEstimation
Boaz Nadler PCA
The Important Question
Why is this interesting, relevant, important ?
Boaz Nadler PCA
Covariance Estimation will be important to you
Boaz Nadler PCA
References (Partial List)
* Nadler, Finite sample approximation results for PCA, Annals of Stat.,2008.* Kritchman and Nadler, determining the number of components in afactor model from limited noisy data, 2008.* Birnbaum et. al., Minimax bounds for sparse PCA with noisy highdimensional data, Annals of Stat., 2013.* Baik and Silverstein, Eigenvalues of large sample covariance matrices...,J. Mult. Anal. 2006.* Paul, Asymptotics of sample eigenstructure..., Stat. Sinica, 2008.* Bickel and Levina, Covariance regularization by thresholding, Annals ofStat., 2008.* Cai and Liu, Adaptive thresholding for sparse covariance matrixestimation, J. Am. Stat. Assoc., 2011.
* Cai, Zhang, Zhou, Optimal rates of convergence for covariance matrix
estimation, Ann. of Stat. 2010.
and many others...
Boaz Nadler PCA