statistical machine learning a.k.a. high dimensional inferencelarry/heinz.pdfbias-variance tradeoff...
TRANSCRIPT
![Page 1: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/1.jpg)
Statistical Machine Learninga.k.a.
High Dimensional InferenceLarry Wasserman
Carnegie Mellon University
![Page 2: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/2.jpg)
Outline
1 High Dimensional Regression and Classification
2 Graphical Models
3 Nonparametric Methods
4 Clustering
Throughout: Strong assumptions versus weak assumptions
Dimension = p or d. Sample size = n.
References:The Elements of Statistical Learning. Hastie, Tibshirani and Freedman.Statistical Machine Learning. Liu and Wasserman (coming soon).
2
![Page 3: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/3.jpg)
Introduction
• Machine learning is statistics with a focus on prediction,scalability and high dimensional problems.
• Regression: predict Y ∈ R from X .
• Classification: predict Y ∈ 0,1 from X .
I Example: Predict if an email X is real Y = 1 or spam Y = 0.
• Finding structure. Examples:
I Clustering: find groups.
I Graphical Models: find conditional independence structure.
3
![Page 4: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/4.jpg)
Three Main Themes
Convexity
Convex problems can be solved quickly. If necessary,approximate the problem with a convex problem.
Sparsity
Many interesting problems are high dimensional. But often, therelevant information is effectively low dimensional.
Weak Assumptions
Make the weakest possible assumptions.
4
![Page 5: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/5.jpg)
Preview: Sparse Graphical Models
Preview: Finding relations between stocks in the S&P 500:
5
![Page 6: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/6.jpg)
Regression
We observe pairs (X1,Y1), . . . , (Xn,Yn).
D = (X1,Y1), . . . , (Xn,Yn) is called the training data.
Yi ∈ R is the response. Xi ∈ Rp is the feature (or covariate).
For example, suppose we have n subjects. Yi is the blood pressure ofsubject i . Xi = (Xi1, . . . ,Xip) is a vector of p = 5,000 gene expressionlevels for subject i .
Remember: Yi ∈ R and Xi ∈ Rp.
Given a new pair (X ,Y ), we want to predict Y from X .
6
![Page 7: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/7.jpg)
Regression
Let Y be a prediction of Y . The prediction error or risk is
R = E(Y − Y )2
where E is the expected value (mean).
The best predictor is the regression function
m(x) = E(Y |X = x) =
∫y f (y |x)dy .
However, the true regression function m(x) is not known. We need toestimate m(x).
7
![Page 8: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/8.jpg)
Regression
Given the training data D = (X1,Y1), . . . , (Xn,Yn) we want toconstruct m to make
prediction risk = R(m) = E(Y − m(X ))2
small. Here, (X ,Y ) are a new pair.
Key fact: Bias-variance decomposition:
R(m) =
∫bias2(x)p(x)dx +
∫var(x)p(x) + σ2
where
bias(x) = E(m(x))−m(x)
var(x) = Variance(m(x))
σ2 = E(Y −m(X ))2
8
![Page 9: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/9.jpg)
Bias-Variance Tradeoff
Prediction Risk = Bias2 + Variance
Prediction methods with low bias tend to have high variance.
Prediction methods with low variance tend to have high bias.
For example, the predictor m(x) ≡ 0 has 0 variance but will be terriblybiased.
To predict well, we need to balance the bias and the variance. Webegin with linear methods.
9
![Page 10: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/10.jpg)
Linear Regression
Try to find the best linear predictor, that is, a predictor of the form:
m(x) = β0 + β1x1 + · · ·+ βpxp.
Important: We do not assume that the true regression function islinear.
We can always define x1 = 1. Then the intercept is β1 and we canwrite
m(x) = β1x1 + · · ·+ βpxp = βT x
where β = (β1, . . . , βp) and x = (x1, . . . , xp).
10
![Page 11: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/11.jpg)
Low Dimensional Linear Regression
Assume for now that p (= length of each Xi ) is small. To find a goodlinear predictor we choose β to minimize the training error:
training error =1n
n∑i=1
(Yi − βT Xi)2
The minimizer β = (β1, . . . , βp) is called the least squares estimator.
11
![Page 12: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/12.jpg)
Low Dimensional Linear Regression
The least squares estimator is:
β = (XTX)−1XTY
where
Xn×d =
X11 X12 · · · X1dX21 X22 · · · X2d
......
......
Xn1 Xn2 · · · Xnd
and
Y = (Y1, . . . ,Yn)T .
In R: lm(y ∼ x)
12
![Page 13: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/13.jpg)
Low Dimensional Linear Regression
Summary: the least squares estimator is m(x) = βT x =∑
j βjxjwhere
β = (XTX)−1XTY.
When we observe a new X , we predict Y to be
Y = m(X ) = βT X .
Our goals are to improve this by:
(i) dealing with high dimensions(ii) using something more flexible than linear predictors.
13
![Page 14: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/14.jpg)
Example
Y = HIV resistance
Xj = amino acid in position j of the virus.
Y = β0 + β1X1 + · · ·+ β100X100 + ε
14
![Page 15: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/15.jpg)
0 10 30 50 70
−0.
4−
0.2
0.0
0.2
0.4
position
β
0 10 30 50 70
−300
−200
−100
0
100
200
300
position
α
−2
−1
0
1
2
fitted values
resi
dual
s
0 20 40 60
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
position
β
Top left: βTop right: marginal regression coefficients (one-at-a-time)Bottom left: Yi − Yi versus Yi
Bottom right: a sparse regression (coming up soon)15
![Page 16: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/16.jpg)
High Dimensional Linear Regression
Now suppose p is large. We even might have p > n (more covariatesthan data points).
The least squares estimator is not defined since XTX is not invertible.The variance of the least squares prediction is huge.
Recall the bias-variance tradeoff:
Prediction Error = Bias2 + Variance
We need to increase the bias so that we can decrease the variance.
16
![Page 17: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/17.jpg)
Ridge Regression
Recall that the least squares estimator minimizes the training error1n∑n
i=1(Yi − βT Xi)2.
Instead, we can minimize the penalized training error:
1n
n∑i=1
(Yi − βT Xi)2 + λ‖β‖22
where ‖β‖2 =√∑
j β2j .
The solution is:β = (XTX + λI)−1XTY
17
![Page 18: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/18.jpg)
Ridge Regression
The tuning parameter λ controls the bias-variance tradeoff:
λ = 0 =⇒ least squares.λ =∞ =⇒ β = 0.
We choose λ to minimize R(λ) where R(λ) is an estimate of theprediction risk.
18
![Page 19: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/19.jpg)
Ridge Regression
To estimate the prediction risk, do not use training error:
Rtraining =1n
n∑i=1
(Yi − Yi)2, Yi = X T
i β
because it is biased: E(Rtraining) < R(β)
Instead, we use leave-one-out cross-validation:
1. leave out (Xi ,Yi)
2. find β
3. predict Yi : Y(−i) = βT Xi
4. repeat for each i
19
![Page 20: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/20.jpg)
Leave-one-out cross-validation
R(λ) =1n
n∑i=1
(Yi − Y(i))2 =
1n
n∑i=1
(Yi − Yi)2
(1− Hii)2
≈ Rtraining(1− k
n
)2
≈ Rtraining −2 k σ2
n
where
H = X(XTX + λI)−1XT
k = trace(H)
k is the effective degrees of freedom20
![Page 21: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/21.jpg)
Example
Y = 3X1 + · · ·+ 3X5 + 0X6 + · · ·+ 0X1000 + ε
n = 100, p = 1,000.
So there are 1000 covariates but only 5 are relevant.
What does ridge regression do in this case?
21
![Page 22: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/22.jpg)
Ridge Regularization Paths
0 2000 4000 6000 8000 10000
−2
02
λ
β
22
![Page 23: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/23.jpg)
Sparse Linear Regression
Ridge regression does not take advantage of sparsity.
Maybe only a small number of covariates are important predictors.How do we find them?
We could fit many submodels (with a small number of covariates) andchoose the best one. This is called model selection.
Now the inaccuracy is
prediction error = bias2 + variance
The bias is the errors due to omitting important variables. Thevariance is the error due to having to estimate many parameters.
23
![Page 24: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/24.jpg)
The Bias-Variance Tradeoff
0 20 40 60 80 100
Var
ianc
e
Number of Variables
0 20 40 60 80 100
Bia
s
Number of Variables
24
![Page 25: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/25.jpg)
The Bias-Variance Tradeoff
This is a Goldilocks problem. Can’t use too few or too many variables.
Have to choose just the right variables.
Have to try all models with one variable, two variables,...
If there are p variables then there are 2p models.
Suppose we have 50,000 genes. We have to search through 250,000
models. But 250,000 > number of atoms in the universe.
This problem is NP-hard. This was a major bottleneck in statistics formany years.
25
![Page 26: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/26.jpg)
You are Here
P
NP
NP−Complete
NP−Hard X
Variable Selection
26
![Page 27: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/27.jpg)
Two Things that Save Us
Two key ideas to make this feasible are sparsity and convexrelaxation.
Sparsity: probably only a few genes are needed to predict somedisease Y . In other words, of β1, . . . , β50,000 most βj ≈ 0.
But which ones?? (Needle in a haystack.)
Convex Relaxation: Replace model search with something easier.
It is the marriage of these two concepts that makes it all work.
27
![Page 28: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/28.jpg)
Sparsity
Look at this:β = (5,5,5,0,0,0, . . . ,0).
This vector is high-dimensional but it is sparse.
Here is a less obvious example:
β = (50,12,6,3,2,1.4,1,0.8.,0.6,0.5, . . .)
It turns out that, if the βj ’s die off fairly quickly, then β behaves a like asparse vector.
28
![Page 29: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/29.jpg)
Sparsity
We measure the (lack of) sparsity of β = (β1, . . . , βp) with the q-norm
‖β‖q =(|β1|q + · · ·+ |βp|q)1/q =
(∑j
|βj |q)1/q
.
Which values of q measure (lack of) sparsity?
sparse: a = 1 0 0 0 · · · 0not sparse: b = .001 .001 .001 .001 · · · .001
√ √ ×q = 0 q = 1 q = 2
‖a‖q 1 1 1‖b‖q p
√p 1
Lesson: Need to use q ≤ 1 to measure sparsity. (Actually, q < 2 ok.)29
![Page 30: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/30.jpg)
Sparsity
So we estimate β = (β1, . . . , βp) by minimizing
n∑i=1
(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2
subject to the constraint that β is sparse i.e. ‖β‖q ≤ small.
Can we do this minimization?
If we use q = 0 this turns out to be the same as searching through all2p models. Ouch!
What about other values of q?
What does the set β : ‖β‖q ≤ small look like?
30
![Page 31: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/31.jpg)
The set ‖β‖q ≤ 1 when p = 2
q = 14 q = 1
2
q = 1 q = 32
31
![Page 32: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/32.jpg)
Sparsity Meets Convexity
We need these sets to have a nice shape (convex). If so, theminimization is no longer NP-hard. In fact, it is easy.
Sensitivity to sparsity: q ≤ 1 (actually, q < 2 suffices)Convexity (niceness): q ≥ 1
This means we should use q = 1.
32
![Page 33: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/33.jpg)
Where Sparsity and Convexity Meet
Sparsity
Convexity
0 1 2 3 4 5 6 7 8 9p
33
![Page 34: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/34.jpg)
Sparsity Meets Convexity: The LASSO
So we estimate β = (β1, . . . , βp) by minimizing
n∑i=1
(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2
subject to the constraint that β is sparse i.e. ‖β‖1 =∑
j |βj | ≤ small.
This is called the LASSO. Invented by Rob Tibshirani in 1996.
34
![Page 35: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/35.jpg)
Lasso
The result is an estimated vector
β1, . . . , βp
Most are 0!
Magically, we have done model selection without searching (thanks tosparsity plus convexity).
The next picture explains why some βj = 0.
35
![Page 36: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/36.jpg)
Sparsity: How Corners Create Sparse Estimators
36
![Page 37: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/37.jpg)
The Lasso: HIV Example Again
• Y is resistance to HIV drug.• Xj = amino acid in position j of the virus.• p = 99, n ≈ 100.
37
![Page 38: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/38.jpg)
The Lasso: An Example
0.0 0.2 0.4 0.6 0.8 1.0
−8
−6
−4
−2
02
4
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
LASSO
5325
5551
3617
2923
28
38
![Page 39: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/39.jpg)
Selecting λ
We choose the sparsity level by estimating prediction error. (Ten-foldcross-validation.)
0 10 20 30 40 50 60
020
040
060
0
p
Ris
k
39
![Page 40: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/40.jpg)
The Lasso: An Example
0 10 20 30 40 50 60 70
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
position
ββ
40
![Page 41: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/41.jpg)
The Lasso: Another Example (using glmnet)
0 5 10 15
−1
01
23
4
L1 Norm
Coe
ffici
ents
0 4 6 12
41
![Page 42: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/42.jpg)
The Lasso: Another Example (using glmnet)
−3 −2 −1 0 1
020
4060
log(Lambda)
Mea
n−S
quar
ed E
rror
14 14 13 12 12 11 10 9 8 8 8 6 5 5 4 4 3 2 0
42
![Page 43: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/43.jpg)
Sparvexity
To summarize: we penalize the sums of squares with
‖β‖q =
∑j
|βj |q1/q
.
To get a sparse answer: q < 2.
To get a convex problem: q ≥ 1.
So q = 1 works.
Sparvexity (the marriage of sparsity and convexity) is one of thebiggest developments in statistics and machine learning.
43
![Page 44: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/44.jpg)
The Lasso
• β(λ) is called the lasso estimator. Then define
S(λ) =
j : βj(λ) 6= 0
.
R: lars (y,x) or glmnet (y,x)
• (Optional) After you find S(λ), you can re-fit the model by doingleast squares on the sub-model S(λ). (Less bias but highervariance.)
44
![Page 45: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/45.jpg)
The Lasso
Choose λ by risk estimation. (For example, 10-fold cross-validation.)
If you use the “re-fit” approach, then you can apply leave-one-outcross-validation:
R(λ) =1n
n∑i=1
(Yi − Y(i))2 =
1n
n∑i=1
(Yi − Yi)2
(1− Hii)2 ≈1n
RSS(1− s
n
)2
where H is the hat matrix and s = #j : βj 6= 0.
Choose λ to minimize R(λ).
45
![Page 46: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/46.jpg)
The Lasso
The complete steps are:
1 Find β(λ) and S(λ) for each λ.2 Choose λ to minimize estimated risk.3 Prediction: Y = X T β.
46
![Page 47: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/47.jpg)
Some Convexity Theory for the Lasso
Consider a simpler model than regression: Suppose Y ∼ N(µ,1). Letµ minimize
A(µ) =12
(Y − µ)2 + λ|µ|.
How do we minimize A(µ)?
• Since A is convex, we set the subderivative = 0. Recall that c is asubderivative of f (x) at x0 if
f (x)− f (x0) ≥ c(x − x0).
• The subdifferential ∂f (x0) is the set of subderivatives. Also, x0minimizes f if and only if 0 ∈ ∂f .
47
![Page 48: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/48.jpg)
`1 and Soft Thresholding
• If f (µ) = |µ| then
∂f =
−1 µ < 0[ -1 , 1 ] µ = 0+1 µ > 0.
• Hence,
∂A =
µ− Y − λ µ < 0µ− Y + λz : −1 ≤ z ≤ 1 µ = 0µ− Y + λ µ > 0.
48
![Page 49: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/49.jpg)
`1 and Soft Thresholding
• µ minimizes A(µ) if and only if 0 ∈ ∂A.
• So
µ =
Y + λ Y < −λ0 −λ ≤ Y ≤ λY − λ Y > λ.
• This can be written as
µ = soft(Y , λ) ≡ sign(Y ) (|Y | − λ)+.
49
![Page 50: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/50.jpg)
`1 and Soft Thresholding
−3 −2 −1 0 1 2 3
−2
−1
01
2
x
soft(
x)
− λ λ
50
![Page 51: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/51.jpg)
The Lasso: Computing β
• Minimize∑
i(Yi − βT Xi)2 + λ‖β‖1.
- use lars (least angle regression) or- coordinate descent: set β = (0, . . . ,0) then iterate the
following:• for j = 1, . . . ,d :• set Ri = Yi −
∑s 6=j βsXsi
• βj = least squares fit of Ri ’s in Xj .• βj ← soft(βj,LS, λ/
∑i X 2
ij )
R: glmnet
51
![Page 52: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/52.jpg)
Variations on the Lasso
• Sparse logistic regression: later.
• Elastic Net: minimize
n∑i=1
(Yi − βT Xi)2 + λ1‖β‖1 + λ2‖β‖2
• Group Lasso:
β = (β1, . . . , βk︸ ︷︷ ︸v1
, . . . , βt , . . . , βp︸ ︷︷ ︸vm
)
minimize:n∑
i=1
(Yi − βT Xi)2 + λ
m∑j=1
‖vj‖
52
![Page 53: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/53.jpg)
Summary
• For low dimensional (linear) prediction, we can use least squares.
• For high dimensional linear regression, we face a bias-variancetradeoff: omitting too many variables causes bias while includingtoo many variables causes high variance.
• The key is to select a good subset of variables.
• The lasso (L1-regularized least squares) is a fast way to selectvariables.
• If there are good, sparse, linear predictors, the lasso will workwell.
53
![Page 54: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/54.jpg)
Weak Assumptions Versus Strong Assumptions
With virtually no assumptions, the following is true. Let β∗ minimize
E(Y − βT X )2 subject to ||β||1 ≤ L.
So βT∗ x is the best, sparse, linear predictor. Let β be the Lasso
estimator. Then
R(β)− R(β∗) = OP
(√L2
nlog(p)
).
The lasso is risk consistent as long as p = o(en). We do not assumethat the true model is linear. We make no assumptions on the designmatrix.
54
![Page 55: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/55.jpg)
Strong Assumptions
To recover the true support of β or make inferences about β, manyauthors assume:
1 Incoherence (covariarates are almost independent).2 True model is linear.3 Constant variance.4 Donut (no small effects).
In signal processing, these assumptions are realistic. In dataanalysis, I can’t imagine why we would assume these things.
55
![Page 56: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/56.jpg)
Inference
Approach I: The HARNESS (joint work with Ryan Tibshirani)
Split the data into two halves: D1 and D2.
Use D1 to choose the variables S.
Use D2 we can get confidence intervals for the βj ’s, prediction risk R,prediction risk Rj when dropping βj . For example,
R(S) =1n
∑D2
|Yi − βT Xi |
estimatesR(S) = E|Y − βT X |.
56
![Page 57: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/57.jpg)
Inference From D2
Selected S ⊂ 1, . . . ,p using D1. Using D2 do usual least-squareslinear regression.
Can get confidence intervals for:
1 R = E|Y − βT X |.2 Rj = E|Y − βT
(−j)X | − E|Y − βT X |.3 βS = (βj : j ∈ S).
But notice that βS is the projection of m(x) onto (xj : j ∈ S). Itdepends on what variables are selected.
57
![Page 58: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/58.jpg)
Example (Wine Data): Confidence Intervals
−0.10 −0.05 0.00 0.05 0.10 −0.4 −0.2 0.0 0.2 0.4
Selected Variables: Alcohol, Volatile-Acidity, Sulphates,Total-Sulfur-Dioxide and pH.Left: Rj . Right: βj ’s.
58
![Page 59: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/59.jpg)
Sequential Approach
Approach II: Sequential. Test each variables as it is added in by thealgorithm:
See: Lockhart, Taylor, Tibshirani and Tibshirani (2013).(arxiv.org/abs/1301.7161 This will appear in the Annals of Statisticswith discussion.)
Other approaches: Javanmard and Montanari (2013), Buhlmann andvan der Geer (2013).
Warning: these approaches make strong assumptions.
59
![Page 60: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/60.jpg)
Classification
Same as regression except that Yi ∈ −1,+1.
Classification riskR(h) = P(Y 6= h(X )).
Best classifier (Bayes’ classifier):
h(x) =
+1 if m(x) ≥ 1/2−1 if m(x) < 1/2
wherem(x) = P(Y = +1|X = x).
60
![Page 61: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/61.jpg)
Linear Classifiers
h(x) =
+1 if β0 + βT x ≥ 0−1 if β0 + βT x < 0
h(x) = sign(β0 + βT x).
Minimizing the training error
1n
n∑i=1
I(Yi 6= h(Xi))
over β is not feasible (non-convex).
61
![Page 62: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/62.jpg)
(Sparse) Logistic Regression
m(x) = P(Y = 1|X = x) =exTβ
1 + exTβ
and, in the high-dimensional case, we minimize
− logL(β)− λ∑
j
|βj |
where the likelihood is
L(β) =∏
Yi=+1
m(Xi)∏
Yi=−1
(1−m(Xi)).
R: glmnet
62
![Page 63: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/63.jpg)
Support Vector Machine (SVM)
Choose (β0, β) to minimize
1n
n∑i=1
[1− Yim(Xi)
]+
+ λ∑
j
β2j
where[ a ]+ = maxa, 0.
R: libsvm, e1071, ...
63
![Page 64: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/64.jpg)
SVM: Geometric View
+1
1
64
![Page 65: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/65.jpg)
SVM: Geometric View
The optimization problem can be written as:
minβ0,β
12‖β‖22 +
1λ′
n∑i=1
ξi
subject to ξi ≥ 1− Yi(β0 + X T
i β) and ξi ≥ 0 for all i = 1, . . . ,n. or:
α = argmaxα∈Rn
∑ni=1 αi − 1
2∑n
i=1∑n
k=1 αiαkYiYk 〈Xi ,Xk 〉
subject to∑n
i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where
β =∑n
i=1 αiYiXi .
-quadratic program-changing 〈Xi ,Xk 〉 to K (Xi ,Xj) (kernelization) makes it nonlinear.
65
![Page 66: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/66.jpg)
Surrogate Losses
h(x) = sign(m(x)) where m(x) = β0 + βT x .
The empirical classification risk is
R =1n
n∑i=1
I(Yi 6= h(Xi)) =1n
n∑i=1
I(Yim(Xi) ≤ 0).
Minimizing R is hard because it is non-convex. The idea is to replacethe non-convex function I(u ≤ 0) with a convex surrogate.
surrogate form classifierhinge [1− u]+ support vector machinelogistic log[1 + e−u] logistic regression
66
![Page 67: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/67.jpg)
Surrogate Losses
yH(x)
L(yH(x))
0.0
0.5
1.0
1.5
2.0
2.5
3.0
−2 −1 0 1 2
u
u
67
![Page 68: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/68.jpg)
Example
South African Heart Disease. Outcome is Y=1,0(disease/non-disease). There are 9 predictors X1, . . . ,X9. I added100 extra (irrelevant) predictors.
n = 462. d = 109.
Use sparse logistic regression.
68
![Page 69: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/69.jpg)
Example
−8 −7 −6 −5 −4 −3 −2
0.26
0.28
0.30
0.32
0.34
0.36
0.38
0.40
log(Lambda)
Mis
clas
sific
atio
n E
rror
108 106 104 101 100 89 77 67 50 29 13 5 4 3 1
−8 −7 −6 −5 −4 −3 −2
−0.
50.
00.
51.
0
Log Lambda
Coe
ffici
ents
106 101 93 69 34 5 1
69
![Page 70: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/70.jpg)
Regression is Harder Than Classification
h(x) =
+1 if m(x) > 1/2−1 otherwise.
The risk of the plug-in classifier rule satisfies
R(h)− R∗ ≤√∫
(m(x)−m∗(x))2dPX (x),
where R∗ = infh R(h).
70
![Page 71: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/71.jpg)
Regression is Harder Than Classification
m(x)
x = 01
+1
0
m(x)
x = 01
+1
0
71
![Page 72: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/72.jpg)
Example: Political Blogs
Features: word counts, and links.
72
![Page 73: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/73.jpg)
Example: Political Blogs
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1.0 1.5 2.0 2.5 3.0 3.5
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
6 7 8 9 10 11 12
Sparse LR SVM
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1.0 1.5 2.0 2.5 3.0 3.5 4.0
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
7 8 9 10 11
Sparse LR (+ links) SVM (+ links)
73
![Page 74: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/74.jpg)
Undirected Graphs
UNDIRECTED GRAPHS
74
![Page 75: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/75.jpg)
Undirected Graphs
An approach to representing multivariate data.
Unsupervised: no response variables.
Two main types:
1 Correlation Graphs2 Partial Correlation Graphs
75
![Page 76: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/76.jpg)
Correlation Graphs
X = (X (1), . . . ,X (d)).Put an edge between j and k if θjk = Cov(X (j),X (k)) 6= 0.
Bootstrap method: (Wasserman, Kolar, Rinaldo arXiv:1309.6933).
Step 1: Estimate Σ:
S =1n
n∑i=1
(Xi − X )(Xi − X )T .
Step 2: Bootstrap: draw B bootstrap samples, each of size n, giving S∗1 , . . . ,S∗B . Find
upper quantile zα:
1B
B∑j=1
I
(max
r,s
√n|S∗j (r , s)− S(r , s)| > zα
)= α.
76
![Page 77: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/77.jpg)
Correlation Graphs
Step 3: Put edge between (j , k) if
0 /∈ S(j , k)± zα√n.
ThenP(G 6= G) ≥ 1− α− C log d
n1/8 .
This follows from results in Wasserman, Kolar, Rinaldo (2013)arXiv:1309.6933 and Chernozhukov, Chetverikov and Kato (2012)arXiv:1212.6906.
77
![Page 78: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/78.jpg)
Example: n = 100 and d = 100.
78
![Page 79: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/79.jpg)
Markov (Conditional Independence) Graphs
Let X = (X (1), . . . ,X (d)). A graph G = (V ,E) has vertices V , edgesE . Independence graph has one vertex for each X (j).
X Y Z
means thatX q Z
∣∣∣ Y
V = X ,Y ,Z and E = (X ,Y ), (Y ,Z ).
79
![Page 80: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/80.jpg)
Markov Property
A probability distribution P satisfies the global Markov property withrespect to a graph G if:
for any disjoint vertex subsets A, B, and C such that C separates Aand B,
XA q XB
∣∣∣ XC .
80
![Page 81: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/81.jpg)
Example
1 2 3 4 5
6 7 8
C = 3, 7 separates A = 1, 2 and B = 4, 8. Hence:
X1, X2! X4, X8
∣
∣
∣
∣
∣
X3, X7.
1
C = 3,7 separates A = 1,2 and B = 4,8. Hence,
X (1),X (2) q X (4),X (8)∣∣∣ X (3),X (7)
81
![Page 82: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/82.jpg)
Example: Protein networks (Maslov 2002)
82
![Page 83: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/83.jpg)
Distributions Encoded by a Graph
• I(G) = all independence statements implied by the graph G.
• I(P) = all independence statements implied by P.
• P(G) = P : I(G) ⊆ I(P).• If P ∈ P(G) we say that P is Markov to G.
• The graph G represents the class of distributions P(G).
• Goal: Given random vectors X1, . . . ,Xn ∼ P estimate G.
83
![Page 84: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/84.jpg)
Gaussian Case
• If X ∼ N(µ,Σ) then there is no edge between Xi and Xj if andonly if
Ωij = 0
where Ω = Σ−1.
• GivenX1, . . . ,Xn ∼ N(µ,Σ).
• For n > p, letΩ = Σ−1
and testH0 : Ωij = 0 versus H1 : Ωij 6= 0.
84
![Page 85: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/85.jpg)
Gaussian Case: p > n
Two approaches:
• parallel lasso (Meinshausen and Buhlmann)
• graphical lasso (glasso; Banerjee et al, Hastie et al.)
Parallel Lasso:
1 For each j = 1, . . . ,d (in parallel): Regress Xj on all othervariables using the lasso.
2 Put an edge between Xi and Xj if each appears in the regressionof the other.
85
![Page 86: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/86.jpg)
Glasso (Graphical Lasso)
The glasso minimizes:
−`(Ω) + λ∑j 6=k
|Ωjk |
where`(Ω) =
12
(log |Ω| − tr(ΩS))
is the Gaussian loglikelihood (maximized over µ).
There is a simple blockwise gradient descent algorithm for minimizingthis function. It is very similar to the previous algorithm.
R packages: glasso and huge
86
![Page 87: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/87.jpg)
Graphs on the S&P 500
• Data from Yahoo! Finance (finance.yahoo.com).
• Daily closing prices for 452 stocks in the S&P 500 between 2003and 2008 (before onset of the “financial crisis”).
• Log returns Xtj = log(St ,j/St−1,j
).
• Winsorized to trim outliers.
• In following graphs, each node is a stock, and color indicatesGICS industry.
Consumer Discretionary Consumer StaplesEnergy FinancialsHealth Care IndustrialsInformation Technology MaterialsTelecommunications Services Utilities
87
![Page 88: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/88.jpg)
S&P 500: Graphical Lasso
88
![Page 89: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/89.jpg)
S&P 500: Parallel Lasso
89
![Page 90: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/90.jpg)
Choosing λ
Can use:
1 Cross-validation2 BIC = log-likelihood - (p/2) log n3 AIC = log-likelihood - p
where p = number of parameters.
90
![Page 91: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/91.jpg)
Remarks
1 Versions for discrete random variables.2 Assumption Free Approach (Wasserman, Kolar and Rinaldo
2013). (Includes confidence intervals).3 Very active area, especially in bio-informatics.4 Nonparametric approaches (coming up).
91
![Page 92: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/92.jpg)
NONPARAMETRICMACHINE LEARNING
92
![Page 93: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/93.jpg)
Nonparametric Regression
Given (X1,Y1), . . . , (Xn,Yn) predict Y from X .
Assume only that Yi = m(Xi) + εi where where m(x) is a smoothfunction of x .
The most popular methods are kernel methods. However, there aretwo types of kernels:
1 Smoothing kernels2 Mercer kernels
Smoothing kernels involve local averaging.Mercer kernels involve regularization.
93
![Page 94: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/94.jpg)
Smoothing Kernels
• Smoothing kernel estimator:
mh(x) =
∑ni=1 Yi Kh(Xi , x)∑n
i=1 Kh(Xi , x)=∑
i
wi(x)Yi
where Kh(x , z) is a kernel such as
Kh(x , z) = exp(−‖x − z‖2
2h2
)and h > 0 is called the bandwidth.
• mh(x) is just a local average of the Yi ’s near x .
• The bandwidth h controls the bias-variance tradeoff:Small h = large variance while large h = large bias.
94
![Page 95: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/95.jpg)
Example: Some Data – Plot of Yi versus Xi
X
Y
95
![Page 96: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/96.jpg)
Example: m(x)
X
Y
96
![Page 97: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/97.jpg)
m(x) is a local average
97
![Page 98: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/98.jpg)
Effect of the bandwidth h
very small bandwidth
small bandwidth
medium bandwidth
large bandwidth
98
![Page 99: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/99.jpg)
Smoothing Kernels
Risk = E(Y − mh(X ))2 = bias2 + variance + σ2.
bias2 ≈ h4,
variance ≈ 1nhd where d = dimension of X .
σ2 = E(Y −m(X ))2 is the unavoidable prediction error.
small h: low bias, high variance (undersmoothing)large h: high bias, low variance (oversmoothing)
99
![Page 100: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/100.jpg)
Risk Versus Bandwidth
h
Variance Bias
optimal h
Risk
100
![Page 101: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/101.jpg)
Estimating the Risk: Cross-Validation
To choose h we need to estimate the risk R(h). We can estimate therisk by using cross-validation.
1 Omit (Xi ,Yi) to get mh,(i), then predict: Y(i) = mh,(i)(Xi).2 Repeat this for all observations.3 The cross-validation estimate of risk is:
R(h) =1n
n∑i=1
(Yi − Y(i))2.
Shortcut formula:
R(h) =1n
n∑i=1
(Yi − Yi
1− Lii
)2
where Lii = Kh(Xi ,Xi)/∑
t Kh(Xi ,Xt ).101
![Page 102: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/102.jpg)
Summary so far
1 Compute mh for each h.2 Estimate the risk R(h).
3 Choose bandwidth h to minimize R(h).4 Let m(x) = mh(x).
102
![Page 103: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/103.jpg)
Example
X
Y
0.05 0.10 0.15 0.20
0.01
20.
014
0.01
60.
018
h
Est
imat
ed R
isk
X
Y
103
![Page 104: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/104.jpg)
Mercer Kernels (RKHS)
Choose m to minimize∑i
(Yi −m(Xi))2 + λ penalty(m)
where penalty(m) is a roughness penalty.
λ is a smoothing parameter that controls the amount of smoothing.
How do we construct a penalty that measures roughness? Oneapproach is: Mercer Kernels and RKHS = Reproducing Kernel HilbertSpaces.
104
![Page 105: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/105.jpg)
What is a Mercer Kernel?
A Mercer Kernel K (x , y) is symmetric and positive definite:∫ ∫f (x)f (y)K (x , y) dx dy ≥ 0 for all f .
Example: K (x , y) = e−||x−y ||2/2.
Think of K (x , y) as the similarity between x and y . We will create aset of basis functions based on K .
Fix z and think of K (z, x) as a function of x . That is,
K (z, x) = Kz(x)
is a function of the second argument, with the first argument fixed.
105
![Page 106: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/106.jpg)
Mercer Kernels
Let
F =
f (·) =
k∑j=1
βj K (zj , ·)
Define a norm: ‖f‖K =∑
j∑
k βjβkK (zj , zk ). ‖f‖K small means fsmooth.
If f =∑
r αr K (zr , ·), g =∑
s βsK (ws, ·), the inner product is
〈f ,g〉K =∑
r
∑s
αrβsK (zr ,ws).
F is a reproducing kernel Hilbert space (RKHS) because
〈f ,K (x , ·)〉 = f (x)
106
![Page 107: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/107.jpg)
Nonparametric Regression: Mercer Kernels
Representer Theorem: Let m minimize
J =n∑
i=1
(Yi −m(Xi))2 + λ‖m‖2K .
Then
m(x) =n∑
i=1
αi K (Xi , x)
for some α1, . . . , αn.
So, we only need to find the coefficients
α = (α1, . . . , αn).
107
![Page 108: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/108.jpg)
Nonparametric Regression: Mercer Kernels
Plug m(x) =∑n
i=1 αiK (Xi , x) into J:
J = ||Y −Kα||2 + λαTKα
where Kjk = K (Xj ,Xk )
Now we find α to minimize J. We get: α = (K + λI)−1Y andm(x) =
∑i αiK (Xi , x).
The estimator depends on the amount of regularization λ. Again,there is a bias-variance tradeoff. We choose λ by cross-validation.This is like the bandwidth in smoothing kernel regression.
108
![Page 109: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/109.jpg)
Smoothing Kernels Versus Mercer Kernels
Smoothing kernels: the bandwidth h controls the amount ofsmoothing.
Mercer kernels: norm ‖f‖K controls the amount of smoothing.
In practice these two methods give answers that are very similar.
109
![Page 110: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/110.jpg)
Mercer Kernels: Examples
very small lambda
small lambda
medium lambda
large lambda
110
![Page 111: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/111.jpg)
Multiple Regression
Both methods extend easily to the case where X has dimensiond > 1. For example, just use
K (x , y) = e−‖x−y‖2/2.
However, this is hard to interpret and is subject to the curse ofdimensionality. This means that the statistical performance and thecomputational complexity degrade as dimension d increases.
An alternative is to use something less nonparametric such asadditive model where we restrict m(x1, . . . , xd ) to be of the form:
m(x1, . . . , xd ) = β0 +∑
j
mj(xj).
111
![Page 112: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/112.jpg)
Additive Models
Model: m(x) = β0 +∑d
j=1 mj(xj).
We can take β0 = Y and we will ignore β0 from now on.
We want to minimize
n∑i=1
(Yi −
(m1(Xi1) + · · ·+ md (Xid )
))2
subject to mj smooth.
112
![Page 113: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/113.jpg)
Additive Models
The backfitting algorithm:
• Set mj = 0• Iterate until convergence:
• Iterate over j :• Ri = Yi −
∑k 6=j mk (Xik )
• mj ←− smooth(Xj ,R)
Here, smooth(Xj ,R) is any one-dimensional nonparametric regressionfunction.
R: glm
But what if d is large?
113
![Page 114: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/114.jpg)
Sparse Additive Models (SpAM)
Ravikumar, Lafferty, Liu and Wasserman (JRSS, 2009).
MinimizeJ =
∑i
(Yi −
∑j
mj(Xij))2
+ λ∑
j
‖mj‖
subject to mj smooth. This is basically a nonparametric version of thelasso.
Solution: backfitting + soft thresholding: Iterate:
• Ri = Yi −∑
k 6=j mk (Xik )
• mj = smooth(R,Xj)
• mj ← mj
[1− λ
sj
]+
soft thresholding
• where s2j = 1
n∑
i mj(X 2ij )
114
![Page 115: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/115.jpg)
Example: Boston Housing Data
Predict house value Y from 10 covariates.
We added 20 irrelevant (random) covariates to test the method.
Y = house value; n = 506, d = 30.
Y = β0 + m1(crime) + m2(tax) + · · ·+ · · ·m30(X30) + ε.
Note that m11 = · · · = m30 = 0.
We choose λ by minimizing the estimated risk.
SpAM yields 6 nonzero functions. It correctly reports thatm11 = · · · = m30 = 0.
115
![Page 116: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/116.jpg)
L2 norms of fitted functions versus 1/λ
0.0 0.2 0.4 0.6 0.8 1.0
01
23
Com
pone
nt N
orm
s
177
56
38
104
116
![Page 117: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/117.jpg)
Estimated Risk Versus λ
0.0 0.2 0.4 0.6 0.8 1.0
2030
4050
6070
80C
p
117
![Page 118: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/118.jpg)
Example Fits
0.0 0.2 0.4 0.6 0.8 1.0
−10
1020
l1=177.14
x1
m1
0.0 0.2 0.4 0.6 0.8 1.0−
1010
20
118
![Page 119: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/119.jpg)
Example Fits
0.0 0.2 0.4 0.6 0.8 1.0
−10
1020
l1=478.29
x8
m8
0.0 0.2 0.4 0.6 0.8 1.0−
1010
20
119
![Page 120: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/120.jpg)
Nonparametric Classification
There are many nonparametric classification techniques. Becausethey are similar to regression, I will not go into great detail.
Two common methods:
Plug-in:
h(x) =
+1 m(x) ≥ 1/2−1 m(x) < 1/2
where m(x) is from nonparametric regression.
Kernelization:
Often, replacing inner-products 〈x , y〉 with a kernel K (x , y) converts alinear classifier into a nonparametric classifier. For example, thekernelized support vector machine.
120
![Page 121: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/121.jpg)
knn Classification
Given x , find the k -nearest neighbors: (X(1),Y(1)), . . . , (X(k),Y(k)).
Set h(x) = 1 of majority of Y(i) are 1’s. (0 otherwise.)
Choose k by cross-validation.
This IS a plug-in classifier. It corresponds to a kernel regression:
m(x) =
∑i Yi I(||x − Xi || ≤ h(x))∑
i I(||x − Xi || ≤ h(x))
where h(x) is distace to k-th nearest neighbor.
121
![Page 122: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/122.jpg)
knn Classification
−2 −1 0 1 2
−2
−1
01
2
122
![Page 123: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/123.jpg)
knn Classification
−2 −1 0 1 2
−2
−1
01
2
123
![Page 124: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/124.jpg)
Kernelization
Recall the SVM:
α = argmaxα∈Rn
∑ni=1 αi − 1
2∑n
i=1∑n
k=1 αiαkYiYk 〈Xi ,Xk 〉
subject to∑n
i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where
β =∑n
i=1 αiYiXi .
Replacing 〈Xi ,Xk 〉 with K (Xi ,Xj) (kernelization) makes itnonparametric. Why?
K (x , y) = 〈φ(x), φ(y)〉 where φ maps x into a very high dimensionalspace.
Kernelized (nonlinear) SVM is equivalent to a linear SVM in ahigh-dimesional space. Like adding interactions in regression.
Can kernelize any procedure that depends on inner-products. 124
![Page 125: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/125.jpg)
NONPARAMETRICGRAPHS
125
![Page 126: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/126.jpg)
Regression vs. Graphical Models
assumptions regression graphical models
parametric lasso graphical lasso
nonparametric sparse additive model nonparanormal
126
![Page 127: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/127.jpg)
The Nonparanormal (Liu, Lafferty, Wasserman, 2009)
A random vector X = (X1, . . . ,Xp)T has a nonparanormal distribution
X ∼ NPN(µ,Σ, f )
in caseZ ≡ f (X ) ∼ N(µ,Σ)
where f (X ) = (f1(X1), . . . , fp(Xp)).
Joint density
pX (x) =1
(2π)p/2|Σ|1/2 exp−1
2(f (x)− µ)T Σ−1 (f (x)− µ)
p∏j=1
|f ′j (xj)|
• Semiparametric Gaussian copula
127
![Page 128: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/128.jpg)
Examples
128
![Page 129: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/129.jpg)
The Nonparanormal
• Define hj(x) = Φ−1(Fj(x)) where Fj(x) = P(Xj ≤ x).
• Let Λ be the covariance matrix of Z = h(X ). Then
Xj q Xk
∣∣∣∣∣ Xrest
if and only if Λ−1jk = 0.
• Hence we need to:
1 Estimate hj(x) = Φ−1(Fj(x)).
2 Estimate covariance matrix of Z = h(X ) using the glasso.
129
![Page 130: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/130.jpg)
Properties
• LLW (2009) show that the resulting procedure has the sametheoretical properties as the glasso, even with dimension dincreasing with n.
• If the nonparanormal is used when the data are actually Normal,little efficiency is lost.
• Implemented in R: huge
130
![Page 131: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/131.jpg)
Gene-Gene Interactions for Arabidopsis thaliana
source: wikipedia.org
Dataset from Affymetrix microarrays,sample size n = 118, p = 40 genes(isoprenoid pathway).
131
![Page 132: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/132.jpg)
Example Results
NPN glasso difference
1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40
1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40
132
![Page 133: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/133.jpg)
Transformations for 3 Genes
2 4 6 8
34
56
78
910
x5
6 7 8 9 10
8.0
8.5
9.0
9.5
10.0
10.5
11.0
x8
1 2 3 4 5
23
45
6
x18
• These genes have highly non-Normal marginal distributions.
• The graphs are different at these genes.
133
![Page 134: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/134.jpg)
S&P Data (2003–2008): Graphical Lasso
134
![Page 135: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/135.jpg)
S&P Data: Nonparanormal
135
![Page 136: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/136.jpg)
S&P Data: Nonparanormal vs. Glasso
common edges differences
136
![Page 137: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/137.jpg)
Another approach: Forests
• Nonparanormal: Unrestricted graphs, semiparametric.
• We’ll now trade off structural flexibility for greaternonparametricity.
• Forests are graphs with no cycles.
• Force the graph to be a forest but make no distributionalassumptions.
137
![Page 138: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/138.jpg)
Forest Densities (Gupta, Lafferty, Liu, Wasserman, Xu, 2011)
A distribution is supported by a forest F with edge set E(F ) if
p(x) =∏
(i,j)∈E(F )
p(xi , xj)
p(xi) p(xj)
∏k∈V
p(xk )
• For known marginal densities p(xi , xj), best tree obtained byminimum weight spanning tree algorithms.
• In high dimensions, a spanning tree will overfit.
• We prune back to a forest.
138
![Page 139: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/139.jpg)
Step 1: Constructing a Full Tree
• Compute kernel density estimates
fn1(xi , xj) =1n1
∑s∈D1
1h2
2K
(X (s)
i − xi
h2
)K
X (s)j − xj
h2
• Estimate mutual informations
In1(Xi ,Xj) =1
m2
m∑k=1
m∑`=1
fn1(xki , x`j) logfn1(xki , x`j)
fn1(xki) fn1(x`j)
• Run Kruskal’s algorithm (Chow-Liu) on edge weights
139
![Page 140: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/140.jpg)
Step 2: Pruning the Tree
• Heldout risk
Rn2(fF ) = −∑
(i,j)∈E
∫fn2(xi , xj) log
f (xi , xj)
f (xi) f (xj)dxidxj
• Selected forest given by
k = arg mink∈0,...,p−1
Rn2
(fT (k)
n1
)
where T (k)n1
is forest obtained after k steps of Kruskal
140
![Page 141: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/141.jpg)
S&P Data: Forest Graph—Oops! Outliers.
141
![Page 142: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/142.jpg)
S&P Data: Forest Graph
142
![Page 143: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/143.jpg)
S&P Data: Forest vs. Nonparanormal
common edges differences
143
![Page 144: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/144.jpg)
Graph-Valued Regression
• (X1,Y1), . . . , (Xn,Yn) where Yi is high-dimensional
• We’ll discuss one particular version: graph-valued regression(Chen, Lafferty, Liu, Wasserman, 2010)
• Let G(x) be the graph for Y based on p(y |x)
• This defines a partition X1, . . . ,Xk where G(x) is constant overeach partition.
• Three methods to find G(x):I Parametric
I Kernel graph-valued regression
I GO-CART (Graph-Optimized CART)
144
![Page 145: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/145.jpg)
Graph-Valued Regression
multivariate regression graphical model(supervised) (unsupervised)µ(x) = E(Y | x) Graph(Y ) = (V ,E)Y ∈ Rp, x ∈ Rq (j , k) 6∈ E ⇐⇒ Yj q Yk |Yrest
@@@R
graph-valued regressionGraph(Y | x)
• Gene associations from phenotype (or vice versa)
• Voting patterns from covariates on bills
• Stock interactions given market conditions, news items
145
![Page 146: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/146.jpg)
Method I: Parametric
• Assume that Z = (X ,Y ) is jointly multivariate Gaussian.
• Σ =
ΣX ΣXYΣYX ΣY
.
• Get ΣX , ΣY , and ΣXY
• Get ΩX by the glasso.
• ΣY |X = ΣY − ΣYX ΩX ΣXY .
• But, the estimated graph does not vary with different values of X .
146
![Page 147: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/147.jpg)
Method II: Kernel Smoothing
• Y |X = x ∼ N(µ(x),Σ(x)).
Σ(x) =
∑ni=1 K
(‖x−xi‖
h
)(yi − µ(x)) (yi − µ(x))T∑n
i=1 K(‖x−xi‖
h
)µ(x) =
∑ni=1 K
(‖x−xi‖
h
)yi∑n
i=1 K(‖x−xi‖
h
) .
• Apply glasso to Σ(x)
• Easy to do but recovering X1, . . . ,Xk requires difficultpost-processing.
147
![Page 148: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/148.jpg)
Method III: Partition Estimator
• Run CART but use Gaussian log-likelihood (on held out data) todetermine the splits
• This yields a partition X1, . . . ,Xk (and a correspdonding tree)
• Run the glasso within each partition element
148
![Page 149: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/149.jpg)
Simulated Data
1
2 3
4
5 6
7
8 9 10 11
12
13 14
15 16
17 18
19
20 21 22 23 24 25 26 27
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
X1<
0.5
X1>
0.5
X2<
0.5
X2>
0.5
X2<
0.5
X2>
0.5
X2<
0.25
X2>
0.25
X1<
0.75
X1>
0.75
X1<
0.25
X1>
0.25
X1<
0.25
X1>
0.25
X2<
0.75
X2>
0.75
X2<
0.75
X2>
0.75
X2<
0.125
X2>
0.125
X1<
0.375
X1>
0.375
X2<
0.625
X2>
0.625
X1<
0.875
X1>
0.875
X1<
0.125
X1>
0.125
X1<
0.125
X1>
0.125
X2<
0.375
X2>
0.375
X2<
0.375
X2>
0.375
X1<
0.625
X1>
0.625
X1<
0.625
X1>
0.625
X2<
0.875
X2>
0.875
X2<
0.875
X2>
0.875
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
101112
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
201
2
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20 12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
201
2
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20 12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
1920
5
6
13
14
17
18
28 29
30 31
32
33
34
35
36 37
38 39
40
41
42
43
0 5 10 15 20 2520.8
20.9
21
21.1
21.2
21.3
Splitting Sequence No.
Held
−out
Ris
k
(a)(c)
(b)
149
![Page 150: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/150.jpg)
Climate Data
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
1819
20
21
22
23
24 25 26 27
28 29 30 31
32 33 34 35
36
37
38
39
40 41
42 43
4445
46 47
4849
5051
5253
5455
56 57
58 59 60 61
62 63
64
65
66
67
68
69
70 71 72 73
74 75 76 77
78 79 80 81
82 83 84 8586 87
1
2
3
4
5 6
7
8
9
10
11
12
13
14
15
16
17
18 19
20
21
22
23
24 25 26 27
28 29 30 31
32 33 34 35
36
37
38
39
40 41
42 43
44 45
46 47
48 49
50 51
52 53 54 55
56 57
58 59 60 61
62 63
64
65
66
67
68
69
70 71 72 73
74 75 76 77
78 79 80 81
82 83 84 85 86 87
CO2CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UV
CO2CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UVCO2
CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UVCO2
CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UV
(a)
(b)
(c)
150
![Page 151: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/151.jpg)
CLUSTERINGAND MIXTURES
151
![Page 152: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/152.jpg)
Clustering (Unspervised Learning)
1 Parametric Clustering (mixtures):
p(x) =k∑
j=1
πj φ(x ;µj ,Σj)
2 Nonparametric Clustering: Estimate density p(x)nonparametrically then either:
-Construct a level set tree (cluster tree)
-or find the modes and their basins of attraction.
Others: k -means, hierarchical, ...
152
![Page 153: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/153.jpg)
Mixtures
X1, . . . ,Xn ∼ p(x ; θ) =k∑
j=1
πj φ(x ;µj ,Σj)
Find θ to maximize
L(θ) =n∏
i=1
p(Xi ; θ)
usually by EM algorithm.
The likelihood is very non-convex and EM is not good. There are nowmany papers on using spectral methods and moment methods to toestimate high dimensional mixtures.
153
![Page 154: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/154.jpg)
Mixtures
Latent cluster variable: Zi = (0,0,1,0,0,0,0,0) means Xi is fromcluster 3.
P(Xi in cluster j) =πjφ(Xi ; µj , Σj)∑k πkφ(Xi ; µk , Σk )
.
k is also a parameter. Usually chosen by penalized maximumlikelihood, for example, BIC: max
logL(θ)− dk
2log n
where dk is the number of parameters. (Theory is a bit dodgy, but seepapers by Watanabe and also Drton).
154
![Page 155: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/155.jpg)
Nonparametric: Level Set Trees
Step 1: Estimate density:
p(x) =1n
n∑i=1
1hd K
( ||x − Xi ||h
)where K is a kernel (smooth symmetric density) and h > 0 is abandwidth.
Step 2: Find level sets: Lt = x : p(x) > λ.
Step 3: Find connected components of Lt i.e. Lt = C1⋃ · · ·⋃Cr .
155
![Page 156: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/156.jpg)
Nonparametric: Level Set Trees
156
![Page 157: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/157.jpg)
Nonparametric: Level Set Trees
As we vary λ, the connected components form a tree.
Software: R: denpro or python: DeBaCl written by Brian Kent:
https://github.com/CoAxLab/DeBaCl
The following examples were done by Brian Kent.
Thanks Brian! (He has done this successfully in d = 11,000dimensions.)
157
![Page 158: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/158.jpg)
Example
158
![Page 159: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/159.jpg)
Example
159
![Page 160: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/160.jpg)
Example
160
![Page 161: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/161.jpg)
Density Clustering II: Mode Clustering
Step I: Find p.
Step II: Find modes m1, . . . ,mk of p. Mean-shift algorithm: repeat:
x ←∑
i Xi K(||x−Xi ||
h
)∑
i K(||x−Xi ||
h
)
Step III: Find basins of attraction Cj (Morse complex) of each modemj . Specifically, x ∈ Cj if the gradient ascent path starting at x leadsto mj .
161
![Page 162: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/162.jpg)
Example
162
![Page 163: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/163.jpg)
Example
−2 −1 0 1 2
−2
−1
01
2
a
b
163
![Page 164: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/164.jpg)
Example
164
![Page 165: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/165.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
u1
u2
165
![Page 166: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/166.jpg)
Example
166
![Page 167: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/167.jpg)
Dimension Reduction
Recall: linear dimension reduction.
MDS (multidimensional scaling).
Y1, . . . ,Yn ∈ Rd .
Let φ be a linear projection onto R2 and let Zi = φ(Yi).
Minimize:
∑i,j
(||Yi − Yj ||2 − ||Zi − Zj ||2)
over all linear projections.
Solution: project onto the first two principal components u1,u2.167
![Page 168: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/168.jpg)
Nonlinear Dimension Reduction
Many methods such as:
-Isomap
-LLE
-diffusion maps
-kernel PCA
+ many others
I will briefly explain diffusion maps.
(Coifman, Lafon, Lee, 2005)168
![Page 169: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/169.jpg)
Nonlinear Dimension Reduction
Imagine a Markov chain starting at Xi and jumping to another point Xjwith probability
p(i , j) ∝ Kh(Xi ,Xj).
The transition matrix isM = D−1K
where K(i , j) = Kh(Xi ,Xj) and D is diagonal with Dii =∑
j Kh(Xi ,Xj).
The diffusion distance measures how easy it is to get from Xi and Xj .This is essentially Euclidean distance in a new space.
We can visualize this by plotting the eigenvectors of M.
169
![Page 170: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/170.jpg)
Nonlinear Dimension Reduction
170
![Page 171: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/171.jpg)
Nonlinear Dimension Reduction
171
![Page 172: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/172.jpg)
Requests
Causal Inference: let’s talk.
Counterexamples to Bayes: let’s talk (after a few drinks).
But see:
http://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/
http://normaldeviate.wordpress.com/2012/12/08/flat-priors-in-flatland-stones-paradox/
Also:
Owhadi, Scovel and Sullivan (2013). arXiv:1304.6772172
![Page 173: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/173.jpg)
What I Didn’t Cover
1 Bayes2 PCA, sparse PCA3 Nonlinear dimension reduction4 Directed graphs5 Convex optimization6 Semi-supervised methods7 Active learning8 Sequential learning9 ...
173
![Page 174: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh](https://reader035.vdocuments.us/reader035/viewer/2022071301/60948038f275413b1f58f452/html5/thumbnails/174.jpg)
THE END
174