ridge regression - vrije universiteit amsterdamwvanwie/courses/highdimensionaldata... ·...
TRANSCRIPT
![Page 1: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/1.jpg)
Ridge regression
Wessel van [email protected]
Department of Epidemiology and Biostatistics, VUmc& Department of Mathematics, VU University
Amsterdam, The Netherlands
![Page 2: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/2.jpg)
Preliminary
AssumptionThe data are zero-centered variate-wise.
Hence, the response and the expression data of each gene is centered around zero.
That is, is replaced by where
Scribed lectureVan Wieringen, W.N. (2018), Lecture notes on ridge
regression, arXiv:1509.09169.
![Page 3: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/3.jpg)
Problem
CollinearityTwo (or multiple) covariates are highly linearly related.
ConsequenceHigh standard error of estimates.
The regression equation isY = 0.126 + 0.437 X1 + 1.09 X2 + 0.937 X3
Predictor Coef SE Coef T PConstant 0.1257 0.4565 0.28 0.784X1 0.43731 0.05550 7.88 0.000X2 1.0871 0.3399 3.20 0.003X3 0.9373 0.6865 1.37 0.179
The regression equation isY = 0.126 + 0.437 X1 + 1.09 X2 + 0.937 X3
Predictor Coef SE Coef T PConstant 0.1257 0.4565 0.28 0.784X1 0.43731 0.05550 7.88 0.000X2 1.0871 0.3399 3.20 0.003X3 0.9373 0.6865 1.37 0.179
X1X1
X2X2
X3X3
![Page 4: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/4.jpg)
Problem
SupercollinearityTwo (or multiple) covariates are fully linearly dependent.
Example:
The columns are dependent: C1 = C2 + C3.
Consequence : singular .
A square matrix with no inverse is called singular.
A matrix A is singular iff det(A) = 0.
![Page 5: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/5.jpg)
Problem
SupercollinearityExample:
As det(A) = 0, A is singular and its inverse is undefined.
Det(A) equals the product of the eigenvalues θj of A:the matrix A is singular if any eigenvalue of A is zero.
To see this, consider the spectral decomposition of A:
where vj is the eigenvector belonging to θj.
![Page 6: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/6.jpg)
Problem
Supercollinearity
A has eigenvalues 5 and 0. The inverse of A via the spectral decomposition is then undefined:
> A < matrix(c(1,2,2,4), ncol=2)> Ainv < solve(A)Error in solve.default(A) : Lapack routine dgesv: system is exactly singular
Even R says no:
The inverse of A is then:
![Page 7: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/7.jpg)
Problem
SupercollinearityConsequence : singular .
So?Recall the ML regression estimator (and its variance):
These are only defined if (XTX)-1 exists.
Supercollinearity → ML regression estimator undefined.
Supercollinearity occurs high-dimensionally, i.e. when the number of covariates exceeds the number of samples (p > n).
![Page 8: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/8.jpg)
Ridge regression
![Page 9: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/9.jpg)
Ridge regression
Problem In case of singular its inverse is not defined. Consequently, the OLS estimator
does not exist. This happens in high-dimensional data.
Solution An ad-hoc solution adds to , leading to:
This is called the ridge estimator.
![Page 10: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/10.jpg)
Ridge regression
ExampleLet:
then
which has eigenvalues equal to 10, 6 and 0.
With the “ridge-fix”, we get e.g.:
which has eigenvalues equal to 11, 7 and 1.
![Page 11: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/11.jpg)
Ridge regression
Example (continued)Suppose now that .
For every choice of λ, we have a ridge estimate of the coefficients of the regression equation: .
QuestionDoes ridge estimate always tend to zero as λ tends to infinity?
Ridge regularization path
![Page 12: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/12.jpg)
Ridge regression
Ridge vs. OLS estimatorThe columns of the matrix X are orthonormal if the columns are orthogonal and have a unit length.
Orthonormality of the design matrix implies:
Then, there is a simple relation between the ridge estimator and the OLS estimator:
![Page 13: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/13.jpg)
Ridge regression
Why does the ad hoc fix work? Study its effect from the perspective of singular values.
Use the singular value decomposition of matrix X:
to rewrite:
and: role of singular values
![Page 14: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/14.jpg)
Ridge regression
Why does the ad hoc fix work? Combine the two results and write to obtain:
OLS ridge
Thus, the ridge estimator shrinks the singular values of X.
-1
non-zero
Return to the problem of super-collinearity: issingular but is not. Its inverse is:
![Page 15: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/15.jpg)
Ridge regression
Contrast to principal component regression Let contain the 1st k principal components.PC regression then fits: The least squares estimate gives:
this gives:
I.e. thresholding vs. shrinkage.
Translated to the linear regression model:
![Page 16: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/16.jpg)
Moments of the ridge estimator
![Page 17: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/17.jpg)
Moments
OLS and ridge estimates Bias of ridge estimates
The expectation of the ridge estimator:
![Page 18: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/18.jpg)
Define: Then:
Moments
and variance of the ridge estimator becomes:
Consequence: . Translated to the levels of the distribution of both estimators:
OLS ridge
![Page 19: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/19.jpg)
Moments
QuestionProve that the ellipsoid level sets of the distribution of the ridge estimator are indeed smaller than that of the OLS.
λ = 0
λ > 0
Hints→ Express determinant in terms → of eigenvalues.→ Write:
![Page 20: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/20.jpg)
Moments
Ridge vs. OLS estimatorIn the orthonormal case:
As the penalty parameter is non-negative, the former exceeds the latter.
![Page 21: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/21.jpg)
Moments
DistributionThe distribution of the ridge estimator is:
QuestionWhy can we not use this distribution for testing:
QuestionWhy is the estimator normally distributed?
![Page 22: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/22.jpg)
Mean squared error
Previous motivation for the ridge estimator:→ Ad hoc solution to collinearity.
An alternative motivation: comes from studying the Mean Squared Error (MSE) of the ridge regression estimator.
In general, for any estimator of a parameter μ:
Hence, the MSE is a measure of the quality of the estimator.
![Page 23: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/23.jpg)
Mean squared error
QuestionSo far:→ bias increases with λ, and→ variance decreases with λ.
What happens to the MSE when λ increase?
variance of ridge estimatesbias of ridge estimates
![Page 24: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/24.jpg)
Mean squared error
The mean squared error of the ridge estimator is then:
sum of variances of the ridge estimator
“squared bias” of the ridge estimator
![Page 25: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/25.jpg)
Mean squared error
Ridge vs. OLS estimatorIn the orthonormal case, i.e. :
and
The latter achieves its minimum at:the ratio between the error variance and the ‘signal’.
QuestionWhat is the practical relevance of this result?
![Page 26: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/26.jpg)
Mean squared error
For small/large λ, variance/bias dominates the MSE.
For λ < 0.6, MSE(λ) < MSE(0) and the ridge estimator outperforms the OLS estimator.
![Page 27: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/27.jpg)
TheoremThere exists λ > 0 such that MSE(λ) < MSE(0).
ProblemThe optimal λ depends on unknown quantities β and σ2.
PracticeChoose in data-driven manner by:→ cross-validation,→ information criterion,→ empirical Bayes.
Mean squared error
![Page 28: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/28.jpg)
Constrained estimation
![Page 29: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/29.jpg)
Constrained estimation
The ad-hoc ridge estimator minimizes the loss function:
with penalty parameter
Take the derivative:
Equate the derivative to zero and solve:
sum of squares ridge penalty
![Page 30: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/30.jpg)
ConvexityA set S is convex if for all , : .
It is strict convex if for all .
Constrained estimation
convex sets non-convex sets
![Page 31: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/31.jpg)
Constrained estimation
convex functions non-convex functions
ConvexityA map is convex if for all , S convex, and : .
A function is convex ↔ region above the function is convex.
Strict convex: strict inequality for all .
![Page 32: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/32.jpg)
ConvexitySum of squares, , is convex in . Penalty, , is strict convex in . Consequently, their sum is strict convex.
Constrained estimation
Strict convexity ensures the existence of a unique minimizer of the penalized sum of squares.
unpenalized loss penalized loss
The red line / dot represents the optimum (minimum) of the loss function.
![Page 33: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/33.jpg)
Ridge regression as constrained estimationThe method of Lagrange multipliers enables the reformulation of the penalized least squares problem:
into a constrained estimation problem:
An explicit expression of θ(λ) is available.
Constrained estimation
![Page 34: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/34.jpg)
residual sum of squares:
β2
β1
OLS estimate
2c2(λ)
Constrained estimation
ridge estimate
![Page 35: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/35.jpg)
Constrained estimation
QuestionHow does the parameter constraint domain fare with λ?
![Page 36: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/36.jpg)
Over-fitting
Simple exampleConsider 9 covariates with data drawn from the standard normal distribution:
A response links to the covariates by the following linear regression model:
where .
Only ten observations are drawn from model. Hence, n=10 and p=9.
![Page 37: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/37.jpg)
Over-fitting
Simple exampleFit the following linear regression model to the data:
Large estimate values→ indication of overfitting.
b = (0.049, 2.386, 5.528, 6.243,
4.819, 0.760, 3.345, 4.748, 2.136)
Estimate: Fit:
A simple remedy: constrains the parameter estimator. Another motivation for the ridge estimator!
![Page 38: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/38.jpg)
A Bayesian interpretation
![Page 39: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/39.jpg)
A Bayesian interpretation
Ridge regression is closely related to Bayesian linear regression.
Bayesian linear regression assumes the parameters and to be the random variables.
The conjugate priors for the parameters are:
The latter denotes an inverse Gamma distribution.
![Page 40: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/40.jpg)
The posterior distribution of and can then be written as:
A Bayesian interpretation
where
Then, clearly the posterior mean of is:
![Page 41: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/41.jpg)
Hence, the ridge regression estimator can be viewed as a Bayesian estimate of when imposing a Gaussian prior.
A Bayesian interpretation
The penalty parameter relates to the prior:→ a small λ corresponds → to wide/vague prior, → a large λ yields a → narrow/informative one.
![Page 42: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/42.jpg)
Efficient computation
![Page 43: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/43.jpg)
Efficient computation
In the high-dimensional setting the number of covariates p is large compared to the number of samples n. In a microarray experiment p = 40000 and n = 100 is not uncommon.
If we wish to perform ridge regression in this context, we need to evaluate the expression:
(p x p)-dim. matrix
For p = 40000 this is unfeasible on most computers.
However, there is a workaround.
![Page 44: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/44.jpg)
Efficient computation
Revisit the singular value decomposition of X:
and write .
As both U and D are (n x n)-dimensional matrices, so is R.
Consequently, X is now decomposed as: .
The ridge estimator can now be rewritten as:
(n x n)-dim. matrix
![Page 45: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/45.jpg)
Efficient computation
Hence, the reformulated ridge estimator involves the inversion of a (n x n)-dimensional matrix. With n = 100, this feasible on any standard computer.
Tibshirani and Hastie (2004) point out that the number of computation operations reduces from O(p3) to O(pn2).
In addition, they point out that this computation short-cut can be used in combition with other loss functions (GLM).
![Page 46: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/46.jpg)
Degrees of freedom
![Page 47: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/47.jpg)
Degrees of freedom
The degrees of freedom of ridge regression is calculated.
Recall from ordinary regression that:
where H is the hat matrix.
The degrees of freedom of ordinary regression: . In particular, if X if of full rank, i.e. rank(X) = p, then:
![Page 48: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/48.jpg)
Degrees of freedom
By analogy, the ridge-version of the hat matrix is:
Continuing this analogy, the degrees of freedom of ridge regression is given by the trace of the hat matrix:
The d.o.f. is monotone decreasing in λ. In particular:
![Page 49: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/49.jpg)
Simulation I----
Variance of covariates
![Page 50: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/50.jpg)
Simulation I
Effect of ridge estimationConsider a set of 50 genes. Their expression levels follow a multivariate normal law with mean zero and covariance:
Put differently, a diagonal covariance with:
Together they regulate a 51th gene through:with and regression coefficients .Hence, the 50 genes contribute equally.
![Page 51: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/51.jpg)
Simulation I
Effect of ridge estimationRidge regularization paths for coefficients of the 50 genes.
Ridge regression prefers (i.e. shrinks less) coefficient estimates of covariates with larger variance.
![Page 52: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/52.jpg)
Simulation I
Some intuitionRewrite the ridge regression estimator:
Plug in the employed covariance matrix:
Hence, larger variances = slower shrinkage.
![Page 53: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/53.jpg)
Simulation I
Geometrically
![Page 54: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/54.jpg)
Consider the ridge penalty:
Simulation I
Considerations:→ Some form of standardization seems reasonable,→ at least to ensure things are penalized comparably.
→ After preprocessing expression data of genes are often → assumed to have a comparable scale.
→ Standardization affects the estimates.
Each regression coefficient is penalized in the same way.
![Page 55: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/55.jpg)
Simulation II----
Effect of collinearity
![Page 56: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/56.jpg)
Simulation II
Effect of ridge estimationConsider a set of 50 genes. Their expression levels followa multivariate normal law with mean zero and covariance:
where .
Together they regulate a 51th gene through:with and regression coefficients .Hence, the 50 genes contribute equally.
![Page 57: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/57.jpg)
Simulation II
Effect of ridge estimationRidge regularization paths for coefficients of the 50 genes.
Ridge regression prefers (i.e. shrinks less) coefficient estimates of strongly positively correlated covariates.
![Page 58: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/58.jpg)
Simulation II
Some intuitionLet p=2 and write U=X1+X2 and V=X1-X2. Then:
For large λ:
Write γa=β1+β2 and γb=β1-β2. Its ridge estimator is:
Now use Var(U) >> Var(V) due to strong collinearity.
![Page 59: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/59.jpg)
Simulation II
Geometrically
![Page 60: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/60.jpg)
Cross-validation
![Page 61: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/61.jpg)
Methods for choosing penalty parameterMethods for choosing penalty parameter
1. Cross-validation
• Estimation of the performance of a model, which is reflected in the error (often operationalized as log-likelihood or MSE).
• The data used to construct the model is also used to estimate the error.
Cross-validation
Cross-validation
2. Information criteria
![Page 62: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/62.jpg)
Cross-validation
Penalty selection
ModelModel
Train setTrain set
Test setTest set
Cross-validation
optimal value
Performance evaluation
→ K-fold→ LOOCV
![Page 63: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/63.jpg)
• K-fold cross-validation divides the data set randomly into K equal (or almost equal) sized subsets 1, …, K.
• Model built on training set – k.
• Model applied to the test set k to estimate the error.
• The average of these error estimates the error rate of the original classifier.
• n-fold cross-validation or leave-one-out cross-validation sets K = n, using but one sample to built the models.
Cross validation
Cross-validation
![Page 64: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/64.jpg)
LOOCV
Cross-validation
The LOOCV loss can be calculated without resampling:
where diagonal and
Hence, instead of n evaluations of a pxp dimensional inverse only a single one is needed: a considerable gain.
![Page 65: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/65.jpg)
Generalized cross-validation
Cross-validation
Diagonal elements of the hat matrix may assume value close or equal to one. Consequently, the LOOCV loss may become unstable.
This is resolved in the generalized cross-validation criterion:
The GCV too avoids the re-evaluation of the regression parameter estimate for each training set.
![Page 66: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/66.jpg)
Example---
Regulation of mRNA by microRNA
![Page 67: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/67.jpg)
microRNAsRecently, a new class of RNA was discovered:MicroRNA (mir). Mirs are non-coding RNAs of approx. 22 nucleotides. Like mRNAs, mirs are encoded in and transcribed from the DNA.
Mirs down-regulate gene expression by either of two post-transcriptional mechanisms: mRNA cleavage or transcriptional repression. Both depend on the degree of complementarity between the mir and the target.
A single mir can bind to and regulate many different mRNA targets and, conversely, several mirs can bind to and cooperatively control a single mRNA target.
Example: microRNA-mRNA regulation
![Page 68: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/68.jpg)
AimModel microRNA regulation of mRNA expression levels.
Example: mir-mRNA regulation
Data→ 90 prostate cancers→ expression of 735 mirs→ mRNA expression of the MCM7 gene
Motivation→ MCM7 involved in prostate cancer.→ mRNA levels of MCM7 reportedly affected by mirs.
Not part of the objective: feature selection ≈ understanding the basis of this prediction by identifying features (mirs) that characterize the mRNA expression.
![Page 69: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/69.jpg)
Analysis
Find: mrna expr. = f(mir expression)
= β0 + β1*mir1 + β2*mir2 + … + βp*mirp + error
However, p > n: ridge regression. Having found the optimal λ, we obtain the ridge estimates for the coefficients: bj( )λ .
With these estimates we calculate the linear predictor:Survival b0 + b1( )*mirλ 1 + … + bp( )*mirλ p
Finally, we obtain the predicted survival:pred. mrna expr. = f(linear predictor)
pred. survival = b0 + b1( )*mirλ 1 + … + bp( )*mirλ p
Compare observed and predicted mRNA expression.
Example: microRNA-mRNA regulation
![Page 70: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/70.jpg)
Example: microRNA-mRNA regulation
#( < β 0) = 394 (out of 735)
Penalty parameter choice
ρsp = 0.629R2 = 0.449
Obs. vs. pred. mRNA expression
Beta hat distribution
Question: explain axes’ scale difference in the RHS plot.
![Page 71: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/71.jpg)
Example: microRNA-mRNA regulation
Biological dogma
MicroRNAs down-regulate mRNA levels.
The dogma suggests that negative regression coefficients prevail.
The penalized package allows for the specification of the sign of the regression parameters. No explicit expression for ridge estimator: numeric optimization of the loss function.
Re-analysis of the data with negative constraints.
![Page 72: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/72.jpg)
Example: microRNA-mRNA regulation
#(β < 0) = 401#(β = 0) = 334
ρsp = 0.679R2 = 0.524
Histograms of ridge estimates.
Observed vs. predicted mRNA expression.
![Page 73: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/73.jpg)
Example: microRNA-mRNA regulation
The parameter constraint implies feature selection. Are the microRNAs identified to down-regulate MCM7 expression levels also reported by prediction tools?
Contingency table prediction toolridge regression nomir2MCM7 mir2MCM7
= 0 323 11β < 0 390 11β
Chi-square test
Pearson's Chisquared test with Yates' continuity correction
data: table(nonzeroBetas, nonzeroPred) Xsquared = 0.0478, df = 1, pvalue = 0.827
![Page 74: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/74.jpg)
Generalized ridge regression
![Page 75: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/75.jpg)
Generalized ridge regression
A generalized ridge regression estimator minimizes a weighted least squares criterion augmented with a generalized ridge penalty:
with:→ weight matrix ,→ penalty parameter ,→ non-random target .
Set:→ → → to obtain the original ridge regression estimator.
![Page 76: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/76.jpg)
The generalized penalty is a quadratic form. It implies a non-zero centered, ellipsoid parameter constraint.
Generalized ridge regression
![Page 77: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/77.jpg)
Differentiate the loss criterion w.r.t. , equate to zero, and obtain the estimating equation:
Generalized ridge regression
Solving this for yields the generalized estimator:
Clearly, the generalized estimator reduces to the regular ridge regression estimator when simultaneously:→ → →
![Page 78: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/78.jpg)
Regular and generalized regularization paths
Generalized ridge regression
Of note:→ the limits of the 3rd → and 4th regression → coefficient.→ more subtly, → regularization paths → of the 2nd and 3rd → regression coefficient → temporarily → convergence.
![Page 79: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/79.jpg)
The 1st and 2nd order moments of the generalized ridge regression estimator are:
Generalized ridge regression
Clearly, the generalized estimator is biased.
The generalized estimator has limiting behaviour:
QuestionWhat is the effect of on the MSE of the estimator?
![Page 80: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/80.jpg)
BayesThe generalized ridge estimator too has a Bayesian interpretation. Set and replace the prior on the regression coefficients by: .
Generalized ridge regression
The joint posterior then is:
This implies: .
![Page 81: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/81.jpg)
ExampleConsider the linear regression model
Generalized ridge regression
with coefficients ,for j=1, …, 500 and a standard normal error.
Estimate by minimization of the fused ridge loss function:
where
Cf., e.g., Goeman, 2008.
![Page 82: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/82.jpg)
ExampleRegular vs. fused ridge estimates
Generalized ridge regression
![Page 83: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/83.jpg)
ExampleDNA copy number : # gene copies encoded in the DNA. → 2 : most genes on the autosomal chromosomes,→ 1 : genes on the X or Y chromosomes in males,→ 0 : genes on the Y chromosome in females,→ anything goes in cancer.
Generalized ridge regression
The cis-dogma: more copies, more transcription.
Q: Does a trans-effect exist? Does one gene's high copy Q: number lead to elevated transcription levels of another?
Regress a gene's expression on copy number of all genes. → trans-effect: large coefficients away from response gene.
![Page 84: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/84.jpg)
ExampleRidge vs. fused ridge estimates: local copy number effect.
Generalized ridge regression
![Page 85: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/85.jpg)
What is generally referred as the generalized ridge uses:→ → → with from the SVD ,with positive definite diagonal matrix .
Generalized ridge regression
Eigenvalues are shrunken individually rather than jointly.
The generalized ridge estimator then is:
Hoerl, Kennard, 1970; Hemmerle, 1975.
Question: verify this expression.
![Page 86: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/86.jpg)
Rewrite the linear regression model to simplify notation:
Generalized ridge regression
where and .
The loss function then becomes:
which is optimized by:
In the original notation this results in:
![Page 87: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/87.jpg)
Use the 1st and 2nd moments of this estimator,
Generalized ridge regression
to obtain its MSE:
where and .
The MSE of is minimized when for all j.Both quantities are unknown but may be estimated. Estimates, however, need not yield the desired MSE.
Hemmerle, 1975; Lawless, 1981.
![Page 88: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/88.jpg)
QuestionsWhat is the effect of on the MSE of the estimator:
Generalized ridge regression
Does the MSE of the generalized ridge estimator:
outperform that of the regular ridge estimator:
![Page 89: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/89.jpg)
The mixed model
![Page 90: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/90.jpg)
Mixed model
The mixed or random effect model is:
where Z is a (nxq)-dimensional design matrix, and:
with
The covariance matrix of the random effect, , is parametrized by a low-dimensional parameter .
Reformulated:
![Page 91: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/91.jpg)
Mixed model
Example (growth rate of cells)→ Longitudinal study→ Yit = (log) cell count in petri dish i at time t→ Xi = concentration of growth medium in petri dish i.→ Mixed model:
![Page 92: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/92.jpg)
Mixed model
EstimationParameters are estimated by maximization of the likelihood:
L(Y | γ = g) = (2πσε 2)−n/2 exp(− 12 σε −2 Y − Xβ − ZLθ g 22 ).
with conditional likelihood:
Or, by restricted likelihood maximization (REML):
![Page 93: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/93.jpg)
Mixed model
Link to ridge regressionIn absence of fixed effects the mixed model becomes:
where and .
Temporarily consider the random effect as fixed. Then:
which can be solved explicitly:
A shrinkage estimator that allows for q > n!
![Page 94: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/94.jpg)
Mixed model
TheoremAssume with and . The expected generalized cross-validation error of the ridge estimatoris minimized for ….
Practical take-awayHandy for penalty parameter selection:→ use REML to estimate and → set .
A familiar ratio, confer the MSE of ridge estimator with orthonormal design.
![Page 95: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/95.jpg)
Logistic regression(recap)
Hosmer et al. (2013).
![Page 96: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/96.jpg)
The logistic regression model
Linear regression relates a continuous response to explanatory variables. What if the response were binary?
Insightful?
![Page 97: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/97.jpg)
The logistic regression model
Do not model directly, but rather:
What would then be an appropriate model?
i)
ii)
iii)
LHS may yield values outside [0,1].
LHS may yield negative values.
Both RHS and LHS cover the real line!
![Page 98: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/98.jpg)
The logistic regression model
The model may be rewritten as:
The function is called the link function. It links the response to the explanatory variables.
The one above is called the logistic link function.Or short, logit.
![Page 99: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/99.jpg)
The logistic regression model
The β0 parameter determines where (on the x-axis)
The logistic link function for several values of β0.
![Page 100: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/100.jpg)
The logistic regression model
The β1 parameter determines slope at the point where
The logistic link function for several values of β1.
![Page 101: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/101.jpg)
The odds is the ratio between the probability of an event (success) and the probability that this event will not happen (failure).
In our analysis of the advices:
The odds ratio is the relative increase in the odds as the explanatory variable increases with 1 unit.
QuestionWhat if the confidence interval of the odds ratio contains 1?
The logistic regression model
![Page 102: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/102.jpg)
The logistic regression model
Many other link functions for binary data exist, e.g.:
i) Probit:
ii) Cloglog:
iii) Cauchit:
All these link functions are invertible.
![Page 103: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/103.jpg)
The logistic regression model
Comparison of link function for binary data.
![Page 104: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/104.jpg)
Estimation
Consider an experiment with Yi in {0, 1} for i=1, …, n and with each sample Xi available
After taking the logarithm and some ready algebra, the log-likelihood is found to be:
The likelihood of the experiment is then:
![Page 105: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/105.jpg)
Estimation
Hence, a curve is fit through data by minimizing the distance between them: at the ML estimate of β, a weighted average of the deviations is zero.
Differentiate the log-likelihood w.r.t. β, equate it zero, and obtain the estimating equation for β.
This derivative is:
difference between observation and model
![Page 106: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/106.jpg)
Estimation
Interpretation of the ML estimation:
ML estimation considers a weighted average of these distance.
![Page 107: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/107.jpg)
Estimation
Newton-Raphson algorithm iteratively finds the zeros of a function f(•).
Let x0 denote an initial guess of the zero. Then, approximate f(•) around x0 by means of a Taylor series:
Solve this for x:
Let x1 the solution for x, use this as the new guess and repeat the above until convergence.
![Page 108: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/108.jpg)
Estimation
When the function f(•) has multiple arguments and is vector-valued, the Taylor approximation becomes:
with
the Jacobi matrix.
![Page 109: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/109.jpg)
Estimation
When applied here to the estimation of β, the Newton-Raphson update is:
where the Hessian of the log-likelihood equals:
Iterative application of this updating formula converges to the ML estimate of β.
![Page 110: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/110.jpg)
Estimation
The Newton-Raphson algorithm is often reformulated to an iteratively re-weighted least squares algorithm.
First write the gradient and Hessian in matrix notation:
with g-1(•) = exp(•) / [1 + exp(•)] and W diagonal with:
![Page 111: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/111.jpg)
Estimation
The updating formula of the estimate then becomes:
where
![Page 112: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/112.jpg)
Estimation
The Newton-Raphson update is thus the solution to the following weighted least squares problem:
Effectively, at each iteration the adjusted response z is regressed on the covariates that comprise X.
![Page 113: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/113.jpg)
Ridge logistic regression
Le Cessie, van Houwelingen (1992)
![Page 114: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/114.jpg)
Ridge logistic regression
ProblemThe logistic model parameters cannot be estimated with maximum likelihood from high-dimensional data.
SolutionAugmentation of the loglikelihood with a ridge penalty:
This does not change the model, only the estimates!
![Page 115: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/115.jpg)
Ridge logistic regression
Penalized log-likelihood contour + ridge constraint.
![Page 116: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/116.jpg)
Ridge estimation
Ridge ML estimates of the logistic model parameters are found by the maximization of the penalized loglikelihood.
Again, use the Newton-Raphson algorithm for solving the (penalized) estimating equation. The gradient is now:
The rest stays the same.
and the Hessian:
![Page 117: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/117.jpg)
Ridge estimation
Again, the Newton-Raphson algorithm is reformulated as an iteratively re-weighted least squares algorithm with the updating step modified accordingly:
where and W and z as before.
![Page 118: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/118.jpg)
Ridge estimation
Effect of ridge penalization
Regression estimate vs λ Y-hat vs x for various λ
![Page 119: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/119.jpg)
Ridge estimation
To illustrate this further, consider the resulting classification. Define the red and green domain through:
Separate design space in red, green domain.
The white bar between them is the domain where samples cannot be classified with high certainty.
![Page 120: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/120.jpg)
Ridge estimation
![Page 121: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/121.jpg)
Moments
From the final Newton-Raphson step, we can approximate the 1st and 2nd order moment:
with:
![Page 122: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/122.jpg)
Moments
As with ridge regression, the variances of the parameter estimates vanish as the penalty parameter increases.
![Page 123: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/123.jpg)
Moments
What about the MSE?
Question: can this be understood (from the moments)?
![Page 124: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/124.jpg)
Assume
This is not a familiar distribution, but asymptotically normal.
The posterior then is proportional to:
Laplace’s method approximates the posterior by a Gaussian → centered at the posterior mode, and → covariance equal to the curvature at the posterior mode.
The posterior mode, denoted , coincides with the ridge logistic regression estimator.
Bayes
Bishop (2006).
![Page 125: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/125.jpg)
For the posterior covariance approximate the logarithm of the posterior by a 2nd order Taylor series around the mode:
Take exponential and match arguments to that of a Gaussian to arrive at the normal approximation of the posterior:
Bayes
Bishop (2006).Hessian
Question: where is the 1st order term?
![Page 126: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/126.jpg)
A Gaussian approximation is convenient, but is it any good?
The Bernstein-von Mises theorem warrants that (under smoothness conditions) the difference between posterior and its normal approximation vanishes (in probability).
van der Vaart (2007).
Bayes
![Page 127: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/127.jpg)
Cross-validation
Penalty parameter selection
Again, the LOOCV loss may be evaluated computationally efficient. Use the approximate leave-one-out estimator:
Substitute these approximations in
Meijer, Goeman (2013).
This 'approximate' LOOCV loss often yields an optimal penalty parameter close to that produces by LOOCV loss.
![Page 128: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/128.jpg)
AimModel / predict ovarian cancer survival status.
Application
Data→ 295 ovarian cancer patients, → status (dead/alive) at end of study,→ expression of 19990 transcript counts at study onset,→ count transformed.
Analysis→ ridge logistic regression.→ model: λ chosen by LOOCV.→ prediction: double CV loop, both λ and prediction by → LOOCV
![Page 129: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/129.jpg)
Application
Evaluation→ Fitted model gives reasonable description for data.→ It extrapolates poorly to new samples.
![Page 130: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/130.jpg)
References & further reading
![Page 131: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/131.jpg)
References & further readingLe Cessie, S., & Van Houwelingen, J. C. (1992), “Ridge estimators in logistic
regression”, Applied Statistics, 191-201.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Goeman, J. J. (2008), “Autocorrelated logistic ridge regression for prediction based on proteomics spectra”, Statistical Applications in Genetics and Molecular Biology, 7(2).
Hemmerle, W. J. (1975), “An explicit solution for generalized ridge regression”, Technometrics, 17(3), 309-314.
Hoerl, A. E., & Kennard, R. W. (1970), “Ridge regression: Biased estimation for nonorthogonal problems”, Technometrics, 12(1), 55-67.
Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X. (2013), Applied Logistic Regression. John Wiley & Sons.
Lawless, J. F. (1981). Mean squared error properties of generalized ridge estimators. Journal of the American Statistical Association, 76(374), 462-466.
Meijer, R. J. and Goeman, J. J. (2013). “Efficient approximate k-fold and leave-one-out cross-validation for ridge regression”. Biometrical Journal , 55(2), 141–155.
Van der Vaart, A.W. (2007), Asymptotic Statistics, Cambridge University Press.
Van Wieringen, W.N. (2018), Lecture notes on ridge regression, arXiv:1509.09169.
![Page 132: Ridge regression - Vrije Universiteit Amsterdamwvanwie/Courses/HighdimensionalData... · 2018-09-10 · Preliminary Assumption The data are zero-centered variate-wise. Hence, the](https://reader030.vdocuments.us/reader030/viewer/2022040323/5e67d88e49b59d751a4937cf/html5/thumbnails/132.jpg)
This material is provided under the Creative Commons Attribution/Share-Alike/Non-Commercial License.
See http://www.creativecommons.org for details.