![Page 1: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/1.jpg)
Introduction to Machine Learning (67577)Lecture 11
Shai Shalev-Shwartz
School of CS and Engineering,The Hebrew University of Jerusalem
Dimensionality Reduction
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 1 / 25
![Page 2: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/2.jpg)
Dimensionality Reduction
Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space
Why?
Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration
Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25
![Page 3: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/3.jpg)
Dimensionality Reduction
Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space
Why?
Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration
Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25
![Page 4: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/4.jpg)
Dimensionality Reduction
Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space
Why?
Reduces training (and testing) time
Reduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration
Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25
![Page 5: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/5.jpg)
Dimensionality Reduction
Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space
Why?
Reduces training (and testing) timeReduces estimation error
Interpretability of the data, finding meaningful structure in data,illustration
Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25
![Page 6: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/6.jpg)
Dimensionality Reduction
Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space
Why?
Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration
Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25
![Page 7: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/7.jpg)
Dimensionality Reduction
Dimensionality Reduction = taking data in high dimensional spaceand mapping it into a low dimensional space
Why?
Reduces training (and testing) timeReduces estimation errorInterpretability of the data, finding meaningful structure in data,illustration
Linear dimensionality reduction: x 7→Wx where W ∈ Rn,d and n < d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25
![Page 8: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/8.jpg)
Outline
1 Principal Component Analysis (PCA)
2 Random Projections
3 Compressed Sensing
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 3 / 25
![Page 9: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/9.jpg)
Principal Component Analysis (PCA)
x 7→Wx
What makes W a good matrix for dimensionality reduction ?
Natural criterion: we want to be able to approximately recover x fromy =Wx
PCA:
Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25
![Page 10: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/10.jpg)
Principal Component Analysis (PCA)
x 7→Wx
What makes W a good matrix for dimensionality reduction ?
Natural criterion: we want to be able to approximately recover x fromy =Wx
PCA:
Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25
![Page 11: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/11.jpg)
Principal Component Analysis (PCA)
x 7→Wx
What makes W a good matrix for dimensionality reduction ?
Natural criterion: we want to be able to approximately recover x fromy =Wx
PCA:
Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25
![Page 12: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/12.jpg)
Principal Component Analysis (PCA)
x 7→Wx
What makes W a good matrix for dimensionality reduction ?
Natural criterion: we want to be able to approximately recover x fromy =Wx
PCA:
Linear recovery: x̃ = Uy = UWx
Measures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25
![Page 13: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/13.jpg)
Principal Component Analysis (PCA)
x 7→Wx
What makes W a good matrix for dimensionality reduction ?
Natural criterion: we want to be able to approximately recover x fromy =Wx
PCA:
Linear recovery: x̃ = Uy = UWxMeasures “approximate recovery” by averaged squared norm: givenexamples x1, . . . ,xm, solve
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25
![Page 14: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/14.jpg)
Solving the PCA Problem
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Theorem
Let A =∑m
i=1 xix>i and let u1, . . . , un be the n leading eigenvectors of
A. Then, the solution to the PCA problem is to set the columns of U tobe u1, . . . ,un and to set W = U>
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 5 / 25
![Page 15: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/15.jpg)
Solving the PCA Problem
argminW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2
Theorem
Let A =∑m
i=1 xix>i and let u1, . . . , un be the n leading eigenvectors of
A. Then, the solution to the PCA problem is to set the columns of U tobe u1, . . . ,un and to set W = U>
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 5 / 25
![Page 16: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/16.jpg)
Proof main ideas
UW is of rank n, therefore its range is n dimensional subspace,denoted S
The transformation x 7→ UWx moves x to this subspace
The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S
Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25
![Page 17: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/17.jpg)
Proof main ideas
UW is of rank n, therefore its range is n dimensional subspace,denoted S
The transformation x 7→ UWx moves x to this subspace
The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S
Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25
![Page 18: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/18.jpg)
Proof main ideas
UW is of rank n, therefore its range is n dimensional subspace,denoted S
The transformation x 7→ UWx moves x to this subspace
The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S
Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25
![Page 19: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/19.jpg)
Proof main ideas
UW is of rank n, therefore its range is n dimensional subspace,denoted S
The transformation x 7→ UWx moves x to this subspace
The point in S which is closest to x is V V >x, where columns of Vare orthonormal basis of S
Therefore, we can assume w.l.o.g. that W = U> and that columns ofU are orthonormal
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25
![Page 20: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/20.jpg)
Proof main ideas
Observe:
‖x− UU>x‖2 = ‖x‖2 − 2x>UU>x+ x>UU>UU>x
= ‖x‖2 − x>UU>x
= ‖x‖2 − trace(U>xx>U) ,
Therefore, an equivalent PCA problem is
argmaxU∈Rd,n:U>U=I
trace
(U>
(m∑i=1
xix>i
)U
).
The solution is to set U to be the leading eigenvectors of A =∑m
i=1 xix>i .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25
![Page 21: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/21.jpg)
Proof main ideas
Observe:
‖x− UU>x‖2 = ‖x‖2 − 2x>UU>x+ x>UU>UU>x
= ‖x‖2 − x>UU>x
= ‖x‖2 − trace(U>xx>U) ,
Therefore, an equivalent PCA problem is
argmaxU∈Rd,n:U>U=I
trace
(U>
(m∑i=1
xix>i
)U
).
The solution is to set U to be the leading eigenvectors of A =∑m
i=1 xix>i .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25
![Page 22: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/22.jpg)
Proof main ideas
Observe:
‖x− UU>x‖2 = ‖x‖2 − 2x>UU>x+ x>UU>UU>x
= ‖x‖2 − x>UU>x
= ‖x‖2 − trace(U>xx>U) ,
Therefore, an equivalent PCA problem is
argmaxU∈Rd,n:U>U=I
trace
(U>
(m∑i=1
xix>i
)U
).
The solution is to set U to be the leading eigenvectors of A =∑m
i=1 xix>i .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25
![Page 23: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/23.jpg)
Value of the objective
It is easy to see that
minW∈Rn,d,U∈Rd,n
m∑i=1
‖xi − UWxi‖2 =
d∑i=n+1
λi(A)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 8 / 25
![Page 24: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/24.jpg)
Centering
It is a common practice to “center” the examples before applyingPCA, namely:
First calculate µ = 1m
∑mi=1 xi
Then apply PCA on the vectors (x1 − µ), . . . , (xm − µ)
This is also related to the interpretation of PCA as variancemaximization (will be given in exercise)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 9 / 25
![Page 25: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/25.jpg)
Efficient implementation for d� m and kernel PCA
Recall: A =∑m
i=1 xix>i = X>X where X ∈ Rm,d is a matrix whose
i’th row is x>i .
Let B = XX>. That is, Bi,j = 〈xi,xj〉If Bu = λu then
A(X>u) = X>XX>u = X>Bu = λ(X>u)
So, X>u‖X>u‖ is an eigenvector of A with eigenvalue λ
We can therefore calculate the PCA solution by calculating theeigenvalues of B instead of A
The complexity is O(m3 +m2d)
And, it can be computed using a kernel function
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 10 / 25
![Page 26: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/26.jpg)
Pseudo code
PCA
inputA matrix of m examples X ∈ Rm,dnumber of components n
if (m > d)A = X>XLet u1, . . . ,un be the eigenvectors of A with largest eigenvalues
elseB = XX>
Let v1, . . . ,vn be the eigenvectors of B with largest eigenvaluesfor i = 1, . . . , n set ui =
1‖X>vi‖
X>vi
output: u1, . . . ,un
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 11 / 25
![Page 27: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/27.jpg)
Demonstration
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 12 / 25
![Page 28: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/28.jpg)
Demonstration
50× 50 images from Yale dataset
Before (left) and after reconstruction (right) to 10 dimensions
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 13 / 25
![Page 29: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/29.jpg)
Demonstration
Before and after
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 14 / 25
![Page 30: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/30.jpg)
Demonstration
Images after dim reduction to R2
Different marks indicate different individuals
x xx x xx x
o oo o oo o
* **
*** *+++ + +++
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 15 / 25
![Page 31: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/31.jpg)
Outline
1 Principal Component Analysis (PCA)
2 Random Projections
3 Compressed Sensing
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 16 / 25
![Page 32: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/32.jpg)
What is a successful dimensionality reduction?
In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx
In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm
One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖
Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1
Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖
‖x‖ ≈ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25
![Page 33: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/33.jpg)
What is a successful dimensionality reduction?
In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx
In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm
One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖
Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1
Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖
‖x‖ ≈ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25
![Page 34: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/34.jpg)
What is a successful dimensionality reduction?
In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx
In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm
One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖
Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1
Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖
‖x‖ ≈ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25
![Page 35: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/35.jpg)
What is a successful dimensionality reduction?
In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx
In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm
One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖
Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1
Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖
‖x‖ ≈ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25
![Page 36: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/36.jpg)
What is a successful dimensionality reduction?
In PCA, we measured succes as squared distance between x and areconstruction of x from y =Wx
In some cases, we don’t care about reconstruction, all we care is thaty1, . . . ,ym will retain certain properties of x1, . . . ,xm
One option: do not distort distances. That is, we’d like that for alli, j, ‖xi − xj‖ ≈ ‖yi − yj‖
Equivalently, we’d like that for all i, j,‖Wxi−Wxj‖‖xi−xj‖ ≈ 1
Equivalently, we’d like that for all x ∈ Q, whereQ = {xi − xj : i, j ∈ [m]}, we’ll have ‖Wx‖
‖x‖ ≈ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 17 / 25
![Page 37: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/37.jpg)
Random Projections do not distort norms
Random projection: The transformation x 7→Wx, where W is arandom matrix
We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)
Let wi be the i’th row of W . Then:
E[‖Wx‖2] =n∑i=1
E[(〈wi,x〉)2] =n∑i=1
x> E[wiw>i ]x
= nx>(1
nI
)x = ‖x‖2
In fact, ‖Wx‖2 has a χ2n distribution, and using a measure
concentration inequality it can be shown that
P
[ ∣∣∣∣∣‖Wx‖2
‖x‖2− 1
∣∣∣∣∣ > ε
]≤ 2 e−ε
2n/6
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25
![Page 38: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/38.jpg)
Random Projections do not distort norms
Random projection: The transformation x 7→Wx, where W is arandom matrix
We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)
Let wi be the i’th row of W . Then:
E[‖Wx‖2] =n∑i=1
E[(〈wi,x〉)2] =n∑i=1
x> E[wiw>i ]x
= nx>(1
nI
)x = ‖x‖2
In fact, ‖Wx‖2 has a χ2n distribution, and using a measure
concentration inequality it can be shown that
P
[ ∣∣∣∣∣‖Wx‖2
‖x‖2− 1
∣∣∣∣∣ > ε
]≤ 2 e−ε
2n/6
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25
![Page 39: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/39.jpg)
Random Projections do not distort norms
Random projection: The transformation x 7→Wx, where W is arandom matrix
We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)
Let wi be the i’th row of W . Then:
E[‖Wx‖2] =n∑i=1
E[(〈wi,x〉)2] =n∑i=1
x> E[wiw>i ]x
= nx>(1
nI
)x = ‖x‖2
In fact, ‖Wx‖2 has a χ2n distribution, and using a measure
concentration inequality it can be shown that
P
[ ∣∣∣∣∣‖Wx‖2
‖x‖2− 1
∣∣∣∣∣ > ε
]≤ 2 e−ε
2n/6
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25
![Page 40: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/40.jpg)
Random Projections do not distort norms
Random projection: The transformation x 7→Wx, where W is arandom matrix
We’ll analyze the distortion due to W s.t. Wi,j ∼ N(0, 1/n)
Let wi be the i’th row of W . Then:
E[‖Wx‖2] =n∑i=1
E[(〈wi,x〉)2] =n∑i=1
x> E[wiw>i ]x
= nx>(1
nI
)x = ‖x‖2
In fact, ‖Wx‖2 has a χ2n distribution, and using a measure
concentration inequality it can be shown that
P
[ ∣∣∣∣∣‖Wx‖2
‖x‖2− 1
∣∣∣∣∣ > ε
]≤ 2 e−ε
2n/6
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 18 / 25
![Page 41: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/41.jpg)
Random Projections do not distort norms
Applying the union bound over all vectors in Q we obtain:
Lemma (Johnson-Lindenstrauss lemma)
Let Q be a finite set of vectors in Rd. Let δ ∈ (0, 1) and n be an integersuch that
ε =
√6 log(2|Q|/δ)
n≤ 3 .
Then, with probability of at least 1− δ over a choice of a random matrixW ∈ Rn,d with Wi,j ∼ N(0, 1/n), we have
maxx∈Q
∣∣∣∣‖Wx‖2
‖x‖2− 1
∣∣∣∣ < ε .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 19 / 25
![Page 42: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/42.jpg)
Outline
1 Principal Component Analysis (PCA)
2 Random Projections
3 Compressed Sensing
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 20 / 25
![Page 43: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/43.jpg)
Compressed Sensing
Prior assumption: x ≈ Uα where U is orthonormal and
‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d
E.g.: natural images are approximately sparse in a wavelet basis
How to “store” x ?
We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25
![Page 44: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/44.jpg)
Compressed Sensing
Prior assumption: x ≈ Uα where U is orthonormal and
‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d
E.g.: natural images are approximately sparse in a wavelet basis
How to “store” x ?
We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25
![Page 45: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/45.jpg)
Compressed Sensing
Prior assumption: x ≈ Uα where U is orthonormal and
‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d
E.g.: natural images are approximately sparse in a wavelet basis
How to “store” x ?
We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25
![Page 46: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/46.jpg)
Compressed Sensing
Prior assumption: x ≈ Uα where U is orthonormal and
‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d
E.g.: natural images are approximately sparse in a wavelet basis
How to “store” x ?
We can find α = U>x and then save the non-zero elements of α
Requires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25
![Page 47: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/47.jpg)
Compressed Sensing
Prior assumption: x ≈ Uα where U is orthonormal and
‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d
E.g.: natural images are approximately sparse in a wavelet basis
How to “store” x ?
We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storage
Why go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25
![Page 48: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/48.jpg)
Compressed Sensing
Prior assumption: x ≈ Uα where U is orthonormal and
‖α‖0def= |{i : αi 6= 0}| ≤ s for some s� d
E.g.: natural images are approximately sparse in a wavelet basis
How to “store” x ?
We can find α = U>x and then save the non-zero elements of αRequires order of s log(d) storageWhy go to so much effort to acquire all the d coordinates of x whenmost of what we get will be thrown away? Can’t we just directlymeasure the part that won’t end up being thrown away?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 21 / 25
![Page 49: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/49.jpg)
Compressed Sensing
Informally, the main premise of compressed sensing is the following three“surprising” results:
1 It is possible to fully reconstruct any sparse signal if it wascompressed by x 7→Wx, where W is a matrix which satisfies acondition called Restricted Isoperimetric Property (RIP). A matrixthat satisfies this property is guaranteed to have a low distortion ofthe norm of any sparse representable vector.
2 The reconstruction can be calculated in polynomial time by solving alinear program.
3 A random n× d matrix is likely to satisfy the RIP condition providedthat n is greater than order of s log(d).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 22 / 25
![Page 50: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/50.jpg)
Compressed Sensing
Informally, the main premise of compressed sensing is the following three“surprising” results:
1 It is possible to fully reconstruct any sparse signal if it wascompressed by x 7→Wx, where W is a matrix which satisfies acondition called Restricted Isoperimetric Property (RIP). A matrixthat satisfies this property is guaranteed to have a low distortion ofthe norm of any sparse representable vector.
2 The reconstruction can be calculated in polynomial time by solving alinear program.
3 A random n× d matrix is likely to satisfy the RIP condition providedthat n is greater than order of s log(d).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 22 / 25
![Page 51: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/51.jpg)
Compressed Sensing
Informally, the main premise of compressed sensing is the following three“surprising” results:
1 It is possible to fully reconstruct any sparse signal if it wascompressed by x 7→Wx, where W is a matrix which satisfies acondition called Restricted Isoperimetric Property (RIP). A matrixthat satisfies this property is guaranteed to have a low distortion ofthe norm of any sparse representable vector.
2 The reconstruction can be calculated in polynomial time by solving alinear program.
3 A random n× d matrix is likely to satisfy the RIP condition providedthat n is greater than order of s log(d).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 22 / 25
![Page 52: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/52.jpg)
Restricted Isoperimetric Property (RIP)
A matrix W ∈ Rn,d is (ε, s)-RIP if for all x 6= 0 s.t. ‖x‖0 ≤ s we have∣∣∣∣‖Wx‖22‖x‖22
− 1
∣∣∣∣ ≤ ε .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 23 / 25
![Page 53: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/53.jpg)
RIP matrices yield lossless compression for sparse vectors
Theorem
Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.
Proof.
Assume, by way of contradiction, that x̃ 6= x.
Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.
Therefore, ‖x− x̃‖0 ≤ 2s.
By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1
∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25
![Page 54: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/54.jpg)
RIP matrices yield lossless compression for sparse vectors
Theorem
Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.
Proof.
Assume, by way of contradiction, that x̃ 6= x.
Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.
Therefore, ‖x− x̃‖0 ≤ 2s.
By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1
∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25
![Page 55: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/55.jpg)
RIP matrices yield lossless compression for sparse vectors
Theorem
Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.
Proof.
Assume, by way of contradiction, that x̃ 6= x.
Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.
Therefore, ‖x− x̃‖0 ≤ 2s.
By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1
∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25
![Page 56: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/56.jpg)
RIP matrices yield lossless compression for sparse vectors
Theorem
Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.
Proof.
Assume, by way of contradiction, that x̃ 6= x.
Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.
Therefore, ‖x− x̃‖0 ≤ 2s.
By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1
∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25
![Page 57: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/57.jpg)
RIP matrices yield lossless compression for sparse vectors
Theorem
Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.
Proof.
Assume, by way of contradiction, that x̃ 6= x.
Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.
Therefore, ‖x− x̃‖0 ≤ 2s.
By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1
∣∣∣ ≤ ε
But, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25
![Page 58: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/58.jpg)
RIP matrices yield lossless compression for sparse vectors
Theorem
Let ε < 1 and let W be a (ε, 2s)-RIP matrix. Let x be a vector s.t.‖x‖0 ≤ s, let y =Wx and let x̃ ∈ argminv:Wv=y ‖v‖0. Then, x̃ = x.
Proof.
Assume, by way of contradiction, that x̃ 6= x.
Since x satisfies the constraints in the optimization problem thatdefines x̃ we clearly have that ‖x̃‖0 ≤ ‖x‖0 ≤ s.
Therefore, ‖x− x̃‖0 ≤ 2s.
By RIP on x− x̃ we have∣∣∣‖W (x−x̃)‖2‖x−x̃‖2 − 1
∣∣∣ ≤ εBut, since W (x− x̃) = 0 we get that |0− 1| ≤ ε. Contradiction.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 24 / 25
![Page 59: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/59.jpg)
Efficient reconstruction
If we further assume that ε < 11+√2
then
x = argminv:Wv=y
‖v‖0 = argminv:Wv=y
‖v‖1 .
The right-hand side is a linear programming problem
Summary: we can reconstruct all sparse vector efficiently based onO(s log(d)) measurements
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 25 / 25
![Page 60: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/60.jpg)
Efficient reconstruction
If we further assume that ε < 11+√2
then
x = argminv:Wv=y
‖v‖0 = argminv:Wv=y
‖v‖1 .
The right-hand side is a linear programming problem
Summary: we can reconstruct all sparse vector efficiently based onO(s log(d)) measurements
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 25 / 25
![Page 61: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/61.jpg)
Efficient reconstruction
If we further assume that ε < 11+√2
then
x = argminv:Wv=y
‖v‖0 = argminv:Wv=y
‖v‖1 .
The right-hand side is a linear programming problem
Summary: we can reconstruct all sparse vector efficiently based onO(s log(d)) measurements
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 25 / 25
![Page 62: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/62.jpg)
PCA vs. Random Projections
Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors
PCA guarantee perfect recovery if all examples are in ann-dimensional subspace
Different prior knowledge:
If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25
![Page 63: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/63.jpg)
PCA vs. Random Projections
Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors
PCA guarantee perfect recovery if all examples are in ann-dimensional subspace
Different prior knowledge:
If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25
![Page 64: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/64.jpg)
PCA vs. Random Projections
Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors
PCA guarantee perfect recovery if all examples are in ann-dimensional subspace
Different prior knowledge:
If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25
![Page 65: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/65.jpg)
PCA vs. Random Projections
Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors
PCA guarantee perfect recovery if all examples are in ann-dimensional subspace
Different prior knowledge:
If the data is e1, . . . , ed, random projections will be perfect but PCAwill fail
If d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25
![Page 66: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/66.jpg)
PCA vs. Random Projections
Random projections guarantee perfect recovery for allO(n/ log(d))-sparse vectors
PCA guarantee perfect recovery if all examples are in ann-dimensional subspace
Different prior knowledge:
If the data is e1, . . . , ed, random projections will be perfect but PCAwill failIf d is very large and data is exactly on an n-dim subspace. Then, PCAwill be perfect but random projections might fail
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 26 / 25
![Page 67: Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec903c8eef0bd162553a7f9/html5/thumbnails/67.jpg)
Summary
Linear dimensionality reduction x 7→Wx
PCA: optimal if reconstruction is linear and error is squared distanceRandom projections: preserves disctancesRandom projections: exact reconstruction for sparse vectors (but witha non-linear reconstruction)
Not covered: non-linear dimensionality reduction
Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 27 / 25