pattern recognition and machine learning
DESCRIPTION
Lars Kasper, December 15 th 2010. Pattern Recognition and Machine Learning. Chapter 12: Continuous Latent Variables. Relation To Other Topics. Last weeks: Approximate Inference Today: Back to data-preprocessing Data representation/Feature extraction “Model-free” analysis - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/1.jpg)
Lars Kasper, December 15th 2010
PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 12: CONTINUOUS LATENT VARIABLES
![Page 2: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/2.jpg)
Relation To Other Topics
• Last weeks: Approximate Inference• Today: Back to• data-preprocessing• Data representation/Feature extraction• “Model-free” analysis• Dimensionality reduction• The matrix
• Link: We also have a (particular easy) model of the underlying state of the world whose parameters we want to infer from the data
![Page 3: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/3.jpg)
Take-home TLAs (Three-letter acronyms)
Although termed “continuous latent variables”, we mainly deal with• PCA (Principal Component Analysis)• ICA (Independent Component Analysis)• Factor analysis
General motivation/theme: “What is interesting about my data – but hidden (latent)? …And what is just noise?”
![Page 4: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/4.jpg)
Importance Sampling ;-) 1996 2 0.1918 %1997 3 0.2876 %1998 7 0.6711 %1999 17 1.6299 %2000 33 3.1640 %2001 41 3.9310 %2002 54 5.1774 %2003 53 5.0815 %2004 77 7.3826 %2005 85 8.1496 %2006 98 9.3960 %2007 115 11.0259 %2008 139 13.3269 %2009 160 15.3404 %2010 157 15.0527 %
Publications concerningfMRI and (PCA or ICA or factorAnalysis)Source: ISI Web of Knowledge, Dec 13th, 2010
![Page 5: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/5.jpg)
Importance Sampling: fMRI
MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time series of a visual block stimulation
• Used for fMRI analysis, e.g. software package FSL: “MELODIC”
![Page 6: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/6.jpg)
Motivation: Low intrinsic dimensionality
• Generating hand-written digit samples by translating and rotating one example 100 times
• High dimensional data (100 x 100 pixel)• Low degrees of freedom (1 rotation angle, 2 translations)
![Page 7: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/7.jpg)
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
![Page 8: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/8.jpg)
Heuristic PCA: Projection View
How do we simplify or compress our data (make it low-dimensional) without losing actual information? Dimensionality reduction by projecting on a linear subspace
2D-data
Projected on 1D-line
![Page 9: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/9.jpg)
Heuristic PCA: Dimensionality Reduction
High dimensional data
• Data points
Projection Low-Dimensional Subspace
• Dimension • Projected data
points
Advantages:• Reduced amount of data• Might be easier to reveal structure withinin the data (pattern recognition, data
visualization)
![Page 10: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/10.jpg)
Heuristic PCA: Maximum Variance View
• We want to reduce the dimensionality of our data space via a linear projection.
• But we still want to keep the projected samples as different as possible.
• A good measure for this difference is the data covariance expressed by the matrix
• Note: This expresses the covariance between different data dimensions, not between data points.
• We now aim to maximize the variance of the projected data in the projection space spanned by the basis vectors .
𝒙−mean of all data points ,𝑁−number of data points
![Page 11: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/11.jpg)
Maximum Variance View: The Maths
• Maximum variance formulation of 1D-projection with projection vector :
• Constraint optimization:
• Leads to best projector being an eigenvector of , the data covariance matrix:
• with maximum projected variance equal to the
maximum eigenvalue:
![Page 12: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/12.jpg)
Heuristic PCA: Conclusion
By induction we yield the general PCA result to maximize the variance of the data in the projected dimensions:
The projection vectors shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix . These vectors are called
the principal components.
![Page 13: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/13.jpg)
Heuristic PCA: Minimum error formulation
• By projecting, we want to lose as few information as possible, i.e. keep the projected data points as similiar to the raw data as possible.
• Therefore we minimize the mean quadratic error
• With respect to the projection vectors .• This leads to the same result as in the maximum
variance formulation: shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix .
![Page 14: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/14.jpg)
Example: Eigenimages
![Page 15: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/15.jpg)
Eigenimages II
Christopher DeCoro http://www.cs.princeton.edu/cdecoro/eigenfaces/
![Page 16: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/16.jpg)
Dimensionality Reduction
![Page 17: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/17.jpg)
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
![Page 18: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/18.jpg)
Probabilistic PCA: A synthesizer’s view
𝒙=𝑊 𝒛 +𝝁+𝝐• – standardised normal distribution
• Independent latent variables with zero mean & unit variance• – a spherical Gaussian
• i.e. identical independent noise in each of the data dimensions• Prior predictive or marginal distribution of data points:
![Page 19: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/19.jpg)
Probabilistic PCA: ML-solution
Same as in heuristic PCA matrix of first eigenvectors, diagonal matrix of eigenvalues Only specified up to a rotation in latent space
![Page 20: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/20.jpg)
Recap: The EM-algorithm
• The Expectation-Maximization algorithm determines the Maximum Likelihood-solution for our model parameters iteratively
• Advantageous compared to direct eigenvector decomposition, if , i.e. if we have considerably fewer latent variables than data dimensions• Projection on a very low dimensional space, e.g.
for data visualization to
![Page 21: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/21.jpg)
EM-Algorithm: Expectation Step
• We consider the complete-data likelihood
• Maximizing the marginal likelihood instead would need an integration over latent space
• E-Step: The posterior distribution of the latent variables is updated and used to calculate the expected value of the complete-data log likelihood with respect to
• Keeping estimates of fixed
![Page 22: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/22.jpg)
EM-Algorithm: Maximization Step
• M-Step: The calculated expectation is now maximized with respect to :
• keeping the estimated posterior distribution of fixed from the E-Step
![Page 23: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/23.jpg)
EM-algorithm for ML-PCA
Green dots: Data points, always fixedExpectation: Red rod is fixed, cyan connection of blue springs moves
obeying spring forces (Maximization: Cyan connections are fixed, red rod moves
(obey spring forces)
M
E M
𝑊𝑍𝑊 𝑇
![Page 24: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/24.jpg)
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
![Page 25: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/25.jpg)
Bayesian PCA – Finding the real dimension
MaximumLikelihood
BayesianPCA
Introducing hyperparameters, marginalizing :
𝑥=𝑊𝑧+𝜇+ϵ
Estimating
Estimated projection matrix for an dimensional latent variable model and synthetic data generated from a latent model with
![Page 26: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/26.jpg)
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
![Page 27: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/27.jpg)
Factor Analysis: A non-spherical PCA
with )
• Noise is still independent and Gaussian
• Controversy: Do thefactors (dimensions of ) have an interpretable meaning?• Problem: posterior invariant wrt rotations of
![Page 28: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/28.jpg)
Independent Component Analysis (ICA)
with • Still a linear model of independent components• No data noise components, for dim(latent space) =
dim(data space)• Explicitly Non-Gaussian• Otherwise, no separation of mixing coefficients in from
latent variables would be possible• Rotational symmetry
• Maximization of Non-Gaussianity/Independence• Different criteria, e.g. kurtosis, skewness • Minimization of mutual information
![Page 29: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/29.jpg)
ICA vs PCA
• ICA rewards bi-modality of projected distribution• PCA rewards maximum variance between elements
PCA 1st principal component
ICA 1st independentcomponent
Unsupervised method:No class labels!
![Page 30: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/30.jpg)
Summary
Parameter estimation
Heuristic quadratic cost function
(Minimum Error Projection)
Probabilistic (Maximum Likelihood
projection matrix)
Bayesian (Hyperparameters
of projection vectors)
Generative probabilistic
process in latent space
Standardized normal distribution
(PCA)
Standardized normal distribution
(Factor Analysis)
Independent probabilistic
process for each dimension (ICA)
Noise in data space
Spherical Gaussian(PCA)
Gaussian(Factor Analysis)
None (ICA)
Feature Mapping (latent to data space)
Linear: PCA, ICA, Factor Analysis
Nonlinear: Kernel PCA
![Page 31: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/31.jpg)
Relation To Other Topics
• Today• data-preprocessing• Whitening via covariance => Identity
• Data representation/Feature extraction• “Model-free” analysis• Well: NO! We have seen the model assumptions in probabilistic
PCA • Dimensionality reduction• Via projection on basis vectors carrying the most
variance/leaving the smallest error• At least for linear models, not for kernel PCA
• The matrix
![Page 32: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/32.jpg)
Kernel PCA
𝐶= 1𝑁 ∑
𝑛=1
𝑁
𝑥𝑛𝑥𝑛𝑇 𝐶= 1
𝑁 ∑𝑛=1
𝑁
Φ(𝑥¿¿𝑛)⋅Φ (𝑥𝑛 )𝑇 ¿
• Instead of the sample covariance matrix, we now consider a covariance matrix in a feature space
• As always, the kernel trick of not computing in the high-dimensional feature space works, because the covariance matrix only needs scalar products of the
![Page 33: Pattern Recognition and Machine Learning](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160dc550346895dd00cc9/html5/thumbnails/33.jpg)
Kernel PCA – Example: Gaussian kernel
• Kernel PCA does not enable dimensionality reduction via • is a manifold in feature space, not a linear subspace• The PCA projects onto subspaces in feature space with elements • These elements typically not lie in , so their pre-images ) will not be in data
space