week 4, lecture 7 - dimensionality reduction: pca and nmf · week 4, lecture 7 - dimensionality...
TRANSCRIPT
1 / 36
Week 4, Lecture 7 - Dimensionality Reduction:PCA and NMF
Aaron Meyer
2 / 36
Outline
I Administrative IssuesI Decomposition methods
I Factor analysisI Principal components analysisI Non-negative matrix factorization
3 / 36
Dealing with many variables
I So far we’ve largely concentrated on cases in which we haverelatively large numbers of measurements for a few variablesI This is frequently refered to as n > p
I Two other extremes are imporantI Many observations and many variablesI Many variables but few observations (p > n)
4 / 36
Dealing with many variables
Usually when we’re dealing with many variables, we don’t havea great understanding of how they relate to each otherI E.g. if gene X is high, we can’t be sure that will mean gene Y
will be tooI If we had these relationships, we could reduce the data
I E.g. if we had variables to tell us it’s 3 pm in Los Angeles, wedon’t need one to say it’s daytime
5 / 36
Dimensionality Reduction
Generate a low-dimensional encoding of a high-dimensionalspacePurposes:I Data compression / visualizationI Robustness to noise and uncertaintyI Potentially easier to interpret
Bonus: Many of the other methods from the class can be appliedafter dimensionality reduction with little or no adjustment!
6 / 36
Matrix Factorization
Many (most?) dimensionality reduction methods involve matrix fac-torization
Basic Idea: Find two (or more) matrices whose product best approx-imate the original matrix
Low rank approximation to original N × M matrix:
X ≈ WHT
where W is N × R, HT is M × R, and R � N .
7 / 36
Matrix Factorization
Generalization of many methods (e.g., SVD, QR, CUR, TruncatedSVD, etc.)
8 / 36
Aside - What should R be?
X ≈ WHT
where W is M × R, HT is M × R, and R � N .
9 / 36
Matrix FactorizationMatrix factorization is also compression
Figure: http://www.aaronschlegel.com/image-compression-principal-component-analysis/
10 / 36
Factor AnalysisMatrix factorization is also compression
Figure: http://www.aaronschlegel.com/image-compression-principal-component-analysis/
11 / 36
Factor AnalysisMatrix factorization is also compression
Figure: http://www.aaronschlegel.com/image-compression-principal-component-analysis/
12 / 36
Examples from bioengineering
Process controlI Large bioreactor runs may be recorded in a database, along
with a variety of measurements from those runsI We may be interested in how those different runs varied, and
how each factor relates to one anotherI Plotting a compressed version of that data can indicate when
an anomolous change is present
13 / 36
Examples from bioengineering
Mutational processesI Anytime multiple contributory factors give rise to a
phenomena, matrix factorization can separate them outI Will talk about this in greater detail
Cell heterogeneityI Enormous interest in understanding how cells are similar or
differentI Answer to this can be in millions of different waysI But cells often follow programs
14 / 36
Principal Components Analysis
Application of matrix factorizationI Each principal component (PC) is linear combination of
uncorrelated attributes / features’I Ordered in terms of varianceI kth PC is orthogonal to all previous PCsI Reduce dimensionality while maintaining maximal variance
15 / 36
Principal Components Analysis
BOARD
16 / 36
Methods to calculate PCA
I Iterative computationI More robust with high numbers of variablesI Slower to calculate
I NIPALS (Non-linear iterative partial least squares)I Able to efficiently calculate a few PCs at onceI Breaks down for high numbers of variables (large p)
17 / 36
Practical Notes
PCAI Implemented within sklearn.decomposition.PCA
I PCA.fit_transform(X) fits the model to X, then provides thedata in principal component space
I PCA.components_ provides the “loadings matrix”, ordirections of maximum variance
I PCA.explained_variance_ provides the amount of varianceexplained by each component
18 / 36
PCA
import matplotlib.pyplot as pltfrom sklearn import datasetsfrom sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.datay = iris.targettarget_names = iris.target_names
pca = PCA(n_components=2)X_r = pca.fit(X).transform(X)
# Print PC1 loadingsprint(pca.components_[:, 0])# ...
19 / 36
PCA
# ...pca = PCA(n_components=2)X_r = pca.fit(X).transform(X)
# Print PC1 loadingsprint(pca.components_[:, 0])
# Print PC1 scoresprint(X_r[:, 0])
# Percentage of variance explained for each componentprint(pca.explained_variance_ratio_)# [ 0.92461621 0.05301557]
20 / 36
PCA
21 / 36
Non-negative matrix factorization
Like PCA, except the coefficients in the linear combinationmust be non-negativeI Forcing positive coefficients implies an additive combination
of basis parts to reconstruct wholeI Generally leads to zeros for factors that don’t contribute
22 / 36
Non-negative matrix factorization
The answer you get will always depend on the error metric,starting point, and search methodBOARD
23 / 36
What is significant about this?
I The update rule is multiplicative instead of additiveI In the initial values for W and H are non-negative, then W
and H can never become negativeI This guarantees a non-negative factorizationI Will converge to a local maxima
I Therefore starting point matters
24 / 36
Non-negative matrix factorization
The answer you get will always depend on the error metric,starting point, and search methodI Another approach is to find the gradient across all the
variables in the matrixI Called coordinate descent, and is usually fasterI Not going to go through implementationI Will also converge to a local maxima
25 / 36
NMF application: Netflix
Suppose Alice rates Inception 4 stars. What led to this rating?
26 / 36
NMF application: Netflix480,000 users x 17,700 movies Test data: most recent ratings
What do you think is done with missing values?
27 / 36
NMF application: Netflix
I More generally, additional baseline predictors include:I A factor that allows Alice’s rating to (linearly) depend on the
(square root of the) number of days since her first rating. (Forexample, have you ever noticed that you become a harshercritic over time?)
I A factor that allows Alice’s rating to depend on the number ofdays since the movie’s first rating by anyone. (If you’re one ofthe first people to watch it, maybe it’s because you’re a hugefan and really excited to see it on DVD, so you’ll tend to rateit higher.)
I A factor that allows Alice’s rating to depend on the number ofpeople who have rated Inception. (Maybe Alice is a hipsterwho hates being part of the crowd.)
I A factor that allows Alice’s rating to depend on the movie’soverall rating.
I (Plus a bunch of others.)
28 / 36
NMF application: Netflix
And, in fact, modeling these biases turned out to be fairly important:in their paper describing their final solution to the Netflix Prize, Belland Koren write that:
Of the numerous new algorithmic contributions, I would like tohighlight one – those humble baseline predictors (or biases), whichcapture main effects in the data. While the literature mostly con-centrates on the more sophisticated algorithmic aspects, we havelearned that an accurate treatment of main effects is probably atleast as signficant as coming up with modeling breakthroughs.
29 / 36
NMF application: Netflix
30 / 36
NMF application: Mutational Processes in Cancer
Figure: Helleday et al, Nat Rev Gen, 2014
31 / 36
NMF application: Mutational Processes in Cancer
Figure: Helleday et al, Nat Rev Gen, 2014
32 / 36
NMF application: Mutational Processes in Cancer
Figure: Alexandrov et al, Cell Rep, 2013
33 / 36
NMF application: Mutational Processes in Cancer
Figure: Alexandrov et al, Cell Rep, 2013
34 / 36
Practical Notes - NMF
I Implemented within sklearn.decomposition.NMF.I n_components: number of componentsI init: how to initialize the searchI solver: ‘cd’ for coordinate descent, or ‘mu’ for multiplicative
updateI l1_ratio: Can regularize fit
I Provides:I NMF.components_: components x features matrixI Returns transformed data through NMF.fit_transform()
35 / 36
Summary
PCAI Preserves the covariation within a datasetI Therefore mostly preserves axes of maximal variationI Number of components can vary—in practice more than 2 or
3 rarely helpful
NMFI Explains the dataset through factoring into two non-negative
matricesI Much more stable and well-specified reconstruction when
assumptions are appropriateI Excellent for separating out additive factors
36 / 36
Closing
As always, selection of the appropriate method depends uponthe question being asked.