-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
A penalized matrix decomposition,and its applications
Daniela M. WittenThesis Defense
Department of StatisticsStanford University
June 7, 2010
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparsity
I Consider a high-dimensional regression problem: we wish topredict y ∈ Rn using X ∈ Rn×p, where p may be quite large.
I We can use an L1 or lasso penalty to fit the modely = Xβ + � in way that gives a sparsity:
β̂ = argminβ{||y − Xβ||2 + λ||β||1}
I An active area of research: Lasso (Tibshirani 1996), BasisPursuit (Chen, Donoho, and Saunders 1998), LARS (Efron,Hastie, Johnstone, and Tibshirani 2004), Adaptive Lasso (Zou2006), Group Lasso (Yuan and Lin 2006), Dantzig selector(Candes and Tao 2007), Relaxed Lasso (Meinshausen 2008)
I Today: sparsity in the unsupervised setting.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparsity
I Consider a high-dimensional regression problem: we wish topredict y ∈ Rn using X ∈ Rn×p, where p may be quite large.
I We can use an L1 or lasso penalty to fit the modely = Xβ + � in way that gives a sparsity:
β̂ = argminβ{||y − Xβ||2 + λ||β||1}
I An active area of research: Lasso (Tibshirani 1996), BasisPursuit (Chen, Donoho, and Saunders 1998), LARS (Efron,Hastie, Johnstone, and Tibshirani 2004), Adaptive Lasso (Zou2006), Group Lasso (Yuan and Lin 2006), Dantzig selector(Candes and Tao 2007), Relaxed Lasso (Meinshausen 2008)
I Today: sparsity in the unsupervised setting.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparsity
I Consider a high-dimensional regression problem: we wish topredict y ∈ Rn using X ∈ Rn×p, where p may be quite large.
I We can use an L1 or lasso penalty to fit the modely = Xβ + � in way that gives a sparsity:
β̂ = argminβ{||y − Xβ||2 + λ||β||1}
I An active area of research: Lasso (Tibshirani 1996), BasisPursuit (Chen, Donoho, and Saunders 1998), LARS (Efron,Hastie, Johnstone, and Tibshirani 2004), Adaptive Lasso (Zou2006), Group Lasso (Yuan and Lin 2006), Dantzig selector(Candes and Tao 2007), Relaxed Lasso (Meinshausen 2008)
I Today: sparsity in the unsupervised setting.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparsity
I Consider a high-dimensional regression problem: we wish topredict y ∈ Rn using X ∈ Rn×p, where p may be quite large.
I We can use an L1 or lasso penalty to fit the modely = Xβ + � in way that gives a sparsity:
β̂ = argminβ{||y − Xβ||2 + λ||β||1}
I An active area of research: Lasso (Tibshirani 1996), BasisPursuit (Chen, Donoho, and Saunders 1998), LARS (Efron,Hastie, Johnstone, and Tibshirani 2004), Adaptive Lasso (Zou2006), Group Lasso (Yuan and Lin 2006), Dantzig selector(Candes and Tao 2007), Relaxed Lasso (Meinshausen 2008)
I Today: sparsity in the unsupervised setting.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Matrix Decompositions
Consider a n × p matrix X for which we want a low-rankapproximation. For simplicity, assume that the row and columnmeans of X are zero.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Matrix Decompositions
We might want this low-rank approximation in order to
1. obtain a lower-dimensional projection of the data thatcaptures most of the variability, or
2. achieve a better understanding and interpretation of the data,or
3. impute missing values, e.g. for movie recommender systems
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
The singular value decomposition
We decompose the matrix X as
X = UDVT
where U and V have orthonormal columns and D is diagonal;d1 ≥ d2 ≥ ... ≥ dp ≥ 0.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
A sparse matrix decomposition
The SVD has many useful and interesting properties, but ingeneral, the columns of U and V are not sparse - that is, noelements of U and V are exactly zero.
We want a matrix decomposition with sparse elements, forconciseness, parsimony, and interpretability.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Example of the sparse matrix decomposition: Netflix Data
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix recommendations
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix Data
Over 100 million ratings given by 480,000 users to 18,000 movies.Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix Data
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix Results
“Lord of the Rings: The Fellowship of the Ring”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Two Towers”“Lord of the Rings: The Return of the King”“Lord of the Rings: The Return of the King: Extended Edition”“Star Wars: Episode V: The Empire Strikes Back”“Star Wars: Episode VI: Return of the Jedi”“Star Wars: Episode IV: A New Hope”“Raiders of the Lost Ark”
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Applications of the penalized matrix decomposition
Input matrix ResultData data interpretation
missing value imputationmatrix completion
Variance-covariance sparse PCA
Cross-products sparse CCA
Dissimilarity sparse clustering
Between-Class Covariance sparse LDA
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Criterion for the Singular Value Decomposition
Recall that the first components u, v, and d of the SVD comprisethe best rank-1 approximation to the matrix X, in the sense of theFrobenius norm:
minimizeu,v,d
||X− duvT ||2F subject to ||u||2 = 1, ||v||2 = 1
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Criterion for the Penalized Matrix Decomposition
Suppose we add in additional penalty terms to that criterion:
minimizeu,v,d
||X− duvT ||2F
subject to ||u||2 = ||v||2 = 1,P1(u) ≤ c1,P2(v) ≤ c2,
where P1 and P2 are arbitrary penalty functions. We can call thisthe rank-one penalized matrix decomposition.
For now, let P1(u) = ||u||1, P2(v) = ||v||1.
This encourages sparsity: the sparse matrix decomposition.
This is related to a proposal of Shen and Huang (2008).
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Criterion for the Penalized Matrix Decomposition
Suppose we add in additional penalty terms to that criterion:
minimizeu,v,d
||X− duvT ||2F
subject to ||u||2 = ||v||2 = 1,P1(u) ≤ c1,P2(v) ≤ c2,
where P1 and P2 are arbitrary penalty functions. We can call thisthe rank-one penalized matrix decomposition.
For now, let P1(u) = ||u||1, P2(v) = ||v||1.
This encourages sparsity: the sparse matrix decomposition.
This is related to a proposal of Shen and Huang (2008).
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
More on the Rank-One PMD Model
I Note that u, v that minimize
||X− duvT ||2F subject to ||u||2 = ||v||2 = 1
also maximize
uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1.
I This means that we can re-write the rank-one PMD criterionas
maximizeu,v
uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1, ||u||1 ≤ c1, ||v||1 ≤ c2.
I With u fixed, the criterion is convex in v, and with v fixed, it’sconvex in u. This bi-convexity leads to a convenient iterativealgorithm!
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
More on the Rank-One PMD Model
I Note that u, v that minimize
||X− duvT ||2F subject to ||u||2 = ||v||2 = 1
also maximize
uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1.
I This means that we can re-write the rank-one PMD criterionas
maximizeu,v
uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1, ||u||1 ≤ c1, ||v||1 ≤ c2.
I With u fixed, the criterion is convex in v, and with v fixed, it’sconvex in u. This bi-convexity leads to a convenient iterativealgorithm!
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
More on the Rank-One PMD Model
I Note that u, v that minimize
||X− duvT ||2F subject to ||u||2 = ||v||2 = 1
also maximize
uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1.
I This means that we can re-write the rank-one PMD criterionas
maximizeu,v
uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1, ||u||1 ≤ c1, ||v||1 ≤ c2.
I With u fixed, the criterion is convex in v, and with v fixed, it’sconvex in u. This bi-convexity leads to a convenient iterativealgorithm!
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Algorithm for Sparse Matrix Decomposition
1. Initialize v to satisfy the constraints ||v||2 = 1, ||v||1 ≤ c2.2. Iterate until convergence:
I u← argmaxuuTXv subject to ||u||1 ≤ c1, ||u||2 ≤ 1.I v← argmaxvuTXv subject to ||v||1 ≤ c2, ||v||2 ≤ 1.
For c1 and c2 sufficiently small, the resulting u and v will be sparse.
In the absence of L1 penalties, this yields the rank one SVD.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Soft-thresholding
To update u with v held fixed, we must optimize
u← argmaxuuTXv subject to ||u||1 ≤ c1, ||u||2 ≤ 1.
It turns out that the solution simply involves soft-thresholding:
u =S(Xv,∆)
||S(Xv,∆)||2
where S(a,∆) = sgn(a)max(0, |a| −∆).
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
L1 and L2 penalties
Video of L1 and L2 penalties
Daniela M. Witten A penalized matrix decomposition
movie.mpgMedia File (video/mpeg)
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
L1 and L2 penalties
The story in three dimensions
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Algorithm in action
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Algorithm in action: Update u
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Algorithm in action: Update v
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Algorithm in action: Update u
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Algorithm in action: Update v
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Extension to Rank-K Decomposition
I To get the rank-K decomposition, we simply subtract out therank-(K − 1) decomposition from the original data matrix X,and apply the rank-1 decomposition to the residuals.
I In the absence of L1 penalties, this gives the rank K SVD.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Selection of tuning parameters c1 and c2
I Selection of tuning parameters in unsupervised problems is avery difficult problem.
I We leave out scattered elements of X and choose the tuningparameters such that our low-rank approximation to Xoptimally estimates the left-out elements.
I Closely related to proposals by Owen and Perry (2009) andWold (1978).
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Applications of the penalized matrix decomposition
Input matrix ResultData data interpretation
missing value imputationmatrix completion
Variance-covariance sparse PCA
Cross-products sparse CCA
Dissimilarity sparse clustering
Between-Class Covariance sparse LDA
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Example of Sparse Matrix Decomposition: Netflix Data
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix Data: Factor 1 - All movies have negative weights
“Lord of the Rings: The Fellowship of the Ring”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Two Towers”“Lord of the Rings: The Return of the King”“Lord of the Rings: The Return of the King: Extended Edition”“Star Wars: Episode V: The Empire Strikes Back”“Star Wars: Episode VI: Return of the Jedi”“Star Wars: Episode IV: A New Hope”“Raiders of the Lost Ark”
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix Data: Factor 5 - Movies with positive weights
“Austin Powers in Goldmember”“Austin Powers: International Man of Mystery”“Austin Powers: The Spy Who Shagged Me”“The Nutty Professor”“Big Mommas House”“Wild Wild West”“Dodgeball: A True Underdog Story”“Anchorman: The Legend of Ron Burgundy”“Mr. Deeds”“Punch-Drunk Love”“Anger Management”“Moulin Rouge”“Spaceballs”
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Netflix Data: Factor 5 - Movies with negative weights
“Star Wars: Episode V: The Empire Strikes Back”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Return of the King: Extended Edition”“Raiders of the Lost Ark”“The Silence of the Lambs”“Rain Man”“We Were Soldiers”“The Godfather”“The Shawshank Redemption: Special Edition”“Saving Private Ryan”“E.T. the Extra-Terrestrial: The 20th Anniversary (Rerelease)”“Finding Nemo (Widescreen)”
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Applications of the penalized matrix decomposition
Input matrix ResultData data interpretation
missing value imputationmatrix completion
Variance-covariance sparse PCA
Cross-products sparse CCA
Dissimilarity sparse clustering
Between-Class Covariance sparse LDA
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Hierarchical clustering
There has been a resurgence of interest in hierarchical clustering inthe field of genomics.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Clustering when p � n
Suppose we wish to cluster n observations on p features, wherep � n.
I Hierarchical clustering is very subjective: the answer you getdepends on what set of features you use. We want aprincipled way to choose a set of features to use in clustering.
I If the true classes that we wish to identify are defined on onlya subset of the features, then the presence of noise featurescan obscure this signal. We want a way to adaptively choosethe signal features to use in clustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Clustering when p � n
Suppose we wish to cluster n observations on p features, wherep � n.
I Hierarchical clustering is very subjective: the answer you getdepends on what set of features you use. We want aprincipled way to choose a set of features to use in clustering.
I If the true classes that we wish to identify are defined on onlya subset of the features, then the presence of noise featurescan obscure this signal. We want a way to adaptively choosethe signal features to use in clustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Clustering when p � n
Suppose we wish to cluster n observations on p features, wherep � n.
I Hierarchical clustering is very subjective: the answer you getdepends on what set of features you use. We want aprincipled way to choose a set of features to use in clustering.
I If the true classes that we wish to identify are defined on onlya subset of the features, then the presence of noise featurescan obscure this signal. We want a way to adaptively choosethe signal features to use in clustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Example
A simple example with 10 observations; 2 classes are defined on 10important features.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Example: 10 important features; 10 features total
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Example: 10 important features; 500 features total
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Example: 10 important features; 5000 features total
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering results: 10 importantfeatures; 5000 features total
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse Clustering
We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.
We want an automated way to
I find a subset of features to use in the clustering, and
I obtain a more accurate or interesting clustering using thatsubset of features.
Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =
∑pj=1 di ,i ′,j
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse Clustering
We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.
We want an automated way to
I find a subset of features to use in the clustering, and
I obtain a more accurate or interesting clustering using thatsubset of features.
Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =
∑pj=1 di ,i ′,j
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse Clustering
We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.
We want an automated way to
I find a subset of features to use in the clustering, and
I obtain a more accurate or interesting clustering using thatsubset of features.
Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =
∑pj=1 di ,i ′,j
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Dissimilarity matrix for the n observations
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Dissimilarity matrix for the n observations
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Dissimilarity matrix is a sum of dissimilarity matrices overthe features
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Hierarchical clustering sums the dissimilarity matrices forthe features
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Weighted sum of the dissimilarity matrices for the features
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering and the PMD
Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .
Then, suppose we apply the PMD to D:
maximizeu,w
uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s
I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then
performing hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.
I If w1 = ... = wp then this gives standard hierarchicalclustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering and the PMD
Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .
Then, suppose we apply the PMD to D:
maximizeu,w
uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s
I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then
performing hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.
I If w1 = ... = wp then this gives standard hierarchicalclustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering and the PMD
Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .
Then, suppose we apply the PMD to D:
maximizeu,w
uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s
I wj is a weight on the dissimilarity matrix for feature j .
I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then
performing hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.
I If w1 = ... = wp then this gives standard hierarchicalclustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering and the PMD
Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .
Then, suppose we apply the PMD to D:
maximizeu,w
uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s
I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).
I If we re-arrange the elements of Dw into a n × n matrix, thenperforming hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.
I If w1 = ... = wp then this gives standard hierarchicalclustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering and the PMD
Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .
Then, suppose we apply the PMD to D:
maximizeu,w
uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s
I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then
performing hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.
I If w1 = ... = wp then this gives standard hierarchicalclustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering and the PMD
Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .
Then, suppose we apply the PMD to D:
maximizeu,w
uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s
I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then
performing hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.
I If w1 = ... = wp then this gives standard hierarchicalclustering.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse hierarchical clustering in action
A simulated example with 6 classes defined on 200 signal features;2000 features in total.
5658
6062
6466
6870
72
Standard Clustering
0.00
00.
005
0.01
00.
015
0.02
00.
025
Sparse Clustering
●
●
●
●●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●●●
●
●
●●●●
●
●●●●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●●●●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 500 1000 1500 2000
0.00
0.05
0.10
0.15
0.20
W
Index
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
An important breast cancer paper
Nature (2000) 406:747-752.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Breast cancer data
I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).
I Clustered based on full set of 1753 genes first.
I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.
I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Breast cancer data
I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).
I Clustered based on full set of 1753 genes first.
I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.
I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Breast cancer data
I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).
I Clustered based on full set of 1753 genes first.
I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.
I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Breast cancer data
I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).
I Clustered based on full set of 1753 genes first.
I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.
I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Clustering using intrinsic genes: normal-breast-like,basal-like, ER+, Erb-B2+
0.0
0.5
1.0
1.5
All Samples
0.2
0.4
0.6
62 Samples
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse clustering
We wonder: If we sparsely cluster the 62 observations using all ofthe genes, can we identify the four classes successfully?
Three types of clustering:
1. Standard hierarchical clustering using all 1753 genes.
2. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen to yield 496 genes.
3. Standard hierarchical clustering using the 496 genes withhighest marginal variance.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Sparse clustering
We wonder: If we sparsely cluster the 62 observations using all ofthe genes, can we identify the four classes successfully?
Three types of clustering:
1. Standard hierarchical clustering using all 1753 genes.
2. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen to yield 496 genes.
3. Standard hierarchical clustering using the 496 genes withhighest marginal variance.
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
normal-breast-like, basal-like, ER+, Erb-B2+0.
00.
51.
01.
5
Standard Clust: All 1753 Genes
0.0
0.5
1.0
1.5
Sparse Clust: 496 Non−Zero Genes
0.0
0.5
1.0
1.5
Standard Clust: 496 High−Var. Genes
Daniela M. Witten A penalized matrix decomposition
-
IntroductionPenalized Matrix Decomposition
Sparse Hierarchical Clustering
Genes with high weights
# Gene Weight1 S100 CALCIUM-BINDING PROTEIN A8 (CALGRANULIN A) 0.2232 SECRETED FRIZZLED-RELATED PROTEIN 1 0.21263 ESTROGEN RECEPTOR 1 0.20764 KERATIN 17 0.16275 HUMAN REARRANGED IMMUNOGLOBULIN LAMBDA 0.15686 CYTOCHROME P450, SUBFAMILY IIA 0.1557 APOLIPOPROTEIN D 0.15098 LACTOTRANSFERRIN 0.14719 ESTROGEN RECEPTOR 1 0.140510 134783 0.1411 HEPATOCYTE NUCLEAR FACTOR 3, ALPHA 0.133212 HUMAN REARRANGED IMMUNOGLOBULIN LAMBDA LIGHT 0.130913 FATTY ACID BINDING PROTEIN 4, ADIPOCYTE 0.129214 CERULOPLASMIN (FERROXIDASE) 0.12615 HUMAN SECRETORY PROTEIN (P1.B) MRNA 0.120816 NON-SPECIFIC CROSS REACTING ANTIGEN 0.119917 LIPOPROTEIN LIPASE 0.112318 IMMUNOGLOBULIN LAMBDA LIGHT CHAIN 0.11219 CRYSTALLIN, ALPHA B 0.110820 FATTY ACID BINDING PROTEIN 4, ADIPOCYTE 0.1121 PLEIOTROPHIN (HEPARIN BINDING GROWTH FACTOR 8) 0.109922 85660 0.107723 ESTS, HIGHLY SIMILAR TO PROBABLE ATAXIA-TELANGIECTASIA 0.107124 V-FOS FBJ MURINE OSTEOSARCOMA VIRAL ONCOGENE HOMOLOG 0.105625 EPIDIDYMIS-SPECIFIC, WHEY-ACIDIC PROTEIN TYPE 0.101326 ALDO-KETO REDUCTASE FAMILY 1, MEMBER C1 0.1007
Daniela M. Witten A penalized matrix decomposition
IntroductionPenalized Matrix DecompositionSparse Hierarchical Clustering