Download - A penalized matrix decomposition, and its applicationsstatweb.stanford.edu/~tibs/sta306bfiles/Defense.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

IntroductionPenalized Matrix Decomposition

Sparse Hierarchical Clustering

A penalized matrix decomposition,and its applications

Daniela M. WittenThesis Defense

Department of StatisticsStanford University

June 7, 2010

Daniela M. Witten A penalized matrix decomposition



Sparsity

I Consider a high-dimensional regression problem: we wish topredict y ∈ Rn using X ∈ Rn×p, where p may be quite large.

I We can use an L1 or lasso penalty to fit the modely = Xβ + � in way that gives a sparsity:

β̂ = argminβ{||y − Xβ||2 + λ||β||1}

I An active area of research: Lasso (Tibshirani 1996), BasisPursuit (Chen, Donoho, and Saunders 1998), LARS (Efron,Hastie, Johnstone, and Tibshirani 2004), Adaptive Lasso (Zou2006), Group Lasso (Yuan and Lin 2006), Dantzig selector(Candes and Tao 2007), Relaxed Lasso (Meinshausen 2008)

I Today: sparsity in the unsupervised setting.




Matrix Decompositions

Consider a n × p matrix X for which we want a low-rankapproximation. For simplicity, assume that the row and columnmeans of X are zero.




Matrix Decompositions

We might want this low-rank approximation in order to

1. obtain a lower-dimensional projection of the data thatcaptures most of the variability, or

2. achieve a better understanding and interpretation of the data,or

3. impute missing values, e.g. for movie recommender systems




The singular value decomposition

We decompose the matrix X as

X = UDVT

where U and V have orthonormal columns and D is diagonal;d1 ≥ d2 ≥ ... ≥ dp ≥ 0.




A sparse matrix decomposition

The SVD has many useful and interesting properties, but ingeneral, the columns of U and V are not sparse - that is, noelements of U and V are exactly zero.

We want a matrix decomposition with sparse elements, forconciseness, parsimony, and interpretability.




Example of the sparse matrix decomposition: Netflix Data




Netflix recommendations




Netflix Data

Over 100 million ratings given by 480,000 users to 18,000 movies.Daniela M. Witten A penalized matrix decomposition



Netflix Data




Netflix Results

“Lord of the Rings: The Fellowship of the Ring”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Two Towers”“Lord of the Rings: The Return of the King”“Lord of the Rings: The Return of the King: Extended Edition”“Star Wars: Episode V: The Empire Strikes Back”“Star Wars: Episode VI: Return of the Jedi”“Star Wars: Episode IV: A New Hope”“Raiders of the Lost Ark”




Applications of the penalized matrix decomposition

Input matrix ResultData data interpretation

missing value imputationmatrix completion

Variance-covariance sparse PCA

Cross-products sparse CCA

Dissimilarity sparse clustering

Between-Class Covariance sparse LDA




Criterion for the Singular Value Decomposition

Recall that the first components u, v, and d of the SVD comprisethe best rank-1 approximation to the matrix X, in the sense of theFrobenius norm:

minimizeu,v,d

||X− duvT ||2F subject to ||u||2 = 1, ||v||2 = 1




Criterion for the Penalized Matrix Decomposition

Suppose we add in additional penalty terms to that criterion:

minimizeu,v,d

||X− duvT ||2F

subject to ||u||2 = ||v||2 = 1,P1(u) ≤ c1,P2(v) ≤ c2,

where P1 and P2 are arbitrary penalty functions. We can call thisthe rank-one penalized matrix decomposition.

For now, let P1(u) = ||u||1, P2(v) = ||v||1.

This encourages sparsity: the sparse matrix decomposition.

This is related to a proposal of Shen and Huang (2008).




More on the Rank-One PMD Model

I Note that u, v that minimize

||X− duvT ||2F subject to ||u||2 = ||v||2 = 1

also maximize

uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1.

I This means that we can re-write the rank-one PMD criterionas

maximizeu,v

uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1, ||u||1 ≤ c1, ||v||1 ≤ c2.

I With u fixed, the criterion is convex in v, and with v fixed, it’sconvex in u. This bi-convexity leads to a convenient iterativealgorithm!




Algorithm for Sparse Matrix Decomposition

1. Initialize v to satisfy the constraints ||v||2 = 1, ||v||1 ≤ c2.2. Iterate until convergence:

I u← argmaxuuTXv subject to ||u||1 ≤ c1, ||u||2 ≤ 1.I v← argmaxvuTXv subject to ||v||1 ≤ c2, ||v||2 ≤ 1.

For c1 and c2 sufficiently small, the resulting u and v will be sparse.

In the absence of L1 penalties, this yields the rank one SVD.




Soft-thresholding

To update u with v held fixed, we must optimize

u← argmaxuuTXv subject to ||u||1 ≤ c1, ||u||2 ≤ 1.

It turns out that the solution simply involves soft-thresholding:

u =S(Xv,∆)

||S(Xv,∆)||2

where S(a,∆) = sgn(a)max(0, |a| −∆).




L1 and L2 penalties

Video of L1 and L2 penalties


movie.mpgMedia File (video/mpeg)



L1 and L2 penalties

The story in three dimensions




Algorithm in action




Algorithm in action: Update u




Algorithm in action: Update v




Algorithm in action: Update u




Algorithm in action: Update v




Extension to Rank-K Decomposition

I To get the rank-K decomposition, we simply subtract out therank-(K − 1) decomposition from the original data matrix X,and apply the rank-1 decomposition to the residuals.

I In the absence of L1 penalties, this gives the rank K SVD.




Selection of tuning parameters c1 and c2

I Selection of tuning parameters in unsupervised problems is avery difficult problem.

I We leave out scattered elements of X and choose the tuningparameters such that our low-rank approximation to Xoptimally estimates the left-out elements.

I Closely related to proposals by Owen and Perry (2009) andWold (1978).




Example of Sparse Matrix Decomposition: Netflix Data




Netflix Data: Factor 1 - All movies have negative weights

“Lord of the Rings: The Fellowship of the Ring”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Two Towers”“Lord of the Rings: The Return of the King”“Lord of the Rings: The Return of the King: Extended Edition”“Star Wars: Episode V: The Empire Strikes Back”“Star Wars: Episode VI: Return of the Jedi”“Star Wars: Episode IV: A New Hope”“Raiders of the Lost Ark”




Netflix Data: Factor 5 - Movies with positive weights

“Austin Powers in Goldmember”“Austin Powers: International Man of Mystery”“Austin Powers: The Spy Who Shagged Me”“The Nutty Professor”“Big Mommas House”“Wild Wild West”“Dodgeball: A True Underdog Story”“Anchorman: The Legend of Ron Burgundy”“Mr. Deeds”“Punch-Drunk Love”“Anger Management”“Moulin Rouge”“Spaceballs”




Netflix Data: Factor 5 - Movies with negative weights

“Star Wars: Episode V: The Empire Strikes Back”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Return of the King: Extended Edition”“Raiders of the Lost Ark”“The Silence of the Lambs”“Rain Man”“We Were Soldiers”“The Godfather”“The Shawshank Redemption: Special Edition”“Saving Private Ryan”“E.T. the Extra-Terrestrial: The 20th Anniversary (Rerelease)”“Finding Nemo (Widescreen)”




Hierarchical clustering

There has been a resurgence of interest in hierarchical clustering inthe field of genomics.




Clustering when p � n

Suppose we wish to cluster n observations on p features, wherep � n.

I Hierarchical clustering is very subjective: the answer you getdepends on what set of features you use. We want aprincipled way to choose a set of features to use in clustering.

I If the true classes that we wish to identify are defined on onlya subset of the features, then the presence of noise featurescan obscure this signal. We want a way to adaptively choosethe signal features to use in clustering.




Example

A simple example with 10 observations; 2 classes are defined on 10important features.




Example: 10 important features; 10 features total




Sparse hierarchical clustering results: 10 importantfeatures; 5000 features total




Sparse Clustering

We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.

We want an automated way to

I find a subset of features to use in the clustering, and

I obtain a more accurate or interesting clustering using thatsubset of features.

Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =

∑pj=1 di ,i ′,j




Dissimilarity matrix for the n observations




Dissimilarity matrix is a sum of dissimilarity matrices overthe features




Hierarchical clustering sums the dissimilarity matrices forthe features




Weighted sum of the dissimilarity matrices for the features




Sparse hierarchical clustering and the PMD

Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .

Then, suppose we apply the PMD to D:

maximizeu,w

uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1, ||w||1 ≤ s

I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then

performing hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.

I If w1 = ... = wp then this gives standard hierarchicalclustering.







maximizeu,w


I wj is a weight on the dissimilarity matrix for feature j .

I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then









maximizeu,w


I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).

I If we re-arrange the elements of Dw into a n × n matrix, thenperforming hierarchical clustering on this re-weighteddissimilarity matrix gives sparse hierarchical clustering.








maximizeu,w


I wj is a weight on the dissimilarity matrix for feature j .I wj ≥ 0 occurs naturally (we assume that Di ,i ′ ≥ 0).I If we re-arrange the elements of Dw into a n × n matrix, then






Sparse hierarchical clustering in action

A simulated example with 6 classes defined on 200 signal features;2000 features in total.

5658

6062

6466

6870

72

Standard Clustering

0.00

00.

005

0.01

00.

015

0.02

00.

025

Sparse Clustering

●

●

●

●●

●

●

●

●

●●●●●●●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●●●

●

●

●●●●

●

●●●●

●

●●

●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●●●●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●●●●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 500 1000 1500 2000

0.00

0.05

0.10

0.15

0.20

W

Index




An important breast cancer paper

Nature (2000) 406:747-752.




Breast cancer data

I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).

I Clustered based on full set of 1753 genes first.

I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.

I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.




Clustering using intrinsic genes: normal-breast-like,basal-like, ER+, Erb-B2+

0.0

0.5

1.0

1.5

All Samples

0.2

0.4

0.6

62 Samples




Sparse clustering

We wonder: If we sparsely cluster the 62 observations using all ofthe genes, can we identify the four classes successfully?

Three types of clustering:

1. Standard hierarchical clustering using all 1753 genes.

2. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen to yield 496 genes.

3. Standard hierarchical clustering using the 496 genes withhighest marginal variance.




normal-breast-like, basal-like, ER+, Erb-B2+0.

00.

51.

01.

5

Standard Clust: All 1753 Genes

0.0

0.5

1.0

1.5

Sparse Clust: 496 Non−Zero Genes

0.0

0.5

1.0

1.5

Standard Clust: 496 High−Var. Genes




Genes with high weights

# Gene Weight1 S100 CALCIUM-BINDING PROTEIN A8 (CALGRANULIN A) 0.2232 SECRETED FRIZZLED-RELATED PROTEIN 1 0.21263 ESTROGEN RECEPTOR 1 0.20764 KERATIN 17 0.16275 HUMAN REARRANGED IMMUNOGLOBULIN LAMBDA 0.15686 CYTOCHROME P450, SUBFAMILY IIA 0.1557 APOLIPOPROTEIN D 0.15098 LACTOTRANSFERRIN 0.14719 ESTROGEN RECEPTOR 1 0.140510 134783 0.1411 HEPATOCYTE NUCLEAR FACTOR 3, ALPHA 0.133212 HUMAN REARRANGED IMMUNOGLOBULIN LAMBDA LIGHT 0.130913 FATTY ACID BINDING PROTEIN 4, ADIPOCYTE 0.129214 CERULOPLASMIN (FERROXIDASE) 0.12615 HUMAN SECRETORY PROTEIN (P1.B) MRNA 0.120816 NON-SPECIFIC CROSS REACTING ANTIGEN 0.119917 LIPOPROTEIN LIPASE 0.112318 IMMUNOGLOBULIN LAMBDA LIGHT CHAIN 0.11219 CRYSTALLIN, ALPHA B 0.110820 FATTY ACID BINDING PROTEIN 4, ADIPOCYTE 0.1121 PLEIOTROPHIN (HEPARIN BINDING GROWTH FACTOR 8) 0.109922 85660 0.107723 ESTS, HIGHLY SIMILAR TO PROBABLE ATAXIA-TELANGIECTASIA 0.107124 V-FOS FBJ MURINE OSTEOSARCOMA VIRAL ONCOGENE HOMOLOG 0.105625 EPIDIDYMIS-SPECIFIC, WHEY-ACIDIC PROTEIN TYPE 0.101326 ALDO-KETO REDUCTASE FAMILY 1, MEMBER C1 0.1007


IntroductionPenalized Matrix DecompositionSparse Hierarchical Clustering

Download - A penalized matrix decomposition, and its applicationsstatweb.stanford.edu/~tibs/sta306bfiles/Defense.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Top Related