mlconf nyc animashree anandkumar
DESCRIPTION
TRANSCRIPT
Tensor Decompositions for Guaranteed Learningof Latent Variable Models
Anima Anandkumar
U.C. Irvine
Application 1: Topic Modeling
Document modeling
Observed: words in document corpus.
Hidden: topics.
Goal: carry out document summarization.
Application 2: Understanding Human Communities
Social Networks
Observed: network of social ties, e.g. friendships, co-authorships
Hidden: groups/communities of actors.
Application 3: Recommender Systems
Recommender System
Observed: Ratings of users for various products, e.g. yelp reviews.
Goal: Predict new recommendations.
Modeling: Find groups/communities of users and products.
Application 4: Feature Learning
Feature Engineering
Learn good features/representations for classification tasks, e.g.image and speech recognition.
Sparse representations, low dimensional hidden structures.
Application 5: Computational Biology
Observed: gene expression levels
Goal: discover gene groups
Hidden variables: regulators controlling gene groups
“Unsupervised Learning of Transcriptional Regulatory Networks via Latent Tree Graphical
Model” by A. Gitter, F. Huang, R. Valluvan, E. Fraenkel and A. Anandkumar Submitted to
BMC Bioinformatics, Jan. 2014.
Statistical Framework
In all applications: discover hidden structure in data: unsupervisedlearning.
Latent Variable Models
Concise statistical description throughgraphical modeling
Conditional independence relationshipsor hierarchy of variables. x
h
Statistical Framework
In all applications: discover hidden structure in data: unsupervisedlearning.
Latent Variable Models
Concise statistical description throughgraphical modeling
Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5
h
Statistical Framework
In all applications: discover hidden structure in data: unsupervisedlearning.
Latent Variable Models
Concise statistical description throughgraphical modeling
Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5
h1
h2 h3
Computational FrameworkChallenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Computational FrameworkChallenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.
Computational FrameworkChallenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.
Tensor-based Estimation
Estimate moment tensors from data: higher order relationships.
Compute decomposition of moment tensor.
Iterative updates, e.g. tensor power iterations, alternatingminimization.
Non-convex: convergence to a local optima. No guarantees.
Computational FrameworkChallenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.
Tensor-based Estimation
Estimate moment tensors from data: higher order relationships.
Compute decomposition of moment tensor.
Iterative updates, e.g. tensor power iterations, alternatingminimization.
Non-convex: convergence to a local optima. No guarantees.
Innovation: Guaranteed convergence to correct model.
Computational FrameworkChallenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.
Tensor-based Estimation
Estimate moment tensors from data: higher order relationships.
Compute decomposition of moment tensor.
Iterative updates, e.g. tensor power iterations, alternatingminimization.
Non-convex: convergence to a local optima. No guarantees.
Innovation: Guaranteed convergence to correct model.
In this talk: tensor decompositions and applications
Outline
1 Introduction
2 Topic Models
3 Efficient Tensor Decomposition
4 Experimental Results
5 Conclusion
Topic Models: Bag of Words
Probabilistic Topic Models
Bag of words: order of words does not matter
Graphical model representation
l words in a document x1, . . . , xl.
h: proportions of topics in a document.
Word xi generated from topic yi.
A(i, j) := P[xm = i|ym = j] :
topic-word matrix.
Words
Topics
Topic
Mixture
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
AAAAA
h
Geometric Picture for Topic Models
Topic proportions vector (h)
Document
Linear Model:E[xi|h] = Ah .
Multiview model: h isfixed and multiple words(xi) are generated.
Geometric Picture for Topic Models
Single topic (h)
Linear Model:E[xi|h] = Ah .
Multiview model: h isfixed and multiple words(xi) are generated.
Geometric Picture for Topic Models
Topic proportions vector (h)
Linear Model:E[xi|h] = Ah .
Multiview model: h isfixed and multiple words(xi) are generated.
Geometric Picture for Topic Models
Topic proportions vector (h)
AAA
x1
x2
x3Word generation (x1, x2, . . .)
Linear Model:E[xi|h] = Ah .
Multiview model: h isfixed and multiple words(xi) are generated.
Moment TensorsConsider single topic model.
E[xi|h] = Ah. ~λ := [E[h]]i.
Learn topic-word matrix A, vector ~λ = P[h]
M2: Co-occurrence of two words in a document
M2 := E[x1x⊤2 ] = E[E[x1x
⊤2 |h]] = AE[hh⊤]A⊤ =
k∑
r=1
λrara⊤r
Moment TensorsConsider single topic model.
E[xi|h] = Ah. ~λ := [E[h]]i.
Learn topic-word matrix A, vector ~λ = P[h]
M2: Co-occurrence of two words in a document
M2 := E[x1x⊤2 ] = E[E[x1x
⊤2 |h]] = AE[hh⊤]A⊤ =
k∑
r=1
λrara⊤r
Tensor M3: Co-occurrence of three words
M3 := E(x1 ⊗ x2 ⊗ x3) =∑
r
λrar ⊗ ar ⊗ ar
Moment TensorsConsider single topic model.
E[xi|h] = Ah. ~λ := [E[h]]i.
Learn topic-word matrix A, vector ~λ = P[h]
M2: Co-occurrence of two words in a document
M2 := E[x1x⊤2 ] = E[E[x1x
⊤2 |h]] = AE[hh⊤]A⊤ =
k∑
r=1
λrara⊤r
Tensor M3: Co-occurrence of three words
M3 := E(x1 ⊗ x2 ⊗ x3) =∑
r
λrar ⊗ ar ⊗ ar
Matrix and Tensor Forms: ar := rth column of A.
M2 =
k∑
r=1
λrar ⊗ ar. M3 =
k∑
r=1
λrar ⊗ ar ⊗ ar
Tensor Decomposition Problem
M2 =k
∑
r=1
λrar ⊗ ar. M3 =k
∑
r=1
λrar ⊗ ar ⊗ ar
= + ....
Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2
u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.
k topics, d words in vocabulary.
M3: O(d× d× d) tensor, Rank k.
Learning Topic Models through Tensor Decomposition
Detecting Communities in Networks
Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Mixed Membership Model
Overlapping
Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Mixed Membership Model
Overlapping
Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Mixed Membership Model
Overlapping
Unifying Assumption
Edges conditionally independent given community memberships
Multi-view Mixture Models
Tensor Forms in Other Models
Independent Component Analysis
Independent sources, unknown mixing.
Blind source separation of speech, image, video..
h1 h2 hk
x1 x2 xd
A
Gaussian Mixtures Hidden MarkovModels/Latent Trees
x1 x2 x3 x4 x5
h1
h2 h3
Reduction to similar moment forms
Outline
1 Introduction
2 Topic Models
3 Efficient Tensor Decomposition
4 Experimental Results
5 Conclusion
Tensor Decomposition Problem
M3 =
k∑
r=1
λrar ⊗ ar ⊗ ar
= + ....
Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2
u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.
k topics, d words in vocabulary.
M3: O(d× d× d) tensor, Rank k.
d: vocabulary size for topic models or n: size of network forcommunity models.
Dimensionality Reduction for Tensor Decomposition
M3 =
k∑
r=1
λrar ⊗ ar ⊗ ar
Dimensionality Reduction(Whitening)
Convert M3 of size O(d× d× d)to tensor T of size k × k × k
Carry out decomposition of T Tensor M3 Tensor T
Dimensionality reduction through multi-linear transforms
Computed from data, e.g. pairwise moments.
T =∑
i ρir⊗3i is symmetric orthogonal tensor: {ri} are orthonormal
Orthogonal/Eigen Decomposition
Orthogonal symmetric tensor: T =∑
j∈[k]
ρjr⊗3j
T (I, r1, r1) =∑
j∈[k]
ρj〈r1, rj〉2rj = ρ1r1
Orthogonal/Eigen Decomposition
Orthogonal symmetric tensor: T =∑
j∈[k]
ρjr⊗3j
T (I, r1, r1) =∑
j∈[k]
ρj〈r1, rj〉2rj = ρ1r1
Obtaining eigenvectors through power iterations
u 7→T (I, u, u)
‖T (I, u, u)‖
Orthogonal/Eigen Decomposition
Orthogonal symmetric tensor: T =∑
j∈[k]
ρjr⊗3j
T (I, r1, r1) =∑
j∈[k]
ρj〈r1, rj〉2rj = ρ1r1
Obtaining eigenvectors through power iterations
u 7→T (I, u, u)
‖T (I, u, u)‖
Basic Algorithm
Random initialization, run power iterations and deflate
Practical Considerations
k communities, n nodes, k ≪ n.
Steps
k-SVD of n× n matrix: randomized techniques
Online k × k × k tensor decomposition: No tensor explicitly formed.
Parallelization: Inherently parallelizable, GPU deployment.
Sparse implementation: real-world networks are sparse
Validation Metric: p-value test based “soft-pairing”
Parallel time complexity: O
(
nsk
c+ k3
)
,
s is max. degree in graph and c is number of cores.
Huang, Niranjan, Hakeem and Anandkumar, “Fast Detection of Overlapping Communities via
Online Tensor Methods,” Preprint, Sept. 2013.
Scaling Of The Stochastic Iterations
vt+1
i ← vti − 3θβt
k∑
j=1
[
⟨
vtj , vti
⟩2vtj
]
+ βt⟨
vti , ytA
⟩⟨
vti , ytB
⟩
ytC + . . .
Parallelize acrosseigenvectors.
STGD is iterative:device code reusebuffers for updates.
vti
ytA,ytB,y
tC
CPU
GPU
Standard Interface
vti
ytA,ytB,y
tC
CPU
GPU
Device Interface
vti
Scaling Of The Stochastic Iterations
102
103
10−1
100
101
102
103
104
Number of communities k
Runningtime(secs)
MATLAB Tensor Toolbox
CULA Standard Interface
CULA Device Interface
Eigen Sparse
Outline
1 Introduction
2 Topic Models
3 Efficient Tensor Decomposition
4 Experimental Results
5 Conclusion
Experimental Results
Friend
Users
n ∼ 20, 000
BusinessUser
Reviews
Yelp
n ∼ 40, 000
AuthorCoauthor
DBLP
n ∼ 1 million
Error (E) and Recovery ratio (R)
Dataset k̂ Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP(k=6000) 100 ours 5407 0.105 95%
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business Stars Review Counts
1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business Stars Review Counts
1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31
Bridgeness: Distance from vector [1/k̂, . . . , 1/k̂]⊤
Top-5 bridging nodes (businesses)
Business Categories
Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe
Pizzeria Bianco Restaurants, Pizza, Phoenix
FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix
Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch
Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe
Outline
1 Introduction
2 Topic Models
3 Efficient Tensor Decomposition
4 Experimental Results
5 Conclusion
Conclusion
Guaranteed Learning of Latent Variable Models
Guaranteed to recover correct model
Efficient sample and computational complexities
Better performance compared to EM, VariationalBayes etc.
Mixed membership communities, topic models,ICA, Gaussian mixtures...
Current and Future Goals
Guaranteed online learning in high dimensions
Large-scale cloud-based implementation of tensor approaches
Code available on website and Github