mlconf nyc animashree anandkumar

Tensor Decompositions for Guaranteed Learningof Latent Variable Models

Anima Anandkumar

U.C. Irvine

Application 1: Topic Modeling

Document modeling

Observed: words in document corpus.

Hidden: topics.

Goal: carry out document summarization.

Application 2: Understanding Human Communities

Social Networks

Observed: network of social ties, e.g. friendships, co-authorships

Hidden: groups/communities of actors.

Application 3: Recommender Systems

Recommender System

Observed: Ratings of users for various products, e.g. yelp reviews.

Goal: Predict new recommendations.

Modeling: Find groups/communities of users and products.

Application 4: Feature Learning

Feature Engineering

Learn good features/representations for classification tasks, e.g.image and speech recognition.

Sparse representations, low dimensional hidden structures.

Application 5: Computational Biology

Observed: gene expression levels

Goal: discover gene groups

Hidden variables: regulators controlling gene groups

“Unsupervised Learning of Transcriptional Regulatory Networks via Latent Tree Graphical

Model” by A. Gitter, F. Huang, R. Valluvan, E. Fraenkel and A. Anandkumar Submitted to

BMC Bioinformatics, Jan. 2014.

Statistical Framework

In all applications: discover hidden structure in data: unsupervisedlearning.

Latent Variable Models

Concise statistical description throughgraphical modeling

Conditional independence relationshipsor hierarchy of variables. x

h





Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5

h





Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5

h1

h2 h3

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?





Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.






Tensor-based Estimation

Estimate moment tensors from data: higher order relationships.

Compute decomposition of moment tensor.

Iterative updates, e.g. tensor power iterations, alternatingminimization.

Non-convex: convergence to a local optima. No guarantees.











Innovation: Guaranteed convergence to correct model.











Innovation: Guaranteed convergence to correct model.

In this talk: tensor decompositions and applications

Outline

1 Introduction

2 Topic Models

3 Efficient Tensor Decomposition

4 Experimental Results

5 Conclusion

Topic Models: Bag of Words

Probabilistic Topic Models

Bag of words: order of words does not matter

Graphical model representation

l words in a document x1, . . . , xl.

h: proportions of topics in a document.

Word xi generated from topic yi.

A(i, j) := P[xm = i|ym = j] :

topic-word matrix.

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

AAAAA

h

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

Linear Model:E[xi|h] = Ah .

Multiview model: h isfixed and multiple words(xi) are generated.


Single topic (h)





AAA

x1

x2

x3Word generation (x1, x2, . . .)



Moment TensorsConsider single topic model.

E[xi|h] = Ah. ~λ := [E[h]]i.

Learn topic-word matrix A, vector ~λ = P[h]

M2: Co-occurrence of two words in a document

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

k∑

r=1

λrara⊤r


E[xi|h] = Ah. ~λ := [E[h]]i.



M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

k∑

r=1

λrara⊤r

Tensor M3: Co-occurrence of three words

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

r

λrar ⊗ ar ⊗ ar


E[xi|h] = Ah. ~λ := [E[h]]i.



M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

k∑

r=1

λrara⊤r

Tensor M3: Co-occurrence of three words

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

r

λrar ⊗ ar ⊗ ar

Matrix and Tensor Forms: ar := rth column of A.

M2 =

k∑

r=1

λrar ⊗ ar. M3 =

k∑

r=1

λrar ⊗ ar ⊗ ar

Tensor Decomposition Problem

M2 =k

∑

r=1

λrar ⊗ ar. M3 =k

∑

r=1

λrar ⊗ ar ⊗ ar

= + ....

Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2

u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.

k topics, d words in vocabulary.

M3: O(d× d× d) tensor, Rank k.

Learning Topic Models through Tensor Decomposition

Detecting Communities in Networks


Stochastic Block Model

Non-overlapping



Non-overlapping

Mixed Membership Model

Overlapping



Non-overlapping

Mixed Membership Model

Overlapping

Unifying Assumption

Edges conditionally independent given community memberships

Multi-view Mixture Models

Tensor Forms in Other Models

Independent Component Analysis

Independent sources, unknown mixing.

Blind source separation of speech, image, video..

h1 h2 hk

x1 x2 xd

A

Gaussian Mixtures Hidden MarkovModels/Latent Trees

x1 x2 x3 x4 x5

h1

h2 h3

Reduction to similar moment forms

Outline

1 Introduction

2 Topic Models



5 Conclusion

Tensor Decomposition Problem

M3 =

k∑

r=1

λrar ⊗ ar ⊗ ar

= + ....

Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2

u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.

k topics, d words in vocabulary.

M3: O(d× d× d) tensor, Rank k.

d: vocabulary size for topic models or n: size of network forcommunity models.

Dimensionality Reduction for Tensor Decomposition

M3 =

k∑

r=1

λrar ⊗ ar ⊗ ar

Dimensionality Reduction(Whitening)

Convert M3 of size O(d× d× d)to tensor T of size k × k × k

Carry out decomposition of T Tensor M3 Tensor T

Dimensionality reduction through multi-linear transforms

Computed from data, e.g. pairwise moments.

T =∑

i ρir⊗3i is symmetric orthogonal tensor: {ri} are orthonormal

Orthogonal/Eigen Decomposition

Orthogonal symmetric tensor: T =∑

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

ρj〈r1, rj〉2rj = ρ1r1



j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]


Obtaining eigenvectors through power iterations

u 7→T (I, u, u)

‖T (I, u, u)‖



j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]


Obtaining eigenvectors through power iterations

u 7→T (I, u, u)

‖T (I, u, u)‖

Basic Algorithm

Random initialization, run power iterations and deflate

Practical Considerations

k communities, n nodes, k ≪ n.

Steps

k-SVD of n× n matrix: randomized techniques

Online k × k × k tensor decomposition: No tensor explicitly formed.

Parallelization: Inherently parallelizable, GPU deployment.

Sparse implementation: real-world networks are sparse

Validation Metric: p-value test based “soft-pairing”

Parallel time complexity: O

(

nsk

c+ k3

)

,

s is max. degree in graph and c is number of cores.

Huang, Niranjan, Hakeem and Anandkumar, “Fast Detection of Overlapping Communities via

Online Tensor Methods,” Preprint, Sept. 2013.

Scaling Of The Stochastic Iterations

vt+1

i ← vti − 3θβt

k∑

j=1

[

⟨

vtj , vti

⟩2vtj

]

+ βt⟨

vti , ytA

⟩⟨

vti , ytB

⟩

ytC + . . .

Parallelize acrosseigenvectors.

STGD is iterative:device code reusebuffers for updates.

vti

ytA,ytB,y

tC

CPU

GPU

Standard Interface

vti

ytA,ytB,y

tC

CPU

GPU

Device Interface

vti

Scaling Of The Stochastic Iterations

102

103

10−1

100

101

102

103

104

Number of communities k

Runningtime(secs)

MATLAB Tensor Toolbox

CULA Standard Interface

CULA Device Interface

Eigen Sparse

Outline

1 Introduction

2 Topic Models



5 Conclusion

Experimental Results

Friend

Users

Facebook

n ∼ 20, 000

BusinessUser

Reviews

Yelp

n ∼ 40, 000

AuthorCoauthor

DBLP

n ∼ 1 million

Error (E) and Recovery ratio (R)

Dataset k̂ Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP(k=6000) 100 ours 5407 0.105 95%

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/k̂, . . . , 1/k̂]⊤

Top-5 bridging nodes (businesses)

Business Categories

Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe

Pizzeria Bianco Restaurants, Pizza, Phoenix

FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix

Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch

Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

Outline

1 Introduction

2 Topic Models



5 Conclusion

Conclusion

Guaranteed Learning of Latent Variable Models

Guaranteed to recover correct model

Efficient sample and computational complexities

Better performance compared to EM, VariationalBayes etc.

Mixed membership communities, topic models,ICA, Gaussian mixtures...

Current and Future Goals

Guaranteed online learning in high dimensions

Large-scale cloud-based implementation of tensor approaches

Code available on website and Github