introduction to graphical models for data mining arindam banerjee banerjee@cs.umn.edu dept of...
Post on 11-Jan-2016
216 Views
Preview:
TRANSCRIPT
Introduction to Graphical Models for Data Mining
Arindam Banerjeebanerjee@cs.umn.edu
Dept of Computer Science & EngineeringUniversity of Minnesota, Twin Cities
16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
July 25, 2010
Graphical Models 2
Introduction
• Graphical Models– Brief Overview
• Part I: Tree Structured Graphical Models– Exact Inference
• Part II: Mixed Membership Models– Latent Dirichlet Allocation– Generalizations, Applications
• Part III: Graphical Models for Matrix Analysis– Probabilistic Matrix Factorization– Probabilistic Co-clustering – Stochastic Block Models
Graphical Models 3
Graphical Models: What and Why
• Statistical Data Analaysis– Build diagnostic/predictive models from data
– Uncertainty quantification based on (minimal) assumptions
• The I.I.D. assumption– Data is independently and identically distributed
– Example: Words in a doc drawn i.i.d. from the dictionary
• Graphical models– Assume (graphical) dependencies between (random) variables
– Closer to reality, domain knowledge can be captured
– Learning/inference is much more difficult
Graphical Models 4
Flavors of Graphical Models
• Basic nomenclature– Node = random variable, maybe observed/hidden
– Edge = statistical dependency
• Two popular flavors: ‘Directed’ and ‘Undirected’
• Directed Graphs– A directed graph between random variables, causal dependencies
– Example: Bayesian networks, Hidden Markov Models
– Joint distribution is a product of P(child|parents)
• Undirected Graphs– An undirected graph between random variables
– Example: Markov/Conditional random fields
– Joint distribution in terms of potential functions
X1
X3
X4 X5
X2
Graphical Models 5
Bayesian Networks
• Joint distribution in terms of P(X|Parents(X))
X1
X3
X4 X5
X2
Graphical Models 6
Example I: Burglary Network
This and several other examples are from the Russell-Norvig AI book
Computing Probabilities of Events
• Probability of any event can be computed:P(B,E,A,J,M) = P(B) P(E|B) P(A|B,E) P(J|B,E,A) P(M|B,E,A,J)
= P(B) P(E) P(A|B,E) P(J|A) P(M|A)
• Example:P(b,¬e,a, ¬j,m) = P(b) P(¬e)P(a|b,¬e) P(¬j|a) P(m|a)
Graphical Models 7
Graphical Models 8
Example II: Rain Network
Graphical Models 9
Example III: “Car Won’t Start” Diagnosis
Graphical Models 10
Inference
• Some variables in the Bayes net are observed– the evidence/data, e.g., John has not called, Mary has called
• Inference– How to compute value/probability of other variables
– Example: What is the probability of Burglary, i.e., P(b|¬j,m)
Graphical Models 11
Inference Algorithms
• Graphs without loops: Tree-structured Graphs– Efficient exact inference algorithms are possible
– Sum-product algorithm, and its special cases• Belief propagation in Bayes nets
• Forward-Backward algorithm in Hidden Markov Models (HMMs)
• Graphs with loops– Junction tree algorithms
• Convert into a graph without loops
• May lead to exponentially large graph
– Sum-product/message passing algorithm, ‘disregarding loops’• Active research topic, correct convergence ‘not guaranteed’
• Works well in practice
– Approximate inference
Graphical Models 12
Approximate Inference
• Variational Inference– Deterministic approximation
– Approximate complex true distribution over latent variables
– Replace with family of simple/tractable distributions• Use the best approximation in the family
– Examples: Mean-field, Bethe, Kikuchi, Expectation Propagation
• Stochastic Inference– Simple sampling approaches
– Markov Chain Monte Carlo methods (MCMC)• Powerful family of methods
– Gibbs sampling• Useful special case of MCMC methods
Part I: Tree Structured Graphical Models
• The Inference Problem
• Factor Graphs and the Sum-Product Algorithm
• Example: Hidden Markov Models
• Generalizations
Graphical Models 13
The Inference Problem
Graphical Models 14
Complexity of Naïve Inference
Graphical Models 15
Bayes Nets to Factor Graphs
Graphical Models 16
Factor Graphs: Product of Local Functions
Graphical Models 17
Marginalize Product of Functions (MPF)
• Marginalize product of functions
• Computing marginal functions
• The “not-sum” notation
Graphical Models 18
MPF using Distributive Law
• We focus on two examples: g1(x1) and g3(x3)
• Main Idea: Distributive law
ab + ac = a(b+c)
• For g1(x1), we have
• For g3(x3), we have
Graphical Models 19
Computing Single Marginals
• Main Idea:– Target node becomes the root
– Pass messages from leaves up to the root
Graphical Models 20
Compute product of descendants with fThen do not-sum over part
Message Passing
Graphical Models 21
Compute product of descendants
Example: Computing g1(x1)
Graphical Models 22
Example: Computing g3(x3)
Graphical Models 23
Efficient Algorithm is encoded in the structure of the factor graph
Hidden Markov Models (HMMs)
Graphical Models 24
Latent variables:z0,z1,…,zt-1,zt,zt+1,…,zT
Observed variables:x1,…,xt-1,xt,xt+1,…,xT
Inference Problems:1.Compute p(x1:T) 2.Compute p(zt|x1:T)3.Find maxz1:T
p(z1:T|x1:T)
Similar problem for chain-structured Conditional Random Fields (CRFs)
The Sum-Product Algorithm
• To compute gi(xi), form a tree rooted at xi
• Starting from the leaves, apply the following two rules– Product Rule:
At a variable node, take the product of descendants
– Sum-product Rule:
At a factor node, take the product of f with descendants;
then perform not-sum over the parent node
• To compute all marginals– Can be done one at a time; repeated computations, not efficient
– Simultaneous message passing following the sum-product algorithm
– Examples: Belief Propagation, Forward-Backward algorithm, etc.
Graphical Models 25
Sum-Product Updates
Graphical Models 26
Sum-Product Updates
Graphical Models 27
Example: Step 1
Graphical Models 28
Example: Step 2
Graphical Models 29
Example: Step 3
Graphical Models 30
Example: Step 4
Graphical Models 31
Example: Step 5
Graphical Models 32
Example: Termination
Graphical Models 33
HMMs Revisited
Graphical Models 34
Latent variables:z0,z1,…,zt-1,zt,zt+1,…,zT
Observed variables:x1,…,xt-1,xt,xt+1,…,xT
Inference Problem:1.Compute p(x1:T)2.Compute p(zt|x1:T)
Sum-product algorithm is knownas the `forward-backward’ algorithm
Smoothing in Kalman Filtering
Distributive Law on Semi-Rings
• Idea can be applied to any commutative semi-ring
• Semi-ring 101 – Two operations (+,×): Associative, Commutative, Identity
– Distributive law: a×b + a×c = a×(b+c)
Graphical Models 35
•Belief Propagation in Bayes nets•MAP inference in HMMs•Max-product algorithm•Alternative to Viterbi Decoding•Kalman Filtering•Error Correcting Codes•Turbo Codes•…
Message Passing in General Graphs
• Tree structured graphs– Message passing is guaranteed to give correct solutions
– Examples: HMMs, Kalman Filters
• General Graphs – Active research topic
• Progress has been made in the past 10 years
– Message passing• May not converge
• May converge to a ‘local minima’ of ‘Bethe variational free energy’
– New approaches to convergent and correct message passing
• Applications– True Skill: Ranking System for Xbox Live
– Turbo Codes: 3G, 4G phones, satellite comm, Wimax, Mars orbiter
Graphical Models 36
Part II: Mixed Membership Models
• Mixture Models vs Mixed Membership Models
• Latent Dirichlet Allocation
• Inference– Mean-Field and Collapsed Variational Inference
– MCMC/Gibbs Sampling
• Applications
• Generalizations
Graphical Models 37
Graphical Models 38
Background: Plate Diagrams
a
b
3
a
b1 b2 b3
Compact representation of large Bayesian networks
θ
x
dn
d
39Graphical Models
0.3
1
-2
x
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Model 1: Independent Features
d=3, n=1
θ
x
dn
d
π
z
θ
x
dn
kd
40
Model 2: Naïve Bayes (Mixture Models)
Graphical Models
-1
3
-0.5
x
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
41Graphical Models
Naïve Bayes Model
n k
π z θx
d d
0.1
2.1
-1.5
x
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
42Graphical Models
Naïve Bayes Model
n k
π z θx
d d
θ
xdn
d
π
z
θ
xdn
kd
π
z
xdn
α
θkd
43
Model 3: Mixed Membership Model
Graphical Models
0.7
3.1
-1
x
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
44Graphical Models
Mixed Membership Models
π z xdn
α θ
k
d
0.9
2.1
-2
x
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
45Graphical Models
Mixed Membership Models
π z xdn
α θ
k
d
0.1
2.1
-1.5
x
46Graphical Models
Mixture Model vs Mixed Membership Model
-1
3
-0.5
x0.7
3.1
-1
x
0.9
2.1
-2
x
Single component membership Multi-component mixed membership
Graphical Models 47
Nd D
zi
xi
(d)
p(d) Dirichlet()
zi Discrete( (d) )
xi Discrete( (zi) )
K
distribution over topicsfor each document
topic assignment for each word
distribution over words for each topic
word generated from assigned topic
Dirichlet priors
Latent Dirichlet Allocation (LDA)
j
Graphical Models 48
z~Discrete()
~Drichlet(α)
x1 x3x2
z1 z2 z3
z11
x11
z12
x12
z13
x13
1
z21
x21
2
z22
x22
a
x1 x3x2
z
x1 x3x2
z1 z2 z3
x1 x3x2
z1 z2 z3
Graphical Models 49
LDA Generative Model
document1 document2
Graphical Models 50
LDA Generative Model
document1 document2
Learning: Inference and Estimation
Graphical Models 51
Variational Inference
Graphical Models 52
Variational EM for LDA
Graphical Models 53
E-step: Variational Distribution and Updates
Graphical Models 54
M-step: Parameter Estimation
Graphical Models 55
Results: Topics Inferred
Graphical Models 56
Results: Perplexity Comparison
Graphical Models 57
Graphical Models 60
Aviation Safety Reports (NASA)
Graphical Models 61
Results: NASA Reports I
Arrival Departure
Passenger Maintenance
runwayapproachdeparturealtitude
turntower
air traffic controlheadingtaxi way
flight
passengerattendant
flightseat
medicalcaptain
attendantslavatory
toldpolice
maintenance enginemel zzz
air craft installed
checkinspection
fuelWork
Graphical Models 62
Results: NASA Reports II
Medical Emergency
Wheel Maintenance
Weather Condition
Departure
medical
passenger
doctor
attendant
oxygen
emergency
paramedics
flight
nurse
aed
tire
wheel
assembly
nut
spacer
main
axle
bolt
missing
tires
knots
turbulence
aircraft
degrees
ice
winds
wind
speed
air speed
conditions
departure
sid
dme
altitude
climbing
mean sea level
heading
procedure
turn
degree
Graphical Models 63
The pilot flies an owner's airplane with the owner as a
passenger. Loses contact with the center during the flight.
While performing a sky diving, a jet approaches at the same altitude, but an accident is avoided.
Red: Flight Crew Blue: Passenger Green: Maintenance
Two-Dimensional Visualization for Reports
Graphical Models 64
Altimeter has a problem, but the pilot overcomes the difficulty during the flight.
During acceleration, a flap retraction issue happens. The pilot then returns
to base and lands. The mechanic finds out the problem.
Red: Flight Crew Blue: Passenger Green: Maintenance
Two-Dimensional Visualization for Reports
Graphical Models 65
The pilot has a landing gear problem. Maintenance crew joins radio conversation to
help.
The captain has a medical emergency.
Red: Flight crew Blue: Passenger Green: Maintenance
Two-Dimensional Visualization for Reports
Graphical Models 66
Red: Flight Crew Blue: Passenger Green: Maintenance
Mixed Membership of Reports
Flight Crew: 0.7039Passenger: 0.0009Maintenance: 0.2953
Flight Crew: 0.1405Passenger: 0.0663Maintenance: 0.7932
Flight Crew: 0.2563Passenger: 0.6599Maintenance: 0.0837
Flight Crew: 0.0013Passenger: 0.0013Maintenance: 0.9973
Graphical Models 67
Nd D
zi
xi
p (d)
(j)
p(d) Dirichlet()
zi Discrete(p (d) ) (j) Dirichlet()
xi Discrete( (zi) )
T
distribution over topicsfor each document
topic assignment for each word
distribution over words for each topic
word generated from assigned topic
Dirichlet priors
Smoothed Latent Dirichlet Allocation
Stochastic Inference using Markov Chains
• Powerful family of approximate inference methods– Markov Chain Monte Carlo, Gibbs Sampling
• The basic idea– Need to marginalize over complex latent variable distribution
p(x|) = ∫z p(x,z|) = ∫z p(x|) p(z|x,) = Ez~p(z|x,)[p(x|)]– Draw ‘independent’ samples from p(z|x,)
– Compute sample based average instead of the full integral
• Main Issue: How to draw samples?– Difficult to directly draw samples from p(z|x,)
– Construct a Markov chain whose stationary distribution is p(z|x,)
– Run chain till ‘convergence’
– Obtain samples from p(z|x,)
Graphical Models 68
The Metropolis-Hastings Algorithm
Graphical Models 69
The Metropolis-Hastings Algorithm (Contd)
Graphical Models 70
The Gibbs Sampler
Graphical Models 71
Collapsed Gibbs Sampling for LDA
Graphical Models 72
Collapsed Variational Inference for LDA
Graphical Models 73
Collapsed Variational Inference for LDA
Graphical Models 74
Results: Comparison of Inference Methods
Graphical Models 75
Results: Comparison of Inference Methods
Graphical Models 76
Generalizations
• Generalized Topic Models– Correlated Topic Models
– Dynamic Topic Models, Topics over Time
– Dynamic Topics with birth/death
• Mixed membership models over non-text data, applications– Mixed membership naïve-Bayes
– Discriminative models for classification
– Cluster Ensembles
• Nonparametric Priors– Dirichlet Process priors: Infer number of topics
– Hierarchical Dirichlet processes: Infer hierarchical structures
– Several other priors: Pachinko allocation, Gaussian Processes, IBP, etc.
Graphical Models 77
CTM Results
Graphical Models 78
DTM Results
Graphical Models 79
DTM Results II
Graphical Models 80
Graphical Models 81/41
Mixed Membership Naïve Bayes
• For each data point,– Choose π ~ Dirichlet()
• For each of observed features fn:
– Choose a class zn ~ Discrete (π)
– Choose a feature value xn from p(xn|zn,fn,Θ), which could be Gaussian, Poisson, Bernoulli…
Graphical Models 82
MMNB vs NB: Perplexity Surfaces
•MMNB typically achieves a lower perplexity than NB
•On test set, NB shows overfitting, but MMNB is stable and robust.
NB NB
NB
MMNB
MMNB
MMNB
Discriminative Mixed Membership Models
Graphical Models 83
84
Results: DLDA for text classification
Generally, Fast DLDA has a higher accuracy on most of the datasets
Graphical Models
85
Topics from DLDA
Graphical Models
cabin flight ice aircraft flight
descent hours aircraft gate smoke
pressurization time flight ramp cabin
emergency crew wing wing passenger
flight day captain taxi aircraft
aircraft duty icing stop captain
pressure rest engine ground cockpit
oxygen trip anti parking attendant
atc zzz time area smell
masks minutes maintenance line emergency
Graphical Models
• Combining multiple base clusterings of a dataset
• Robust and stable• Distributed and scalable• Knowledge reuse, privacy preserving
Cluster Ensembles
base clustering1 base clustering2 base clustering3
86
Graphical Models
Problem Formulation
• Input & Output
Data points
Base clusterings Consensus clustering
87
Graphical Models
Results: State-of-the-art vs Bayesian Ensembles
88
Part III: Graphical Models for Matrix Analysis
• Probabilistic Matrix Factorizations
• Probabilistic Co-clustering
• Stochastic Block Structures
Graphical Models 89
Matrix Factorization
• Singular value decomposition
• Problems– Large matrices, with millions of row/colums
• SVD can be rather slow
– Sparse matrices, most entries are missing• Traditional approaches cannot handle missing entries
Graphical Models 90
≈
• Model X ϵ Rn×m as UVT where– U is a Rn×k, V is Rm×k
– Alternatively optimize U and V
Matrix Factorization: “Funk SVD”
Graphical Models 91
Xij = uiTvj =
error = (Xij –Xij)2 = (Xij –ui
Tvj)2
^
uiT
vj
^
• Gradient descent updates
Matrix Factorization (Contd)
Graphical Models 92
uik(t+1) = uik
(t) + η (Xij-Xij) vjk(t)
vjk(t+1) = vjk
(t) + η (Xij-Xij) ujk(t)
^
^
Xij = uiTvj =
error = (Xij -Xij )2
^
uiT
vj
^
Probabilistic Matrix Factorization (PMF)
Graphical Models 93
Xij ~ N(uiTvj , σ2)
uiT
vj
N(0, σu2I)
N(0, σv2I)
uiT ~ N(0, σu
2I)vj ~ N(0, σv
2I)Rij ~ N(ui
Tvj , σ2)
Inference using gradient descent
Bayesian Probabilistic Matrix Factorization
Graphical Models 94
Xij ~ N(uiTvj , σ2)
uiT
vj
N(µu, Λu)
N(µv, Λv)µu ~ N(µ0, Λ u), Λ u ~ W(ν0, W0)µv ~ N(µ0, Λ v), Λ v ~ W(ν0, W0)ui
~ N(µu, Λ u)vj ~ N(µv, Λ v)Rij ~ N(ui
Tvj , σ2)
Wishart
Gaussian
Inference using MCMC
Results: PMF on the Netflix Dataset
Graphical Models 95
Results: PMF on the Netflix Dataset
Graphical Models 96
Results: Bayesian PMF on Netflix
Graphical Models 97
Results: Bayesian PMF on Netflix
Graphical Models 98
Results: Bayesian PMF on Netflix
Graphical Models 99
Graphical Models 100
Co-clustering: Gene Expression Analysis
Original Co-clustered
Graphical Models 101
Co-clustering and Matrix Approximation
Graphical Models 102
Probabilistic Co-clustering
Row clusters:Column clusters:
…
…
Graphical Models 103
Probabilistic Co-clustering
Graphical Models 104
Generative Process
2
• Assume a mixed membership for each row and column
• Assume a Gaussian for each co-cluster
1. Pick row/column clusters
2. Generate each entry of the matrix
Graphical Models 105
Reduction to Mixture Models
3
Graphical Models 106
Reduction to Mixture Models
3 1.1
Graphical Models 107
Generative Process
2
• Assume a mixed membership for each row and column
• Assume a Gaussian for each co-cluster
1. Pick row/column clusters
2. Generate each entry of the matrix
Graphical Models 108
Bayesian Co-clustering (BCC)
2
• A Dirichlet distribution over all possible mixed memberships
Graphical Models 109
Bayesian Co-clustering (BCC)
Graphical Models 110
Learning: Inference and Estimation
• Learning– Estimate model parameters
– Infer ‘mixed memberships’ of individual rows and columns
• Expectation Maximization
• Issues– Posterior probability cannot be obtained in closed form
– Parameter estimation cannot be done directly
• Approach: Approximate inference– Variational Inference
– Collapsed Gibbs Sampling, Collapsed Variational Inference
),,( 21
Graphical Models 111
Variational EM
• Introduce a variational distribution to approximate
• Use Jensen’s inequality to get a tractable lower bound
• Maximize the lower bound w.r.t – Alternatively minimize the KL divergence between
and
• Maximize the lower bound w.r.t.
Graphical Models 112
Variational Distribution
• for each row, for each column
Collapsed Inference
• Latent distribution can be exactly marginalized over (1,2)
– Obtain p(X,z1,z2|1,2,) in closed form
– Analysis assumes discrete/categorical entries
– Can be generalized to exponential family distributions
• Collapsed Gibbs Sampling– Conditional distribution of (z1uv,z2uv) in closed form
P(z1uv=i, z2
uv=j | X, z1-uv, z2-uv, 1,2,
– Sample states, run sampler till convergence
• Collapsed Variational Bayes– Variational distribution q(z1,z2|) = ∏u,v q(z1
uv,z2uv|uv)
– Gaussian and Taylor approximation to obtain updates for uv
Graphical Models 113
Graphical Models 114
Residual Bayesian Co-clustering (RBC)
•(z1,z2) determines the distribution
•Users/movies may have bias
•(m1,m2): row/column means
•(bm1,bm2): row/ column bias
Graphical Models 115
Results: Datasets
• Movielens: Movie recommendation data – 100,000 ratings (1-5) for 1682 movies from 943 users (6.3%)
– Binarize: 0 (1-3), 1(4-5).
– Discrete (original), Bernoulli (binary), Real (z-scored)
• Foodmart: Transaction data– 164,558 sales records for 7803 customers and 1559 products (1.35%)
– Binarize: 0 (less than median), 1(higher than median)
– Poisson (original), Bernoulli (binary), Real (z-scored)
• Jester: Joke rating data– 100,000 ratings (-10.00,+10.00) for 100 jokes from 1000 users (100%)
– Binarize: 0 (lower than 0), 1 (higher than 0)
– Gaussian (original), Bernoulli (binary), Real (z-scored)
Graphical Models 116
MMNB BCC LDA
Jester 1.7883 1.8186 98.3742
Movielens 1.6994 1.9831 439.6361
Foodmart 1.8691 1.9545 1461.7463
MMNB BCC LDA
Jester 4.0237 2.5498 98.9964
Movielens 3.9320 2.8620 1557.0032
Foodmart 6.4751 2.1143 6542.9920
Training Set Test Set
On Binary Data
MMNB BCC
Jester 15.4620 18.2495
Movielens 3.1495 0.8068
Foodmart 4.5901 4.5938
MMNB BCC
Jester 39.9395 24.8239
Movielens 38.2377 1.0265
Foodmart 4.6681 4.5964
Training Set Test Set
Perplexity Comparison with 10 Clusters
On Original Data
Graphical Models 117
Co-embedding: Users
Graphical Models 118
Co-embedding: Movies
Graphical Models 119
RBC vs. other co-clustering algorithms
Jester
•RBC and RBC-FF perform better than BCC
•RBC and RBC-FF are also the best among others
Graphical Models 120
RBC vs. other co-clustering algorithms
Foodmart
Movielens
Graphical Models 121
RBC vs. SVD, NNMF, and CORR
Jester
•RBC and RBC-FF are competitive with other algorithms
Graphical Models 122
RBC vs. SVD, NNMF and CORR
Movielens
Foodmart
Graphical Models 123
SVD vs. Parallel RBC
Parallel RBC scales well to large matrices
Inference Methods: VB, CVB, Gibbs
Graphical Models 124
Mixed Membership Stochastic Block Models
• Network data analysis– Relational View: Rows and Columns are the same entity
– Example: Social networks, Biological networks
– Graph View: (Binary) adjacency matrix
• Model
Graphical Models 125
MMB Graphical Model
Graphical Models 126
Variational Inference
• Variational lower bound
• Fully factorized variational distribution
• Variational EM– E-step: Update variational parameters (,)
– M-step: Update model parameters (,B)
Graphical Models 127
Results: Inferring Communities
Graphical Models 128
Original friendship matrix Friendships inferred from the posterior, respectively based on thresholding p
TBq and pTBq
Results: Protein Interaction Analysis
Graphical Models 129
“Ground truth”: MIPS collection of protein interactions (yellow diamond)
Comparison with other models based on protein interactions and microarray expression analysis
Non-parametric Bayes
Graphical Models 130
Dirichlet Process Mixtures
Chinese Restaurant Processes
Indian Buffet Processes
Pittman-Yor Processes
Gaussian Processes
Hierarchical Dirichlet Processes
Mondrain Processes
References: Graphical Models
• S. Russell & P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 2009.
• D. Koller & N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.
• C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007.
• D. Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press, 2010.
• M. I. Jordan (Ed), Learning in Graphical Models, MIT Press, 1998.
• S. L. Lauritzen, Graphical Models, Oxford University Press, 1996.
• J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988.
Graphical Models 131
References: Inference
• F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the
• sum-product algorithm,” IEEE Transactions on Information Theory, vol.47, no. 2, 498–519, 2001.
• S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE Transactions on Information Theory, 46, 325–343, 2000.
• M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1-2, 1-305, December 2008.
• C. Andrieu, N. De Freitas, A. Doucet, M. I. Jordan, “An Introduction to MCMC for Machine Learning,” Machine Learning, 50, 5-43, 2003.
• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free energy approximations and generalized belief propagation algorithms,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, 2005.
Graphical Models 132
References: Mixed-Membership Models
• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. “Indexing by latent semantic analysis,” Journal of the Society for Information Science, 41(6):391–407, 1990.
• T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42(1):177–196, 2001.
• D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research (JMLR), 3:993–1022, 2003.
• T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, 101(Suppl 1): 5228–5235, 2004.
• Y. W. Teh, D. Newman, and M. Welling. “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, ” Neural Information Processing Systems (NIPS), 2007.
• A. Asuncion, P. Smyth, M. Welling, Y.W. Teh, “On Smoothing and Inference for Topic Models,” Uncertainty in Artificial Intelligence (UAI), 2009.
• H. Shan, A. Banerjee, and N. Oza, “Discriminative Mixed-membership Models,”IEEE Conference on Data Mining (ICDM), 2009.
Graphical Models 133
References: Matrix Factorization
• S. Funk, “Netflix update: Try this at home,” http://sifter.org/~simon/journal/20061211.html
• R. Salakhutdinov and A. Mnih. “Probabilistic matrix factorization,” Neural Information Processing Systems (NIPS), 2008.
• R. Salakhutdinov and A. Mnih. “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” International Conference on Machine Learning (ICML), 2008.
• I. Porteous, A. Asuncion, and M. Welling, “Bayesian matrix factorization with side information and Dirichlet process mixtures,” Conference on Artificial Intelligence (AAAI), 2010.
• I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. “Modelling relational data using Bayesian clustered tensor facotrization,” Neural Information Processing Systems (NIPS), 2009.
• A. Singh and G. Gordon, “A Bayesian matrix factorization model for relational data,” Uncertainty in Artificial Intelligence (UAI), 2010.
Graphical Models 134
References: Co-clustering, Block Structures
• A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, D. Modha., “A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation,” Journal of Machine Learning Research (JMLR), 2007.
• M. M. Shafiei and E. E. Milios, “Latent Dirichlet Co-Clustering,” IEEE Conference on Data Mining (ICDM), 2006.
• H. Shan and A. Banerjee, “Bayesian co-clustering,” IEEE International Conference on Data Mining (ICDM), 2008.
• P. Wang, C. Domeniconi, and K. B. Laskey, “Latent Dirichlet Bayesian Co-Clustering,” European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2009.
• H. Shan and A. Banerjee, “Residual Bayesian Co-clustering for Matrix Approximation,” SIAM International Conference on Data Mining (SDM), 2010.
• T. A. B. Snijders and K. Nowicki, “Estimation and prediction for stochastic blockmodels for graphs with latent block structure,” Journal of Classification, 14:75–100, 1997.
• E.M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed-membership stochastic blockmodels,” Journal of Machine Learning Research (JMLR), 9, 1981-2014, 2008.
Graphical Models 135
Graphical Models 136
Acknowledgements
Hanhuai Shan Amrudin Agovic
Graphical Models 137
top related