parameter learning in markov nets
DESCRIPTION
Readings: K&F: 19.1, 19.2, 19.3, 19.4. Parameter learning in Markov nets. Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 17 th , 2008. Learning Parameters of a BN. C. D. I. G. S. L. J. H. Log likelihood decomposes: Learn each CPT independently - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/1.jpg)
1
Parameter learning in Markov nets
Graphical Models – 10708
Carlos Guestrin
Carnegie Mellon University
November 17th, 2008
Readings:K&F: 19.1, 19.2, 19.3, 19.4
10-708 – Carlos Guestrin 2006-2008
![Page 2: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/2.jpg)
10-708 – Carlos Guestrin 2006-2008 2
Learning Parameters of a BN
Log likelihood decomposes:
Learn each CPT independently Use counts
D
SG
HJ
C
L
I
![Page 3: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/3.jpg)
10-708 – Carlos Guestrin 2006-2008 3
Log Likelihood for MN
Log likelihood of the data:
Difficulty
SATGrade
HappyJob
Coherence
Letter
Intelligence
![Page 4: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/4.jpg)
10-708 – Carlos Guestrin 2006-2008 4
Log Likelihood doesn’t decompose for MNs Log likelihood:
A concave problem Can find global optimum!!
Term log Z doesn’t decompose!!
Difficulty
SATGrade
HappyJob
Coherence
Letter
Intelligence
![Page 5: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/5.jpg)
10-708 – Carlos Guestrin 2006-2008 5
Derivative of Log Likelihood for MNs
Difficulty
SATGrade
HappyJob
Coherence
Letter
Intelligence
![Page 6: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/6.jpg)
10-708 – Carlos Guestrin 2006-2008 6
Derivative of Log Likelihood for MNs 2
Difficulty
SATGrade
HappyJob
Coherence
Letter
Intelligence
![Page 7: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/7.jpg)
10-708 – Carlos Guestrin 2006-2008 7
Derivative of Log Likelihood for MNs
Difficulty
SATGrade
HappyJob
Coherence
Letter
Intelligence Derivative:
Computing derivative requires inference:
Can optimize using gradient ascent Common approach Conjugate gradient, Newton’s method,…
Let’s also look at a simpler solution
![Page 8: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/8.jpg)
10-708 – Carlos Guestrin 2006-2008 8
Iterative Proportional Fitting (IPF)
Difficulty
SATGrade
HappyJob
Coherence
Letter
Intelligence
Setting derivative to zero:
Fixed point equation:
Iterate and converge to optimal parameters Each iteration, must compute:
![Page 9: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/9.jpg)
10-708 – Carlos Guestrin 2006-2008 9
Log-linear Markov network(most common representation)
Feature is some function [D] for some subset of variables D e.g., indicator function
Log-linear model over a Markov network H: a set of features 1[D1],…, k[Dk]
each Di is a subset of a clique in H two ’s can be over the same variables
a set of weights w1,…,wk
usually learned from data
![Page 10: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/10.jpg)
10-708 – Carlos Guestrin 2006-2008 10
Learning params for log linear models – Gradient Ascent
Log-likelihood of data:
Compute derivative & optimize usually with conjugate gradient ascent
![Page 11: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/11.jpg)
Derivative of log-likelihood 1 – log-linear models
10-708 – Carlos Guestrin 2006-2008 11
![Page 12: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/12.jpg)
Derivative of log-likelihood 2 – log-linear models
10-708 – Carlos Guestrin 2006-2008 12
![Page 13: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/13.jpg)
Learning log-linear models with gradient ascent
Gradient:
Requires one inference computation per
Theorem: w is maximum likelihood solution iff
Usually, must regularize E.g., L2 regularization on parameters
10-708 – Carlos Guestrin 2006-2008 13
![Page 14: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/14.jpg)
10-708 – Carlos Guestrin 2006-2008 14
What you need to know about learning MN parameters?
BN parameter learning easy MN parameter learning doesn’t decompose!
Learning requires inference!
Apply gradient ascent or IPF iterations to obtain optimal parameters
![Page 15: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/15.jpg)
15
Conditional Random Fields
Graphical Models – 10708
Carlos Guestrin
Carnegie Mellon University
November 17th, 2008
Readings:K&F: 19.3.2
10-708 – Carlos Guestrin 2006-2008
![Page 16: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/16.jpg)
Generative v. Discriminative classifiers – A review
Want to Learn: h:X Y X – features Y – target classes
Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes:
Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model
Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X|y)
Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model
Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available
1610-708 – Carlos Guestrin 2006-2008
![Page 17: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/17.jpg)
10-708 – Carlos Guestrin 2006-2008 17
Log-linear CRFs(most common representation)
Graph H: only over hidden vars Y1,..,Yn
No assumptions about dependency on observed vars X You must always observe all of X
Feature is some function [D] for some subset of variables D e.g., indicator function,
Log-linear model over a CRF H: a set of features 1[D1],…, k[Dk]
each Di is a subset of a clique in H two ’s can be over the same variables
a set of weights w1,…,wk
usually learned from data
![Page 18: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/18.jpg)
10-708 – Carlos Guestrin 2006-2008 18
Learning params for log linear CRFs – Gradient Ascent
Log-likelihood of data:
Compute derivative & optimize usually with conjugate gradient ascent
![Page 19: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/19.jpg)
Learning log-linear CRFs with gradient ascent
Gradient:
Requires one inference computation per
Usually, must regularize E.g., L2 regularization on parameters
10-708 – Carlos Guestrin 2006-2008 19
![Page 20: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/20.jpg)
10-708 – Carlos Guestrin 2006-2008 20
What you need to know about CRFs
Discriminative learning of graphical models Fewer assumptions about distribution often
performs better than “similar” MN
Gradient computation requires inference per datapoint Can be really slow!!
![Page 21: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/21.jpg)
21
EM for BNs
Graphical Models – 10708
Carlos Guestrin
Carnegie Mellon University
November 17th, 2008
Readings: 18.1, 18.2
10-708 – Carlos Guestrin 2006-2008
![Page 22: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/22.jpg)
10-708 – Carlos Guestrin 2006-2008 22
Thus far, fully supervised learning
We have assumed fully supervised learning:
Many real problems have missing data:
![Page 23: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/23.jpg)
10-708 – Carlos Guestrin 2006-2008 23
The general learning problem with missing data
Marginal likelihood – x is observed, z is missing:
![Page 24: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/24.jpg)
10-708 – Carlos Guestrin 2006-2008 24
E-step
x is observed, z is missing Compute probability of missing data given current choice of
Q(z|x(j)) for each x(j) e.g., probability computed during classification step corresponds to “classification step” in K-means
![Page 25: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/25.jpg)
10-708 – Carlos Guestrin 2006-2008 25
Jensen’s inequality
Theorem: log z P(z) f(z) ≥ z P(z) log f(z)
![Page 26: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/26.jpg)
10-708 – Carlos Guestrin 2006-2008 26
Applying Jensen’s inequality
Use: log z P(z) f(z) ≥ z P(z) log f(z)
![Page 27: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/27.jpg)
10-708 – Carlos Guestrin 2006-2008 27
The M-step maximizes lower bound on weighted data
Lower bound from Jensen’s:
Corresponds to weighted dataset: <x(1),z=1> with weight Q(t+1)(z=1|x(1)) <x(1),z=2> with weight Q(t+1)(z=2|x(1)) <x(1),z=3> with weight Q(t+1)(z=3|x(1)) <x(2),z=1> with weight Q(t+1)(z=1|x(2)) <x(2),z=2> with weight Q(t+1)(z=2|x(2)) <x(2),z=3> with weight Q(t+1)(z=3|x(2)) …
![Page 28: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/28.jpg)
10-708 – Carlos Guestrin 2006-2008 28
The M-step
Maximization step:
Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)]
![Page 29: Parameter learning in Markov nets](https://reader036.vdocuments.us/reader036/viewer/2022081513/56813e40550346895da82539/html5/thumbnails/29.jpg)
10-708 – Carlos Guestrin 2006-2008 29
Convergence of EM
Define potential function F(,Q):
EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester