![Page 1: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/1.jpg)
HMM, MEMM and CRF
40-957 Special Topics in Artificial Intelligence:
Probabilistic Graphical Models
Sharif University of Technology
Soleymani
Spring 2014
![Page 2: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/2.jpg)
Sequence labeling
2
Taking collective a set of interrelated instances ๐1, โฆ , ๐๐
and jointly labeling them
We get as input a sequence of observations ๐ฟ = ๐1:๐ and need
to label them with some joint label ๐ = ๐ฆ1:๐
![Page 3: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/3.jpg)
Generalization of mixture models for
sequential data
3
[Jordan]
๐
๐
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
Y: states (latent variables)
X: observations
![Page 4: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/4.jpg)
HMM examples
4
Part-of-speech-tagging
Speech recognition
๐๐๐ ๐๐ต๐ ๐๐ต ๐๐
Students are
๐๐ต๐
expected to study
![Page 5: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/5.jpg)
HMM: probabilistic model
5
Transitional probabilities: transition probabilities between
states
๐ด๐๐ โก ๐(๐๐ก = ๐|๐๐กโ1 = ๐)
Initial state distribution: start probabilities in different
states
๐๐ โก ๐(๐1 = ๐)
Observation model: Emission probabilities associated with
each state
๐(๐๐ก|๐๐ก , ๐ฝ)
![Page 6: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/6.jpg)
HMM: probabilistic model
6
Transitional probabilities: transition probabilities between
states
๐(๐๐ก|๐๐กโ1 = ๐)~๐๐ข๐๐ก๐(๐ด๐1, โฆ , ๐ด๐๐) โ๐ โ ๐ ๐ก๐๐ก๐๐
Initial state distribution: start probabilities in different
states
๐(๐1)~๐๐ข๐๐ก๐(๐1, โฆ , ๐๐)
Observation model: Emission probabilities associated with
each state
Discrete observations: ๐(๐๐ก|๐๐ก = ๐)~๐๐ข๐๐ก๐(๐ต๐,1, โฆ , ๐ต๐,๐พ) โ๐ โ ๐ ๐ก๐๐ก๐๐
General: ๐ ๐๐ก ๐๐ก = ๐ = ๐(. |๐ฝ๐)
๐: states (latent variables)
๐: observations
![Page 7: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/7.jpg)
Inference problems in sequential data
7
Decoding: argmax๐ฆ1,โฆ,๐ฆ๐
๐ ๐ฆ1, โฆ , ๐ฆ๐ ๐ฅ1, โฆ , ๐ฅ๐
Evaluation
Filtering: ๐ ๐ฆ๐ก ๐ฅ1, โฆ , ๐ฅ๐ก
Smoothing: ๐กโฒ < ๐ก, ๐ ๐ฆ๐กโฒ ๐ฅ1, โฆ , ๐ฅ๐ก
Prediction: ๐กโฒ > ๐ก, ๐ ๐ฆ๐กโฒ ๐ฅ1, โฆ , ๐ฅ๐ก
![Page 8: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/8.jpg)
Some questions
8
๐ ๐ฆ๐ก|๐ฅ1, โฆ , ๐ฅ๐ก =?
Forward algorithm
๐ ๐ฅ1, โฆ , ๐ฅ๐ =?
Forward algorithm
๐ ๐ฆ๐ก|๐ฅ1, โฆ , ๐ฅ๐ =?
Forward-Backward algorithm
How do we adjust the HMM parameters:
Complete data: each training data includes a state sequence and
the corresponding observation sequence
Incomplete data: each training data includes only an observation
sequence
![Page 9: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/9.jpg)
Forward algorithm
9
๐ผ๐ก ๐ = ๐ ๐ฅ1, โฆ , ๐ฅ๐ก , ๐๐ก = ๐
๐ผ๐ก ๐ = ๐ผ๐กโ1 ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
Initialization:
๐ผ1 ๐ = ๐ ๐ฅ1, ๐1 = ๐ = ๐ ๐ฅ1|๐1 = ๐ ๐ ๐1 = ๐
Iterations: ๐ก = 2 to ๐
๐ผ๐ก ๐ = ๐ผ๐กโ1 ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก|๐๐ก = ๐๐
๐, ๐ = 1, โฆ , ๐
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐ผ1(. ) ๐ผ2(. ) ๐ผ๐โ1(. ) ๐ผ๐(. ) ๐ผ๐ก ๐ = ๐๐กโ1โ๐ก(๐)
![Page 10: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/10.jpg)
Backward algorithm
10
๐ฝ๐ก ๐ = ๐๐กโ๐กโ1(๐) = ๐ ๐ฅ๐ก+1, โฆ , ๐ฅ๐|๐๐ก = ๐
๐ฝ๐กโ1 ๐ = ๐ฝ๐ก ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
Initialization:
๐ฝ๐ ๐ = 1
Iterations: ๐ก = ๐ down to 2
๐ฝ๐กโ1 ๐ = ๐ฝ๐ก ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐๐
๐, ๐ โ ๐ ๐ก๐๐ก๐๐
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐ฝ๐ . = 1 ๐ฝ๐โ1 . ๐ฝ2 . ๐ฝ1 .
๐ฝ๐ก ๐ = ๐๐กโ๐กโ1(๐)
![Page 11: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/11.jpg)
Forward-backward algorithm
11
๐ผ๐ก ๐ = ๐ผ๐กโ1 ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
๐ผ1 ๐ = ๐ ๐ฅ1, ๐1 = ๐ = ๐ ๐ฅ1|๐1 = ๐ ๐ ๐1 = ๐
๐ฝ๐กโ1 ๐ = ๐ฝ๐ก ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
๐ฝ๐ ๐ = 1
๐ ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ = ๐ผ๐ ๐
๐
๐ฝ๐ ๐ = ๐ผ๐ ๐
๐
๐ ๐๐ก = ๐|๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ =๐ผ๐ก(๐)๐ฝ๐ก(๐)
๐ผ๐ ๐๐
๐ผ๐ก(๐) โก ๐ ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ก , ๐๐ก = ๐
๐ฝ๐ก(๐) โก ๐ ๐ฅ๐ก+1, ๐ฅ๐ก+2, โฆ , ๐ฅ๐|๐๐ก = ๐
![Page 12: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/12.jpg)
Forward-backward algorithm
12
This will be used for expectation maximization to train a
HMM
๐ ๐๐ก = ๐ ๐ฅ1, โฆ , ๐ฅ๐ =๐ ๐ฅ1,โฆ,๐ฅ๐,๐๐ก=๐
๐ ๐ฅ1,โฆ,๐ฅ๐=
๐ผ๐ก(๐)๐ฝ๐ก(๐)
๐ผ๐(๐)๐๐=1
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐ฝ๐ . = 1
๐ผ1(. ) ๐ผ2(. ) ๐ผ๐โ1(. ) ๐ผ๐(. )
๐ฝ๐โ1 . ๐ฝ2 . ๐ฝ1 .
![Page 13: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/13.jpg)
Decoding Problem
13
Choose state sequence to maximize the observations:
argmax๐ฆ1,โฆ,๐ฆ๐ก
๐ ๐ฆ1, โฆ , ๐ฆ๐ก ๐ฅ1, โฆ , ๐ฅ๐ก
Viterbi algorithm:
Define auxiliary variable ๐ฟ:
๐ฟ๐ก ๐ = max๐ฆ1,โฆ,๐ฆ๐กโ1
๐(๐ฆ1, ๐ฆ2, โฆ , ๐ฆ๐กโ1, ๐๐ก = ๐, ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ก)
๐ฟ๐ก(๐): probability of the most probable path ending in state ๐๐ก = ๐
Recursive relation:
๐ฟ๐ก ๐ = max๐
๐ฟ๐กโ1 ๐ ๐(๐๐ก = ๐|๐๐กโ1 = ๐) ๐ ๐ฅ๐ก ๐๐ก = ๐
๐ฟ๐ก ๐ = max๐=1,โฆ,๐
๐ฟ๐กโ1 ๐ก ๐ ๐๐ก = ๐ ๐๐กโ1 = ๐ ๐(๐ฅ๐ก|๐๐ก = ๐)
![Page 14: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/14.jpg)
Decoding Problem: Viterbi algorithm
14
Initialization ๐ฟ1 ๐ = ๐ ๐ฅ1|๐1 = ๐ ๐ ๐1 = ๐
๐1 ๐ = 0
Iterations: ๐ก = 2, โฆ , ๐
๐ฟ๐ก ๐ = max๐
๐ฟ๐กโ1 ๐ ๐ ๐๐ก = ๐ ๐๐กโ1 = ๐ ๐ ๐ฅ๐ก ๐๐ก = ๐
๐๐ก ๐ = argmax๐
๐ฟ๐ก ๐ ๐ ๐๐ก = ๐ ๐๐กโ1 = ๐
Final computation:
๐โ = max๐=1,โฆ,๐
๐ฟ๐ ๐
๐ฆ๐โ = argmax
๐=1,โฆ,๐๐ฟ๐ ๐
Traceback state sequence: ๐ก = ๐ โ 1 down to 1 ๐ฆ๐ก
โ = ๐๐ก+1 ๐ฆ๐ก+1โ
๐ = 1, โฆ , ๐
๐ = 1, โฆ , ๐
![Page 15: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/15.jpg)
Max-product algorithm
15
๐๐๐๐๐๐ฅ ๐ฅ๐ = max
๐ฅ๐
๐ ๐ฅ๐ ๐ ๐ฅ๐ , ๐ฅ๐ ๐๐๐๐๐๐ฅ(๐ฅ๐)
๐โ๐ฉ(๐)\๐
๐ฟ๐ก ๐ = ๐๐กโ1,๐ก๐๐๐ฅ ร ๐ ๐ฅ๐
![Page 16: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/16.jpg)
HMM Learning
16
Supervised learning: When we have a set of data samples,
each of them containing a pair of sequences (one is the
observation sequence and the other is the state sequence)
Unsupervised learning: When we have a set of data
samples, each of them containing a sequence of observations
![Page 17: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/17.jpg)
HMM supervised learning by MLE
17
Initial state probability:
๐๐ = ๐ ๐1 = ๐ , 1 โค ๐ โค ๐
State transition probability:
๐ด๐๐ = ๐ ๐๐ก+1 = ๐ ๐๐ก = ๐ , 1 โค ๐, ๐ โค ๐
State transition probability:
๐ต๐๐ = ๐ ๐๐ก = ๐ ๐๐ก = ๐ , 1 โค ๐ โค ๐พ
๐1 ๐2 ๐๐ ๐๐โ1 โฆ
๐ ๐จ
๐2 ๐1 ๐๐โ1 ๐๐
๐ฉ
Discrete
observations
![Page 18: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/18.jpg)
HMM: supervised parameter learning by MLE
18
๐ ๐|๐ฝ = ๐ ๐ฆ1(๐)
๐ ๐(๐ฆ๐ก(๐)
|๐ฆ๐กโ1(๐)
, ๐จ)๐
๐ก=2 ๐(๐๐ก
(๐)|๐ฆ๐ก
(๐), ๐ฉ)
๐
๐ก=1
๐
๐=1
๐ด ๐๐ = ๐ผ ๐ฆ๐กโ1
(๐)= ๐, ๐ฆ๐ก
(๐)= ๐๐
๐ก=2๐๐=1
๐ผ ๐ฆ๐กโ1(๐)
= ๐๐๐ก=2
๐๐=1
๐ ๐ = ๐ผ ๐ฆ1
(๐)= ๐๐
๐=1
๐
๐ต ๐๐ = ๐ผ ๐ฆ๐ก
(๐)= ๐, ๐ฅ๐ก
(๐)= ๐๐
๐ก=1๐๐=1
๐ผ ๐ฆ๐ก(๐)
= ๐๐๐ก=1
๐๐=1
Discrete
observations
![Page 19: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/19.jpg)
Learning
19
Problem: how to construct an HHM given only observations?
Find ๐ฝ = (๐จ, ๐ฉ, ๐ ), maximizing ๐(๐1, โฆ , ๐๐|๐ฝ)
Incomplete data
EM algorithm
![Page 20: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/20.jpg)
HMM learning by EM (Baum-Welch)
20
๐ฝ๐๐๐ = ๐ ๐๐๐ , ๐จ๐๐๐ , ๐ฝ๐๐๐
E-Step:
๐พ๐,๐ก๐ = ๐ ๐๐ก
(๐)= ๐ ๐ฟ(๐); ๐ฝ๐๐๐
๐๐,๐ก๐,๐
= ๐ ๐๐กโ1(๐)
= ๐, ๐๐ก(๐)
= ๐ ๐(๐); ๐ฝ๐๐๐
M-Step:
๐๐๐๐๐ค =
๐พ๐,1๐๐
๐=1
๐
๐ด๐,๐๐๐๐ค =
๐๐,๐ก๐,๐๐
๐ก=2๐๐=1
๐พ๐,๐ก๐๐โ1
๐ก=1๐๐=1
๐ต๐,๐๐๐๐ค =
๐พ๐,๐ก๐ ๐ผ ๐๐ก
(๐)=๐, ๐
๐ก=1๐๐=1
๐ผ ๐๐ก(๐)
=๐๐๐ก=1
๐๐=1
๐, ๐ = 1, โฆ , ๐
๐, ๐ = 1, โฆ , ๐
Baum-Welch algorithm (Baum, 1972)
Discrete observations
![Page 21: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/21.jpg)
HMM learning by EM (Baum-Welch)
21
๐ฝ๐๐๐ = ๐ ๐๐๐ , ๐จ๐๐๐ , ๐ฝ๐๐๐
E-Step:
๐พ๐,๐ก๐ = ๐ ๐๐ก
(๐)= ๐ ๐ฟ(๐); ๐ฝ๐๐๐
๐๐,๐ก๐,๐
= ๐ ๐๐กโ1(๐)
= ๐, ๐๐ก(๐)
= ๐ ๐(๐); ๐ฝ๐๐๐
M-Step:
๐๐๐๐๐ค =
๐พ๐,1๐๐
๐=1
๐
๐ด๐,๐๐๐๐ค =
๐๐,๐ก๐,๐๐
๐ก=2๐๐=1
๐พ๐,๐ก๐๐โ1
๐ก=1๐๐=1
๐ต๐,๐๐๐๐ค =
๐พ๐,๐ก๐ ๐ผ ๐๐ก
(๐)=๐, ๐
๐ก=1๐๐=1
๐ผ ๐๐ก(๐)
=๐๐๐ก=1
๐๐=1
๐, ๐ = 1, โฆ , ๐
๐, ๐ = 1, โฆ , ๐
๐๐๐๐๐ค =
๐พ๐,๐ก๐ ๐๐ก
(๐)๐๐ก=1
๐๐=1
๐พ๐,๐ก๐๐
๐ก=1๐๐=1
๐ฎ๐๐๐๐ค =
๐พ๐,๐ก๐ ๐๐ก
(๐)โ ๐๐
๐๐๐ค ๐๐ก(๐)
โ ๐๐๐๐๐ค
๐๐๐ก=1
๐๐=1
๐พ๐,๐ก๐๐
๐ก=1๐๐=1
Assumption: Gaussian emission
probabilities
Baum-Welch algorithm (Baum, 1972)
![Page 22: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/22.jpg)
Forward-backward algorithm for E-step
22
๐ ๐ฆ๐กโ1, ๐ฆ๐ก ๐1, โฆ , ๐๐
=๐ ๐1, โฆ , ๐๐ ๐ฆ๐กโ1, ๐ฆ๐ก)๐ ๐ฆ๐กโ1, ๐ฆ๐ก
๐(๐1, โฆ , ๐๐)
=๐ ๐1, โฆ , ๐๐กโ1 ๐ฆ๐กโ1)๐ ๐๐ก ๐ฆ๐ก)๐ ๐๐ก+1, โฆ , ๐๐ ๐ฆ๐ก)๐ ๐ฆ๐ก ๐ฆ๐กโ1)๐ ๐ฆ๐กโ1
๐(๐1, โฆ , ๐๐)
=๐ผ๐กโ1(๐ฆ๐กโ1)๐ ๐๐ก ๐ฆ๐ก)๐ ๐ฆ๐ก ๐ฆ๐กโ1)๐ฝ๐ก ๐ฆ๐ก
๐ผ๐(๐)๐๐=1
![Page 23: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/23.jpg)
HMM shortcomings
23
In modeling the joint distribution ๐(๐, ๐ฟ), HMM ignores many
dependencies between observations ๐1, โฆ , ๐๐ (similar to
most generative models that need to simplify the structure)
In the sequence labeling task, we need to classify an
observation sequence using the conditional probability ๐(๐|๐ฟ)
However, HMM learns a joint distribution ๐(๐, ๐ฟ) while uses only
๐(๐|๐ฟ) for labeling
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
![Page 24: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/24.jpg)
Maximum Entropy Markov Model (MEMM)
24
๐ ๐1:๐ ๐1:๐ = ๐ ๐ฆ1 ๐1:๐ ๐ ๐ฆ๐ก ๐ฆ๐กโ1, ๐1:๐
๐
๐ก=2
๐ ๐ฆ๐ก ๐ฆ๐กโ1, ๐1:๐ =exp ๐๐๐ ๐ฆ๐ก , ๐ฆ๐กโ1, ๐1:๐
exp ๐๐๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐๐ฆ๐ก
Discriminative model
Only models ๐(๐|๐ฟ) and completely ignores modeling ๐(๐ฟ) Maximizes the conditional likelihood ๐(๐๐|๐๐ฟ, ๐ฝ)
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐1 ๐2 ๐๐
๐ฟ1:๐
โฆ ๐๐โ1
๐ ๐ฆ๐กโ1, ๐1:๐ , ๐ = exp ๐๐๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐
๐ฆ๐ก
![Page 25: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/25.jpg)
Feature function
25
Feature function ๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐ can take account of
relations between both data and label space
However, they are often indicator functions showing absence
or presence of a feature
๐ค๐ captures how closely ๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐ is related with
the label
![Page 26: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/26.jpg)
MEMM disadvantages
26
The later observation in the sequence has absolutely no effect
on the posterior probability of the current state
model does not allow for any smoothing.
The model is incapable of going back and changing its prediction
about the earlier observations.
The label bias problem
there are cases that a given observation is not useful in
predicting the next state of the model.
if a state has a unique out-going transition, the given observation is
useless
![Page 27: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/27.jpg)
Label bias problem in MEMM
27
Label bias problem: states with fewer arcs are preferred
Preference of states with lower entropy of transitions over others in decoding
MEMMs should probably be avoided in cases where many transitions are close to deterministic
Extreme case: When there is only one outgoing arc, it does not matter what the observation is
The source of this problem: Probabilities of outgoing arcs normalized separately for each state
sum of transition probability for any state has to sum to 1
Solution: Do not normalize probabilities locally
![Page 28: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/28.jpg)
From MEMM to CRF
28
From local probabilities to local potentials
๐ ๐ ๐ฟ =1
๐(๐ฟ) ๐ ๐ฆ๐กโ1, ๐ฆ๐ก , ๐ฟ
๐
๐ก=1
=1
๐(๐ฟ, ๐) exp ๐๐๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐
๐
๐ก=1
CRF is a discriminative model (like MEMM) can dependence between each state and the entire observation sequence
uses global normalizer ๐(๐ฟ, ๐) that overcomes the label bias problem of MEMM
MEMM use an exponential model for each state while CRF have a single
exponential model for the joint probability of the entire label sequence
๐1 ๐2 ๐๐
๐ฟ1:๐
โฆ ๐๐โ1
๐ฟ โก ๐1:๐
๐ โก ๐ฆ1:๐
![Page 29: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/29.jpg)
CRF: conditional distribution
29
๐ ๐ ๐ =1
๐(๐, ๐)exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐=1
=1
๐(๐, ๐)exp ๐๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐
๐=1
๐ ๐, ๐ = exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐=1๐
![Page 30: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/30.jpg)
CRF: MAP inference
30
Given CRF parameters ๐, find the ๐โ that maximizes ๐ ๐ ๐ :
๐โ = argmax๐
exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐=1
๐(๐) is not a function of ๐ and so has been ignored
Max-product algorithm can be used for this MAP inference
problem
Same as Viterbi decoding used in HMMs
![Page 31: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/31.jpg)
CRF: inference
31
Exact inference for 1-D chain CRFs
๐ ๐ฆ๐ , ๐ฆ๐โ1 = exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐1 ๐2 ๐๐
๐ฟ1:๐
โฆ ๐๐โ1
๐1, ๐2
๐2, ๐3
โฆ ๐๐โ1, ๐๐ ๐2 ๐๐โ1
![Page 32: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/32.jpg)
CRF: learning
32
๐โ = argmax๐
๐ ๐(๐) ๐(๐), ๐๐
๐=1
๐ ๐(๐) ๐(๐), ๐๐
๐=1=
1
๐(๐(๐), ๐)exp ๐๐๐ ๐ฆ๐
(๐), ๐ฆ๐โ1
(๐), ๐(๐)
๐
๐=1
๐
๐=1
= exp ๐๐๐ ๐ฆ๐(๐)
, ๐ฆ๐โ1(๐)
, ๐(๐) โ ln ๐(๐(๐), ๐)
๐
๐=1
๐
๐=1
๐ฟ ๐ = ln ๐ ๐(๐) ๐(๐), ๐๐
๐=1
๐ป๐๐ฟ ๐ = ๐ ๐ฆ๐(๐)
, ๐ฆ๐โ1(๐)
, ๐(๐) โ ๐ป๐ ln ๐(๐(๐), ๐)
๐
๐=1
๐
๐=1
๐ป๐ ln ๐(๐(๐), ๐) = ๐(๐|๐ ๐ , ๐) ๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐
๐=1๐
Maximum Conditional Likelihood
๐ฝ = argmax๐ฝ
๐(๐๐|๐๐, ๐ฝ)
Gradient of the log-partition function in an exponential family is
the expectation of the sufficient statistics.
![Page 33: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/33.jpg)
CRF: learning
33
๐ป๐ ln ๐(๐(๐), ๐) = ๐(๐|๐ ๐ , ๐) ๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐
๐=1๐
= ๐(๐|๐ ๐ , ๐)๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐
๐
๐=1
= ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐)๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐ฆ๐,๐ฆ๐โ1
๐
๐=1
How do we find the above expectations?
๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐) must be computed for all ๐ = 2, โฆ , ๐
![Page 34: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/34.jpg)
CRF: learning
Inference to find ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐)
34
Junction tree algorithm
Initialization of clique potentials:
๐ ๐ฆ๐ , ๐ฆ๐โ1 = exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐ ๐ฆ๐ = 1
After calibration:๐โ ๐ฆ๐ , ๐ฆ๐โ1
๐ ๐ฆ๐ , ๐ฆ๐โ1|๐(๐), ๐ =๐โ ๐ฆ๐ , ๐ฆ๐โ1
๐โ ๐ฆ๐ , ๐ฆ๐โ1๐ฆ๐,๐ฆ๐โ1
๐1, ๐2
๐2, ๐3
โฆ ๐๐โ1, ๐๐ ๐2 ๐๐โ1
![Page 35: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/35.jpg)
CRF learning: gradient descent
35
๐๐ก = ๐๐ก + ๐๐ป๐๐ฟ ๐๐ก
In each gradient step, for each data sample, we use inference
to find ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐) required for computing feature expectation:
๐ป๐ ln ๐(๐(๐), ๐) = ๐ธ๐(๐|๐ ๐ ,๐๐ก) ๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
= ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐)๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐ฆ๐,๐ฆ๐โ1
๐ป๐๐ฟ ๐ = ๐ ๐ฆ๐(๐)
, ๐ฆ๐โ1(๐)
, ๐(๐) โ ๐ป๐ ln ๐(๐(๐), ๐)
๐
๐=1
๐
๐=1
![Page 36: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic](https://reader030.vdocuments.us/reader030/viewer/2022041004/5ea8fdce8898af63fe47d7c7/html5/thumbnails/36.jpg)
Summary
36
Discriminative vs. generative
In cases where we have many correlated features, discriminative models (MEMM and CRF) are often better
avoid the challenge of explicitly modeling the distribution over features.
but, if only limited training data are available, the stronger bias of the generative model may dominate and these models may be preferred.
Learning
HMMs and MEMMs are much more easily learned.
CRF requires an iterative gradient-based approach, which is considerably more expensive
inference must be run separately for every training sequence
MEMM vs. CRF (Label bias problem of MEMM)
In many cases, CRFs are likely to be a safer choice (particularly in cases where many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.