-
1
Machine LearningMachine Learning
Conditional Random FieldsConditional Random Fields
Eric XingEric Xing
1010--701/15701/15--781, Fall 2011781, Fall 2011
gg
Lecture 12, October 19, 2011
Reading: Chap. 13 CB1© Eric Xing @ CMU, 2006-2011
Y1 Y2 Y5…
X1 … Xn
Definition (of HMM)Observation spaceObservation space
Alphabetic set:Euclidean space:
y2 y3y1 yT... { }Kccc ,,, L21=CEuclidean space:
Index set of hidden statesIndex set of hidden states
Transition probabilitiesTransition probabilities between any two statesbetween any two states
or
Start probabilitiesStart probabilities
A AA Ax2 x3x1 xT... dR
{ }M,,, L21=I
,)|( ,jii
tj
t ayyp === − 11 1( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1
Graphical model
1 2pp
Emission probabilitiesEmission probabilities associated with each stateassociated with each state
or in general:
( ).,,,lMultinomia~)( Myp πππ K211
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211
( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1
K …
State automata
2© Eric Xing @ CMU, 2006-2011
-
2
Viterbi decodingGIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:(y| )
y* = argmaxy P(y|x) = argmaxπ P(y,x) Let
= Probability of most likely sequence of states ending at state yt = kThe recursion:
),,...,,,...,(max ,--},...{ - 1111111 ==k
ttttyyk
t yxyyxxPV t
ikk VayxpV 1 max)|(x1 x2 x3 ……………………...……..xN
State 12
x1 x2 x3 ……………………...……..xNState 1
2
x1 x2 x3 ……………………...……..xNState 1
2
x1 x2 x3 ……………………...……..xNState 1
2
Underflows are a significant problem
These numbers become extremely small – underflowSolution: Take the logs of all values:
tkiittt VayxpV 11 −== ,max)|( 2K
2
K
2
K
2
K
Vi(t)k
tV
tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( LLKK 11121111 −= π
( )( )itkiikttkt VayxpV 11 −++== ,logmax)|(log
3© Eric Xing @ CMU, 2006-2011
The Viterbi Algorithm – derivationDefine the viterbi probability:
1kk P ),,...,,,...,(max ,},...{ 111111 1 == +++k
ttttyyk
t yxyyxxPV t),...,,,...,(),...,,,...,|(max ,},...{ tttt
kttyy yyxxPyyxxyxPt 111111 11 == ++
),,,...,,,...,()|(max ,},...{ tttttk
ttyy yxyyxxPyyxPt 111111 11 −−++ ==
itki
ktti VayxP ,, )|(max 111 == ++
),,,...,,,...,(max)|(max },...{, 111 111111 11 ==== −−++ −i
ttttyyi
tk
tti yxyyxxPyyxP t
itkii
ktt VayxP ,, max)|( 111 == ++
4© Eric Xing @ CMU, 2006-2011
-
3
The Viterbi Algorithm Input: x = x1, …, xT,
Initialization:
Iteration:
Termination:
itkii
ktt
kt VayxPV 11 −== ,, max)|(
kkk yxPV π)|( 1111 ==
itkii Vak,t 1−= ,maxarg)Ptr(
Termination:
TraceBack:
kTkVP max),( * =yx
),Ptr(
maxarg**
*
tyyVy
tt
kTkT
=
=
−1
5© Eric Xing @ CMU, 2006-2011
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 FAIR LOADED
0.05 0.950.95, , , , , , , , , FAIR LOADED
0.05P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6
P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2
kiiit
ktt
kt ayxP ,)|( ∑ −== 11 αα
iik P 1∑
0.6250 0.37500.7353 0.26470.8224 0.17760.8853 0.11470.7200 0.2800
ktV
it
itti ik
kt yxPa 111 1 +++ == ∑ ββ )|(,
itkii
ktt
kt VayxpV 11 −== ,max)|(
0.8108 0.18920.8772 0.12280.7043 0.29570.7988 0.20120.8687 0.1313
6© Eric Xing @ CMU, 2006-2011
-
4
Viterbi Vs. Posterior Decoding (individual)
k kVeterbi PD
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
ktV
0.6250 0.37500.7353 0.26470.8224 0.17760.8853 0.11470.7200 0.28000.8108 0.18920.8772 0.12280 7043 0 2957
)|1( xyp kt =
0.8128 0.18720.8238 0.17620.8176 0.18240.7925 0.20750.7415 0.25850.7505 0.24950.7386 0.26140 7027 0 2973
Veterbi PD
11111111
11111111
),( tkptr
N/A1 21 21 21 21 21 21 20.7043 0.2957
0.7988 0.20120.8687 0.1313
0.7027 0.29730.7251 0.27490.7251 0.2749
111
111
1 21 21 2
7© Eric Xing @ CMU, 2006-2011
Another Example
ktV )|1( xyp
kt =
Veterbi PD),( tkptr
X = 6, 2, 3, 5, 6, 2, 6, 3, 6, 6
t
0.2500 0.75000.5263 0.47370.6494 0.35060.7143 0.28570.3333 0.66670.5263 0.47370.2703 0.72970.5263 0.47370 2703 0 7297
)|(yp t
0.2733 0.72670.6040 0.39600.6538 0.34620.6062 0.39380.2861 0.71390.5342 0.46580.2734 0.72660.5226 0.47740 2252 0 7748
2 1 1 1 2 22 22
2 1 1 1 2 12 12
)(p
N/A2 21 21 11 12 21 22 21 2
FAIR LOADED
0.4
0.4
0.60.6
0.2703 0.72970.1818 0.8182
0.2252 0.77480.2159 0.7841
2 2
2 2
1 22 2
Same transition probabilitiesmax
8© Eric Xing @ CMU, 2006-2011
-
5
Computational Complexity and implementation details
What is the running time, and space required, for Forward, and Backward?
Time: O(K2N); Space: O(KN).
∑ −==i
kiit
ktt
kt ayxp ,1)1|( αα
it
itt
iik
kt yxpa 111, )1|( +++ == ∑ ββ
itkii
ktt
kt VayxpV 1,max)1|( −==
Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant
9© Eric Xing @ CMU, 2006-2011
Three Main Questions on HMMs1.1. EvaluationEvaluation
GIVEN an HMM M, and a sequence x,M, q x,FIND Prob (x | M)ALGO. ForwardForward
2.2. DecodingDecodingGIVEN an HMM M, and a sequence x ,FIND the sequence y of states that maximizes, e.g., P(y | x , M),
or the most probable subsequence of statesALGO. ViterbiViterbi, Forward, Forward--backward backward
3.3. LearningLearningGIVEN an HMM M, with unspecified transition/emission probs.,
and a sequence x,FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)ALGO. BaumBaum--Welch (EM)Welch (EM)
10© Eric Xing @ CMU, 2006-2011
-
6
Learning HMM: two scenariosSupervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation
11© Eric Xing @ CMU, 2006-2011
MLE
12© Eric Xing @ CMU, 2006-2011
-
7
Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,
Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters θ are:
∑∑ ∑∑ ∑ ==
•→→
= = − ,,)(#)(# ij
T i
jtnn
Tt
itnML
ij AA
yyy
ijia 2 1
What if y is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
∑∑ ∑•→ = − ' ',)(# j ijn t tn Ayi 2 1
∑∑ ∑∑ ∑ ==
•→→
==
=
' ',
,,
)(#)(#
k ik
ik
nTt
itn
ktnn
Tt
itnML
ik BB
yxy
ikib
1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
13© Eric Xing @ CMU, 2006-2011
Supervised ML estimation, ctd.Intuition:
When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data
Drawback:Given little data, there may be overfitting:
P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD
E lExample:Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
14© Eric Xing @ CMU, 2006-2011
-
8
PseudocountsSolution for small training sets:
Add pseudocountspAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik
Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,
--- "strength" of prior belief, --- total number of imaginary instances in the prior
Larger total pseudocounts ⇒ strong prior belief
Small total pseudocounts: just to avoid 0 probabilities ---smoothing
15© Eric Xing @ CMU, 2006-2011
Unsupervised ML estimation
16© Eric Xing @ CMU, 2006-2011
-
9
Unsupervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is unknown,
EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters θ:1. Estimate Aij , Bik in the training data
How? , , Update θ according to Aij , Bik
ktntn
itnik xyB ,, ,∑=∑ −= tn jtni tnij yyA , ,, 1
p g ij , ikNow a "supervised learning" problem
2. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
We can get to a provably more (or equally) likely parameter set θ each iteration
17© Eric Xing @ CMU, 2006-2011
The Baum Welch algorithmThe complete log likelihood
∏ ∏∏ ⎞⎜⎜⎛ TT xxpyypypp )|()|()(l)(l)(θl
The expected complete log likelihood
EMThe E step
∏ ∏∏⎠
⎜⎜⎝
====
−n t
tntnt
tntnnc xxpyypypp12
11 )|()|()(log),(log),;( ,,,,,yxyxθl
∑∑∑∑∑==
− ⎟⎠⎞⎜
⎝⎛+⎟
⎠⎞⎜
⎝⎛+⎟
⎠⎞⎜
⎝⎛=
− n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(,
logloglog),;(,,,, xxx
yxθ πl
The M step ("symbolically" identical to MLE)
)|( ,,, nitn
itn
itn ypy x1===γ
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111 ==== −−ξ
∑ ∑∑ ∑
−
=
==n
Tt
itn
nTt
jitnML
ija 11
2
,
,,
γ
ξ
∑ ∑∑ ∑
−
=
==n
Tt
itn
ktnn
Tt
itnML
ikx
b 11
1
,
,,
γ
γ
Nn
inML
i∑= 1,γπ
18© Eric Xing @ CMU, 2006-2011
-
10
The Baum-Welch algorithm --comments
Time Complexity:
# it ti O(K2N)# iterations × O(K2N)
Guaranteed to increase the log likelihood of the model
Not guaranteed to find globally best parameters
Converges to local optimum, depending on initial conditions
Too many parameters / too large model: Overt fittingToo many parameters / too large model: Overt-fitting
19© Eric Xing @ CMU, 2006-2011
Summary: the HMM algorithmsQuestions:
Evaluation: What is the probability of the observed sequence? ForwardDecoding: What is the probability that the state of the 3rd roll is loaded, given the observed sequence? Forward-BackwardDecoding: What is the most likely die sequence? ViterbiDecoding: What is the most likely die sequence? Viterbi
Learning: Under what parameterization are the observed sequences most probable? Baum-Welch (EM)
20© Eric Xing @ CMU, 2006-2011
-
11
Applications of HMMsSome early applications of HMMs
finance, but we never saw them speech recognition modelling ion channels
In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
Some current applications of HMMs to biologypp gymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence
21© Eric Xing @ CMU, 2006-2011
Typical structure of a gene
22© Eric Xing @ CMU, 2006-2011
-
12
E0 E1 E2
GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGCCGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGTGGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAATTGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCT
GENSCAN (Burge & Karlin)
E
poly-A
3'UTR5'UTR
tEi
Es
I0 I1 I2
promoter
TTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTCTTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGCTGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTAGCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACATATAAGCTTGCACACTGATGCACACACACCGACACGTTGTCACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTCGGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGAACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTTGCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGTTGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATATCGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAACACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTCGCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTTTTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACGTTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGCTGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATCTTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTATTTATATGGATGAAACGTGCTATAATAACAATGCAGAATGAAGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATAT
4
3
2
1
=)|•(
θθθθ
yp
poly-A
intergenicregion
Forward (+) strandReverse (-) strand
Forward (+) strandReverse (-) strand
promoter G C G G G C CC GGAAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGTGGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTACCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAAATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTACAAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCATATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCATATGCACTTTATAAAAAATTATCCTACATTAACGTATTTTATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAAGTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTAACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAGCCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCTTGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATATAACCTATTGAGACAATACATTTATTTTATTTTTTTATATCATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATCCATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAAGATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT
23© Eric Xing @ CMU, 2006-2011
Shortcomings of Hidden Markov Model
Y1 Y2 … … … Yn
HMM models capture dependences between each state and only its corresponding observation
NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length indentation amount of white space etc
X1 X2 … … … Xn
the whole line such as line length, indentation, amount of white space, etc.
Mismatch between learning objective function and prediction objective function
HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
24© Eric Xing @ CMU, 2006-2011
-
13
Recall Generative vs. Discriminative Classifiers
Goal: Wish to learn f: X → Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)
Yn
Xn
Discriminative classifiers (e.g., logistic regression)Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data
Yn
Xn
25© Eric Xing @ CMU, 2006-2011
Structured Conditional Models
Y1 Y2 … … … Yn
Conditional probability P(label sequence y | observation sequence x)rather than joint probability P(y, x)
Specify the probability of possible label sequences given an observation sequence
Allow arbitrary non-independent features on the observation
x1:n
Allow arbitrary, non independent features on the observation sequence X
The probability of a transition between labels may depend on past and future observations
Relax strong independence assumptions in generative models
26© Eric Xing @ CMU, 2006-2011
-
14
Conditional DistributionIf the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by the Hammersley Clifford
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
theorem of random fields is:
─ x is a data sequence─ y is a label sequence ─ v is a vertex from vertex set V = set of label random variables─ e is an edge from edge set E over V
(y | x) exp ( , y | , x) ( , y | , x)θ λ µ∈ ∈
⎛ ⎞∝ +⎜ ⎟
⎝ ⎠∑ ∑k k e k k v
e E,k v V ,kp f e g v
Y1 Y2 Y5…
X1 … Xn
─ e is an edge from edge set E over V─ fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge
feature─ k is the number of features─ are parameters to be estimated─ y|e is the set of components of y defined by edge e─ y|v is the set of components of y defined by vertex v
1 2 1 2( , , , ; , , , ); andn n k kθ λ λ λ µ µ µ λ µ= L L
27© Eric Xing @ CMU, 2006-2011
Conditional Random Fields
Y Y YY1 Y2 … … … Yn
x1:n
CRF is a partially directed modelDiscriminative modelUsage of global normalizer Z(x)Models the dependence between each state and the entire observation sequence
28© Eric Xing @ CMU, 2006-2011
-
15
Conditional Random FieldsGeneral parametric form:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
Y1 Y2 … … … Yn
x1:n
29© Eric Xing @ CMU, 2006-2011
Conditional Random Fields
⎬⎫
⎨⎧
= ∑ yxfxyp ),(exp)|( θθ 1
Allow arbitrary dependencies on input
⎭⎬
⎩⎨∑
cccc yxfxZ
xyp ),(exp),(
)|( θθθ
Clique dependencies on labels
Use approximate inference for general graphs
© Eric Xing @ CMU, 2006-2011
-
16
CRFs: InferenceComputing marginals using a message passing algorithm called “sum-product”:
∑ ∏≠
→→ =iji
iSC jk
kiikCijji SmSm\
)()( ψ
p
Initialization:
Y1 Y2 … … … Yn
x1:n
Y Y Y Y Y Y Yn 1,YnY2 Y3 Yn-2 Yn-1
After calibration:
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 n-1, n
Also called forward-backward algorithm
31© Eric Xing @ CMU, 2006-2011
CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)
∏≠
→→ =jk
kiikCSCijjiSmSm
iiji
)(max)(\
ψ
Can ignore Z(x) because it is not a function of y
Again run a message-passing algorithm called “max-product”:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 Yn-1,YnY2 Y3 Yn-2 Yn-1
Same as Viterbi decoding used in HMMs!
32© Eric Xing @ CMU, 2006-2011
-
17
CRF learningGiven {(xd, yd)}d=1N, find λ*, µ* such that
Computing the gradient w.r.t λ:
Gradient of the log-partition function in an exponential family is the expectation
of the sufficient statistics.
33© Eric Xing @ CMU, 2006-2011
CRF learning
Computing the model expectations:
Requires exponentially large number of summations: Is it intractable?
Tractable!Can compute marginals using the sum-product algorithm on the chain
Expectation of f over the corresponding marginal probability of neighboring nodes!!
34© Eric Xing @ CMU, 2006-2011
-
18
CRF learningComputing feature expectations using calibrated potentials:
Now we know how to compute ∇λL(λ,µ):
Learning can now be done using gradient ascent:
35© Eric Xing @ CMU, 2006-2011
CRFs: some empirical resultsComparison of error rates on synthetic data
Data is increasingly higher order in the direction of arrow
CRFs achieve the lowest error rate for higher order data
36© Eric Xing @ CMU, 2006-2011
-
19
CRFs: some empirical resultsParts of Speech tagging
Using same set of features: HMM >=< CRFUsing additional overlapping features: CRF+ >> HMM
37© Eric Xing @ CMU, 2006-2011
SummaryConditional Random Fields is a discriminative Structured Input Output model!
HMM is a generative structured I/O model
Complementary strength and weakness:
1.
Yn
Xn
Y1.2.
3.
…
Yn
Xn
38© Eric Xing @ CMU, 2006-2011