machine learning › ~epxing › class › 10701-11f › lecture › lecture12-annotated.pdf1...

1

Machine LearningMachine Learning

Conditional Random FieldsConditional Random Fields

Eric XingEric Xing

1010--701/15701/15--781, Fall 2011781, Fall 2011

gg

Lecture 12, October 19, 2011

Reading: Chap. 13 CB1© Eric Xing @ CMU, 2006-2011

Y1 Y2 Y5…

X1 … Xn

Definition (of HMM)Observation spaceObservation space

Alphabetic set:Euclidean space:

y2 y3y1 yT... { }Kccc ,,, L21=CEuclidean space:

Index set of hidden statesIndex set of hidden states

Transition probabilitiesTransition probabilities between any two statesbetween any two states

or

Start probabilitiesStart probabilities

A AA Ax2 x3x1 xT... dR

{ }M,,, L21=I

,)|( ,jii

tj

t ayyp === − 11 1( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1

Graphical model

1 2pp

Emission probabilitiesEmission probabilities associated with each stateassociated with each state

or in general:

( ).,,,lMultinomia~)( Myp πππ K211

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

K …

State automata

2© Eric Xing @ CMU, 2006-2011

2

Viterbi decodingGIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:(y| )

y* = argmaxy P(y|x) = argmaxπ P(y,x) Let

= Probability of most likely sequence of states ending at state yt = kThe recursion:

),,...,,,...,(max ,--},...{ - 1111111 ==k

ttttyyk

t yxyyxxPV t

ikk VayxpV 1 max)|(x1 x2 x3 ……………………...……..xN

State 12

x1 x2 x3 ……………………...……..xNState 1

2

x1 x2 x3 ……………………...……..xNState 1

2

x1 x2 x3 ……………………...……..xNState 1

2

Underflows are a significant problem

These numbers become extremely small – underflowSolution: Take the logs of all values:

tkiittt VayxpV 11 −== ,max)|( 2K

2

K

2

K

2

K

Vi(t)k

tV

tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( LLKK 11121111 −= π

( )( )itkiikttkt VayxpV 11 −++== ,logmax)|(log

3© Eric Xing @ CMU, 2006-2011

The Viterbi Algorithm – derivationDefine the viterbi probability:

1kk P ),,...,,,...,(max ,},...{ 111111 1 == +++k

ttttyyk

t yxyyxxPV t),...,,,...,(),...,,,...,|(max ,},...{ tttt

kttyy yyxxPyyxxyxPt 111111 11 == ++

),,,...,,,...,()|(max ,},...{ tttttk

ttyy yxyyxxPyyxPt 111111 11 −−++ ==

itki

ktti VayxP ,, )|(max 111 == ++

),,,...,,,...,(max)|(max },...{, 111 111111 11 ==== −−++ −i

ttttyyi

tk

tti yxyyxxPyyxP t

itkii

ktt VayxP ,, max)|( 111 == ++

4© Eric Xing @ CMU, 2006-2011

3

The Viterbi Algorithm Input: x = x1, …, xT,

Initialization:

Iteration:

Termination:

itkii

ktt

kt VayxPV 11 −== ,, max)|(

kkk yxPV π)|( 1111 ==

itkii Vak,t 1−= ,maxarg)Ptr(

Termination:

TraceBack:

kTkVP max),( * =yx

),Ptr(

maxarg**

*

tyyVy

tt

kTkT

=

=

−1

5© Eric Xing @ CMU, 2006-2011

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 FAIR LOADED

0.05 0.950.95, , , , , , , , , FAIR LOADED

0.05P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

kiiit

ktt

kt ayxP ,)|( ∑ −== 11 αα

iik P 1∑

0.6250 0.37500.7353 0.26470.8224 0.17760.8853 0.11470.7200 0.2800

ktV

it

itti ik

kt yxPa 111 1 +++ == ∑ ββ )|(,

itkii

ktt

kt VayxpV 11 −== ,max)|(

0.8108 0.18920.8772 0.12280.7043 0.29570.7988 0.20120.8687 0.1313

6© Eric Xing @ CMU, 2006-2011

4

Viterbi Vs. Posterior Decoding (individual)

k kVeterbi PD

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

ktV

0.6250 0.37500.7353 0.26470.8224 0.17760.8853 0.11470.7200 0.28000.8108 0.18920.8772 0.12280 7043 0 2957

)|1( xyp kt =

0.8128 0.18720.8238 0.17620.8176 0.18240.7925 0.20750.7415 0.25850.7505 0.24950.7386 0.26140 7027 0 2973

Veterbi PD

11111111

11111111

),( tkptr

N/A1 21 21 21 21 21 21 20.7043 0.2957

0.7988 0.20120.8687 0.1313

0.7027 0.29730.7251 0.27490.7251 0.2749

111

111

1 21 21 2

7© Eric Xing @ CMU, 2006-2011

Another Example

ktV )|1( xyp

kt =

Veterbi PD),( tkptr

X = 6, 2, 3, 5, 6, 2, 6, 3, 6, 6

t

0.2500 0.75000.5263 0.47370.6494 0.35060.7143 0.28570.3333 0.66670.5263 0.47370.2703 0.72970.5263 0.47370 2703 0 7297

)|(yp t

0.2733 0.72670.6040 0.39600.6538 0.34620.6062 0.39380.2861 0.71390.5342 0.46580.2734 0.72660.5226 0.47740 2252 0 7748

2 1 1 1 2 22 22

2 1 1 1 2 12 12

)(p

N/A2 21 21 11 12 21 22 21 2

FAIR LOADED

0.4

0.4

0.60.6

0.2703 0.72970.1818 0.8182

0.2252 0.77480.2159 0.7841

2 2

2 2

1 22 2

Same transition probabilitiesmax

8© Eric Xing @ CMU, 2006-2011

5

Computational Complexity and implementation details

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N); Space: O(KN).

∑ −==i

kiit

ktt

kt ayxp ,1)1|( αα

it

itt

iik

kt yxpa 111, )1|( +++ == ∑ ββ

itkii

ktt

kt VayxpV 1,max)1|( −==

Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant

9© Eric Xing @ CMU, 2006-2011

Three Main Questions on HMMs1.1. EvaluationEvaluation

GIVEN an HMM M, and a sequence x,M, q x,FIND Prob (x | M)ALGO. ForwardForward

2.2. DecodingDecodingGIVEN an HMM M, and a sequence x ,FIND the sequence y of states that maximizes, e.g., P(y | x , M),

or the most probable subsequence of statesALGO. ViterbiViterbi, Forward, Forward--backward backward

3.3. LearningLearningGIVEN an HMM M, with unspecified transition/emission probs.,

and a sequence x,FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)ALGO. BaumBaum--Welch (EM)Welch (EM)

10© Eric Xing @ CMU, 2006-2011

6

Learning HMM: two scenariosSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he

changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

11© Eric Xing @ CMU, 2006-2011

MLE

12© Eric Xing @ CMU, 2006-2011

7

Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,

Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters θ are:

∑∑ ∑∑ ∑ ==

•→→

= = − ,,)(#)(# ij

T i

jtnn

Tt

itnML

ij AA

yyy

ijia 2 1

What if y is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑∑ ∑•→ = − ' ',)(# j ijn t tn Ayi 2 1

∑∑ ∑∑ ∑ ==

•→→

==

=

' ',

,,

)(#)(#

k ik

ik

nTt

itn

ktnn

Tt

itnML

ik BB

yxy

ikib

1

1

( ){ }NnTtyx tntn :,::, ,, 11 ==

13© Eric Xing @ CMU, 2006-2011

Supervised ML estimation, ctd.Intuition:

When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data

Drawback:Given little data, there may be overfitting:

P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD

E lExample:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

14© Eric Xing @ CMU, 2006-2011

8

PseudocountsSolution for small training sets:

Add pseudocountspAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

Larger total pseudocounts ⇒ strong prior belief

Small total pseudocounts: just to avoid 0 probabilities ---smoothing

15© Eric Xing @ CMU, 2006-2011

Unsupervised ML estimation

16© Eric Xing @ CMU, 2006-2011

9

Unsupervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is unknown,

EXPECTATION MAXIMIZATION

0. Starting with our best guess of a model M, parameters θ:1. Estimate Aij , Bik in the training data

How? , , Update θ according to Aij , Bik

ktntn

itnik xyB ,, ,∑=∑ −= tn jtni tnij yyA , ,, 1

p g ij , ikNow a "supervised learning" problem

2. Repeat 1 & 2, until convergence

This is called the Baum-Welch Algorithm

We can get to a provably more (or equally) likely parameter set θ each iteration

17© Eric Xing @ CMU, 2006-2011

The Baum Welch algorithmThe complete log likelihood

∏ ∏∏ ⎞⎜⎜⎛ TT xxpyypypp )|()|()(l)(l)(θl

The expected complete log likelihood

EMThe E step

∏ ∏∏⎠

⎜⎜⎝

====

−n t

tntnt

tntnnc xxpyypypp12

11 )|()|()(log),(log),;( ,,,,,yxyxθl

∑∑∑∑∑==

− ⎟⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛=

− n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(,

logloglog),;(,,,, xxx

yxθ πl

The M step ("symbolically" identical to MLE)

)|( ,,, nitn

itn

itn ypy x1===γ

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111 ==== −−ξ

∑ ∑∑ ∑

−

=

==n

Tt

itn

nTt

jitnML

ija 11

2

,

,,

γ

ξ

∑ ∑∑ ∑

−

=

==n

Tt

itn

ktnn

Tt

itnML

ikx

b 11

1

,

,,

γ

γ

Nn

inML

i∑= 1,γπ

18© Eric Xing @ CMU, 2006-2011

10

The Baum-Welch algorithm --comments

Time Complexity:

# it ti O(K2N)# iterations × O(K2N)

Guaranteed to increase the log likelihood of the model

Not guaranteed to find globally best parameters

Converges to local optimum, depending on initial conditions

Too many parameters / too large model: Overt fittingToo many parameters / too large model: Overt-fitting

19© Eric Xing @ CMU, 2006-2011

Summary: the HMM algorithmsQuestions:

Evaluation: What is the probability of the observed sequence? ForwardDecoding: What is the probability that the state of the 3rd roll is loaded, given the observed sequence? Forward-BackwardDecoding: What is the most likely die sequence? ViterbiDecoding: What is the most likely die sequence? Viterbi

Learning: Under what parameterization are the observed sequences most probable? Baum-Welch (EM)

20© Eric Xing @ CMU, 2006-2011

11

Applications of HMMsSome early applications of HMMs

finance, but we never saw them speech recognition modelling ion channels

In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

Some current applications of HMMs to biologypp gymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence

21© Eric Xing @ CMU, 2006-2011

Typical structure of a gene

22© Eric Xing @ CMU, 2006-2011

12

E0 E1 E2

GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGCCGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGTGGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAATTGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCT

GENSCAN (Burge & Karlin)

E

poly-A

3'UTR5'UTR

tEi

Es

I0 I1 I2

promoter

TTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTCTTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGCTGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTAGCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACATATAAGCTTGCACACTGATGCACACACACCGACACGTTGTCACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTCGGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGAACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTTGCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGTTGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATATCGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAACACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTCGCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTTTTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACGTTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGCTGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATCTTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTATTTATATGGATGAAACGTGCTATAATAACAATGCAGAATGAAGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATAT

4

3

2

1

=)|•(

θθθθ

yp

poly-A

intergenicregion

Forward (+) strandReverse (-) strand

Forward (+) strandReverse (-) strand

promoter G C G G G C CC GGAAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGTGGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTACCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAAATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTACAAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCATATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCATATGCACTTTATAAAAAATTATCCTACATTAACGTATTTTATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAAGTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTAACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAGCCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCTTGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATATAACCTATTGAGACAATACATTTATTTTATTTTTTTATATCATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATCCATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAAGATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT

23© Eric Xing @ CMU, 2006-2011

Shortcomings of Hidden Markov Model

Y1 Y2 … … … Yn

HMM models capture dependences between each state and only its corresponding observation

NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length indentation amount of white space etc

X1 X2 … … … Xn

the whole line such as line length, indentation, amount of white space, etc.

Mismatch between learning objective function and prediction objective function

HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

24© Eric Xing @ CMU, 2006-2011

13

Recall Generative vs. Discriminative Classifiers

Goal: Wish to learn f: X → Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)

Yn

Xn

Discriminative classifiers (e.g., logistic regression)Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data

Yn

Xn

25© Eric Xing @ CMU, 2006-2011

Structured Conditional Models

Y1 Y2 … … … Yn

Conditional probability P(label sequence y | observation sequence x)rather than joint probability P(y, x)

Specify the probability of possible label sequences given an observation sequence

Allow arbitrary non-independent features on the observation

x1:n

Allow arbitrary, non independent features on the observation sequence X

The probability of a transition between labels may depend on past and future observations

Relax strong independence assumptions in generative models

26© Eric Xing @ CMU, 2006-2011

14

Conditional DistributionIf the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by the Hammersley Clifford

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

theorem of random fields is:

─ x is a data sequence─ y is a label sequence ─ v is a vertex from vertex set V = set of label random variables─ e is an edge from edge set E over V

(y | x) exp ( , y | , x) ( , y | , x)θ λ µ∈ ∈

⎛ ⎞∝ +⎜ ⎟

⎝ ⎠∑ ∑k k e k k v

e E,k v V ,kp f e g v

Y1 Y2 Y5…

X1 … Xn

─ e is an edge from edge set E over V─ fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge

feature─ k is the number of features─ are parameters to be estimated─ y|e is the set of components of y defined by edge e─ y|v is the set of components of y defined by vertex v

1 2 1 2( , , , ; , , , ); andn n k kθ λ λ λ µ µ µ λ µ= L L

27© Eric Xing @ CMU, 2006-2011

Conditional Random Fields

Y Y YY1 Y2 … … … Yn

x1:n

CRF is a partially directed modelDiscriminative modelUsage of global normalizer Z(x)Models the dependence between each state and the entire observation sequence

28© Eric Xing @ CMU, 2006-2011

15

Conditional Random FieldsGeneral parametric form:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Y1 Y2 … … … Yn

x1:n

29© Eric Xing @ CMU, 2006-2011

Conditional Random Fields

⎬⎫

⎨⎧

= ∑ yxfxyp ),(exp)|( θθ 1

Allow arbitrary dependencies on input

⎭⎬

⎩⎨∑

cccc yxfxZ

xyp ),(exp),(

)|( θθθ

Clique dependencies on labels

Use approximate inference for general graphs

© Eric Xing @ CMU, 2006-2011

16

CRFs: InferenceComputing marginals using a message passing algorithm called “sum-product”:

∑ ∏≠

→→ =iji

iSC jk

kiikCijji SmSm\

)()( ψ

p

Initialization:

Y1 Y2 … … … Yn

x1:n

Y Y Y Y Y Y Yn 1,YnY2 Y3 Yn-2 Yn-1

After calibration:

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 n-1, n

Also called forward-backward algorithm

31© Eric Xing @ CMU, 2006-2011

CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)

∏≠

→→ =jk

kiikCSCijjiSmSm

iiji

)(max)(\

ψ

Can ignore Z(x) because it is not a function of y

Again run a message-passing algorithm called “max-product”:

Y1 Y2 … … … Yn

x1:n

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 Yn-1,YnY2 Y3 Yn-2 Yn-1

Same as Viterbi decoding used in HMMs!

32© Eric Xing @ CMU, 2006-2011

17

CRF learningGiven {(xd, yd)}d=1N, find λ*, µ* such that

Computing the gradient w.r.t λ:

Gradient of the log-partition function in an exponential family is the expectation

of the sufficient statistics.

33© Eric Xing @ CMU, 2006-2011

CRF learning

Computing the model expectations:

Requires exponentially large number of summations: Is it intractable?

Tractable!Can compute marginals using the sum-product algorithm on the chain

Expectation of f over the corresponding marginal probability of neighboring nodes!!

34© Eric Xing @ CMU, 2006-2011

18

CRF learningComputing feature expectations using calibrated potentials:

Now we know how to compute ∇λL(λ,µ):

Learning can now be done using gradient ascent:

35© Eric Xing @ CMU, 2006-2011

CRFs: some empirical resultsComparison of error rates on synthetic data

Data is increasingly higher order in the direction of arrow

CRFs achieve the lowest error rate for higher order data

36© Eric Xing @ CMU, 2006-2011

19

CRFs: some empirical resultsParts of Speech tagging

Using same set of features: HMM >=< CRFUsing additional overlapping features: CRF+ >> HMM

37© Eric Xing @ CMU, 2006-2011

SummaryConditional Random Fields is a discriminative Structured Input Output model!

HMM is a generative structured I/O model

Complementary strength and weakness:

1.

Yn

Xn

Y1.2.

3.

…

Yn

Xn

38© Eric Xing @ CMU, 2006-2011

machine learning › ~epxing › class › 10701-11f › lecture › lecture12-annotated.pdf1...

Documents