machine learning › ~epxing › class › 10701-11f › lecture › lecture12-annotated.pdf1...

19
1 Machine Learning Machine Learning Conditional Random Fields Conditional Random Fields Eric Xing Eric Xing 10 10-701/15 701/15-781, Fall 2011 781, Fall 2011 Lecture 12, October 19, 2011 Reading: Chap. 13 CB 1 © Eric Xing @ CMU, 2006-2011 Y1 Y2 Y5 X1 Xn Definition (of HMM) Observation space Observation space Alphabetic set: Euclidean space: y 2 y 3 y 1 y T ... { } K c c c , , , L 2 1 = C Euclidean space: Index set of hidden states Index set of hidden states Transition probabilities Transition probabilities between any two states between any two states or Start probabilities Start probabilities A A A A x 2 x 3 x 1 x T ... d R { } M , , , L 2 1 = I , ) | ( , j i i t j t a y y p = = = 1 1 1 ( ) . , , , , l Multinomia ~ ) | ( , , , I = i a a a y y p M i i i i t t K 2 1 1 1 Graphical model 1 2 Emission probabilities Emission probabilities associated with each state associated with each state or in general: ( ) . , , , l Multinomia ~ ) ( M y p π π π K 2 1 1 ( ) . , , , , l Multinomia ~ ) | ( , , , I = i b b b y x p K i i i i t t K 2 1 1 ( ) . , | f ~ ) | ( I = i y x p i i t t θ 1 K State automata 2 © Eric Xing @ CMU, 2006-2011

Upload: others

Post on 27-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Machine LearningMachine Learning

    Conditional Random FieldsConditional Random Fields

    Eric XingEric Xing

    1010--701/15701/15--781, Fall 2011781, Fall 2011

    gg

    Lecture 12, October 19, 2011

    Reading: Chap. 13 CB1© Eric Xing @ CMU, 2006-2011

    Y1 Y2 Y5…

    X1 … Xn

    Definition (of HMM)Observation spaceObservation space

    Alphabetic set:Euclidean space:

    y2 y3y1 yT... { }Kccc ,,, L21=CEuclidean space:

    Index set of hidden statesIndex set of hidden states

    Transition probabilitiesTransition probabilities between any two statesbetween any two states

    or

    Start probabilitiesStart probabilities

    A AA Ax2 x3x1 xT... dR

    { }M,,, L21=I

    ,)|( ,jii

    tj

    t ayyp === − 11 1( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1

    Graphical model

    1 2pp

    Emission probabilitiesEmission probabilities associated with each stateassociated with each state

    or in general:

    ( ).,,,lMultinomia~)( Myp πππ K211

    ( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

    ( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

    K …

    State automata

    2© Eric Xing @ CMU, 2006-2011

  • 2

    Viterbi decodingGIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:(y| )

    y* = argmaxy P(y|x) = argmaxπ P(y,x) Let

    = Probability of most likely sequence of states ending at state yt = kThe recursion:

    ),,...,,,...,(max ,--},...{ - 1111111 ==k

    ttttyyk

    t yxyyxxPV t

    ikk VayxpV 1 max)|(x1 x2 x3 ……………………...……..xN

    State 12

    x1 x2 x3 ……………………...……..xNState 1

    2

    x1 x2 x3 ……………………...……..xNState 1

    2

    x1 x2 x3 ……………………...……..xNState 1

    2

    Underflows are a significant problem

    These numbers become extremely small – underflowSolution: Take the logs of all values:

    tkiittt VayxpV 11 −== ,max)|( 2K

    2

    K

    2

    K

    2

    K

    Vi(t)k

    tV

    tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( LLKK 11121111 −= π

    ( )( )itkiikttkt VayxpV 11 −++== ,logmax)|(log

    3© Eric Xing @ CMU, 2006-2011

    The Viterbi Algorithm – derivationDefine the viterbi probability:

    1kk P ),,...,,,...,(max ,},...{ 111111 1 == +++k

    ttttyyk

    t yxyyxxPV t),...,,,...,(),...,,,...,|(max ,},...{ tttt

    kttyy yyxxPyyxxyxPt 111111 11 == ++

    ),,,...,,,...,()|(max ,},...{ tttttk

    ttyy yxyyxxPyyxPt 111111 11 −−++ ==

    itki

    ktti VayxP ,, )|(max 111 == ++

    ),,,...,,,...,(max)|(max },...{, 111 111111 11 ==== −−++ −i

    ttttyyi

    tk

    tti yxyyxxPyyxP t

    itkii

    ktt VayxP ,, max)|( 111 == ++

    4© Eric Xing @ CMU, 2006-2011

  • 3

    The Viterbi Algorithm Input: x = x1, …, xT,

    Initialization:

    Iteration:

    Termination:

    itkii

    ktt

    kt VayxPV 11 −== ,, max)|(

    kkk yxPV π)|( 1111 ==

    itkii Vak,t 1−= ,maxarg)Ptr(

    Termination:

    TraceBack:

    kTkVP max),( * =yx

    ),Ptr(

    maxarg**

    *

    tyyVy

    tt

    kTkT

    =

    =

    −1

    5© Eric Xing @ CMU, 2006-2011

    x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 FAIR LOADED

    0.05 0.950.95, , , , , , , , , FAIR LOADED

    0.05P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

    P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

    kiiit

    ktt

    kt ayxP ,)|( ∑ −== 11 αα

    iik P 1∑

    0.6250   0.37500.7353    0.26470.8224    0.17760.8853    0.11470.7200    0.2800

    ktV

    it

    itti ik

    kt yxPa 111 1 +++ == ∑ ββ )|(,

    itkii

    ktt

    kt VayxpV 11 −== ,max)|(

    0.8108    0.18920.8772    0.12280.7043    0.29570.7988    0.20120.8687    0.1313

    6© Eric Xing @ CMU, 2006-2011

  • 4

    Viterbi Vs. Posterior Decoding (individual)

    k kVeterbi PD

    x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

    ktV

    0.6250   0.37500.7353    0.26470.8224    0.17760.8853    0.11470.7200    0.28000.8108    0.18920.8772    0.12280 7043 0 2957

    )|1( xyp kt =

    0.8128    0.18720.8238    0.17620.8176    0.18240.7925    0.20750.7415    0.25850.7505    0.24950.7386    0.26140 7027 0 2973

    Veterbi PD

    11111111

    11111111

    ),( tkptr

    N/A1     21     21     21     21     21     21 20.7043    0.2957

    0.7988    0.20120.8687    0.1313

    0.7027    0.29730.7251    0.27490.7251    0.2749

    111

    111

    1     21     21     2  

    7© Eric Xing @ CMU, 2006-2011

    Another Example

    ktV )|1( xyp

    kt =

    Veterbi PD),( tkptr

    X = 6, 2, 3, 5, 6, 2, 6, 3, 6, 6

    t

    0.2500    0.75000.5263    0.47370.6494    0.35060.7143    0.28570.3333    0.66670.5263    0.47370.2703    0.72970.5263    0.47370 2703 0 7297

    )|(yp t

    0.2733    0.72670.6040    0.39600.6538    0.34620.6062    0.39380.2861    0.71390.5342    0.46580.2734    0.72660.5226    0.47740 2252 0 7748

    2     1     1     1     2     22     22

    2     1     1     1     2     12     12

    )(p

    N/A2     21     21     11     12     21     22     21 2

    FAIR LOADED

    0.4

    0.4

    0.60.6

    0.2703    0.72970.1818    0.8182

    0.2252    0.77480.2159    0.7841

    2     2

    2     2

    1     22     2  

    Same transition probabilitiesmax

    8© Eric Xing @ CMU, 2006-2011

  • 5

    Computational Complexity and implementation details

    What is the running time, and space required, for Forward, and Backward?

    Time: O(K2N); Space: O(KN).

    ∑ −==i

    kiit

    ktt

    kt ayxp ,1)1|( αα

    it

    itt

    iik

    kt yxpa 111, )1|( +++ == ∑ ββ

    itkii

    ktt

    kt VayxpV 1,max)1|( −==

    Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant

    9© Eric Xing @ CMU, 2006-2011

    Three Main Questions on HMMs1.1. EvaluationEvaluation

    GIVEN an HMM M, and a sequence x,M, q x,FIND Prob (x | M)ALGO. ForwardForward

    2.2. DecodingDecodingGIVEN an HMM M, and a sequence x ,FIND the sequence y of states that maximizes, e.g., P(y | x , M),

    or the most probable subsequence of statesALGO. ViterbiViterbi, Forward, Forward--backward backward

    3.3. LearningLearningGIVEN an HMM M, with unspecified transition/emission probs.,

    and a sequence x,FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)ALGO. BaumBaum--Welch (EM)Welch (EM)

    10© Eric Xing @ CMU, 2006-2011

  • 6

    Learning HMM: two scenariosSupervised learning: estimation when the “right answer” is known

    Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

    (experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

    as he changes dice and produces 10,000 rolls

    Unsupervised learning: estimation when the “right answer” is unknown

    Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

    CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he

    changes dice

    QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

    11© Eric Xing @ CMU, 2006-2011

    MLE

    12© Eric Xing @ CMU, 2006-2011

  • 7

    Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,

    Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

    We can show that the maximum likelihood parameters θ are:

    ∑∑ ∑∑ ∑ ==

    •→→

    = = − ,,)(#)(# ij

    T i

    jtnn

    Tt

    itnML

    ij AA

    yyy

    ijia 2 1

    What if y is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

    ∑∑ ∑•→ = − ' ',)(# j ijn t tn Ayi 2 1

    ∑∑ ∑∑ ∑ ==

    •→→

    ==

    =

    ' ',

    ,,

    )(#)(#

    k ik

    ik

    nTt

    itn

    ktnn

    Tt

    itnML

    ik BB

    yxy

    ikib

    1

    1

    ( ){ }NnTtyx tntn :,::, ,, 11 ==

    13© Eric Xing @ CMU, 2006-2011

    Supervised ML estimation, ctd.Intuition:

    When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data

    Drawback:Given little data, there may be overfitting:

    P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD

    E lExample:Given 10 casino rolls, we observe

    x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

    Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

    14© Eric Xing @ CMU, 2006-2011

  • 8

    PseudocountsSolution for small training sets:

    Add pseudocountspAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik

    Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

    --- "strength" of prior belief, --- total number of imaginary instances in the prior

    Larger total pseudocounts ⇒ strong prior belief

    Small total pseudocounts: just to avoid 0 probabilities ---smoothing

    15© Eric Xing @ CMU, 2006-2011

    Unsupervised ML estimation

    16© Eric Xing @ CMU, 2006-2011

  • 9

    Unsupervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is unknown,

    EXPECTATION MAXIMIZATION

    0. Starting with our best guess of a model M, parameters θ:1. Estimate Aij , Bik in the training data

    How? , , Update θ according to Aij , Bik

    ktntn

    itnik xyB ,, ,∑=∑ −= tn jtni tnij yyA , ,, 1

    p g ij , ikNow a "supervised learning" problem

    2. Repeat 1 & 2, until convergence

    This is called the Baum-Welch Algorithm

    We can get to a provably more (or equally) likely parameter set θ each iteration

    17© Eric Xing @ CMU, 2006-2011

    The Baum Welch algorithmThe complete log likelihood

    ∏ ∏∏ ⎞⎜⎜⎛ TT xxpyypypp )|()|()(l)(l)(θl

    The expected complete log likelihood

    EMThe E step

    ∏ ∏∏⎠

    ⎜⎜⎝

    ====

    −n t

    tntnt

    tntnnc xxpyypypp12

    11 )|()|()(log),(log),;( ,,,,,yxyxθl

    ∑∑∑∑∑==

    − ⎟⎠⎞⎜

    ⎝⎛+⎟

    ⎠⎞⎜

    ⎝⎛+⎟

    ⎠⎞⎜

    ⎝⎛=

    − n

    T

    tkiyp

    itn

    ktn

    n

    T

    tjiyyp

    jtn

    itn

    niyp

    inc byxayyy

    ntnntntnnn 1211

    11,)|(,,,)|,(,,)|(,

    logloglog),;(,,,, xxx

    yxθ πl

    The M step ("symbolically" identical to MLE)

    )|( ,,, nitn

    itn

    itn ypy x1===γ

    )|,( ,,,,,, n

    jtn

    itn

    jtn

    itn

    jitn yypyy x1111 ==== −−ξ

    ∑ ∑∑ ∑

    =

    ==n

    Tt

    itn

    nTt

    jitnML

    ija 11

    2

    ,

    ,,

    γ

    ξ

    ∑ ∑∑ ∑

    =

    ==n

    Tt

    itn

    ktnn

    Tt

    itnML

    ikx

    b 11

    1

    ,

    ,,

    γ

    γ

    Nn

    inML

    i∑= 1,γπ

    18© Eric Xing @ CMU, 2006-2011

  • 10

    The Baum-Welch algorithm --comments

    Time Complexity:

    # it ti O(K2N)# iterations × O(K2N)

    Guaranteed to increase the log likelihood of the model

    Not guaranteed to find globally best parameters

    Converges to local optimum, depending on initial conditions

    Too many parameters / too large model: Overt fittingToo many parameters / too large model: Overt-fitting

    19© Eric Xing @ CMU, 2006-2011

    Summary: the HMM algorithmsQuestions:

    Evaluation: What is the probability of the observed sequence? ForwardDecoding: What is the probability that the state of the 3rd roll is loaded, given the observed sequence? Forward-BackwardDecoding: What is the most likely die sequence? ViterbiDecoding: What is the most likely die sequence? Viterbi

    Learning: Under what parameterization are the observed sequences most probable? Baum-Welch (EM)

    20© Eric Xing @ CMU, 2006-2011

  • 11

    Applications of HMMsSome early applications of HMMs

    finance, but we never saw them speech recognition modelling ion channels

    In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

    Some current applications of HMMs to biologypp gymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence

    21© Eric Xing @ CMU, 2006-2011

    Typical structure of a gene

    22© Eric Xing @ CMU, 2006-2011

  • 12

    E0 E1 E2

    GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGCCGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGTGGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAATTGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCT

    GENSCAN (Burge & Karlin)

    E

    poly-A

    3'UTR5'UTR

    tEi

    Es

    I0 I1 I2

    promoter

    TTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTCTTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGCTGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTAGCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACATATAAGCTTGCACACTGATGCACACACACCGACACGTTGTCACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTCGGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGAACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTTGCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGTTGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATATCGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAACACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTCGCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTTTTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACGTTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGCTGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATCTTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTATTTATATGGATGAAACGTGCTATAATAACAATGCAGAATGAAGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATAT

    4

    3

    2

    1

    =)|•(

    θθθθ

    yp

    poly-A

    intergenicregion

    Forward (+) strandReverse (-) strand

    Forward (+) strandReverse (-) strand

    promoter G C G G G C CC GGAAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGTGGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTACCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAAATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTACAAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCATATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCATATGCACTTTATAAAAAATTATCCTACATTAACGTATTTTATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAAGTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTAACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAGCCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCTTGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATATAACCTATTGAGACAATACATTTATTTTATTTTTTTATATCATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATCCATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAAGATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT

    23© Eric Xing @ CMU, 2006-2011

    Shortcomings of Hidden Markov Model

    Y1 Y2 … … … Yn

    HMM models capture dependences between each state and only its corresponding observation

    NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length indentation amount of white space etc

    X1 X2 … … … Xn

    the whole line such as line length, indentation, amount of white space, etc.

    Mismatch between learning objective function and prediction objective function

    HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

    24© Eric Xing @ CMU, 2006-2011

  • 13

    Recall Generative vs. Discriminative Classifiers

    Goal: Wish to learn f: X → Y, e.g., P(Y|X)

    Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)

    Yn

    Xn

    Discriminative classifiers (e.g., logistic regression)Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data

    Yn

    Xn

    25© Eric Xing @ CMU, 2006-2011

    Structured Conditional Models

    Y1 Y2 … … … Yn

    Conditional probability P(label sequence y | observation sequence x)rather than joint probability P(y, x)

    Specify the probability of possible label sequences given an observation sequence

    Allow arbitrary non-independent features on the observation

    x1:n

    Allow arbitrary, non independent features on the observation sequence X

    The probability of a transition between labels may depend on past and future observations

    Relax strong independence assumptions in generative models

    26© Eric Xing @ CMU, 2006-2011

  • 14

    Conditional DistributionIf the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by the Hammersley Clifford

    A AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ... A AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ...

    theorem of random fields is:

    ─ x is a data sequence─ y is a label sequence ─ v is a vertex from vertex set V = set of label random variables─ e is an edge from edge set E over V

    (y | x) exp ( , y | , x) ( , y | , x)θ λ µ∈ ∈

    ⎛ ⎞∝ +⎜ ⎟

    ⎝ ⎠∑ ∑k k e k k v

    e E,k v V ,kp f e g v

    Y1 Y2 Y5…

    X1 … Xn

    ─ e is an edge from edge set E over V─ fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge

    feature─ k is the number of features─ are parameters to be estimated─ y|e is the set of components of y defined by edge e─ y|v is the set of components of y defined by vertex v

    1 2 1 2( , , , ; , , , ); andn n k kθ λ λ λ µ µ µ λ µ= L L

    27© Eric Xing @ CMU, 2006-2011

    Conditional Random Fields

    Y Y YY1 Y2 … … … Yn

    x1:n

    CRF is a partially directed modelDiscriminative modelUsage of global normalizer Z(x)Models the dependence between each state and the entire observation sequence

    28© Eric Xing @ CMU, 2006-2011

  • 15

    Conditional Random FieldsGeneral parametric form:

    A AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ... A AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ...

    Y1 Y2 … … … Yn

    x1:n

    29© Eric Xing @ CMU, 2006-2011

    Conditional Random Fields

    ⎬⎫

    ⎨⎧

    = ∑ yxfxyp ),(exp)|( θθ 1

    Allow arbitrary dependencies on input

    ⎭⎬

    ⎩⎨∑

    cccc yxfxZ

    xyp ),(exp),(

    )|( θθθ

    Clique dependencies on labels

    Use approximate inference for general graphs

    © Eric Xing @ CMU, 2006-2011

  • 16

    CRFs: InferenceComputing marginals using a message passing algorithm called “sum-product”:

    ∑ ∏≠

    →→ =iji

    iSC jk

    kiikCijji SmSm\

    )()( ψ

    p

    Initialization:

    Y1 Y2 … … … Yn

    x1:n

    Y Y Y Y Y Y Yn 1,YnY2 Y3 Yn-2 Yn-1

    After calibration:

    Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 n-1, n

    Also called forward-backward algorithm

    31© Eric Xing @ CMU, 2006-2011

    CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)

    ∏≠

    →→ =jk

    kiikCSCijjiSmSm

    iiji

    )(max)(\

    ψ

    Can ignore Z(x) because it is not a function of y

    Again run a message-passing algorithm called “max-product”:

    Y1 Y2 … … … Yn

    x1:n

    Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 Yn-1,YnY2 Y3 Yn-2 Yn-1

    Same as Viterbi decoding used in HMMs!

    32© Eric Xing @ CMU, 2006-2011

  • 17

    CRF learningGiven {(xd, yd)}d=1N, find λ*, µ* such that

    Computing the gradient w.r.t λ:

    Gradient of the log-partition function in an exponential family is the expectation

    of the sufficient statistics.

    33© Eric Xing @ CMU, 2006-2011

    CRF learning

    Computing the model expectations:

    Requires exponentially large number of summations: Is it intractable?

    Tractable!Can compute marginals using the sum-product algorithm on the chain

    Expectation of f over the corresponding marginal probability of neighboring nodes!!

    34© Eric Xing @ CMU, 2006-2011

  • 18

    CRF learningComputing feature expectations using calibrated potentials:

    Now we know how to compute ∇λL(λ,µ):

    Learning can now be done using gradient ascent:

    35© Eric Xing @ CMU, 2006-2011

    CRFs: some empirical resultsComparison of error rates on synthetic data

    Data is increasingly higher order in the direction of arrow

    CRFs achieve the lowest error rate for higher order data

    36© Eric Xing @ CMU, 2006-2011

  • 19

    CRFs: some empirical resultsParts of Speech tagging

    Using same set of features: HMM >=< CRFUsing additional overlapping features: CRF+ >> HMM

    37© Eric Xing @ CMU, 2006-2011

    SummaryConditional Random Fields is a discriminative Structured Input Output model!

    HMM is a generative structured I/O model

    Complementary strength and weakness:

    1.

    Yn

    Xn

    Y1.2.

    3.

    Yn

    Xn

    38© Eric Xing @ CMU, 2006-2011