indian institute of technology, bombay2006/02/07  · iit bombay and ibm india research lab december...

29
Research Division, India Research Lab December 2005 Indian Institute of Technology, Bombay and Graphical models for part of speech tagging

Upload: others

Post on 06-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Research Division, India Research Lab

    December 2005

    Indian Institute of Technology, Bombayand

    Graphical models for part of speech tagging

  • IIT Bombay and IBM India Research Lab

    December 2005

    Different Models for POS tagging

    HMMMaximum Entropy Markov ModelsConditional Random Fields

  • IIT Bombay and IBM India Research Lab

    December 2005

    POS tagging: A Sequence Labeling Problem

    Input and Output– Input sequence x = x1x2 …xn– Output sequence y = y1y2 …ym

    Labels of the input sequenceSemantic representation of the input

    Other Applications– Automatic speech recognition– Text processing, e.g., tagging, name entity

    recognition, summarization by exploiting layout structure of text, etc.

  • IIT Bombay and IBM India Research Lab

    December 2005

    Hidden Markov ModelsDoubly stochastic models

    Efficient dynamic programming algorithms exist for– Finding Pr(S)– The highest probability path P that

    maximizes Pr(S,P) (Viterbi)Training the model– (Baum-Welch algorithm)

    S2

    S4

    S1 0.9

    0.5

    0.50.8

    0.2

    0.1

    S3

    A

    C

    0.6

    0.4

    A

    C

    0.3

    0.7

    A

    C

    0.5

    0.5

    A

    C

    0.9

    0.1

  • IIT Bombay and IBM India Research Lab

    December 2005

    Hidden Markov Model (HMM) : Generative Modeling

    Source Model P(Y)

    Noisy Channel P(X|Y)

    y x

    ∏ −=i

    ii yyPP )|()( 1y ∏=i

    ii yxPP )|()|( yx

    Parameter estimation:

    maximize the joint likelihood of training examples

    e.g., 1st order Markov chain

    ∑∈T

    P),(

    2 ),(logyx

    yx

  • IIT Bombay and IBM India Research Lab

    December 2005

    Dependency (1st order)

    kY1−kY

    kX

    )|( kk YXP

    )|( 1−kk YYP

    1−kX

    )|( 11 −− kk YXP

    2−kX

    )|( 22 −− kk YXP

    2−kY)|( 21 −− kk YYP

    1+kY

    1+kX

    )|( 1 kk YYP +

    )|( 11 ++ kk YXP

  • IIT Bombay and IBM India Research Lab

    December 2005

    Different Models for POS tagging

    HMMMaximum Entropy Markov ModelsConditional Random Fields

  • IIT Bombay and IBM India Research Lab

    December 2005

    Disadvantage of HMMs (1)No Rich Feature Information– Rich information are required

    When xk is complexWhen data of xk is sparse

    Example: POS Tagging– How to evaluate P(wk|tk) for unknown

    words wk ?– Useful features

    Suffix, e.g., -ed, -tion, -ing, etc.Capitalization

  • IIT Bombay and IBM India Research Lab

    December 2005

    Disadvantage of HMMs (2)Generative Model– Parameter estimation: maximize the joint

    likelihood of training examples∑

    ==T

    P),(

    2 ),(logyx

    yYxX

    Better ApproachDiscriminative model which models P(y|x) directlyMaximize the conditional likelihood of training examples

    ∑∈

    ==T

    P),(

    2 )|(logyx

    xXyY

  • IIT Bombay and IBM India Research Lab

    December 2005

    Maximum Entropy Markov Model

    Discriminative Sub Models– Unify two parameters in generative model

    into one conditional modelTwo parameters in generative model, parameter in source model and parameter in noisy channelUnified conditional model

    – Employ maximum entropy principle

    )|( 1−kk yyP

    )|( kk yxP

    ),|( 1−kkk yxyP

    ∏ −=i

    iii xyyPP ),|()|( 1xy

    Maximum Entropy Markov Model

  • IIT Bombay and IBM India Research Lab

    December 2005

    General Maximum Entropy Model

    Model– Model distribution P(Y |X) with a set of

    features {f1, f2, …, fl} defined on X and YIdea– Collect information of features from

    training data

    – Assume nothing on distribution P(Y |X)other than the collected information

    Maximize the entropy as a criterion

  • IIT Bombay and IBM India Research Lab

    December 2005

    Features

    Features– 0-1 indicator functions

    1 if (x, y) satisfies a predefined condition0 if not

    Example: POS Tagging

    ⎩⎨⎧

    = otherwise

    NN is and tion- with ends if ,0,1

    ),(1yx

    yxf

    ⎩⎨⎧

    = otherwise NNP is and tionCaptializa with start if

    ,0,1

    ),(2yx

    yxf

  • IIT Bombay and IBM India Research Lab

    December 2005

    Constraints

    Empirical Information– Statistics from training data T

    ∑∈

    =Tyx

    ii yxfTfP

    ),(

    ),(||

    1)(ˆ

    Constraints)()(ˆ ii fPfP =

    ∑ ∑∈ ∈′

    ′=′==Tyx YDy

    ii yxfxXyYPTfP

    ),( )(

    ),()|(||

    1)(

    Expected ValueFrom the distribution P(Y |X) we want to model

  • IIT Bombay and IBM India Research Lab

    December 2005

    Maximum Entropy: Objective

    Entropy

    ∑∑

    ∑====−=

    =′==′=−=∈

    x y

    Tyx

    xXyYPxXyYPxP

    xXyYPxXyYPT

    I

    )|(log)|()(ˆ

    )|(log)|(||

    1

    2

    ),(2

    )()(ˆ s.t.

    max)|(

    fPfP

    IXYP

    =

    Maximization Problem

  • IIT Bombay and IBM India Research Lab

    December 2005

    Dual Problem

    Dual Problem – Conditional model

    – Maximum likelihood of conditional data

    )),(exp()|(1

    ∑∝===

    l

    iii yxfxXyYP λ

    SolutionImproved iterative scaling (IIS) (Berger et al. 1996)Generalized iterative scaling (GIS) (McCallum et al. 2000)

    ∑∈

    ==Tyx

    xXyYPl ),(

    2,,)|(logmax

    1 λλ L

  • IIT Bombay and IBM India Research Lab

    December 2005

    Maximum Entropy Markov Model

    Use Maximum Entropy Approach to Model– 1st order

    ),|( 11 −− === kkkkkk yYxXyYP

    FeaturesBasic features (like parameters in HMM)

    Bigram (1st order) or trigram (2nd order) in source modelState-output pair feature (Xk = xk, Yk = yk)

    Advantage: incorporate other advanced features on (xk, yk)

  • IIT Bombay and IBM India Research Lab

    December 2005

    HMM vs MEMM (1st order)

    kY1−kY

    kX

    )|( 1−kk YYP

    )|( kk YXP

    HMM Maximum Entropy Markov Model (MEMM)

    kY1−kY

    kX

    ),|( 1−kkk YXYP

  • IIT Bombay and IBM India Research Lab

    December 2005

    Performance in POS Tagging

    POS Tagging– Data set: WSJ– Features:

    HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

    Results (Lafferty et al. 2001)– 1st order HMM

    94.31% accuracy, 54.01% OOV accuracy– 1st order MEMM

    95.19% accuracy, 73.01% OOV accuracy

  • IIT Bombay and IBM India Research Lab

    December 2005

    Different Models for POS tagging

    HMMMaximum Entropy Markov ModelsConditional Random Fields

  • IIT Bombay and IBM India Research Lab

    December 2005

    Disadvantage of MEMMs (1)

    Complex Algorithm of Maximum Entropy Solution– Both IIS and GIS are difficult to implement

    – Require many tricks in implementationSlow in Training– Time consuming when data set is large

    Especially for MEMM

  • IIT Bombay and IBM India Research Lab

    December 2005

    Disadvantage of MEMMs (2)

    Maximum Entropy Markov Model– Maximum entropy model as a sub model– Optimization of entropy on sub models,

    not on global modelLabel Bias Problem– Conditional models with per-state

    normalization– Effects of observations are weakened for

    states with fewer outgoing transitions

  • IIT Bombay and IBM India Research Lab

    December 2005

    Label Bias Problem

    Training Data

    X:Y

    rib:123

    rib:123

    rib:123

    rob:456

    rob:456

    1 2 3

    r i b

    4 5 6

    r o b

    Model

    4.0114.0)5,|6()4,|5()|4()|456(

    6.0116.0)2,|3()1,|2()|1()|123(

    =××==

    =××==

    borrob

    borrob

    PPPP

    PPPP

    Parameters

    1)5,|6()2,|3(,1)4,|5()4,|5(

    ,1)1,|2()1,|2(,6.0)|4(,4.0)|1(

    ====

    ====

    bboioi

    rr

    PPPPPP

    PP

    New input: rob

  • IIT Bombay and IBM India Research Lab

    December 2005

    Solution

    Global Optimization– Optimize parameters in a global model

    simultaneously, not in sub models separately

    Alternatives– Conditional random fields

    – Application of perceptron algorithm

  • IIT Bombay and IBM India Research Lab

    December 2005

    Conditional Random Field (CRF) (1)

    Let – be a graph such that Y is indexed

    by the vertices),( EVG =

    Then(X, Y) is a conditional random field if

    Conditioned globally on X

    VvvY ∈= )(Y

    )),(,,|(),,|( EvwYYPvwYYP wvwv ∈=≠ XX

  • IIT Bombay and IBM India Research Lab

    December 2005

    Conditional Random Field (CRF) (2)

    Exponential Model– : a tree (or more specifically, a

    chain) with cliques as edges and vertices),( EVG =

    Parameter EstimationMaximize the conditional likelihood of training examples

    IIS or GIS

    ∑∑∈∈

    +∝==kVv

    vkkkEe

    ekk vgefP,,

    )|,()|,(exp()|( xyxyxXyY μλ

    ∑∈

    ==T

    P),(

    2 )|(logyx

    xXyY

    State determined

    Determined by State Transitions

  • IIT Bombay and IBM India Research Lab

    December 2005

    MEMM vs CRF

    Similarities– Both employ maximum entropy principle

    – Both incorporate rich feature informationDifferences– Conditional random fields are always

    globally conditioned on X, resulting in a global optimized model

  • IIT Bombay and IBM India Research Lab

    December 2005

    Performance in POS Tagging

    POS Tagging– Data set: WSJ– Features:

    HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

    Results (Lafferty et al. 2001)– 1st order MEMM

    95.19% accuracy, 73.01% OOV accuracy– Conditional random fields

    95.73% accuracy, 76.24% OOV accuracy

  • IIT Bombay and IBM India Research Lab

    December 2005

    Comparison of the three approaches to POS Tagging

    Results (Lafferty et al. 2001)– 1st order HMM

    94.31% accuracy, 54.01% OOV accuracy– 1st order MEMM

    95.19% accuracy, 73.01% OOV accuracy– Conditional random fields

    95.73% accuracy, 76.24% OOV accuracy

  • IIT Bombay and IBM India Research Lab

    December 2005

    References

    A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71.J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.

    Graphical models for part of speech taggingDifferent Models for POS taggingPOS tagging: A Sequence Labeling ProblemHidden Markov ModelsHidden Markov Model (HMM) : Generative ModelingDependency (1st order)Different Models for POS taggingDisadvantage of HMMs (1)Disadvantage of HMMs (2)Maximum Entropy Markov ModelGeneral Maximum Entropy ModelFeaturesConstraintsMaximum Entropy: ObjectiveDual ProblemMaximum Entropy Markov ModelHMM vs MEMM (1st order)Performance in POS TaggingDifferent Models for POS taggingDisadvantage of MEMMs (1)Disadvantage of MEMMs (2)Label Bias ProblemSolutionConditional Random Field (CRF) (1)Conditional Random Field (CRF) (2)MEMM vs CRFPerformance in POS TaggingComparison of the three approaches to POS TaggingReferences