indian institute of technology, bombay2006/02/07 · iit bombay and ibm india research lab december...

Research Division, India Research Lab

December 2005

Indian Institute of Technology, Bombayand

Graphical models for part of speech tagging

IIT Bombay and IBM India Research Lab

December 2005

Different Models for POS tagging

HMMMaximum Entropy Markov ModelsConditional Random Fields


December 2005

POS tagging: A Sequence Labeling Problem

Input and Output– Input sequence x = x1x2 …xn– Output sequence y = y1y2 …ym

Labels of the input sequenceSemantic representation of the input

Other Applications– Automatic speech recognition– Text processing, e.g., tagging, name entity

recognition, summarization by exploiting layout structure of text, etc.


December 2005

Hidden Markov ModelsDoubly stochastic models

Efficient dynamic programming algorithms exist for– Finding Pr(S)– The highest probability path P that

maximizes Pr(S,P) (Viterbi)Training the model– (Baum-Welch algorithm)

S2

S4

S1 0.9

0.5

0.50.8

0.2

0.1

S3

A

C

0.6

0.4

A

C

0.3

0.7

A

C

0.5

0.5

A

C

0.9

0.1


December 2005

Hidden Markov Model (HMM) : Generative Modeling

Source Model P(Y)

Noisy Channel P(X|Y)

y x

∏ −=i

ii yyPP )|()( 1y ∏=i

ii yxPP )|()|( yx

Parameter estimation:

maximize the joint likelihood of training examples

e.g., 1st order Markov chain

∑∈T

P),(

2 ),(logyx

yx


December 2005




December 2005

Disadvantage of HMMs (1)No Rich Feature Information– Rich information are required

When xk is complexWhen data of xk is sparse

Example: POS Tagging– How to evaluate P(wk|tk) for unknown

words wk ?– Useful features

Suffix, e.g., -ed, -tion, -ing, etc.Capitalization


December 2005

Disadvantage of HMMs (2)Generative Model– Parameter estimation: maximize the joint

likelihood of training examples∑

∈

==T

P),(

2 ),(logyx

yYxX

Better ApproachDiscriminative model which models P(y|x) directlyMaximize the conditional likelihood of training examples

∑∈

==T

P),(

2 )|(logyx

xXyY


December 2005

Maximum Entropy Markov Model

Discriminative Sub Models– Unify two parameters in generative model

into one conditional modelTwo parameters in generative model, parameter in source model and parameter in noisy channelUnified conditional model

– Employ maximum entropy principle

)|( 1−kk yyP

)|( kk yxP

),|( 1−kkk yxyP

∏ −=i

iii xyyPP ),|()|( 1xy



December 2005

General Maximum Entropy Model

Model– Model distribution P(Y |X) with a set of

features {f1, f2, …, fl} defined on X and YIdea– Collect information of features from

training data

– Assume nothing on distribution P(Y |X)other than the collected information

Maximize the entropy as a criterion


December 2005

Features

Features– 0-1 indicator functions

1 if (x, y) satisfies a predefined condition0 if not

Example: POS Tagging

⎩⎨⎧

= otherwise

NN is and tion- with ends if ,0,1

),(1yx

yxf

⎩⎨⎧

= otherwise NNP is and tionCaptializa with start if

,0,1

),(2yx

yxf


December 2005

Constraints

Empirical Information– Statistics from training data T

∑∈

=Tyx

ii yxfTfP

),(

),(||

1)(ˆ

Constraints)()(ˆ ii fPfP =

∑ ∑∈ ∈′

′=′==Tyx YDy

ii yxfxXyYPTfP

),( )(

),()|(||

1)(

Expected ValueFrom the distribution P(Y |X) we want to model


December 2005

Maximum Entropy: Objective

Entropy

∑∑

∑====−=

=′==′=−=∈

x y

Tyx

xXyYPxXyYPxP

xXyYPxXyYPT

I

)|(log)|()(ˆ

)|(log)|(||

1

2

),(2

)()(ˆ s.t.

max)|(

fPfP

IXYP

=

Maximization Problem


December 2005

Dual Problem

Dual Problem – Conditional model

– Maximum likelihood of conditional data

)),(exp()|(1

∑∝===

l

iii yxfxXyYP λ

SolutionImproved iterative scaling (IIS) (Berger et al. 1996)Generalized iterative scaling (GIS) (McCallum et al. 2000)

∑∈

==Tyx

xXyYPl ),(

2,,)|(logmax

1 λλ L


December 2005


Use Maximum Entropy Approach to Model– 1st order

),|( 11 −− === kkkkkk yYxXyYP

FeaturesBasic features (like parameters in HMM)

Bigram (1st order) or trigram (2nd order) in source modelState-output pair feature (Xk = xk, Yk = yk)

Advantage: incorporate other advanced features on (xk, yk)


December 2005

HMM vs MEMM (1st order)

kY1−kY

kX

)|( 1−kk YYP

)|( kk YXP

HMM Maximum Entropy Markov Model (MEMM)

kY1−kY

kX

),|( 1−kkk YXYP


December 2005

Performance in POS Tagging

POS Tagging– Data set: WSJ– Features:

HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

Results (Lafferty et al. 2001)– 1st order HMM

94.31% accuracy, 54.01% OOV accuracy– 1st order MEMM

95.19% accuracy, 73.01% OOV accuracy


December 2005




December 2005

Disadvantage of MEMMs (1)

Complex Algorithm of Maximum Entropy Solution– Both IIS and GIS are difficult to implement

– Require many tricks in implementationSlow in Training– Time consuming when data set is large

Especially for MEMM


December 2005

Disadvantage of MEMMs (2)

Maximum Entropy Markov Model– Maximum entropy model as a sub model– Optimization of entropy on sub models,

not on global modelLabel Bias Problem– Conditional models with per-state

normalization– Effects of observations are weakened for

states with fewer outgoing transitions


December 2005

Label Bias Problem

Training Data

X:Y

rib:123

rib:123

rib:123

rob:456

rob:456

1 2 3

r i b

4 5 6

r o b

Model

4.0114.0)5,|6()4,|5()|4()|456(

6.0116.0)2,|3()1,|2()|1()|123(

=××==

=××==

borrob

borrob

PPPP

PPPP

Parameters

1)5,|6()2,|3(,1)4,|5()4,|5(

,1)1,|2()1,|2(,6.0)|4(,4.0)|1(

====

====

bboioi

rr

PPPPPP

PP

New input: rob


December 2005

Solution

Global Optimization– Optimize parameters in a global model

simultaneously, not in sub models separately

Alternatives– Conditional random fields

– Application of perceptron algorithm


December 2005

Conditional Random Field (CRF) (1)

Let – be a graph such that Y is indexed

by the vertices),( EVG =

Then(X, Y) is a conditional random field if

Conditioned globally on X

VvvY ∈= )(Y

)),(,,|(),,|( EvwYYPvwYYP wvwv ∈=≠ XX


December 2005

Conditional Random Field (CRF) (2)

Exponential Model– : a tree (or more specifically, a

chain) with cliques as edges and vertices),( EVG =

Parameter EstimationMaximize the conditional likelihood of training examples

IIS or GIS

∑∑∈∈

+∝==kVv

vkkkEe

ekk vgefP,,

)|,()|,(exp()|( xyxyxXyY μλ

∑∈

==T

P),(

2 )|(logyx

xXyY

State determined

Determined by State Transitions


December 2005

MEMM vs CRF

Similarities– Both employ maximum entropy principle

– Both incorporate rich feature informationDifferences– Conditional random fields are always

globally conditioned on X, resulting in a global optimized model


December 2005

Performance in POS Tagging

POS Tagging– Data set: WSJ– Features:

HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

Results (Lafferty et al. 2001)– 1st order MEMM

95.19% accuracy, 73.01% OOV accuracy– Conditional random fields



December 2005

Comparison of the three approaches to POS Tagging

Results (Lafferty et al. 2001)– 1st order HMM

94.31% accuracy, 54.01% OOV accuracy– 1st order MEMM

95.19% accuracy, 73.01% OOV accuracy– Conditional random fields



December 2005

References

A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71.J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.

Graphical models for part of speech taggingDifferent Models for POS taggingPOS tagging: A Sequence Labeling ProblemHidden Markov ModelsHidden Markov Model (HMM) : Generative ModelingDependency (1st order)Different Models for POS taggingDisadvantage of HMMs (1)Disadvantage of HMMs (2)Maximum Entropy Markov ModelGeneral Maximum Entropy ModelFeaturesConstraintsMaximum Entropy: ObjectiveDual ProblemMaximum Entropy Markov ModelHMM vs MEMM (1st order)Performance in POS TaggingDifferent Models for POS taggingDisadvantage of MEMMs (1)Disadvantage of MEMMs (2)Label Bias ProblemSolutionConditional Random Field (CRF) (1)Conditional Random Field (CRF) (2)MEMM vs CRFPerformance in POS TaggingComparison of the three approaches to POS TaggingReferences

indian institute of technology, bombay2006/02/07 · iit bombay and ibm india research lab december...

Documents