indian institute of technology, bombay2006/02/07 · iit bombay and ibm india research lab december...
TRANSCRIPT
-
Research Division, India Research Lab
December 2005
Indian Institute of Technology, Bombayand
Graphical models for part of speech tagging
-
IIT Bombay and IBM India Research Lab
December 2005
Different Models for POS tagging
HMMMaximum Entropy Markov ModelsConditional Random Fields
-
IIT Bombay and IBM India Research Lab
December 2005
POS tagging: A Sequence Labeling Problem
Input and Output– Input sequence x = x1x2 …xn– Output sequence y = y1y2 …ym
Labels of the input sequenceSemantic representation of the input
Other Applications– Automatic speech recognition– Text processing, e.g., tagging, name entity
recognition, summarization by exploiting layout structure of text, etc.
-
IIT Bombay and IBM India Research Lab
December 2005
Hidden Markov ModelsDoubly stochastic models
Efficient dynamic programming algorithms exist for– Finding Pr(S)– The highest probability path P that
maximizes Pr(S,P) (Viterbi)Training the model– (Baum-Welch algorithm)
S2
S4
S1 0.9
0.5
0.50.8
0.2
0.1
S3
A
C
0.6
0.4
A
C
0.3
0.7
A
C
0.5
0.5
A
C
0.9
0.1
-
IIT Bombay and IBM India Research Lab
December 2005
Hidden Markov Model (HMM) : Generative Modeling
Source Model P(Y)
Noisy Channel P(X|Y)
y x
∏ −=i
ii yyPP )|()( 1y ∏=i
ii yxPP )|()|( yx
Parameter estimation:
maximize the joint likelihood of training examples
e.g., 1st order Markov chain
∑∈T
P),(
2 ),(logyx
yx
-
IIT Bombay and IBM India Research Lab
December 2005
Dependency (1st order)
kY1−kY
kX
)|( kk YXP
)|( 1−kk YYP
1−kX
)|( 11 −− kk YXP
2−kX
)|( 22 −− kk YXP
2−kY)|( 21 −− kk YYP
1+kY
1+kX
)|( 1 kk YYP +
)|( 11 ++ kk YXP
-
IIT Bombay and IBM India Research Lab
December 2005
Different Models for POS tagging
HMMMaximum Entropy Markov ModelsConditional Random Fields
-
IIT Bombay and IBM India Research Lab
December 2005
Disadvantage of HMMs (1)No Rich Feature Information– Rich information are required
When xk is complexWhen data of xk is sparse
Example: POS Tagging– How to evaluate P(wk|tk) for unknown
words wk ?– Useful features
Suffix, e.g., -ed, -tion, -ing, etc.Capitalization
-
IIT Bombay and IBM India Research Lab
December 2005
Disadvantage of HMMs (2)Generative Model– Parameter estimation: maximize the joint
likelihood of training examples∑
∈
==T
P),(
2 ),(logyx
yYxX
Better ApproachDiscriminative model which models P(y|x) directlyMaximize the conditional likelihood of training examples
∑∈
==T
P),(
2 )|(logyx
xXyY
-
IIT Bombay and IBM India Research Lab
December 2005
Maximum Entropy Markov Model
Discriminative Sub Models– Unify two parameters in generative model
into one conditional modelTwo parameters in generative model, parameter in source model and parameter in noisy channelUnified conditional model
– Employ maximum entropy principle
)|( 1−kk yyP
)|( kk yxP
),|( 1−kkk yxyP
∏ −=i
iii xyyPP ),|()|( 1xy
Maximum Entropy Markov Model
-
IIT Bombay and IBM India Research Lab
December 2005
General Maximum Entropy Model
Model– Model distribution P(Y |X) with a set of
features {f1, f2, …, fl} defined on X and YIdea– Collect information of features from
training data
– Assume nothing on distribution P(Y |X)other than the collected information
Maximize the entropy as a criterion
-
IIT Bombay and IBM India Research Lab
December 2005
Features
Features– 0-1 indicator functions
1 if (x, y) satisfies a predefined condition0 if not
Example: POS Tagging
⎩⎨⎧
= otherwise
NN is and tion- with ends if ,0,1
),(1yx
yxf
⎩⎨⎧
= otherwise NNP is and tionCaptializa with start if
,0,1
),(2yx
yxf
-
IIT Bombay and IBM India Research Lab
December 2005
Constraints
Empirical Information– Statistics from training data T
∑∈
=Tyx
ii yxfTfP
),(
),(||
1)(ˆ
Constraints)()(ˆ ii fPfP =
∑ ∑∈ ∈′
′=′==Tyx YDy
ii yxfxXyYPTfP
),( )(
),()|(||
1)(
Expected ValueFrom the distribution P(Y |X) we want to model
-
IIT Bombay and IBM India Research Lab
December 2005
Maximum Entropy: Objective
Entropy
∑∑
∑====−=
=′==′=−=∈
x y
Tyx
xXyYPxXyYPxP
xXyYPxXyYPT
I
)|(log)|()(ˆ
)|(log)|(||
1
2
),(2
)()(ˆ s.t.
max)|(
fPfP
IXYP
=
Maximization Problem
-
IIT Bombay and IBM India Research Lab
December 2005
Dual Problem
Dual Problem – Conditional model
– Maximum likelihood of conditional data
)),(exp()|(1
∑∝===
l
iii yxfxXyYP λ
SolutionImproved iterative scaling (IIS) (Berger et al. 1996)Generalized iterative scaling (GIS) (McCallum et al. 2000)
∑∈
==Tyx
xXyYPl ),(
2,,)|(logmax
1 λλ L
-
IIT Bombay and IBM India Research Lab
December 2005
Maximum Entropy Markov Model
Use Maximum Entropy Approach to Model– 1st order
),|( 11 −− === kkkkkk yYxXyYP
FeaturesBasic features (like parameters in HMM)
Bigram (1st order) or trigram (2nd order) in source modelState-output pair feature (Xk = xk, Yk = yk)
Advantage: incorporate other advanced features on (xk, yk)
-
IIT Bombay and IBM India Research Lab
December 2005
HMM vs MEMM (1st order)
kY1−kY
kX
)|( 1−kk YYP
)|( kk YXP
HMM Maximum Entropy Markov Model (MEMM)
kY1−kY
kX
),|( 1−kkk YXYP
-
IIT Bombay and IBM India Research Lab
December 2005
Performance in POS Tagging
POS Tagging– Data set: WSJ– Features:
HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)
Results (Lafferty et al. 2001)– 1st order HMM
94.31% accuracy, 54.01% OOV accuracy– 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy
-
IIT Bombay and IBM India Research Lab
December 2005
Different Models for POS tagging
HMMMaximum Entropy Markov ModelsConditional Random Fields
-
IIT Bombay and IBM India Research Lab
December 2005
Disadvantage of MEMMs (1)
Complex Algorithm of Maximum Entropy Solution– Both IIS and GIS are difficult to implement
– Require many tricks in implementationSlow in Training– Time consuming when data set is large
Especially for MEMM
-
IIT Bombay and IBM India Research Lab
December 2005
Disadvantage of MEMMs (2)
Maximum Entropy Markov Model– Maximum entropy model as a sub model– Optimization of entropy on sub models,
not on global modelLabel Bias Problem– Conditional models with per-state
normalization– Effects of observations are weakened for
states with fewer outgoing transitions
-
IIT Bombay and IBM India Research Lab
December 2005
Label Bias Problem
Training Data
X:Y
rib:123
rib:123
rib:123
rob:456
rob:456
1 2 3
r i b
4 5 6
r o b
Model
4.0114.0)5,|6()4,|5()|4()|456(
6.0116.0)2,|3()1,|2()|1()|123(
=××==
=××==
borrob
borrob
PPPP
PPPP
Parameters
1)5,|6()2,|3(,1)4,|5()4,|5(
,1)1,|2()1,|2(,6.0)|4(,4.0)|1(
====
====
bboioi
rr
PPPPPP
PP
New input: rob
-
IIT Bombay and IBM India Research Lab
December 2005
Solution
Global Optimization– Optimize parameters in a global model
simultaneously, not in sub models separately
Alternatives– Conditional random fields
– Application of perceptron algorithm
-
IIT Bombay and IBM India Research Lab
December 2005
Conditional Random Field (CRF) (1)
Let – be a graph such that Y is indexed
by the vertices),( EVG =
Then(X, Y) is a conditional random field if
Conditioned globally on X
VvvY ∈= )(Y
)),(,,|(),,|( EvwYYPvwYYP wvwv ∈=≠ XX
-
IIT Bombay and IBM India Research Lab
December 2005
Conditional Random Field (CRF) (2)
Exponential Model– : a tree (or more specifically, a
chain) with cliques as edges and vertices),( EVG =
Parameter EstimationMaximize the conditional likelihood of training examples
IIS or GIS
∑∑∈∈
+∝==kVv
vkkkEe
ekk vgefP,,
)|,()|,(exp()|( xyxyxXyY μλ
∑∈
==T
P),(
2 )|(logyx
xXyY
State determined
Determined by State Transitions
-
IIT Bombay and IBM India Research Lab
December 2005
MEMM vs CRF
Similarities– Both employ maximum entropy principle
– Both incorporate rich feature informationDifferences– Conditional random fields are always
globally conditioned on X, resulting in a global optimized model
-
IIT Bombay and IBM India Research Lab
December 2005
Performance in POS Tagging
POS Tagging– Data set: WSJ– Features:
HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)
Results (Lafferty et al. 2001)– 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy– Conditional random fields
95.73% accuracy, 76.24% OOV accuracy
-
IIT Bombay and IBM India Research Lab
December 2005
Comparison of the three approaches to POS Tagging
Results (Lafferty et al. 2001)– 1st order HMM
94.31% accuracy, 54.01% OOV accuracy– 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy– Conditional random fields
95.73% accuracy, 76.24% OOV accuracy
-
IIT Bombay and IBM India Research Lab
December 2005
References
A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71.J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.
Graphical models for part of speech taggingDifferent Models for POS taggingPOS tagging: A Sequence Labeling ProblemHidden Markov ModelsHidden Markov Model (HMM) : Generative ModelingDependency (1st order)Different Models for POS taggingDisadvantage of HMMs (1)Disadvantage of HMMs (2)Maximum Entropy Markov ModelGeneral Maximum Entropy ModelFeaturesConstraintsMaximum Entropy: ObjectiveDual ProblemMaximum Entropy Markov ModelHMM vs MEMM (1st order)Performance in POS TaggingDifferent Models for POS taggingDisadvantage of MEMMs (1)Disadvantage of MEMMs (2)Label Bias ProblemSolutionConditional Random Field (CRF) (1)Conditional Random Field (CRF) (2)MEMM vs CRFPerformance in POS TaggingComparison of the three approaches to POS TaggingReferences