![Page 1: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/1.jpg)
CSA3202Human Language
Technology
HMMs for POS Tagging
![Page 2: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/2.jpg)
Acknowledgement
Slides adapted from lecture by Dan Jurafsky
![Page 3: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/3.jpg)
3
Hidden Markov Model Tagging
Using an HMM to do POS tagging is a special case of Bayesian inference Foundational work in computational
linguistics Bledsoe 1959: OCR Mosteller and Wallace 1964: authorship
identification It is also related to the “noisy channel”
model that’s the basis for ASR, OCR and MT
![Page 4: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/4.jpg)
4
Noisy Channel Modelfor Machine Translation
(Brown et. al. 1990)
sourcesentence
target sentence
sentence
![Page 5: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/5.jpg)
5
POS Tagging as Sequence Classification
Probabilistic view: Given an observation sequence O = w1…wn e.g. Secretariat is expected to race
tomorrow What is the best sequence of tags T
that corresponds to this sequence of observations?
Consider all possible sequences of tags Out of this universe of sequences, choose
the tag sequence which maximises P(T|O)
![Page 6: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/6.jpg)
6
Getting to HMMs
i.e. out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.
Hat ^ means “our estimate of the best one” argmaxx f(x) means “the x such that f(x) is
maximized”
![Page 7: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/7.jpg)
7
Getting to HMMs
But how to make it operational? How to compute this value?
Intuition of Bayesian classification: Use Bayes Rule to transform this
equation into a set of other probabilities that are easier to compute
![Page 8: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/8.jpg)
8
Bayes Rule
![Page 9: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/9.jpg)
9
recall this picture ....
A B
sample space
A B
![Page 10: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/10.jpg)
10
Using Bayes Rule
![Page 11: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/11.jpg)
11
Two Probabilities Involved:
Probability of atag sequence
Probability of a wordsequence given a tag sequence
![Page 12: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/12.jpg)
12
Probability of tag sequence
We are seeking p(ti,...,t1) =P(t1|0) *
p(t2|t1) *
p(t3|t1,t2)*, … , *
p(ti|ti-1,…,t1)
(by chain rule of probabilities where 0 is the the beginning of string)
Next we approximate this product …
![Page 13: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/13.jpg)
13
Probability of tag sequence by N-gram Approximation
By employing the following assumption
p(ti|ti-1,...,t1) ≈ p(ti|ti-1) i.e. we approximate the entire left
context by the previous tag. So the product is simplified toP(t1|0) *
p(t2|t1) *
p(t3|t2)*, … , *
p(ti|ti-1)
![Page 14: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/14.jpg)
14
Generalizing the Approach
The same trick works for approximating P(wn
1|tn1)
P(wn1|tn
1) ≈
P(w1|t1) *
P(w2|t2) *
P(wn-1|tn-1) *
P(wn|tn)
as seen on the next slide
![Page 15: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/15.jpg)
15
Likelihood and Prior
![Page 16: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/16.jpg)
16
Two Kinds of Probabilities
1. Tag transition probabilities p(ti|ti-1)
2. Word likelihood probabilities p(wi|ti)
![Page 17: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/17.jpg)
17
Estimating Probabilities
Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and
nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high
Compute p(ti|ti-1) by counting in a labeled corpus:
![Page 18: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/18.jpg)
18
Example
![Page 19: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/19.jpg)
19
Two Kinds of Probabilities
Word likelihood probabilities p(wi|ti) VBZ (3sg Pres verb) likely to be “is” Compute P(is|VBZ) by counting in a
labeled corpus:
![Page 20: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/20.jpg)
20
Example: The Verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
How do we pick the right tag?
![Page 21: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/21.jpg)
21
Disambiguating “race”
![Page 22: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/22.jpg)
22
Example
P(NN|TO) = .00047 P(VB|TO) = .83 P(race|NN) = .00057 P(race|VB) = .00012 P(NR|VB) = .0027 P(NR|NN) = .0012 P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|
NN)=.00000000032 So we (correctly) choose the verb reading
![Page 23: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/23.jpg)
23
Hidden Markov Models
What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)
We approach the concept of an HMM via the simpler notions Weighted Finite State Automaton Markov Chain
![Page 24: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/24.jpg)
24
WFST Definition
A Weighted Finite-State Automaton (WFST) adds probabilities to the arcs The sum of the probabilities leaving any
arc must sum to one A Markov chain is a special case of a
WFST in which the input sequence uniquely determines which states the automaton will go through
Examples follow
![Page 25: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/25.jpg)
25
Markov Chain for Pronunciation Variants
![Page 26: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/26.jpg)
26
Markov Chain for Weather
![Page 27: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/27.jpg)
27
Probability of a Sequence
Markov chain assigns probabilities to unambiguous sequences
What is the probability of 3 consecutive rainy days?
Sequence is rainy-rainy-rainy I.e., state sequence is 0-3-3-3 P(0,3,3,3) =
0a03a33a33 = 0.2 x (0.6)3 = 0.0432
![Page 28: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/28.jpg)
28
Markov Chain: “First-order observable Markov Model”
A set of states Q = q1, q2…qN; the state at time t is qt
Transition probabilities: a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
The set of these is the transition probability matrix A
Current state only depends on previous state
![Page 29: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/29.jpg)
29
But ....
Markov chains can’t represent inherently ambiguous problems
For that we need a Hidden Markov Model
![Page 30: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/30.jpg)
30
Hidden Markov Model
For Markov chains, the output symbols are the same as the states. See hot weather: we’re in state hot
But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags
So we need an extension! A Hidden Markov Model is an extension of a
Markov chain in which the input symbols are not the same as the states.
This means we don’t know which state we are in.
![Page 31: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/31.jpg)
31
States Q = q1, q2…qN; Observations O= o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities Transition probability matrix A = {aij}
Observation likelihoods Output probability matrix B={bi(k)}
Special initial probability vector
Hidden Markov Models
![Page 32: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/32.jpg)
32
Underlying Markov Chain is as before, but we need to add ...
![Page 33: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/33.jpg)
33
Observation Likelihoods
![Page 34: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/34.jpg)
34
Observation Likelihoods
Observation likelihoods P(O|S) link actual observations O with underlying states S
Are necessary whenever there is not a 1:1 correspondence between the two.
Examples from many different domains where there is uncertainty of interpretation OCR Speech Recognition Spelling Correction POS
![Page 35: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/35.jpg)
35
Decoding
Recall that we need to compute
We could just enumerate all paths given the input and use the model to assign probabilities to each. Many Paths Only need to store the most probable path
Dynamic programming Viterbi Algorithm
![Page 36: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/36.jpg)
36
Viterbi Algorithm
Inputs HMM Q: States
A: Transition probabilities B: Observation likelihoods
O: Observed Words Output
Most probable state/tag sequence together with its probability
![Page 37: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/37.jpg)
37
POS Transition Probabilities
![Page 38: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/38.jpg)
38
POS Observation Likelihoods
![Page 39: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/39.jpg)
39
Viterbi Summary
Create an array Columns corresponding to inputs Rows corresponding states
Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs
Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).
![Page 40: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/40.jpg)
40
Viterbi Example
![Page 41: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/41.jpg)
41
![Page 42: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/42.jpg)
42
POS-Tagger Evaluation
Different metrics can be used:
Overall error rate with respect to a gold-standard test set.
Error rates on particular tagsc(gold(tok)=t and tagger(tok)=t)/c(gold(tok)=t)
Error rates on particular words Tag confusions...
![Page 43: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/43.jpg)
43
Evaluation
The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a
baseline tagger (one that uses no context).
Important: 100% is impossible even for human annotators.
Inter annotator agreement problem
![Page 44: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/44.jpg)
44
Error Analysis
Look at a confusion matrix
See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
![Page 45: CSA3202 Human Language Technology HMMs for POS Tagging](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649f555503460f94c78fdd/html5/thumbnails/45.jpg)
45
POS Tagging Summary
Parts of speech Tagsets Part of speech tagging Rule-Based HMM Tagging
Markov Chains Hidden Markov Models
Next: Transformation based