lecture 6 pos tagging methods
DESCRIPTION
Lecture 6 POS Tagging Methods. CSCE 771 Natural Language Processing. Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-?. February 3, 2011. Overview. Last Time Overview of POS Tags Today - PowerPoint PPT PresentationTRANSCRIPT
Lecture 6 POS Tagging Methods
Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning
Readings: Chapter 5.4-?Readings: Chapter 5.4-?
February 3, 2011
CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2011
OverviewLast TimeLast Time
Overview of POS Tags
TodayToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggers
ReadingsReadings Chapter 5.4-5.?
– 3 – CSCE 771 Spring 2011
History of Tagging
Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a “techne” which summarized linguistic knowledge of “techne” which summarized linguistic knowledge of the daythe day Terminology that is still used 2000 years later
Syntax, dipthong, clitic, analogy Also included eight “parts-of-speech” basis of subsequent
POS descriptions of Greek, Latin and European languagesNounVerbPronounPrepositionAdverbConjunctionParticipleArticle
– 4 – CSCE 771 Spring 2011
History of Tagging
100 BC Dionysis Thrax – document eight parts of speech100 BC Dionysis Thrax – document eight parts of speech
1959 Harris (U Penn) first tagger as part of TDAP parser 1959 Harris (U Penn) first tagger as part of TDAP parser projectproject
1963 Klien and Simmons() Computational Grammar Coder 1963 Klien and Simmons() Computational Grammar Coder CGCCGC Small lexicon (1500 exceptional words), morphological analyzer
and context disambiguator
1971 Green and Rubin TAGGIT expanded CGC 1971 Green and Rubin TAGGIT expanded CGC More tags 87 and bigger dictionary Achieved 77% accuracy when applied to Brown Corpus
1983 Marshal/Garside – CLAWS tagger1983 Marshal/Garside – CLAWS tagger Probabilistic algorithm using tag bigram probabilities
– 5 – CSCE 771 Spring 2011
History of Tagging (continued)
1988 Church () PARTS tagger1988 Church () PARTS tagger Extended CLAWS idea Stored P(tag | word) * P(tag | previous n tags) instead of P(word | tag) * P(tag | previous n tags) used in HMM
taggers
1992 Kupiec HMM tagger1992 Kupiec HMM tagger
1994 Schütze and Singer () variable length Markov Models1994 Schütze and Singer () variable length Markov Models
1994 Jelinek/Magerman – decision trees for the 1994 Jelinek/Magerman – decision trees for the probabilitiesprobabilities
1996 Ratnaparkhi () - the Maximum Entropy Algorithm1996 Ratnaparkhi () - the Maximum Entropy Algorithm
1997 Brill – unsupervised version of the TBL algorithm1997 Brill – unsupervised version of the TBL algorithm
– 6 – CSCE 771 Spring 2011
POS TaggingWords often have more than one POS: Words often have more than one POS: backback
The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag The POS tagging problem is to determine the POS tag for a particular instance of a word.for a particular instance of a word.
– 7 – CSCE 771 Spring 2011
How Hard is POS Tagging? Measuring Ambiguity
– 8 – CSCE 771 Spring 2011
Two Methods for POS Tagging1.1. Rule-based taggingRule-based tagging
(ENGTWOL)
2.2. StochasticStochastic1. Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
– 9 – CSCE 771 Spring 2011
Rule-Based TaggingStart with a dictionaryStart with a dictionary
Assign all possible tags to words from the dictionaryAssign all possible tags to words from the dictionary
Write rules by hand to selectively remove tagsWrite rules by hand to selectively remove tags
Leaving the correct tag for each word.Leaving the correct tag for each word.
– 10 – CSCE 771 Spring 2011
Start With a Dictionary• she:she: PRPPRP• promised:promised: VBN,VBDVBN,VBD• toto TOTO• back:back: VB, JJ, RB, NNVB, JJ, RB, NN• the:the: DTDT• bill:bill: NN, VB NN, VB
• Etc… for the ~100,000 words of English with more Etc… for the ~100,000 words of English with more than 1 tagthan 1 tag
– 11 – CSCE 771 Spring 2011
Assign Every Possible Tag
NNNN
RBRB
VBNVBN JJ VBJJ VB
PRPPRP VBDVBD TO TO VB DTVB DT NN NN
SheShe promised to back thepromised to back the bill bill
– 12 – CSCE 771 Spring 2011
Write Rules to Eliminate TagsEliminate VBN if VBD is an option when VBN|VBD Eliminate VBN if VBD is an option when VBN|VBD
follows “<start> PRP”follows “<start> PRP”
NNNN
RBRB
JJJJ VB VB
PRPPRP VBDVBD TO VB TO VB DT DT NNNN
SheShe promisedpromised toto back theback the billbill
VBN
– 13 – CSCE 771 Spring 2011
Stage 1 of ENGTWOL TaggingFirst Stage: Run words through FST First Stage: Run words through FST
morphological analyzer to get all parts of morphological analyzer to get all parts of speech.speech.
Example: Example: Pavlov had shown that salivation …Pavlov had shown that salivation …
PavlovPavlov PAVLOV N NOM SG PROPERPAVLOV N NOM SG PROPERhadhad HAVE V PAST VFIN SVOHAVE V PAST VFIN SVO
HAVE PCP2 SVOHAVE PCP2 SVOshownshown SHOW PCP2 SVOO SVO SVSHOW PCP2 SVOO SVO SVthatthat ADVADV
PRON DEM SGPRON DEM SGDET CENTRAL DEM SGDET CENTRAL DEM SGCSCS
salivationsalivation N NOM SGN NOM SG
– 14 – CSCE 771 Spring 2011
Stage 2 of ENGTWOL TaggingSecond Stage: Apply NEGATIVE constraints.Second Stage: Apply NEGATIVE constraints.
Example: Adverbial “that” ruleExample: Adverbial “that” rule Eliminates all readings of “that” except the one in
“It isn’t that odd”
GivenGiven input: “that” input: “that”IfIf(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier
(+2 SENT-LIM) ;following which is E-O-S(+2 SENT-LIM) ;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a(NOT -1 SVOC/A) ; and the previous word is not a
; verb like “consider” which ; verb like “consider” which
; allows adjective complements ; allows adjective complements
; in “I consider that odd”; in “I consider that odd”
ThenThen eliminate non-ADV tags eliminate non-ADV tagsElseElse eliminate ADV eliminate ADV
– 15 – CSCE 771 Spring 2011
Hidden Markov Model TaggingUsing an HMM to do POS tagging is a special case of Using an HMM to do POS tagging is a special case of
Bayesian inferenceBayesian inference Foundational work in computational linguistics Bledsoe 1959: OCR Mosteller and Wallace 1964: authorship identification
It is also related to the “noisy channel” model that’s the It is also related to the “noisy channel” model that’s the basis for ASR, OCR and MT basis for ASR, OCR and MT
– 16 – CSCE 771 Spring 2011
POS Tagging as Sequence ClassificationWe are given a sentence (an “observation” or “sequence We are given a sentence (an “observation” or “sequence
of observations”)of observations”) Secretariat is expected to race tomorrow
What is the best sequence of tags that corresponds to What is the best sequence of tags that corresponds to this sequence of observations?this sequence of observations?
Probabilistic view:Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence
which is most probable given the observation sequence of n words w1…wn.
– 17 – CSCE 771 Spring 2011
Getting to HMMsWe want, out of all sequences of n tags tWe want, out of all sequences of n tags t11…t…tnn the single the single
tag sequence such that P(ttag sequence such that P(t11…t…tnn|w|w11…w…wnn) is highest.) is highest.
Hat ^ means “our estimate of the best one”Hat ^ means “our estimate of the best one”
ArgmaxArgmaxxx f(x) means “the x such that f(x) is maximized” f(x) means “the x such that f(x) is maximized”
– 18 – CSCE 771 Spring 2011
Getting to HMMsThis equation is guaranteed to give us the best tag This equation is guaranteed to give us the best tag
sequencesequence
But how to make it operational? How to compute this But how to make it operational? How to compute this value?value?
Intuition of Bayesian classification:Intuition of Bayesian classification: Use Bayes rule to transform this equation into a set of other
probabilities that are easier to compute
– 19 – CSCE 771 Spring 2011
Using Bayes Rule
– 20 – CSCE 771 Spring 2011
Likelihood and Prior
– 21 – CSCE 771 Spring 2011
Two Kinds of Probabilities
Tag transition probabilities p(tTag transition probabilities p(tii|t|ti-1i-1)) Determiners likely to precede adjs and nouns
That/DT flight/NNThe/DT yellow/JJ hat/NNSo we expect P(NN|DT) and P(JJ|DT) to be highBut P(DT|JJ) to be:
Compute P(NN|DT) by counting in a labeled corpus:
– 22 – CSCE 771 Spring 2011
Two Kinds of Probabilities
Word likelihood probabilities p(wWord likelihood probabilities p(wii|t|tii))VBZ (3sg Pres verb) likely to be “is”Compute P(is|VBZ) by counting in a
labeled corpus:
– 23 – CSCE 771 Spring 2011
Example: The Verb “race”
Secretariat/Secretariat/NNPNNP is/ is/VBZVBZ expected/ expected/VBNVBN to/ to/TO TO racerace//VBVB tomorrow/tomorrow/NRNR
People/People/NNSNNS continue/ continue/VBVB to/ to/TOTO inquire/ inquire/VBVB the/ the/DTDT reason/reason/NNNN for/ for/ININ the/ the/DTDT racerace//NNNN for/ for/ININ outer/ outer/JJJJ space/space/NNNN
How do we pick the right tag?How do we pick the right tag?
– 24 – CSCE 771 Spring 2011
Disambiguating “race”
– 25 – CSCE 771 Spring 2011
ExampleP(NN|TO) = .00047P(NN|TO) = .00047
P(VB|TO) = .83P(VB|TO) = .83
P(race|NN) = .00057P(race|NN) = .00057
P(race|VB) = .00012P(race|VB) = .00012
P(NR|VB) = .0027P(NR|VB) = .0027
P(NR|NN) = .0012P(NR|NN) = .0012
P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(VB|TO)P(NR|VB)P(race|VB) = .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
So we (correctly) choose the verb reading,So we (correctly) choose the verb reading,
– 26 – CSCE 771 Spring 2011
Hidden Markov ModelsWhat we’ve described with these two kinds of What we’ve described with these two kinds of
probabilities is a Hidden Markov Model (HMM)probabilities is a Hidden Markov Model (HMM)
– 27 – CSCE 771 Spring 2011
DefinitionsA A weighted finite-state automatonweighted finite-state automaton adds adds
probabilities to the arcsprobabilities to the arcs The sum of the probabilities leaving any arc must
sum to one
A A Markov chainMarkov chain is a special case of a WFST in is a special case of a WFST in which the input sequence uniquely which the input sequence uniquely determines which states the automaton will determines which states the automaton will go throughgo through
Markov chains can’t represent inherently Markov chains can’t represent inherently ambiguous problemsambiguous problems Useful for assigning probabilities to unambiguous
sequences
– 28 – CSCE 771 Spring 2011
Markov Chain for Weather
– 29 – CSCE 771 Spring 2011
Markov Chain for Words
– 30 – CSCE 771 Spring 2011 30
Markov Chain: “First-order observable Markov Model”A set of states A set of states
Q = q1, q2…qN; the state at time t is qt
Transition probabilities: Transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning
from state i to state j The set of these is the transition probability matrix
A
Current state only depends on previous state Current state only depends on previous state
P(qi |q1...qi 1) P(qi |qi 1)
– 31 – CSCE 771 Spring 2011
Markov Chain for WeatherWhat is the probability of 4 consecutive rainy days?What is the probability of 4 consecutive rainy days?
Sequence is rainy-rainy-rainy-rainySequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3I.e., state sequence is 3-3-3-3
P(3,3,3,3) = P(3,3,3,3) = 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
– 32 – CSCE 771 Spring 2011
HMM for Ice CreamYou are a climatologist in the year 2799You are a climatologist in the year 2799
Studying global warmingStudying global warming
You can’t find any records of the weather in Baltimore, You can’t find any records of the weather in Baltimore, MA for summer of 2007MA for summer of 2007
But you find Jason Eisner’s diaryBut you find Jason Eisner’s diary
Which lists how many ice-creams Jason ate every date Which lists how many ice-creams Jason ate every date that summerthat summer
Our job: figure out how hot it wasOur job: figure out how hot it was
– 33 – CSCE 771 Spring 2011
Hidden Markov ModelFor Markov chains, the output symbols are the same as the For Markov chains, the output symbols are the same as the
states.states. See hot weather: we’re in state hot
But in part-of-speech tagging (and other things)But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags
So we need an extension!So we need an extension!A A Hidden Markov ModelHidden Markov Model is an extension of a Markov chain is an extension of a Markov chain
in which the input symbols are not the same as the in which the input symbols are not the same as the states.states.
This means This means we don’t know which state we are inwe don’t know which state we are in..
– 34 – CSCE 771 Spring 2011
States States Q = qQ = q11, q, q22……qqN; N;
Observations Observations O= oO= o11, o, o22……ooN; N; Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilitiesTransition probabilities Transition probability matrix A = {a ij}
Observation likelihoodsObservation likelihoods Output probability matrix B={bi(k)}
Special initial probability vector Special initial probability vector
i P(q1 i) 1iN
aij P(qt j |qt 1 i) 1i, j N
bi(k) P(X t ok |qt i)
Hidden Markov Models
– 35 – CSCE 771 Spring 2011
Eisner TaskGivenGiven
Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:Produce: Weather Sequence: H,C,H,H,H,C…
– 36 – CSCE 771 Spring 2011
HMM for Ice Cream
– 37 – CSCE 771 Spring 2011
Transition Probabilities
– 38 – CSCE 771 Spring 2011
Observation Likelihoods
– 39 – CSCE 771 Spring 2011
DecodingOk, now we have a complete model that can Ok, now we have a complete model that can
give us what we need. Recall that we need to give us what we need. Recall that we need to getget
We could just enumerate all paths given the We could just enumerate all paths given the input and use the model to assign input and use the model to assign probabilities to each.probabilities to each. Not a good idea. Luckily dynamic programming (last seen in Ch. 3
with minimum edit distance) helps us here
– 40 – CSCE 771 Spring 2011
The Viterbi Algorithm
– 41 – CSCE 771 Spring 2011
Viterbi Example
– 42 – CSCE 771 Spring 2011
Viterbi SummaryCreate an arrayCreate an array
With columns corresponding to inputs Rows corresponding to possible states
Sweep through the array in one pass filling the columns Sweep through the array in one pass filling the columns left to right using our transition probs and left to right using our transition probs and observations probsobservations probs
Dynamic programming key is that we need only store Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).the MAX prob path to each cell, (not all paths).
– 43 – CSCE 771 Spring 2011
EvaluationSo once you have you POS tagger running how do you So once you have you POS tagger running how do you
evaluate it?evaluate it? Overall error rate with respect to a gold-standard test set. Error rates on particular tags Error rates on particular words Tag confusions...
– 44 – CSCE 771 Spring 2011
Error AnalysisLook at a confusion matrixLook at a confusion matrix
See what errors are causing problemsSee what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
– 45 – CSCE 771 Spring 2011
EvaluationThe result is compared with a manually coded “Gold The result is compared with a manually coded “Gold
Standard”Standard” Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one
that uses no context).
Important: 100% is impossible even for human Important: 100% is impossible even for human annotators.annotators.
– 46 – CSCE 771 Spring 2011
Summary
Parts of speechParts of speech
TagsetsTagsets
Part of speech taggingPart of speech tagging
HMM TaggingHMM TaggingMarkov ChainsHidden Markov Models