lecture 6 pos tagging methods topics taggers rule based taggers probabilistic taggers transformation...
TRANSCRIPT
Lecture 6 POS Tagging Methods
Lecture 6 POS Tagging Methods
Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning
Readings: Chapter 5.4-?Readings: Chapter 5.4-?
February 3, 2011
CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2011
OverviewOverviewLast TimeLast Time
Overview of POS Tags
TodayToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggers
ReadingsReadings Chapter 5.4-5.?
– 3 – CSCE 771 Spring 2011
History of TaggingHistory of Tagging
Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a “techne” which summarized linguistic knowledge of “techne” which summarized linguistic knowledge of the daythe day Terminology that is still used 2000 years later
Syntax, dipthong, clitic, analogy
Also included eight “parts-of-speech” basis of subsequent POS descriptions of Greek, Latin and European languages
NounVerbPronounPrepositionAdverbConjunctionParticipleArticle
– 4 – CSCE 771 Spring 2011
History of TaggingHistory of Tagging
100 BC Dionysis Thrax – document eight parts of speech100 BC Dionysis Thrax – document eight parts of speech
1959 Harris (U Penn) first tagger as part of TDAP parser 1959 Harris (U Penn) first tagger as part of TDAP parser projectproject
1963 Klien and Simmons() Computational Grammar Coder 1963 Klien and Simmons() Computational Grammar Coder CGCCGC Small lexicon (1500 exceptional words), morphological analyzer
and context disambiguator
1971 Green and Rubin TAGGIT expanded CGC 1971 Green and Rubin TAGGIT expanded CGC More tags 87 and bigger dictionary Achieved 77% accuracy when applied to Brown Corpus
1983 Marshal/Garside – CLAWS tagger1983 Marshal/Garside – CLAWS tagger Probabilistic algorithm using tag bigram probabilities
– 5 – CSCE 771 Spring 2011
History of Tagging (continued)History of Tagging (continued)
1988 Church () PARTS tagger1988 Church () PARTS tagger Extended CLAWS idea Stored P(tag | word) * P(tag | previous n tags) instead of P(word | tag) * P(tag | previous n tags) used in HMM
taggers
1992 Kupiec HMM tagger1992 Kupiec HMM tagger
1994 Schütze and Singer () variable length Markov Models1994 Schütze and Singer () variable length Markov Models
1994 Jelinek/Magerman – decision trees for the 1994 Jelinek/Magerman – decision trees for the probabilitiesprobabilities
1996 Ratnaparkhi () - the Maximum Entropy Algorithm1996 Ratnaparkhi () - the Maximum Entropy Algorithm
1997 Brill – unsupervised version of the TBL algorithm1997 Brill – unsupervised version of the TBL algorithm
– 6 – CSCE 771 Spring 2011
POS TaggingPOS Tagging
Words often have more than one POS: Words often have more than one POS: backback The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag The POS tagging problem is to determine the POS tag for a particular instance of a word.for a particular instance of a word.
– 7 – CSCE 771 Spring 2011
How Hard is POS Tagging? Measuring AmbiguityHow Hard is POS Tagging? Measuring Ambiguity
– 8 – CSCE 771 Spring 2011
Two Methods for POS TaggingTwo Methods for POS Tagging
1.1. Rule-based taggingRule-based tagging (ENGTWOL)
2.2. StochasticStochastic1. Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
– 9 – CSCE 771 Spring 2011
Rule-Based TaggingRule-Based Tagging
Start with a dictionaryStart with a dictionary
Assign all possible tags to words from the dictionaryAssign all possible tags to words from the dictionary
Write rules by hand to selectively remove tagsWrite rules by hand to selectively remove tags
Leaving the correct tag for each word.Leaving the correct tag for each word.
– 10 – CSCE 771 Spring 2011
Start With a DictionaryStart With a Dictionary• she:she: PRPPRP
• promised:promised: VBN,VBDVBN,VBD
• toto TOTO
• back:back: VB, JJ, RB, NNVB, JJ, RB, NN
• the:the: DTDT
• bill:bill: NN, VB NN, VB
• Etc… for the ~100,000 words of English with more Etc… for the ~100,000 words of English with more than 1 tagthan 1 tag
– 11 – CSCE 771 Spring 2011
Assign Every Possible TagAssign Every Possible Tag
NNNN
RBRB
VBNVBN JJ VBJJ VB
PRPPRP VBDVBD TO TO VB DTVB DT NN NN
SheShe promised to back thepromised to back the bill bill
– 12 – CSCE 771 Spring 2011
Write Rules to Eliminate TagsWrite Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”follows “<start> PRP”
NNNN
RBRB
JJJJ VB VB
PRPPRP VBDVBD TO VB TO VB DT DT NNNN
SheShe promisedpromised toto back theback the billbill
VBN
– 13 – CSCE 771 Spring 2011
Stage 1 of ENGTWOL TaggingStage 1 of ENGTWOL Tagging
First Stage: Run words through FST First Stage: Run words through FST morphological analyzer to get all parts of morphological analyzer to get all parts of speech.speech.
Example: Example: Pavlov had shown that salivation …Pavlov had shown that salivation …
PavlovPavlov PAVLOV N NOM SG PROPERPAVLOV N NOM SG PROPERhadhad HAVE V PAST VFIN SVOHAVE V PAST VFIN SVO
HAVE PCP2 SVOHAVE PCP2 SVOshownshown SHOW PCP2 SVOO SVO SVSHOW PCP2 SVOO SVO SVthatthat ADVADV
PRON DEM SGPRON DEM SGDET CENTRAL DEM SGDET CENTRAL DEM SGCSCS
salivationsalivation N NOM SGN NOM SG
– 14 – CSCE 771 Spring 2011
Stage 2 of ENGTWOL TaggingStage 2 of ENGTWOL TaggingSecond Stage: Apply NEGATIVE constraints.Second Stage: Apply NEGATIVE constraints.
Example: Adverbial “that” ruleExample: Adverbial “that” rule Eliminates all readings of “that” except the one in
“It isn’t that odd”
GivenGiven input: “that” input: “that”IfIf(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier
(+2 SENT-LIM) ;following which is E-O-S(+2 SENT-LIM) ;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a(NOT -1 SVOC/A) ; and the previous word is not a
; verb like “consider” which ; verb like “consider” which
; allows adjective complements ; allows adjective complements
; in “I consider that odd”; in “I consider that odd”
ThenThen eliminate non-ADV tags eliminate non-ADV tagsElseElse eliminate ADV eliminate ADV
– 15 – CSCE 771 Spring 2011
Hidden Markov Model TaggingHidden Markov Model Tagging
Using an HMM to do POS tagging is a special case of Using an HMM to do POS tagging is a special case of Bayesian inferenceBayesian inference Foundational work in computational linguistics Bledsoe 1959: OCR Mosteller and Wallace 1964: authorship identification
It is also related to the “noisy channel” model that’s the It is also related to the “noisy channel” model that’s the basis for ASR, OCR and MT basis for ASR, OCR and MT
– 16 – CSCE 771 Spring 2011
POS Tagging as Sequence ClassificationPOS Tagging as Sequence ClassificationWe are given a sentence (an “observation” or “sequence We are given a sentence (an “observation” or “sequence
of observations”)of observations”) Secretariat is expected to race tomorrow
What is the best sequence of tags that corresponds to What is the best sequence of tags that corresponds to this sequence of observations?this sequence of observations?
Probabilistic view:Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence
which is most probable given the observation sequence of n words w1…wn.
– 17 – CSCE 771 Spring 2011
Getting to HMMsGetting to HMMs
We want, out of all sequences of n tags tWe want, out of all sequences of n tags t11…t…tnn the single the single tag sequence such that P(ttag sequence such that P(t11…t…tnn|w|w11…w…wnn) is highest.) is highest.
Hat ^ means “our estimate of the best one”Hat ^ means “our estimate of the best one”
ArgmaxArgmaxxx f(x) means “the x such that f(x) is maximized” f(x) means “the x such that f(x) is maximized”
– 18 – CSCE 771 Spring 2011
Getting to HMMsGetting to HMMs
This equation is guaranteed to give us the best tag This equation is guaranteed to give us the best tag sequencesequence
But how to make it operational? How to compute this But how to make it operational? How to compute this value?value?
Intuition of Bayesian classification:Intuition of Bayesian classification: Use Bayes rule to transform this equation into a set of other
probabilities that are easier to compute
– 21 – CSCE 771 Spring 2011
Two Kinds of ProbabilitiesTwo Kinds of Probabilities
Tag transition probabilities p(tTag transition probabilities p(tii|t|ti-1i-1)) Determiners likely to precede adjs and nouns
That/DT flight/NNThe/DT yellow/JJ hat/NNSo we expect P(NN|DT) and P(JJ|DT) to be highBut P(DT|JJ) to be:
Compute P(NN|DT) by counting in a labeled corpus:
– 22 – CSCE 771 Spring 2011
Two Kinds of ProbabilitiesTwo Kinds of Probabilities
Word likelihood probabilities p(wWord likelihood probabilities p(wii|t|tii))VBZ (3sg Pres verb) likely to be “is”Compute P(is|VBZ) by counting in a
labeled corpus:
– 23 – CSCE 771 Spring 2011
Example: The Verb “race”Example: The Verb “race”
Secretariat/Secretariat/NNPNNP is/ is/VBZVBZ expected/ expected/VBNVBN to/ to/TO TO racerace//VBVB tomorrow/tomorrow/NRNR
People/People/NNSNNS continue/ continue/VBVB to/ to/TOTO inquire/ inquire/VBVB the/ the/DTDT reason/reason/NNNN for/ for/ININ the/ the/DTDT racerace//NNNN for/ for/ININ outer/ outer/JJJJ space/space/NNNN
How do we pick the right tag?How do we pick the right tag?
– 25 – CSCE 771 Spring 2011
ExampleExample
P(NN|TO) = .00047P(NN|TO) = .00047
P(VB|TO) = .83P(VB|TO) = .83
P(race|NN) = .00057P(race|NN) = .00057
P(race|VB) = .00012P(race|VB) = .00012
P(NR|VB) = .0027P(NR|VB) = .0027
P(NR|NN) = .0012P(NR|NN) = .0012
P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(VB|TO)P(NR|VB)P(race|VB) = .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
So we (correctly) choose the verb reading,So we (correctly) choose the verb reading,
– 26 – CSCE 771 Spring 2011
Hidden Markov ModelsHidden Markov Models
What we’ve described with these two kinds of What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)probabilities is a Hidden Markov Model (HMM)
– 27 – CSCE 771 Spring 2011
DefinitionsDefinitions
A A weighted finite-state automatonweighted finite-state automaton adds adds probabilities to the arcsprobabilities to the arcs The sum of the probabilities leaving any arc must
sum to one
A A Markov chainMarkov chain is a special case of a WFST in is a special case of a WFST in which the input sequence uniquely which the input sequence uniquely determines which states the automaton will determines which states the automaton will go throughgo through
Markov chains can’t represent inherently Markov chains can’t represent inherently ambiguous problemsambiguous problems Useful for assigning probabilities to unambiguous
sequences
– 30 – CSCE 771 Spring 2011 30
Markov Chain: “First-order observable Markov Model”Markov Chain: “First-order observable Markov Model”A set of states A set of states
Q = q1, q2…qN; the state at time t is qt
Transition probabilities: Transition probabilities: a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
The set of these is the transition probability matrix A
Current state only depends on previous state Current state only depends on previous state
P(qi |q1...qi 1) P(qi |qi 1)
– 31 – CSCE 771 Spring 2011
Markov Chain for WeatherMarkov Chain for Weather
What is the probability of 4 consecutive rainy days?What is the probability of 4 consecutive rainy days?
Sequence is rainy-rainy-rainy-rainySequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3I.e., state sequence is 3-3-3-3
P(3,3,3,3) = P(3,3,3,3) = 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
– 32 – CSCE 771 Spring 2011
HMM for Ice CreamHMM for Ice Cream
You are a climatologist in the year 2799You are a climatologist in the year 2799
Studying global warmingStudying global warming
You can’t find any records of the weather in Baltimore, You can’t find any records of the weather in Baltimore, MA for summer of 2007MA for summer of 2007
But you find Jason Eisner’s diaryBut you find Jason Eisner’s diary
Which lists how many ice-creams Jason ate every date Which lists how many ice-creams Jason ate every date that summerthat summer
Our job: figure out how hot it wasOur job: figure out how hot it was
– 33 – CSCE 771 Spring 2011
Hidden Markov ModelHidden Markov Model
For Markov chains, the output symbols are the same as For Markov chains, the output symbols are the same as the states.the states. See hot weather: we’re in state hot
But in part-of-speech tagging (and other things)But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags
So we need an extension!So we need an extension!
A A Hidden Markov ModelHidden Markov Model is an extension of a Markov chain is an extension of a Markov chain in which the input symbols are not the same as the in which the input symbols are not the same as the states.states.
This means This means we don’t know which state we are inwe don’t know which state we are in..
– 34 – CSCE 771 Spring 2011
States States Q = qQ = q11, q, q22……qqN; N;
Observations Observations O= oO= o11, o, o22……ooN; N; Each observation is a symbol from a vocabulary V = {v1,v2,
…vV}
Transition probabilitiesTransition probabilities Transition probability matrix A = {aij}
Observation likelihoodsObservation likelihoods Output probability matrix B={bi(k)}
Special initial probability vector Special initial probability vector
i P(q1 i) 1iN
aij P(qt j |qt 1 i) 1i, j N
bi(k) P(X t ok |qt i)
Hidden Markov ModelsHidden Markov Models
– 35 – CSCE 771 Spring 2011
Eisner TaskEisner Task
GivenGiven Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:Produce: Weather Sequence: H,C,H,H,H,C…
– 39 – CSCE 771 Spring 2011
DecodingDecoding
Ok, now we have a complete model that can Ok, now we have a complete model that can give us what we need. Recall that we need to give us what we need. Recall that we need to getget
We could just enumerate all paths given the We could just enumerate all paths given the input and use the model to assign input and use the model to assign probabilities to each.probabilities to each. Not a good idea. Luckily dynamic programming (last seen in Ch. 3
with minimum edit distance) helps us here
– 42 – CSCE 771 Spring 2011
Viterbi SummaryViterbi Summary
Create an arrayCreate an array With columns corresponding to inputs Rows corresponding to possible states
Sweep through the array in one pass filling the columns Sweep through the array in one pass filling the columns left to right using our transition probs and left to right using our transition probs and observations probsobservations probs
Dynamic programming key is that we need only store Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).the MAX prob path to each cell, (not all paths).
– 43 – CSCE 771 Spring 2011
EvaluationEvaluation
So once you have you POS tagger running how do you So once you have you POS tagger running how do you evaluate it?evaluate it? Overall error rate with respect to a gold-standard test set. Error rates on particular tags Error rates on particular words Tag confusions...
– 44 – CSCE 771 Spring 2011
Error AnalysisError Analysis
Look at a confusion matrixLook at a confusion matrix
See what errors are causing problemsSee what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
– 45 – CSCE 771 Spring 2011
EvaluationEvaluation
The result is compared with a manually coded “Gold The result is compared with a manually coded “Gold Standard”Standard” Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one
that uses no context).
Important: 100% is impossible even for human Important: 100% is impossible even for human annotators.annotators.