![Page 1: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/1.jpg)
1
Parts of Speech
Sudeshna Sarkar
7 Aug 2008
![Page 2: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/2.jpg)
2
Why Do We Care about Parts of Speech?
•PronunciationHand me the lead pipe.
•Predicting what words can be expected nextPersonal pronoun (e.g., I, she) ____________
•Stemming-s means singular for verbs, plural for nouns
•As the basis for syntactic parsing and then meaning extractionI will lead the group into the lead smelter.
•Machine translation• (E) content +N (F) contenu +N• (E) content +Adj (F) content +Adj or satisfait +Adj
![Page 3: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/3.jpg)
3
What is a Part of Speech?
Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns.
Consider: green book
book is a Noun
green is an Adjective
Now consider: book worm
This green is very soothing.
![Page 4: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/4.jpg)
4
How Many Parts of Speech Are There?
A first cut at the easy distinctions:
Open classes:
•nouns, verbs, adjectives, adverbs
Closed classes: function words
•conjunctions: and, or, but
•pronounts: I, she, him
•prepositions: with, on
•determiners: the, a, an
![Page 5: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/5.jpg)
5
Part of speech tagging
8 (ish) traditional parts of speechNoun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc
This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)
Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS
We’ll use POS most frequently
I’ll assume that you all know what these are
![Page 6: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/6.jpg)
6
POS examples
N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adj purple, tall, ridiculous
ADV adverb unfortunately, slowly,
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, those
![Page 7: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/7.jpg)
7
Tagsets
Brown corpus tagset (87 tags):
http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html
Penn Treebank tagset (45 tags):
http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6)
C7 tagset (146 tags)
http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
![Page 8: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/8.jpg)
8
POS Tagging: Definition
The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
thekoalaputthekeysonthetable
WORDSTAGS
NVPDET
![Page 9: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/9.jpg)
9
POS Tagging example
WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
![Page 10: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/10.jpg)
10
POS tagging: Choosing a tagset
There are so many parts of speech, potential distinctions we can draw
To do POS tagging, need to choose a standard set of tags to work with
Could pick very coarse tagetsN, V, Adj, Adv.
More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags
PRP$, WRB, WP$, VBG
Even more fine-grained tagsets exist
![Page 11: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/11.jpg)
11
Penn TreeBank POS Tag set
![Page 12: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/12.jpg)
12
Using the UPenn tagset
The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)
Except the preposition/complementizer “to” is just marked “to”.
![Page 13: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/13.jpg)
13
POS Tagging
Words often have more than one POS: backThe back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word.
![Page 14: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/14.jpg)
14
How hard is POS tagging? Measuring ambiguity
![Page 15: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/15.jpg)
15
Algorithms for POS Tagging
•Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags):
Worse, 40% of the tokens are ambiguous.
![Page 16: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/16.jpg)
16
Algorithms for POS Tagging
Why can’t we just look them up in a dictionary?
•Words that aren’t in the dictionary
http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc
•One idea: P(ti | wi) = the probability that a random hapax legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more likely than pronouns.
•Another idea: use morphology.
![Page 17: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/17.jpg)
17
Algorithms for POS Tagging - Knowledge
•Dictionary
•Morphological rules, e.g.,•_____-tion•_____-ly•capitalization
•N-gram frequencies•to _____•DET _____ N•But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish)
•Combining these• V _____-ing I was gracking vs. Gracking is fun.
![Page 18: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/18.jpg)
18
POS Tagging - Approaches
ApproachesRule-based tagging
(ENGTWOL)Stochastic (=Probabilistic) tagging
HMM (Hidden Markov Model) taggingTransformation-based tagging
Brill tagger
• Do we return one best answer or several answers and let later steps decide?
• How does the requisite knowledge get entered?
![Page 19: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/19.jpg)
19
3 methods for POS tagging
1. Rule-based taggingExample: Karlsson (1995) EngCG tagger based on the Constraint Grammar architecture and ENGTWOL lexicon
– Basic Idea:
Assign all possible tags to words (morphological analyzer used)
Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned)
![Page 20: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/20.jpg)
20
3 methods for POS tagging
2. Transformation-based taggingExample: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies– Basic Idea:
Start with a tagged corpus + dictionary (with most frequent tags) Set the most probable tag for each word as a start value Change tags according to rules of type “if word-1 is a determiner
and word is a verb then change the tag to noun” in a specific order (like rule-based taggers)
machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach)
![Page 21: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/21.jpg)
21
3 methods for POS tagging
3. Stochastic (=Probabilistic) taggingExample: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context
![Page 22: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/22.jpg)
22
Hidden Markov Model (HMM) Tagging
Using an HMM to do POS tagging
HMM is a special case of Bayesian inference
It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)
![Page 23: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/23.jpg)
23
Goal: maximize P(word|tag) x P(tag|previous n tags)
P(word|tag) word/lexical likelihoodprobability that given this tag, we have this word NOT probability that this word has this tagmodeled through language model (word-tag matrix)
P(tag|previous n tags)tag sequence likelihoodprobability that this tag follows these previous tagsmodeled through language model (tag-tag matrix)
Hidden Markov Model (HMM) Taggers
Lexical information Syntagmatic information
![Page 24: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/24.jpg)
24
POS tagging as a sequence classification task
We are given a sentence (an “observation” or “sequence of observations”)
Secretariat is expected to race tomorrow
sequence of n words w1…wn.
What is the best sequence of tags which corresponds to this sequence of observations?
Probabilistic/Bayesian view:Consider all possible sequences of tags
Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.
![Page 25: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/25.jpg)
25
Getting to HMM
Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn
Hat ^ means “our estimate of the best = the most probable tag sequence”
Argmaxx f(x) means “the x such that f(x) is maximized”
it maximazes our estimate of the best tag sequence
![Page 26: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/26.jpg)
26
Getting to HMM
This equation is guaranteed to give us the best tag sequence
But how do we make it operational? How do we compute this value?Intuition of Bayesian classification:
Use Bayes rule to transform it into a set of other probabilities that are easier to computeThomas Bayes: British mathematician (1702-1761)
![Page 27: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/27.jpg)
27
Bayes Rule
Breaks down any conditional probability P(x|y) into three other probabilities
P(x|y): The conditional probability of an event x assuming that y has occurred
![Page 28: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/28.jpg)
28
Bayes Rule
We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words
![Page 29: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/29.jpg)
29
Bayes Rule
![Page 30: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/30.jpg)
30
Likelihood and prior
n
![Page 31: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/31.jpg)
31
Likelihood and prior Further Simplifications
n
1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it
2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by the bigram tagger
![Page 32: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/32.jpg)
32
Likelihood and prior Further Simplifications
n
1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it
thekoalaputthekeysonthetable
WORDSTAGS
NVPDET
![Page 33: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/33.jpg)
33
Likelihood and prior Further Simplifications
2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.
Bigrams are used as the basis for simple statistical analysis of text
The bigram assumption is related to the first-order Markov assumption
![Page 34: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/34.jpg)
34
Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by the bigram tagger
n
biagram assumption
---------------------------------------------------------------------------------------------------------------
![Page 35: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/35.jpg)
35
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1)
Determiners likely to precede adjs and nouns– That/DT flight/NN– The/DT yellow/JJ hat/NN
– So we expect P(NN|DT) and P(JJ|DT) to be high– But P(DT|JJ) to be:?
![Page 36: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/36.jpg)
36
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1)
Compute P(NN|DT) by counting in a labeled corpus:
# of times DT is followed by NN
![Page 37: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/37.jpg)
37
Two kinds of probabilities (2)
Word likelihood probabilities p(wi|ti)P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
Compute P(is|VBZ) by counting in a labeled corpus:
If we were expecting a third person singular verb, how likely is it that
this verb would be is?
![Page 38: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/38.jpg)
38
An Example: the verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NRPeople/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NNHow do we pick the right tag?
![Page 39: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/39.jpg)
39
Disambiguating “race”
![Page 40: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/40.jpg)
40
Disambiguating “race”
P(NN|TO) = .00047P(VB|TO) = .83
The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’
P(race|NN) = .00057P(race|VB) = .00012
Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.
P(NR|VB) = .0027P(NR|NN) = .0012
tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun
P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins
![Page 41: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/41.jpg)
41
Hidden Markov Models
What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)
Let’s just spend a bit of time tying this into the model
In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.
![Page 42: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/42.jpg)
42
Definitions
A weighted finite-state automaton adds probabilities to the arcs
The sum of the probabilities leaving any arc must sum to one
A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go throughMarkov chains can’t represent inherently ambiguous problems
Useful for assigning probabilities to unambiguous sequences
![Page 43: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/43.jpg)
43
Markov chain = “First-order observed Markov Model”
a set of states
Q = q1, q2…qN; the state at time t is qt
a set of transition probabilities:
a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state jThe set of these is the transition probability matrix A
Distinguished start and end states
Special initial probability vector
i the probability that the MM will start in state i, each i expresses the probability p(qi|START)
aij P(qt j | qt 1 i) 1i, j N
aij 1; 1i Nj1
N
![Page 44: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/44.jpg)
44
Markov chain = “First-order observed Markov Model”
Markov Chain for weather: Example 1
three types of weather: sunny, rainy, foggy
we want to find the following conditional probabilities:
P(qn|qn-1, qn-2, …, q1)
- I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days
- We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences
Problem: the larger n is, the more observations we must collect.
Suppose that n=6, then we have to collect statistics for 3(6-1) =
243 past histories
![Page 45: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/45.jpg)
45
Markov chain = “First-order observed Markov Model”
Therefore, we make a simplifying assumption, called the (first-order) Markov assumption
for a sequence of observations q1, … qn,
current state only depends on previous state
the joint probability of certain past and current observations
![Page 46: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/46.jpg)
46
Markov chain = “First-order observable Markov Model”
![Page 47: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/47.jpg)
47
Markov chain = “First-order observed Markov Model”
Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?
Using the Markov assumption and the probabilities in table 1, this translates into:
![Page 48: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/48.jpg)
48
The weather figure: specific example
Markov Chain for weather: Example 2
![Page 49: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/49.jpg)
49
Markov chain for weather
What is the probability of 4 consecutive rainy days?
Sequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3
P(3,3,3,3) = 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
![Page 50: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/50.jpg)
50
Hidden Markov Model
For Markov chains, the output symbols are the same as the states.
See sunny weather: we’re in state sunny
But in part-of-speech tagging (and other things)The output symbols are words
But the hidden states are part-of-speech tags
So we need an extension!
A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states.
This means we don’t know which state we are in.
![Page 51: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/51.jpg)
51
Markov chain for weather
![Page 52: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/52.jpg)
52
Markov chain for words
Observed events: words
Hidden events: tags
![Page 53: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/53.jpg)
53
States Q = q1, q2…qN; Observations O = o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities (prior)
Transition probability matrix A = {aij}
Observation likelihoods (likelihood)
Output probability matrix B={bi(ot)}a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities
Special initial probability vector
i the probability that the HMM will start in state i, each i expresses the probability
p(qi|START)
Hidden Markov Models
![Page 54: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/54.jpg)
54
Assumptions
Markov assumption: the probability of a particular state depends only on the previous state
Output-independence assumption: the probability of an output observation depends only on the state that produced that observation
P(qi | q1...qi 1) P(qi | qi 1)
![Page 55: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/55.jpg)
55
HMM for Ice Cream
You are a climatologist in the year 2799
Studying global warming
You can’t find any records of the weather in Boston, MA for summer of 2007
But you find Jason Eisner’s diary
Which lists how many ice-creams Jason ate every date that summer
Our job: figure out how hot it was
![Page 56: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/56.jpg)
56
Noam task
GivenIce Cream Observation Sequence: 1,2,3,2,2,2,3…
(cp. with output symbols)
Produce:Weather Sequence: C,C,H,C,C,C,H …
(cp. with hidden states, causing states)
![Page 57: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/57.jpg)
57
HMM for ice cream
![Page 58: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/58.jpg)
58
Different types of HMM structure
Bakis = left-to-right Ergodic = fully-connected
![Page 59: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/59.jpg)
59
HMM Taggers
Two kinds of probabilitiesA transition probabilities (PRIOR)
B observation likelihoods (LIKELIHOOD)
HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability
![Page 60: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/60.jpg)
60
Weighted FSM corresponding to hidden states of HMM, showing A probs
![Page 61: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/61.jpg)
61
B observation likelihoods for POS HMM
![Page 62: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/62.jpg)
62
The A matrix for the POS HMM
![Page 63: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/63.jpg)
63
The B matrix for the POS HMM
![Page 64: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/64.jpg)
64
HMM Taggers
The probabilities are trained on hand-labeled training corpora (training set)
Combine different N-gram levels
Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)
![Page 65: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/65.jpg)
65
The Viterbi Algorithm
best tag sequence for "John likes to fish in the sea"?efficiently computes the most likely state sequence given a particular output sequencebased on dynamic programming
![Page 66: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/66.jpg)
66
A smaller example
b
What is the best sequence of states for the input string “bbba”?
Computing all possible paths and finding the one with the max probability is exponential
a
0.6
q rstart end
0.5
0.7
a
0.4 0.80.2
b
1 1
0.3 0.5
![Page 67: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/67.jpg)
67
A smaller example (con’t)
For each state, store the most likely sequence that could lead to it (and its probability)
Path probability matrix: An array of states versus time (tags versus words)
That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time.
Best sequence Input sequence / time
ε --> b b --> b bb --> b bbb --> a
leading
to q
coming
from qε --> q 0.6
(1.0x0.6)
q --> q 0.108
(0.6x0.3x0.6)
qq --> q 0.01944 (0.108x0.3x0.6)
qrq --> q 0.018144
(0.1008x0.3x0.4)
coming
from rr --> q 0
(0x0.5x0.6)
qr --> q 0.1008
(0.336x0.5x 0.6)
qrr --> q 0.02688 (0.1344x0.5x0.4)
leading
to r
coming
from qε --> r 0
(0x0.8)
q --> r 0.336
(0.6x0.7x0.8)
qq --> r 0.0648 (0.108x0.7x0.8)
qrq --> r 0.014112
(0.1008x0.7x0.2)
coming
from rr --> r 0 (0x0.5x0.8)
qr --> r 0.1344 (0.336x0.5x0.8)
qrr --> r 0.01344
(0.1344x0.5x0.2)
![Page 68: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/68.jpg)
68
Viterbi intuition: we are looking for the best ‘path’
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
S1 S2 S4S3 S5
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
Slide from Dekang Lin
![Page 69: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/69.jpg)
69
The Viterbi Algorithm
![Page 70: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/70.jpg)
70
Intuition
The value in each cell is computed by taking the MAX over all paths that lead to this cell. An extension of a path from state i at time t-1 is computed by multiplying:
Previous path probability from previous cell viterbi[t-1,i]
Transition probability aij from previous state I to current state j
Observation likelihood bj(ot) that current state j matches observation symbol t
![Page 71: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/71.jpg)
71
Viterbi example
![Page 72: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/72.jpg)
72
Smoothing of probabilities
Data sparseness is a problem when estimating probabilities based on corpus data.The “add one” smoothing technique –
BN
wCwP n
n
1,1
,1
C- absolute frequencyN: no of training instancesB: no of different types
Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:
iii
iiiiiiii ttPttPtPttP
1,10
)|()|()(| 2,133122111,1
The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.
![Page 73: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/73.jpg)
75
in bigram POS tagging, we condition a tag only on the preceding tag
why not...use more context (ex. use trigram model)
– more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense
– combine trigram, bigram, unigram models
condition on words too
but with an n-gram approach, this is too costly (too many parameters to model)
Possible improvements
![Page 74: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/74.jpg)
76
Further issues with Markov Model tagging
Unknown words are a problem since we don’t have the required probabilities. Possible solutions:
Assign the word probabilities based on corpus-wide distribution of POSUse morphological cues (capitalization, suffix) to assign a more calculated guess.
Using higher order Markov models:Using a trigram model captures more contextHowever, data sparseness is much more of a problem.
![Page 75: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/75.jpg)
77
TnT
Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000
Underlying model:
Trigram modelling –The probability of a POS only depends on its two preceding POS
The probability of a word appearing at a particular position given that its POS occurs at that position is independent of everything else.
T
iTTiiiii
ttttPtwPtttP
T 1121 )|()|(),|(maxarg
1
![Page 76: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/76.jpg)
78
Training
Maximum likelihood estimates:
)(
),()|(:
),(
),,(),|(:
)(
),()|( : Bigrams
: Unigrams
3
3333
32
321213
^
3
3223
33
tc
twctwPLexical
ttc
tttctttPTrigrams
tc
ttcttP
N
)c(t)(tP
^
^
Smoothing : context-independent variant of linear interpolation.
),|()|()(),|( 213
^
323
^
23
^
1213 tttPttPtPtttP
![Page 77: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/77.jpg)
79
Smoothing algorithm
Set λi=0
For each trigram t1 t2 t3 with f(t1,t2,t3 )>0
Depending on the max of the following three values:– Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )
– Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 )
– Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 )
Normalize λi
![Page 78: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/78.jpg)
80
Evaluation of POS taggers
compared with gold-standard of human performance
metric: accuracy = % of tags that are identical to gold standard
most taggers ~96-97% accuracy
must compare accuracy to:ceiling (best possible results)
– how do human annotators score compared to each other? (96-97%)– so systems are not bad at all!
baseline (worst possible results)– what if we take the most-likely tag (unigram model) regardless of previous
tags ? (90-91%) – so anything less is really bad
![Page 79: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/79.jpg)
81
More on tagger accuracy
is 95% good?that’s 5 mistakes every 100 words
if on average, a sentence is 20 words, that’s 1 mistake per sentence
when comparing tagger accuracy, beware of:size of training corpus
– the bigger, the better the results
difference between training & testing corpora (genre, domain…)– the closer, the better the results
size of tag set – Prediction versus classification
unknown words– the more unknown words (not in dictionary), the worst the results
![Page 80: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/80.jpg)
82
Error Analysis
Look at a confusion matrix (contingency table)
E.g. 4.4% of the total errors caused by mistagging VBD as VBNSee what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)Adverb (RB) vs Particle (RP) vs Prep (IN)Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
ERROR ANALYSIS IS ESSENTIAL!!!
![Page 81: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/81.jpg)
83
Tag indeterminacy
![Page 82: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/82.jpg)
84
Major difficulties in POS taggingUnknown words (proper names)
because we do not know the set of tags it can take
and knowing this takes you a long way (cf. baseline POS tagger)
possible solutions:– assign all possible tags with probabilities distribution identical to lexicon as a whole– use morphological cues to infer possible tags
ex. word ending in -ed are likely to be past tense verbs or past participles
Frequently confused tag pairspreposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle)
verb, past tense vs. past participle vs. adjective
![Page 83: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/83.jpg)
85
Unknown Words
Most-frequent-tag approach.
What about words that don’t appear in the training set?
Suffix analysis:
The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix.
Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word.
Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix)
Use a morphological analyzer to get the restriction on the possible tags.
![Page 84: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/84.jpg)
86
Unknown words
![Page 85: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/85.jpg)
87
Alternative graphical models for part of speech tagging
![Page 86: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/86.jpg)
88
Different Models for POS tagging
HMM
Maximum Entropy Markov Models
Conditional Random Fields
![Page 87: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/87.jpg)
89
Hidden Markov Model (HMM) : Generative Modeling
Source Model PY
Noisy Channel PXY
y x
i
ii yyPP )|()( 1y i
ii yxPP )|()|( yx
![Page 88: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/88.jpg)
90
Dependency (1st order)
kY1kY
kX
)|( kk YXP
)|( 1kk YYP
1kX
)|( 11 kk YXP
2kX
)|( 22 kk YXP
2kY)|( 21 kk YYP
1kY
1kX
)|( 1 kk YYP
)|( 11 kk YXP
![Page 89: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/89.jpg)
91
Disadvantage of HMMs (1)
No Rich Feature InformationRich information are required
– When xk is complex– When data of xk is sparse
Example: POS TaggingHow to evaluate Pwk|tk for unknown words wk ?Useful features
– Suffix, e.g., -ed, -tion, -ing, etc.– Capitalization
Generative ModelParameter estimation: maximize the joint likelihood of training examples
T
P),(
2 ),(logyx
yYxX
![Page 90: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/90.jpg)
92
Generative Models
Hidden Markov models (HMMs) and stochastic grammarsAssign a joint probability to paired observation and label sequences
The parameters typically trained to maximize the joint likelihood of train examples
![Page 91: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/91.jpg)
93
Generative Models (cont’d)
Difficulties and disadvantagesNeed to enumerate all possible observation sequences
Not practical to represent multiple interacting features or long-range dependencies of the observations
Very strict independence assumptions on the observations
![Page 92: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/92.jpg)
94
Better ApproachDiscriminative model which models P(y|x) directly
Maximize the conditional likelihood of training examples
T
P),(
2 )|(logyx
xXyY
![Page 93: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/93.jpg)
95
Maximum Entropy modeling
N-gram model : probabilities depend on the previous few tokens.
We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc)
Maxent combines these features in a probabilistic model.
The given features provide a constraint on the model.
We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints.
![Page 94: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/94.jpg)
96
Maximum Entropy Markov Model
Discriminative Sub ModelsUnify two parameters in generative model into one conditional model
– Two parameters in generative model,
– parameter in source model and parameter in noisy
channel
– Unified conditional model
Employ maximum entropy principle
)|( 1kk yyP
)|( kk yxP ),|( 1kkk yxyP
i
iii xyyPP ),|()|( 1xy
Maximum Entropy Markov Model
![Page 95: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/95.jpg)
97
General Maximum Entropy Principle
Model
Model distribution PY|X with a set of features fffl defined on X and Y
IdeaCollect information of features from training data
Principle
– Model what is known
– Assume nothing else
Flattest distribution
Distribution with the maximum Entropy
![Page 96: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/96.jpg)
98
Example
(Berger et al., 1996) exampleModel translation of word “in” from English to French
– Need to model P(wordFrench)– Constraints
1: Possible translations: dans, en, à, au course de, pendant 2: “dans” or “en” used in 30% of the time 3: “dans” or “à” in 50% of the time
![Page 97: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/97.jpg)
99
Features
Features0-1 indicator functions
– 1 if x y satisfies a predefined condition
– 0 if not
Example: POS Tagging
otherwise
NN is and tion- with ends if
,0
,1),(1
yxyxf
otherwise ,0
NNP is andtion Captializa with starts if ,1),(2
yxyxf
![Page 98: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/98.jpg)
100
Constraints
Empirical InformationStatistics from training data T
Tyx
ii yxfT
fP),(
),(||
1)(ˆ
Constraints)()(ˆ
ii fPfP
Tyx YDy
ii yxfxXyYPT
fP),( )(
),()|(||
1)(
Expected Value From the distribution PY|X we want to model
![Page 99: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/99.jpg)
101
Maximum Entropy: Objective
Entropy
x y
Tyx
xXyYPxXyYPxP
xXyYPxXyYPT
I
)|(log)|()(ˆ
)|(log)|(||
1
2
),(2
)()(ˆ s.t.
max)|(
fPfP
IXYP
Maximization Problem
![Page 100: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/100.jpg)
102
Dual Problem
Dual Problem Conditional model
Maximum likelihood of conditional data)),(exp()|(
1
l
iii yxfxXyYP
Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al.
2000)
Tyx
xXyYPl ),(
2,,
)|(logmax1
![Page 101: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/101.jpg)
103
Maximum Entropy Markov Model
Use Maximum Entropy Approach to Model1st order
),|( 11 kkkkkk yYxXyYP
Features Basic features (like parameters in HMM)
Bigram (1st order) or trigram (2nd order) in source model
State-output pair feature Xkxk Yk yk Advantage: incorporate other advanced
features on xk yk
![Page 102: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/102.jpg)
104
HMM vs MEMM (1st order)
kY1kY
kX
)|( 1kk YYP
)|( kk YXP
HMMMaximum Entropy
Markov Model (MEMM)
kY1kY
kX
),|( 1kkk YXYP
![Page 103: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/103.jpg)
105
Performance in POS Tagging
POS TaggingData set: WSJFeatures:
– HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)
Results (Lafferty et al. 2001)1st order HMM
– 94.31% accuracy, 54.01% OOV accuracy
1st order MEMM– 95.19% accuracy, 73.01% OOV accuracy
![Page 104: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/104.jpg)
106
ME applications
Part of Speech (POS) Tagging (Ratnaparkhi, 1996)P(POS tag | context)
Information sources– Word window (4)– Word features (prefix, suffix, capitalization)– Previous POS tags
![Page 105: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/105.jpg)
107
ME applications
Abbreviation expansion (Pakhomov, 2002)Information sources
– Word window (4)– Document title
Word Sense Disambiguation (WSD) (Chao & Dyer, 2002)Information sources
– Word window (4)– Structurally related words (4)
Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997)Information sources
– Token features (prefix, suffix, capitalization, abbreviation)– Word window (2)
![Page 106: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/106.jpg)
108
Solution
Global OptimizationOptimize parameters in a global model simultaneously, not in sub models separately
AlternativesConditional random fields
Application of perceptron algorithm
![Page 107: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/107.jpg)
109
Why ME?
AdvantagesCombine multiple knowledge sources
– Local Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) Token prefix, suffix, capitalization, abbreviation (Sentence Boundary -
(Reynar & Ratnaparkhi, 1997))– Global
N-grams (Rosenfeld, 1997) Word window Document title (Pakhomov, 2002) Structurally related words (Chao & Dyer, 2002) Sentence length, conventional lexicon (Och & Ney, 2002)
Combine dependent knowledge sources
![Page 108: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/108.jpg)
110
Why ME?
AdvantagesAdd additional knowledge sources
Implicit smoothing
DisadvantagesComputational
– Expected value at each iteration
– Normalizing constant
Overfitting– Feature selection
Cutoffs Basic Feature Selection (Berger et al., 1996)
![Page 109: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/109.jpg)
111
Conditional Models
Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x)
Specify the probability of possible label sequences given an observation sequence
Allow arbitrary, non-independent features on the observation sequence X
The probability of a transition between labels may depend on past and future observations
Relax strong independence assumptions in generative models
![Page 110: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/110.jpg)
112
Discriminative ModelsMaximum Entropy Markov Models (MEMMs)
Exponential modelGiven training set X with label sequence Y:
Train a model θ that maximizes P(Y|X, θ)For a new data sequence x, the predicted label y maximizes P(y|x, θ)Notice the per-state normalization
![Page 111: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/111.jpg)
113
MEMMs (cont’d)
MEMMs have all the advantages of Conditional Models
Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)
Subject to Label Bias Problem
Bias toward states with fewer outgoing transitions
![Page 112: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/112.jpg)
114
Label Bias Problem
• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
• Consider this MEMM:
![Page 113: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/113.jpg)
115
Solve the Label Bias Problem
Change the state-transition structure of the model
Not always practical to change the set of states
Start with a fully-connected model and let the training procedure figure out a good structure
Prelude the use of prior, which is very valuable (e.g. in information extraction)
![Page 114: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/114.jpg)
116
Random Field
![Page 115: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/115.jpg)
117
Conditional Random Fields (CRFs)
CRFs have all the advantages of MEMMs without label bias problem
MEMM uses per-state exponential model for the conditional probabilities of next states given the current state
CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence
Undirected acyclic graph
Allow some transitions “vote” more strongly than others depending on the corresponding observations
![Page 116: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/116.jpg)
118
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
![Page 117: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/117.jpg)
119
Example of CRFs
![Page 118: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/118.jpg)
120
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
![Page 119: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/119.jpg)
121
Conditional Distribution
1 2 1 2( , , , ; , , , ); andn n k k
x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge featurek is the number of features
are parameters to be estimated
y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:
(y | x) exp ( , y | , x) ( , y | , x)
k k e k k v
e E,k v V ,k
p f e g v
![Page 120: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/120.jpg)
122
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:
Z(x) is a normalization over the data sequence x
(y | x) exp ( , y | , x) ( , y |1
(x), x)
k k e k k v
e E,k v V ,k
p f e g vZ
![Page 121: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/121.jpg)
123
Parameter Estimation for CRFs
The paper provided iterative scaling algorithms
It turns out to be very inefficient
Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient
![Page 122: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/122.jpg)
124
Training of CRFs (From Prof. Dietterich)
log ( | )( , y | , x) ( , y | , x) log (x)
k k e k k ve E,k v V ,k
p y xf e g v Z
log ( | ) ( , y | , x) ( , y | , x) log (x)k k e k k ve E,k v V ,k
p y x f e g v Z
• First, we take the log of the equation
• Then, take the derivative of the above equation
• For training, the first 2 items are easy to get. • For example, for each k, fk is a sequence of Boolean numbers, such
as 00101110100111. is just the total number of 1’s in the sequence.( , y | , x)k k ef e
• The hardest thing is how to calculate Z(x)
![Page 123: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/123.jpg)
125
Training of CRFs (From Prof. Dietterich) (cont’d)
• Maximal cliques
y1 y2 y3 y4c1 c2 c3
c1 c2 c3
1 2 3 4
1 2 3 4
1 1 2 2 2 3 3 3 4y ,y ,y ,y
1 1 2 2 2 3 3 3 4y y y y
(x) (y ,y ,x) (y ,y ,x) (y ,y ,x)
(y ,y ,x) (y ,y ,x) (y ,y ,x)
Z c c c
c c c
3 4 3 4 3 3 4: exp( (y ,x) (y ,y ,x)) (y ,y ,x)c c
1 1 2 1 2 1 1 2: exp( (y ,x) (y ,x) (y ,y ,x)) (y ,y ,x)c c
2 3 2 3 2 2 3: exp( (y ,x) (y ,y ,x)) (y ,y ,x)c c
![Page 124: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/124.jpg)
126
POS tagging Experiments
![Page 125: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/125.jpg)
127
POS tagging Experiments (cont’d)
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
![Page 126: 1 Parts of Speech Sudeshna Sarkar 7 Aug 2008. 2 Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d235503460f949f9ad2/html5/thumbnails/126.jpg)
128
Summary
Discriminative models are prone to the label bias problem
CRFs provide the benefits of discriminative models
CRFs solve the label bias problem well, and demonstrate good performance