linguistica generale e computazionale disambiguazione delle parti del discorso
TRANSCRIPT
LINGUISTICA GENERALE E COMPUTAZIONALE
DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO
2
POS tagging: the problem
• People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
• Problem: assign a tag to race• Requires: tagged corpus
3
Ambiguity in POS tagging
The ATman NN VBstill NN VB RBsaw NN VBDher PPO PP$
4
How hard is POS tagging?
Number of tags 1 2 3 4 5 6 7
Number of words types
35340 3760 264 61 12 2 1
In the Brown corpus,- 11.5% of word types ambiguous- 40% of word TOKENS
5
Frequency + Context
• Both the Brill tagger and HMM-based taggers achieve good results by combining– FREQUENCY
• I poured FLOUR/NN into the bowl.• Peter should FLOUR/VB the baking tray
– Information about CONTEXT • I saw the new/JJ PLAY/NN in the theater.• The boy will/MD PLAY/VBP in the garden.
6
The importance of context
• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
• People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
TAGGED CORPORA
8
Choosing a tagset
• The choice of tagset greatly affects the difficulty of the problem
• Need to strike a balance between– Getting better information about context (best:
introduce more distinctions)– Make it possible for classifiers to do their job
(need to minimize distinctions)
9
Some of the best-known Tagsets
• Brown corpus: 87 tags• Penn Treebank: 45 tags• Lancaster UCREL C5 (used to tag the BNC): 61
tags• Lancaster C7: 145 tags
10
Important Penn Treebank tags
11
Verb inflection tags
12
The entire Penn Treebank tagset
13
UCREL C5
14
Tagsets per l’italiano
Si-TAL (Pisa, Venezia, IRST, ....)
PAROLE
TEXTPRO (dopo)
15
Il tagset di SI-TAL
16
POS tags in the Brown corpus
Television/NN has/HVZ yet/RB to/TO work/VB out/RP a/AT living/RBG arrangement/NN with/IN jazz/NN ,/, which/VDT comes/VBZ to/IN the/AT medium/NN more/QL as/CS an/AT uneasy/JJ guest/NN than/CS as/CS a/AT relaxed/VBN member/NN of/IN the/AT family/NN ./.
17
SGML-based POS in the BNC
<div1 complete=y org=seq> <head> <s n=00040> <w NN2>TROUSERS <w VVB>SUIT </head> <caption> <s n=00041> <w EX0>There <w VBZ>is <w PNI>nothing <w AJ0>masculine <w PRP>about <w DT0>these <w AJ0>new <w NN1>trouser <w NN2-VVZ>suits <w PRP>in <w NN1>summer<w POS>'s <w AJ0>soft <w NN2>pastels<c PUN>. <s n=00042> <w NP0>Smart <w CJC>and <w AJ0>acceptable <w PRP>for <w NN1>city <w NN1-VVB>wear <w CJC>but <w AJ0>soft <w AV0>enough <w PRP>for <w AJ0>relaxed <w NN2>days </caption>
18
Quick test
DoCoMo and Sony are to develop a chip that would let people pay for goods through their mobiles.
POS TAGGED CORPORA IN NLTK
>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN'
>>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
Exploring tagged corpora
• Ch.5, p. 184-189
OTHER POS-TAGGED CORPORA
• NLTK:• WAC Corpora:
– English: UKWAC– Italian: ITWAC
POS TAGGING
23
Markov Model POS tagging
• Again, the problem is to find an `explanation’ with the highest probability:
• As in the lecture on text classification, this can be ‘turned around’ using Bayes’ Rule:
)..|..(argmax 11Tt
nn wwttPi
)..(
)..()..|..(argmax
1
111
n
nnn
wwP
ttPttwwP
24
Combining frequency and contextual information
• As in the case of spelling, this equation can be simplified:
• As we will see, once further simplifications are applied, this equation will encode both FREQUENCY and CONTEXT INFORMATION
prior
1
likelihood
11 )..()..|..(argmax nnn ttPttwwP
25
Three further assumptions
• MARKOV assumption: a tag only depends on a FIXED NUMBER of previous tags (here, assume bigrams)– Simplify second factor
• INDEPENDENCE assumption: words are independent from each other.
• A word’s identity only depends on its own tag– Simplify first factor
26
The final equations
FREQUENCYCONTEXT
27
Estimating the probabilities
Can be done using Maximum Likelihood Estimation as usual, for BOTH probabilities:
28
An example of tagging with Markov Models :
• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
• People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
• Problem: assign a tag to race given the subsequences– to/TO race/???– the/DT race/???
• Solution: we choose the tag that has the greater of these probabilities:– P(VB|TO) P(race|VB)– P(NN|TO)P(race|NN)
29
Tagging with MMs (2)• Actual estimates from the Switchboard corpus:• LEXICAL FREQUENCIES:
– P(race|NN) = .00041– P(race|VB) = .00003
• CONTEXT:– P(NN|TO) = .021– P(VB|TO) = .34
• The probabilities:– P(VB|TO) P(race|VB) = .00001– P(NN|TO)P(race|NN) = .000007
30
A graphical interpretation of the POS tagging equations
31
Hidden Markov Models
32
An example
33
Computing the most likely sequence of tags
• In general, the problem of computing the most likely sequence t1 .. tn could have exponential complexity
• It can however be solved in polynomial time using an example of DYNAMIC PROGRAMMING: the VITERBI ALGORITHM (Viterbi, 1967)
• (Also called TRELLIS ALGORITHMs)
POS TAGGING IN NLTK
DEFAULT POS TAGGER: nltk.pos_tag
>>> text = nltk.word_tokenize("And now for something completely different")>>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
TEXTPRO
• The most widely used NLP tool for Italian• http://textpro.fbk.eu/• Demo
THE TEXTPRO TAGSET
READINGS
• Bird et al, chapter 5, chapter 6.1