part-of-speech tagging - resource centre for indian ... · what is part of speech? a category of...
TRANSCRIPT
What is Part of Speech?
A category of words which have similar grammatical properties
Nouns → denotes Abstract/real things
Verbs → denotes action
Adjective → modifies noun
Adverb → qualifies action (manner, time, place, etc.)
Pronoun → shorthand for some nouns referred to
Preposition → indicate spatial or temporal relations
Conjunction → join phrases, clauses and sentences
Interjection → oh, alas, etc.
Open class vs closed class
Open class: New words can be added to this set
Nouns, verbs, adjectives, adverbs are open class
Closed class: New words are rarely added and there are few of these
Conjunctions, prepositions, pronouns, interjections are closed class
How many Parts of speech are there?
Depends on who you ask?
First a few questions -
What is POS for can, not, should, will, etc. ?
What is the POS tag for घरासमोरचा → Noun, Preposition, Both ??
Why is it important to know POS of a word?
Basic building block for Natural Language Understanding
Following tasks require POS as input:
Chunking
Parsing
… and then the whole stack is built on this
POS Tagging in NLP applications
POS tagging is relatively easier to to than full language understanding
Statistical POS taggers require less data compared to other tasks like parsing
For NLP applications, even POS information is very useful:
Sentiment Analysis
Sarcasm Detection
Named Entity Recognition
Some Information Extraction tasks
Is it not enough to look up a dictionary and markup
POS tags?Ace can be noun too.
e.g. Nadal served an aceBatsman modifies Arvinda De Silva, can be J
Called the “Noisy Channel Model”
Y has been converted to X while passing the message through a channel (probably with some
errors)
Task is to recover Y given X
A very core abstraction in machine learning
Independence Assumptions to simplify things ...
Markov assumption:
current tag depends on
previous k tags only
This is the heart of the design of any ML system → the modelling
Lexical independence
assumption: Word
depends on its POS
tag! - strange!!
Hidden Markov Models
Hidden tags
Observed words
Markov Chain →
sequence of POS tags
with limited history
Training HMM models
Goal: Learn a “model” from this corpus, so new unseen sentences can be tagged
What does learning a model mean?
Learning the values of:
P(yi|yi-1) → transition probabilities
P(xi|yi) → emission probabilities
Input: POS tagged corpus of many sentences
Essentially fill in these two tables
yi yi-1 P(yi|yi-1)
JJ NN ??
VB RB ??
... ... ...
Transition Probability Table
Emission Probability Table
xi yi P(xi|yi)
natural NN ??
is RB ??
... ... ...
These missing values are called model parameters
Except, not all words occur in the training corpus?
How do we estimate emission probabilities for
these?
We need to do some “smoothing”
What is that ???
Smoothing
Set aside some probability mass for unseen words
For unseen words PML= 0, so its emission probability will be:
This is the simplest smoothing method … not the best one .. there are many other
better ones
Not for today though ...
Why is smoothing not very important for transition
probabilities?
Because tagset is generally small ...
Now, we have trained the model ….
How do we find the POS tags for a new sentence?
This problem is called “decoding”
We use the ‘Viterbi Algorithm’
You will encounter this guy, Andrew Viterbi, everywhere in machine learning
Trying all possible combinations is very costly … you have to
evaluate L|V| paths
where L: number of words
|V|: size of POS tagset
For a 40 tagset and sentence of 10 words, you have to
evaluate 1040 possible sequences
Viterbi Algorithm to the rescue ….
It is just a dynamic programming algorithm that exploits the
optimal substructure of the problem
Key Property:
If the best tag sequence at position k contains the tag T at
position k-1 …
then the sequence of positions 1 to k-1 is the best sequence
ending in T
This means if you have found best sequence till position k-
1 ending in every POS tag, it is efficient to do the same for
position k
Done in two steps:
● Forward step: Find the best value through the graph
● Backward step: Backtrack to find the best path
Backward step
During forward step, keep track of the best edge at every computation
Now, you just have to backtrack from the final best tag at position (l+1) to find the best POS sequence
Acknowledgements
Graham Neubig: http://www.phontron.com/slides/nlp-programming-en-04-hmm.pdf