nlp_pos and n-gram models
TRANSCRIPT
-
8/7/2019 Nlp_pos and N-gram Models
1/21
Slide 0
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
N-Grams
-
8/7/2019 Nlp_pos and N-gram Models
2/21
Slide 1
POS
The process of assigning a part-of-speech or
other lexical class marker to each word in acorpus (Jurafsky and Martin)
the
motherkissed
the
child
on
the
cheek
WORDS
TAGS
N
V
P
DET
-
8/7/2019 Nlp_pos and N-gram Models
3/21
Slide 2
An Example
themother
kiss
the
child
onthe
cheek
LEMMA TAG
DETNOUN
VPAST
DET
NOUN
PREPDET
NOUN
themother
kissed
the
child
onthe
cheek
WORD
-
8/7/2019 Nlp_pos and N-gram Models
4/21
Slide 3
Word Classes and Tag Sets
Basic word classes: Noun, Verb, Adjective, Adverb,Preposition,
Tag sets
Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and
purpose
Open vs. Closed classesOpen:
Nouns, Verbs, Adjectives, Adverbs.
Closed:Determiners, pronouns, prepositions, conjunction
-
8/7/2019 Nlp_pos and N-gram Models
5/21
Slide 4
Open Class Words
Every known human language has nouns and verbs
Nouns: people, places, thingsClasses of nouns
proper vs. common
count vs. mass
Verbs: actions and processes
Adjectives: properties, qualities
Adverbs:
Unfortunately, John walked home extremely slowly yesterdayNumerals: one, two, three, third,
-
8/7/2019 Nlp_pos and N-gram Models
6/21
Slide 5
Closed Class Words
Differ more from language to language and over time than
open class words
Examples:
prepositions: on, under, over, particles: up, down, on, off,
determiners: a, an, the,
pronouns: she, who, I, ..
conjunctions: and, but, or,
auxiliary verbs: can, may should,
-
8/7/2019 Nlp_pos and N-gram Models
7/21
Slide 6
Prepositions from CELEX
-
8/7/2019 Nlp_pos and N-gram Models
8/21
Slide 7
Pronouns in CELEX
-
8/7/2019 Nlp_pos and N-gram Models
9/21
Slide 8
Conjunctions
-
8/7/2019 Nlp_pos and N-gram Models
10/21
Slide 9
Auxiliaries
-
8/7/2019 Nlp_pos and N-gram Models
11/21
Slide 10
Word Classes: Tag set example
PRP
PRP$
-
8/7/2019 Nlp_pos and N-gram Models
12/21
Slide 11
Example ofPenn Treebank Tagging ofBrown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
-
8/7/2019 Nlp_pos and N-gram Models
13/21
Slide 12
The Problem
Words often have more than one word class: thisThis is a nice day = PRP
This day is nice = DT
You can go this far = RB
-
8/7/2019 Nlp_pos and N-gram Models
14/21
Slide 13
Part-of-Speech Tagging
Rule-Based Tagger: ENGTWOL (ENGlish TWO Level
analysis)
Stochastic Tagger: HMM-based
Transformation-Based Tagger (Brill)
-
8/7/2019 Nlp_pos and N-gram Models
15/21
Slide 14
Word prediction
Guess the next word...... I notice three guys standing on the ???
- by simply looking at the preceding words and keeping track of some fairly
simple counts.
We can formalize this task using what are called N-gram models.
N-grams are token sequences of length N.
The above example contains the following 2-grams (bigrams)(I notice), (notice three), (three guys), (guys standing), (standing on), (on the)
-
8/7/2019 Nlp_pos and N-gram Models
16/21
Slide 15
N-Gram Models
More formally, we can use knowledge of the counts ofN-grams to
assess the conditional probability of candidate words as the next
word in a sequence.
Or, we can use them to assess the probability of an entire sequence of
words.
-
8/7/2019 Nlp_pos and N-gram Models
17/21
Slide 16
Counting
Simple counting lies at the core of any probabilistic approach. So lets first
take a look at what were counting.He stepped out into the hall, was delighted to encounter a water brother.
15 tokens
Not always that simpleI do uh main- mainly business data processing
Spoken language poses various challenges.Should we count uh and other fillers as tokens?
What about the repetition of mainly?
The answers depend on the application. If were focusing on something like ASR
to support indexing for search, then uh isnt helpful (its not likely to occur as a
query). But filled pauses are very useful in dialog management, so we might
want them there.
-
8/7/2019 Nlp_pos and N-gram Models
18/21
Slide 17
Counting: Types and Tokens
How aboutThey picnicked by the pool, then lay back on the grass and looked at the
stars.18 tokens (again counting punctuation)
But we might also note that the is used 3 times, so there are only 16unique types (as opposed to tokens).
-
8/7/2019 Nlp_pos and N-gram Models
19/21
Slide 18
Counting: Corpora
So what happens when we look at large bodies of text
instead of single utterances?
Brown et al (1992) large corpus ofEnglish text583 million wordform tokens
293,181 wordform typesGoogle
Crawl of 1,024,908,267,229 English tokens
13,588,391 wordform typesThat seems like a lot of types... After all, even large dictionaries ofEnglish have only around
500k types. Why so many here?
Numbers
Misspellings
Names
Acronyms
etc
-
8/7/2019 Nlp_pos and N-gram Models
20/21
Slide 19
Language Modeling
Back to word prediction
We can model the word prediction task as the ability to assess the conditional
probability of a word given the previous words in the sequenceP(wn|w1,w2wn-1)
Calculating - conditional probabilityOne way is to use the definition of conditional probabilities and look for counts. So
to get
P(the | its water is so transparent that)
By definition thats
P(its water is so transparent that the)P(its water is so transparent that)
We can get each of those from counts in a large corpus.
-
8/7/2019 Nlp_pos and N-gram Models
21/21
Slide 20
Very Easy Estimate
How to estimate?P(the | its water is so transparent that)
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
According to Google those counts are 5/9.But 2 of those were to these slides... So maybe its really 3/7