nlp_pos and n-gram models

8/7/2019 Nlp_pos and N-gram Models

1/21

Slide 0

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

N-Grams


2/21

Slide 1

POS

The process of assigning a part-of-speech or

other lexical class marker to each word in acorpus (Jurafsky and Martin)

the

motherkissed

the

child

on

the

cheek

WORDS

TAGS

N

V

P

DET


3/21

Slide 2

An Example

themother

kiss

the

child

onthe

cheek

LEMMA TAG

DETNOUN

VPAST

DET

NOUN

PREPDET

NOUN

themother

kissed

the

child

onthe

cheek

WORD


4/21

Slide 3

Word Classes and Tag Sets

Basic word classes: Noun, Verb, Adjective, Adverb,Preposition,

Tag sets

Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and

purpose

Open vs. Closed classesOpen:

Nouns, Verbs, Adjectives, Adverbs.

Closed:Determiners, pronouns, prepositions, conjunction


5/21

Slide 4

Open Class Words

Every known human language has nouns and verbs

Nouns: people, places, thingsClasses of nouns

proper vs. common

count vs. mass

Verbs: actions and processes

Adjectives: properties, qualities

Adverbs:

Unfortunately, John walked home extremely slowly yesterdayNumerals: one, two, three, third,


6/21

Slide 5

Closed Class Words

Differ more from language to language and over time than

open class words

Examples:

prepositions: on, under, over, particles: up, down, on, off,

determiners: a, an, the,

pronouns: she, who, I, ..

conjunctions: and, but, or,

auxiliary verbs: can, may should,


7/21

Slide 6

Prepositions from CELEX


8/21

Slide 7

Pronouns in CELEX


9/21

Slide 8

Conjunctions


10/21

Slide 9

Auxiliaries


11/21

Slide 10

Word Classes: Tag set example

PRP

PRP$


12/21

Slide 11

Example ofPenn Treebank Tagging ofBrown Corpus Sentence

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./.

VB DT NN .

Book that flight .

VBZ DT NN VB NN ?

Does that flight serve dinner ?


13/21

Slide 12

The Problem

Words often have more than one word class: thisThis is a nice day = PRP

This day is nice = DT

You can go this far = RB


14/21

Slide 13

Part-of-Speech Tagging

Rule-Based Tagger: ENGTWOL (ENGlish TWO Level

analysis)

Stochastic Tagger: HMM-based

Transformation-Based Tagger (Brill)


15/21

Slide 14

Word prediction

Guess the next word...... I notice three guys standing on the ???

- by simply looking at the preceding words and keeping track of some fairly

simple counts.

We can formalize this task using what are called N-gram models.

N-grams are token sequences of length N.

The above example contains the following 2-grams (bigrams)(I notice), (notice three), (three guys), (guys standing), (standing on), (on the)


16/21

Slide 15

N-Gram Models

More formally, we can use knowledge of the counts ofN-grams to

assess the conditional probability of candidate words as the next

word in a sequence.

Or, we can use them to assess the probability of an entire sequence of

words.


17/21

Slide 16

Counting

Simple counting lies at the core of any probabilistic approach. So lets first

take a look at what were counting.He stepped out into the hall, was delighted to encounter a water brother.

15 tokens

Not always that simpleI do uh main- mainly business data processing

Spoken language poses various challenges.Should we count uh and other fillers as tokens?

What about the repetition of mainly?

The answers depend on the application. If were focusing on something like ASR

to support indexing for search, then uh isnt helpful (its not likely to occur as a

query). But filled pauses are very useful in dialog management, so we might

want them there.


18/21

Slide 17

Counting: Types and Tokens

How aboutThey picnicked by the pool, then lay back on the grass and looked at the

stars.18 tokens (again counting punctuation)

But we might also note that the is used 3 times, so there are only 16unique types (as opposed to tokens).


19/21

Slide 18

Counting: Corpora

So what happens when we look at large bodies of text

instead of single utterances?

Brown et al (1992) large corpus ofEnglish text583 million wordform tokens

293,181 wordform typesGoogle

Crawl of 1,024,908,267,229 English tokens

13,588,391 wordform typesThat seems like a lot of types... After all, even large dictionaries ofEnglish have only around

500k types. Why so many here?

Numbers

Misspellings

Names

Acronyms

etc


20/21

Slide 19

Language Modeling

Back to word prediction

We can model the word prediction task as the ability to assess the conditional

probability of a word given the previous words in the sequenceP(wn|w1,w2wn-1)

Calculating - conditional probabilityOne way is to use the definition of conditional probabilities and look for counts. So

to get

P(the | its water is so transparent that)

By definition thats

P(its water is so transparent that the)P(its water is so transparent that)

We can get each of those from counts in a large corpus.


21/21

Slide 20

Very Easy Estimate

How to estimate?P(the | its water is so transparent that)

P(the | its water is so transparent that) =

Count(its water is so transparent that the)

Count(its water is so transparent that)

According to Google those counts are 5/9.But 2 of those were to these slides... So maybe its really 3/7

nlp_pos and n-gram models

Documents