1 sims 290-2: applied natural language processing marti hearst sept 13, 2004

47
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstSept 13, 2004 

 

2

Today

Purpose of Part-of-Speech TaggingTraining and Testing CollectionsIntro to N-grams and Language ModelingUsing NLTK for POS Tagging

3

Class Exercise

I will read off a few words from the beginning of a sentenceYou should write down the very first 2 words that come to mind that should follow these words.Example:

I say “One fish”You write “two fish”

Don’t second-guess or try to be clever.Note: there are no correct answers

4Modified from Diane Litman's version of Steve Bird's notes

Terminology

TaggingThe process of associating labels with each token in a text

TagsThe labels

Tag SetThe collection of tags used for a particular task

5Modified from Diane Litman's version of Steve Bird's notes

Example

Typically a tagged text is a sequence of white-space separated base/tag tokens:

The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./.

6Modified from Diane Litman's version of Steve Bird's notes

What does Tagging do?

1. Collapses Distinctions• Lexical identity may be discarded• e.g. all personal pronouns tagged with PRP

2. Introduces Distinctions• Ambiguities may be removed• e.g. deal tagged with NN or VB• e.g. deal tagged with DEAL1 or DEAL2

3. Helps classification and prediction

7Modified from Diane Litman's version of Steve Bird's notes

Significance of Parts of Speech

A word’s POS tells us a lot about the word and its neighbors:

Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)Helps in stemmingLimits the range of following words for Speech RecognitionCan help select nouns from a document for IRBasis for partial parsing (chunked parsing)Parsers can build trees directly on the POS tags instead of maintaining a lexicon

8Slide modified from Massimo Poesio's

Choosing a tagset

The choice of tagset greatly affects the difficulty of the problemNeed to strike a balance between

Getting better information about context (best: introduce more distinctions)Make it possible for classifiers to do their job (need to minimize distinctions)

9Slide modified from Massimo Poesio's

Some of the best-known Tagsets

Brown corpus: 87 tagsPenn Treebank: 45 tagsLancaster UCREL C5 (used to tag the BNC): 61 tagsLancaster C7: 145 tags

10Modified from Diane Litman's version of Steve Bird's notes

The Brown Corpus

The first digital corpus (1961)Francis and Kucera, Brown University

Contents: 500 texts, each 2000 words long

From American books, newspapers, magazinesRepresenting genres:

– Science fiction, romance fiction, press reportage scientific writing, popular lore

11Modified from Diane Litman's version of Steve Bird's notes

Penn Treebank

First syntactically annotated corpus1 million words from Wall Street JournalPart of speech tags and syntax trees

12Slide modified from Massimo Poesio's

How hard is POS tagging?

Number of tags 1 2 3 4 5 6 7

Number of words types

35340 3760 264

61 12 2 1

In the Brown corpus,- 11.5% of word types ambiguous- 40% of word TOKENS

13Slide modified from Massimo Poesio's

Important Penn Treebank tags

14Slide modified from Massimo Poesio's

Verb inflection tags

15Slide modified from Massimo Poesio's

The entire Penn Treebank tagset

16Slide modified from Massimo Poesio's

Quick test

DoCoMo and Sony are to develop a chip that would let people pay for goods through their mobiles.

17

Tagging methods

Hand-codedStatistical taggersBrill (transformation-based) tagger

18Modified from Diane Litman's version of Steve Bird's notes

Reading Tagged Corpora

>> corpus = brown.read(‘ca01’)>> corpus[‘WORDS’][0:10]

[<The/at>, <Fulton/np-tl>, <County/nn-tl>, <Grand/jj-tl>,

<Jury/nn-tl>, <said/vbd>, <Friday/nr>, <an/at>, <investigation/nn>, <of/in>]

>> corpus[‘WORDS’][2][‘TAG’]‘nn-tl’

>> corpus[‘WORDS’][2][‘TEXT’]‘County’

19Modified from Diane Litman's version of Steve Bird's notes

Default Tagger

We need something to use for unseen words

E.g., guess NNP for a word with an initial capital

How to do this?Apply a sequence of regular expression testsAssign the word to a suitable tag

If there are no matches…Assign to the most frequent unknown tag, NN

– Other common ones are verb, proper noun, adjectiveNote the role of closed-class words in English

– Prepositions, auxiliaries, etc.– New ones do not tend to appear.

20Modified from Diane Litman's version of Steve Bird's notes

A Default Tagger> from nltk.tokenizer import *> from nltk.tagger import *

> text_token = Token(TEXT="John saw 3 polar bears .")> WhitespaceTokenizer().tokenize(text_token)> NN_CD_tagger =

RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')])

> NN_CD_tagger.tag(text_token) <[<John/nn>, <saw/nn>, <3/cd>, <polar/nn>, <bears/nn>, <./nn>]>

NN_CD_Tagger assigns CD to numbers, otherwise NN.Poor performance (20-30%) in isolation, but when used with other taggers can significantly improve performance

21Modified from Diane Litman's version of Steve Bird's notes

Finding the most frequent tag

>>>from nltk.probability import FreqDist>>>from nltk.corpus import brown>>> fd = FreqDist()>>> corpus = brown.read('ca01')>>> for token in corpus['WORDS']:

fd.inc(token['TAG'])... >>> fd.max()>>> fd.count(fd.max())

22

Evaluating the Tagger

This gets 2 wrongout of 16, or 18.5% errorCan also say an accuracyof 81.5%.

23

Training vs. Testing

A fundamental idea in computational linguisticsStart with a collection labeled with the right answers

Supervised learningUsually the labels are done by hand

“Train” or “teach” the algorithm on a subset of the labeled text.Test the algorithm on a different set of data.

Why?– If memorization worked, we’d be done.– Need to generalize so the algorithm works on examples that you

haven’t seen yet.– Thus testing only makes sense on examples you didn’t train on.

NLTK has an excellent interface for doing this easily.

24

Training the Unigram Tagger

25

Creating Separate Training and Testing Sets

26Modified from Diane Litman's version of Steve Bird's notes

Evaluating a Tagger

Tagged tokens – the original dataUntag (exclude) the dataTag the data with your own taggerCompare the original and new tags

Iterate over the two lists checking for identity and countingAccuracy = fraction correct

27

Assessing the Errors

Why the tuple method? Dictionaries cannot be indexedby lists, so convert lists to tuples.

exclude returns a new token containing only the properties that are not named in the given list.

28

Assessing the Errors

29

Language Modeling

Another fundamental concept in NLPMain idea:

For a given language, some words are more likely than others to follow each other, orYou can predict (with some degree of accuracy) the probability that a given word will follow another word.

Illustration:Distributions of words in class-participation exercise.

30

N-GramsThe N stands for how many terms are used

Unigram: 1 termBigram: 2 termsTrigrams: 3 terms

– Usually don’t go beyond this

You can use different kinds of terms, e.g.:Character based n-gramsWord-based n-gramsPOS-based n-grams

OrderingOften adjacent, but not required

We use n-grams to help determine the context in which some linguistic phenomenon happens.

E.g., look at the words before and after the period to see if it is the end of a sentence or not.

31Modified from Diane Litman's version of Steve Bird's notes

Features and Contexts

wn-2 wn-1 wn wn+1

CONTEXT FEATURE CONTEXT

tn-2tn-1 tn tn+1

32Modified from Diane Litman's version of Steve Bird's notes

Unigram Tagger

Trained using a tagged corpus to determine which tags are most common for each word.

E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1 time, and with VBP 1 time

Performance is highly dependent on the quality of its training set.

Can’t be too smallCan’t be too different from texts we actually want to tag

33Modified from Diane Litman's version of Steve Bird's notes

Nth Order TaggingOrder refers to how much context

It’s one less than the N in N-gram here because we use the target word itself as part of the context.

– Oth order = unigram tagger– 1st order = bigrams– 2nd order = trigrams

Bigram taggerFor tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokensWhat is the most likely tag for w_n, given w_n-1 and t_n-1?The tagger picks the tag which is most likely for that context.

34

Reading the Bigram tableThe current word

The previously seen tag

The predicted POS

35Modified from Massio Poesio's lecture

Tagging with lexical frequencies

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NNPeople/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NNProblem: assign a tag to race given its lexical frequencySolution: we choose the tag that has the greater

P(race|VB)P(race|NN)

Actual estimate from the Switchboard corpus:P(race|NN) = .00041P(race|VB) = .00003

36Modified from Diane Litman's version of Steve Bird's notes

Combining Taggers

Use more accurate algorithms when we can, backoff to wider coverage when needed.

Try tagging the token with the 1st order tagger. If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag.

37Modified from Diane Litman's version of Steve Bird's notes

BackoffTagger class>>> train_toks =

TaggedTokenizer().tokenize(tagged_text_str)

# Construct the taggers >>> tagger1 = NthOrderTagger(1,

SUBTOKENS=‘WORDS’) >>> tagger2 = UnigramTagger() # 0th order>>> tagger3 = NN_CD_Tagger()

# Train the taggers >>> for tok in train_toks:

tagger1.train(tok) tagger2.train(tok)

38Modified from Diane Litman's version of Steve Bird's notes

Backoff (continued)

# Combine the taggers (in order, by specificity) > tagger = BackoffTagger([tagger1, tagger2, tagger3])

# Use the combined tagger> accuracy = tagger_accuracy(tagger, unseen_tokens)

39Modified from Diane Litman's version of Steve Bird's notes

Rule-Based Tagger

The Linguistic ComplaintWhere is the linguistic knowledge of a tagger?Just a massive table of numbersAren’t there any linguistic insights that could emerge from the data?Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

40Slide modified from Massimo Poesio's

The Brill tagger

An example of TRANSFORMATION-BASED LEARNING Very popular (freely available, works fairly well)A SUPERVISED method: requires a tagged corpusBasic idea: do a quick job first (using frequency), then revise it using contextual rules

41

Brill Tagging: In more detail

Start with simple (less accurate) rules…learn better ones from tagged corpus

Tag each word initially with most likely POSExamine set of transformations to see which improves tagging decisions compared to tagged corpus Re-tag corpus using best transformationRepeat until, e.g., performance doesn’t improveResult: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

42Slide modified from Massimo Poesio's

An example

Examples:It is expected to race tomorrow.The race for outer space.

Tagging algorithm:1. Tag all uses of “race” as NN (most likely tag in the Brown

corpus)• It is expected to race/NN tomorrow• the race/NN for outer space

2. Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO:• It is expected to race/VB tomorrow• the race/NN for outer space

43Slide modified from Massimo Poesio's

Transformation-based learning in the Brill tagger1. Tag the corpus with the most likely tag for

each word2. Choose a TRANSFORMATION that

deterministically replaces an existing tag with a new one such that the resulting tagged corpus has the lowest error rate

3. Apply that transformation to the training corpus

4. Repeat5. Return a tagger that

a. first tags using unigramsb. then applies the learned transformations in order

44Slide modified from Massimo Poesio's

Examples of learned transformations

45Slide modified from Massimo Poesio's

Templates

46Adapted from Massio Peosio's

Additional issues

Most of the difference in performance between POS algorithms depends on their treatment of UNKNOWN WORDS

Multiple token words (‘Penn Treebank’)

Class-based N-grams

47

Upcoming

I will email the procedures for turning in the first assignment on Wed Sept 15

Will be over the web

On Wed I’ll discuss shallow parsingStart reading the Chunking (Shallow Parsing) tutorialI will assign homework from this on Wed, due in one week on Sept 22.

Next Monday I’ll briefly discuss syntactic parsting

There is a tutorial on this; feel free to read itIn the interests of reducing workload, I’m not assigning it however