sequence labeling i - iit delhi

85
Mausam (Slides based on Michael Collins, Dan Klein, Chris Manning, Dan Jurafsky, Heng Ji, Luke Zettlemoyer, Alex Simma, Erik Sudderth, David Fernandez-Baca, Drena Dobbs, Serafim Batzoglou, William Cohen, Andrew McCallum, Dan Weld) Sequence Labeling I POS Tagging with Hidden Markov Models 1

Upload: others

Post on 30-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Labeling I - IIT Delhi

Mausam

(Slides based on Michael Collins, Dan Klein, Chris

Manning, Dan Jurafsky, Heng Ji, Luke Zettlemoyer,

Alex Simma, Erik Sudderth, David Fernandez-Baca,

Drena Dobbs, Serafim Batzoglou, William Cohen,

Andrew McCallum, Dan Weld)

Sequence Labeling IPOS Tagging with Hidden Markov Models

1

Page 2: Sequence Labeling I - IIT Delhi

Sequence problems

• Many problems in NLP have data which is a sequence of characters, words, phrases, lines, or sentences …

• We can think of our task as one of labeling each item

VBG NN IN DT NN IN NNChasing opportunity in an age of upheaval

POS tagging

B B I I B I B I B B

而 相 对 于 这 些 品 牌 的 价

Word segmentation

PERS O O O ORG ORG

Murdoch discusses future of News Corp.

Named entity recognition

Text segmen-tation

QAQAAAQA

Page 3: Sequence Labeling I - IIT Delhi

Example: Speech Recognition• Given an audio

waveform, would like to robustly extract & recognize any spoken words

• Observations– accoustics

• Labels– words

3S. Roweis, 2004

Page 4: Sequence Labeling I - IIT Delhi

POS Tagging

4

DT NN IN NN VBD NNS VBD

The average of interbank offered rates plummeted …

DT NNP NN VBD VBN RP NN NNS

The Georgia branch had taken on loan commitments …

Observations

Sentence

Tagging

POS for each word

Page 5: Sequence Labeling I - IIT Delhi

5/39

What is Part-of-Speech (POS)

• Generally speaking, Word Classes (=POS) :

– Verb, Noun, Adjective, Adverb, Article, …

• We can also include inflection:

– Verbs: Tense, number, …

– Nouns: Number, proper/common, …

– Adjectives: comparative, superlative, …

– …

• Lots of debate within linguistics about the number, nature, and universality of these

• We’ll completely ignore this debate.

Page 6: Sequence Labeling I - IIT Delhi

6/39

Penn TreeBank POS Tag Set

• Penn Treebank: hand-annotated corpus of Wall Street Journal, 1M words

• 45 tags

• Some particularities:

– to /TO not disambiguated

– Auxiliaries and verbs not distinguished

Page 7: Sequence Labeling I - IIT Delhi

7/39

Penn Treebank Tagset

Page 8: Sequence Labeling I - IIT Delhi

Open class (lexical) words

Closed class (functional)

Nouns Verbs

Proper Common

Modals

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

… more

IBM

Italy

cat / cats

snow

see

registered

can

had

old older oldest

slowly

to with

off up

the some

and or

he its

Numbers

122,312

one

Interjections Ow Eh

Page 9: Sequence Labeling I - IIT Delhi

Open vs. Closed classes

• Open vs. Closed classes– Closed:

• determiners: a, an, the

• pronouns: she, he, I

• prepositions: on, under, over, near, by, …

• Usually function words (short common words which play a role in grammar)

• Why “closed”?

– Open:

• Nouns, Verbs, Adjectives, Adverbs.

Page 10: Sequence Labeling I - IIT Delhi

10/39

Open Class Words

• Nouns– Proper nouns (Boulder, Granby, Eli Manning)

• English capitalizes these.

– Common nouns (the rest).

– Count nouns and mass nouns

• Count: have plurals, get counted: goat/goats, one goat, two goats

• Mass: don’t get counted (snow, salt, communism) (*two snows)

• Adverbs: tend to modify verbs– Unfortunately, John walked home extremely slowly yesterday

– Directional/locative adverbs (here,home, downhill)

– Degree adverbs (extremely, very, somewhat)

– Manner adverbs (slowly, slinkily, delicately)

• Verbs– In English, have morphological affixes (eat/eats/eaten)

Page 11: Sequence Labeling I - IIT Delhi

11/39

Closed Class Words

Examples:

– prepositions: on, under, over, …

– particles: up, down, on, off, …

– determiners: a, an, the, …

– pronouns: she, who, I, ..

– conjunctions: and, but, or, …

– auxiliary verbs: has, been, do, …

– numerals: one, two, three, third, …

– modal verbs: can, may, should, …

Page 12: Sequence Labeling I - IIT Delhi

12/39

Prepositions from CELEX

Page 13: Sequence Labeling I - IIT Delhi

13/39

English Particles

Page 14: Sequence Labeling I - IIT Delhi

14/39

Conjunctions

Page 15: Sequence Labeling I - IIT Delhi

POS Tagging Ambiguity

• Words often have more than one POS: back

– The back door = JJ

– On my back = NN

– Win the voters back = RB

– Promised to back the bill = VB

• The POS tagging problem is to determine the POS tag for a particular instance of a word.

Page 16: Sequence Labeling I - IIT Delhi

POS Tagging

• Input: Plays well with others

• Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS

• Output: Plays/VBZ well/RB with/IN others/NNS

• Uses:

– Text-to-speech (how do we pronounce “lead”?)

– Can write regexps like (Det) Adj* N+ over the output for phrases, etc.

– An early step in NLP pipeline: output used later

– If you know the tag, you can back off to it in other tasks

Penn Treebank POS tags

Page 17: Sequence Labeling I - IIT Delhi

Human Upper Bound

• Deciding on the correct part of speech can be difficult even for people

• Mrs/NNP Shaefer/NNP never/RB got/VBD around/??to/TO joining/VBG

• All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/?? the/DT corner/NN

• Chateau/NNP Petrus/NNP costs/VBZ around/?? 250/CD

Page 18: Sequence Labeling I - IIT Delhi

Human Upper Bound

• Deciding on the correct part of speech can be difficult even for people

• Mrs/NNP Shaefer/NNP never/RB got/VBD around/RPto/TO joining/VBG

• All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN

• Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

Page 19: Sequence Labeling I - IIT Delhi

19/39

Measuring Ambiguity

Page 20: Sequence Labeling I - IIT Delhi

How hard is POS tagging?

• About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech

• But they tend to be very common words. E.g., that– I know that he is honest = IN

– Yes, that play was nice = DT

– You can’t go that far = RB

• 40% of the word tokens are ambiguous

Page 21: Sequence Labeling I - IIT Delhi

POS tagging performance

• How many tags are correct? (Tag accuracy)

– About 97% currently

– But baseline is already 90%

• Baseline is performance of stupidest possible method– Tag every word with its most frequent tag

– Tag unknown words as nouns

– Partly easy because

• Many words are unambiguous

• You get points for them (the, a, etc.) and for punctuation marks!

Page 22: Sequence Labeling I - IIT Delhi

22/39

History of POS Tagging

1960 1970 1980 1990 2000

Brown Corpus

Created (EN-US)

1 Million Words

Brown Corpus

Tagged

HMM Tagging

(CLAWS)

93%-95%

Greene and Rubin

Rule Based - 70%

LOB Corpus

Created (EN-UK)

1 Million Words

DeRose/Church

Efficient HMM

Sparse Data

95%+

British National

Corpus

(tagged by CLAWS)

POS Tagging

separated from

other NLP

Transformation

Based Tagging

(Eric Brill)

Rule Based – 95%+

Tree-Based Statistics

(Helmut Shmid)

Rule Based – 96%+

Neural Network

96%+

Trigram Tagger

(Kempe)

96%+

Combined Methods

98%+

Penn Treebank

Corpus

(WSJ, 4.5M)

LOB Corpus

Tagged

Page 23: Sequence Labeling I - IIT Delhi

Sources of information

• What are the main sources of information for POS tagging?

– Knowledge of neighboring words

• Bill saw that man yesterday

• NNP NN DT NN NN

• VB VB(D) IN VB NN

– Knowledge of word probabilities

• man is rarely used as a verb….

• The latter proves the most useful, but the former also helps

Page 24: Sequence Labeling I - IIT Delhi

Markov Chain

• Set of states

– Initial probabilities

– Transition probabilities

Markov Chain models system dynamics24

Page 25: Sequence Labeling I - IIT Delhi

Markov Chains: Language Models

25

0.5

0.3

0.2

0.1

0.9 0.6

0.40.0

0.0

<s>

Page 26: Sequence Labeling I - IIT Delhi

Hidden Markov Model

• Set of states

– Initial probabilities

– Transition probabilities

• Set of potential observations

– Emission/Observation probabilities

HMM generates observation sequence

w1 w2 w3 w4 w5

26

Page 27: Sequence Labeling I - IIT Delhi

Hidden Markov Models (HMMs)

Finite state machine

Graphical Model...Hidden states

Observations

w1 w2 w3 w4 w5 w6 w7 w8

Hidden state sequence

Observation sequence

Generates

Yt-2 Yt-1 Yt

Xt-2 Xt-1 Xt...

...

...

Random variable Yt

takes values from{s1, s2, s3, s4}

Random variable Xt takes values from s {w1, w2, w3, w4, w5, …}

27

Page 28: Sequence Labeling I - IIT Delhi

HMM

Finite state machine

w1 w2 w3 w4 w5 w6 w7 w8

Hidden state sequence

Observation sequence

Generates

Graphical Model...Hidden states

Observations

Yt-2 Yt-1 Yt

Xt-2 Xt-1 Xt...

...

...

Random variable Yt

takes values from sss {s1, s2, s3, s4}

Random variable Xt takes values from s {w1, w2, w3, w4, w5, …}

28

Page 29: Sequence Labeling I - IIT Delhi

HMM

Graphical Model...Hidden states

orTags

Observationsor

Words

Yt-2 Yt-1 Yt

Xt-2 Xt-1 Xt...

...

...

Random variable Yt

takes values from sss {s1, s2, s3, s4}

Random variable Xt takes values from s {w1, w2, w3, w4, w5, …}

Need Parameters: Start state probabilities: P(Y1=sk )Transition probabilities: P(Yt=si | Yt-1=sk)Observation probabilities: P(Xt=wj | Yt=sk )

29

Page 30: Sequence Labeling I - IIT Delhi

Hidden Markov Models for Text

30

• Just another graphical model…

“Conditioned on the present,

the past & future are independent”hidden

states

observed

vars

Transition

Distribution

Observ

ation

Dis

trib

ution

<s>

)|()|(),(1

1

1

1

T

t

ttt

T

t

t yweyyqywP

w1 w2 w3 w4

</s>

Page 31: Sequence Labeling I - IIT Delhi

HMM Generative Process

32

We can easily sample sequences pairs:

X1:n,Y1:n

For i = 1 ... n

Sample yi from the distribution q(yi|yi-1)

Sample xi from the distribution e(wi|yi)

Sample </s> from q(</s>|yi)

Sample initial state: <s>

Page 32: Sequence Labeling I - IIT Delhi

Example: POS Tagging

33

Setup:

states S = {DT, NNP, NN, ... } are the POS tags

Observations W in V are words

Transition dist’n q(yi|yi-1) models the tag sequences

Observation dist’n e(wi|yi) models words given their POS

Most important task: tagging

Decoding: find the most likely tag sequence for words w

Neighboring states

Current word

Subtlety: not dependent on neighboring words directly

influence thru neighboring tags.

),....,|,....,(maxarg 11...1

nnyny

wwyyP

Page 33: Sequence Labeling I - IIT Delhi

Trigram HMMs

• y0=y-1=<s>. yT+1=</s>

• Parameters

– q(s|u,v) for s Є S U {</s>}, u,v Є S U {<s>}

– e(w|s) for w Є V and s Є S

34

T

t

tttt

T

t

t yweyyyqywP1

21

1

1

)|(),|(),(

)|()|(),( 1

1

ttt

T

t

t yweyyqywP

Page 34: Sequence Labeling I - IIT Delhi

Parameter Estimation

Counting & Smoothing

35

N

yc

yc

yyc

yyc

yyycyyyq t

t

tt

tt

tttttt

)(

)(

),(

),(

),,(),|( 3

1

12

12

12121

)(

)()|(

,

t

tt

ttyc

ywcywe

Bad idea: zeros!

how to smooth a

really low freq word?

Page 35: Sequence Labeling I - IIT Delhi

Low Frequency Words

• Test sentence:

– Astronaut Sujay M. Kulkarni decided not to leave the tricky spot, manning a tough situation by himself.

• Intuition

– manning likely a verb. Why?

• “-ing”

– Sujay likely a noun. Why?

• Capitalized in the middle of a sentence36

Page 36: Sequence Labeling I - IIT Delhi

Low Frequency Words Solution

• Split vocabulary into two sets:

– frequent (count >k) and infrequent

• Map low frequency words into a

– small, finite set

– using word’s orthographic features

37

Page 37: Sequence Labeling I - IIT Delhi

Words Orthographic Features

38

• (Bikel et al 1999) for NER task

• Features computed in order.

Page 38: Sequence Labeling I - IIT Delhi

Example

• Training data

• Astronaut/NN Sujay/NNP M./NNP Kulkarni/NNP decided/VBD not/RB to/TO leave/VB the/DT tricky/JJ spot/NN ,/, manning/VBG a/DT tough/JJ situation/NN by/IN himself/PRP .

• firstword/NN initCap/NNP capPeriod/NNP initCap/NNP decided/VBD not/RB to/TO leave/VB the/DT tricky/JJ spot/NN ,/, endinING/VBG a/DT tough/JJ situation/NN by/IN himself/PRP .

39

Page 39: Sequence Labeling I - IIT Delhi

HMM Inference

• Decoding: most likely sequence of hidden states

– Viterbi algorithm

• Evaluation: prob. of observing an obs. sequence

– Forward Algorithm (very similar to Viterbi)

• Marginal distribution: prob. of a particular state

– Forward-Backward

40

Page 40: Sequence Labeling I - IIT Delhi

Decoding ProblemGiven w=w1 …wT and HMM θ, what is “best” parse y1 …yT?

Several possible meanings of ‘solution’1. States which are individually most likely2. Single best state sequence

We want sequence y1 …yT,such that P(y|w) is maximized

y* = argmaxy P( y|w )

1

2

K

1

2

K

1

2

K

1

2

K

w1 w2 w3 wT

2

1

K

2

41

Page 41: Sequence Labeling I - IIT Delhi

Most Likely Sequence

• Problem: find the most likely (Viterbi) sequence under the model

42

P(Y1:T+1,w1:T) = q(NNP|<s>,<s>) q(Fed|NNP) P(VBZ|<s>,NNP) P(raises|VBZ)

P(NN|NNP,VBZ)…..

NNP VBZ NN NNS CD NN

NNP NNS NN NNS CD NN

NNP VBZ VB NNS CD NN

logP = -23

logP = -29

logP = -27

In principle, we’re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence)

Fed raises interest rates 0.5 percent .

NNP VBZ NN NNS CD NN .

Given model parameters, we can score any sequence pair

2T+1 operations per sequence

|Y|T tag sequences!

Page 42: Sequence Labeling I - IIT Delhi

Finding the Best Trajectory • Brute Force: Too many trajectories (state sequences) to list• Option 1: Beam Search

– A beam is a set of partial hypotheses– Start with just the single empty trajectory– At each derivation step:

• Consider all continuations of previous hypotheses• Discard most, keep top k

43

<s>,<s>

<s>,Fed:N

<s>,Fed:V

<s>,Fed:J

Fed:N, raises:N

Fed:N, raises:V

Fed:V, raises:N

Fed:V, raises:V

Beam search works ok in practice … but sometimes you want the optimal answer

… and there’s often a better option than naïve beams

Page 43: Sequence Labeling I - IIT Delhi

State Lattice / Trellis (Bigram HMM)

44

<s>

N

V

J

D

N

V

J

D

N

V

J

D

N

V

J

D

N

V

J

D

N

V

J

D

<s> Fed raises interest rates

<s> <s> <s> <s> <s>

</s> </s> </s> </s> </s> </s>

Page 44: Sequence Labeling I - IIT Delhi

State Lattice / Trellis (Bigram HMM)

45

<s>

N

V

J

D

N

V

J

D

N

V

J

D

N

V

J

D

N

V

J

D

N

V

J

D

<s> Fed raises interest rates

<s> <s> <s> <s> <s>

</s> </s> </s> </s> </s> </s>

e(FED|N)

Page 45: Sequence Labeling I - IIT Delhi

Dynamic Programming (Bigram)

• Decoding:

• First consider how to compute max

• Define – probability of most likely state sequence ending with tag

yi, given observations w1, …, wi

46

T

t

ttt

T

t

ty

yweyyq1

1

1

1

)|()|(maxarg

),(maxarg)|(maxarg* ywPwyPyyy

),(max)( ]..1[]..1[]1:1[

iiiy

ii wyPy

),()|()|(max ]1..1[]1..1[1]1:1[

iiiiiiiy

wyPyyqywe

),(max)|(max)|( ]1..1[]1..1[]2:1[

11

iiiy

iiy

ii wyPyyqywei

)()|(max)|( 1111

iiiiy

ii yyyqywei

)( ii y

Page 46: Sequence Labeling I - IIT Delhi

Viterbi Algorithm for Bigram HMMs

• Input: w1,…,wT, model parameters q() and e()

• Initialize: δ0(<s>) = 1

• For k=1 to T do

– For (y’) in all possible tagset

• Return

returns only the optimal value

keep backpointers47

)()|'(max)'|()'( 1 yyyqywey iy

ii

)'()'|/(max'

yysq Ty

Page 47: Sequence Labeling I - IIT Delhi

Viterbi Algorithm for Bigram HMMs

• Input: w1,…,wT, model parameters q() and e()

• Initialize: δ0(<s>,<s>) = 1

• For k=1 to T do

– For (y’) in all possible tagset

• Set

• For k=T-1 to 1 do

– Set

• Return y[1..T] 48

)()|'(max)'|()'( 1 yyyqywey iy

ii

)'()'|/(maxarg'

yysqy Ty

T

)()|'(maxarg)'|()'( 1 yyyqyweybp iy

ii

)( 1 kkk ybpyTime: O(|Y|2T)Space: O(|Y|T)

Page 48: Sequence Labeling I - IIT Delhi

Viterbi Algorithm for Bigram HMMsw1 w2 …wi-1 wi………………………………wT

Tag 1

2

K

i δi(s)

Maxs’ δi-1(y’) * Ptrans* Pobs

Remember: δi(y) = probability of most likely tag seq ending with y at time i

52

Page 49: Sequence Labeling I - IIT Delhi

Terminating Viterbi

δ

δ

δ

δ

δ

w1 w2 …………………………………………..wT

Tag 1

2

K

i

Choose Maxy e(</s>|y)

*δT(y)

53

Page 50: Sequence Labeling I - IIT Delhi

Terminating Viterbi

Time: O(|Y|2T)Space: O(|Y|T)

w1 w2 …………………………………………..wT

State 1

2

K

i

Linear in length of sequence

Maxs’ δT-1(y’) * Ptrans* Pobs

Max

How did we compute *?

δ*

Now Backchain to Find Final Sequence

54

Page 51: Sequence Labeling I - IIT Delhi

55/39

Example

Fish sleep.

Page 52: Sequence Labeling I - IIT Delhi

56/39

Example: Bigram HMM

<s> noun verb </s>0.8

0.2

0.80.7

0.1

0.2

0.10.1

Page 53: Sequence Labeling I - IIT Delhi

57/39

Data

• A two-word language: “fish” and “sleep”

• Suppose in our training corpus,

• “fish” appears 8 times as a noun and 5 times as a verb

• “sleep” appears twice as a noun and 5 times as a verb

• Emission probabilities:

• Noun

– P(fish | noun) : 0.8

– P(sleep | noun) : 0.2

• Verb

– P(fish | verb) : 0.5

– P(sleep | verb) : 0.5

Page 54: Sequence Labeling I - IIT Delhi

58/39

Viterbi Probabilities

0 1 2 3

start

verb

noun

end

Page 55: Sequence Labeling I - IIT Delhi

59/39

0 1 2 3

start 1

verb 0

noun 0

end 0

<s> noun verb </s>0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Page 56: Sequence Labeling I - IIT Delhi

60/39

0 1 2 3

start 1 0

verb 0 .2 * .5

noun 0 .8 * .8

end 0 0

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 1: fish

<s> </s>

Page 57: Sequence Labeling I - IIT Delhi

61/39

0 1 2 3

start 1 0

verb 0 .1

noun 0 .64

end 0 0

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 1: fish

<s> </s>

Page 58: Sequence Labeling I - IIT Delhi

62/39

0 1 2 3

start 1 0 0

verb 0 .1 .1*.1*.5

noun 0 .64 .1*.2*.2

end 0 0 -

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is verb)

<s> </s>

Page 59: Sequence Labeling I - IIT Delhi

63/39

0 1 2 3

start 1 0 0

verb 0 .1 .005

noun 0 .64 .004

end 0 0 -

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is verb)

<s> </s>

Page 60: Sequence Labeling I - IIT Delhi

64/39

0 1 2 3

start 1 0 0

verb 0 .1 .005

.64*.8*.5

noun 0 .64 .004

.64*.1*.2

end 0 0 -

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is a noun)

<s> </s>

Page 61: Sequence Labeling I - IIT Delhi

65/39

0 1 2 3

start 1 0 0

verb 0 .1 .005

.256

noun 0 .64 .004

.0128

end 0 0 -

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is a noun)

<s> </s>

Page 62: Sequence Labeling I - IIT Delhi

66/39

0 1 2 3

start 1 0 0

verb 0 .1 .005

.256

noun 0 .64 .004

.0128

end 0 0 -

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 2: sleep

take maximum,

set back pointers

<s> </s>

Page 63: Sequence Labeling I - IIT Delhi

67/39

0 1 2 3

start 1 0 0

verb 0 .1 .256

noun 0 .64 .0128

end 0 0 -

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 2: sleep

take maximum,

set back pointers

<s> </s>

Page 64: Sequence Labeling I - IIT Delhi

68/39

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

.0128*.1

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 3: end

<s> </s>

Page 65: Sequence Labeling I - IIT Delhi

69/39

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

.0128*.1

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Token 3: end

take maximum,

set back pointers

<s> </s>

Page 66: Sequence Labeling I - IIT Delhi

70/39

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

noun verb0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Decode:

fish = noun

sleep = verb

<s> </s>

Page 67: Sequence Labeling I - IIT Delhi

State Lattice / Trellis (Trigram HMM)

N,N

$

START Fed raises interest …

^,^

N,V

N,D

D,V

N,N

$

^,^

^,N

N,D

D,V

N,N

$

^,^

^,V

N,D

D,V

N,N

$

^,^

^,V

N,D

D,V

… … … …

e(Fed|N)

e(raises|D)

e(interest|V)

Page 68: Sequence Labeling I - IIT Delhi

Dynamic Programming (Trigram)

• Decoding:

• First consider how to compute max

• Define – probability of most likely state sequence ending with tags

yi -1,yi, given observations w1, …, wi

72

T

t

tttt

T

t

ty

yweyyyq1

21

1

1

)|(),|(maxarg

),(maxarg)|(maxarg* ywPwyPyyy

),(max),( ]..1[]..1[]2:1[

1 iiiy

iii wyPyy

),(),|()|(max ]1..1[]1..1[12]2:1[

iiiiiiiiy

wyPyyyqywe

),(max),|(max)|( ]1..1[]1..1[]3:1[

122

iiiy

iiiy

ii wyPyyyqywei

),(),|(max)|( 121122

iiiiiiy

ii yyyyyqywei

),( 1 iii yy

Page 69: Sequence Labeling I - IIT Delhi

Viterbi Algorithm for Trigram HMMs

• Input: w1,…,wT, model parameters q() and e()

• Initialize: δ0(<s>,<s>) = 1

• For k=1 to T do

– For (y’,y’’) in all possible tagset

• Return

returns only the optimal value

keep backpointers73

)',()',|''(max)''|()'','( 1 yyyyyqyweyy iy

ii

)'','()'','|/(max'','

yyyysq Tyy

Page 70: Sequence Labeling I - IIT Delhi

Viterbi Algorithm for Trigram HMMs

• Input: w1,…,wT, model parameters q() and e()

• Initialize: δ0(<s>,<s>) = 1

• For k=1 to T do

– For (y’,y’’) in all possible tagset

• Set

• For k=T-2 to 1 do

– Set

• Return y[1..T] 74

)',()',|''(max)''|()'','( 1 yyyyyqyweyy iy

ii

)'','()'','|/(maxarg,'','

1 yyyysqyy Tyy

TT

)',()',|''(maxarg)''|()'','( 1 yyyyyqyweyybp iy

ii

),( 21 kkkk yybpyTime: O(|Y|3T)Space: O(|Y|2T)

Page 71: Sequence Labeling I - IIT Delhi

Overview: Accuracies

• Roadmap of (known / unknown) accuracies:

– Most freq tag: ~90% / ~50%

– Trigram HMM: ~95% / ~55%

• TnT (Brants, 2000):– A carefully smoothed trigram tagger

– Suffix trees for emissions

– 96.7% on WSJ text

– Upper bound: ~98%

Most errors

on unknown

words

Page 72: Sequence Labeling I - IIT Delhi

Common Errors

• Common errors [from Toutanova & Manning 00]

NN/JJ NN

official knowledge

VBD RP/IN DT NN

made up the story

RB VBD/VBN NNS

recently sold shares

Page 73: Sequence Labeling I - IIT Delhi

Issues with HMMs for POS Tagging

• Slow for long sentences

• Only one feature for less frequent words

• No features for frequent words

• Why not try a feature rich classifier?

– MaxEnt?

77

Page 74: Sequence Labeling I - IIT Delhi

Feature-based tagger

• Can do surprisingly well just looking at a word by itself:

– Word the: the DT

– Lowercased word Importantly: importantly RB

– Prefixes unfathomable: un- JJ

– Suffixes Importantly: -ly RB

– Capitalization Meridian: CAP NNP

– Word shapes 35-year: d-x JJ

• Then build a maxent (or whatever) model to predict tag

– Maxent P(y|w): 93.7% overall / 82.6% unknown

Page 75: Sequence Labeling I - IIT Delhi

Overview: Accuracies

• Roadmap of (known / unknown) accuracies:

– Most freq tag: ~90% / ~50%

– Trigram HMM: ~95% / ~55%

– Maxent P(t|w): 93.7% / 82.6%

– TnT (HMM++): 96.2% / 86.0%

– Upper bound: ~98%

Page 76: Sequence Labeling I - IIT Delhi

How to improve supervised results?

• Build better features!

– We could fix this with a feature that looked at the next word

– We could fix this by linking capitalized words to their lowercase versions

PRP VBD IN RB IN PRP VBD .

They left as soon as he arrived .

NNP NNS VBD VBN .

Intrinsic flaws remained undetected .

RB

JJ

Page 77: Sequence Labeling I - IIT Delhi

Tagging Without Sequence Information

y0

w0

Baseline

y0

w0w-1 w1

Three Words

Model Features Token Unknown

Baseline 56,805 93.69% 82.61%

3Words 239,767 96.57% 86.78%

Using words only in a straight classifier works as well as a basic

sequence model!!

Page 78: Sequence Labeling I - IIT Delhi

Overview: Accuracies

• Roadmap of (known / unknown) accuracies:

– Most freq tag: ~90% / ~50%

– Trigram HMM: ~95% / ~55%

– Maxent P(y|w): 93.7% / 82.6%

– TnT (HMM++): 96.2% / 86.0%

– Maxent (local nbrs): 96.8% / 86.8%

– Upper bound: ~98%

Page 79: Sequence Labeling I - IIT Delhi

Discriminative Sequence Taggers

• Maxent P(y|w) is too local

– completely ignores sequence labeling problem

– and predicts independently

• Discriminative Sequence Taggers

– Feature rich

– neighboring labels can guide tagging process

– Example: Max Entropy Markov Models (MEMM), Linear Perceptron

83

Page 80: Sequence Labeling I - IIT Delhi

Overview: Accuracies

• Roadmap of (known / unknown) accuracies:

– Most freq tag: ~90% / ~50%

– Trigram HMM: ~95% / ~55%

– Maxent P(y|w): 93.7% / 82.6%

– TnT (HMM++): 96.2% / 86.0%

– Maxent (local nbrs): 96.8% / 86.8%

– MEMMs: 96.9% / 86.9%

– Linear Perceptron: 96.7% / ??

– Upper bound: ~98%

Page 81: Sequence Labeling I - IIT Delhi

Cyclic Network

• Train two MEMMs,

multiple together to

score

• And be very careful

• Tune regularization

• Try lots of different features

• See paper for full details

[Toutanova et al 03]

Page 82: Sequence Labeling I - IIT Delhi

Overview: Accuracies

• Roadmap of (known / unknown) accuracies:

– Most freq tag: ~90% / ~50%

– Trigram HMM: ~95% / ~55%

– Maxent P(y|w): 93.7% / 82.6%

– TnT (HMM++): 96.2% / 86.0%

– Maxent (local nbrs): 96.8% / 86.8%

– MEMMs: 96.9% / 86.9%

– Linear Perceptron: 96.7% / ??

– Cyclic tagger: 97.2% / 89.0%

– Upper bound: ~98%

Page 83: Sequence Labeling I - IIT Delhi

Overview: Accuracies

• Roadmap of (known / unknown) accuracies:

– Most freq tag: ~90% / ~50%

– Trigram HMM: ~95% / ~55%

– Maxent P(y|w): 93.7% / 82.6%

– TnT (HMM++): 96.2% / 86.0%

– Maxent (local nbrs): 96.8% / 86.8%

– MEMMs: 96.9% / 86.9%

– Linear Perceptron: 96.7% / ??

– Cyclic tagger: 97.2% / 89.0%

– Maxent+Ext ambig. 97.4% / 91.3%

– Upper bound: ~98%

Page 84: Sequence Labeling I - IIT Delhi

Summary of POS TaggingFor tagging, the change from generative to discriminative model does not

by itself result in great improvement

One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc.

An MEMM allows integration of rich features of the observations, but can suffer strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words

This additional power (of the MEMM ,CRF, Perceptron models) has been shown to result in improvements in accuracy

The higher accuracy of discriminative models comes at the price of much slower training

Simple MaxEnt models perform close to state of the art

What does it say about the sequence labeling task?

Page 85: Sequence Labeling I - IIT Delhi

Domain Effects

• Accuracies degrade outside of domain

– Up to triple error rate

– Usually make the most errors on the things you care

about in the domain (e.g. protein names)

• Open questions

– How to effectively exploit unlabeled data from a new

domain (what could we gain?)

– How to best incorporate domain lexica in a principled

way (e.g. UMLS specialist lexicon, ontologies)