conditional markov models: maxent tagging and memms william w. cohen cald

33
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Upload: rodney-hood

Post on 14-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Conditional Markov Models: MaxEnt Tagging and MEMMs

William W. Cohen

CALD

Page 2: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Announcements

• Confused about what to write up?– Mon 2/9: Ratnaparki & Frietag et al– Wed 2/11: Borthwick et al & Mikheev– Mon 2/16: no class (President’s day)– Wed 2/18: Sha & Pereira, Lafferty et al– Mon 2/23: Klein & Manning, Toutanova et al– Wed 2/25: no writeup due– Mon 3/1: no writeup due– Wed 3/3: project proposal due: personnel + 1-2 page – Spring break week, no class

Page 3: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Review of review

• Multinomial HMMs are sequential version of naïve Bayes.

• One way to drop independence assumption: use a maxent instead of NB, and a conditional model

Page 4: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

From NB to Maxent

Zy

yw

docf

docf

k

i

kj

/)Pr(

)|Pr(

ncombinatiok j,th -i )(

0]:doc?1 of jposition at appearsk [word )(,

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

xfi )(0

Page 5: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

From NB to Maxent

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

xfi )(0

Page 6: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

From NB to Maxent

i

xfn

ijni

ni

jixx

xx

)(0),...,Pr(

,..., data of likelihood

Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB.

Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

Page 7: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MaxEnt Comments

• Functional form same as Naïve Bayes (loglinear model)

• Numerical issues & smoothing important

• All methods are iterative

• Classification performance can be competitive with state-of-art

• optimizes Pr(y|x), not error rate

Page 8: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Page 9: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

What is a symbol?

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations

...)|Pr( tt xs

Page 10: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

What is a symbol?

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

...),|Pr( ,1 ttt sxs

Page 11: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

What is a symbol?

St -1 S

t

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

......),|Pr( ,2,1 tttt ssxs

Page 12: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Ratnaparkhi’s MXPOST

• Sequential learning problem: predict POS tags of words.

• Uses MaxEnt model described above.

• Rich feature set.

• To smooth, discard features occurring < 10 times.

Page 13: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MXPOST

Page 14: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MXPOST: learning & inference

GISFeature

selection

Page 15: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MXPost inference

Adwait: consider only

extensions suggested by a

dictionary

Page 16: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MXPost results

• State of art accuracy (for 1996)

• Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art).

• Same approach used for NER by Borthwick, Malouf, Collins, Manning, and others.

Page 17: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Alternative inference

Page 18: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Finding the most probable path: the Viterbi algorithm (for HMMs)

• define to be the probability of the most probable path accounting for the first i characters of x and ending in state k (ending in with tag k)

)(ivk

• we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state

• can define recursively• can use dynamic programming to find efficiently

)(LvN

)(LvN

Page 19: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Finding the most probable path: the Viterbi algorithm for HMMs

• initialization:

1)0(0 v

k statesother for ,0)0( kv

Page 20: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

The Viterbi algorithm for HMMs

• recursion for emitting states (i =1…L):

klkk

ill aivxbiv )1(max)()(

Page 21: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

The Viterbi algorithm for HMMs and Maxent Taggers

• recursion for emitting states (i =1…L):

klkilk

l aivxbiv )1()(max)(

)1(),|Pr(max)( ivvxviv kkilk

l

Previous tag k i-th token

Page 22: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MEMMs (Frietag & McCallum)

• Basic difference from ME tagging:– ME tagging: previous state is feature of MaxEnt

classifier– MEMM: build a separate MaxEnt classifier for each

state.• Can build any HMM architecture you want; eg parallel nested

HMM’s, etc.• Data is fragmented: examples where previous tag is “proper

noun” give no information about learning tags when previous tag is “noun”

– Mostly a difference in viewpoint – easier to see parallels to HMMs

Page 23: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MEMM task: FAQ parsing

Page 24: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MEMM features

Page 25: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MEMMs

Page 26: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Borthwick et al: MENE system

• Much like MXPost, with some tricks for NER:– 4 tags/field: x_start, x_continue, x_end, x_unique– Features:

• Section features• Tokens in window• Lexical features of tokens in window• Dictionary features of tokens (is token a firstName?)• External system of tokens (is this a NetOwl_company_start?

proteus_person_unique?)• Smooth by discarding low-count features

– No history: viterbi search used to find best consistent tag sequence (e.g. no continue w/o start)

Page 27: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Dictionaries in MENE

Page 28: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MENE results (dry run)

Page 29: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

MENE learning curves

96.393.392.2

Page 30: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

• Largest U.S. Cable Operator Makes Bid for Walt Disney• By ANDREW ROSS SORKIN

• The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus.

• If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios.

Short names

Longer names

Page 31: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

LTG system

• Another MUC-7 competitor• Handcoded rules for “easy” cases (amounts, etc)• Process of repeated tagging and “matching” for hard cases

– Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”)

– Partial matches to sure-fire rule are filtered maxent classifier (candidate filtering) using contextual information, etc

– Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….”

– Final partial-match & filter step on titles with different learned filter.

• Exploits discourse/context information

Page 32: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

LTG Results

Page 33: Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

LTG

Identifinder

MENE+Proteus

Manitoba(NB filtered names)

NetOwlCommercial RBS