a survey of unsupervised grammar induction baskaran sankaran senior supervisor: dr anoop sarkar...

39
A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

Post on 19-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

A Survey of Unsupervised Grammar Induction

Baskaran Sankaran

Senior Supervisor:Dr Anoop Sarkar

School of Computing ScienceSimon Fraser University

Page 2: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

2

MotivationLanguages have hidden

regularitieskaruppu naay puunaiyai thurathiyathu

iruttil karuppu uruvam marainthathu

naay thurathiya puunai vekamaaka ootiyathu

Page 3: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

3

MotivationLanguages have hidden

regularitieskaruppu naay puunaiyai thurathiyathu

iruttil karuppu uruvam marainthathu

naay thurathiya puunai vekamaaka ootiyathu

Page 4: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

4

FORMAL STRUCTURES

Page 5: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

5

Phrase-Structure

Sometimes the bribed became partners in the company

Page 6: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

6

Phrase-Structure

Binarize, CNF

• Sparsity issue with words• Use POS tags

S

ADVP

@S

RB NP VP

VBD

@VP

NP PP

DT

VBN

NNS

IN NP

IN DT NN

S ADVP @S

@S NP VP

VP VBD @VP

@VP NP PP

NP DT VBN

NP DT NN

NP NNS

PP IN NP

ADVP

RB

IN IN

Page 7: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

7

Evaluation Metric-1

Unsupervised Induction◦ Binarized output tree

Possibly unlabelled

Evaluation◦ Gold treebank parse◦ Recall - % of true

constituents found◦ Also precision and F-

scoreWall Street Journal

(WSJ) dataset

S

X X

XX VBD

X

X X

IN

X

VBNDTRB

NNS

NNDT

Page 8: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

8

Dependency Structure

VBD

VBN NNS

VBN*DT

VBD*

IN

IN* NN

DT NN*

RB

Sometimes

the

NNS*

the company

bribed partners

became

in

Page 9: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

9

Dependency Structure

VBDDT NNS NNIN DTVBNRB

Sometimes the bribed became partners in the company

Page 10: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

10

Evaluation Metric-2Unsupervised Induction

◦Generates directed dependency arcsCompute (directed) attachment

accuracy◦Gold dependencies◦WSJ10 dataset

VBDDT NNS NNIN DTVBNRB

Sometimes the bribed became partners in the company

Page 11: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

11

Unsupervised Grammar Induction

To learn the hidden structure of a language◦POS tag sequences as input◦Generates phrase-structure/ dependencies ◦No attempt to find the meaning

Overview◦Phrase-structure and dependency grammars◦Mostly on English (few on Chinese, German

etc.)◦Learning restricted to shorter sentences◦Significantly lags behind the supervised

methods

Page 12: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

12

PHRASE-STRUCTURE INDUCTION

Page 13: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

13

Toy ExampleCorpus

the dog bites a man dog sleeps

a dog bites a bone the man sleeps

GrammarS NP VP NP N N manVP V NP Det a N boneVP V Det the V sleepsNP Det N N dog V bites

Page 14: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

14

EM for PCFG(Baker ’79; Lari and Young ’90)

Inside-Outside◦EM instance for probabilistic CFG

Generalization of Forward-backward for HMMs

◦Non-terminals are fixed◦Estimate maximum likelihood rule

probabilities

S NP VP V --> dog

NP --> Det N Det --> man

NP --> N N --> man

VP --> V V --> man

VP --> V NP Det --> bone

VP --> NP V N --> bone

Det --> the V --> bone

N --> the Det --> bites

V --> the N --> bites

Det --> a V --> bites

N --> a Det --> sleeps

V --> a N --> sleeps

Det --> dog V --> sleeps

N --> dog

S NP VP 1.0 V --> dog

NP --> Det N 0.875 Det --> man

NP --> N 0.125 N --> man 0.375

VP --> V 0.5 V --> man

VP --> V NP 0.5 Det --> bone

VP --> NP V N --> bone 0.125

Det --> the 0.428571 V --> bone

N --> the Det --> bites

V --> the N --> bites

Det --> a 0.571429 V --> bites 0.5

N --> a Det --> sleeps

V --> a N --> sleeps 0.5

Det --> dog V --> sleeps

N --> dog 0.5

Page 15: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

15

Inside-Outside

Sometimes

the

bribed

became

partners

in

the

company

@S NP VP

P(NP the bribed)

P(@S NP VP)

P(VP became … company)

P(S Sometimes @S)

Page 16: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

16

Constraining Search

Sometimes

the

bribed

became

partners

in

the

company

(Pereira and Schabes ’92; Schabes et al. ’93)

Page 17: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

17

Constraining Search(Pereira and Schabes ’92; Schabes et al. ’93; Hwa ’99)

Treebank bracketings◦Bracketing boundaries constrain

inductionWhat happens with limited

supervision?◦More bracketed data exposed iteratively◦0% bracketed data◦100% bracketed data

Right-branching baseline

Recall: 50.0

Recall: 78.0

Recall: 76.0

Page 18: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

18

Distributional clustering(Adriaans et al. ’00; Clark ’00; van Zaanen ’00)

Cluster the word sequences◦Context: adjacent words or

boundaries◦Relative frequency distribution of

contexts the black dog bites the manthe man eats an apple

Identifies constituents◦Evaluation on ATIS corpus

Recall: 35.6

Page 19: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

19

Constituent-Context Model(Klein and Manning ’02)

• Valid constituents in a tree should not cross

S

XX

X X

VBD

X

X

DT

VBN

X

XX

DT

NN

RB

NNS

IN

S

X X

XX VBD

X

X X

IN

X

VBNDTRB

NNS

NNDT

Page 20: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

20

Constituent-Context Model

Sometimes

the

bribed

became

partners

in

the

company

DT

VBN

RB

VBD

RecallRight-branch:

70.0CCM: 81.6

S

XX

X X

VBD

X

X

DT

VBN

X

XX

DT

NN

RB

NNS

IN

Page 21: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

21

DEPENDENCY INDUCTION

Page 22: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

22

Dependency Model w/ Valence

(Klein and Manning ’04)

Simple generative model◦Choose head – P(Root)◦End – P(End | h, dir, v)

Attachment dir (right, left) Valence (head outward)

◦Argument – P(a | h, dir)

Dir Accuracy CCM: 23.8DMV: 43.2 Joint: 47.5

VBDDT NNS NNIN DTVBNRB

Sometimes the bribed became partners in the company

Sometimes

the

bribed

became

partners

in

the

company

• Head – P(Root)• Argument – P(a | h, dir)• End – P(End | h, dir, v)

Page 23: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

23

DMV Extensions(Headden et al. ’09; Blunsom and Cohn ’10)

Extended Valence (EVG)◦Valence frames for the head

Allows different distributions over arguments

Lexicalization (L-EVG)Tree Substitution Grammar

◦Tree fragments instead of CFG rules

Dir Acc: 68.8

Dir Acc: 65.0

VBDDT NNS NNIN DTVBNRB

Sometimes the bribed became partners in the company

Dir Acc: 67.7

Page 24: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

24

MULTILINGUAL SETTING

Page 25: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

25

Bilingual Alignment & Parsing

(Wu ’97)

Inversion Transduction Grammar (ITG)◦Allows reordering S

X X

e2

f4

e1

f3

e4

f2

e3

f1

e1 e2 e3 e4

f1 f2 f3 f4

Page 26: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

26

Bilingual Parsing(Snyder et al. ’09)

Bilingual Parsing◦PP Attachment ambiguity

I saw (the student (from MIT)1 )2

◦Not ambiguous in Urdu)میں سے( آ�ئٹی ) 1یم علم( دیکھ� 2ط�لب کو

I ((MIT of) student) saw

Page 27: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

27

Summary & OverviewEM for PCFG

Constrain with bracketing

Distributional Clustering

CCM

DMV

Contrastive Estimation

EVG & L-EVG

TSG + DMV

Data-oriented Parsing

Parametric Search Methods

Structural Search Methods

EM for PCFG

Constrain with bracketing

Contrastive Estimation

Distributional Clustering

CCM

DMV

EVG & L-EVG

TSG + DMV

Data-oriented Parsing

•State-of-the-art• Phrase-structure (CCM +

DMV)Recall: 88.0

• Dependency (Lexicalized EVG)

Dir Acc: 68.8

PrototypePrototype

Page 28: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

28

QUESTIONS?

Thanks!

Page 29: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

29

MotivationLanguages have hidden

regularities

Page 30: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

30

MotivationLanguages have hidden

regularities◦The guy in China◦… new leader in China◦That’s what I am asking you …◦I am telling you …

Page 31: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

31

Issues with EM(Carroll and Charniak ’92; Periera and Schabes ’92; de

Marcken ’05)

(Liang and Klein ’08; Spitkovsky et al. ’10)

Phrase-structure◦Finds local maxima instead of global◦Multiple ordered adjuctions

Both phrase-structure & dependency◦Disconnect between likelihood and

optimal grammar

Page 32: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

32

Constituent-Context Model(Klein and Manning ’02)

CCM◦Only constituent identity◦Valid constituents in a tree should

not cross

Page 33: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

33

Bootstrap phrases(Haghighi and Klein ’06)

Bootstrap with seed examples for constituents types◦Chosen from most frequent treebank

phrases◦Induces labels for constituents

Integrate with CCM◦CCM generates brackets

(constituents)◦Proto labels them

Recall: 59.6

Recall: 68.4

Page 34: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

34

Dependency Model w/ Valence

(Klein and Manning ’04)

Simple generative model◦Choose head; attachment dir (right,

left)◦Valence (head outward)

End of generation modelled separately

Dir Acc: 43.2

VBDDT NNS NNIN DTVBNRB

Sometimes the bribed became partners in the company

Page 35: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

35

Learn from how not to speak

Contrastive Estimation (Smith and Eisner ’05)

◦Log-linear Model of dependency Features: f(q, T)

P(Root); P(a | h, dir); P(End | h, dir, v)

Conditional likelihood

Page 36: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

36

Learn from how not to speak(Smith and Eisner ’05)

Contrastive Estimation Ex. the brown cat vs. cat brown the

◦Neighborhoods Transpose (Trans), delete & transpose

(DelOrTrans)

Dir Acc: 48.8

Page 37: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

37

DMV Extensions-1(Cohen and Smith ’08, ’09)

Tying parameters◦Correlated Topic Model (CTM)

Correlation between different word types

◦Two types of tying parameters Logistic Normal (LN) Shared LN

Dir Acc: 61.3Dir Acc: 61.3

Page 38: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

38

DMV Extensions-2VBD

VBN

NNS

VBN*

DT

VBD*

IN

IN*

NN

DT NN*

RB

Sometimes

the

NNS*

the company

bribed

partners

became

in

VBD

VBN

VBD

became

VBD

NNSVBD

NNS

(Blunsom and Cohn ’10)

NNS

IN

IN

in

NN

Page 39: A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University

39

DMV Extensions-2(Blunsom and Cohn ’10)

Tree Substitution Grammar (TSG)◦Lexicalized trees◦Hierarchical prior

Different levels of backoff

Dir Acc: 67.7

VBD

VBN

VBD

became

VBD

NNSVBD

NNS

NNS

IN

IN

in

NN