should neural network architecture reflect linguistic...

Post on 24-Jul-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Should Neural Network Architecture Reflect Linguistic Structure?

Wang Ling (Google/DeepMind)Austin Matthews (CMU)

Noah A. Smith (UW)

Miguel Ballesteros (UPF) Joint work with:

Chris Dyer (CMU → Google/DeepMind)

ICLR 2016 May 2, 2016

Learning languageARBITRARINESS

COMPOSITIONALITY

(de Saussure, 1916)

(Frege, 1892)

Learning languageARBITRARINESS

COMPOSITIONALITY

car c b bar+� =

(de Saussure, 1916)

(Frege, 1892)

Learning languageARBITRARINESS

COMPOSITIONALITY

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)

Learning languageARBITRARINESS

COMPOSITIONALITY

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)

car

Learning languageARBITRARINESS

COMPOSITIONALITY

Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)

car

Learning languageARBITRARINESS

COMPOSITIONALITY

Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)

car

Learning languageARBITRARINESS

COMPOSITIONALITY

Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)Johndances John Mary Marydances+� =

dance(John) dance(Mary)

car

Learning languageARBITRARINESS

COMPOSITIONALITY

Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)Johndances John Mary Marydances+� =

dance(John) dance(Mary)

Johnsings John Mary Marysings+� =

sing(John) sing(Mary)

car

Learning languageARBITRARINESS

COMPOSITIONALITY

Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)Johndances John Mary Marydances+� =

dance(John) dance(Mary)

Johnsings John Mary Marysings+� =

sing(John) sing(Mary)

car

Learning languageARBITRARINESS

COMPOSITIONALITY

Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan

car c b bar+� = cat c b bat+� =

(de Saussure, 1916)

(Frege, 1892)Johndances John Mary Marydances+� =

dance(John) dance(Mary)

Johnsings John Mary Marysings+� =

sing(John) sing(Mary)

car Memorize

Generalize

Learning language

John gave the book to Mary

Learning language

00 0 1 0 · · ·· · ·· · ·· · · 1>John

John gave the book to Mary

Learning language

P⇥00 0 1 0 · · ·· · ·· · ·· · · 1>

John

John gave the book to Mary

Learning language

P⇥00 0 1 0 · · ·· · ·· · ·· · · 1>

John

vJohn

John gave the book to Mary

Learning language

John gave the book to Mary

Learning language

John gave the book to Mary

Learning language

John gave the book to Mary

f

Learning language

John gave the book to Mary

f Recurrent NN, ConvNet, Recursive NN, …

Learning language

John gave the book to Mary

f Recurrent NN, ConvNet, Recursive NN, …

prediction

Learning language

John gave the book to Mary

f Recurrent NN, ConvNet, Recursive NN, …

prediction

Memorize

Learning language

John gave the book to Mary

f Recurrent NN, ConvNet, Recursive NN, …

prediction

Memorize

Generalize

Learning languageCHALLENGE 1: IDIOMS

CHALLENGE 2: MORPHOLOGY

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

CHALLENGE 2: MORPHOLOGY

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)

CHALLENGE 2: MORPHOLOGY

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)

CHALLENGE 2: MORPHOLOGY

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)

CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)

CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool

cat s+ = cats

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)

CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool

cat s+ = cats bat s+ = bats

Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket

see(John, football)

see(John,bucket)

kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)

CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool

cat s+ = cats bat s+ = bats

Compositional words

cats

Memorize Generalize

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

c a t STOPSTART s

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

c a t STOPSTART s

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

c a t STOPSTART s

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

c a t STOPSTART s

Compositional words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

c a t STOPSTART s

• Does a “compositional” model have the capacity to learn the “arbitrariness” that is required?

• We might think so—RNNs/LSTMs can definitely overfit!

• Will we see better improvements in languages with more going on in the morphology?

Compositional wordsQuestions

ExampleDependency parsing

I saw her duck

ExampleDependency parsing

I saw her duck

ExampleDependency parsing

I saw her duck

ROOT

ExampleDependency parsing

I saw her duck

ROOT

ExampleDependency parsing

I saw her duck

ROOT

00 0 1 0 · · ·· · ·· · ·· · ·

SS

Word embedding models:

Memorize Generalize

Dependency parsingCharLSTM > Word Lookup

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

Dependency parsingCharLSTM > Word Lookup

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

In English parsing, the character LSTM is roughly equivalent to the lookup approach.

00 0 1 0 · · ·· · ·· · ·· · ·

Dependency parsingCharLSTM > Word Lookup

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

In English parsing, the character LSTM is roughly equivalent to the lookup approach.

00 0 1 0 · · ·· · ·· · ·· · ·

⇡What about languages with richer lexicons?MuvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesineTurkish:

MegszentségteleníthetetlenségeskedéseitekértHungarian:

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

Dependency parsingCharLSTM > Word Lookup

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

00 0 1 0 · · ·· · ·· · ·· · ·

In agglutinative languages,

Dependency parsingCharLSTM > Word Lookup

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

00 0 1 0 · · ·· · ·· · ·· · ·

In agglutinative languages,

In fusional/templatic languages,

00 0 1 0 · · ·· · ·· · ·· · ·<

Dependency parsingCharLSTM > Word Lookup

Word Chars Δ

English 91.2 91.5 +0.3

Turkish 71.7 76.3 +4.6

Hungarian 72.8 80.4 +7.6

Basque 77.1 85.2 +8.1

Korean 78.7 88.4 +9.7

Swedish 76.4 79.2 +3.2

Swedish 76.4 79.2 +2.8

Arabic 85.2 86.1 +0.9

Chinese 79.1 79.9 +0.8

00 0 1 0 · · ·· · ·· · ·· · ·

In agglutinative languages,

In fusional/templatic languages,

00 0 1 0 · · ·· · ·· · ·· · ·<

In analytic languages, the models are roughly equivalent.

Dependency parsingCharLSTM > Word Lookup

Language modelingWord similarities

increased John Noahshire phding

reduced Richard Nottinghamshire mixing

improved George Bucharest modelling

expected James Saxony styling

decreased Robert Johannesburg blaming

targeted Edward Gloucestershire christening

query

Language modelingWord similarities

increased John Noahshire phding

reduced Richard Nottinghamshire mixing

improved George Bucharest modelling

expected James Saxony styling

decreased Robert Johannesburg blaming

targeted Edward Gloucestershire christening

query

5 nearest neighbors

Language modelingWord similarities

increased John Noahshire phding

reduced Richard Nottinghamshire mixing

improved George Bucharest modelling

expected James Saxony styling

decreased Robert Johannesburg blaming

targeted Edward Gloucestershire christening

query

5 nearest neighbors

• Lots of exciting work from a variety of places

• Google Brain: language models

• Harvard/NYU (Kim, Rush, Sontag): language models

• NYU/FB: document representation “from scratch”

• CMU (me, Cohen, Salakhutdinov): Twitter, morphologically rich languages, translation

• Now for something a bit more controversial…

Character vs. word modelingSummary

Structure-aware words

cats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

c a t STOPSTART s

Structure-aware words

STOPSTARTcats

00 0 1 0 · · ·· · ·· · ·· · ·

Memorize Generalize

Would we benefit from a more knowledge-rich

decomposition?

cat +PL

• Rather than assuming a fixed vocabulary, model any sequence in where is the inventory of characters.

Open Vocabulary LMs

⌃⇤ ⌃

Open Vocabulary LMsTurkish

Open Vocabulary LMsTurkish morphology

Open Vocabulary LMsTurkish morphology

Open Vocabulary LMsTurkish morphology

süreç+NOUN+A3SG+P3SG+ACC

Open Vocabulary LMsTurkish morphology

süreç+NOUN+A3SG+P3SG+ACC süreç+NOUN+A3SG+P2SG+ACC

Open Vocabulary LMsTurkish morphology

süreç+NOUN+A3SG+P3SG+ACC süreç+NOUN+A3SG+P2SG+ACC

POOLING

Open Vocabulary LMsTurkish morphology

süreç+NOUN+A3SG+P3SG+ACC süreç+NOUN+A3SG+P2SG+ACC

POOLING

sürecini s ü r e c i n i

Input word representation:

Input word representation:

Output generation process: (marginalized)

Input word representation:

Output generation process: (marginalized)

Input word representation:

Output generation process: (marginalized)

Input word representation:

Output generation process: (marginalized)

Input word representation:

Output generation process: (marginalized)

Input word representation:

Output generation process: (marginalized)

Open Vocabulary LMperplexity per

word

Characters 18600

Characters +Morphs 8165

Characters +Words 5021

Characters +Words +Morphs

4116

• Model performance is essentially equivalent in morphologically simple languages (e.g., Chinese, English)

• In morphologically rich languages (e.g., Hungarian, Turkish, Finnish), performance improvements are most pronounced

• We need far fewer parameters to represent words as “compositions” of characters

• Word and morpheme level information adds additional value

• Where else could we add linguistic structural knowledge?

Character vs. word modelingSummary

(1) a. The talk I gave did not appeal to anybody.

Modeling syntaxLanguage is hierarchical

Examples adapted from Everaert et al. (TICS 2015)

(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.

Modeling syntaxLanguage is hierarchical

Examples adapted from Everaert et al. (TICS 2015)

(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.

Modeling syntaxLanguage is hierarchical

NPI

Examples adapted from Everaert et al. (TICS 2015)

(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.

Generalization hypothesis: not must come before anybody

Modeling syntaxLanguage is hierarchical

NPI

Examples adapted from Everaert et al. (TICS 2015)

(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.

Generalization hypothesis: not must come before anybody

(2) *The talk I did not give appealed to anybody.

Modeling syntaxLanguage is hierarchical

NPI

Examples adapted from Everaert et al. (TICS 2015)

Examples adapted from Everaert et al. (TICS 2015)

containing anybody. (This structural configuration is called c(onstituent)-command in thelinguistics literature [31].) When the relationship between not and anybody adheres to thisstructural configuration, the sentence is well-formed.

In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating notin Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentenceis not well-formed.

The reader may confirm that the same hierarchical constraint dictates whether the examples in(4–5) are well-formed or not, where we have depicted the hierarchical sentence structure interms of conventional labeled brackets:

(4) [S1 [NP The book [S2 I bought]S2]NP did not [VP appeal to anyone]VP]S1(5) *[S1 [NP The book [S2 I did not buy]S2]NP [VP appealed to anyone]VP]S1

Only in example (4) does the hierarchical structure containing not (corresponding to the sentenceThe book I bought did not appeal to anyone) also immediately dominate the NPI anybody. In (5)not is embedded in at least one phrase that does not also include the NPI. So (4) is well-formedand (5) is not, exactly the predicted result if the hierarchical constraint is correct.

Even more strikingly, the same constraint appears to hold across languages and in many othersyntactic contexts. Note that Japanese-type languages follow this same pattern if we assumethat these languages have hierarchically structured expressions similar to English, but linearizethese structures somewhat differently – verbs come at the end of sentences, and so forth [32].Linear order, then, should not enter into the syntactic–semantic computation [33,34]. This israther independent of possible effects of linearly intervening negation that modulate acceptabilityin NPI contexts [35].

The Syntax of SyntaxObserve an example as in (6):

(6) Guess which politician your interest in clearly appeals to.

The construction in (6) is remarkable because a single wh-phrase is associated bothwith the prepositional object gap of to and with the prepositional object gap of in, as in(7a). We talk about ‘gaps’ because a possible response to (6) might be as in (7b):

(7) a. Guess which politician your interest in GAP clearly appeals to GAP.b. response to (7a): Your interest in Donald Trump clearly appeals to Donald Trump

(A) (B)

X X

X X X X

The book X X X The book X appealed to anybodydid not

that I bought appeal to anybody that I did not buy

Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.

734 Trends in Cognitive Sciences, December 2015, Vol. 19, No. 12

Language is hierarchical

The talk

I gavedid not

appeal to anybody

appealed to anybodyThe talk

I did not give

Examples adapted from Everaert et al. (TICS 2015)

containing anybody. (This structural configuration is called c(onstituent)-command in thelinguistics literature [31].) When the relationship between not and anybody adheres to thisstructural configuration, the sentence is well-formed.

In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating notin Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentenceis not well-formed.

The reader may confirm that the same hierarchical constraint dictates whether the examples in(4–5) are well-formed or not, where we have depicted the hierarchical sentence structure interms of conventional labeled brackets:

(4) [S1 [NP The book [S2 I bought]S2]NP did not [VP appeal to anyone]VP]S1(5) *[S1 [NP The book [S2 I did not buy]S2]NP [VP appealed to anyone]VP]S1

Only in example (4) does the hierarchical structure containing not (corresponding to the sentenceThe book I bought did not appeal to anyone) also immediately dominate the NPI anybody. In (5)not is embedded in at least one phrase that does not also include the NPI. So (4) is well-formedand (5) is not, exactly the predicted result if the hierarchical constraint is correct.

Even more strikingly, the same constraint appears to hold across languages and in many othersyntactic contexts. Note that Japanese-type languages follow this same pattern if we assumethat these languages have hierarchically structured expressions similar to English, but linearizethese structures somewhat differently – verbs come at the end of sentences, and so forth [32].Linear order, then, should not enter into the syntactic–semantic computation [33,34]. This israther independent of possible effects of linearly intervening negation that modulate acceptabilityin NPI contexts [35].

The Syntax of SyntaxObserve an example as in (6):

(6) Guess which politician your interest in clearly appeals to.

The construction in (6) is remarkable because a single wh-phrase is associated bothwith the prepositional object gap of to and with the prepositional object gap of in, as in(7a). We talk about ‘gaps’ because a possible response to (6) might be as in (7b):

(7) a. Guess which politician your interest in GAP clearly appeals to GAP.b. response to (7a): Your interest in Donald Trump clearly appeals to Donald Trump

(A) (B)

X X

X X X X

The book X X X The book X appealed to anybodydid not

that I bought appeal to anybody that I did not buy

Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.

734 Trends in Cognitive Sciences, December 2015, Vol. 19, No. 12

Generalization: not must “structurally precede” anybody

Language is hierarchical

The talk

I gavedid not

appeal to anybody

appealed to anybodyThe talk

I did not give

Examples adapted from Everaert et al. (TICS 2015)

containing anybody. (This structural configuration is called c(onstituent)-command in thelinguistics literature [31].) When the relationship between not and anybody adheres to thisstructural configuration, the sentence is well-formed.

In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating notin Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentenceis not well-formed.

The reader may confirm that the same hierarchical constraint dictates whether the examples in(4–5) are well-formed or not, where we have depicted the hierarchical sentence structure interms of conventional labeled brackets:

(4) [S1 [NP The book [S2 I bought]S2]NP did not [VP appeal to anyone]VP]S1(5) *[S1 [NP The book [S2 I did not buy]S2]NP [VP appealed to anyone]VP]S1

Only in example (4) does the hierarchical structure containing not (corresponding to the sentenceThe book I bought did not appeal to anyone) also immediately dominate the NPI anybody. In (5)not is embedded in at least one phrase that does not also include the NPI. So (4) is well-formedand (5) is not, exactly the predicted result if the hierarchical constraint is correct.

Even more strikingly, the same constraint appears to hold across languages and in many othersyntactic contexts. Note that Japanese-type languages follow this same pattern if we assumethat these languages have hierarchically structured expressions similar to English, but linearizethese structures somewhat differently – verbs come at the end of sentences, and so forth [32].Linear order, then, should not enter into the syntactic–semantic computation [33,34]. This israther independent of possible effects of linearly intervening negation that modulate acceptabilityin NPI contexts [35].

The Syntax of SyntaxObserve an example as in (6):

(6) Guess which politician your interest in clearly appeals to.

The construction in (6) is remarkable because a single wh-phrase is associated bothwith the prepositional object gap of to and with the prepositional object gap of in, as in(7a). We talk about ‘gaps’ because a possible response to (6) might be as in (7b):

(7) a. Guess which politician your interest in GAP clearly appeals to GAP.b. response to (7a): Your interest in Donald Trump clearly appeals to Donald Trump

(A) (B)

X X

X X X X

The book X X X The book X appealed to anybodydid not

that I bought appeal to anybody that I did not buy

Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.

734 Trends in Cognitive Sciences, December 2015, Vol. 19, No. 12

Generalization: not must “structurally precede” anybody- many theories of the details of structure - the psychological reality of structural sensitivty

is not empirically controversial - much more than NPIs follow such constraints

Language is hierarchical

The talk

I gavedid not

appeal to anybody

appealed to anybodyThe talk

I did not give

One theory of hierarchy

• Generate symbols sequentially using an RNN

• Add some “control symbols” to rewrite the history periodically

• Periodically “compress” a sequence into a single “constituent”

• Augment RNN with an operation to compress recent history into a single vector (-> “reduce”)

• RNN predicts next symbol based on the history of compressed elements and non-compressed terminals (“push”)

• RNN must also predict “control symbols” that decide how big constituents are

• We call such models recurrent neural network grammars.

One theory of hierarchy

• Generate symbols sequentially using an RNN

• Add some “control symbols” to rewrite the history periodically

• Periodically “compress” a sequence into a single “constituent”

• Augment RNN with an operation to compress recent history into a single vector (-> “reduce”)

• RNN predicts next symbol based on the history of compressed elements and non-compressed terminals (“shift” or “generate”)

• RNN must also predict “control symbols” that decide how big constituents are

• We call such models recurrent neural network grammars.

One theory of hierarchy

• Generate symbols sequentially using an RNN

• Add some “control symbols” to rewrite the history periodically

• Periodically “compress” a sequence into a single “constituent”

• Augment RNN with an operation to compress recent history into a single vector (-> “reduce”)

• RNN predicts next symbol based on the history of compressed elements and non-compressed terminals (“shift” or “generate”)

• RNN must also predict “control symbols” that decide how big constituents are

• We call such models recurrent neural network grammars.

Stack ActionTerminals

Stack ActionNT(S)

Terminals

Stack ActionNT(S)

(S

Terminals

Stack ActionNT(S)

(S NT(NP)

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

The hungry cat (S (NP The hungry cat )

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

The hungry cat (S (NP The hungry cat )

(S (NP The hungry cat)

Compress “The hungry cat” into a single composite symbol

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE

(S (NP The hungry cat)The hungry cat

(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE

NT(VP)(S (NP The hungry cat)The hungry cat

(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE

NT(VP)

(S (NP The hungry cat) (VPThe hungry cat

(S (NP The hungry cat)The hungry cat

(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE

NT(VP)

GEN(meows)(S (NP The hungry cat) (VPThe hungry cat

(S (NP The hungry cat)The hungry cat

(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Stack ActionNT(S)

(S NT(NP)

(S (NP GEN(The)

GEN(hungry)

GEN(cat)

REDUCE

NT(VP)

GEN(meows)

The hungry cat meows

REDUCE

(S (NP The hungry cat) (VP meows) GEN(.)

REDUCE

(S (NP The hungry cat) (VP meows) .)The hungry cat meows .

(S (NP The hungry cat) (VP meows) .The hungry cat meows .

(S (NP The hungry cat) (VP meowsThe hungry cat meows

(S (NP The hungry cat) (VPThe hungry cat

(S (NP The hungry cat)The hungry cat

(S (NP The hungry catThe hungry cat

(S (NP The hungryThe hungry

(S (NP TheThe

Terminals

Syntactic Composition(NP The hungry cat)Need representation for:

Syntactic Composition(NP The hungry cat)Need representation for:

NP

What head type?

Syntactic Composition

The

(NP The hungry cat)Need representation for:

NP

What head type?

Syntactic Composition

The hungry

(NP The hungry cat)Need representation for:

NP

What head type?

Syntactic Composition

The hungry cat

(NP The hungry cat)Need representation for:

NP

What head type?

Syntactic Composition

The hungry cat )

(NP The hungry cat)Need representation for:

NP

What head type?

Syntactic Composition

TheNP hungry cat )

(NP The hungry cat)Need representation for:

Syntactic Composition

TheNP hungry cat ) NP

(NP The hungry cat)Need representation for:

Syntactic Composition

TheNP hungry cat ) NP

(NP The hungry cat)Need representation for:

Syntactic Composition

TheNP hungry cat ) NP

(NP The hungry cat)Need representation for:

Syntactic Composition

TheNP hungry cat ) NP

(NP The hungry cat)Need representation for:

Syntactic Composition

TheNP hungry cat ) NP (

(NP The hungry cat)Need representation for:

Syntactic Composition

TheNP hungry cat ) NP (

(NP The hungry cat)Need representation for:

Recursion

TheNP cat ) NP (

(NP The (ADJP very hungry) cat)Need representation for: (NP The hungry cat)

hungry

Recursion

TheNP cat ) NP (

(NP The (ADJP very hungry) cat)Need representation for: (NP The hungry cat)

| {z }v

v

Syntactic Composition• Inspired by Socher et al (2011, 2012 …)

• words and constituents embedded in same space

• Composition functions designed to

• capture linguistic notion of headedness (LSTMs know what type of head they are looking for while they traverse children)

• support any number of children

• are learned via backpropagation through structure

• RNNGs jointly model sequences of words together with a “tree structure”,

• Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints)

• We use trees from the Penn Treebank

• We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction. Future work…

Implementing RNNGsParameter Estimation

p✓(x,y)

• An RNNG is a joint distribution p(x,y) over strings (x) and parse trees (y)

• We are interested in two inference questions:

• What is p(x) for a given x? [language modeling]

• What is max p(y | x) for a given x? [parsing]

• Unfortunately, the dynamic programming algorithms we often rely on are of no help here

• We can use importance sampling to do both by sampling from a discriminatively trained model

y

Implementing RNNGsInference

Type F1

Petrov and Klein (2007) G 90.1

Shindo et al (2012)Single model G 91.1

Shindo et al (2012)Ensemble ~G 92.4

Vinyals et al (2015)PTB only D 90.5

Vinyals et al (2015)Ensemble S 92.8

Discriminative D 89.8

Generative (IS) G 92.4

English PTB (Parsing)

Type F1

Petrov and Klein (2007) G 90.1

Shindo et al (2012)Single model G 91.1

Shindo et al (2012)Ensemble ~G 92.4

Vinyals et al (2015)PTB only D 90.5

Vinyals et al (2015)Ensemble S 92.8

Discriminative D 89.8

Generative (IS) G 92.4

English PTB (Parsing)

Type F1

Petrov and Klein (2007) G 90.1

Shindo et al (2012)Single model G 91.1

Shindo et al (2012)Ensemble ~G 92.4

Vinyals et al (2015)PTB only D 90.5

Vinyals et al (2015)Ensemble S 92.8

Discriminative D 89.8

Generative (IS) G 92.4

English PTB (Parsing)

Perplexity

5-gram IKN 169.3

LSTM + Dropout 113.4

Generative (IS) 102.4

English PTB (LM)

Perplexity

5-gram IKN 255.2

LSTM + Dropout 207.3

Generative (IS) 171.9

Chinese CTB (LM)

This Talk, In a Nutshell• Facts about language:

• Arbitrariness and compositionality exist at all levels

• Language is sensitive to hierarchy, not strings

• My work’s hypothesis:Models designed with these considerations structure explicit will outperform models that don’t

STOPSTARTt h a n k s !

STOPSTARTt h a n k s !

questions?

• Augment a sequential RNN with a stack pointer

• Two constant-time operations

• push - read input, add to top of stack, connect to current location of the stack pointer

• pop - move stack pointer to its parent

• A summary of stack contents is obtained by accessing the output of the RNN at location of the stack pointer

• Note: push and pop are discrete actions here (cf. Grefenstette et al., 2015)

Implementing RNNGsStack RNNs

;

y0

PUSH

Implementing RNNGsStack RNNs

; x1

y0 y1

POP

Implementing RNNGsStack RNNs

; x1

y0 y1

Implementing RNNGsStack RNNs

; x1

y0 y1

PUSH

Implementing RNNGsStack RNNs

; x1

y0 y1 y2

x2

POP

Implementing RNNGsStack RNNs

; x1

y0 y1 y2

x2

Implementing RNNGsStack RNNs

; x1

y0 y1 y2

x2

PUSH

Implementing RNNGsStack RNNs

; x1

y0 y1 y2

x2 x3

y3

Implementing RNNGsStack RNNs

Importance Samplingq(y | x)Assume we’ve got a conditional distribution

p(x,y) > 0 =) q(y | x) > 0y ⇠ q(y | x)

(i)(ii) is tractable and

q(y | x)(iii) is tractable

s.t.

Importance Samplingq(y | x)Assume we’ve got a conditional distribution

p(x,y) > 0 =) q(y | x) > 0y ⇠ q(y | x)

(i)(ii) is tractable and

q(y | x)(iii) is tractable

s.t.

w(x,y) =p(x,y)

q(y | x)Let the importance weights

Importance Samplingq(y | x)Assume we’ve got a conditional distribution

p(x,y) > 0 =) q(y | x) > 0y ⇠ q(y | x)

(i)(ii) is tractable and

q(y | x)(iii) is tractable

s.t.

w(x,y) =p(x,y)

q(y | x)Let the importance weights

p(x) =X

y2Y(x)

p(x,y) =X

y2Y(x)

w(x,y)q(y | x)

= Ey⇠q(y|x)w(x,y)

Importance Samplingp(x) =

X

y2Y(x)

p(x,y) =X

y2Y(x)

w(x,y)q(y | x)

= Ey⇠q(y|x)w(x,y)

Importance Samplingp(x) =

X

y2Y(x)

p(x,y) =X

y2Y(x)

w(x,y)q(y | x)

= Ey⇠q(y|x)w(x,y)

Replace this expectation with its Monte Carlo estimate.

y

(i) ⇠ q(y | x) for i 2 {1, 2, . . . , N}

Importance Samplingp(x) =

X

y2Y(x)

p(x,y) =X

y2Y(x)

w(x,y)q(y | x)

= Ey⇠q(y|x)w(x,y)

Replace this expectation with its Monte Carlo estimate.

y

(i) ⇠ q(y | x) for i 2 {1, 2, . . . , N}

Eq(y|x)w(x,y)MC⇡ 1

N

NX

i=1

w(x,y(i))

top related