why generative models underperform surface heuristics uc berkeley natural language processing john...

Why Generative Models Underperform Surface

Heuristics

UC BerkeleyNatural Language Processing

John DeNero, Dan Gillick, James Zhang, and Dan Klein

Overview: Learning Phrases

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …

Phrase table(translation model)

Intersected and grown word alignments

Directional word alignments

Overview: Learning Phrases

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …

Phrase table(translation model)

Phrase-level generative model

• Early successful phrase-based SMT system [Marcu & Wong ‘02]

• Challenging to train

• Underperforms heuristic approach

OutlineI) Generative phrase-based alignment

Motivation Model structure and training Performance results

II) Error analysis Properties of the learned phrase

table Contributions to increased error rate

III) Proposed Improvements

Motivation for Learning Phrases

Translate!

Input sentence:

Output sentence:

J ’ ai un chat .

I have a spade .


appelle un chat un chat

call

a

spade

a

spade

appelle call

chat un chat spade a spade


appelle un chat un chat

call

a

spade

a

spade

appelleappelle un appelle un chatunun chatun chat unchatchat unchat un chat

callcall acall a spadea x2

a spade x2

a spade aspade x2

spade aspade a spade

… appelle un chat un chat …

A Phrase Alignment Model Compatible with Pharaoh

les chats aiment le poisson frais .

cats like fresh fish .

Training Regimen That Respects Word Alignment

les chatsaiment

lepoisson

cats

like

fresh

fish

.

.frais

.

les chatsaiment

lepoisson

cats

like

fresh

fish

.

.

frais

.

X

Training Regimen That Respects Word Alignment

les chatsaiment

lepoisson

cats

like

fresh

fish

.

.frais

.

Only 46% of training sentences contributed to training.

36

37

38

39

40

0 1 2 3 4

EM Iterations

BLEU

100k25k

Performance Results

Heuristically generated parameters

Performance Results

39.0

38.538.3

38.8

37

38

39

40

Heuristic(100k)

Heuristic(50k)

Heuristic(25k)

Learned(100k)

BLEU

Lost training data is not the whole story

Learned parameters with 4x training data

underperform heuristic


Model structure and training Performance results


table Contributions to increased error

rate


Training Corpus

French: carte sur la table

English: map on the table


English: notice on the chart

Example: Maximizing Likelihood with Competing Segmentations

cartecartecarte surcarte surcarte sur lacarte sur lasurlasur lasur la tablesur la tablela tablela tabletabletable

mapnoticemap onnotice onmap on thenotice on theontheon theon the tableon the chartthe tablethe charttablechart

0.50.50.50.50.50.51.01.01.00.50.50.50.50.50.50.25 * 7 / 7

= 0.25

carte sur la tableLikelihood Computation

Training Corpus


English: map on the table


English: notice on the chart

Example: Maximizing Likelihood with Competing Segmentations

cartecarte surcarte sur lasursur lasur la tablelala tabletable

mapnotice onnotice on theonon theon the tablethethe tablechart

1.01.01.01.01.01.01.01.01.0

carte sur la table

Likelihood of “notice on the chart” pair: 1.0 * 2 / 7 = 0.28 > 0.25

Likelihood of “map on the table” pair: 1.0 * 2 / 7 = 0.28 > 0.25

EM Training Significantly Decreases Entropy of the Phrase Table

French phrase entropy:

0 10 20 30 40

0-.01

.01-.5

.5-1

1-1.5

1.5-2

> 2

Entropy

Percent of French Phrases

LearnedHeuristic

10% of French phrases have deterministic distributions

Effect 1: Useful Phrase Pairs Are Lost Due to Critically Small Probabilities

In 10k translated sentences, no phrases with weight less than 10-5 were used by the decoder.

0 100 200 300 400

Heuristic

Learned

Effective Table Size (1000 phrases)

Effect 2: Determinized Phrases Override Better Candidates During Decoding

the situation varies to an enormous degree

the situation varie d ' une immense degré

the situation varies to an enormous degree

the situation varie d ' une immense caractérise

Heuristic

Learned

~00.02amount

0.010.02extent

0.260.38level

0.640.49degree

degré

€

φH

€

φEM

0.998~0degree

~00.05features

0.0010.21characterized

0.0010.49characterizes

caractérise

€

φH

€

φEM

Effect 3: Ambiguous Foreign Phrases Become Active During Decoding

Deterministic phrases can be used by the decoder with no cost.

Translations for the French apostrophe


Model structure and training Performance results


table Contributions to increased error

rate


Motivation for Reintroducing Entropy to the Phrase Table

1. Useful phrase pairs are lost due to critically small probabilities.

2. Determinized phrases override better candidates.

3. Ambiguous foreign phrases become active during decoding.

Reintroducing Lost Phrases

36.5 37 37.5 38 38.5 39

Learned

Heuristic

Interpolated

BLEU (25k sentences)

Interpolation yields up to 1.0 BLEU improvement

Smoothing Phrase Probabilities

Reserves probability mass for unseen translations based on

the length of the French phrase

36.5 37 37.5 38 38.5 39

Learned

Heuristic

Smoothed

BLEU (25k sentences)

Conclusion Generative phrase models determinize the phrase table via the latent segmentation variable.

A determinized phrase table introduces errors at decoding time.

Modest improvement can be realized by reintroducing phrase table entropy.

Questions?

why generative models underperform surface heuristics uc berkeley natural language processing john...

Documents

sur sur

chat chat

chat slide

table french

table english

table table map notice

spade slide

table likelihood of