jhu mt class: feature-based models

Feature-BasedModels

•Some (not all) key ingredients in Google Translate:

•Phrase-based translation models

•... Learned heuristically from word alignments

•... Coupled with a huge language model

•... And very tight pruning heuristics

•Today: more flexible parameterizations.

p(English|Chinese) !

p(English) ! p(Chinese|English)

Bayes’ Rule

translation modellanguage model

English

p(Chinese|English)

English

p(Chinese|English)

! p(English)

English

p(Chinese|English)

! p(English)

∼ p(English|Chinese)

English

p(Chinese|English)1

! p(English)1


English

p(Chinese|English)2

! p(English)1


English

p(Chinese|English)1/2

! p(English)1


English

p(Chinese|English)0

! p(English)1


English

0 · log p(Chinese|English)

+1 · log p(English)

∼ log p(English|Chinese)

English



∼ log p(English|Chinese)

log(x) is monotonic for positive x:log(x) > log(y) iff x>y

English



= score(English|Chinese)

score(English|Chinese) =

λ1 log p(Chinese|English) + λ2 log p(English)

score(English|Chinese) =

exp(λ1 log p(Chinese|English) + λ2 log p(English))

exp(λ1 log p(Chinese|English) + λ2 log p(English))�

English


p(English|Chinese) =


English



log-linear modelmaximum entropy model

conditional modelundirected model




p(English) ! p(Chinese|English)

Note: Original model is a special case of this model!


English








exp

��

k

λkhk(English, Chinese)

�

�

English�

exp

��

k

λkhk(English�, Chinese)

�




1Z

exp

��

k


�




1Z

exp

��

k


�

Z is the normalization term or partition function


1Z

exp

��

k


�

Z is the normalization term or partition function

The functions hk are features or feature functionsThey are deterministic (fixed) functions of the

input/output pair.

The parameters of the model are the terms.λk

What’s a Feature?

What’s a Feature?A feature can be any function in the form:

hk : English× Chinese→ R+

What’s a Feature?

•Language model: p(English)

A feature can be any function in the form: hk : English× Chinese→ R+

What’s a Feature?


•Translation model: p(Chinese|English)


What’s a Feature?



•Reverse translation model: p(English|Chinese)


What’s a Feature?




•The number of words in the English sentence.


What’s a Feature?





•The number of verbs in the English sentence.


What’s a Feature?





•The number of verbs in the English sentence.

•1 if the English sentence has a verb, 0 otherwise.


What’s a Feature?A feature can be any function in the form:

hk : English× Chinese→ R+

What’s a Feature?

•A word-based translation model: p(Chinese|English)


What’s a Feature?


•Agreement features in the English sentence.


What’s a Feature?



•Features over part-of-speech sequences in the English sentence.


What’s a Feature?




•How many times the sentence pair includes the English word north and Chinese word 北.


What’s a Feature?




•How many times the sentence pair includes the English word north and Chinese word 北.

•Do words north and 北 appear in a dictionary?


Learning

arg maxθ

1Z

exp

��

k


�

θ = �λ1, ...,λK�where:

Learning

arg maxθ

1Z

exp

��

k


�


Techniques: SGD, L-BFGS

Learning

arg maxθ

1Z

exp

��

k


�


Techniques: SGD, L-BFGS

Require computing derivatives (expectations!), iterating.

Problems

Problems

•Inference is intractable!

Problems


•Compute over n-best lists of outputs.

Problems



•Compute over pruned search graphs.

Problems




•Reachability: what if data likelihood is zero?

Problems





•Throw away data.

Problems





•Throw away data.

•Pretend sentence with highest BLEU score is observed.

Problems

Problems

•Why maximize likelihood if we care about BLEU or some other metric?

BLEU(MT output)

BLEU(argmaxEnglish

score(English|Chinese))

BLEU(argmaxEnglish

score(English|Chinese))1�

Chinese∈Test

BLEU

• Ôptimization

jhu mt class: feature-based models

Technology

penglish translation

hk english

original model

english sentence

wordbased translation

english chinese r

huge language model

english word