a short introduction to neural machine...

Machine Translation

Marc Dymetman

Centrale-Supélec NLP Course: Lecture 6

26 February 2018

1

Outline

• Machine Translation: The Problem

• Symbolic MT : Rule-Based MT (RBMT)

• Statistical MT : Phrase-Based MT(PBMT)

• MT Evaluation

• Neural MT (NMT)

• Language Modelling with RNNs and LSTMs

• Seq2Seq Models for NMT

• Attention Models

• Advanced NMT models

• Other uses of Seq2Seq models

• Toolkits and Learning Resources2

Machine Translation: a difficult problem

3

Lexical differences between languages

4[Credit: Jurafsky and Martin 2000]

Different specificities

5[Credit: Jurafsky and Martin 2000]


6

The student was thinking

L’étudiant réfléchissaitL’étudiante réfléchissait

In many cases, the source text is not enough.Only access to the situation helps.


7

The student was thinking

L’étudiant réfléchissaitL’étudiante réfléchissait

QUIZ: In fact even the second translation is (probably) wrongCan you spot why ?

Translation and mental representations

[Credit: Dymetman 1994]

Human Translation


Translation by human

Mostly easy for us

English French

leg jambe / patte

map / plane plan

they ils / elles

his / her son (also sa)

student étudiant / étudiante

he walked out il est sorti

she sailed across the Atlantic

Elle a traversé l’Atlantique à la voile

47 miles per gallon 6 litres aux cents

… …

Translation and mental representations


Translation by machine

Sometimes difficult or even impossible for machines

English French

leg jambe / patte

map / plane plan

they ils / elles

his / her son (also sa)

student étudiant / étudiante

he walked out il est sorti

she sailed across the Atlantic

Elle a traversé l’Atlantique à la voile

47 miles per gallon 6 litres aux cents

… …

MT progress: some problems are now solved

11

Quiz: Can you guess the English source ?

English French

? La rose du taux de chômage

? Vieillissez par le sexe

Early rule-based MT


12

English French

The unemployment rate rose La rose du taux de chômage

Age by sex Vieillissez par le sexe

Early rule-based MT


13

English French



Early rule-based MT

GT 2018


14

English French



Early rule-based MT

GT 2018

MT progress: some problems are being solved

15

GT 2018

DeepL 2018

MT progress: some problems will probably be solved “soon”

16

GT 2018

Promise vs. PersuadeWell-knownsyntactic rule (*)

Cultural conventionsabout units

(*) Pierre Isabelle 2017: Challenge Set Approach to Evaluating Machine Translation

http://aclweb.org/anthology/D17-1263

MT progress: some problems will (probably) take a long time to solve

17

GT 2018

[Winograd Schemas, see: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html]

Commonsense reasoning

MT progress: some problems will (probably) take a long time to solve

18

GT 2018

[Winograd Schemas, see: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html]

Commonsense reasoning

Brief History of Machine Translation

19

20[Credit: Chris Manning 2016]

21[Credit: Ken Heafield 2017]

Symbolic MT

aka: Rule Based MT (RBMT)

22

Rule-Based MT: Syntax plays a large role

23[Credit: P. Koehn. Statistical Machine Translation Book. 2010]

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

Phrase-StructureGrammar

Dependency-StructureGrammar

Vauquois Triangle

RBMT (Rule-Based MT)

25

Tom misses Paris Paris manque à Tom


26



27


miss(Tom,Paris)


28


miss(Tom,Paris) manquer_à(Paris,Tom)


29




30



Pred: wish_forAgent: TomObject: Paris

Statistical MT(SMT)

31

Phrase-Based MT(PBMT)

32

Phrase-based SMT

33[Credit: Koehn 2006]

Bilingual corpora


Learning bi-phrases

35[Credit: Koehn]

• Word-alignment phase

• Based on EM (Expectation Maximization)

• Tools: GIZA++, …

Learning bi-phrases

36

• Bi-phrase extraction based on word-alignments

• Tools: MOSES, …

[Credit: Koehn]

Bi-phrase table


Decoding

38

Obama was waiting for election results in Chicago

Decoding

39



Decoding with bi-phrases

40


Obama

1a


Decoding

41


Obama attendait

1a 2a



42


Obama attendait les résultats des élections

1a 2a 3a


Decoding

43


Obama attendait les résultats des élections Chicagoà

1a 2a 3a 4a



44



1a 2a 3a 4ascore = 5.21


Decoding

45


Obama attendait résultats élection Chicagodans

1a 2a 3b 5bscore = 2.04

4b



46



1a 2a 3a 4ascore = 5.21

Obama was waiting for election results in Chicago In practice, several candidates are compared in parallel using beam-search

Scoring

47

𝑝𝝀 𝑡 𝑠 ∝ exp𝑖𝜆𝑖 ℎ𝑖(𝑠, 𝑡)

parameters features

source sentence

target sentence score

Log-Linear model

Features:• ℎlm: language model feature• ℎbs1, ℎbs2, …: features related to

plausibilities of the bi-phrases applied• Other features (distorsion, word

penalty, etc.)

Parameters:• Optimized towards minimizing distance

(e.g. BLEU) relative to reference translations

MT Evaluation

48

MT evaluation: no simple notion of a correcttranslation

49[Credit: Koehn 2010]

Manual Evaluation

50[Credit: Bojar 2017]

Manual Evaluation: Adequacy and Fluency

51

Problems of Manual Evaluation


Automatic Evaluation


Reference (human) translation:

The US island of Guam is

maintaining a high state of alert

after the Guam airport and its

offices both received an e-mail

from someone calling himself

Osama Bin Laden and threatening a

biological/chemical attack against

the airport.

Machine translation:

The American International airport and its

the office a receives one calls self the sand

Arab rich business and so on electronic

mail, which sends out; The threat will be

able after the maintenance at the airport.

• N-gram precision (score between 0 & 1)

• What % of MT n-grams (a sequence

of words) can be found in the

reference translation?

• Brevity Penalty

• Can’t just type out single word

“the’’ (precision 1.0!)

BLEU

[Credit: A. Way] 54

BLEU

• Reference Translation: The gunman was shot to death by the police .

• The gunman was shot kill .

• Wounded police jaya of

• The gunman was shot dead by the police .

• The gunman arrested by police kill .

• The gunmen were killed .

• The gunman was shot to death by the police .

• The ringer is killed by the police .

• Police killed the gunman .

• Green = 4-gram match (good!) Red = unmatched word (bad!)

[Credit: A. Way] 55

BLEU Metrics

• Proposed by IBM’s SMT group (Papineni et al, ACL-2002)

• Widely used in MT evaluations

• BLEU Metric:

– pn: Modified n-gram precision

– Geometric mean of p1, p2,..pn

– BP: Brevity penalty (c=length of MT hypothesis, r=length of reference)

– Usually, N=4 and wn=1/N.

𝐵𝐿𝐸𝑈 = 𝐵𝑃. exp

𝑛=1

𝑁

𝑤𝑛 log 𝑝𝑛

[Credit: A. Way] 56

An Example

• MT Hypothesis: The gunman was shot dead by police .

– Ref 1: The gunman was shot to death by the police .

– Ref 2: The gunman was shot to death by the police .

– Ref 3: Police killed the gunman .

– Ref 4: The gunman was shot dead by the police .

• Precision: p1=1.0 (8/8) p2=0.86 (6/7) p3=0.67 (4/6) p4=0.6 (3/5)

• Brevity Penalty: c=8, r=9, BP=0.8825

• Final Score:

[Credit: A. Way] 57

Correlation of BLEU with human judgments


Neural MT

59

[Cre

dit

: hla

litec

h.o

rg]

NMT Techniques

60

Recurrent Neural Networks: Refresher

61

A Feedforward Neural Network

62

ℎ

𝑥

𝑜

input

output

hiddenstate

A Recurrent Neural Network

ℎ𝑡

𝑥𝑡

𝑜𝑡

63

ℎ

𝑥

𝑜


ℎ𝑡

𝑥𝑡

𝑜𝑡

64

ℎ𝑡 = 𝑓𝑊 ℎ𝑡−1, 𝑥𝑡


ℎ𝑡

𝑥𝑡

𝑜𝑡

65


𝑜𝑡 = 𝑔𝑊 ℎ𝑡


ℎ𝑡

𝑥𝑡

𝑜𝑡

66



= tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡

= softmax(𝑊ℎ𝑜ℎ𝑡)

tanh

softmax 𝒚 𝑖 =e𝑦𝑖

σ𝑗 e𝑦𝑗


ℎ𝑡

𝑥𝑡

𝑜𝑡

67





tanh


σ𝑗 e𝑦𝑗

𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) : shared between all time steps


ℎ𝑡

𝑥𝑡

𝑜𝑡

68





tanh


σ𝑗 e𝑦𝑗

𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) : shared between all time steps

Note: bias terms omitted

ℎ𝑡

𝑥𝑡

𝑜𝑡

ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡

𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)

ℎ𝑡

𝑥𝑡

𝑜𝑡

69


Language modelling with RNNs

70

Il était une fois un roi et

une reine si fâchés de

n’avoir point d’enfants

…





…

71

Let’s first assume that we have already trained some ``good’’ RNN : 𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) …


72

Let’s first assume that we have already trained some ``good’’ RNN : 𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) …

… and see how this RNN can be used for predicting new texts




…

Decoding with the trained RNN

ℎ0

73

hidden-state initialization


ℎ0

𝑜0

𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡) 74

.001 .7 … .002 … .1 .1 … .005


ℎ0

𝑜0




il

ℎ0

𝑜0

77



il

ℎ0

𝑜0

il

78

𝑥1



il

ℎ0

𝑜0

il

𝑥10 1 … 0 … 0 0 … 0

79



il

ℎ0

𝑜0

il

𝑥10 1 … 0 … 0 0 … 0

80

1-hot encoding


il

ℎ0

𝑜0

il

ℎ1

𝑥10 1 … 0 … 0 0 … 0

81𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)



il

ℎ0

𝑜0

il

ℎ1

𝑥1

.001 .01 … .002 … .5 .1 … .005

𝑜1

0 1 … 0 … 0 0 … 0

82𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)




il

ℎ0

𝑜0

il

ℎ1

𝑥1


.001 .01 … .002 … .5 .1 … .005

𝑜1

83



il

ℎ0

𝑜0

il

ℎ1

𝑥1


𝑜1

était

était

84

𝑥2



il

ℎ0

𝑜0

il

ℎ1

𝑥1


𝑜1

était

était

ℎ2

𝑥2

𝑜2

ℎ3

𝑥3

𝑜3

ℎ4

𝑥4

𝑜4

une fois

une fois …

…

85

Training a RNN

il était une fois …il était une fois un …

86

Training set

Training a RNN


87

Training set



Initial values of parameters

Training a RNN



ℎ0


88

Training a RNN



ℎ0

𝑜0 .5 .01 … .002 … .1 .1 … .005

il était une fois un …

89

Training a RNN



ℎ0

𝑜0 .5 .01 … .002 … .1 .1 … .005


90

Training a RNN



ℎ0

𝑜0 .5 .01 … .002 … .1 .1 … .005

Cross-Entropy Loss: −log 𝑝 il = −log .01


91

Training a RNN



ℎ0

𝑜0 .1 .05 … .07 … .3 .1 … .3

il était une fois un

Cross-Entropy Loss: −log 𝑝 était = −log .3

il

ℎ1

𝑥1

𝑜1

…

92


Backpropagation through time



ℎ0

𝑜0

il

ℎ1

𝑥1

𝑜1

était

ℎ2

𝑥2

𝑜2

ℎ3

𝑥3

𝑜3

ℎ4

𝑥4

𝑜4

une fois

…

LOSS

93





ℎ0

𝑜0

il

ℎ1

𝑥1

𝑜1

était

ℎ2

𝑥2

𝑜2

ℎ3

𝑥3

𝑜3

ℎ4

𝑥4

𝑜4

une fois

…

𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ

𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜

94

LOSS





ℎ0

𝑜0

il

ℎ1

𝑥1

𝑜1

était

ℎ2

𝑥2

𝑜2

ℎ3

𝑥3

𝑜3

ℎ4

𝑥4

𝑜4

une fois

…

𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ

𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜

95

LOSSBPTT:

Back-PropagationThrough Time

Recap:word-levellanguagemodeling

96[Credit: Jozefowicz]

Vanilla RNNs have issues with long-term interactions

97[Credit: Alex Graves]

LSTMs can maintain long-term memories

98

Long Short Term Memory Networks Hochreiter & Schmidthuber 1997

[Credit: Alex Graves]

RNNs with longer-term memory: LSTMs and GRUs

LSTM

GRU

http://colah.github.io/posts/2015-08-Understanding-LSTMs

These variants of RNNs alleviate the “vanishing gradient problem” and allow the network to model long-distance effects

99

http://colah.github.io/posts/2015-08-Understanding-LSTMs

It time permits: Hinton’s explanation of LSTM

100

101[Credit: Hinton 2013]

End Hinton’s explanation of LSTM

108

Seq2Seq models for NMT

109

Seq2Seq for Machine Translation

110[Cho’s blog on NMT]

Seq2Seq RNN

111http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf

http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf

Seq2Seq RNN


[Sutzkever et al, 2014. Sequence to Sequence Learning with Neural Networks]

Source/TargetInterface


Vanilla RNN-based NMT

113https://github.com/tensorflow/nmt

https://github.com/tensorflow/nmt

Attention in NMT

114

Introducing Attention:Vanilla seq2seq and information bottleneck

115[Adapted from http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf]

Problem: fixed-dimensional

interface between

encoder and decoder


Attention mechanism


Solution: “Random Access

Memory” (kind of)


Attention mechanism


Simplified version of (Bahdanau et al. 2015)


Attention mechanism: scoring


Compare target and source

hidden states





hidden states


Attention mechanism: normalization


Convert into alignment

weights


Attention mechanism: context


Build context vector:

weighted average


Attention mechanism: next hidden state


Compute the next hidden

state


Attention



NMT with attention

125https://github.com/tensorflow/nmt

https://github.com/tensorflow/nmt

NMT with attention

126http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp

http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp

The “canonical” NMT architecture:RNN seq2seq with attention and with bidirectional encoding

127https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ (Cho’s Blog, 2015)

bidirectional encoding

https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Advanced NMT Models

128

Alternative Architectures• Convolutional approaches

• “Attention is all you need”

129

Convolutional models

• Illustration from:

[Kalchbrenner et al. 2016, Neural Machine Translation in Linear Time]https://arxiv.org/abs/1610.10099

(ByteNet)

130

• Shorter “maximum path length”

• More parallelizable

https://arxiv.org/abs/1610.10099


131

• Shorter “maximum path length”

• More parallelizable

RNN


• Also: Convolution + Attention:

[Gehring et al. 2017, Convolutional Sequence to Sequence Learning]https://arxiv.org/abs/1705.03122

(ConvS2S, fairseq)

132

• Illustration from:

[Kalchbrenner et al. 2016, Neural Machine Translation in Linear Time]https://arxiv.org/abs/1610.10099

(ByteNet)



From convolution to self-attention

𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6 𝑠7 𝑠8

Convolution• The next level for 𝑠3 gets input

from a fixed number of (immediate) neighbors

133

From convolution to self-attention

134

𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6 𝑠7 𝑠8 𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6 𝑠7 𝑠8

Convolution• The next level for 𝑠3 gets input

from a fixed number of (immediate) neighbors

Self-attention• The next level for 𝑠3 gets input from

a variable number of (perhaps distant) neighbors …… according to attention weights relative to 𝑠3

“Attention is all you need”

• [Vaswani et al. (2017). Attention Is All You Need]http://arxiv.org/abs/1706.03762

(Transformer)

• Attention applied to:• Encoding of the source (self-attention)

• Decoding of the next target word:

• Attention to source-words encodings

• Attention to previous target words

135

http://arxiv.org/abs/1706.03762

Speed

136

https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html

Coreference resolution (Winograd schemas)

137

self-attention

https://research.googleblog.com/2017/08/transformer-novel-neural-network.html


Coreference resolution (Winograd schemas)

138https://research.googleblog.com/2017/08/transformer-novel-neural-network.html


Multilingual NMT

139

Multilingual NMT

140

[Slide cited by: Marta Costa-Jussa, 2017]

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism (2016), O. Firhat et al. ] Google Multilingual NMT system: Enabling zero-shot translation (2017) ,Melvin Johnson et al.

Google Multilingual NMT

141https://www.youtube.com/watch?v=nR74lBO5M3s

[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]

https://www.youtube.com/watch?v=nR74lBO5M3s

Google Multilingual NMT: Enabling zero-shot translation

147

https://www.youtube.com/watch?v=nR74lBO5M3s[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]

https://arxiv.org/pdf/1611.04558.pdf



Google Multilingual NMT: Enabling zero-shot translation

148https://arxiv.org/pdf/1611.04558.pdf


The Generality of Seq2Seq Models

149

Image Captioning

150[Show and Tell: A Neural Image Caption Generator, Vinyals et al, 2014]

Natural Language Generation

151[Credit: Agarwal et al., 2017]

MR: Meaning Representation

(Dialog Act)

RF: Reference (Test Set)

Pred: Seq2seq prediction

Structure is encoded as a character sequence

Char2Char model with attention

Seq2Seq Applications

152

• text —> text• Machine Translation

• Summarization

• Dialogue

• —> text• Language Modeling

• other —> text• Image Captions

• Natural Language Generation

• Speech Recognition

• Handwriting Recognition

• text —> other• Semantic Parsing

• Code Generation

• Handwriting Generation

• Speech Synthesis

• other —> other• Image Generation

• etc.

Tools and Resources

153

A few open-source NMT toolkits

Extensive list at: https://github.com/jonsafari/nmt-list154

NAME Model Type Main Framework Who Comments

tf-seq2seq RNNTensorFlow Denny Britz

(Google Brain)

Nematus RNN Theano Edinburgh U.

Marian-NMT RNNC++ Poznan U. and

Edinburgh U.

Compatible with

Nematus

OpenNMT-py RNN PyTorch Harvard, Systran Based on

OpenNMT (Torch)

FairseqCNN

(ConvS2S)

Torch Facebook

Tensor2Tensor

(T2T)

“Attention is all

you need”

(Transformer)

+ other models

TensorFlow Google Brain

https://github.com/jonsafari/nmt-list

https://github.com/google/seq2seq

https://github.com/rsennrich/nematus

https://github.com/marian-nmt/marian

https://github.com/OpenNMT/OpenNMT-py

https://github.com/facebookresearch/fairseq

https://github.com/tensorflow/tensor2tensor

References: some overviews

155

An introduction to machine translation (1992), Hutchins and Somers,

Academic Press. [web]

Statistical Machine Translation (2010), P. Koehn, Cambridge University Press.

Neural Machine Translation and Sequence-to-sequence Models (2017): A

Tutorial, G. Neubig. [pdf]

Neural Machine Translation (chapter draft) (2017), P. Koehn [pdf]

CS224d, Deep Learning for Natural Language Processing, Lecture 10

(Machine Translation), Manning et al., Stanford University [web]

Survey of the State of the Art in Natural Language Generation: Core tasks,

applications and evaluation (2017), Gatt et al. [pdf]

http://www.hutchinsweb.me.uk/IntroMT-TOC.htm

http://arxiv.org/pdf/1703.01619v1.pdf


http://cs224d.stanford.edu/


References: a few papers

156

Long short-term memory (1997), S. Hochreiter and J. Schmidhuber. [pdf]

Generating sequences with recurrent neural networks (2013), A. Graves. [pdf]

Sequence to sequence learning with neural networks (2014), I. Sutskever et al. [pdf]

Neural machine translation by jointly learning to align and translate (2014), D.

Bahdanau et al. [pdf]

Google's neural machine translation system: Bridging the gap between human and

machine translation (2016), Y. Wu et al. [pdf]

A Convolutional Encoder Model for Neural Machine Translation (2017), J. Gehring et

al. [pdf]

Attention Is All You Need (2017), A Vaswani. [pdf]

http://www.mitpressjournals.org/doi/pdfplus/10.1162/neco.1997.9.8.1735

https://arxiv.org/pdf/1308.0850

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

http://arxiv.org/pdf/1409.0473

https://arxiv.org/pdf/1609.08144



Concluding remarks• Many aspects I did not discuss:

• Detailed Implementation techniques (batching, dropout, ensembling, …)

• Pros/Cons of NMT relative to PBMT (Philipp Koehn)

• Sub-word units, Byte-Pair encoding

• Use of monolingual data

• Fine-grained linguistic evaluation techniques (Pierre Isabelle’s challenge dataset)

• Prior linguistic knowledge

157

Seq2Seq in NMT, text generation, etc. :

An active research field with exciting applications

a short introduction to neural machine...

Documents