a short introduction to neural machine...
TRANSCRIPT
Machine Translation
Marc Dymetman
Centrale-Supélec NLP Course: Lecture 6
26 February 2018
1
Outline
• Machine Translation: The Problem
• Symbolic MT : Rule-Based MT (RBMT)
• Statistical MT : Phrase-Based MT(PBMT)
• MT Evaluation
• Neural MT (NMT)
• Language Modelling with RNNs and LSTMs
• Seq2Seq Models for NMT
• Attention Models
• Advanced NMT models
• Other uses of Seq2Seq models
• Toolkits and Learning Resources2
Machine Translation: a difficult problem
3
Lexical differences between languages
4[Credit: Jurafsky and Martin 2000]
Different specificities
5[Credit: Jurafsky and Martin 2000]
Different specificities
6
The student was thinking
L’étudiant réfléchissaitL’étudiante réfléchissait
In many cases, the source text is not enough.Only access to the situation helps.
Different specificities
7
The student was thinking
L’étudiant réfléchissaitL’étudiante réfléchissait
QUIZ: In fact even the second translation is (probably) wrongCan you spot why ?
Translation and mental representations
[Credit: Dymetman 1994]
Human Translation
[Credit: Dymetman 1994]
Translation by human
Mostly easy for us
English French
leg jambe / patte
map / plane plan
they ils / elles
his / her son (also sa)
student étudiant / étudiante
he walked out il est sorti
she sailed across the Atlantic
Elle a traversé l’Atlantique à la voile
47 miles per gallon 6 litres aux cents
… …
Translation and mental representations
[Credit: Dymetman 1994]
Translation by machine
Sometimes difficult or even impossible for machines
English French
leg jambe / patte
map / plane plan
they ils / elles
his / her son (also sa)
student étudiant / étudiante
he walked out il est sorti
she sailed across the Atlantic
Elle a traversé l’Atlantique à la voile
47 miles per gallon 6 litres aux cents
… …
MT progress: some problems are now solved
11
Quiz: Can you guess the English source ?
English French
? La rose du taux de chômage
? Vieillissez par le sexe
Early rule-based MT
MT progress: some problems are now solved
12
English French
The unemployment rate rose La rose du taux de chômage
Age by sex Vieillissez par le sexe
Early rule-based MT
MT progress: some problems are now solved
13
English French
The unemployment rate rose La rose du taux de chômage
Age by sex Vieillissez par le sexe
Early rule-based MT
GT 2018
MT progress: some problems are now solved
14
English French
The unemployment rate rose La rose du taux de chômage
Age by sex Vieillissez par le sexe
Early rule-based MT
GT 2018
MT progress: some problems are being solved
15
GT 2018
DeepL 2018
MT progress: some problems will probably be solved “soon”
16
GT 2018
Promise vs. PersuadeWell-knownsyntactic rule (*)
Cultural conventionsabout units
(*) Pierre Isabelle 2017: Challenge Set Approach to Evaluating Machine Translation
MT progress: some problems will (probably) take a long time to solve
17
GT 2018
[Winograd Schemas, see: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html]
Commonsense reasoning
MT progress: some problems will (probably) take a long time to solve
18
GT 2018
[Winograd Schemas, see: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html]
Commonsense reasoning
Brief History of Machine Translation
19
20[Credit: Chris Manning 2016]
21[Credit: Ken Heafield 2017]
Symbolic MT
aka: Rule Based MT (RBMT)
22
Rule-Based MT: Syntax plays a large role
23[Credit: P. Koehn. Statistical Machine Translation Book. 2010]
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Phrase-StructureGrammar
Dependency-StructureGrammar
Vauquois Triangle
RBMT (Rule-Based MT)
25
Tom misses Paris Paris manque à Tom
RBMT (Rule-Based MT)
26
Tom misses Paris Paris manque à Tom
RBMT (Rule-Based MT)
27
Tom misses Paris Paris manque à Tom
miss(Tom,Paris)
RBMT (Rule-Based MT)
28
Tom misses Paris Paris manque à Tom
miss(Tom,Paris) manquer_à(Paris,Tom)
RBMT (Rule-Based MT)
29
Tom misses Paris Paris manque à Tom
miss(Tom,Paris) manquer_à(Paris,Tom)
RBMT (Rule-Based MT)
30
Tom misses Paris Paris manque à Tom
miss(Tom,Paris) manquer_à(Paris,Tom)
Pred: wish_forAgent: TomObject: Paris
Statistical MT(SMT)
31
Phrase-Based MT(PBMT)
32
Phrase-based SMT
33[Credit: Koehn 2006]
Bilingual corpora
34[Credit: P. Koehn. Statistical Machine Translation Book. 2010]
Learning bi-phrases
35[Credit: Koehn]
• Word-alignment phase
• Based on EM (Expectation Maximization)
• Tools: GIZA++, …
Learning bi-phrases
36
• Bi-phrase extraction based on word-alignments
• Tools: MOSES, …
[Credit: Koehn]
Bi-phrase table
37[Credit: P. Koehn. Statistical Machine Translation Book. 2010]
Decoding
38
Obama was waiting for election results in Chicago
Decoding
39
Obama was waiting for election results in Chicago
Obama was waiting for election results in Chicago
Decoding with bi-phrases
40
Obama was waiting for election results in Chicago
Obama
1a
Obama was waiting for election results in Chicago
Decoding
41
Obama was waiting for election results in Chicago
Obama attendait
1a 2a
Obama was waiting for election results in Chicago
Decoding with bi-phrases
42
Obama was waiting for election results in Chicago
Obama attendait les résultats des élections
1a 2a 3a
Obama was waiting for election results in Chicago
Decoding
43
Obama was waiting for election results in Chicago
Obama attendait les résultats des élections Chicagoà
1a 2a 3a 4a
Obama was waiting for election results in Chicago
Decoding with bi-phrases
44
Obama was waiting for election results in Chicago
Obama attendait les résultats des élections Chicagoà
1a 2a 3a 4ascore = 5.21
Obama was waiting for election results in Chicago
Decoding
45
Obama was waiting for election results in Chicago
Obama attendait résultats élection Chicagodans
1a 2a 3b 5bscore = 2.04
4b
Obama was waiting for election results in Chicago
Decoding with bi-phrases
46
Obama was waiting for election results in Chicago
Obama attendait les résultats des élections Chicagoà
1a 2a 3a 4ascore = 5.21
Obama was waiting for election results in Chicago In practice, several candidates are compared in parallel using beam-search
Scoring
47
𝑝𝝀 𝑡 𝑠 ∝ exp𝑖𝜆𝑖 ℎ𝑖(𝑠, 𝑡)
parameters features
source sentence
target sentence score
Log-Linear model
Features:• ℎlm: language model feature• ℎbs1, ℎbs2, …: features related to
plausibilities of the bi-phrases applied• Other features (distorsion, word
penalty, etc.)
Parameters:• Optimized towards minimizing distance
(e.g. BLEU) relative to reference translations
MT Evaluation
48
MT evaluation: no simple notion of a correcttranslation
49[Credit: Koehn 2010]
Manual Evaluation
50[Credit: Bojar 2017]
Manual Evaluation: Adequacy and Fluency
51
Problems of Manual Evaluation
52[Credit: Bojar 2017]
Automatic Evaluation
53[Credit: Bojar 2017]
Reference (human) translation:
The US island of Guam is
maintaining a high state of alert
after the Guam airport and its
offices both received an e-mail
from someone calling himself
Osama Bin Laden and threatening a
biological/chemical attack against
the airport.
Machine translation:
The American International airport and its
the office a receives one calls self the sand
Arab rich business and so on electronic
mail, which sends out; The threat will be
able after the maintenance at the airport.
• N-gram precision (score between 0 & 1)
• What % of MT n-grams (a sequence
of words) can be found in the
reference translation?
• Brevity Penalty
• Can’t just type out single word
“the’’ (precision 1.0!)
BLEU
[Credit: A. Way] 54
BLEU
• Reference Translation: The gunman was shot to death by the police .
• The gunman was shot kill .
• Wounded police jaya of
• The gunman was shot dead by the police .
• The gunman arrested by police kill .
• The gunmen were killed .
• The gunman was shot to death by the police .
• The ringer is killed by the police .
• Police killed the gunman .
• Green = 4-gram match (good!) Red = unmatched word (bad!)
[Credit: A. Way] 55
BLEU Metrics
• Proposed by IBM’s SMT group (Papineni et al, ACL-2002)
• Widely used in MT evaluations
• BLEU Metric:
– pn: Modified n-gram precision
– Geometric mean of p1, p2,..pn
– BP: Brevity penalty (c=length of MT hypothesis, r=length of reference)
– Usually, N=4 and wn=1/N.
𝐵𝐿𝐸𝑈 = 𝐵𝑃. exp
𝑛=1
𝑁
𝑤𝑛 log 𝑝𝑛
[Credit: A. Way] 56
An Example
• MT Hypothesis: The gunman was shot dead by police .
– Ref 1: The gunman was shot to death by the police .
– Ref 2: The gunman was shot to death by the police .
– Ref 3: Police killed the gunman .
– Ref 4: The gunman was shot dead by the police .
• Precision: p1=1.0 (8/8) p2=0.86 (6/7) p3=0.67 (4/6) p4=0.6 (3/5)
• Brevity Penalty: c=8, r=9, BP=0.8825
• Final Score:
[Credit: A. Way] 57
Correlation of BLEU with human judgments
58[Credit: P. Koehn. Statistical Machine Translation Book. 2010]
Neural MT
59
[Cre
dit
: hla
litec
h.o
rg]
NMT Techniques
60
Recurrent Neural Networks: Refresher
61
A Feedforward Neural Network
62
ℎ
𝑥
𝑜
input
output
hiddenstate
A Recurrent Neural Network
ℎ𝑡
𝑥𝑡
𝑜𝑡
63
ℎ
𝑥
𝑜
A Recurrent Neural Network
ℎ𝑡
𝑥𝑡
𝑜𝑡
64
ℎ𝑡 = 𝑓𝑊 ℎ𝑡−1, 𝑥𝑡
A Recurrent Neural Network
ℎ𝑡
𝑥𝑡
𝑜𝑡
65
ℎ𝑡 = 𝑓𝑊 ℎ𝑡−1, 𝑥𝑡
𝑜𝑡 = 𝑔𝑊 ℎ𝑡
A Recurrent Neural Network
ℎ𝑡
𝑥𝑡
𝑜𝑡
66
ℎ𝑡 = 𝑓𝑊 ℎ𝑡−1, 𝑥𝑡
𝑜𝑡 = 𝑔𝑊 ℎ𝑡
= tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
= softmax(𝑊ℎ𝑜ℎ𝑡)
tanh
softmax 𝒚 𝑖 =e𝑦𝑖
σ𝑗 e𝑦𝑗
A Recurrent Neural Network
ℎ𝑡
𝑥𝑡
𝑜𝑡
67
ℎ𝑡 = 𝑓𝑊 ℎ𝑡−1, 𝑥𝑡
𝑜𝑡 = 𝑔𝑊 ℎ𝑡
= tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
= softmax(𝑊ℎ𝑜ℎ𝑡)
tanh
softmax 𝒚 𝑖 =e𝑦𝑖
σ𝑗 e𝑦𝑗
𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) : shared between all time steps
A Recurrent Neural Network
ℎ𝑡
𝑥𝑡
𝑜𝑡
68
ℎ𝑡 = 𝑓𝑊 ℎ𝑡−1, 𝑥𝑡
𝑜𝑡 = 𝑔𝑊 ℎ𝑡
= tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
= softmax(𝑊ℎ𝑜ℎ𝑡)
tanh
softmax 𝒚 𝑖 =e𝑦𝑖
σ𝑗 e𝑦𝑗
𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) : shared between all time steps
Note: bias terms omitted
ℎ𝑡
𝑥𝑡
𝑜𝑡
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡
𝑥𝑡
𝑜𝑡
69
A Recurrent Neural Network
Language modelling with RNNs
70
Il était une fois un roi et
une reine si fâchés de
n’avoir point d’enfants
…
Language modelling with RNNs
Il était une fois un roi et
une reine si fâchés de
n’avoir point d’enfants
…
71
Let’s first assume that we have already trained some ``good’’ RNN : 𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) …
Language modelling with RNNs
72
Let’s first assume that we have already trained some ``good’’ RNN : 𝑊 = (𝑊ℎℎ ,𝑊𝑥ℎ ,𝑊ℎ𝑜) …
… and see how this RNN can be used for predicting new texts
Il était une fois un roi et
une reine si fâchés de
n’avoir point d’enfants
…
Decoding with the trained RNN
ℎ0
73
hidden-state initialization
Decoding with the trained RNN
ℎ0
𝑜0
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡) 74
.001 .7 … .002 … .1 .1 … .005
Decoding with the trained RNN
ℎ0
𝑜0
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡) 75
.001 .7 … .002 … .1 .1 … .005
Decoding with the trained RNN
ℎ0
𝑜0
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡) 76
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
77
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
il
78
𝑥1
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
il
𝑥10 1 … 0 … 0 0 … 0
79
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
il
𝑥10 1 … 0 … 0 0 … 0
80
1-hot encoding
Decoding with the trained RNN
il
ℎ0
𝑜0
il
ℎ1
𝑥10 1 … 0 … 0 0 … 0
81𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
Decoding with the trained RNN
il
ℎ0
𝑜0
il
ℎ1
𝑥1
.001 .01 … .002 … .5 .1 … .005
𝑜1
0 1 … 0 … 0 0 … 0
82𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
il
ℎ1
𝑥1
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
.001 .01 … .002 … .5 .1 … .005
𝑜1
83
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
il
ℎ1
𝑥1
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
𝑜1
était
était
84
𝑥2
Decoding with the trained RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
il
ℎ0
𝑜0
il
ℎ1
𝑥1
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
𝑜1
était
était
ℎ2
𝑥2
𝑜2
ℎ3
𝑥3
𝑜3
ℎ4
𝑥4
𝑜4
une fois
une fois …
…
85
Training a RNN
il était une fois …il était une fois un …
86
Training set
Training a RNN
il était une fois …il était une fois un …
87
Training set
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
Initial values of parameters
Training a RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
il était une fois …il était une fois un …
88
Training a RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0 .5 .01 … .002 … .1 .1 … .005
il était une fois un …
89
Training a RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0 .5 .01 … .002 … .1 .1 … .005
il était une fois un …
90
Training a RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0 .5 .01 … .002 … .1 .1 … .005
Cross-Entropy Loss: −log 𝑝 il = −log .01
il était une fois un …
91
Training a RNN
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0 .1 .05 … .07 … .3 .1 … .3
il était une fois un
Cross-Entropy Loss: −log 𝑝 était = −log .3
il
ℎ1
𝑥1
𝑜1
…
92
il était une fois un …
Backpropagation through time
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0
il
ℎ1
𝑥1
𝑜1
était
ℎ2
𝑥2
𝑜2
ℎ3
𝑥3
𝑜3
ℎ4
𝑥4
𝑜4
une fois
…
LOSS
93
il était une fois un …
Backpropagation through time
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0
il
ℎ1
𝑥1
𝑜1
était
ℎ2
𝑥2
𝑜2
ℎ3
𝑥3
𝑜3
ℎ4
𝑥4
𝑜4
une fois
…
𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ
𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜
94
LOSS
il était une fois un …
Backpropagation through time
𝑜𝑡 = softmax(𝑊ℎ𝑜ℎ𝑡)
ℎ𝑡 = tanh 𝑊ℎℎℎ𝑡−1 +𝑊𝑥ℎ𝑥𝑡
ℎ0
𝑜0
il
ℎ1
𝑥1
𝑜1
était
ℎ2
𝑥2
𝑜2
ℎ3
𝑥3
𝑜3
ℎ4
𝑥4
𝑜4
une fois
…
𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ 𝑊ℎℎ
𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜 𝑊ℎ𝑜
95
LOSSBPTT:
Back-PropagationThrough Time
Recap:word-levellanguagemodeling
96[Credit: Jozefowicz]
Vanilla RNNs have issues with long-term interactions
97[Credit: Alex Graves]
LSTMs can maintain long-term memories
98
Long Short Term Memory Networks Hochreiter & Schmidthuber 1997
[Credit: Alex Graves]
RNNs with longer-term memory: LSTMs and GRUs
LSTM
GRU
http://colah.github.io/posts/2015-08-Understanding-LSTMs
These variants of RNNs alleviate the “vanishing gradient problem” and allow the network to model long-distance effects
99
It time permits: Hinton’s explanation of LSTM
100
101[Credit: Hinton 2013]
102[Credit: Hinton 2013]
103[Credit: Hinton 2013]
104[Credit: Hinton 2013]
105[Credit: Hinton 2013]
106[Credit: Hinton 2013]
107[Credit: Hinton 2013]
End Hinton’s explanation of LSTM
108
Seq2Seq models for NMT
109
Seq2Seq for Machine Translation
110[Cho’s blog on NMT]
Seq2Seq RNN
111http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Seq2Seq RNN
112http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
[Sutzkever et al, 2014. Sequence to Sequence Learning with Neural Networks]
Source/TargetInterface
Attention in NMT
114
Introducing Attention:Vanilla seq2seq and information bottleneck
115[Adapted from http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf]
Problem: fixed-dimensional
interface between
encoder and decoder
Attention mechanism
116http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Solution: “Random Access
Memory” (kind of)
Attention mechanism
117http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Simplified version of (Bahdanau et al. 2015)
Attention mechanism: scoring
118http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Compare target and source
hidden states
Attention mechanism: scoring
119http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Compare target and source
hidden states
Attention mechanism: scoring
120http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Compare target and source
hidden states
Attention mechanism: normalization
121http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Convert into alignment
weights
Attention mechanism: context
122http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Build context vector:
weighted average
Attention mechanism: next hidden state
123http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
Compute the next hidden
state
Attention
124http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture10.pdf
NMT with attention
126http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp
The “canonical” NMT architecture:RNN seq2seq with attention and with bidirectional encoding
127https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ (Cho’s Blog, 2015)
bidirectional encoding
Advanced NMT Models
128
Alternative Architectures• Convolutional approaches
• “Attention is all you need”
129
Convolutional models
• Illustration from:
[Kalchbrenner et al. 2016, Neural Machine Translation in Linear Time]https://arxiv.org/abs/1610.10099
(ByteNet)
130
• Shorter “maximum path length”
• More parallelizable
Convolutional models
131
• Shorter “maximum path length”
• More parallelizable
RNN
Convolutional models
• Also: Convolution + Attention:
[Gehring et al. 2017, Convolutional Sequence to Sequence Learning]https://arxiv.org/abs/1705.03122
(ConvS2S, fairseq)
132
• Illustration from:
[Kalchbrenner et al. 2016, Neural Machine Translation in Linear Time]https://arxiv.org/abs/1610.10099
(ByteNet)
From convolution to self-attention
𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6 𝑠7 𝑠8
Convolution• The next level for 𝑠3 gets input
from a fixed number of (immediate) neighbors
133
From convolution to self-attention
134
𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6 𝑠7 𝑠8 𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6 𝑠7 𝑠8
Convolution• The next level for 𝑠3 gets input
from a fixed number of (immediate) neighbors
Self-attention• The next level for 𝑠3 gets input from
a variable number of (perhaps distant) neighbors …… according to attention weights relative to 𝑠3
“Attention is all you need”
• [Vaswani et al. (2017). Attention Is All You Need]http://arxiv.org/abs/1706.03762
(Transformer)
• Attention applied to:• Encoding of the source (self-attention)
• Decoding of the next target word:
• Attention to source-words encodings
• Attention to previous target words
135
Speed
136
https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html
Coreference resolution (Winograd schemas)
137
self-attention
https://research.googleblog.com/2017/08/transformer-novel-neural-network.html
Coreference resolution (Winograd schemas)
138https://research.googleblog.com/2017/08/transformer-novel-neural-network.html
Multilingual NMT
139
Multilingual NMT
140
[Slide cited by: Marta Costa-Jussa, 2017]
Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism (2016), O. Firhat et al. ] Google Multilingual NMT system: Enabling zero-shot translation (2017) ,Melvin Johnson et al.
Google Multilingual NMT
141https://www.youtube.com/watch?v=nR74lBO5M3s
[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
Google Multilingual NMT
142https://www.youtube.com/watch?v=nR74lBO5M3s
[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
Google Multilingual NMT
143https://www.youtube.com/watch?v=nR74lBO5M3s
[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
Google Multilingual NMT
144https://www.youtube.com/watch?v=nR74lBO5M3s
[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
Google Multilingual NMT
145https://www.youtube.com/watch?v=nR74lBO5M3s
[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
Google Multilingual NMT
146https://www.youtube.com/watch?v=nR74lBO5M3s
[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
Google Multilingual NMT: Enabling zero-shot translation
147
https://www.youtube.com/watch?v=nR74lBO5M3s[Melvin Johnson et al, 2017. Google Multilingual NMT system: Enabling zero-shot translation ]
https://arxiv.org/pdf/1611.04558.pdf
Google Multilingual NMT: Enabling zero-shot translation
148https://arxiv.org/pdf/1611.04558.pdf
The Generality of Seq2Seq Models
149
Image Captioning
150[Show and Tell: A Neural Image Caption Generator, Vinyals et al, 2014]
Natural Language Generation
151[Credit: Agarwal et al., 2017]
MR: Meaning Representation
(Dialog Act)
RF: Reference (Test Set)
Pred: Seq2seq prediction
Structure is encoded as a character sequence
Char2Char model with attention
Seq2Seq Applications
152
• text —> text• Machine Translation
• Summarization
• Dialogue
• —> text• Language Modeling
• other —> text• Image Captions
• Natural Language Generation
• Speech Recognition
• Handwriting Recognition
• text —> other• Semantic Parsing
• Code Generation
• Handwriting Generation
• Speech Synthesis
• other —> other• Image Generation
• etc.
Tools and Resources
153
A few open-source NMT toolkits
Extensive list at: https://github.com/jonsafari/nmt-list154
NAME Model Type Main Framework Who Comments
tf-seq2seq RNNTensorFlow Denny Britz
(Google Brain)
Nematus RNN Theano Edinburgh U.
Marian-NMT RNNC++ Poznan U. and
Edinburgh U.
Compatible with
Nematus
OpenNMT-py RNN PyTorch Harvard, Systran Based on
OpenNMT (Torch)
FairseqCNN
(ConvS2S)
Torch Facebook
Tensor2Tensor
(T2T)
“Attention is all
you need”
(Transformer)
+ other models
TensorFlow Google Brain
References: some overviews
155
An introduction to machine translation (1992), Hutchins and Somers,
Academic Press. [web]
Statistical Machine Translation (2010), P. Koehn, Cambridge University Press.
Neural Machine Translation and Sequence-to-sequence Models (2017): A
Tutorial, G. Neubig. [pdf]
Neural Machine Translation (chapter draft) (2017), P. Koehn [pdf]
CS224d, Deep Learning for Natural Language Processing, Lecture 10
(Machine Translation), Manning et al., Stanford University [web]
Survey of the State of the Art in Natural Language Generation: Core tasks,
applications and evaluation (2017), Gatt et al. [pdf]
References: a few papers
156
Long short-term memory (1997), S. Hochreiter and J. Schmidhuber. [pdf]
Generating sequences with recurrent neural networks (2013), A. Graves. [pdf]
Sequence to sequence learning with neural networks (2014), I. Sutskever et al. [pdf]
Neural machine translation by jointly learning to align and translate (2014), D.
Bahdanau et al. [pdf]
Google's neural machine translation system: Bridging the gap between human and
machine translation (2016), Y. Wu et al. [pdf]
A Convolutional Encoder Model for Neural Machine Translation (2017), J. Gehring et
al. [pdf]
Attention Is All You Need (2017), A Vaswani. [pdf]
Concluding remarks• Many aspects I did not discuss:
• Detailed Implementation techniques (batching, dropout, ensembling, …)
• Pros/Cons of NMT relative to PBMT (Philipp Koehn)
• Sub-word units, Byte-Pair encoding
• Use of monolingual data
• Fine-grained linguistic evaluation techniques (Pierre Isabelle’s challenge dataset)
• Prior linguistic knowledge
157
Seq2Seq in NMT, text generation, etc. :
An active research field with exciting applications