![Page 1: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/1.jpg)
What can Statistical Machine Translation teach Neural
Machine Translation about Structured Prediction?
Graham Neubig @ ICLR Workshop on Deep Reinforcement Learning Meets Structured Prediction
5/6/2019
![Page 2: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/2.jpg)
Types of Prediction
![Page 3: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/3.jpg)
Types of Prediction• Two classes (binary classification)
![Page 4: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/4.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
![Page 5: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/5.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
![Page 6: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/6.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
I hate this movie
very good good
neutral bad
very bad
![Page 7: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/7.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)
I hate this movie
very good good
neutral bad
very bad
![Page 8: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/8.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)I hate this movie PRP VBP DT NN
I hate this movie
very good good
neutral bad
very bad
![Page 9: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/9.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)I hate this movie PRP VBP DT NN
I hate this movie kono eiga ga kirai
I hate this movie
very good good
neutral bad
very bad
![Page 10: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/10.jpg)
Types of Prediction• Two classes (binary classification)
I hate this movie positive negative
• Multiple classes (multi-class classification)
• Exponential/infinite labels (structured prediction)I hate this movie PRP VBP DT NN
I hate this movie kono eiga ga kirai
I hate this movie
very good good
neutral bad
very bad
![Page 11: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/11.jpg)
...
Neubig & Watanabe, Computational Linguistics (2016)
![Page 12: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/12.jpg)
...
Neubig & Watanabe, Computational Linguistics (2016)
![Page 13: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/13.jpg)
Then: Symbolic Translation Models
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 14: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/14.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 15: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/15.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 16: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/16.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 17: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/17.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 18: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/18.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)• Reordering model -- e.g. P(<swap> | eiga, ga kirai)
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 19: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/19.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)• Reordering model -- e.g. P(<swap> | eiga, ga kirai)• Length model P(|Y|) -- e.g. word penalty for each word added
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
![Page 20: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/20.jpg)
Then: Symbolic Translation Modelskono eiga ga kirai
moviethisI hate• First step: learn component models to maximize likelihood
• Translation model P(y|x) -- e.g. P( movie | eiga )• Language model P(Y) -- e.g. P(hate | I)• Reordering model -- e.g. P(<swap> | eiga, ga kirai)• Length model P(|Y|) -- e.g. word penalty for each word added
• Second step: learning log-linear combination to maximize translation accuracy [Och 2004]
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
logP (Y | X) =X
i
�i�i(X,Y )/Z<latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4llDHl42mhk2a3gk9P95mU898=">AAACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNII+pejeuOeSEVjcaMHCelw1BU0ohhpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HHzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OcciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI99RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit>
![Page 21: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/21.jpg)
Now: Auto-regressive Neural Networks
![Page 22: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/22.jpg)
Now: Auto-regressive Neural Networks
</s>
dec dec dec dec
</s>
I hate this movie
kono eiga ga kirai
I hate this movie
Encoder
Decoder
![Page 23: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/23.jpg)
Now: Auto-regressive Neural Networks
</s>
dec dec dec dec
</s>
I hate this movie
kono eiga ga kirai
I hate this movie
Encoder
Decoder
• All parameters trained end-to-end, usually to maximize likelihood (not accuracy!)
![Page 24: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/24.jpg)
Standard MT System Training/Decoding
![Page 25: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/25.jpg)
Decoder StructureI
classifyclassify
I hate
hate
classify
this
this
classify
movie
movie
classify
</s>
encoder
P (E | F ) =TY
t=1
P (et | F, e1, . . . , et�1)<latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">AAACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZggJ//qom0mKivZzo5HQwFH2LGFcC7sr5TdMM4423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit>
![Page 26: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/26.jpg)
Maximum Likelihood Training
• Maximum the likelihood of predicting the next word in the reference given the previous words
`(E | F ) = � logP (E | F )
= �TX
t=1
logP (et | F, e1, . . . , et�1)<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit>
![Page 27: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/27.jpg)
Maximum Likelihood Training
• Maximum the likelihood of predicting the next word in the reference given the previous words
`(E | F ) = � logP (E | F )
= �TX
t=1
logP (et | F, e1, . . . , et�1)<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPPtPrQE=">AAACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPPV1uvt3s6bH87UVuBIGGXsdc4dKlniiCQpvK4scp0rvMpvvy78q19onTTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbee6WAqbF+lQRL9WGi4dq5uc79pOZ04556C/F/3rim6cmkkWVVE5bi/qBprYAMLKqGQloUpOaecGGlvyuIG265IP8hkS8hefrk52R0PPg8SC4+9s++dG2ss3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9lljxC+/Qcv6aoL</latexit>
• Also called "teacher forcing"
![Page 28: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/28.jpg)
Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,
but at test time we may make mistakes that propagate
I
classifyclassify
I I
I
classify
I
encoder I
classify
I
I
classify
I
![Page 29: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/29.jpg)
Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,
but at test time we may make mistakes that propagate
• Exposure bias: The model is not exposed to mistakes during training, and cannot deal with them at test
I
classifyclassify
I I
I
classify
I
encoder I
classify
I
I
classify
I
![Page 30: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/30.jpg)
Problem 1: Exposure Bias• Teacher forcing assumes feeding correct previous input,
but at test time we may make mistakes that propagate
• Exposure bias: The model is not exposed to mistakes during training, and cannot deal with them at test
• Really important! One main source of commonly witnessed phenomena such as repeating.
I
classifyclassify
I I
I
classify
I
encoder I
classify
I
I
classify
I
![Page 31: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/31.jpg)
Problem 2: Disregard to Evaluation Metrics
![Page 32: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/32.jpg)
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
![Page 33: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/33.jpg)
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
![Page 34: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/34.jpg)
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
• Really important! Causes systematic problems:
![Page 35: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/35.jpg)
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
• Really important! Causes systematic problems:
• Hypothesis-reference length mismatch
![Page 36: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/36.jpg)
Problem 2: Disregard to Evaluation Metrics
• In the end, we want good translations
• Good translations can be measured with metrics, e.g. BLEU or METEOR
• Really important! Causes systematic problems:
• Hypothesis-reference length mismatch
• Dropped/repeated content
![Page 37: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/37.jpg)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
![Page 38: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/38.jpg)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
![Page 39: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/39.jpg)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
![Page 40: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/40.jpg)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
![Page 41: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/41.jpg)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
![Page 42: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/42.jpg)
A Clear Example• My (winning) submission to Workshop on Asian
Translation 2016 [Neubig 16]
23
24
25
26
27
MLE MLE+Length MinRisk80
85
90
95
100
MLE MLE+Length MinRisk
BLEU Length Ratio
• Just training for (sentence-level) BLEU largely fixes length problems, and does much better than heuristics
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
![Page 43: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/43.jpg)
Error and Risk
![Page 44: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/44.jpg)
Error
![Page 45: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/45.jpg)
Error• Generate a translation
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
![Page 46: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/46.jpg)
Error• Generate a translation
• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>
![Page 47: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/47.jpg)
Error• Generate a translation
• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)
• We would like to minimize error
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>
![Page 48: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/48.jpg)
Error• Generate a translation
• Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)
• We would like to minimize error
• Problem: argmax is not differentiable, and thus not conducive to gradient-based optimization
E = argmaxEP (E | F )<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">AAACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLruu1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE9222Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNNzE2zZdSi4hLzopwYSxvusC01LI6bAtLLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>
error(E, E) = 1� BLEU(E, E)<latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit><latexit sha1_base64="KRxJjxRRAFBSumCLgm+mSm7rf7k=">AAACHHicbVDLSgNBEJyNrxhfUY9eBoMQQcOuBNSDECIBDx4iuEZIQpiddJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUPP65yO/dgdKizC4xkEETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyyqBNLiiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHHpEQuSJW4hJMH8kReyKv1aD1bb9b7pDVlTWe2yS9Yn99F66BW</latexit>
![Page 49: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/49.jpg)
In Phrase-based MT: Minimum Error Rate Training
![Page 50: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/50.jpg)
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
![Page 51: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/51.jpg)
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
• Pick a single direction in feature space
![Page 52: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/52.jpg)
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
• Pick a single direction in feature space
• Exactly calculate the loss surface in this direction only (over an n-best list for every hypothesis)
![Page 53: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/53.jpg)
In Phrase-based MT: Minimum Error Rate Training
• A clever trick for gradient-free optimization of linear models
• Pick a single direction in feature space
• Exactly calculate the loss surface in this direction only (over an n-best list for every hypothesis)
F1
φ1
φ2
φ3
err
E1,1 1 0 -1 0.6
E1,2 0 1 0 0
E1,3 1 0 1 1
F2
φ1
φ2
φ3
err
E2,1 1 0 -2 0.8
E2,2 3 0 1 0.3
E2,3 3 1 2 0
-4 -2 0 2 4
-4
-3
-2
-1
0
1
2
3
4
-4 -2 0 2 4
-4
-3
-2
-1
0
1
2
3
4(a) (b)
λ1=-1, λ
2=1, λ
3=0
-4 -2 0 2 40
1
-4 -2 0 2 40
1
-4 -2 0 2 40
1
2
(d)
α ←1.25
(c)F1 candidates
F2 candidates
F1 error
F2 error
total error
E1,1
E1,2
E1,3
E2,1
E2,2
E2,3
d1=0, d
2=0, d
3=1
λ1=-1, λ
2=1, λ
3=1.25
![Page 54: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/54.jpg)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
![Page 55: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/55.jpg)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
• Risk is defined as the expected error
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
![Page 56: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/56.jpg)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
• Risk is defined as the expected error
risk(F,E, ✓) =X
E
P (E | F ; ✓)error(E, E).
<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit>
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
![Page 57: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/57.jpg)
A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]
• Risk is defined as the expected error
risk(F,E, ✓) =X
E
P (E | F ; ✓)error(E, E).
<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">AAACTHicbVFdSyMxFM1U14/uh1UffQmWhRakzIigIoIoLfvYhe1W6JSSydza0GRmSO6IZZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg6772ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAigg4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezzQaB1rPPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgyy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDOO9gPKNgRv+cnvQeewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/VVr7A9xQssM=</latexit>
• This is includes the probability in the objective function -> differentiable!
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
![Page 58: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/58.jpg)
Sub-sampling
![Page 59: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/59.jpg)
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
![Page 60: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/60.jpg)
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
risk(F,E, S) =X
E2S
P (E | F )
Zerror(E, E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>
![Page 61: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/61.jpg)
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
• Samples can be created using random sampling or n-best search
risk(F,E, S) =X
E2S
P (E | F )
Zerror(E, E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>
![Page 62: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/62.jpg)
Sub-sampling• Create a small sample of sentences (5-50), and
calculate risk over that
• Samples can be created using random sampling or n-best search
• If random sampling, make sure to deduplicate
risk(F,E, S) =X
E2S
P (E | F )
Zerror(E, E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">AAACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQCC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/oteekhCILLjQz76QGiYmfckjZJCCoth+CeovVh5ufpqbb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MddPfoKxItdHOCtgpNi5FpngDD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOOu33ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJllWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aaQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVVu8A7qWy3A==</latexit>
![Page 63: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/63.jpg)
Policy Gradient/REINFORCE
![Page 64: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/64.jpg)
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
![Page 65: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/65.jpg)
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>
![Page 66: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/66.jpg)
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
• Outputs that get a bigger reward will get a higher weight
`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>
![Page 67: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/67.jpg)
Policy Gradient/REINFORCE• Alternative way of maximizing expected reward,
minimizing risk
• Outputs that get a bigger reward will get a higher weight
• Can show this converges to minimum-risk solution
`reinforce(X,Y ) = �R(Y , Y ) logP (Y | X)<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">AAACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mddaFiqWAimmY6OzeiWVdrUrmmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okkmJEh8nRttDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZZOuGMe9bJDzjvHwvDpmQ=</latexit>
![Page 68: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/68.jpg)
But Wait, why is Everyone Using MLE for NMT?
![Page 69: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/69.jpg)
When Training goes Bad...
Minimum risk training for neural machine translation (Shen et al. 2015)
![Page 70: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/70.jpg)
When Training goes Bad...
Chances are, this is you 😔
Minimum risk training for neural machine translation (Shen et al. 2015)
![Page 71: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/71.jpg)
It Happens to the Best of Us
![Page 72: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/72.jpg)
It Happens to the Best of Us
• Email from a famous MT researcher: "we also re-implemented MRT, but so far, training has been very unstable, and after a improving for a bit, our models develop a bias towards producing ever-shorter translations..."
![Page 73: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/73.jpg)
My Current Recipe for Stabilizing MRT/Reinforcement Learning
![Page 74: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/74.jpg)
Warm-start
![Page 75: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/75.jpg)
Warm-start• Start training with maximum likelihood, then switch
over to REINFORCE
![Page 76: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/76.jpg)
Warm-start• Start training with maximum likelihood, then switch
over to REINFORCE
• Works only in the scenarios where we can run MLE (not latent variables or standard RL settings)
![Page 77: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/77.jpg)
Warm-start• Start training with maximum likelihood, then switch
over to REINFORCE
• Works only in the scenarios where we can run MLE (not latent variables or standard RL settings)
• MIXER (Ranzato et al. 2016) gradually transitions from MLE to the full objective
![Page 78: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/78.jpg)
Adding a Baseline
![Page 79: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/79.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
![Page 80: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/80.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
“This is an easy sentence”“Buffalo Buffalo Buffalo”
![Page 81: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/81.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
“This is an easy sentence”“Buffalo Buffalo Buffalo”
![Page 82: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/82.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1“This is an easy sentence”
“Buffalo Buffalo Buffalo”
![Page 83: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/83.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
![Page 84: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/84.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected
![Page 85: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/85.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected
`baseline(X) = �(R(Y , Y )�B(Y )) logP (Y | X)
![Page 86: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/86.jpg)
Adding a Baseline• Basic idea: we have expectations about our reward
for a particular sentence
Reward0.80.3
0.95Baseline
0.1
B-R-0.150.2
“This is an easy sentence”“Buffalo Buffalo Buffalo”
• We can instead weight our likelihood by B-R to reflect when we did better or worse than expected
`baseline(X) = �(R(Y , Y )�B(Y )) logP (Y | X)
• (Be careful to not backprop through the baseline)
![Page 87: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/87.jpg)
Increasing Batch Size
![Page 88: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/88.jpg)
Increasing Batch Size• Because each sample will be high variance, we
can sample many different examples before performing update
![Page 89: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/89.jpg)
Increasing Batch Size• Because each sample will be high variance, we
can sample many different examples before performing update
• We can increase the number of examples (roll-outs) done before an update to stabilize
![Page 90: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/90.jpg)
Increasing Batch Size• Because each sample will be high variance, we
can sample many different examples before performing update
• We can increase the number of examples (roll-outs) done before an update to stabilize
• We can also save previous roll-outs and re-use them when we update parameters (experience replay, Lin 1993)
![Page 91: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/91.jpg)
Adding Temperature
![Page 92: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/92.jpg)
Adding Temperaturerisk(F,E, ✓, ⌧, S) =
X
E2S
P (E | F ; ✓)1/⌧
Zerror(E, E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
![Page 93: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/93.jpg)
Adding Temperature
• Temperature adjusts the peakiness of the distribution
risk(F,E, ✓, ⌧, S) =X
E2S
P (E | F ; ✓)1/⌧
Zerror(E, E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
![Page 94: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/94.jpg)
Adding Temperature
• Temperature adjusts the peakiness of the distribution
• With a small sample, setting temperature > 1 accounts for unsampled hypotheses that should be in the denominator
risk(F,E, ✓, ⌧, S) =X
E2S
P (E | F ; ✓)1/⌧
Zerror(E, E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">AAACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSSWTuNoxJnjhkUXELdSSoLJeNLdg5TDzVTYGdubVZNdzyT0bww/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFeffB1Mdkf7o/jjXu/wXevGBnlOXpI+iclrckg+kDGZEE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
![Page 95: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/95.jpg)
Contrasting Phrase-based SMT and NMT
![Page 96: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/96.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
![Page 97: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/97.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
![Page 98: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/98.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
![Page 99: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/99.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
![Page 100: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/100.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Objective Risk Error
![Page 101: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/101.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Objective Risk Error
Metric Granularity Sentence Level Corpus Level
![Page 102: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/102.jpg)
Phrase-based SMT MERT and NMT MinRisk/REINFORCE
NMT+MinRisk PBMT+MERT
Model NMT PBMT
Optimized Parameters Millions 5-30 Log-linear
Weights (others MLE)
Objective Risk Error
Metric Granularity Sentence Level Corpus Level
n-best Lists Re-generated Accumulated
![Page 103: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/103.jpg)
Optimized Parameters
![Page 104: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/104.jpg)
Optimized Parameters• Can we reduce the number of parameters
optimized for NMT?
![Page 105: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/105.jpg)
Optimized Parameters
• Maybe we can optimize only some parts of the model?Freezing Subnetworks to Analyze Domain Adaptation in NMT. Thompson et al. 2018.
• Can we reduce the number of parameters optimized for NMT?
![Page 106: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/106.jpg)
Optimized Parameters
• Maybe we can optimize only some parts of the model?Freezing Subnetworks to Analyze Domain Adaptation in NMT. Thompson et al. 2018.
• Maybe we can express models as a linear combination of a few hyper-parameters?Contextualized Parameter Generation for Universal NMT. Platanios et al. 2018.
• Can we reduce the number of parameters optimized for NMT?
W =X
i
↵iWi
<latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AAAB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xxprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TTTnLYzSSGJOfXj4c2k7j9SqVgq7vUoo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovgllBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yyQlmo8MAJHM/BWTAUgg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kff4AcJ+VOw==</latexit>
![Page 107: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/107.jpg)
Objective
![Page 108: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/108.jpg)
Objective• Can we move closer to minimizing error, which is what we
want to do in the first place?
![Page 109: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/109.jpg)
Objective• Can we move closer to minimizing error, which is what we
want to do in the first place?
• Maybe we can gradually anneal the temperature to move towards a peakier distribution?Minimum risk annealing for training log-linear models. Smith and Eisner 2006.
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
Training progression
![Page 110: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/110.jpg)
Metric Granularity
![Page 111: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/111.jpg)
Metric Granularity• Two ways of measuring metrics
![Page 112: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/112.jpg)
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
![Page 113: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/113.jpg)
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score
![Page 114: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/114.jpg)
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score• Regular BLEU is corpus-level, but mini-batch NMT
optimization algorithms calculate sentence level
![Page 115: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/115.jpg)
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score• Regular BLEU is corpus-level, but mini-batch NMT
optimization algorithms calculate sentence level• This causes problems, e.g. in sentence length!
Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012.
![Page 116: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/116.jpg)
Metric Granularity• Two ways of measuring metrics
• Sentence-level: Measure sentence-by-sentence, average
• Corpus: Sum sufficient statistics, calculate score• Regular BLEU is corpus-level, but mini-batch NMT
optimization algorithms calculate sentence level• This causes problems, e.g. in sentence length!
Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012.
• Maybe we can keep a running average of the sufficient statistics to approximate corpus BLEU?Online large-margin training of syntactic and structural translation features. Chiang et al. 2008.
![Page 117: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/117.jpg)
N-best Lists
![Page 118: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/118.jpg)
N-best Lists• In MERT for PBMT, we would accumulate n-best
lists across epochs:
new n-best 2
n-best 1
Epoch 1n-best 1
Epoch 2
new n-best 2
n-best 1
Epoch 3
new n-best 3
![Page 119: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/119.jpg)
N-best Lists• In MERT for PBMT, we would accumulate n-best
lists across epochs:
new n-best 2
n-best 1
Epoch 1n-best 1
Epoch 2
new n-best 2
n-best 1
Epoch 3
new n-best 3
• Greatly stabilizes training! Even if model learns horrible parameters, it still has good hypotheses from which to recover.
![Page 120: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/120.jpg)
N-best Lists• In MERT for PBMT, we would accumulate n-best
lists across epochs:
new n-best 2
n-best 1
Epoch 1n-best 1
Epoch 2
new n-best 2
n-best 1
Epoch 3
new n-best 3
• Greatly stabilizes training! Even if model learns horrible parameters, it still has good hypotheses from which to recover.
• Maybe we could do the same for NMT? Analogous to experience replay in RL:Self-improving reactive agents based on reinforcement learning, planning and teaching. Lin 1992.
![Page 121: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/121.jpg)
Summary
![Page 122: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/122.jpg)
Summary
![Page 123: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/123.jpg)
Summary• Neural MT has come a long way, and we can
optimize for accuracy
![Page 124: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/124.jpg)
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for
![Page 125: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/125.jpg)
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for• But no-one does it... Problems of stability speed.
![Page 126: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/126.jpg)
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for• But no-one does it... Problems of stability speed.• Still lots to remember from the past!
Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)
![Page 127: What can Statistical Machine Translation teach Neural ... · 5/6/2019 · Types of Prediction • Two classes (binary classification) I hate this movie positive negative • Multiple](https://reader036.vdocuments.us/reader036/viewer/2022071116/5ffd688794262f0f3d70b82b/html5/thumbnails/127.jpg)
Summary• Neural MT has come a long way, and we can
optimize for accuracy• This is important, fixes lots of problems that we'd
otherwise use heuristic hacks for• But no-one does it... Problems of stability speed.• Still lots to remember from the past!
Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)
Thanks! Questions?