switchout: an efficient data augmentation for neural...
TRANSCRIPT
![Page 1: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/1.jpg)
SwitchOut: An Efficient Data Augmentation for NeuralMachine Translation
Xinyi Wang∗, Hieu Pham∗, Zihang Dai, Graham Neubig
November 2, 2018
∗:equal contribution1 / 41
![Page 2: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/2.jpg)
Data Augmentation
Neural models are data hungry, while collecting data is expensive
Prevalent in computer vision1
More difficult for natural languageI Discrete vocabularyI NMT sensitive to arbitrary noise
1image source:Medium2 / 41
![Page 3: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/3.jpg)
Data Augmentation
Neural models are data hungry, while collecting data is expensive
Prevalent in computer vision1
More difficult for natural languageI Discrete vocabularyI NMT sensitive to arbitrary noise
1image source:Medium3 / 41
![Page 4: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/4.jpg)
Data Augmentation
Neural models are data hungry, while collecting data is expensive
Prevalent in computer vision1
More difficult for natural languageI Discrete vocabularyI NMT sensitive to arbitrary noise
1image source:Medium4 / 41
![Page 5: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/5.jpg)
Existing Strategies
Word replacement
Dictionary [Fadaee et al., 2017]
Word dropout [Sennrich et al., 2016a]
Reward Augmented Maximum Likelihood (RAML)[Norouzi et al., 2016]
→ Can we characterize all of the related approaches together?
5 / 41
![Page 6: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/6.jpg)
Existing Strategies
Word replacement
Dictionary [Fadaee et al., 2017]
Word dropout [Sennrich et al., 2016a]
Reward Augmented Maximum Likelihood (RAML)[Norouzi et al., 2016]
→ Can we characterize all of the related approaches together?
6 / 41
![Page 7: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/7.jpg)
Existing Strategies
Word replacement
Dictionary [Fadaee et al., 2017]
Word dropout [Sennrich et al., 2016a]
Reward Augmented Maximum Likelihood (RAML)[Norouzi et al., 2016]
→ Can we characterize all of the related approaches together?
7 / 41
![Page 8: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/8.jpg)
Existing Strategies
Word replacement
Dictionary [Fadaee et al., 2017]
Word dropout [Sennrich et al., 2016a]
Reward Augmented Maximum Likelihood (RAML)[Norouzi et al., 2016]
→ Can we characterize all of the related approaches together?
8 / 41
![Page 9: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/9.jpg)
Existing Strategies
Word replacement
Dictionary [Fadaee et al., 2017]
Word dropout [Sennrich et al., 2016a]
Reward Augmented Maximum Likelihood (RAML)[Norouzi et al., 2016]
→ Can we characterize all of the related approaches together?
9 / 41
![Page 10: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/10.jpg)
Existing Strategies: RAML
RAML [Norouzi et al., 2016]
Motivation: NMT relies on imperfect partial translation at test time,but trained only on gold standard target
Solution: Sample corrupted target during training
Gold target y , corrupted y , similarity measure ry
q∗(y |y , τ) =exp {ry (y , y)/τ}∑y ′ exp {ry (y ′, y)/τ}
10 / 41
![Page 11: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/11.jpg)
Existing Strategies: RAML
RAML [Norouzi et al., 2016]
Motivation: NMT relies on imperfect partial translation at test time,but trained only on gold standard target
Solution: Sample corrupted target during training
Gold target y , corrupted y , similarity measure ry
q∗(y |y , τ) =exp {ry (y , y)/τ}∑y ′ exp {ry (y ′, y)/τ}
11 / 41
![Page 12: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/12.jpg)
Existing Strategies: RAML
RAML [Norouzi et al., 2016]
Motivation: NMT relies on imperfect partial translation at test time,but trained only on gold standard target
Solution: Sample corrupted target during training
Gold target y , corrupted y , similarity measure ry
q∗(y |y , τ) =exp {ry (y , y)/τ}∑y ′ exp {ry (y ′, y)/τ}
12 / 41
![Page 13: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/13.jpg)
Formalize Data Augmentation
Real data distribution: x , y ∼ p(X ,Y )
Observed data distribution: x , y ∼ p(X ,Y )→ Problem: p(X ,Y ) and p(X ,Y ) might have large discrepancy
Data augmentation: x , y ∼ q(X , Y )
13 / 41
![Page 14: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/14.jpg)
Formalize Data Augmentation
Real data distribution: x , y ∼ p(X ,Y )
Observed data distribution: x , y ∼ p(X ,Y )
→ Problem: p(X ,Y ) and p(X ,Y ) might have large discrepancy
Data augmentation: x , y ∼ q(X , Y )
14 / 41
![Page 15: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/15.jpg)
Formalize Data Augmentation
Real data distribution: x , y ∼ p(X ,Y )
Observed data distribution: x , y ∼ p(X ,Y )→ Problem: p(X ,Y ) and p(X ,Y ) might have large discrepancy
Data augmentation: x , y ∼ q(X , Y )
15 / 41
![Page 16: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/16.jpg)
Formalize Data Augmentation
Real data distribution: x , y ∼ p(X ,Y )
Observed data distribution: x , y ∼ p(X ,Y )→ Problem: p(X ,Y ) and p(X ,Y ) might have large discrepancy
Data augmentation: x , y ∼ q(X , Y )
16 / 41
![Page 17: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/17.jpg)
Design a good q(X , Y )
q: function of observed (x , y)
How should q approximate p?I Diversity: larger support with all valid data pairs (x , y)
F Entropy H[q(x , y |x , y)
]is large
I Smoothness: probability of similar data pairs are similarF q maximizes similarity measure rx(x , x), ry (y , y)
τ : control effect of diversity; q should maximize
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]
17 / 41
![Page 18: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/18.jpg)
Design a good q(X , Y )
q: function of observed (x , y)How should q approximate p?
I Diversity: larger support with all valid data pairs (x , y)F Entropy H
[q(x , y |x , y)
]is large
I Smoothness: probability of similar data pairs are similarF q maximizes similarity measure rx(x , x), ry (y , y)
τ : control effect of diversity; q should maximize
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]
18 / 41
![Page 19: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/19.jpg)
Design a good q(X , Y )
q: function of observed (x , y)How should q approximate p?
I Diversity: larger support with all valid data pairs (x , y)F Entropy H
[q(x , y |x , y)
]is large
I Smoothness: probability of similar data pairs are similarF q maximizes similarity measure rx(x , x), ry (y , y)
τ : control effect of diversity; q should maximize
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]
19 / 41
![Page 20: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/20.jpg)
Design a good q(X , Y )
q: function of observed (x , y)How should q approximate p?
I Diversity: larger support with all valid data pairs (x , y)F Entropy H
[q(x , y |x , y)
]is large
I Smoothness: probability of similar data pairs are similarF q maximizes similarity measure rx(x , x), ry (y , y)
τ : control effect of diversity; q should maximize
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]
20 / 41
![Page 21: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/21.jpg)
Design a good q(X , Y )
q: function of observed (x , y)How should q approximate p?
I Diversity: larger support with all valid data pairs (x , y)F Entropy H
[q(x , y |x , y)
]is large
I Smoothness: probability of similar data pairs are similarF q maximizes similarity measure rx(x , x), ry (y , y)
τ : control effect of diversity; q should maximize
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]
21 / 41
![Page 22: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/22.jpg)
Mathematically Optimal q
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]Solve for the best q
q∗(x , y |x , y) =exp {s(x , y ; x , y)/τ}∑
x ′,y ′ exp {s(x ′, y ′; x , y)/τ}
Decompose x and y
q∗(x , y |x , y) =exp {rx(x , x)/τx}∑x ′ exp {rx(x ′, x)/τx}
× exp {ry (y , y)/τy}∑y ′ exp {ry (y ′, y)/τy}
Formulate existing methodsI Dictionary: jointly on x and y , but deterministic and not diverseI Word dropout: only x side with null tokenI RAML: only y side
22 / 41
![Page 23: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/23.jpg)
Mathematically Optimal q
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]Solve for the best q
q∗(x , y |x , y) =exp {s(x , y ; x , y)/τ}∑
x ′,y ′ exp {s(x ′, y ′; x , y)/τ}
Decompose x and y
q∗(x , y |x , y) =exp {rx(x , x)/τx}∑x ′ exp {rx(x ′, x)/τx}
× exp {ry (y , y)/τy}∑y ′ exp {ry (y ′, y)/τy}
Formulate existing methodsI Dictionary: jointly on x and y , but deterministic and not diverseI Word dropout: only x side with null tokenI RAML: only y side
23 / 41
![Page 24: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/24.jpg)
Mathematically Optimal q
J(q) = τ ·H[q(x , y |x , y)
]+ Ex ,y∼q
[rx(x , x) + ry (y , y)
]Solve for the best q
q∗(x , y |x , y) =exp {s(x , y ; x , y)/τ}∑
x ′,y ′ exp {s(x ′, y ′; x , y)/τ}
Decompose x and y
q∗(x , y |x , y) =exp {rx(x , x)/τx}∑x ′ exp {rx(x ′, x)/τx}
× exp {ry (y , y)/τy}∑y ′ exp {ry (y ′, y)/τy}
Formulate existing methodsI Dictionary: jointly on x and y , but deterministic and not diverseI Word dropout: only x side with null tokenI RAML: only y side
24 / 41
![Page 25: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/25.jpg)
Formulate SwitchOut
Augment both x and y !
Sample for x , y independently
Define rx(x , x) and ry (y , y)I Negative Hamming Distance, following RAML
25 / 41
![Page 26: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/26.jpg)
Formulate SwitchOut
Augment both x and y !
Sample for x , y independently
Define rx(x , x) and ry (y , y)I Negative Hamming Distance, following RAML
26 / 41
![Page 27: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/27.jpg)
Formulate SwitchOut
Augment both x and y !
Sample for x , y independently
Define rx(x , x) and ry (y , y)I Negative Hamming Distance, following RAML
27 / 41
![Page 28: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/28.jpg)
SwitchOut: Sample efficiently
Given a sentence s = {s1, s2, ...s|s|}1 How many words to corrupt?
Assumption: only one token for swapping.
P(n) ∝ exp(−n)/τ
2 What is the corrupted sentence?
P(randomly swap si by another word) =n
|s|
See Appendix: Efficient batch implementation in PyTorch and Tensorflow
28 / 41
![Page 29: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/29.jpg)
SwitchOut: Sample efficiently
Given a sentence s = {s1, s2, ...s|s|}1 How many words to corrupt?
Assumption: only one token for swapping.
P(n) ∝ exp(−n)/τ
2 What is the corrupted sentence?
P(randomly swap si by another word) =n
|s|
See Appendix: Efficient batch implementation in PyTorch and Tensorflow
29 / 41
![Page 30: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/30.jpg)
Experiments
DatasetsI en-vi: IWSLT 2015I de-en: IWSLT 2016I en-de: WMT 2015
ModelsI Transformer modelI Word-based, standard preprocessing
30 / 41
![Page 31: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/31.jpg)
Results: RAML and word dropout
SwitchOut on source > word dropout
SwitchOut on source and target > RAML
Methoden-de de-en en-vi
src trg
N/A N/A 21.73 29.81 27.97WordDropout N/A 20.63 29.97 28.56SwitchOut N/A 22.78† 29.94 28.67†
N/A RAML 22.83 30.66 28.88WordDropout RAML 20.69 30.79 28.86SwitchOut RAML 23.13† 30.98† 29.09
31 / 41
![Page 32: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/32.jpg)
Results: RAML and word dropout
SwitchOut on source > word dropout
SwitchOut on source and target > RAML
Methoden-de de-en en-vi
src trg
N/A N/A 21.73 29.81 27.97WordDropout N/A 20.63 29.97 28.56SwitchOut N/A 22.78† 29.94 28.67†
N/A RAML 22.83 30.66 28.88WordDropout RAML 20.69 30.79 28.86SwitchOut RAML 23.13† 30.98† 29.09
32 / 41
![Page 33: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/33.jpg)
Results: RAML and word dropout
SwitchOut on source > word dropout
SwitchOut on source and target > RAML
Methoden-de de-en en-vi
src trg
N/A N/A 21.73 29.81 27.97WordDropout N/A 20.63 29.97 28.56SwitchOut N/A 22.78† 29.94 28.67†
N/A RAML 22.83 30.66 28.88WordDropout RAML 20.69 30.79 28.86SwitchOut RAML 23.13† 30.98† 29.09
33 / 41
![Page 34: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/34.jpg)
Where does SwitchOut help?
More gain for sentences more different from training data
1350 2700 4050 5400Top K sentences
-0.25
0
0.25
0.5
0.75
Gainin
BLEU
253 506 759 1012Top K sentences
-1
-0.5
0
0.5
1
Gainin
BLEU
Figure: Left: IWSLT 16 de-en. Right: IWSLT 15 en-vi.
34 / 41
![Page 35: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/35.jpg)
Final Thoughts
SwitchOut sampling is efficient and easy-to-use
Work with any NMT architecture
Formulation of data augmentation encompasses existing works andinspires future direction
Thanks a lot for listening! Questions?
35 / 41
![Page 36: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/36.jpg)
Final Thoughts
SwitchOut sampling is efficient and easy-to-use
Work with any NMT architecture
Formulation of data augmentation encompasses existing works andinspires future direction
Thanks a lot for listening! Questions?
36 / 41
![Page 37: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/37.jpg)
Final Thoughts
SwitchOut sampling is efficient and easy-to-use
Work with any NMT architecture
Formulation of data augmentation encompasses existing works andinspires future direction
Thanks a lot for listening! Questions?
37 / 41
![Page 38: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/38.jpg)
Final Thoughts
SwitchOut sampling is efficient and easy-to-use
Work with any NMT architecture
Formulation of data augmentation encompasses existing works andinspires future direction
Thanks a lot for listening! Questions?
38 / 41
![Page 39: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/39.jpg)
References
Norouzi et al. (2016) Reward Augmented Maximum Likelihood for NeuralStructured Prediction. In NIPS.
Sennrich et al. (2016a) Edinburgh neural machine translation systems for wmt 16.In WMT.
Sennrich et al. (2016b) Improving neural machine translation models withmonolingual data. In ACL.
Currey et al. (2017) Copied Monolingual Data Improves Low-Resource NeuralMachine Translation. In WMT.
Fadaee et al. (2017) Data Augmentation for Low-Resource Neural MachineTranslation. In ACL.
39 / 41
![Page 40: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/40.jpg)
Results: Back Translation (WMT en-de 2015)
SwitchOut > Back Translation
Switchout + RAML + back translate wins
Method en-de
Transformer 21.73+SwitchOut 22.78
+BT 21.82
+BT +RAML 21.53+BT +SwitchOut 22.93+BT +RAML +SwitchOut 23.76
40 / 41
![Page 41: SwitchOut: An Efficient Data Augmentation for Neural ...cindyxinyiwang.github.io/assets/slides/SwitchOut_slides.pdf · Results: RAML and word dropout SwitchOut on source >word dropout](https://reader033.vdocuments.us/reader033/viewer/2022042223/5ec9807b6ace79356a38ebbf/html5/thumbnails/41.jpg)
Results: Back Translation (WMT en-de 2015)
SwitchOut > Back Translation
Switchout + RAML + back translate wins
Method en-de
Transformer 21.73+SwitchOut 22.78
+BT 21.82+BT +RAML 21.53+BT +SwitchOut 22.93+BT +RAML +SwitchOut 23.76
41 / 41