switchout: an efficient data augmentation for neural...

SwitchOut: An Efficient Data Augmentation for NeuralMachine Translation

Xinyi Wang∗, Hieu Pham∗, Zihang Dai, Graham Neubig

November 2, 2018

∗:equal contribution1 / 41

Data Augmentation

Neural models are data hungry, while collecting data is expensive

Prevalent in computer vision1

More difficult for natural languageI Discrete vocabularyI NMT sensitive to arbitrary noise

1image source:Medium2 / 41

Data Augmentation

Existing Strategies

Word replacement

Dictionary [Fadaee et al., 2017]

Word dropout [Sennrich et al., 2016a]

Reward Augmented Maximum Likelihood (RAML)[Norouzi et al., 2016]

→ Can we characterize all of the related approaches together?

5 / 41

Existing Strategies

Word replacement

6 / 41

Existing Strategies

Word replacement

7 / 41

Existing Strategies

Word replacement

8 / 41

Existing Strategies

Word replacement

9 / 41

Existing Strategies: RAML

RAML [Norouzi et al., 2016]

Motivation: NMT relies on imperfect partial translation at test time,but trained only on gold standard target

Solution: Sample corrupted target during training

Gold target y , corrupted y , similarity measure ry

q∗(y |y , τ) =exp {ry (y , y)/τ}∑y ′ exp {ry (y ′, y)/τ}

10 / 41

11 / 41

12 / 41

Formalize Data Augmentation

Real data distribution: x , y ∼ p(X ,Y )

Observed data distribution: x , y ∼ p(X ,Y )→ Problem: p(X ,Y ) and p(X ,Y ) might have large discrepancy

Data augmentation: x , y ∼ q(X , Y )

13 / 41

Observed data distribution: x , y ∼ p(X ,Y )

→ Problem: p(X ,Y ) and p(X ,Y ) might have large discrepancy

14 / 41

15 / 41

16 / 41

Design a good q(X , Y )

q: function of observed (x , y)

How should q approximate p?I Diversity: larger support with all valid data pairs (x , y)

F Entropy H[q(x , y |x , y)

]is large

I Smoothness: probability of similar data pairs are similarF q maximizes similarity measure rx(x , x), ry (y , y)

τ : control effect of diversity; q should maximize

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

17 / 41

q: function of observed (x , y)How should q approximate p?

I Diversity: larger support with all valid data pairs (x , y)F Entropy H

[q(x , y |x , y)

]is large

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

18 / 41

[q(x , y |x , y)

]is large

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

19 / 41

[q(x , y |x , y)

]is large

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

20 / 41

[q(x , y |x , y)

]is large

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

21 / 41

Mathematically Optimal q

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

]Solve for the best q

q∗(x , y |x , y) =exp {s(x , y ; x , y)/τ}∑

x ′,y ′ exp {s(x ′, y ′; x , y)/τ}

Decompose x and y

q∗(x , y |x , y) =exp {rx(x , x)/τx}∑x ′ exp {rx(x ′, x)/τx}

× exp {ry (y , y)/τy}∑y ′ exp {ry (y ′, y)/τy}

Formulate existing methodsI Dictionary: jointly on x and y , but deterministic and not diverseI Word dropout: only x side with null tokenI RAML: only y side

22 / 41

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

q∗(x , y |x , y) =exp {s(x , y ; x , y)/τ}∑

x ′,y ′ exp {s(x ′, y ′; x , y)/τ}

Decompose x and y

23 / 41

J(q) = τ ·H[q(x , y |x , y)

]+ Ex ,y∼q

[rx(x , x) + ry (y , y)

q∗(x , y |x , y) =exp {s(x , y ; x , y)/τ}∑

x ′,y ′ exp {s(x ′, y ′; x , y)/τ}

Decompose x and y

24 / 41

Formulate SwitchOut

Augment both x and y !

Sample for x , y independently

Define rx(x , x) and ry (y , y)I Negative Hamming Distance, following RAML

25 / 41

Formulate SwitchOut

26 / 41

Formulate SwitchOut

27 / 41

SwitchOut: Sample efficiently

Given a sentence s = {s1, s2, ...s|s|}1 How many words to corrupt?

Assumption: only one token for swapping.

P(n) ∝ exp(−n)/τ

2 What is the corrupted sentence?

P(randomly swap si by another word) =n

See Appendix: Efficient batch implementation in PyTorch and Tensorflow

28 / 41

SwitchOut: Sample efficiently

Given a sentence s = {s1, s2, ...s|s|}1 How many words to corrupt?

Assumption: only one token for swapping.

P(n) ∝ exp(−n)/τ

2 What is the corrupted sentence?

P(randomly swap si by another word) =n

See Appendix: Efficient batch implementation in PyTorch and Tensorflow

29 / 41

Experiments

DatasetsI en-vi: IWSLT 2015I de-en: IWSLT 2016I en-de: WMT 2015

ModelsI Transformer modelI Word-based, standard preprocessing

30 / 41

Results: RAML and word dropout

SwitchOut on source > word dropout

SwitchOut on source and target > RAML

Methoden-de de-en en-vi

src trg

N/A N/A 21.73 29.81 27.97WordDropout N/A 20.63 29.97 28.56SwitchOut N/A 22.78† 29.94 28.67†

N/A RAML 22.83 30.66 28.88WordDropout RAML 20.69 30.79 28.86SwitchOut RAML 23.13† 30.98† 29.09

31 / 41

src trg

32 / 41

src trg

33 / 41

Where does SwitchOut help?

More gain for sentences more different from training data

1350 2700 4050 5400Top K sentences

Gainin

253 506 759 1012Top K sentences

Gainin

Figure: Left: IWSLT 16 de-en. Right: IWSLT 15 en-vi.

34 / 41

Final Thoughts

SwitchOut sampling is efficient and easy-to-use

Work with any NMT architecture

Formulation of data augmentation encompasses existing works andinspires future direction

Thanks a lot for listening! Questions?

35 / 41

Final Thoughts

36 / 41

Final Thoughts

37 / 41

Final Thoughts

38 / 41

References

Norouzi et al. (2016) Reward Augmented Maximum Likelihood for NeuralStructured Prediction. In NIPS.

Sennrich et al. (2016a) Edinburgh neural machine translation systems for wmt 16.In WMT.

Sennrich et al. (2016b) Improving neural machine translation models withmonolingual data. In ACL.

Currey et al. (2017) Copied Monolingual Data Improves Low-Resource NeuralMachine Translation. In WMT.

Fadaee et al. (2017) Data Augmentation for Low-Resource Neural MachineTranslation. In ACL.

39 / 41

Results: Back Translation (WMT en-de 2015)

SwitchOut > Back Translation

Switchout + RAML + back translate wins

Method en-de

Transformer 21.73+SwitchOut 22.78

+BT 21.82

+BT +RAML 21.53+BT +SwitchOut 22.93+BT +RAML +SwitchOut 23.76

40 / 41

Results: Back Translation (WMT en-de 2015)

SwitchOut > Back Translation

Switchout + RAML + back translate wins

Method en-de

Transformer 21.73+SwitchOut 22.78

+BT 21.82+BT +RAML 21.53+BT +SwitchOut 22.93+BT +RAML +SwitchOut 23.76

41 / 41

switchout: an efficient data augmentation for neural...

Documents

dropout and retention

design first api's with raml and soapui

ddd (delight-driven development) of apis with raml

api essentials for every stakeholder · raml, wsdl, or wadl...

recruitment - dropout management

geomancy – land and spirit...al-raml (sand writing), also...

micropower low-dropout (ldo) voltage regulators … ·...

word count: patterns of care and dropout rates from

designing rest with raml

school dropout prevention pilot...

raml-based mock service generator for microservice...

stopping the dropout

rubicon asset management limited (in liquidation) acn 095...

team portugal presentation dropout erasmus + "solutions from...

roadway levels - caltrans - california department of ......

implementing raml in studio

background and reasons for unrwa school dropout · pdf...

raml for tech writers

rest with raml

dropout - ed