an empirical study of mini-batch creation strategies for neural … · 2017. 7. 26. · 62. 3...

Proceedings of the First Workshop on Neural Machine Translation, pages 61–68,Vancouver, Canada, August 4, 2017. c©2017 Association for Computational Linguistics

An Empirical Study of Mini-Batch Creation Strategiesfor Neural Machine Translation

Makoto Morishita1∗, Yusuke Oda2, Graham Neubig3,2,Koichiro Yoshino2,4, Katsuhito Sudoh2, Satoshi Nakamura2

1NTT Communication Science Laboratories, NTT Corporation2Nara Institute of Science and Technology

3Carnegie Mellon University4PRESTO, Japan Science and Technology Agency

[email protected], [email protected]{oda.yusuke.on9, koichiro, sudoh, s-nakamura}@is.naist.jp

Abstract

Training of neural machine translation(NMT) models usually uses mini-batchesfor efficiency purposes. During the mini-batched training process, it is necessary topad shorter sentences in a mini-batch tobe equal in length to the longest sentencetherein for efficient computation. Previ-ous work has noted that sorting the cor-pus based on the sentence length beforemaking mini-batches reduces the amountof padding and increases the processingspeed. However, despite the fact thatmini-batch creation is an essential step inNMT training, widely used NMT toolkitsimplement disparate strategies for doingso, which have not been empirically vali-dated or compared. This work investigatesmini-batch creation strategies with exper-iments over two different datasets. Ourresults suggest that the choice of a mini-batch creation strategy has a large effecton NMT training and some length-basedsorting strategies do not always work wellcompared with simple shuffling.

1 Introduction

Mini-batch training is a standard practice in large-scale machine learning. In recent implementa-tions of neural networks, the efficiency of loss andgradient calculation is greatly improved by mini-batching due to the fact that combining trainingexamples into batches allows for fewer but largeroperations that can take advantage of the paral-lelism allowed by modern computation architec-tures, particularly GPUs.

∗ This work is done while the author was at Nara Instituteof Science and Technology.

In some cases, such as the case of processingimages, mini-batching is straightforward, as theinputs in all training examples take the same form.However, in order to perform mini-batching in thetraining of neural machine translation (NMT) orother sequence-to-sequence models, we need topad shorter sentences to be the same length as thelongest sentences to account for sentences of vari-able length in each mini-batch.

To help prevent wasted calculation due to thispadding, it is common to sort the corpus accord-ing to the sentence length before creating mini-batches (Sutskever et al., 2014; Bahdanau et al.,2015), because putting sentences that have sim-ilar lengths in the same mini-batch will reducethe amount of padding and increase the per-wordcomputation speed. However, we can also easilyimagine that this grouping of sentences togethermay affect the convergence speed and stability,and the performance of the learned models. De-spite this fact, no previous work has explicitly ex-amined how mini-batch creation affects the learn-ing of NMT models. Various NMT toolkits in-clude implementations of different strategies, butthey have neither been empirically validated norcompared.

In this work, we attempt to fill this gap by sur-veying the various mini-batch creation strategiesthat are in use: sorting by length of the source sen-tence, target sentence, or both, as well as makingmini-batches according to the number of sentencesand the number of words. We empirically comparetheir efficacy on two translation tasks and find thatsome strategies in wide use are not necessarily op-timal for reliably training models.

2 Mini-batches for NMT

First, to clearly demonstrate the problem of mini-batching in NMT models, Figure 1 shows an ex-

61

Figure 1: An example of mini-batching in an encoder-decoder translation model.

ample of mini-batching two sentences of differentlengths in an encoder-decoder model.

The first thing that we can notice from the fig-ure is that multiple operations at a particular timestep t can be combined into a single operation. Forexample, both “John” and ”I” are embedded in asingle step into a matrix that is passed into the en-coder LSTM in a single step. On the target sideas well, we calcualate the loss for the target wordsat time step t for every sentence in the mini-batchsimultaneously.

However, there are problems when sentencesare of different length, as only some sentences willhave any content at a particular time step. To re-solve this problem, we pad short sentences withend-of-sentence tokens to adjust their length to thelength of the longest sentence. In the Figure 1,purple colored “〈/s〉” indicates the padded end-of-sentence token.

Padding with these tokens makes it possible tohandle variably-lengthed sentences as if they wereof the same length. On the other hand, the com-putational cost for a mini-batch increases in pro-portion to the longest sentence therein, and ex-cess padding can result in a significant amount ofwasted computation. One way to fix this prob-lem is by creating mini-batches that include sen-tences of similar length (Sutskever et al., 2014)

Algorithm 1 Create mini-batches1: C ← Training corpus2: C ← sort(C) or shuffle(C) . sort or shuffle

the whole corpus3: B ← {} . mini-batches4: i← 0, j ← 05: while i < C.size() do6: B[j]← B[j] + C[i]7: if B[j].size() ≥ max mini-batch size then8: B[j]← padding(B[j]) .

Padding tokens to the longest sentence in themini-batch

9: j ← j + 110: end if11: i← i+ 112: end while13: B ← shuffle(B) . shuffle the order of the

mini-batches

to reduce the amount of padding required. ManyNMT toolkits implement length-based sorting ofthe training corpus for this purpose. In the fol-lowing section, we discuss several different mini-batch creation strategies used in existing neuralMT toolkits.

62

3 Mini-batch Creation Strategies

Specifically, we examine three aspects of mini-batch creation: mini-batch size, word vs. sentencemini-batches, and sorting strategies. Algorithm 1shows the pseudo code of creating mini-batches.

3.1 Mini-batch SizeThe first aspect we consider is mini-batch size forwhich, of the three aspects we examine here, theeffect is relatively well known.

When we use larger mini-batches, more sen-tences participate in the gradient calculation mak-ing the gradients more stable. They also increaseefficiency with parallel computation. However,they decrease the number of parameter updatesperformed in a certain amount of time, whichcan slow convergence at the beginning of train-ing. Large mini-batches can also pose problemsin practice due to the fact that they increase mem-ory requirements.

3.2 Sentence vs. Word Mini-batchingThe second aspect that we examine, which has notbeen examined in detail previously, is whether tocreate mini-batches based on the number of sen-tences or number of target words.

Most NMT toolkits create mini-batches with aconstant number of sentences. In this case, thenumber of words included in each mini-batch dif-fers greatly due to the variance in sentence lengths.If we use the neural network library that constructsgraphs in a dynamic fashion (e.g. DyNet (Neubiget al., 2017), Chainer (Tokui et al., 2015), or Py-Torch1), this will lead to a large variance in mem-ory consumption from mini-batch to mini-batch.In addition, because the loss function for the mini-batch is equal to the sum of the losses incurred foreach word, the scale of the losses will vary greatlyfrom mini-batch to mini-batch, which could be po-tentially detrimental to training.

Another choice is to create mini-batches bykeeping the number of target words in each mini-batch approximately stable, but varying the num-ber of sentences. We hypothesize that this maylead to more stable convergence, and test this hy-pothesis in the experiments.

3.3 Corpus Sorting MethodsThe final aspect that we examine, which has sim-ilarly is not yet well understood, is the effect of

1http://pytorch.org

the method that we use to sort the corpus beforegrouping consecutive sentences into mini-batches.

A standard practice in online learning shufflestraining samples to ensure that bias in the pre-sentation order does not adversely affect the finalresult. However, as we mentioned in Section 2,NMT studies (Sutskever et al., 2014; Bahdanauet al., 2015) prefer uniform length samples in themini-batch by sorting the training corpus, to re-duce the amount of padding and increase per-wordcalculation speed. In particular, in the encoder-decoder NMT framework (Sutskever et al., 2014),the computational cost in the softmax layer of thedecoder is much heavier than the encoder. SomeNMT toolkits sort the training corpus based on thetarget sentence length to avoid unnecessary soft-max computations on padded tokens in the tar-get side. Another problem arises in the atten-tional NMT model (Bahdanau et al., 2015; Luonget al., 2015); attentions may give incorrect positiveweights to the padded tokens in the source side.The problems above also motivate the mini-batchcreation with uniform length sentences with fewerpadded tokens.

Inspired by sorting methods in use in currentopen source implementations, we compare the fol-lowing sorting methods:

SHUFFLE: Shuffle the corpus randomly beforecreating mini-batches, with no sorting.

SRC: Sort based on the source sentence length.TRG: Sort based on the target sentence length.SRC TRG: Sort using the source sentence length,

break ties by sorting by target sentencelength.

TRG SRC: Converse of SRC TRG.

Of established open-source toolkits, OpenNMT(Klein et al., 2017) uses the SRC sorting method,Nematus2 and KNMT (Cromieres, 2016) usethe TRG sorting method, and lamtram3 uses theTRG SRC sorting method.

4 Experiments

We conducted NMT experiments with the strate-gies presented above to examine their effects onNMT training.

4.1 Experimental SettingsWe carried out experiments with two languagepairs, English-Japanese using the ASPEC-JE cor-

2https://github.com/rsennrich/nematus3https://github.com/neubig/lamtram

63

ASPEC-JE WMT 2016train 2,000,000 4,562,102dev 1,790 2,169test 1,812 2,999

Table 1: Number of sentences in the corpus

pus (Nakazawa et al., 2016) and English-Germanusing the WMT 2016 news task with news-test2016 as the test-set (Bojar et al., 2016). Table1 shows the number of sentences contained in thecorpora.

The English and German texts were tokenizedwith tokenizer.perl4, and the Japanese textswere tokenized with KyTea (Neubig et al., 2011).

As a testbed for our experiments, we used thestandard global attention model of Luong et al.(2015) with attention feeding and a bidirectionalencoder with one LSTM layer of 512 nodes.We used the DyNet-based (Neubig et al., 2017)NMTKit5, with a vocabulary size of 65536 wordsand dropout of 30% for all vertical connections.We used the same random numbers as initial pa-rameters for each experiment to reduce variancedue to initialization. We used Adam (Kingma andBa, 2015) (α = 0.001) or SGD (η = 0.1) asthe learning algorithm. After every 50,000 train-ing sentences, we processed the test set to recordnegative log likelihoods. In the testing, we set themini-batch size to 1, in order to calculate negativelog likelihood correctly. We calculated the case-insensitive BLEU score (Papineni et al., 2002)with multi-bleu.perl6 script.

Table 2 shows the mini-batch creation settingscompared in this paper, and we tried all sortingmethods discussed in Section 3.3 for each setting.In method (e), we set the average number of targetwords in 64 sentences: 2055 words for ASPEC-JE, 1742 words for WMT. For all experiments, weshuffled the processing order of the mini-batches.

4.2 Experimental Results and Analysis

Figure 2, 3, 4 and 5 show the transition of negativelog likelihoods and the BLEU scores according tothe number of processed sentences in ASPEC-JE

4https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

5https://github.com/odashi/nmtkit Weused the commit number 566e9c2.

6https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

mini-batch units learning algorithm(a) 64 sentences Adam(b) 32 sentences Adam(c) 16 sentences Adam(d) 8 sentences Adam(e) 2055 or 1742 words Adam(f) 64 sentences SGD

Table 2: Compared settings

sorting method average time (hour)SHUFFLE 8.08

SRC 6.45TRG 5.21

SRC TRG 4.35TRG SRC 4.30

Table 3: Average time needed to train a wholeASPEC-JE corpus using method (a). We used aGTX 1080 GPU for this experiment.

and WMT2016 test sets. Table 3 shows the aver-age time to process the whole ASPEC-JE corpus.

The learning curves show very similar tenden-cies in different language pairs. We discuss theresults in detail on each strategy that we investi-gated.

4.2.1 Effect of Mini-batch SizeWe carried out the experiments with the mini-batch size of 8 to 64 sentences.7

From the experimental results of the method (a),(b), (c) and (d), in the case of using Adam, themini-batch size affects the training speed and italso has an impact on the final accuracy of themodel. As we mentioned in Section 3.1, the gra-dients can be stabler by increasing the mini-batchsize, and it seems to have a positive impact on themodel from the view of accuracy. Thus, we canfirst note that mini-batch size is a very importanthyper-parameter for NMT training that should notbe ignored. In our case in particular, the largestmini-batch size that could be loaded into the mem-ory was the best for the NMT training.

4.2.2 Effect of Mini-batch UnitLooking at the experimental results of the meth-ods (a) and (e), we can see that perplexities dropfaster if we use SHUFFLE for method (a) and SRC

for method (e), but we couldn’t see any large dif-ferences in terms of the training speed and the final

7We tried the experiments with larger mini-batch size, butwe couldn’t run it due to the GPU memory limitation.

64

Figure 2: Training curves on the ASPEC-JE test set. The y- and x-axes shows the negative log likelihoodsand number of processed sentences. The scale of the x-axis in the method (f) is different from others.

Figure 3: Training curves on the WMT2016 test set. Axes are the same as Figure 2.

accuracy of the model. We hypothesize that thelarge variance of the loss affects the final modelaccuracy, especially when using the learning algo-rithm that uses momentum such as Adam. How-ever, these results indicate that these differencesdo not significantly affect the training results. Weleave a comparison of memory consumption forfuture research.

4.2.3 Effect of Corpus Sorting Method usingAdam

From all experimental results of the method (a),(b), (c), (d) and (e), in the case of using SHUF-

FLE or SRC, perplexities drop faster and tend toconverge to lower perplexities than the other meth-ods for all mini-batch sizes. We believe the mainreason for this is due to the similarity of the sen-tences contained in each mini-batch. If the sen-tence length is similar, the features of the sentencemay also be similar. We carefully examined thecorpus and found that at least this is true for thecorpus we used (e.g. shorter sentences tend to con-tain the similar words). In this case, if we sort sen-tences by their length, sentences that have similarfeatures will be gathered into the same mini-batch,making training less stable than if all sentences

65

Figure 4: BLEU scores on the ASPEC-JE test set. The y- and x-axes shows the BLEU scores and numberof processed sentences. The scale of the x-axis in the method (f) is different from others.

Figure 5: BLEU scores on the WMT2016 test set. Axes are the same as Figure 4.

in the mini-batch had different features. This isevidenced by the more jagged lines of the TRG

method.As a conclusion, the TRG and TRG SRC sorting

methods, which are used by many NMT toolkits,have a higher overall throughput when just mea-suring the number of words processed, but for con-vergence speed and final model accuracy, it seemsto be better to use SHUFFLE or SRC.

Some toolkits shuffle the corpus first, then cre-ate mini-batches by sorting a few consecutive sen-tences. We think that this method may be effectiveby combining the advantage of SHUFFLE and other

sorting methods, but an empirical comparison isbeyond the scope of this work.

4.2.4 Effect of Corpus Sorting Method usingSGD

By comparing the experimental results of themethods (a) and (f), we found that in the case ofusing Adam, the learning curves greatly dependon the sorting method, but in the case of usingSGD there was little effect. This is likely becauseSGD makes less bold updates of rare parameters,improving its overall stability. However, we findthat only when using the TRG method, the nega-

66

0 1M 2M 3M 4M 5M0

1

2

3

4

5

6

7

8 (a) 64 sentences, Adam

shuffletrg_src

Figure 6: Training curves on the ASPEC test setusing lamtram toolkit. Axes are the same as Fig-ure 2.

tive log likelihoods and the BLEU scores are notstable. It can be conjectured that this is an effectof gathering the similar sentences in a mini-batchas we mentioned in Section 4.2.3. These resultsindicate that in the case of SGD it is acceptable toTRG SRC, which is the fastest method to processthe whole corpus (see Table 3), for SGD.

Recently, Wu et al. (2016) proposed a newlearning paradigm, which uses Adam for the ini-tial training, then switches to SGD after severaliterations. If we use this learning algorithm, wemay be able to train the model more effectively byusing SHUFFLE or SRC sorting method for Adam,and TRG SRC for SGD.

4.3 Experiments with a Different Toolkit

In the previous experiments, we conducted the ex-periments with only one NMT toolkit, so the re-sults may be dependent on the particular imple-mentation provided therein. To ensure that theseresults generalize to other toolkits with differentdefault parameters, we conducted the experimentswith another NMT toolkit.

4.3.1 Experimental SettingsIn this section, we used lamtram8 as a NMTtoolkit. We carried out the Japanese-English trans-lation experiments with ASPEC-JE corpus. Weused Adam (Kingma and Ba, 2015) (α = 0.001)as the learning algorithm and tried the two sort-ing algorithms: SHUFFLE which is the best sort-ing method on previous experiments and TRG SRC

which is the default sorting method used by the

8https://github.com/neubig/lamtram

lamtram toolkit. Normally, lamtram creates mini-batches based on the number of target words con-tained in each mini-batch, but we changed it tofix the mini-batch size to 64 sentences because wefind that larger mini-batch size seems to be bet-ter in the previous experiments. Other experimen-tal settings are the same as described in the Sec-tion 4.1.

4.3.2 Experimental ResultsFigure 6 shows the transition of negative log like-lihoods using lamtram. We can see the tendencyof the training curves are similar to the Figure 2(a), the combination with SHUFFLE drops negativelog likelihood faster than the TRG SRC one.

From this experiments, we could verify that ourexperimental results in the Section 4 do not rely onthe toolkit and we think the observed behavior willgeneralize to other toolkits and implementations.

5 Related Work

Recently, Britz et al. (2017) have released a pa-per about exploring the hyper-parameters of NMT.This work is similar to our paper in the termsof finding the better hyper-parameters by doinga large number of experiments and deriving em-pirical conclusions. However, notably this paperfixed the mini-batch size to 128 sentences and didnot treat mini-batch creation strategy as one of thehyper-parameters of the model. With our experi-mental results, we argue that the mini-batch cre-ation strategies also have an impact on the NMTtraining, and thus having solid recommendationsfor how to adjust this hyper-parameter are also ofmerit.

6 Conclusion

In this paper, we analyzed how mini-batch cre-ation strategies affect the training of NMT modelsfor two language pairs. The experimental resultssuggest mini-batch creation strategy is an impor-tant hyper-parameter of the training process, andcommonly-used sorting strategies are not alwaysoptimal. We sum up the results as follows:

• Mini-batch size can affect the final accuracyof the model in addition to the training speedand the larger mini-batch size seems to bebetter.

• Mini-batch units do not effect to the train-ing process, so it is possible to use either thenumber of sentences or target words.

67

• We should use SHUFFLE or SRC sortingmethod for Adam, and it is sufficient to useTRG SRC for SGD.

In the future, we plan to do experiments withlarger mini-batch sizes and compare the usedpeak memory between making mini-batches bythe number of sentences or target words. We arealso interested in checking the effects of differentmini-batch creation strategies with other languagepairs, corpora and optimization functions.

Acknowledgments

This work was done as a part of the joint re-search project with NTT and Nara Institute of Sci-ence and Technology. This research has been sup-ported in part by JSPS KAKENHI Grant Number16H05873. We thank the anonymous reviewersfor their insightful comments.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe 3rd International Conference on Learning Rep-resentations (ICLR).

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Antonio Jimeno Yepes, Philipp Koehn, VarvaraLogacheva, Christof Monz, Matteo Negri, Aure-lie Neveol, Mariana Neves, Martin Popel, MattPost, Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016. Findings of the 2016 conferenceon machine translation. In Proceedings of the 1stConference on Machine Translation (WMT).

Denny Britz, Anna Goldie, Minh-Thang Luong, andQuoc V. Le. 2017. Massive exploration of neuralmachine translation architectures. arXiv preprintarXiv:1703.03906 .

Fabien Cromieres. 2016. Kyoto-NMT: a neural ma-chine translation implementation in chainer. In Pro-ceedings of the 26th International Conference onComputational Linguistics (COLING).

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceed-ings of the 3rd International Conference on Learn-ing Representations (ICLR).

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Open-NMT: Open-source toolkit for neural machine trans-lation. arXiv preprint arXiv:1701.02810 .

Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian scientific paper excerpt corpus. In Proceed-ings of the 10th International Conference on Lan-guage Resources and Evaluation (LREC).

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, Kevin Duh, ManaalFaruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji,Lingpeng Kong, Adhiguna Kuncoro, Gaurav Ku-mar, Chaitanya Malaviya, Paul Michel, YusukeOda, Matthew Richardson, Naomi Saphra, SwabhaSwayamdipta, and Pengcheng Yin. 2017. DyNet:The dynamic neural network toolkit. arXiv preprintarXiv:1701.03980 .

Graham Neubig, Yosuke Nakata, and Shinsuke Mori.2011. Pointwise prediction for robust, adaptableJapanese morphological analysis. In Proceedings ofthe 49th Annual Meeting of the Association for Com-putational Linguistics (ACL).

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics (ACL).

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 28th Annual Con-ference on Neural Information Processing Systems(NIPS).

Seiya Tokui, Kenta Oono, Shohei Hido, and JustinClayton. 2015. Chainer: a next-generation opensource framework for deep learning. In Proceedingsof Workshop on Machine Learning Systems (Learn-ingSys) in The Twenty-ninth Annual Conference onNeural Information Processing Systems (NIPS).

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Łukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging thegap between human and machine translation. arXivpreprint arXiv:1609.08144 .

68

an empirical study of mini-batch creation strategies for neural … · 2017. 7. 26. · 62. 3...

Documents