faster neural network training with data echoingin order to speed up training in these cases, we...

Faster Neural Network Training with Data Echoing

Dami Choi 1 2 Alexandre Passos 1 Christopher J. Shallue 1 George E. Dahl 1

AbstractIn the twilight of Moore’s law, GPUs and otherspecialized hardware accelerators have dramati-cally sped up neural network training. However,earlier stages of the training pipeline, such as diskI/O and data preprocessing, do not run on acceler-ators. As accelerators continue to improve, theseearlier stages will increasingly become the bot-tleneck. In this paper, we introduce “data echo-ing,” which reduces the total computation usedby earlier pipeline stages and speeds up trainingwhenever computation upstream from accelera-tors dominates the training time. Data echoingreuses (or “echoes”) intermediate outputs fromearlier pipeline stages in order to reclaim idle ca-pacity. We investigate the behavior of differentdata echoing algorithms on various workloads, forvarious amounts of echoing, and for various batchsizes. We find that in all settings, at least onedata echoing algorithm can match the baseline’spredictive performance using less upstream com-putation. We measured a factor of 3.25 decreasein wall-clock time for ResNet-50 on ImageNetwhen reading training data over a network.

1. IntroductionOver the past decade, dramatic increases in neural networktraining speed have facilitated dramatic improvements inpredictive performance by allowing researchers to train big-ger models using larger datasets and to explore new ideasmore rapidly. As Moore’s law ends, general purpose proces-sors are no longer rapidly becoming faster, but specializedhardware continues to drive significant speedups by opti-mizing for a narrower set of operations. For example, GPUsand TPUs1 optimize for highly parallelizable matrix opera-tions, which are core components of neural network trainingalgorithms.

However, neural network training requires more than just

1Google Research, Brain Team 2Vector Institute andUniversity of Toronto. Correspondence to: Dami Choi<[email protected]>.

1https://www.blog.google/products/google-cloud/

google-cloud-offer-tpus-machine-learning/

the operations that run well on accelerators – a trainingprogram may need to read and decompress training data,shuffle it, batch it, and even transform or augment it. Thesesteps exercise multiple system components, including CPUs,disks, network bandwidth, and memory bandwidth. It isimpractical to design specialized hardware for all these gen-eral operations that involve so many different components.Moreover, these operations are not simply executed onceat the start of the training program. Since many of today’sdatasets are too large2 to fit into an accelerator’s memoryor even the host machine’s main memory, most large-scaleneural network training systems stream over the trainingdata, incrementally reading it from disk, pre-processing it inmain memory, and copying successive batches of trainingexamples to the accelerator, which runs the training algo-rithm. Therefore, each training step involves a mixture ofoperations that do and do not run on accelerators.

There are workloads where the code running on acceleratorsconsumes only a small portion of the overall wall time, andthis scenario will only become more common if acceleratorimprovements continue to outpace improvements in CPUs.3

In order to speed up training in these cases, we must ei-ther (1) make the non-accelerator work faster, or (2) reducethe amount of non-accelerator work required to achieve thedesired performance. Option (1) is appealing but requiressubstantial engineering labor or problem-specific techniques(e.g. Ying et al., 2018; Kumar et al., 2018; Yang & Cong,2019). Adding more workers might be too expensive. In-stead, we focus on option (2) and explore techniques forreducing the total amount of work spent reading and prepar-ing inputs in the training pipeline, without affecting the finalconverged model or its predictive performance.

Figure 1 shows the data processing and training pipeline forResNet-50 (He et al., 2016) on ImageNet, which is represen-tative of many large-scale computer vision programs. First,the training program reads each image from disk, decom-presses it into a 3 dimensional array of values, and pushes itinto a shuffle buffer. The next stage of the pipeline samplesimages at random from the shuffle buffer to approximateshuffling the entire dataset, but with a fixed memory budget.

2For example, after decoding and standard pre-processing, theImageNet dataset (Russakovsky et al., 2015) is 700 GB and theCommon Crawl dataset is 6.7 TB.

3E.g. https://devblogs.nvidia.com/gpudirect-storage/

arX

iv:1

907.

0555

0v3

[cs

.LG

] 8

May

202

0

https://www.blog.google/products/google-cloud/google-cloud-offer-tpus-machine-learning/

https://www.blog.google/products/google-cloud/google-cloud-offer-tpus-machine-learning/

https://devblogs.nvidia.com/gpudirect-storage/


Read and decode Shuffle Apply

augmentation Batch Apply SGD update

Figure 1: The training pipeline for ResNet-50 on ImageNet, which is representative of many large-scale computer visionprograms.

Downstream

time

Upstream Upstream

Idle Downstream Idle

(a) Without data echoing, downstream computational capacity is idle50% of the time.

Downstream

time

Upstream Upstream

Downstream Downstream Downstream

(b) Data echoing with echoing factor 2 reclaims downstreamcomputational capacity.

Figure 2: The overlapping computation time for pipeline stages upstream and downstream of the data echoing insertionpoint, if stages are executed in parallel and tupstream = 2tdownstream.

The next stage performs pre-processing and data augmen-tation — each image is randomly cropped and resized toa 224× 224× 3 array, then randomly horizontally flipped,and finally has its colors randomly jittered. These randomdistortions help improve the generalization of vision models,and while these particular operations are specific to images,almost every deep learning pipeline performs some kindof pre-processing on its input data. All of the input andpre-processing steps are typically executed on the CPU.4

Finally, images and labels are gathered into batches and sentto the accelerator to perform a step of minibatch stochasticgradient descent (SGD). For brevity, we will refer to theoperation that updates the model’s parameters for a givenbatch of training examples as the “SGD step” throughoutthe paper, even though variants of the basic SGD algorithmare also popular. Our technique applies equally well forany training algorithm that works on successive batches oftraining examples.

To maximize throughput, the training program is often ex-ecuted as a pipeline process, so that each stage in Figure 1operates in parallel from the other stages. Each stage mightfurther employ multiple parallel worker threads or machines.If any of the stages upstream from the SGD step cannot pro-cess images at the same rate as the SGD step, the acceleratorwill be partly idle (see Figure 2a). This can happen for manyreasons, including slow transfer from disk or cloud storage,time-consuming pre-processing operations, or inadequatetuning of the number of CPU threads dedicated to each stageof the pipeline. While it can be possible to improve train-ing time by dedicating engineering effort to optimizing theinput pipeline, such efforts are often time consuming andcan distract from the practitioner’s main goal of improving

4Sometimes part of the input processing can be run on the ac-celerator (e.g. https://developer.nvidia.com/DALI),but the I/O operations cannot.

their model’s predictive performance. Instead, we proposedata echoing as a simple, cheap, and effective method forreclaiming idle accelerator capacity. Rather than waiting formore data to become available, we propose simply reusingdata that is already available. We do this by adding a stage tothe pipeline that repeats (or “echoes”) data from the previousstage. Once a practitioner identifies the largest bottleneck inthe training pipeline, they can insert an echoing stage afterit to reclaim idle accelerator capacity (see Figure 2b).

In this paper, we demonstrate that:

1. data echoing reduces the amount of upstream computa-tion needed to reach a competitive out-of-sample errorrate on various datasets and model architectures;

2. data echoing can provide a walltime speedup in prac-tice;

3. data echoing can support a wide range of echoing fac-tors;

4. the effectiveness of data echoing depends on the inser-tion point in the training pipeline;

5. data echoing can benefit from additional shuffling afterechoing, but does not require it; and

6. countering expectations, data echoing reaches the samefinal error rate as well-tuned baselines.

The exact speedup a practitioner will achieve (if any) de-pends on their specific training pipeline and hardware.Therefore, instead of focusing on specific walltime improve-ments in this paper, we focus on whether data echoing canreduce total input pipeline computation by reducing thenumber of training examples read from disk. This metric isreproducible on any hardware, and, as we demonstrate inSection 3.2, it is an excellent proxy for walltime when theupstream computation is the bottleneck.

https://developer.nvidia.com/DALI


1.1. Related work

Data echoing shares similarities with experience replay(Mnih et al., 2015), which samples batches from a buffercontaining a reinforcement learning agent’s past experiencesto prevent the most recent interactions from dominatingthe updates. Although both data echoing and experiencereplay reuse previous data, our implementation of data echo-ing specifies the number of times to repeat each example,whereas most implementations of experience replay do notcontrol this explicitly. In addition, local SGD algorithms(Zinkevich et al., 2010; Zhang et al., 2016), which performmultiple local model updates before communicating glob-ally, can also be viewed as reusing data to save computation.However, local SGD targets communication overhead be-tween workers and thus is orthogonal to data echoing.

We are aware of two previous papers that describe variantsof data echoing. Fischetti et al. (2018) describe a specialcase of data echoing they call “minibatch persistency” thatreuses minibatches for multiple consecutive SGD updates.They run experiments on CIFAR-10, but do not tune hy-perparameters for the baseline or for their method. Neithertheir method nor their baseline reach competitive test setnumbers in their experiments, leaving open the questionof whether minibatch persistency has an advantage over awell-tuned baseline. Similarly, Hoffer et al. (2019) describea special case of data echoing they call “batch augmentation”that repeats examples multiple times within a given batch,but with different augmentations. None of their experimentstune optimization hyperparameters, although their baselinesuse settings taken from the original papers that introducedeach model. Both Fischetti et al. (2018) and Hoffer et al.(2019) primarily motivate their work as methods to improvegeneralization, only tangentially mentioning the possibilityof reclaiming idle computational capacity. We would notexpect data echoing to improve generalization for a fixedbatch size and number of SGD updates, since then repeateddata would be more valuable than fresh data. Our experi-ments in Section 3 only show that data echoing can achievebetter out-of-sample error for the same amount of fresh data.

Another related line of work accelerates neural net-work training by subsampling examples non-uniformly(Katharopoulos & Fleuret, 2018). These methods are su-perficially similar to data echoing, in that their goal is alsofaster training and that this is accomplished by modifyingthe examples provided to the accelerator, but fundamentallythese approaches work best in different scenarios. With ex-ample subsampling, each training example is processed, onaverage, less than once per epoch, and faster convergence isachieved by reducing the overall time the accelerator spendsprocessing examples. With data echoing, the opposite istrue: each training example is processed, on average, morethan once per epoch, and faster convergence is achieved by

using more accelerator computation per example instead ofless. Data echoing does not change the optimal decisionboundary since expectations of the loss function under thedata echoing distribution are the same as expectations underthe original distribution. However, when using non-uniformsubsampling, care must be taken to ensure that one con-verges to the correct minimizer of the non-sampled loss.This is extremely relevant in probability matching tasks likelanguage modeling, but should not be ignored even in othermodels. In principle, nothing prevents a particular train-ing pipeline from adopting both methods, though systemsconsiderations might make it difficult to observe an actualspeedup.

2. Data EchoingWe implement data echoing by inserting a stage in thetraining pipeline that repeats (echoes) the outputs of theprevious stage. In TensorFlow’s (Abadi et al., 2016)tf.data library, an echoing stage is as simple as

dataset.flat_map(lambda t:tf.data.Dataset.from_tensors(t).repeat(e))

where e is the data echoing factor, the number of times eachdata item is repeated. In some cases, we also shuffle theoutputs of the echoing stage, but this can require additionalmemory. If the overhead of repeating data is negligible andthe stages on either side of echoing are executed in parallel(e.g. Chien et al., 2018), then the average time for dataechoing to complete one upstream step and e downstreamsteps is

max {tupstream, e× tdownstream} , (1)

where tupstream is the time taken by all stages upstream ofechoing, tdownstream is the time taken by all stages down-stream of echoing, and e is the echoing factor. Non-integral echoing factors can be achieved in expectationby probabilistically repeating data items. We assume thattupstream ≥ tdownstream throughout the paper, since this isthe primary motivation for using data echoing. If we de-note the ratio of upstream-to-downstream processing timeby R = tupstream/tdownstream, then the time to complete oneupstream step and e downstream steps is constant for allechoing factors less than or equal to R. In other words, ife ≤ R, the additional downstream steps per upstream stepare “free” because they utilize idle downstream capacity.

Data echoing aims to decrease training time by reducing thenumber of upstream steps required to achieve a target predic-tive performance. When using data echoing, each upstreamstep is used for e (instead of 1) downstream SGD updates.If the required number of SGD updates with data echoingis the same as without, then training time will decrease by


a factor of e. However, since repeated data might be lessvaluable than completely fresh data, data echoing mightrequire more downstream SGD updates to reach the desiredpredictive performance, and so the speedup factor might beless than e. We investigate the effect of data echoing ontraining time in Section 3.

Given that every operation in the training pipeline takessome time to execute, the amount of idle downstream timethat data echoing can exploit is greatest if the echoing stageis applied just before the SGD update. For the ResNet-50training pipeline in Figure 1, this would result in the sameminibatch of training examples being used multiple timesper epoch. However, we might prefer to insert data echo-ing earlier in the pipeline if it provides a more favorabletrade-off between the number of upstream steps and down-stream steps. In particular, the following factors influencethe behavior of data echoing at different insertion points:

Echoing before or after batching: Echoing before batch-ing means data is repeated and shuffled at the example levelinstead of the batch level. This increases the likelihood thatnearby batches will be different, at the expense of potentiallyduplicating examples within a batch. Whether diversifica-tion across batches or within batches is more important isan empirical question that we address in Section 3. We callthe class of algorithms that echo before batching exampleechoing and the class of algorithms that echo after batchingbatch echoing.

Echoing before or after augmentation: Echoing beforedata augmentation allows repeated data to be transformeddifferently, potentially making repeated data more akin tofresh data. Methods like dropout that add noise during theSGD update can similarly make repeated data appear differ-ent (Hoffer et al., 2019), even in the absence of augmentationor when echoing after augmentation.

The behavior of data echoing is also influenced by theamount of shuffling (if any) performed after the echoingstage. Adding a shuffle buffer increases the likelihood thatnearby SGD updates use different data, which is likely ben-eficial. The larger the buffer size, the more repeated dataare shuffled, and the closer the training procedure approxi-mates a program that loads the entire training set in memorybefore sampling data at random. However, since we arefocused on large-scale workloads, we assume that we canonly afford a buffer size that is a relatively small fraction ofthe (augmented) dataset size.

3. ExperimentsWe evaluated data echoing on two language modeling tasks,two image classification tasks, and one object detection task.For language modeling, we trained the Transformer model(Vaswani et al., 2017) on the LM1B (Chelba et al., 2014)

and Common Crawl5 datasets. For image classification,we trained ResNet-32 (He et al., 2016) on the CIFAR-10dataset (Krizhevsky & Hinton, 2009), and ResNet-50 on theImageNet dataset (Russakovsky et al., 2015). For objectdetection, we trained the Single Shot Detector (SSD, Liuet al., 2016) on the COCO dataset (Lin et al., 2014).

The primary question we investigated was whether dataechoing could provide a training speedup. We measuredtraining cost as the number of “fresh” training examples6

required to reach a target out-of-sample metric value. Thenumber of fresh examples is proportional to the number ofupstream steps in the training pipeline, and therefore pro-portional to wall time if the echoing factor is less than orequal to the ratio of upstream-to-downstream processingtime, R (see Section 2). We did not assume or measure thevalue of R in most of our experiments, since R depends onthe implementation and sometimes on irregular factors likenetwork congestion. Not all of our tasks satisfied R ≥ 1in all experiments. Instead, we designed most of our ex-periments to investigate whether data echoing could reducethe number of fresh examples needed across various tasks,since this measurement is implementation independent. Weconfirm that the number of fresh examples is a proxy forwalltime in Section 3.2.

For each workload, we ran an initial set of experiments with-out data echoing and tuned the hyperparameters to achievethe best out-of-sample performance within a practical com-putational budget.7 We selected the target metric value tobe slightly worse than the best observed in the initial experi-ments to ensure it could be reached reliably. We verified thatsmall changes to our targets did not affect our conclusions.Table 1 summarizes the workloads and target metric valueswe used in our experiments.

We trained the SSD model using SGD with momentum(Polyak, 1964; Rumelhart et al., 1986) and the Transformerand ResNet models using Nesterov momentum (Nesterov,1983; Sutskever et al., 2013). We used a constant learn-ing rate for Transformer, and we used learning rate sched-ules for ResNet (linear decay) and SSD (linear warmup for3.5 epochs followed by piecewise exponential decay). Wepreprocessed the text datasets identically to Shallue et al.(2018). We augmented the image datasets at training timeby resizing each image, taking a random crop, and randomlyhorizontally reflecting the cropped images. We randomlydistorted the image colors for ImageNet and COCO. Unlessotherwise specified, we used a batch size of 1024 for Trans-

5http://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available/

6Each time a training example is read from disk, it counts as afresh example.

720k steps for LM1B, 60k for Common Crawl, 110k for Ima-geNet, 150k for CIFAR-10, and 30k for COCO.

http://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available/

http://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available/


Table 1: Summary of the tasks used in our experiments.

Model Dataset(s) Task Evaluation metric Target

Transformer LM1B,Common Crawl Language modeling Cross entropy 3.9

ResNet-32 CIFAR-10 Image classification Accuracy 91%

ResNet-50 ImageNet Image classification Accuracy 75%

SSD COCO Object detection mAP 0.24

former and ResNet-50, 128 for ResNet-32, and 256 for SSD.We used batch normalization (Ioffe & Szegedy, 2015) forResNet-50 and SSD with virtual batch sizes (Hoffer et al.,2017) of 32 and 128, respectively.

In each experiment, we independently tuned the learningrate, momentum, and, where applicable, the parameters gov-erning the learning rate schedule8. We manually chose thesearch spaces based on our initial experiments, and we veri-fied after each experiment that the optimal hyperparametervalues were away from the search space boundaries. Weused quasi-random search (Bousquet et al., 2017) to tune thehyperparameters with fixed budgets of non-divergent9 trials(100 for Transformer and ResNet-32, and 50 for the moreexpensive ResNet-50 and SSD models). We then chose thetrial that reached the target metric value using the fewestnumber of fresh examples. We repeated this hyperparametersearch 5 times for each search space. All figures in this sec-tion show the mean number of fresh examples required overthese 5 searches, with the minimum and maximum shownas error bars. Appendix A shows the average training curveover these 5 searches for every experiment in this section.

The baseline training program for all of our workloads isshown in Figure 1. Our experiments evaluated the effectsof adding data echoing to various points in the trainingpipeline. We considered three variants of data echoing: ex-ample echoing before augmentation, example echoing afteraugmentation, and batch echoing. For the example echoingvariants, we omitted the baseline’s “shuffle examples” bufferand inserted a shuffle buffer after the echoing stage with thesame size as the baseline’s buffer. For batch echoing, wekept the baseline’s shuffle buffer and repeated batches with-out shuffling after the “batch examples” stage. Therefore,our training pipeline always had one shuffle buffer with thesame size in all cases, so all data echoing variants used thesame amount of memory as the baseline. We used buffer

8For ResNet, we tuned the number of steps over which thelearning rate would linearly decay by a factor of 100 (for CIFAR-10) or 1,000 (for ImageNet). For SSD, we tuned the two epochnumbers at which the learning rate decayed by a factor of 10.

9We discarded trials that diverged, which typically occurredwhen the learning rate was too high.

sizes of 106 for LM1B and Common Crawl, 104 for CIFAR-10, 105 for ImageNet, and 104 for COCO. We explored theeffects of increasing the buffer sizes in Section 3.5.

3.1. Data echoing can reduce the number of freshexamples required for training

Figure 3 shows the effect of data echoing with echoing fac-tor 2 for all workloads in Table 1. In all but one case, dataechoing requires strictly fewer fresh examples than the base-line to reach the target out-of-sample performance. The soleexception (batch echoing on ResNet-50) requires about thesame number of fresh examples as the baseline – data echo-ing provides no benefit, but does not harm training either.The earlier echoing is inserted in the pipeline, the fewerfresh examples are needed: example echoing requires fewerfresh examples than batch echoing, and echoing before dataaugmentation requires fewer fresh examples than echoingafter. We did not observe any negative interaction betweendata echoing and batch normalization for ResNet-50 or SSD.

3.2. Data echoing can reduce training time

If the training time is dominated by upstream operationslike reading and pre-processing input data, data echoingprovides a walltime speedup proportional to the reductionin the number of fresh examples needed for training. Toconfirm this, we tested data echoing in a training pipelinedominated by input latency from streaming training datafrom cloud storage, which is a realistic scenario for manylarge-scale production workloads. For ResNet-50 on Im-ageNet, the ratio of upstream-to-downstream time in thistraining pipeline was approximately R ≈ 6, although theexact number fluctuated with the network transfer rate.

The purpose of this experiment is to show that the numberof fresh examples is indeed a good proxy for walltime in atraining pipeline dominated by upstream operations.

We ran experiments with ResNet-50 on ImageNet with echo-ing factors e between 1 and 5. Since the bottleneck wastransferring data from cloud storage, we inserted data echo-ing early in the training pipeline before data augmentation.We tuned the hyperparameters separately for each echoing


0. 0

0. 5

1. 0

1. 5

Fresh

Exam

ple

s R

ead

×107

(a) Transformer on LM1B

0.0

1.0

2.0

3.0

4.0

5.0

Fresh

Exam

ple

s R

ead

x107

(b) Transformer on Common Crawl

0. 0

0. 2

0. 4

0. 6

0. 8

1. 0

1. 2

Fresh

Exam

ple

s R

ead

×107

(c) ResNet-32 on CIFAR-10

0. 0

0. 2

0. 4

0. 6

0. 8

Fresh

Exam

ple

s R

ead

×108

(d) ResNet-50 on ImageNet

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Fresh

Exam

ple

s R

ead

x106

(e) SSD on COCO

Baseline

Batch echoing

Example echoing afteraugmentation (if any)

Example echoingbefore augmentation

Figure 3: Data echoing with echoing factor 2 either reduces or does not change the number of fresh examples needed toreach the target out-of-sample performance. Dashed lines indicate the expected values if repeated examples were as usefulas fresh examples.

factor, and selected the hyperparameters that reached 75.3%validation accuracy the fastest. We then ran the optimal hy-perparameters 5 times for each echoing factor and recordedthe number of fresh examples and walltime required to reach75% accuracy. Figure 4 shows that data echoing does indeedprovide a walltime speedup proportional to the number offresh examples read over the network. The reduction inwalltime is slightly lower than the fractional reduction infresh examples because our implementation of data echoingincurs a slight overhead. Nonetheless, data echoing providesa significant speedup for all echoing factors, up to a speedupfactor of 3.25 for echoing factor 5.

3.3. Data echoing can be useful up to a reasonableupper bound on the echoing factor

Figure 5 shows the effect of example echoing with echoingfactors up to 16 for Transformer on LM1B. For batch size1024, the maximum useful echoing factor is somewherebetween 4 and 8; beyond this value, the number of freshexamples required is larger than for smaller echoing factors.As the echoing factor increases, the number of fresh exam-ples required must eventually exceed the baseline, but evenan echoing factor as large as 16 still requires significantlyfewer fresh examples than the baseline. For batch size 4096,the maximum useful echoing factor is even larger than 16,suggesting that larger batch sizes can support larger echoingfactors than smaller batch sizes.

3.4. Data echoing as batch size increases

With larger batch sizes, batch echoing performs better, butexample echoing sometimes requires more shuffling. Fig-ure 6 shows the effect of data echoing with echoing factor2 for different batch sizes. As the batch size increases, theperformance of batch echoing relative to the baseline eitherstays the same or improves. This makes sense given thatrepeated batches should approximate fresh batches as thebatch size approaches the training set size, and so, in thelimit, batch echoing must reduce the required number offresh examples by the echoing factor. On the other hand,Figure 6 shows that the performance of example echoingrelative to the baseline either stays the same or gets worse asthe batch size increases. Since the expected fraction of du-plicate examples within each batch increases with the batchsize, example echoing with larger batches may behave morelike a smaller batch size in practice. A smaller batch sizemay increase the required number of SGD updates (Shal-lue et al., 2018), which could explain the example echoingresults in Figure 6. Increasing the amount of shuffling forrepeated examples (at the cost of additional memory) couldimprove the performance of example echoing at larger batchsizes by reducing the probability of duplicate examples ineach batch.


0. 0

0. 2

0. 4

0. 6

0. 8Fr

esh

Exam

ple

s R

ead

×108

Baseline

Echoing factor e= 2

Echoing factor e= 3

Echoing factor e= 4

Echoing factor e= 5

0. 0

0. 5

1. 0

1. 5

Wallt

ime (

sec)

×105

Baseline

Echoing factor e= 2

Echoing factor e= 3

Echoing factor e= 4

Echoing factor e= 5

Figure 4: Example echoing before augmentation can reduce training time for ResNet-50 on ImageNet. Dashed lines indicatethe expected values if repeated examples were as useful as fresh examples and there was no overhead from echoing.

1.0 2.0 4.0 8.0 16.0

Data Echoing Factor

0. 0

0. 5

1. 0

1. 5

Fresh

Exam

ple

s R

ead

×107

Baseline

Example echoing

(a) Batch size 1024

1.0 2.0 4.0 8.0 16.0

Data Echoing Factor

0.0

1.0

2.0

3.0

4.0

Fresh

Exam

ple

s R

ead

x107

Baseline

Example echoing

(b) Batch size 4096

Figure 5: Example echoing reduces the number of fresh examples needed for Transformer on LM1B for echoing factors upto (at least) 16. Dashed lines indicate the expected values if repeated examples were as useful as fresh examples.

26 28 210 212

Batch Size

0. 0

0. 5

1. 0

1. 5

2. 0

2. 5

3. 0

3. 5

Fresh

Exam

ple

s R

ead

×107

Baseline

Batch echoing

Example echoing


28 210 212 214

Batch Size

0. 0

0. 2

0. 4

0. 6

0. 8

1. 0

1. 2

1. 4

Fresh

Exam

ple

s R

ead

×108

Baseline

Batch echoing

Example echoing after augmentation

Example echoing before augmentation

(b) ResNet-50 on ImageNet

Figure 6: As the batch size increases, the performance of batch echoing relative to the baseline either stays the same orimproves, while for example echoing it either stays the same or gets worse. Dashed lines indicate the expected values ifrepeated examples were as useful as fresh examples.

3.5. Data echoing performs better with more shuffling

Figure 7 shows the effect of increasing the shuffle buffersize (at the cost of additional memory) for data echoing withechoing factor 2. While all batch echoing experiments in the

previous sections repeated batches without shuffling, the per-formance of batch echoing improves if repeated batches areshuffled, with more shuffling giving increasingly better per-formance. Similarly, the performance of example echoing


0. 0

0. 5

1. 0

1. 5Fr

esh

Exam

ple

s R

ead

×107

Baseline

Buffer size 1

Buffer size 101

Buffer size 102

(a) Transformer on LM1B with batch echoing

105 106 107

Shuffle Buffer Size

0. 0

0. 5

1. 0

1. 5

Fresh

Exam

ple

s R

ead

×107

Baseline

Example echoing

(b) Transformer on LM1B with example echoing

Figure 7: Data echoing performs better with more shuffling. Dashed lines indicate the expected values if repeated exampleswere as useful as fresh examples. Buffer sizes are in units of number of batches in the batch echoing case.

0. 0 0. 5 1. 0 1. 5 2. 0

Fresh Examples Read ×107

3.5

4.0

4.5

5.0

5.5

6.0

Valid

ati

on C

ross

Entr

opy

Baseline

Batch echoing

Example echoing


0. 0 0. 5 1. 0 1. 5 2. 0


0.3

0.4

0.5

0.6

0.7

0.8

Valid

ati

on A

ccura

cyBaseline

Batch echoing

Example echoing after augmentation

Example echoing before augmentation

(b) ResNet-50 on ImageNet

Figure 8: Individual trials that achieved the best out-of-sample performance during training.

improves with increasing shuffle buffer size, even though itdoes not help the baseline. This is because more shufflingreduces the probability of duplicate examples within eachbatch, as discussed in Section 3.4.

3.6. Data echoing does not harm predictiveperformance

Although one might be concerned that reusing data couldharm final predictive performance, we did not observe anycase where data echoing with a reasonable echoing factorfailed to reach our target metric value. To further demon-strate that data echoing does not degrade solution quality, weran experiments with Transformer on LM1B and ResNet-50on ImageNet to find the best achievable performance withina fixed budget of fresh examples, both with and without dataechoing. We picked the fresh-examples budgets so that thebaseline models would achieve at least our target metric val-ues from Table 1. We used an echoing factor of 4 for all dataechoing experiments. We tuned the hyperparameters for thebaseline and for all data echoing variants using 500 trials forTransformer and 100 trials for ResNet-50. Figure 8 shows

the trials that reached the best out-of-sample performanceat any point during training for each experiment. All dataechoing variants achieved at least the same performance asthe baseline for both tasks.

4. ConclusionData echoing is a simple strategy for increasing hardwareutilization when the training pipeline has a bottleneck in oneof the upstream stages. Although a priori one might worrythat SGD updates with repeated data would be useless oreven harmful, for every workload we considered, at leastone variant of data echoing reduced the total number ofexamples we needed to read from disk. This was true evenfor Transformer on Common Crawl, a dataset so large thatwe do not even train for a full epoch. In this case, dataechoing reached the target predictive performance whileseeing only a subset of the examples seen by the baseline.Echoing after augmentation was still effective at reducingthe total number of examples read from disk, making itappealing for image datasets that employ expensive data


augmentation that runs on the CPU. If reading input datais the bottleneck, then echoing before augmentation willtypically provide the greatest speedup. We measured a factorof 3.25 decrease in walltime for ResNet-50 on ImageNetwhen reading training data over a network.

Data echoing is an effective alternative to optimizing thetraining pipeline or adding additional workers to performupstream data processing, which may not always be possi-ble or desirable. Although the exact speedup depends onthe model architecture, dataset, batch size, and how well re-peated data are shuffled, setting the echoing factor to the ra-tio of upstream-to-downstream processing time maximizesthe potential speedup and worked well in our experiments,even for large ratios. As improvements in specialized ac-celerators like GPUs and TPUs continue to outpace generalpurpose computation, we expect data echoing and similarstrategies to become increasingly important parts of theneural network training toolkit.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.Tensorflow: A system for large-scale machine learning.In 12th {USENIX} Symposium on Operating SystemsDesign and Implementation ({OSDI} 16), pp. 265–283,2016.

Bousquet, O., Gelly, S., Kurach, K., Teytaud, O., and Vin-cent, D. Critical hyper-parameters: No random, no cry.arXiv preprint arXiv:1706.03200, 2017.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T.,Koehn, P., and Robinson, T. One billion word benchmarkfor measuring progress in statistical language modeling.In Conference of the International Speech Communica-tion Association, 2014.

Chien, S. W., Markidis, S., Sishtla, C. P., Santos, L., Her-man, P., Narasimhamurthy, S., and Laure, E. Character-izing deep-learning I/O workloads in TensorFlow. arXivpreprint arXiv:1810.03035, 2018.

Fischetti, M., Mandatelli, I., and Salvagnin, D. FasterSGD training by minibatch persistency. arXiv preprintarXiv:1806.07353, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Hoffer, E., Hubara, I., and Soudry, D. Train longer, general-ize better: closing the generalization gap in large batchtraining of neural networks. In Advances in Neural Infor-mation Processing Systems, pp. 1731–1741, 2017.

Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T.,and Soudry, D. Augment your batch: better training withlarger batches. arXiv preprint arXiv:1901.09335, 2019.

Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In International Conference on Machine Learning, pp.448–456, 2015.

Katharopoulos, A. and Fleuret, F. Not all samples are cre-ated equal: Deep learning with importance sampling.arXiv preprint arXiv:1803.00942, 2018.

Krizhevsky, A. and Hinton, G. Learning multiple layersof features from tiny images. Technical report, Citeseer,2009.

Kumar, S., Sreedhar, D., Saxena, V., Sabharwal, Y., andVerma, A. Efficient training of convolutional neural netson large distributed systems. 2018 IEEE InternationalConference on Cluster Computing (CLUSTER), pp. 392–401, 2018.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,Ramanan, D., Dollár, P., and Zitnick, C. L. MicrosoftCOCO: Common objects in context. In European confer-ence on computer vision, pp. 740–755. Springer, 2014.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,Fu, C.-Y., and Berg, A. C. SSD: Single shot multiboxdetector. In European conference on computer vision, pp.21–37. Springer, 2016.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Nesterov, Y. A method for unconstrained convex minimiza-tion problem with the rate of convergence O(1/k2). InDoklady AN USSR, volume 269, pp. 543–547, 1983.

Polyak, B. T. Some methods of speeding up the convergenceof iteration methods. USSR Computational Mathematicsand Mathematical Physics, 4(5):1–17, 1964.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn-ing representations by back-propagating errors. Nature,323(6088):533, 1986.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., et al. ImageNet large scale visual recognition chal-lenge. International Journal of Computer Vision, 115(3):211–252, 2015.


Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J.,Frostig, R., and Dahl, G. E. Measuring the effects of dataparallelism on neural network training. arXiv preprintarXiv:1811.03600, 2018.

Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E.On the importance of initialization and momentum indeep learning. In International Conference on MachineLearning, pp. 1139–1147, 2013.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in Neural InformationProcessing Systems, pp. 5998–6008, 2017.

Yang, C.-C. and Cong, G. Accelerating data load-ing in deep neural network training. arXiv preprintarXiv:1910.01196, 2019.

Ying, C., Kumar, S., Chen, D., Wang, T., and Cheng, Y. Im-age classification at supercomputer scale. arXiv preprintarXiv:1811.06992, 2018.

Zhang, J., De Sa, C., Mitliagkas, I., and Ré, C. ParallelSGD: When does averaging help? In International Con-ference on Machine Learning Workshop on Optimizationin Machine Learning, 2016.

Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Paral-lelized stochastic gradient descent. In Advances in neuralinformation processing systems, pp. 2595–2603, 2010.


A. Best training curves for all experiments

0. 0 0. 5 1. 0 1. 5 2. 0


3.5

4.0

4.5

5.0

5.5

6.0

Valid

ati

on C

ross

Entr

opy


0 1 2 3 4 5 6


3.5

4.0

4.5

5.0

5.5

6.0

Valid

ati

on C

ross

Entr

opy

(b) Transformer on Common Crawl

0. 0 0. 5 1. 0 1. 5


0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

Valid

ati

on A

ccura

cy

(c) ResNet-32 on CIFAR-10

0. 0 0. 2 0. 4 0. 6 0. 8 1. 0


0.3

0.4

0.5

0.6

0.7

0.8

Valid

ati

on A

ccura

cy

(d) ResNet-50 on ImageNet

0 1 2 3 4 5 6 7


0.00

0.05

0.10

0.15

0.20

0.25

AP

(e) SSD on COCO

Baseline

Batch echoing



Figure 9: The best training curves for experiments in Section 3.1. Solid lines represent the mean over the best trial from 5independent hyperparameter searches, and shaded regions represent one standard deviation. The dashed line represents thetarget metric value.


0. 0 0. 2 0. 4 0. 6 0. 8 1. 0


0.3

0.4

0.5

0.6

0.7

0.8

Valid

ati

on A

ccura

cy

Baseline

Echoing factor e= 2

Echoing factor e= 3

Echoing factor e= 4

Echoing factor e= 5


0. 0 0. 5 1. 0 1. 5 2. 0


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

Baseline

Echoing factor e= 2

Echoing factor e= 4

Echoing factor e= 8

Echoing factor e= 16

(a) Batch size 1024

0 1 2 3 4 5 6


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

Baseline

Echoing factor e= 2

Echoing factor e= 4

Echoing factor e= 8

Echoing factor e= 16

(b) Batch size 4096



0. 0 0. 2 0. 4 0. 6 0. 8 1. 0


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

(a) Batch size 64

0. 0 0. 2 0. 4 0. 6 0. 8 1. 0


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

(b) Batch size 256

0 1 2 3 4 5 6


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

(c) Batch size 4096

Baseline

Batch echoing

Example echoing

Figure 12: The best training curves for experiments with Transformer on LM1B in Section 3.4. Solid lines represent themean over the best trial from 5 independent hyperparameter searches, and shaded regions represent one standard deviation.The dashed line represents the target metric value.


0. 0 0. 2 0. 4 0. 6 0. 8 1. 0


0.3

0.4

0.5

0.6

0.7

0.8V

alid

ati

on A

ccura

cy

(a) Batch size 256

0. 0 0. 2 0. 4 0. 6 0. 8 1. 0


0.3

0.4

0.5

0.6

0.7

0.8

Valid

ati

on A

ccura

cy

(b) Batch size 4096

0. 0 0. 5 1. 0 1. 5


0.3

0.4

0.5

0.6

0.7

0.8

Valid

ati

on A

ccura

cy

(c) Batch size 16384

Baseline

Batch echoing



Figure 13: The best training curves for experiments with ResNet-50 on ImageNet in Section 3.4. Solid lines represent themean over the best trial from 5 independent hyperparameter searches, and shaded regions represent one standard deviation.The dashed line represents the target metric value.

0. 0 0. 5 1. 0 1. 5 2. 0


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

Baseline

Buffer Size 1

Buffer Size 101

Buffer Size 102

Figure 14: The best training curves for experiments with batch echoing in Section 3.5. Solid lines represent the mean overthe best trial from 5 independent hyperparameter searches, and shaded regions represent one standard deviation. The dashedline represents the target metric value.


0. 0 0. 5 1. 0 1. 5 2. 0


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

Baseline

Example echoing

(a) Shuffle buffer size 105

0. 0 0. 5 1. 0 1. 5 2. 0


3.5

4.0

4.5

5.0

5.5

Valid

ati

on C

ross

Entr

opy

Baseline

Example echoing

(b) Shuffle buffer size 107

Figure 15: The best training curves for experiments with example echoing in Section 3.5. Solid lines represent the meanover the best trial from 5 independent hyperparameter searches, and shaded regions represent one standard deviation. Thedashed line represents the target metric value.

faster neural network training with data echoingin order to speed up training in these cases, we...

Documents