arxiv:1910.02054v2 [cs.lg] 7 oct 2019 · arxiv:1910.02054v2 [cs.lg] 7 oct 2019 that don’t t into...

ZeRO: Memory Optimization Towards Training

A Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He{samyamr, jerasley, olruwase, yuxhe}@microsoft.com

Abstract

Training large DL models with billions and potentially trillions of parameters is challenging.Existing solutions exhibit fundamental limitations to obtain both memory and scaling (com-putation/communication) efficiency together. Data parallelism does not help reduce memoryfootprint per device: a model with 1.5 billion parameters or more runs out of memory. Modelparallelism hardly scales efficiently beyond multiple devices of a single node due to fine-grainedcomputation and expensive communication.

We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory,achieving both memory efficiency and scaling efficiency. Unlike basic data parallelism wherememory states are replicated across data-parallel processes, ZeRO partitions model states in-stead, to scale the model size linearly with the number of devices. Furthermore, it retainsscaling efficiency via computation and communication rescheduling and by reducing the modelparallelism degree required to run large models. Our analysis on memory requirements andcommunication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion pa-rameters using today’s hardware (e.g., 1024 GPUs, 64 DGX-2 nodes).

To meet near-term scaling goals and serve as a demonstration of ZeRO ’s capability, weimplemented stage-1 optimizations of ZeRO (out of 3 stages in total described in the paper)and tested this ZeRO-OS version. ZeRO-OS reduces memory and boosts model size by 4xcompared with the state-of-art, scaling up to 100B parameters.

Moving forward, we will work on unlocking stage-2 optimizations, with up to 8x memorysavings per device, and ultimately stage-3 optimizations, reducing memory linearly with respectto the number of devices and potentially scaling to models of arbitrary size. We are excited totransform very large models from impossible to train to feasible and efficient to train!

1 Extended Introduction

Background & Challenges Deep Learning (DL) models are becoming larger, and the increasein model size offers significant accuracy gain. In the area of Natural Language Processing(NLP), transformers have paved the way for large models like Bert-large (0.3B) [1], and GPT-2(1.5B) [2]. As model size continues to grow, we experience the challenges of training them -they can no longer fit within the memory of a single device, e.g., GPU or TPU, and simplyadding more devices will not help scale training.

Data parallelism and model parallelism are two well-known approaches to parallelize trainingacross devices. Data parallelism does not reduce model’s memory footprint per device and itcannot help address out-of-memory issues faced by training large models. To train large models

1

arX

iv:1

910.

0205

4v2

[cs

.LG

] 7

Oct

201

9

that don’t fit into a single device, multiple model parallelism techniques have emerged that splita model into multiple partitions and execute the partitions concurrently across devices. PipelineParallelism in GPipe [3] and PipeDream [4], and TensorSlicing [5] in Megatron [6] and MeshTensorflow [7] are two well-known techniques of model parallelism. For example, Megatron, astate-of-art model parallelism implementation from Nvidia, demonstrates that they can scale upto 20B parameter models leveraging 16-way model parallelism within a DGX2 node (consistingof 16 GPUs).

Scaling modeling parallelism degree further beyond a single node is however challenging.In the example of Megatron, although one can trivially get to a trillion parameter modelby using 1 Trillion / 20 Billion = 50 nodes and 800-way model parallelism, too fine-grainedcomputation, large amount of communication, and limited internode bandwidth will simplykill the performance and render the system unusable (as also indicated by Megatron paper [6]).More specifically, the bisection bandwidth inside a DGX-2 node is about 2.4 TB/s while acrosstwo DGX-2 nodes, the bisection bandwidth drops to a meager 100 GB/sec. Due to this drop inbandwidth, scaling a large model past a single node via model-parallelism suffers from majorperformance degradation. We tested a 40B parameter model using Megatron across nodes andobserve about 5Tflops per V100 GPU (less than 5% of hardware peak).

So, how can we train these large models more efficiently? To answer the question, we firstanalyze the memory consumption of the existing systems on model training. We find thatthe majority of the memory is occupied by i) the optimizer states (such as momentum andvariances in Adam), ii) gradients, and iii) parameters, which we refer to as OGP states of amodel (or model states in short). Furthermore, we observe that data parallelism has goodcompute/communication efficiency but poor memory efficiency while model parallelism canhave poor compute/communication efficiency. More specifically, data parallelism replicates theentire model states across all data parallel process resulting in redundant memory consumption;while model parallelism partitions these states to obtain high memory efficiency, but often resultin too fine-grained computation and expensive communication that is less scaling efficient.Furthermore, both of these approaches maintain all the model states required over the entiretraining process statically, even though not all model states are required all the time duringthe training.

Our Solution We develop ZeRO— Zero Redundacy Optimizer — to conquer the limita-tions of data parallelism and model parallelism while achieving the merits of both. ZeROremoves the memory redundancies across data-parallel processes by partitioning the OGPmodel states across data parallel processes instead of replicating them, and it retains the com-pute/communication efficiency by retaining the computational granularity and communicationvolume of data parallelism using a dynamic communication schedule during training. We callthis ZeRO-powered data parallelism. It reduces per-device memory footprint of a model lin-early with the increase in data parallelism degree while maintaining the communication volumeclose to that of the default data parallelism. ZeRO-powered data parallelism can fit models ofarbitrary size — as long as there are sufficient number of devices to share the model states.For example, our memory analysis shows that ZeRO can fit a trillion parameter model on 1024GPUs with data parallelism degree Nd = 1024 (with more details in Section 5.5).

Since ZeRO eliminates the memory inefficiency in data parallelism, it is natural to ask: Dowe still need model parallelism, and when? How does ZeRO work with model parallelism?With ZeRO , model parallelism becomes a less attractive option for the purpose of fitting largemodels alone. ZeRO-powered data paralellism is at least as effective on reducing per-devicememory footprint as model parallelism, or more effective sometimes when model parallelismcannot divide the model evenly. It also has comparable or better scaling efficiency. Furthermore,data parallelism is so easy to use and it is widely applicable across different workloads, while

2

model parallelism approaches today often need some work from model developers to revisetheir model, system developers to work out distributed operators, and the existing work likeMegatron only supports a limited set of operators and models. That being said, there are stillcases where we want to leverage model parallelism: when aggregated batch size using dataparallelism alone is too big to have good convergence.1 In those case, one can combine ZeROwith model parallelism to fit the model with an acceptable aggregated batch size.

We show that ZeRO can be combined with any model parallelism approach, which couldlead to a max theoretical memory reduction of Nd × Nm times on each device with a dataparallelism degree of Nd and model parallelism degree of Nm. This could allow us to fit atrillion parameter model on 1024 GPUs with 16-way model parallelism (within each DGX2node) and 64-way data parallelism across nodes, and run it efficiently!

Implementation & Evaluation ZeRO has three main optimization stages: partitioning(1) optimizer states, (2) gradients and (3) parameters which bring memory savings of 4, 8 andNd times respectively for a popular training optimizer like ADAM, when compared using dataparallelism only or combination of data and model parallelism. For instance, state-of-the-artdata + model parallelism implementation, Megatron, supports about 20B parameter modeltraining on a cluster of DGX-2 nodes, while these three stages of optimization will enabletraining 80B, 160B, and arbitrary model sizes (Nd × 20B), respectively.

As the first step of the evaluation, we implemented optimizer state partitioning (Stage 1,Pos) on top of PyTorch and call this version ZeRO-OS . We demonstrate its capability to support80-100B parameter models, which we hope will address the near-term growth on model size.The experimental results with ZeRO-OS show that:

• Without any model-parallelism, ZeRO-OS -powered data-parallelism enables training mod-els with up to 6B parameters on V100 GPUs (32GB of device memory) while existingsystems (e.g., PyTorch Distributed Data Parallel) runs out of memory with 1.5B param-eter models. ZeRO-OS gives 4x memory saving / model size boost. Model developerscan run up to 6B parameter models without worrying about model parallelism.

• Combined with model parallelism, ZeRO-OS runs 80B parameter models efficiently, whilethe existing system like using Megatron alone cannot scale efficiently beyond 20B param-eters, as shown in Figure 1. ZeRO-OS also fits 100B parameter models without requiringinter-node model parallelism.

• Improved memory efficiency from ZeRO-OS also helps run models more efficiently (e.g.,by allowing smaller model parallelism degree and higher batch size): we demonstrate upto 6x throughput gain for GPT-like models with various sizes spanning 1.5B to 100B.

We will share the code repository for ZeRO-OS soon. Over time, we will release the fullimplementation with the remaining two stages of ZeRO to support training models with trillionsof parameters. With ZeRO , we aim to transform large models from infeasible-to-train tofeasible- and efficient-to-train.

2 Background

DL training iteratively processes data samples in steps, and each step consists of a forwardpropagation, followed by a backward propagation, which is followed by weight updates. To

1Prior work [8] shows, very large batch size could slow down convergence. For given model and data, thereis a measure of critical-batch size, where increasing batch size further slows down convergence. The detaileddiscussion of this topic is beyond the scope of the paper.

3

Figure 1: Comparing ZeRO-OS with Megatron on the runnable model sizes and throughput.All experiments are conducted using the smallest model parallelism degrees possible. For ZeRO-OS , the model parallelism degree always fit within a single node, while for Megatron, modelslarger than 20B require model parallelism across nodes. Further details on experimental setupand configurations are in Section 9.

scale DL training across GPUs, each step must be parallelized due to sequential dependencyacross steps. Data parallelism and model parallelism, are the two most prominent ways toparallize a DL training step across multiple GPUs.

2.1 Data Parallel Training

For a model that fits in GPU memory for training, data parallelism (DP) is used to scaletraining to multiple GPUs. In data parallelism, model parameters are replicated on each GPUprocess during initialization. At each step, a mini-batch is divided evenly across all the dataparallel processes, such that each process executes the forward and backward propagation ona different subset of data samples.

At the end of the backward propagation, the gradients are averaged across all processes,using an all-reduce communication collective, which creates a copy of the averaged gradientson each data parallel process. Each process then uses the averaged gradients to compute andapply the weight updates to its copy of the model parameters.

Note that in data parallelism, the OGP states (weight parameters, average gradients andthe optimizer states) are replicated across all the data parallel processes, incurring massiveredundancy.

2.2 Model Parallel Training

While model parallelism (MP) may serve other purposes such as reducing the aggregate batchsize, it is primarily used to reduce the memory footprint per GPU.

4

Model parallelism divides a model into multiple partitions and performs the correspondingforward and backward computation across devices, i.e., the computation for each input datasample is partitioned and executed on multiple devices. Model parallelism can work in two ways:i) via vertical partitioning [3, 4], and ii) via horizontal partitioning [5, 6]. Vertical partitioningsplits the model by partitioning the total layers in the DL model across multiple GPUs, anduses pipelined micro-batches to avoid GPU stalls. Horizontal partitioning on the other hand,splits individual DL layers across multiple-GPUs by parallelizing the underlying computationsuch as matrix-multiplication, or the element-wise operation.

While each of these techniques has their pros and cons in terms of training throughputand batch size implications, the underlying effect in terms of memory consumption is the same.Both reduces the memory footprint per GPU by partitioning the activations, model parameters,gradients, and optimizer states across multiple GPUs.

3 Where Did All the Memory Go?

We take a step back to examine the memory consumption of the current training system andhave some observations. For example, a 1.5B parameter GPT-2 model has its weights (orparameters) taking 3GB of memory in 16-bit training, yet, it cannot be trained on a singleGPU with 32GB memory using Tensorflow or Pytorch. One may wonder where all the memorygoes.

During model training, most of the memory is consumed by one of the following three: i)activations, ii) OGP states, i.e., tensors comprising of optimizer states, parameter gradients,and the parameters themselves, and iii) temporary buffers. It is possible to trivially eliminatenearly all the memory required by the activations at the expense of a 33% re-computationoverhead [9]. Frameworks such as Megatron already implement this recomputation. Here welook at the memory consumed by latter two of the three.

3.1 Optimizer States, Gradients and Parameters

With recomputation enabled, majority of the GPU memory is consumed by OGP tensors duringtraining. Consider for instance, ADAM, one of the most popular optimizers for DL training.ADAM requires storing two optimizer states, i) the time averaged momentum and ii) varianceof the gradients to compute the updates. Therefore, to train a model with ADAM, there hasto be enough memory to hold a copy of both the momentum and variance of the gradients. Inaddition, there needs to be enough memory to store the gradients and the weights themselves.Of these three types of the parameter-related tensors, the optimizer states usually consume themost memory, specially when mixed-precision training is applied.

Mixed-Precision Training The state-of-art approach to train large models on the currentgeneration of NVIDIA GPUs is via mixed precision (fp16/32) training, where parameters andactivations are stored as fp16, enabling the use of the high throughput tensor core units onthese GPUs. During mixed-precision training, both the forward and backward propagation areperformed using fp16 weights and activations. However, to effectively compute and apply theupdates at the end of the backward propagation, the mixed-precision optimizer keeps an fp32copy of the parameters as well as an fp32 copy of all the other optimizer states.

Let’s take ADAM as a concrete example. Mixed precision training of a model with Ψparameters using ADAM requires enough memory to hold an fp16 copy of the parameters andthe gradients, with memory requirements of 2Ψ and 2Ψ bytes respectively. In addition, it needsto hold the optimizer states: an fp32 copy of the parameters, momentum and variance, withmemory requirements of 4Ψ, 4Ψ, and 4Ψ bytes, respectively. Let’s use K to denote the memory

5

multiplier of the optimizer states, i.e., the additional memory required to store them is KΨbytes. Mixed-precision ADAM has K = 12. In total, this results in 2Ψ+2Ψ+KΨ = 16Ψ bytesof memory requirement. For a model such as GPT-2 with 1.5 Billion parameters, this leads toa memory requirement of at least 24 GB, which is significantly higher than the meager 3 GBof memory required to hold the fp16 parameters alone.

3.2 Temporary Buffers

For large models, temporary buffers used for storing intermediate results consumes non-trivialamount of memory. Operations such as gradient all-reduce, or gradient norm computationtend to fuse all the gradients into a single flattened buffer before applying the operation in aneffort to improve throughput. For example, the bandwidth of all-reduce across GPUs improveswith large message sizes. While the gradient themselves are usually stored as fp16 tensors, thefused buffer can be an fp32 tensor depending on the operation. When the size of the model islarge, these temporary buffer sizes are non-trivial. For example, for a GPT-2 model, with 1.5Bparameters, a flattened fp32 buffer would required 6GB of memory.

4 ZeRO: Insights and Overview

Now that we know where all the memory goes, how can we reduce the memory footprint withoutsacrificing efficiency? Please note that efficiency is a key here: without this constraint, trivialsolutions like moving all the parameter states to the CPU memory, or increasing the modelparallelism degree indefinitely can reduce the memory footprint.

Our solution, for reducing memory footprint without sacrificing efficiency is based on threekey insights:

• Data-Parallelism has better scaling efficiency than model-parallelism because model par-allelism reduces the granularity of the computation while also increasing the communi-cation overhead. Beyond a certain point, lower computational granularity reduces theefficiency per GPU, while the increased communication overhead, hiders the scalabilityacross GPUs, especially when crossing node boundaries. On the contrary, data parallelismhas both higher computational granularity and lower communication volume, allowing formuch higher efficiency.

• Data-Parallelism is highly memory inefficient as model states are stored redundantlyacross all data-parallel process. On the contrary, model parallelism partitions the modelstates to obtain memory efficiency.

• Both data and model parallelisms keep all the model states needed over the entire trainingprocess, but not everything is required all the time. For example, parameters correspond-ing to each layer is only needed during the forward propagation and backward propagationof the layer.

Based on these insights, we develop ZeRO— Zero Redundancy Optimizer — which over-comes the limitations of both data and model parallelism still retaining their merits. ZeROremoves the memory redundancies across data-parallel processes by partitioning the OGP modelstates instead of replicating them (Section 5), and it retains the compute/communication effi-ciency by dynamically and smartly moving the model states around during training (Section6). By doing that, ZeRO ultimately powers data parallelism to reduce per-device memoryfootprint of a model linearly with the increased data parallelism degree while maintaining thecommunication volume close to that of the default data parallelism.

6

Figure 2: Comparing the per-device memory consumption of parameters, gradients and opti-mizer States, with various ZeRO memory optimizations turned on. In the memory consumptionformula, Ψ denotes model size (number of parameters), K denotes the memory multiplier ofoptimizer states, and Nd denotes data parallelism degree. In the example, we assume a modelsize of Ψ = 7.5B and data parallelism of Nd = 64 with K = 12 based on mixed-precisiontraining with Adam optimizer.

ZeRO powers data parallelism to reduce memory consumption and boost runnable modelsize. Additionally, it can be used in parallel with any model parallelism approach. Whena training job uses both model parallelism and ZeRO-powered data parallelism, its memorysaving (per device) is the product of the saving from model parallelism and that from ZeRO(Section 7).

5 ZeRO: Memory Optimization

While the existing data-parallel approach replicates the OGP model states at each device andintroduces significant memory overhead, ZeRO eliminates this memory redundancy by parti-tioning them — optimizer states, gradients and parameters — across data parallel processes.Figure 2 quantifies and visualizes the memory requirement with and without ZeRO . For amodel with Ψ parameters, and a data parallelism of Nd, the figure shows the memory footprintafter eliminating (1) optimizer state, (2) gradient and (3) parameter redundancies respectively.We refer to them as three techniques or optimization phases of ZeRO : Pos, Pg, and Pp, whichwe will elaborate in Section 5.1, 5.2 and 5.3. Furthermore, ZeRO removes large buffers pro-portional to model size with performance-efficient constant-size buffers (Section 5.4) to addressthe memory overhead of large temporary buffers. The memory optimizations require carefulcomputation, communication rescheduling and mapping across data parallel process. Next wego through each of these techniques in details.

7

5.1 Pos : Optimizer State Partitioning

For a data parallelism degree of Nd, we group the optimizer states into Nd equal partitions,such that the ith data parallel process only updates the optimizer states corresponding to theith partition. Thus, each data parallel process only needs to store and update 1

Ndof the total

optimizer state.Note that here we only eliminate the optimizer state redundancy, but not the parameter

redundancy. Since each process only has 1Nd

of the overall optimizer state, it can only update1Nd

of the parameters. We perform an all-gather across the data parallel process at the end ofeach training step to get the fully updated parameters across all data parallel process. We willrefer to this optimization as Pos.

Memory Savings: As shown in Figure 1, the memory consumption after optimizing statepartition reduces from 4Ψ + KΨ to 4Ψ + KΨ

Nd. As the concrete example depicted in Figure 2,

a 7.5 B parameter model requires 31.4GB of memory using Pos with 64-way data parallelism(Nd = 64), while requiring 120 GB with standard data parallelism. Furthermore, when Nd islarge, the memory requirement reduces from 4Ψ + 12Ψ = 16Ψ bytes to 4Ψ + 12Ψ

Nd≈ 4Ψ bytes,

leading to a 4x memory reduction.

5.2 Pg: Gradient Partitioning

As each data parallel process only updates its corresponding parameter partition, it only needsthe reduced gradients for the corresponding parameters. Therefore, as each gradient of eachlayer becomes available during the backward propagation, we only reduce them on the dataparallel process responsible for updating the corresponding parameters. After the reductionwe no longer need the gradients and their memory can be released. This reduces the memoryfootprint required to hold the gradients from 2Ψ bytes to 2Ψ

Nd.

Effectively this is a Reduce-Scatter operation, where gradient corresponding to differentparameters are reduced to different process. To make this more efficient in practice, we usea bucketization strategy, where we bucketize all the gradients corresponding to a particularpartition, and perform reduction on the entire bucket at once. This is similar in spirit tohow NVIDIA’s AMP [10] optimizer bucketizes the all-reduce gradient computation to overlapcommunication and computation. However, in this case we perform a reduction instead of anall-reduce at the appropriate partition boundaries to reduce memory footprint in addition tooverlapping computation and communication. We refer to this optimization is Pg.

Memory Savings: By removing both gradient and optimizer state redundancy, we reducethe memory footprint further down to 2Ψ + 14Ψ

Nd≈ 2Ψ. As the concrete example depicted in

Figure 1, a 7.5 B parameter model requires only 16.6 GB of memory using Pos+g with 64-waydata parallelism (Nd = 64), while requiring 120 GB with standard data parallelism. When Nd

is large, the memory requirement reduces from 2Ψ+14Ψ = 16Ψ bytes to 2Ψ+ 14ΨNd

≈ 2Ψ bytes,leading to a 8x memory reduction.

5.3 Pp: Parameter Partitioning

Just as with the optimizer states, and the gradients, each process only stores the parameterscorresponding to its partition. When the parameters outside of its partition are required forforward and backward propagation, they are received from the appropriate data parallel processthrough broadcast. While this may seem to incur significant communication overhead at firstglance, we show that this approach only increases the total communication volume of a baselinedata parallel system to 1.5x, while enabling memory reduction proportional to Nd.

8

DP7.5B Model (GB) 128B Model (GB) 1T Model (GB)

Pos Pos+g Pos+g+p Pos Pos+g Pos+g+p Pos Pos+g Pos+g+p

1 120 120 120 2048 2048 2048 16000 16000 160004 52.5 41.3 30 896 704 512 7000 5500 400016 35.6 21.6 7.5 608 368 128 4750 2875 100064 31.4 16.6 1.88 536 284 32 4187 2218 250256 30.4 15.4 0.47 518 263 8 4046 2054 62.51024 30.1 15.1 0.12 513 257 2 4011 2013 15.6

Table 1: Analysis Result on per-device memory consumption of different optimizations in ZeROas a function of Data Parallelism (DP). Highlighted in bold-faced text are the combination ofminimum data parallelism degree and ZeRO optimization for which the model can fit into acluster of 32GB V100 GPUs.

Memory Savings: With parameter partitioning, we reduce the memory consumption ofan Ψ parameter model from 16Ψ to 16Ψ

Nd. As the concrete example depicted in Figure 1, a 7.5

B parameter model requires 1.9 GB of memory using Pos+p+g with 64-way data parallelism(Nd = 64), while requiring 120 GB with standard data parallelism. This has a profoundimplication: ZeRO powers data parallelism to fit models with arbitrary size— as long as thereare sufficient number of devices to share the model states.

5.4 CB: Constant Size Buffers

Besides partitioning, ZeRO performs another type of optimizations on managing the buffersof temporal data. During training, the computational efficiency of certain operations can behighly dependent on the input size, with larger inputs achieving higher efficiency. For example,a large all-reduce operation achieves much higher bandwidth than a smaller one. Hence, toget better efficiency, high performance libraries such as NVIDIA Apex or Megatron fuses allthe parameters into a single buffer before applying these operations. However, the memoryoverhead of the fused buffers is proportional to the model size, and can become inhibiting. Forexample, for a 3B parameter model, a 32-bit fused buffer will require 12 GB of memory. Toaddress this issue, we simply use a performance-efficient constant-size fused buffer when themodel becomes too large. By doing so, the buffer size does not depend on the model size, andby keeping the buffer size large enough, we can still achieve good efficiency.

5.5 Summary of Memory Optimization

By now, we complete the discussion of memory optimizations of ZeRO . The three phases ofpartitioning Pos, Pos+g, and Pos+g+p reduces the memory consumption of each data parallelprocess on model states by up to 4x, 8x, and Nd respectively. Cb allows us to use constant-sizebuffers instead of growing buffer size linearly with the increase of model size. Table 1 gives moreexamples on the memory impact analysis of enabling the 3 stages of partitioning optimizationswith respect to varying data parallelism degree. Without ZeRO , the memory consumptionis equal to the first row in the table, regardless of the data parallelism degree. Note that,with Nd = 64, ZeRO can train models with up to 7.5B, 14B, and 128B parameters usingPos, Pos+g, and Pos+g+p, respectively. When Nd = 1024, ZeRO with all of its optimizationsenabled (Pos+g+p) could train models with 1 Trillion parameters! Or potentially, modelswith Arbitrary size! Without ZeRO , the largest model data parallelism alone can run hasless than 1.5 Billion parameters.

9

6 ZeRO: Communication Analysis

As ZeRO boosts model size by removing memory redundancy, it is only natural to ask ifwe are trading communication volume for memory efficiency. In other words, what is thecommunication volume of ZeRO-powered data-parallel approach compared to a baseline data-parallel approach?

The answer is in two parts: i) ZeRO incurs no additional communication using Pos and Pg,while enabling up to 8x memory reduction, resulting in a total memory footprint of 2Ψ + 14Ψ

Nd

ii) ZeRO incurs a maximum of 1.5x communication when using Pp in addition to Pos and Pg,while further reducing the memory footprint by Nd times, resulting in a total memory foot printof 16Ψ

Nd. Here we present the analysis demonstrating these results. We begin by first presenting

a brief overview of the communication volume for standard data-parallel training.

6.1 Data Parallel Communication Volume

During data parallel training, gradients across all data parallel processes are averaged at the endof the backward propagation before computing the updates for the next step. The averagingis performed using an all-reduce communication collective. For a large model size, the all-reduce communication is entirely communication bandwidth bound, and therefore, we limitour analysis to the total communication volume send to and from each data parallel process.

State-of-art implementation of all-reduce uses a two-step approach, where the first step isa reduce-scatter operation, which reduces different part of the data on different process. Thenext step is an all-gather operation where each process gathers the reduced data on all theprocess. The result of these two steps is an all-reduce.

Both reduce-scatter and all-gather are implemented using a pipelined approach, that resultsin a total data movement of Ψ elements (for a data with Ψ elements) for each of the reduce-scatter and all-gather step. Therefore, our standard data parallel approach incurs 2Ψ datamovement during each training step.

6.2 ZeRO Communication Volume

Communication Volume with Pos+g: With gradient partitioning, each process only storesthe portion of the gradients, that is required to update its corresponding parameter partition.As such, instead of an all-reduce, ZeRO only requires a scatter-reduce operation on the gradi-ents. The communication volume for this operation is Ψ as described above. After each processupdates the partition of the parameters that it is responsible for, an all-gather is performed tocollect all the updated parameters from all the data parallel process. This incurs a communi-cation volume of Ψ as described above. So the total communication volume per training stepis Ψ + Ψ = 2Ψ, which is exactly same as the baseline data parallel approach.

Communication Volume with Pos+g+p: After parameter partitioning, each data par-allel process only stores the parameters that it updates. Therefore, during the forward prop-agation it needs to receives the parameters for all the other partitions. However, this can bepipelined to avoid the memory overhead. Before computing the forward propagation on thepart of the model corresponding to a particular partition, the data parallel process responsiblefor that partition can broadcast the weights to all the data parallel processes. Once the forwardpropagation for that partition is done, the parameters can be discarded. The total communi-cation volume is thus Ψ×Nd

Nd= Ψ. In other words, we reschedule the parameter all-gather by

spreading it across the entire forward propagation, and discarding the parameters once theyhave been used. Note however that this all-gather needs to happen once again for the backwardpropagation in the reverse order.

10

MP GPUsMax Theoretical Model Size Measured Model Size

Baseline Pos Pos+g Pos+g+p Baseline ZeRO-OS (Pos + CB)1 64 2B 7.6B 14.4B 128B 1.3B 6.2B2 128 4B 15.2B 28.8B 256B 2.5B 12.5B4 256 8B 30.4B 57.6B 0.5T 5B 25B8 512 16B 60.8B 115.2B 1T 10B 50B16 1024 32B 121.6B 230.4B 2T 20B 100B

Table 2: An upper bound on the maximum model size through memory analysis (left) and themeasured model size when running with ZeRO-OS (right). The results of ZeRO-OS well alignwith those of Pos, demonstrating that our memory analysis provides realistic upper bounds onmodel memory requirements. See more discussions in Section 9.4.

The total communication volume is therefore the sum of the communication volumes in-curred by these all-gathers in addition to the communication volume incurred by the reduce-scatter of the gradients. The total volume is therefore 3Ψ which is 1.5x compared to thebaseline. Both gradient and parameter partitioning leverage the insight that — not all statesof gradients and parameters are needed all the time — to optimize memory by communicatingthe states judiciously.

6.3 Communication Latency

Here we want to point out that while our optimizations impact communication latency, itsimpact to overall performance is likely small. Note that Pg implements a reduce-scatter asa sequential series of reduce operations, and Pp implements all-gather as a sequential seriesof broadcast operations. Furthermore, Cb can partition a large communication collective intomultiple smaller ones to avoid memory overhead. Clearly, these sequential operations increasesthe communication latency. However, for large models with hundreds of billions of parameters,even with a large enough constant buffer size CB , message sizes are large enough, that thecommunication time is bounded by the communication volume and communication bandwidth,and not by latency.

7 ZeRO & Model Parallelism

Since ZeRO solves the memory inefficiency problem of data parallelism, one may wonder: Ismodel parallelism still needed, and when? Does ZeRO work with model parallelism and how?

With ZeRO , model parallelism becomes less essential / attractive for the sole purpose ofreducing memory footprint and fitting large models. ZeRO-powered data paralellism (Pos+g+p)is at least as effective on reducing per-device memory footprint as model parallelism, and itcould be more effective sometimes when model parallelism cannot divide the model evenly. Italso has comparable or better scaling efficiency. Furthermore, data parallelism is so easy touse and it is widely applicable across different workloads, while model parallelism approachestoday often need some work from model developers to revise their model, system developers towork out distributed operators, and the existing work like Megatron only supports a limitedset of operators and models.

That being said, there are still cases where model parallelism can be helpful: when aggre-gated batch size of using data parallelism alone is too big to have good convergence. In those

11

case, model parallelism can be combined with ZeRO-powered data parallelism to fit the modelwith an acceptable aggregated batch size.

ZeRO (or ZeRO-powered data parallelism) can be combined with any form of model par-allelism to train very large models. The total memory saving can be computed as the productof the saving factor from model parallelism and that from ZeRO . More specifically, if modelparallelism of Nm reduces per-device model size from Ψ to Y where Y = Ψ/Nm, the totalmemory consumption per device is given by 1

Nm× (4Ψ + 12Ψ

Nd), 1

Nm× (2Ψ + 14Ψ

Nd), and 16Ψ

Nd×Nm,

using Pos, Pos+g and Pos+g+p, respectively. As a concrete example, a 32 B parameter with16-way model parallelism and 64-way data parallelism will consume 8.4GB, 4.4GB and 0.5GB,with Pos, Pos+g and Pos+g+p respectively, while without using ZeRO , it would consume at least32GB of memory per GPU.

With the multiplicative memory saving, combining ZeRO with model parallelism enables thecurrent generation hardware to run significantly larger model with the same model parallelismdegree efficiently, without requiring to go across node boundaries. Table 2 shows an upperbound on the largest model that can be run for a given degree of model parallelism and dataparallelism. In the table the baseline columns represent the existing approach of using data andmodel parallelism. Note that with 16-way model parallelism, in theory (without any activationand other memory overheads), the baseline can at most run a 32B parameter model, whileZeRO , with its various memory optimizations turned on, can fit 121 Billion, 230 Billion and 2Trillion using Pos, Pos+g, Pos+g+p, respectively.

8 Step Towards 1 Trillion Parameters

The largest published models today are about 1B to 10B parameters, which are already chal-lenging to train. The road towards trillion scale, that is 2- to 3-orders of magnitude larger, willbe full of hurdles, surprises and innovations. While we do not claim knowing or addressing allof them, ZeRO addresses one of the most fundamental challenges from a system perspective:the ability to fit a model of this scale on the current generation of hardware while allowing itto train with good system scalability.

A Leap from State-of-Art The largest model that the state-of-art framework, Megatron,can train with acceptable throughput is a 16 - 20B parameter model in a DGX-2H system.Scaling further by having model parallelism across multiple DGX nodes results in significantefficiency drop due to limited internode bandwidth.

ZeRO boosts the system capability significantly, with respect to the efficiently-runnablemodel size. It enables the current generation of hardware to run significantly larger modelswithout requiring fine-grained model parallelism to go across the node boundaries. As demon-strated in Table 1, ZeRO , with all optimizations turned on (Pos+g+p), could fit more than1Trillion parameters on 1024 GPUs using data parallelism only. Alternatively, when combinedwith model parallelism (as shown in Table 2), ZeRO could fit more than 1 Trillion parame-ters on 1024 GPUs with 16-way model parallelism (within each DGX2 node) and 64-way dataparallelism across nodes. Running a model with a trillion parameters efficiently is no longerimpossible!

Compute Power Gap Training a trillion parameter model end-to-end within an acceptabletime range, however, could still require significant amount of compute power, which are lackingin today’s AI clusters.

To understand the resource requirement, we present a brief comparison with Bert-Large.Bert-Large can be trained in 67 minutes on a 1024 GPU DGX-2H cluster [11]. A 1 TrillionParameter model can easily contain 3000x (1 trillion / 330 million) more computation thana Bert-Large model for a data sample. Even if we assume the same sequence length and the

12

Cluster Node Count GPUs Per Node Total GPUs GPU RAM

Azure NC24r3 V 3 8 4 32 16GBDGX-2H 25 16 400 32GB

Table 3: Hardware environments for evaluating ZeRO-OS .

total number of samples required to train the model, training a 1T model would take 140days, assuming the same hardware and similar computational efficiency. In practice, both datasamples and sequence length are likely to increase with the increased model size. It wouldrequire an exa-flop system to train a 1T parameter model in a reasonable time. But when suchcompute capacity becomes available, we hope ZeRO provides the system technology to run the1T models efficiently.

9 Implementation and Evaluation of ZeRO-OS

While 1T parameter model requires exa-flop supercomputers, 80B-100B parameter models maybe trained in a few weeks on a 1024 GPU system which already exists (once again assuming thesame sequence length and data size, and similar computational efficiency). As such, for the firststep of the evaluation, we implemented optimizer state partitioning (Pos) and demonstratedits capability to support such models, which would hopefully address the near-term growth onmodel size. We call this implementation ZeRO-OS . The full implementation of ZeRO and itssupport for trillion parameters will come later.

9.1 ZeRO-OS Implementation

We implemented ZeRO-OS for PyTorch as a wrapper class of torch optimizer. To use ZeRO-OS , torch users can simply swap out their favourite optimizer with ZeRO wrapped version ofthe same optimizer, requiring just a few lines of code change.

ZeRO-OS implements Pos to eliminate optimizer state redundancy, and CB to reduce theall-reduce buffer sizes. The all-gather required by Pos can be performed either as all-gather oras a sequence of broadcast operations. The former has less latency impact, but the later hasless memory overhead (using smaller buffers).

ZeRO-OS incurs 1.5x communication volume compared to baseline data-parallel, as we arecurrently using all-reduce instead of reduce-scatter. It is straightforward to eliminate the addi-tional communication by switching to reduce-scatter. We observe when the batch size per deviceis moderate (e.g., 16), this 1.5x of communication volume is not really a bottleneck. It alsoserves the purpose of emulating the communication volume when we enable all optimizationsPos+g+p for ZeRO .

9.2 Evaluation Methodology

Before presenting the experimental results of our evaluation of ZeRO-OS , we give some detailsof our experimental setup.

Hardware We conducted our experiments in two different hardware environments. Thefirst environment is a cluster consisting of 25 DGX-2 nodes, with 400 GPUs in total. Thesecond environment is an Azure cluster consisting of 8 Azure NC24r3 V 3 nodes, with each

13

Without ZeRO-OS With ZeRO-OSModel size (largest runnable batch size) 1.2B (bsz=8) 1.5B (Out of memory with bsz=1) 4B (bsz=16) 5.9B (bsz=4)

Model spec.

layers=40 layers=48 layers=64 layers=72hidden=1536 hidden=1600 hidden=2304 hidden=2592attn-head=16 attn-head=16 attn-head=24 attn-head=32

seq-length=1024 seq-length=1024 seq-length=1024 seq-length=1024

Table 4: ZeRO-OS enables training of larger models using data parallelism only. The test wasdone on 32 V100 GPUs (2 DGX2 boxes).

node containing 4 V100 GPUs, for a total 32 GPUs. Details of our hardware environment aresummarized in Table 3.

Baseline Model Parallelism We use NVIDIA’s Megatron as the baseline for model paral-lelism experiments because it is, to our knowledge, the state-of-art. The most recent Megatronresults report the ability to scale up to 20B parameter models using 32 DGX-2H nodes (totalof 512 Tesla SXM3 32GB V100 GPUs).

Baseline Data Parallelism We use Megatron’s custom version of DistributedDataParallel(DDP) as the data parallelism baseline.

9.3 Scaling to 6B Parameters with Data-Parallelism Alone

In this study we investigate how ZeRO-OS increases the size of models that can be trainedusing data parallelism alone. This capability is beneficial to users who are unable to use modelparallelism efficiently due to restrictions in software (e.g., complex model structure) and/orhardware (poor interconnection bandwidth) environments. For such users, ZeRO-OS enablesthem to scale to larger models than is possible with existing data parallelism, such as PyTorch’sDDP. Specifically, we measure the largest models that can be trained with and without ZeRO-OS using 2 DGX-2H nodes, 32GPUs in total, and 32GB RAM per GPU. The results of thisexperiment are presented in Table: 4. The results show that without ZeRO-OS , only modelssmaller than 1.5B can be trained using data parallelism alone. In contrast, with ZeRO-OS , upto 6B parameter models, with batch size of 4, can be trained using data parallelism alone. Insummary, ZeRO-OS enables data parallelism to train models that are 4 times larger than whatis possible using existing techniques.

9.4 Scaling to 80B - 100B Parameters

In this study, we investigate the largest model sizes that can be trained by ZeRO-OS efficiently.Here we combine ZeRO-powered data parallelism with model parallelism using Megatron. Wealso compare the results of ZeRO-OS with using Megatron alone. The results of this studyare visualized in Figure 1. ZeRO-OS runs 80B parameter models efficiently, while the existingsystem like Megatron cannot scale efficiently beyond 20B parameters. ZeRO-OS also fits 100Bparameter models without requiring inefficient inter-node model parallelism. For the models ofthe same size, ZeRO-OS trains at higher throughput.

Table 5 provides more details of these experiments including the configurations of the differ-ent models, and the corresponding model-parallelism configuration. Here, we can see that fora given model size, ZeRO-OS is able to fit the model using lower degree of model parallelismand higher batch size, which explains the improved throughput.

ZeRO-OS throughput on the 100B parameter run can be improved further. For example,using more GPUs would reduce the memory footprint per GPU, allowing higher batch size and

14

Model size 8B 20B 40B 60B 80B2 100B

Model spec.

layers=72 layers=111 layers=88 layers=132 layers=173/112 layers=317hidden=3072 hidden=3808 hidden=6144 hidden=6144 hidden=6144/7680 hidden=5120attn-head=24 attn-head=32 attn-head=32 attn-head=32/64 attn-head=32/80 attn-head=80

seq-length=1024 seq-length=1024 seq-length=1024 seq-length=1024 seq-length=1024 seq-length=1024

ZeRO-OSThroughput

3 3 3 3 3 338 TFlops per GPU 27 TFlops per GPU 29 TFlops per GPU 20 TFlops per GPU 17 TFlops per GPU 7 TFlops per GPU

15.2 PFlops total 10.8 PFlops total 11.1 PFlops total 7.7 PFlops total 6.8 PFlops total 2.8 PFlops totalConfig

(MP x DP, batch size) 4x100,bsz=24 4x100,bsz=8 8x48,bsz=8 16x24,bsz=8 16x25,bsz=8 16x25,bsz=4

MegatronThroughput

3 3 7 7 7 723 TFlops per GPU 9 TFlops per GPU OOM OOM OOM OOM

Config(MP x DP, batch size) 8x50,bsz=8 16x25,bsz=4 16x24,bsz=1 16x24,bsz=1 16x25,bsz=1 16x25,bsz=1

Megatron Across NodesThroughput N/A N/A 4.92 TFlops per GPU 6.34 TFlops per GPU 4.03 TFlops per GPU 2.49 TFlops per GPU

Config(MP x DP, batch size) N/A N/A 32x12,bsz=4 64x6,bsz=4 80x5,bsz=4 80x5,bsz=2

Table 5: Running GPT-like models with 8B to 100B parameters with ZeRO-OS vs Megatronalone on V100 GPUs / DGX2 nodes. The number of GPUs is equal to the product of modeland data parallelism (ranging between 384 - 400 GPUs in the experiments). Megatron runs outof memory when running larger than 20B model parameters, with a model parallelism degree of16 (within a single DGX-2 node), as shown in the middle row. It is possible to run larger modelswith Megatron using inter-node model parallelism, but the performance degrades significantly,too low to be useful in practice, as shown in the last row of the table.

Model sizeZeRO-OS Megatron ZeRO-OS

Most Efficient Config Throughput Most Efficient Config Throughput Throughput(MPxDP, largest bsz) Samples/Sec (MPxDP, largest bsz) Samples/Sec Boost

8B 4x100,bsz=24 210.8 8x50,bsz=8 129.2 1.63X20B 4x100,bsz=8 62.8 16x25,bsz=4 21.4 2.93X40B 8x48, bsz=8 31.77 32x12, bsz=4 5.38 5.9X

Table 6: Throughput improvement of billion parameter models by ZeRO-OS on V100 GPUs/ DGX2 nodes. The number of GPUs is equal to the product of model and data parallelism(ranging between 384 - 400 GPUs in the experiments).

thus higher per-GPU and total throughput. We are also working on more optimizations toreduce memory consumption further to fit larger batch size.

Another benefit of ZeRO-OS over Megatron is to support a wider range of configurationson the model architecture space. For example, Megatron requires the hidden dimension andattention head divisible by the model parallelism degree. These constraints become more re-strictive as we increase the model parallelism degree to large values such as 64 and 80. Fromthis perspective, ZeRO enables running large models with small model parallelism degree givingus flexibility in the architectures we can train. For example, ZeRO-OS can train 80B modelswith model parallelism of 16 and thus attention head of 16 and 32 (multiple of 16), while Mega-tron requires model parallelism of 80 and thus the attention head of at least 80, and hiddendimensions that are multiple of 80.

In addition, the largest trainable model we measured is close to our analysis as shown inTable 2, demonstrating the soundness of the analysis and the implementation. The small gapis because our upper-bound analysis of the largest trainable model size does not count thememory consumption of buffers and activations, which could be implementation and modelspecific.

15


Most Efficient Config Throughput Most Efficient Config Throughput Throughput(MPxDP, largest bsz) Samples/Sec (MPxDP, largest bsz) Samples/Sec Boost

1.5B GPT2 1x32,bsz=8 40.2 4x8,bsz=8 10.7 3.75X

Table 7: GPT2 throughput improvement by achieved ZeRO-OS . The test ran on 32 V100 AzureGPUs (16GB memory, 4 GPU per node, PCIe, 40Gbps IB).


Throughput GPUs Configuration Throughput GPUs Configuration Resource(MPxDP, largest bsz) (MPxDP, largest bsz) Reduction

8B 129.9 240 4x60,bsz=24 129.2 400 8x50,bsz=8 1.7X20B 24.72 128 4x32,bsz=8 21.38 400 16x25,bsz=4 3.1X

Table 8: ZeRO-OS reduces resource requirements to achieve the same system throughput.

9.5 Up to 6x System Throughput Improvement

In this study, we measure how ZeRO-OS improves system throughput. On the DGX-2 cluster,as illustrated in Table 6, ZeRO-OS improves system throughput over Megatron by 1.63X for8B parameter models, 2.9X for 20B parameter models on the runs of 400 GPUs, and nearly6X for 40B using 384 GPUs. On the Azure cluster, as shown in Table 7, ZeRO-OS achieves a3.75x throughput improvement over Megatron alone for the 1.5B parameter GPT2 model. Asdiscussed earlier, these throughput improvements come from the fact that ZeRO-OS is able tofit these models using lower degree of model parallelism and larger batch sizes.

9.6 Up to 3x Resource Savings

In this study, we measure the resource savings enabled by the throughput improvements ofZeRO-OS . In particular, we want to find out how much fewer GPUs are required by ZeRO-OSto achieve similar throughput as Megatron on the DGX-2 cluster. The results are summarizedin Table 8. The results show that ZeRO-OS enables resource savings of 1.7X for 8B parametermodels, and 3.1X for 20B parameter models. These results show that ZeRO-OS could bebeneficial to significantly reducing the costs of training large models.

10 Concluding Remarks

We see great promise on the memory optimizations of ZeRO for training really large modelswith hundred billions and even trillions of parameters. What we have implemented in ZeRO-OSis only a limited subset of features, which already demonstrates significant boost on model sizeand training efficiency over existing systems and empowers 100B parameter training. We willshare our code repository soon, but kindly understand it is at an early stage, We hope to getfeedback and help from the community, working together to unlock the full power of ZeRO andto train DL models that are impossible to train today!

2For 80B parameters, we created different model for 16-way model parallelism vs 80-way model parallelismdue to several implementation restraints in Megatron. For instance, attention head and hidden dimension mustbe divisible by model parallelism degree. We tried to create a 80B model with 80-way parallelism that performedthe best.

16

Acknowledgement

We thank Junhua Wang for his valuable support and advice. We thank Minjia Zhang andElton Zheng for their great feedback and help on evaluating the work. We thank Gopi Kumar,Jack Zhang, Jing Zhao, Rangan Majumder, Saksham Singhal, Saurabh Tiwary, and Xia Songfor many helpful discussions and suggestions.

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-trainingof deep bidirectional transformers for language understanding. CoRR, abs/1810.04805,2018.

[2] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners. 2019.

[3] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V.Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipelineparallelism. ArXiv, abs/1811.06965, 2018.

[4] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur,Gregory R. Ganger, and Phillip B. Gibbons. Pipedream: Fast and efficient pipeline parallelDNN training. CoRR, abs/1806.03377, 2018.

[5] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, PenpornKoanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sep-assi, and Blake A. Hechtman. Mesh-tensorflow: Deep learning for supercomputers. CoRR,abs/1811.02084, 2018.

[6] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, andBryan Catanzaro. Megatron-lm: Training multi-billion parameter language models usingmodel parallelism, 2019.

[7] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, PenpornKoanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sep-assi, and Blake A. Hechtman. Mesh-tensorflow: Deep learning for supercomputers. CoRR,abs/1811.02084, 2018.

[8] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empiricalmodel of large-batch training. CoRR, abs/1812.06162, 2018.

[9] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets withsublinear memory cost. CoRR, abs/1604.06174, 2016.

[10] NVIDIA. Automatic mixed-precision, 2019.

[11] Shar Narasimhan. NVIDIA Clocks World’s Fastest BERT Training Time ... https:

//devblogs.nvidia.com/training-bert-with-gpus/, 2019. [Online; accessed 25-September-2019].

17

https://devblogs.nvidia.com/training-bert-with-gpus/

https://devblogs.nvidia.com/training-bert-with-gpus/

arxiv:1910.02054v2 [cs.lg] 7 oct 2019 · arxiv:1910.02054v2 [cs.lg] 7 oct 2019 that don’t t into...

Documents