taking advantage of low precision to accelerate training and ......mixed precision training in...
TRANSCRIPT
![Page 1: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/1.jpg)
Taking Advantage of Low Precision to Accelerate Training and Inference Using PyTorchPresented by: Myle Ott and Sergey Edunov Facebook AI Research (FAIR)
Talk ID: S9832
![Page 2: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/2.jpg)
Overview
2
Mixed precision training in PyTorch: • 3-4x speedups in training wall time • Reduced memory usage ==> bigger batch sizes • No architecture changes required
Case study: Neural Machine Translation • Train models in 30 minutes instead of 1 day+ • Semi-supervised training over much larger datasets
![Page 3: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/3.jpg)
What are Tensor Cores? • Optimized hardware units for mixed precision matrix-multiply-and-
accumulate: D = A * B + C
3
Slide credit: Nvidia
![Page 4: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/4.jpg)
4
Slide credit: Nvidia
![Page 5: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/5.jpg)
If only it were this easy…
model.half()
5
![Page 6: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/6.jpg)
Why not pure FP16?
FP16 has insufficient range/precision for some ops
Better to leave some ops in FP32: • Large reductions, e.g., norms, softmax, etc. • Pointwise ops where |f(x)| >> |x|, e.g., exp, pow, log, etc.
6
![Page 7: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/7.jpg)
Why not pure FP16?
In practice, pure FP16 hurts optimization.
According to Nvidia: • Sum of FP16 values whose ratio is >211 is just the larger value
• Weight update: if w >> lr*dw then update doesn’t change w
7
![Page 8: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/8.jpg)
Why not pure FP16?
Solution: mixed precision training
Optimize in FP32 and use FP16 for almost* everything else
* Some operations should still happen in FP32: • Large reductions, e.g., norms, softmax, etc. • Pointwise ops where |f(x)| >> |x|, e.g., exp, pow, log, etc.
8
![Page 9: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/9.jpg)
Optimizing in FP32
9
FP16 Weights
FP16 Loss
FP16Gradients
Forward PassBackprop
![Page 10: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/10.jpg)
FP16 Weights
FP16 Loss
FP32 MasterGradients
FP16Gradients
Forward PassBackprop
Copy
Optimizing in FP32
10
![Page 11: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/11.jpg)
Optimizing in FP32
11
FP16 Weights
FP16 Loss
FP32 MasterGradients
FP16Gradients
FP32 MasterWeights
Forward PassBackprop
Copy
Apply
![Page 12: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/12.jpg)
Optimizing in FP32
12
FP16 Weights
FP16 Loss
FP32 MasterGradients
FP16Gradients
FP32 MasterWeights
Forward PassBackprop
Copy
Apply
Copy
![Page 13: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/13.jpg)
Optimizing in FP32
13
FP16 Weights
FP16 Loss
FP32 MasterGradients
FP16Gradients
FP32 MasterWeights
Forward PassBackprop
Copy
Apply
Copy
This adds overhead!
It’s only worth it because of the Tensor Cores. Don’t use mixed
precision without Tensor Cores!
![Page 14: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/14.jpg)
Gradient underflow
• FP16 has a smaller representable range than FP32 (shown in blue)
• In practice gradient are quite small, so there’s a risk of underflow
14
![Page 15: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/15.jpg)
Gradient underflow
150
GradientsUnderflow can not be detected
But if we scale loss up
If we scale the loss up by K, by the chain rule of derivatives, gradients will be K times bigger
![Page 16: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/16.jpg)
Gradient overflow
16Inf
If overflow detected
Scale the loss down
Gradients
![Page 17: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/17.jpg)
Avoiding under/overflow by loss scaling
17
FP16 Weights
FP16 Loss
ScaledFP16
Gradients
Forward PassBackprop
Loss Scaling
Scaled FP16Loss
![Page 18: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/18.jpg)
Avoiding under/overflow by loss scaling
18
FP16 Weights
FP16 Loss
ScaledFP16
Gradients
Forward PassBackprop
Loss Scaling
Scaled FP16Loss
If gradients overflow (inf), throw away the batch
![Page 19: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/19.jpg)
Avoiding under/overflow by loss scaling
19
FP16 Weights
FP16 Loss
ScaledFP16
Gradients
Forward PassBackprop
Copy
Loss Scaling
Scaled FP16Loss
Scaled FP32Gradients
If gradients overflow (inf), throw away the batch
![Page 20: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/20.jpg)
Avoiding under/overflow by loss scaling
20
FP16 Weights
FP16 Loss
FP32 Gradients
ScaledFP16
Gradients
Forward PassBackprop
Copy
Loss Scaling
Scaled FP16Loss
Scaled FP32Gradients
Remove scale
If gradients overflow (inf), throw away the batch
![Page 21: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/21.jpg)
Avoiding under/overflow by loss scaling
21
FP16 Weights
FP16 Loss
FP32 Gradients
ScaledFP16
Gradients
FP32 MasterWeights
Forward PassBackprop
Copy
Apply
Copy
Loss Scaling
Scaled FP16Loss
Scaled FP32Gradients
If gradients overflow (inf), throw away the batch
Remove scale
![Page 22: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/22.jpg)
How to pick the scaling constant (K)
• Too small and gradient will underflow
• Too big and we’ll waste compute due to overflow
• In practice the optimal scaling constant changes during training
• We can adjust it dynamically!
22
![Page 23: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/23.jpg)
Dynamic loss scaling
• Every time the gradient overflows (inf), reduce the scaling constant by a factor of 2
• If the gradients haven’t overflowed in the last N updates (~1000), then increase the scaling constant by a factor of 2
23
![Page 24: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/24.jpg)
Dynamic loss scaling
24
![Page 25: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/25.jpg)
So far…
Tensor Cores make FP16 ops 4-9x faster
Mixed precision training: • Forward/backward in FP16
• Optimize in FP32
• Requires maintaining two copies of the model weights
• Dynamically scale the loss to avoid gradient under/overflow
25
![Page 26: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/26.jpg)
One more thing about FP16…
For maximal safety, perform ops that sum many values in FP32 • e.g., normalization layers, softmax, L1 or L2 norm, etc. • This includes most Loss layers, e.g., CrossEntropyLoss
General advice: compute your loss in FP32 too
26
![Page 27: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/27.jpg)
The full picture
27
FP16 Weights
FP32 Loss
Forward Pass
![Page 28: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/28.jpg)
The full picture
28
FP16 Weights
FP32 Loss
ScaledFP16
Gradients
Forward PassBackprop
Loss Scaling
Scaled FP32Loss
![Page 29: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/29.jpg)
The full picture
29
FP16 Weights
FP32 Loss
ScaledFP16
Gradients
Forward PassBackprop
Loss Scaling
Scaled FP32Loss
If gradients overflow (inf), throw away the batch
![Page 30: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/30.jpg)
The full picture
30
FP16 Weights
FP32 Loss
ScaledFP16
Gradients
Forward PassBackprop
Copy
Loss Scaling
Scaled FP32Loss
Scaled FP32Gradients
If gradients overflow (inf), throw away the batch
![Page 31: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/31.jpg)
The full picture
31
FP16 Weights
FP32 Loss
FP32 Gradients
ScaledFP16
Gradients
Forward PassBackprop
Copy
Loss Scaling
Scaled FP32Loss
Scaled FP32Gradients
Remove scale
If gradients overflow (inf), throw away the batch
![Page 32: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/32.jpg)
The full picture
32
FP16 Weights
FP32 Loss
FP32 Gradients
ScaledFP16
Gradients
FP32 MasterWeights
Forward PassBackprop
Copy
Apply
Copy
Loss Scaling
Scaled FP32Loss
Scaled FP32Gradients
Remove scale
If gradients overflow (inf), throw away the batch
![Page 33: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/33.jpg)
The full picture
33
FP16 Weights
FP32 Loss
FP32 Gradients
ScaledFP16
Gradients
FP32 MasterWeights
Forward PassBackprop
Copy
Apply
Copy
Loss Scaling
Scaled FP32Loss
Scaled FP32Gradients
Remove scale
If gradients overflow (inf), throw away the batch
Distributed gradient accumulation / all-reduce
option 1 (slower) option 2 (faster)
![Page 34: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/34.jpg)
In PyTorch
To automate the recipe, start with Nvidia’s apex.amp library: from apex import amp
optim = torch.optim.Adam(…)
model, optim = amp.initialize(model, optim, opt_level="O1")
(…)
with amp.scale_loss(loss, optim) as scaled_loss: scaled_loss.backward()
optim.step()
34
![Page 35: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/35.jpg)
Making it even faster
apex.amp supports different optimization levels
opt_level="O1" is conservative and keeps many ops in FP32
opt_level="O2" is faster, but may require manually converting some ops to FP32 to achieve good results
More details at: https://nvidia.github.io/apex/
35
![Page 36: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/36.jpg)
Making it even fasterA useful pattern:
x = torch.nn.functional.softmax(x, dtype=torch.float32).type_as(x)
When x is FP16 (i.e., a torch.HalfTensor): • Computes the softmax in FP32 and casts back to FP16
When x is FP32 (i.e., a torch.FloatTensor): • No impact on speed or memory
36
![Page 37: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/37.jpg)
One more thing…Must have GPU with Tensor Cores (Volta+), CUDA 9.1 or newer
Additionally: • Batch size should be a multiple of 8 • M, N and K for matmul should be multiples of 8 • Dictionaries/embed layers should be padded to be a multiple of 8
37
![Page 38: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/38.jpg)
SummaryMixed precision training gives:
• Tensor Cores make FP16 ops 4-9x faster • No architecture changes required • Use Nvidia's apex library
Tradeoffs: • Some extra bookkeeping required (mostly handled by apex) • Best perf requires manual fixes for softmax, layernorm, etc.
38
![Page 39: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/39.jpg)
Scaling Machine Translation
39
Teng Li Ailing Zhang Shubho SenguptaMyle Ott Sergey Edunov David Grangier Michael Auli
![Page 40: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/40.jpg)
Sequence to Sequence Learning
Bonjour à tous ! Hello everybody!
• Sequence to sequence mapping • Input = sequence, output = sequence • Structured prediction problem
![Page 41: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/41.jpg)
• machine translation • text summarization • writing stories • question generation • dialogue, chatbots • paraphrasing • ...
Sequence to Sequence Learning
41
![Page 42: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/42.jpg)
Why do we need to scale?
• Large benchmark ~2.4 billion words + much more unlabeled data
• Training time: CNNs up to 38 days on 8 M40 GPUs (Gehring et al., 2017)
• Train many models
• Support Multilingual training
42
![Page 43: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/43.jpg)
Reducing training time
Trai
n Ti
me
(Min
utes
)
0
400
800
1200
1600
Original +16-bit + cumul +2x lr 16 nodes +overlap
1,429
43
Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)
![Page 44: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/44.jpg)
Reducing training time
Trai
n Ti
me
(Min
utes
)
0
400
800
1200
1600
Original +16-bit + cumul +2x lr 16 nodes +overlap
495
1,429
44
Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)
3x faster (wall time) using the same hardware, model architecture and bsz!
![Page 45: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/45.jpg)
Reducing training time
Trai
n Ti
me
(Min
utes
)
0
400
800
1200
1600
Original +16-bit + cumul +2x lr 16 nodes +overlap
447495
1,429
45
Gradient
Forward/Backward
IdleGPU 1
GPU 2
GPU 3
GPU 4
Sync After 1
Sync After 2
Time
GPU 1
GPU 2
GPU 3
GPU 4
Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)
![Page 46: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/46.jpg)
Reducing training time
Trai
n Ti
me
(Min
utes
)
0
400
800
1200
1600
Original +16-bit + cumul +2x lr 16 nodes +overlap
311447495
1,429
46
Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)
![Page 47: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/47.jpg)
Reducing training time
Trai
n Ti
me
(Min
utes
)
0
400
800
1200
1600
Original +16-bit + cumul +2x lr 16 nodes +overlap
37
311447495
1,429
47
Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)
![Page 48: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/48.jpg)
Reducing training time
Trai
n Ti
me
(Min
utes
)
0
400
800
1200
1600
Original +16-bit + cumul +2x lr 16 nodes +overlap
3237
311447495
1,429
48
Gradient Sync
Forward
Idle
BackwardGPU 1
GPU 2
GPU 3
GPU 4
Sync After Backward
Overlap Sync with Backward
GPU 1
GPU 2
GPU 3
GPU 4
Time
Implemented in PyTorch's DistributedDataParallel
Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)
![Page 49: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/49.jpg)
Semi-supervised machine translation
49
Myle OttSergey Edunov David GrangierMichael Auli
![Page 50: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/50.jpg)
Bilingual
German
English
=
German
English
=
Intermediate Model
Data augmentation for TranslationBack-translation (Bojar & Tamchyna, 2011; Sennrich et al., 2016)
![Page 51: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/51.jpg)
Bilingual
German
English
= GermanEnglish=
Intermediate Model
Data augmentation for TranslationBack-translation (Bojar & Tamchyna, 2011; Sennrich et al., 2016)
![Page 52: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/52.jpg)
Monolingual Source
GermanGermanGerman
Bilingual
German
English
=
Intermediate Model
![Page 53: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/53.jpg)
Monolingual Generated
EnglishEnglish
Monolingual Source
German
GermanGerman
BilingualIntermediate
Model
German
English
=
![Page 54: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/54.jpg)
Monolingual Generated
EnglishEnglish
Monolingual Source
German
GermanGerman
Bilingual
German
English
=
Intermediate Model
German
English
=
Model
![Page 55: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/55.jpg)
Monolingual Generated
EnglishEnglishEnglishEnglish
Bilingual
Monolingual Source
German
German
English
=
German
German
English
=
Final Model
![Page 56: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/56.jpg)
Bilingual
Monolingual Source
Monolingual Generated
German
EnglishEnglish
GermanEnglishEnglish=
Final Model
German
English
=
![Page 57: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/57.jpg)
Scaling from 100M to 5.8B words
57
BLEU
(Acc
urac
y)
0
9
18
27
36
fairseq &
sampled BT DeepL
(2017)
SAtt + RPR
(Google, 2018)
WTransfo
rmer
(Salesforce, 2017)
Transfo
rmer
(Google, 2017)ConvS2S
(2017)GNMT
(RNN, 2016)
Phrase-based
(2014)
20.724.625.2
28.428.929.233.335
WMT'14 English-GermanHigh quality, non-benchmark data
Only benchmark bilingual +
monolingual data
Model trains in 22.5h on 128 V100
![Page 58: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/58.jpg)
WMT'18 Human evaluations
58
Ranked #1 in the human evaluation of the WMT'18 English-German translation task
![Page 59: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/59.jpg)
Conclusion
59
Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required • Use Nvidia's apex library
Case study: Neural Machine Translation • Train models in 30 minutes instead of 1 day+ • State-of-the-art translation quality using semi-supervised learning
![Page 60: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required](https://reader033.vdocuments.us/reader033/viewer/2022052013/602a518518b07168701de50d/html5/thumbnails/60.jpg)
Thank you! Questions?
60
Contact Us Myle Ott Sergey Edunov
[email protected] [email protected] References • Scaling Neural Machine Translation: arxiv.org/abs/1806.00187 • Understanding Back-translation at Scale: arxiv.org/abs/1808.09381 • apex: nvidia.github.io/apex
Acknowledgements: Nvidia and PyTorch teams for helping us implement and optimize mixed precision training.