8th adac workshop yet another accelerated sgd: resnet-50 ... · conclusion we achieved 70 sec....

31
8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 Training in 70.4 sec. October 30, 2019 Fujitsu Laboratories. LTD. senior researcher Masafumi Yamazaki Copy right 2019 FUJITSU LABORATORIES LIMITED 0

Upload: others

Post on 18-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

8th ADAC Workshop

Yet Another Accelerated SGD: ResNet-50 Training in 70.4 sec.

October 30, 2019Fujitsu Laboratories. LTD.

senior researcher

Masafumi Yamazaki

Copy right 2019 FUJITSU LABORATORIES LIMITED0

Page 2: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Background

The scale of Deep

Neural Network (DNN)

is growing

Copy right 2019 FUJITSU LABORATORIES LIMITED

Source: Dario Amodei and Danny Hernandez, https://openai.com/blog/ai-and-compute/

The large-scale

parallelization is one of

the efficient way to

accelerate training speed.

Complexity of Training

Neural Networks

wider

gap

1

Page 3: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Contents

1. Distributed Deep Learning

History of DNN training speed challenges

Our contributions

2. Key points for distributed

training

DNN training using Data parallel

Synchronous-SGD

Optimal mini-batch size

Allreduce algorithms

3. Optimization Points

Backward/Allreduce parallelization

Optimization of initialization and

unnecessary processing

Optimization for MLPerf v0.6 (The

Rule was changed from v0.5)

4. Evaluation and results

Copy right 2019 FUJITSU LABORATORIES LIMITED2

Page 4: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

History of training speed of DNN

Copy right 2019 FUJITSU LABORATORIES LIMITED

2012

2014

2015

2016

ILSVRC2012

• Barklay U. AlexNet

training in 5-6 days using 2 GPUs

ILSVRC2014

• google GoogleNet

training in a week using few GPUs

(estimation in their paper)

ILSVRC2015

• MICROSOFT ResNet

2018

2019

• Facebook

ResNet-50 Training in 1 hour using 256 GPUs

• PFN

Training in 15 Minutes using 1,024 GPUs

• Tencent

Training in 6.6 minute using 2,048 GPUs

• Google

Training in 1.8 minute using 1,024 TPUs

• Sony

Training in 2.0 minute using 3,456 GPUs

2017

3

Page 5: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Our contribution ... speed up

In 2015, our group in Fujitsu Laboratories began working on large-scale distributed

training

Copy right 2019 FUJITSU LABORATORIES LIMITED

Year Hardware #GPUs DNN / Dataset Time Remarks

Feb, 2016 Tatara (Kyushu Univ.) 64 AlexNet / ImageNet -

June, 2016 TSUBAME 2.5 256 AlexNet / ImageNet - *1

Aug., 2018 ABCI ~4096 ResNet-50 / ImageNet (6.6 minute) The Accuracy didn’t reach 75%

April, 2019 ABCI 2048 ResNet-50 / ImageNet 74.7 seconds arXiv:1903.12650

June, 2019 ABCI 2048 ResNet-50 / ImageNet 70.4 seconds MLPerf v0.6

*1 SWoPP2016 「MPIを用いたDL処理高速化の提案」- evaluated the Allreduce algorithm

- proposed running computation and communication processes in parallel

- reported how accuracy worsened with large mini batch sizes.

4

Page 6: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

History of training speed of DNN

Copy right 2019 FUJITSU LABORATORIES LIMITED

2012

2014

2015

2016

ILSVRC2012

• Barklay U. AlexNet

training in 5-6 days using 2 GPUs

ILSVRC2014

• google GoogleNet

training in a week using few GPUs

(estimation in their paper)

ILSVRC2015

• MICROSOFT ResNet

2018

2019

• Facebook

ResNet-50 Training in 1 hour using 256 GPUs

• PFN

Training in 15 Minutes using 1,024 GPUs

• Tencent

Training in 6.6 minute using 2,048 GPUs

• Google

Training in 1.8 minute using 1,024 TPUs

• Sony

Training in 2.0 minute using 3,456 GPUs

• MLPerf v0.6 (Our work)

Training in 1.2 minute using 2,048 GPUs

• Huawei

Training in 1.0 minute using 1,024 A910

processors

2017

5

Page 7: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Our contribution ... mini-batch size

Facebook was able to increase the mini-batch size up to 8k using ideas such as

warm-ups. However, increasing the batch size further would worsen accuracy

Copy right 2019 FUJITSU LABORATORIES LIMITED

60

65

70

75

80

2 GPUs 8 GPUs

64 GPUs

1,024 GPUs

Facebook

Google

BrainPFN

Sony

Tencent

ImageN

et

top-1

Valid

ation accura

cy

Source: Accurate, Large Minibatch SGD: Training

ImageNet in 1 Hour, P. Goyal (Facebook) et al, 2017

We achieved accuracy

with up to 84k mini-batch

using 2048 GPUs

2048 GPUs

larger mini-batch size

6

Page 8: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Key points for distributed training

Data Parallel Method Based on Synchronous-SGD using Allreduce

Optimal mini-batch size

Allreduce algorithms

Copy right 2019 FUJITSU LABORATORIES LIMITED7

Page 9: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Data Parallel Method We have continued to accelerate training using a data parallel method based on

synchronous-SGD using Allreduce

Copy right 2019 FUJITSU LABORATORIES LIMITED

mini-batch size

Training

IterationBackward

Forward

Number of Images for Training is 1,281,167 (ImageNet-1k)

Update

#0

8

Page 10: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Data Parallel Method Since 2016, we have continued to accelerate training using a data parallel method

based on synchronous-SGD using Allreduce

Copy right 2019 FUJITSU LABORATORIES LIMITED

mini-batch size

Training

Iteration Allreduce

Backward

Forward

Number of Images for Training is 1,281,167 (ImageNet-1k)

Update

#0 #1 #2 #3

Therefore, a single GPU will process less data

when using a large number of GPUs for training.

9

Page 11: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

The performance gets worse as the amount of data to process for each

GPU decreases

10

100

Spee

d [a

rb. u

nit

]

Number of GPUs

系列1系列2系列3系列4系列5

32 64 256

Number of images per GPU

128

Optimal mini-batch size

Copy right 2019 FUJITSU LABORATORIES LIMITED

- TSUBAME2.5 system

- Tesla K20X (using 1GPU per node)

The performance gets worse as the amount of

data to process for each GPU decreases

10

Page 12: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

We selected the optimal mini batch size for enough accuracy

10

100

Spee

d [a

rb. u

nit

]

Number of GPUs

系列1系列2系列3系列4系列5

32 64 256

Number of images per GPU

128

Weak scaleIncreases the amount of images per iteration in proportion to the number of acceleratorsPros; good scalabilityCons; accuracy down in a large

mini batch

Optimal mini-batch size

Copy right 2019 FUJITSU LABORATORIES LIMITED

Strong scaleExecute the same process by dividing it with an acceleratorPros; get the same resultCons; Bad scalability

- Reduced parallelism of GPU- Communication overhead- TSUBAME2.5 system

- Tesla K20X (using 1GPU per node)

11

Page 13: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Comparison of Allreduce algorithms

Copy right 2019 FUJITSU LABORATORIES LIMITED

algorithm Data size to transfer #steps Remarks

Recursive

Halving/Doubling

Ring

Double Tree

12

Page 14: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Recursive Halving/Doubling Algorithm

Copy right 2019 FUJITSU LABORATORIES LIMITED

rank 0

rank 1

rank 2

rank 3

0

2

3

1

0+1

2+3

0+1+2+3

0+1+2+3

0+1+2+3

stepstep1 step2 step3 step4

algorithmData size

to transfer#steps Remarks

Recursive

Halving/Doubling

𝑁

2~𝑁

𝑚2 × 𝑙𝑜𝑔2𝑚 We Implemented

※ N, m are the Data size and the number of ranks, respectively.

13

Page 15: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Ring algorithm

Copy right 2019 FUJITSU LABORATORIES LIMITED

rank 0

rank 1

rank 2

rank 3 step

1+22

3

1

0

0+1+2+3

1+2+3

0+1+2+3 0+1+2+3

0+1+2+3

step1 step2 step3 step4 step5 step6

algorithmData size

to transfer#steps Remarks

Ring𝑁

𝑚2 × (𝑚− 1)

NCCL 2.3

2D – Ring𝑁

𝑚𝑥,𝑁

𝑚𝑦2 × (𝑚𝑥 +𝑚𝑦 − 2)

※ N, m, mx, my are the Data size, the number of ranks, the x-ring size and y ring size, respectively

14

Page 16: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Double Tree algorithm

Copy right 2019 FUJITSU LABORATORIES LIMITED

source: NCCL v2.4 comments

algorithmData size

to transfer#steps Remarks

Double Tree𝑁

22 × 𝑙𝑜𝑔2𝑚 NCCL 2.4

※ N, m are the Data size and the number of ranks, respectively.

15

Page 17: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Comparison of Allreduce algorithms

algorithm Data size to transfer #steps Remarks

Recursive

Halving/Doubling

𝑁

2~𝑁

𝑚2 × 𝑙𝑜𝑔2𝑚 We Implemented

Ring𝑁

𝑚

2 × (𝑚− 1)

NCCL 2.3

2D – Ring 2 × (𝑚𝑥 +𝑚𝑦 − 2)

Double Tree𝑁

22 × 𝑙𝑜𝑔2𝑚 NCCL 2.4

Copy right 2019 FUJITSU LABORATORIES LIMITED

※ N, m, mx, my are the Data size, the number of ranks, the x-ring size and y ring size, respectively

16

Page 18: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

0.1

1

10

100

16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M

Late

ncy [

ms]

Data Size [Byte]

2D Ring (NCCL2.3)

Recursive Halving/Doubling(Our work)

Evaluate the Allreduce algorithms

Copy right 2019 FUJITSU LABORATORIES LIMITED

System: ABCINumber of nodes: 128Number of GPUs: 512

The data size regions often used by Allreduce of distributed training

17

Page 19: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Evaluate the Allreduce algorithms

Copy right 2019 FUJITSU LABORATORIES LIMITED

0.1

1

10

100

16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M

Late

ncy [

ms]

Data Size [Byte]

2D-Ring (NCCL2.3)

Recursive Halving/Doubling(Our work)

Double Tree (NCCL2.4)

System: ABCINumber of nodes: 128Number of GPUs: 512

The data size regions often used by Allreduce of distributed training

18

Page 20: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Optimization Points

Accelerate training speed

Optimization of initialization and unnecessary processing

Changes for MLPerf v0.6

Copy right 2019 FUJITSU LABORATORIES LIMITED19

Page 21: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Accelerate Training Speed

Overlapping Allreduce communication with backward computation

Copy right 2019 FUJITSU LABORATORIES LIMITED

forward backwardoverhead

update

Allreduce AllreduceAllreduceAllreduceAllreduce

Backward

Forward

Allreduce

Update

The processing speed increased to over 1.5 M images/sec.

However we could not reach our goal for the overall training time.

20

Page 22: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Optimization of initialization and unnecessary processes

Copy right 2019 FUJITSU LABORATORIES LIMITED

Initialization Training

Time

①generate common initial-weights using common random-seed in each GPU instead of

broadcasting initial weights from one GPU

②Overlapping NCCL initialization with framework initialization

Validation

③Eliminating unnecessary processes after each epoch (0.1 ~ 0.2 sec. / epoch)

② ③ ③ ③①

1 epoch

Reduce overall training time by 45 seconds (120s → 75s)

21

Page 23: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

In MLPerf v0.6 rules, the validation accuracy was increased from 74.9% to 75.9%.

① Database for training

Changes when submitting MLPerf v0.6

Copy right 2019 FUJITSU LABORATORIES LIMITED

② Learning Rate scheduling

We tuned LR scheduling, and changed it to

follow MLPerf v0.6 rules

256 pixel

256 pixel

training image

(before Data augmentation)

Clopped image for April 1 result Original image for MLPerf v0.6

0

5

10

15

20

0 50

Learn

ing

Rate

Epoch

系列1

系列2

22

Page 24: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Evaluation and results

Hardware

Software

Results

• Validation Accuracy in training

• Accuracy improvement

• Scalability

Copy right 2019 FUJITSU LABORATORIES LIMITED23

Page 25: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Evaluation environment

Hardware

compute nodes

• 4-GPUs / node

• 2-HCA / node

IB Interconnect

• The intra-rack network has topology

of full bi-section fat-tree

• The inter-rack network has topology

of fat-tree with 1/3 over subscription

Copy right 2019 FUJITSU LABORATORIES LIMITED

ABCI compute node configuration

24

Page 26: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Evaluation environment

Software

MXNet / Horovod

• Original source is NVIDIA tuned MXNet

• We customized and tuned

Other libralies

• CUDA, cuDNN v7.5, NCCL v2.4

• OpenMPI

• GCC 7.3

• Python 3.6

Copy right 2019 FUJITSU LABORATORIES LIMITED25

Page 27: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Result; Validation Accuracy in training

Copy right 2019 FUJITSU LABORATORIES LIMITED

0

2

4

6

8

10

12

14

0%

20%

40%

60%

80%

100%

0 20 40 60 80

Le

arn

in

g R

ate

Va

lid

atio

n a

cc

ura

cy

Epoch

We achieved 76%

accuracy with 84k

mini-batch using

2k GPUs

26

Page 28: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Result; Accuracy improvement

Copy right 2019 FUJITSU LABORATORIES LIMITED

ImageN

et

Valid

ation accura

cy

75%

70%

65%

80%

128k 256k

Extended mini-batch size

Improve accuracy

27

Page 29: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Result; Scalability

The number of computation images per GPU is the same (Weak scale)

Copy right 2019 FUJITSU LABORATORIES LIMITED

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1 2 4 8 16 32 64 128 256 512 1,024 2,048

Image /

second

Number of GPUs

系列1

系列2

Parallel efficiency is

77% on 2k GPUs

28

Page 30: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Conclusion

We achieved 70 sec. training time (world record†) and 84K mini-batch size

(world record††

) of ResNet-50/ImageNet under MLPerf v0.6 rules†††

Using ABCI 512 compute nodes (2,048 GPUs)

Copy right 2019 FUJITSU LABORATORIES LIMITED

† Our investigation in March, 2019 †† Our investigation under the conditions of SGD and fixed mini-batch size in March, 2019

Mini-batch size(max)

ProcessorDL

SoftwareTime

Validationaccuracy

Facebook 8,192 Tesla P100 × 256 Caffe2 1 hour 76.3 %

Google 16,384 full TPU Pod TensorFlow 30 min. 76.1 %

Preferred Networks 32,768 Tesla P100 × 1,024 Chainer 15 min. 74.9 %

Tencent 65,536 Tesla P40 × 2,048 TensorFlow 6.6 min. 75.8 %

Google 65,536 TPU v3 × 1,024 TensorFlow 1.8 min. 75.2 %

Sony 55,296 Tesla V100 × 3,456 NNL 2.0 min. 75.3 %

Fujitsu Labs. 86,016 Tesla V100 × 2,048 MXNet 1.17 min. 76.1 %

Huawei ? Ascend A910 × 1,024 MindSpore 0.997 min. >75.9%

††† Used closed Division rules, except for tuning six hyper parameters

29

Page 31: 8th ADAC Workshop Yet Another Accelerated SGD: ResNet-50 ... · Conclusion We achieved 70 sec. training time (world record†) and 84K mini-batch size (world record††) of ResNet-50/ImageNet

Copy right 2019 FUJITSU LABORATORIES LIMITED