=2) +? 38: · 5" $ (t update) ¤tsubame 2.5 tsubame-kfc/dl / % 33%'# (611% '# +8:1...
Post on 26-May-2020
15 Views
Preview:
TRANSCRIPT
"=2)�+?���38: $��������� ��&-����
����1,a) !(A>1 4.562 '(@<3 97;�3 1/B1
1�%*��"
2������ �������
3,0������
a) oyama.y.aa@m.titech.ac.jp, �#�
1
16/08/08SWoPP2016
DFMBDeep Learning (DL)
¤ “H�”Neural Network (NN)�.��1U0O&,¤ NN ��� 7�(�! ��)�0O���2L-
¤ <S�K@9J�N &,�G��9J4:�E5
¤ H�NN � Q$�'+������→ $>I8; =34¤ TSUBAME2.5 16���(48 GPU) 15PCNN 0O�199() (8.2#)¤ NN ?C�0O�! ����!���"���!�/A�����R6* 0O�=3
2
��(��)(%T: http://image-net.org/)
��(���)
Barn swallow = 0.95Police dog = 0.03Water beetle = 0.01
…Neural Network
16/08/08SWoPP2016
69=5Stochastic Gradient Descent (03$@4;'()
¤ ������"� m��� ��(�����)��.1�����&,�@4 ∇Ei�/+�)�NN�-� W(t)�>%��#(¤ !7$@4 ∇E�.18(,2�,:PFlop)�?!DL� *<
3
W (t+1) =W (t ) −ηΣi=1m ∇Ei (W
(t ) )E
E1
-η∇E1
E2
-η∇E2
E3
-η∇E3
W(t)
m = 3
-ηΣi ∇Ei
W(t+1)
16/08/08SWoPP2016
HKQG!�������<N3>
¤ !��������NN�S-(<N3>)0R*+�JM¤ /��!�������→ $#���AX�85�=I�B2¤ &��!�������→ ZE�$%�"3�F,���D41P;�Y� ��¤ OW7@���!��������96(0R*+�TV���'))<N3>�L.��C>3
4
�%����(��11UCNN�ImageNet2012�������0R:?
05
101520253035404550
0 100 200 300 400 500 600
Top
-5 ���
[%]
Epoch
T2
GW2T2 GW2
GPU� 48 1
�� � [�] 6.04 43.8
Epoch � [�] 910 28235
�������� 1076 50
�� top-5 ���[%] 17.2 15.7
Bett
er
16/08/08SWoPP2016
HKOGSuPeRcomputation In Neural Training (SPRINT)
¤ AD.0�,������,�������)%)�*�:1��CQ@M� -ELDL���'
¤ GPU�����I8��RF�?B��.������N2>6<"� $)(���P7 (Asynchronous SGD, ASGD)¤ �+� ;/J9��&!#������=65��
¤ ���������� �→ #-�(�*!�+����,&"�'%(). ���$)��� ��
5
34
W (t+1) =W (t ) −ηΣi∇Ei(t ) W (t+2) =W (t+1) −ηΣi∇Ei
(t+1) W (t+3) =W (t+2) −ηΣi∇Ei(t+2)
GPU�+�
P7�+�
∇Ei(t+1) ∇Ei
(t+2) ∇Ei(t+3)W (t+1) W (t+2)
16/08/08SWoPP2016
Z]=5�BQ
¤ ��: `P�>,TK�-:���ASGD�Convolutional Neural Network (CNN)�N^��@b�,�DL�� '�1�Fc&#$������Mg�AO(!)�Ve�
¤ ����: DL�� '”SPRINT”�6.���>,23�Gd�7_.i���Empirical�AO(!)�LJ¤ %�*C4�CNNVY�08� Epoch23(!����"?<�@b�23)�Fc&#$������+8
¤ ��:¤ TSUBAME 2.5�TSUBAME-KFC/DL�Epoch23�Fc&#$������Fch\6%, 8%�Hd
¤ fD�CNN��� kWGR23�Fch\12%E9�Hd¤ Fc&#$������IS� a��>,[X��@b23�Uj�/;�LJ
6
16/08/08SWoPP2016
SPRINT���
7
16/08/08SWoPP2016
SPRINT�25
¤ GPU�����%-¤ 0���14) �GPU�SSD�*�¤ ������SSD�".��3/
8
���CPU
7$���� GPU����
GPU
�������
SSD
GPU
�������!#
+ CNN���
+ CNN���
GPU����
83�,(
83�,(Allreduce1'����
&$�+
Momentum6
Mutex
���
16/08/08SWoPP2016
SPRINT�;>GPU�)��
1) NSubbatch=�����*#(�SSD��� ��%&'�!�2) D<(49MB)�68�$��%&'?8
¤ ,B?� NSubbatch=��*#(��+C�68��
3) 10�5 �A0����.-�� ��%&'�!�
9
���CPU
A0�)�� GPU�)��
GPU
�*������
SSD
GPU
�������+/
5 CNN&�(
5 CNN&�(
GPU�)��
D<�73
D<�73Allreduce:2 �"�
10�5
Momentum@(1)(2)
(3)
Mutex
���
16/08/08SWoPP2016
SPRINT�)+/�����
a) GPU�����0*�%#������,&
b) ����'!����$�(0*)����-&
c) Allreduce�"�$��/�
10
���CPU
/����� GPU����
GPU
�� ���
SSD
GPU
�� ������
$� CNN���
$� CNN���
GPU����
0*�%#
0*�%#Allreduce("����
��$�
Momentum.
(a)(b)
(c)
Mutex
��
16/08/08SWoPP2016
��������
11
16/08/08SWoPP2016
.-�)/���5(
1. ���* NNode����$GPU* NGPU�CNN01���!&
2. GPU�����2%�����1��������+3' "#TGPU, TUpdate�,4
3. TGPU, TUpdate�Epoch"# TEpoch�+3����� ��NMinibatch_avg�,4
12
�����(NNode, NGPU, …)
CNN01(L, {xl}, {ml}, …)
TGPU���
TUpdate���
TEpoch���
NMinibatch_avg���
!& �&
16/08/08SWoPP2016
&%� '���CNN*+("
¤ .#-��������$�-�softmax�*!��CNN��)¤ ,: VGG
[Simonyan, 2014]
13
���� ��L � ��"LC .#� ��"xl � �� l �� �ml � �� l ����"c .#����� �pl � �� l ������� �
.# �����
� �� l-1 � �� l
…
…
� �� 0
softmax
� �� L� �� LC
…
�$�
xl−1
xl−1
xlxlmlmlml−1
cc pl
pl
xl−1 − c+1x0
x0
m0
mL
xLc
mLc
xLc xLc2mLc
16/08/08SWoPP2016
/-��*0������� ' "#
¤ 2���� 1��������� ' "#�!.������34���6�)+¤ ���� 1,�'85�(��&$
14
"#
W (t+1) =W (t ) −ηΣi∇Ei(t ) W (t+2) =W (t+1) −ηΣi∇Ei
(t+1) W (t+3) =W (t+2) −ηΣi∇Ei(t+2)
GPU����
7%����
TGPU
TUpdate
TGPU = TLoadImage + TComputeGradient + TUpdateGradient + …TUpdate = TSumGradient + TAllreduce + TUpdateWeights + …
16/08/08SWoPP2016
54��,6���GPU�����E:08"#(TConputeGradient)
¤ ������2<=����( "#1-����)¤ 15BCNN�0100%/�����?���¤ SGEMM(����)����
¤ C����CNN9;�NSubbatch��'$¤ convolution: D 2 BFeed-forward08¤ fc: +3!BFeed-forward08¤ dedw, dedb, dedx_*: Back-propagation08
¤ CUDA����¤ :C����CNN9;�NSubbatch��'$¤ im2col: D 2 *>�A�¤ activation: 7,)&.@*¤ pooling: max-pooling¤ c2f: D 2 B→+3!B>�A�F
15
16/08/08SWoPP2016
'%��!(���GPU�����6,")��(TConputeGradient)
1. SGEMM(��� ��)���¤ �0� /����� �2��� m�n�k���#1�3-
¤ 2x/2 � 2y/2 � 2z/2�2.����8&5��2���$4*�+�
16
cublasSgemm�� �����������
cublasSgemm�� ��
Tconvolution (l) =αmnkc2ml−1ml "xl
2NSubbatch +αmnml "xl2NSubbatch +αmkc
2ml−1 "xl2NSubbatch
+αnkc2ml−1ml +αm "xl
2NSubbatch +αnml +αkc2ml−1 +β
log2(m)
log2(n)
log2(k)
log2(m)
log2(n)
log2(k)
●
●
●
●
●
●
●
●
●
2e−16 sec2e−14 sec2e−12 sec2e−10 sec2e−8 sec2e−6 sec2e−4 sec2e−2 sec2e0 sec
16/08/08SWoPP2016
$#� %���GPU ����.)"'��(TConputeGradient)
2. CUDA���¤ �������� *�(+������������
¤ &!��-����(+�,���
17
Tim2col (l) = α !xl2c2ml−1NSubbatch +β (l ≤ LC )
CUDA������ �(�)��������(��)
16/08/08SWoPP2016
>=��6@��!Q,�"���SJ8C()(TSumGradient)
¤ GPU�"���SJ�;C�D���'&��8C��→ Q,�"����G1P.H�/%()�4+ E0¤ �"���<3M�$O����R*�;C9�BF�5KN���! A2
18
TSumGradient =α ×NParam ×NGPU ×min(TUpdate /TGPU,1):�(� ���)7(CNNIL��-*)
SJ�;C9����BF�5KN
()GPU
�"��
Q,�"��
TUpdate
TGPU TGPU TGPU TGPU TGPU
TUpdate
TUpdate > TGPU
?��"���#�SJ7 = 2
?��"���#�SJ7 = 3
16/08/08SWoPP2016
>=��6@��!Q,�"���SJ8C()(TSumGradient)
¤ GPU�"���SJ�;C�D���'&��8C��→ Q,�"����G1P.H�/%()�4+ E0¤ �"���<3M�$O����R*�;C9�BF�5KN���! A2
19
TSumGradient =α ×NParam ×NGPU ×min(TUpdate /TGPU,1):�(� ���)7(CNNIL��-*)
SJ�;C9����BF�5KN
()
GPU�"��
Q,�"�� TUpdate
TGPU TGPU
TUpdate
TUpdate < TGPU
?��"���#�SJ7 = 0
?��"���#�SJ7 = 1
TUpdate /TGPU1−TUpdate /TGPU
16/08/08SWoPP2016
0.��*2���6%�����Allreduce'�!"
¤ <4,3�-35�������#� �→ Allreduce'�!�(:;$)��� )!"���¤ -35�/9�7����(� )!"�18
¤ X1 ~ B(NGPU, p=min(TUpdate/TGPU, 1)), F(i) �7&+
20
!"
6%����
(NNode = 3)
<4,3
X1
<4,3
<4,3
<4,3 <4,3 <4,3
XM =max(X1,X2,!,XNNode)
MPI_Allreduce
MPI_Allreduce
MPI_Allreduce
TBarrier =αE(XM − X1) =α{NGPU (1− p)−Σi=1NGPU F(i)NNode}
TBarrier
TBarrier
16/08/08SWoPP2016
10� ,2�� #'
¤ �!���($%&�+�.5�������� NMiniBatch_avg�Epoch%& TEpoch��� )
21
NMinibatch_avg =NNode ×NGPU ×NSubbatch ×TUpdate
TGPU4/%&���
3*�� �"� -
1������3*%&
TEpoch =NFile ×TUpdateNMinibatch_avg
=NFile ×TGPU
NNode ×NGPU ×NSubbatch
��������"� -
1�"� ����3*%&
16/08/08SWoPP2016
��������
22
16/08/08SWoPP2016
"'���,&
¤ TSUBAME 2.5�TSUBAME-KFC/DL�SPRINT �¤ 15�17����CNN(CNN-A, CNN-B, CNN-C) (!¤ ILSVRC2012������ (!¤ CNN-A (!���� /+�SGEMM ��� /+����)# ��
¤ CNN-A,B,C (!���� /+������%/+ *0
¤ TSUBAME 2.5 (NNode = 1, 2, 4, …, 64)¤ Intel Xeon X5670 � 2 ($ 281 -.�GFlop/s)¤ NVIDIA Tesla K20x � 3 ($ 11.8 -.�TFlop/s) → NGPU = 3¤ 4xQDR InfiniBand � 2 ($ 8 GB/s)
¤ TSBUAME-KFC/DL (NNode = 1, 2, 4, …, 16)¤ Intel Xeon E5-2620 v2 � 2 ($ 403 -.�GFlop/s)¤ NVIDIA Tesla K80 � 4 ($ 34.9 -.�TFlop/s) → NGPU = 8¤ 4xFDR InfiniBand (7 GB/s)
23
16/08/08SWoPP2016
��� ��&�*"� �� (TComputeGradient)
¤ TSUBAME 2.5�TSUBAME-KFC/DL�!��15�17�������CNN�*"� ����()$�'12%��
24
TSUBAME-KFC/DL(�)�TSUBAME 2.5(�)�������(TComputeGradient) ���(� )���( )
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
N_Subbatch
T_ComputeGradient [sec]
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1 2 3 4 5
0.00
0.05
0.10
0.15
0.20
0.25
0.30
N_Subbatch
T_ComputeGradient [sec]
●
●
●
●
●
●
●
●
●
●
●
CNN-A (実測)CNN-A (予測)CNN-B (実測)CNN-B (予測)CNN-C (実測)CNN-C (予測)
SGEMM������#%
16/08/08SWoPP2016
�#� ��)"GPU�������� (TGPU)
¤ TSUBAME 2.5�TSUBAME-KFC/DL�%����16%����+7%���!,.'¤ �NSubbatch��-(
25
TSUBAME-KFC/DL��CNN-A(15��)� �����TGPU���(�)���(�)
UpdateGradientLockGradient_GComputeUpdateValComputeGradientDeformImageLoadImageFetchWeightsLockWeights_G
(N_Node,N_Subbatch)
T_G
PU [s
ec]
0.0
0.2
0.4
0.6
0.8
1.0
(1,1
)
(1,4
)
(1,8
)
(1,1
1)
(2,1
)
(2,4
)
(2,8
)
(2,1
1)
(4,1
)
(4,4
)
(4,8
)
(4,1
1)
(8,1
)
(8,4
)
(8,8
)
(8,1
1)
(16,
1)
(16,
4)
(16,
8)
(16,
11)
/& $��(�*)��&�
16/08/08SWoPP2016
&-����3,5"����$�� (TUpdate)
¤ TSUBAME 2.5�TSUBAME-KFC/DL�/�%�33%'#�(611%'#�+8:1¤ NNode����NSubbatch �92→ �����+8����
26
TSUBAME-KFC/DL��CNN-A(15��)� �����TUpdate���(�)���(�)
UpdateWeightsLockWeights_UUpdateMomentumAllreduceAddMomentumUpdateOldWeightsSumGradientLockGradient_U
(N_Node,N_Subbatch)
T_Up
date
[sec
]
0.00
0.05
0.10
0.15
0.20
0.25
0.30
(1,1
)
(1,4
)
(1,8
)
(1,1
1)
(2,1
)
(2,4
)
(2,8
)
(2,1
1)
(4,1
)
(4,4
)
(4,8
)
(4,1
1)
(8,1
)
(8,4
)
(8,8
)
(8,1
1)
(16,
1)
(16,
4)
(16,
8)
(16,
11)
<0*.� (;4)��!���
Allreduce()4)�$�� �1.27
16/08/08SWoPP2016
�������Epoch����� ���������
¤ Epoch����������6%
¤ �� ��������������8%
27
050100150200250300350400450
(1,1)
(1,4)
(1,8)
(1,11)
(2,1)
(2,4)
(2,8)
(2,11)
(4,1)
(4,4)
(4,8)
(4,11)
(8,1)
(8,4)
(8,8)
(8,11)
(16,1)
(16,4)
(16,8)
(16,11)
N_M
inib
atc
h_a
vg
(N_Node, N_Subbatch)
����
TSUBAME-KFC/DL��CNN-A(15����)���������� �������
16/08/08SWoPP2016
+2��!�71� ���AC
¤ 3�06+2����TSUBAME2.5 16����*8�-;��������138((@4)��25%,'���� ����AC¤ TSUBAME-KFC/DL�8���$9�*8�1.47>�#5)¤ (@��� ���B? ��Epoch%&:<</�.@�"=
28
NNode NSubbatch
Epoch�� [sec] �%�������
�( [sec] �( [sec] �(*"[%]
�( �( �(*"[%]
KFC 8 8 2025 1779 12.1 147 165 12.2KFC 8 11 2316 2226 3.90 173 171 1.28T2.5 16 5 2725 2614 4.06 128 125 2.29T2.5 16 4 2910 2840 2.40 116 118 1.71
1.47> 138�26%
ILSVRC2012��������#������!��((�(��)'�Epoch��$&�� )
1.35> 138�25%
16/08/08SWoPP2016
%*���.)�����24
¤ %*����#�����!�,*30�- �'1�,*
29
10 20 30 40
24
68
10 5e+02 sec1e+03 sec2e+03 sec5e+03 sec1e+04 sec2e+04 sec5e+04 sec1e+05 secN
_Subbatch
N_Node
Epoch���$"���+(
&/������ ��138�25%���30
TSUBAME-KFC/DL��ILSVRC2012�����������Epoch ����
16/08/08SWoPP2016
��������
30
16/08/08SWoPP2016
;@Z]
¤ (.,����'�A�DL��"+�ER-#/2� [Yan et al, SIGKDD, 2015]¤ (.,����'(NN�C9�I��?H���')�A��DL��"+�ER-#0�OM
¤ -#0`h�#��`h�(.,����'`h�`hF NNVY��Epoch78�Kg
¤ a5DE���Bb�X(*&'! ���)�Qi������
¤ Bb78 e=�$1�%�)�;�Z] [Gupta et al, 2015]¤ (.,����'�VY�L3 �a5D`hDL��"+Rudra�OM¤ Rudra�A�Bb>[�NN�N_ER�*&'! ��� Gd
staleness (jW�JT�8�I��c9���:F)�U6�\^���� �P�
¤ 4f�*&'! ����staleness N_ER�;S�<������
31
16/08/08SWoPP2016
��(% C3¤ ��
¤ TSUBAME 2.5TSUBAME-KFC/DLEpoch$&�2J���������2JNE6%, 8%5K
¤ M0 CNN����PA4=$&�2JNE12%1'5K¤ 2J���������8>��G��+�DB .H$&�?O��!)�:9
¤ (% C3¤ ���FLI FL,!)�-��" /;5K¤ +6 .H7<�@ ����#*�/;5K
32
16/08/08SWoPP2016
top related