quantized stochastic gradient descent · 2018. 11. 21. · more accuracy: please see paper alexnet...
TRANSCRIPT
QuantizedStochasticGradientDescent
DanAlistarhETHZurich
2
ThePracticalProblemTraininglargemachinelearningmodelsefficiently
• LargeDatasets:• ImageNet:1.6millionimages(~300GB)• NIST2000Switchboarddataset:2000hours
• LargeModels:• ResNet-152[Heetal.2015]:152layers,60millionparameters• LACEA[Yuetal.2016]:22layers,65millionparameters
Heetal.(2015)“DeepResidualLearningforImageRecognition”Yuetal.(2016)“Deepconvolutionalneuralnetworkswithlayer-wisecontextexpansionandattention” 3
Data-parallelStochasticGradientDescent
Computegradient
Exchangegradient
Updateparams
Minibatch 1 Minibatch 2 Minibatch 34
DataparallelSGD(biggermodels)Computegradient
Exchangegradient
Updateparams
Minibatch 1 Minibatch 25
DataparallelSGD(biggerermodel)Computegradient
Exchangegradient
Updateparams
Minibatch 1 Minibatch 26
Idea[Seide etal,CNTK]:compress thegradients…
Minibatch 1 Minibatch 27
BackgroundonSGD• Startfromthegradientdescentiteration
• Let𝒈"(𝒙𝒕) =gradientatrandomlychosendatapoint.
• Let𝜠 𝒈" 𝒙 − 𝜵𝒇 𝒙 𝟐 ≤ 𝝈𝟐 (variancebound)
𝒙𝒕.𝟏 = 𝒙𝒕 − 𝜼𝒕𝒈"(𝒙𝒕),where𝜠[𝑔4(𝑥𝑡)]=𝛻𝑓 𝑥𝑡 .
Theorem (standard):Given𝑓 convexandsmooth,and𝑅2 = ||𝑥0− 𝑥∗||2.IfwerunSGDfor𝑻 = 𝓞(𝑹𝟐 𝟐𝝈
𝟐
𝜺𝟐 ) iterations,then
𝜠 𝑓(1𝑇F
𝑥G)H
GIJ
− 𝑓 𝑥∗ ≤ 𝜀.
𝒙𝒕.𝟏 = 𝒙𝒕 − 𝜼𝒕𝛻𝑓 𝑥𝑡 .
8
BasicIdea:QSGD[Alistarhetal.,NIPS17]
• Quantizationfunction
𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖
where𝜉T 𝑣𝑖 = 𝟏 withprobability 𝒗𝒊 / 𝒗 𝟐 and0,otherwise.
Properties:1. Unbiasedness:
𝑬 𝑸 𝒗𝒊 = 𝑣 O ⋅ sgn 𝑣T𝒗𝒊𝒗 𝟐
= 𝒗𝒊2. Secondmoment(variance)bound:
𝑬 𝑸 𝒗 𝟐 ≤ 𝒏� 𝒗 𝟐
3.Sparsity:Ifv hasdimensionn,thenE[non-zeroesinQ(v) ]=𝑬 ∑ 𝝃𝒊 𝒗�
𝒊 ≤ 𝒗 𝟏 𝒗 𝟐⁄ ≤ 𝒏�
−1
1
0
Convergence:𝜠[𝑄(𝒗)]=𝒗.
Runtime≤ 𝑛� moreiterations.
𝑣𝑖 / 𝑣 O
9
QSGDCompression
• Quantizationfunction
𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖
where𝜉T 𝑣𝑖 = 1 withprobability 𝑣T / 𝑣 O and0otherwise.
v1float
v2float
v3float
v4float
vnfloat
⋯
𝑣 O
float Compression≈ 𝑛� / log 𝑛.signandlocations
≈ 𝑛� bitsandints
10
QSGDCompression
• Quantizationfunction
𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖
where𝜉T 𝑣𝑖 = 1 withprobability 𝑣T / 𝑣 O and0otherwise.
v1float
v2float
v3float
v4float
vnfloat
⋯
𝑣 O
float Compression≈ 𝑛� / log 𝑛.signandlocations
≈ 𝑛� bitsandintsMoral:We’renottoohappy,sincethe 𝒏� increaseinnumberofiterationsisquitelarge.
Inthepaper:Wereduce 𝒏� toasmallconstant(2) byincreasingthenumberofquantizationlevels.Inpractice,4-8bitsisenough.
11
ExperimentalSetupWhere?• Amazonp16xLarge(16xNVIDIAK80GPUs),withNVIDIADirectConnect• MicrosoftCNTKv2.0,withfastMPI-basedcommunicationWhat?• Tasks:imageclassification(ImageNet,CIFAR)andspeechrecognition(CMUAN4)• Nets:ResNet, VGG, AlexNet ,respectivelyLSTM,withdefaultparametersWhy?• Accuracyvs.Speed/ScalabilityHowlong?• >25KGPUhours
12
Experiments:Accuracy
ResNet-110onCIFAR-10AlexNet onImageNet
13
MoreAccuracy:pleaseseepaperAlexNet ImageNet
Setup Accuracy Samples per second (MPI)Precision Bucket size 56 89 112 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs
32bit / / / 59.90% 240.80 301.45 328.00 272.90 192.10QSGD 16bit 8192 / / / / 388.80 508.80 500.90 335.60QSGD 8bit 512 59.63% 60.36% 60.37% / 424.90 544.60 739.10 535.00QSGD 4bit 512 59.20% 59.97% 60.05% / 466.50 598.70 964.90 748.50QSGD 2bit 128 57.09% 58.09% 58.17% / 449.20 609.15 1076.50 889.801bitSGD / 59.57% 60.23% 60.31% / 424.05 564.30 971.10 849.401bitSGD* 64 59.26% 59.92% 59.99% / 370.80 476.50 761.20 712.701bitSGD* 512 58.44% 59.25% 59.29% / / / / /
ResNet50 ImageNetSetup Accuracy Samples per second (MPI)
Precision Bucket size 60 96 120 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / 70.75% 74.60% 74.68% 47.20 80.80 142.40 247.90 272.30
QSGD 16bit 8192 / / / / 90.20 156.30 275.80 348.70QSGD 8bit 512 70.66% 75.07% 75.19% / 92.60 162.70 313.70 416.80QSGD 4bit 512 70.66% 74.62% 74.76% / 93.90 165.70 326.10 461.20QSGD 2bit 128 / / / / 93.30 178.35 330.45 472.251bitSGD / / / / / 45.10 81.70 160.15 155.201bitSGD* 64 70.78% 74.93% 75.02% / 88.10 156.50 296.70 442.40
ResNet110 CIFAR10Setup Accuracy Samples per second (MPI)
Precision Bucket size 80 128 160 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / 87.09% 93.65% 93.86% 343.70 555.00 957.70 1229.10 831.60
QSGD 16bit 8192 / / / / 551.00 942.70 1164.20 763.40QSGD 8bit 512 86.92% 93.97% 94.19% / 550.20 960.10 1193.10 759.70QSGD 4bit 512 87.77% 93.52% 93.76% / 571.10 957.40 1257.10 784.30QSGD 2bit 128 85.77% 92.59% 92.64% / 557.20 973.10 1227.90 780.401bitSGD / 87.13% 94.02% 94.15% / 465.60 643.30 610.90 406.901bitSGD* 64 / / / / 550.40 884.80 1156.70 757.70
ResNet152 ImageNetSetup Accuracy Samples per second (MPI)
Precision Bucket size 60 96 120 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / 77.00% 16.90 26.10 45.00 73.90 113.50
QSGD 16bit 8192 / / / / 31.20 54.50 95.50 151.00QSGD 8bit 512 72.95% 76.66% 76.74% / 32.80 62.70 109.20 182.50QSGD 4bit 512 / / / / 33.60 60.20 121.90 203.20QSGD 2bit 128 / / / / 33.50 64.35 123.55 208.501bitSGD / / / / / 10.55 22.10 41.40 63.151bitSGD* 64 / / / / 30.40 55.50 108.10 193.50
VGG19 ImageNetSetup Accuracy Samples per second (MPI)
Precision Bucket size 40 64 80 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / / 12.40 20.40 36.30 53.95 40.60
QSGD 16bit 8192 / / / / 24.80 46.40 35.80 67.80QSGD 8bit 512 / / / / 24.20 47.50 119.50 106.60QSGD 4bit 512 / / / / 27.00 52.30 151.65 143.80QSGD 2bit 128 / / / / 24.60 49.35 160.35 170.501bitSGD / / / / / 22.20 43.15 117.35 120.601bitSGD* 64 / / / / 22.90 44.80 99.15 134.30
BN-Inception ImageNetSetup Accuracy Samples per second (MPI)
Precision Bucket size 150 240 300 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / / 88.30 164.80 316.75 473.75 500.40
QSGD 16bit 8192 / / / / 171.80 337.10 482.70 592.30QSGD 8bit 512 / / / / 173.60 342.50 552.90 696.30QSGD 4bit 512 / / / / 174.80 346.90 593.40 743.30QSGD 2bit 128 / / / / 173.40 343.70 591.80 747.501bitSGD / / / / / 127.60 236.25 336.15 321.301bitSGD* 64 / / / / 170.30 335.10 480.50 700.40
Figure 5: Results of using MPI on Amazon EC2 instance. “/” means a number is missing for multiple reasons: (1) it might take more than 2weeks to run to converge; (2) we already see preliminary result that certain run diverges or significantly worse than full precision. Becausethere are no data parallel SGD on 1 GPU therefore most columns about 1GPU is empty. We present the reason for each missing cell.
Generally,4bitsaresufficientforrecoveringorimprovingaccuracyatsamerate(alsotheoreticallysound).
14
• AllexperimentsranonImageNet• Amazonbaseline:
Experiments:Scalability
2.3x
15
Experiments:Scalability
7x
6.2x
4x2x
16
AccuracyvsTimeonLSTM
SpeechRecognition (3-LayerLSTM)
2.5x
• AN4dataset,2xTitanXGPUs
17
WhatElseCanWeQuantize?
Processor(GradientComputation)
FastMemory(StoringModel)
Large,SlowerMemory(StoringSamples)
Processor(GradientComputation)
• Iteration: 𝑥G.e = 𝑥G − 𝜂𝑡𝑔4(𝑥𝑡, 𝑒𝑡)
18
• Iteration:
Everything:Gradient,Samples,andModel
Processor(GradientComputation)
FastMemory(StoringModel)
Large,SlowerMemory(StoringSamples)
Processor(GradientComputation)
𝑥G.e = 𝑥G − 𝜂𝑡𝑸𝑔(𝑔4(𝑸𝑚(𝑥𝑡),𝑸𝑠(𝑒𝑡)))
Canwequantizeend-to-endandstillgetconvergenceguarantees?19
ZipML:End-to-endQuantizationforLinearModels[Alistarh,Li,Liu,Kara,Zhang&Zhang,ICML2017]
• Samples areproblematic:Originalgradient:
Naïvequantization:
• Wecanfixthisbydoublesampling
𝑔4 𝑥𝑡 = 𝑎𝑡(𝑎𝑡𝑇𝑥𝑡 − 𝑏𝑡)
𝑔4 𝑥𝑡 = 𝑸(𝒂𝒕)(𝑸(𝒂𝒕)𝑻𝑥𝑡 − 𝑏𝑡) Notunbiased!
𝑔4 𝑥𝑡 = 𝑸1(𝒂𝒕)(𝑸2(𝒂𝒕)𝑻𝑥𝑡 − 𝑏𝑡)Ifsamplesare
independent,thegradientisagainunbiased!
SampleSource Processor
(𝑎𝑡, 𝑏𝑡) (𝑄1 𝑎𝑡 ,𝑄2 𝑎𝑡 , 𝑏𝑡) 𝑔4 𝑥𝑡20
Generalization&Implementation• Wecandoend-to-endquantizedlinearregressionwithguarantees• Generalclassificationlossfunctions:
1. Forsmoothlosses,wecandopolynomialapproximation,andthenmulti-sampling2. Fornon-smoothlosses(e.g.SVM),wecando“re-fetchsampling”
• ImplementedonanXeon+FPGA platform(Intel-AlteraHARP)• Quantizeddata->8-16xmoredatapertransfer
10
�2
10
�1
10
0
10
1
1
2
·10�2
6.5x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA
Time (s)
Trai
ning
loss
(a) gisette. � = 1/212.
10
�2
10
�1
10
0
0.04
0.06
0.08
0.1
3.6x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA
Time (s)
Trai
ning
loss
(b) epsilon. � = 1/212.
10
�2
10
�1
10
0
10
1
1.8
2
2.2·10�2
10.6x
float CPU 1-threadfloat CPU 10-threadfloat FPGAQ1 FPGA
Time (s)
Trai
ning
loss
(c) mnist, trained for digit 7. � = 1/215.
Fig. 5: SGD on classification data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.
10
�3
10
�2
10
�1
1
1.5
2
2.5
·10�2
3.7x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA
Time (s)
Trai
ning
loss
(a) cadata. � = 1/212.
10
�3
10
�2
10
�1
0
0.5
1
·10�2
7.2x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ4 FPGA
Time (s)
Trai
ning
loss
(b) synthetic100. � = 1/29.
10
�2
10
�1
10
0
0
1
2
·10�2
2x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA
Time (s)
Trai
ning
loss
(c) synthetic1000. � = 1/29.
Fig. 6: SGD on regression data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.
TABLE IV: Data sets used in experimental evaluation.
Name Training size Testing size # Features # Classescadata 20,640 8 regressionmusic 463,715 90 regression
synthetic100 10,000 100 regressionsynthetic1000 10,000 1000 regression
mnist 60,000 10,000 780 10gisette 6000 1000 5000 2epsilon 10,000 10,000 2000 2
updates; that is, each thread works on a separate chunk of dataand applies gradient updates to a common model without anysynchronization. Although this results in convergence speedupfor sparse data sets, for dense and high variance ones theconvergence quality might be affected negatively, becauseof the lack of synchronization. hogwild also makes use ofvectorization, benefiting from AVX extensions.
Methodology: Since we apply the step size as a bit-shiftoperation, we choose one of the following step sizes, whichresults in the smallest loss for the full-precision data after64 epochs: (1/26, 1/29, 1/212, 1/215). With a determinedconstant step size, we run FPGA-SGD on all the precisionvariations that we have implemented (Q1-only for classifica-tion data-, Q2, Q4, Q8, float). For each data set, we present theloss function over time in Figures 6-5, which always containsingle-threaded and 10-threaded CPU-SGD curve for float, anFPGA-SGD curve for float (for supported dimensions) and anFPGA-SGD curve for the smallest precision data, that hasconverged within 5% of the original loss. We would like toshow the difference in time for all implementations to convergeto the same loss, emphasizing the speedup we can achieve withquantized FPGA-SGD to full precision variants.
Main results: In Figure 5 we can observe that for all classi-fication data sets the quantized FPGA-SGD achieves speedupwhile maintaining convergence quality. For gisette in Figure5a, Q2 reaches the same loss 6.9x faster than hogwild. Dueto high variance data in epsilon, both hogwild and Qx curvesseem to be shaky, in Figure 5b. We can see that float FPGA-SGD in this case behaves very nicely, providing both 1.8xspeedup over hogwild and better convergence quality. Theresults for the music data set are very similar to epsilon, sowe omit showing them for space constraints. On mnist dataset in Figure 5c, we can even use Q1 without losing anyconvergence quality, showing that the characteristics of thedata set effect the choice of quantization precision heavily,which justifies having multiple Qx implementations. For thisone, Q1 FPGA-SGD provides 10.6x speedup over hogwild and11x speedup over float FPGA-SGD. In Figure 6a we see thatfloat FPGA-SGD converges as fast as Q2. The reason for thatis the zero padding we apply to be cache-line aligned (seeSection III): The cadata data set has only 8 features, meaninga full row fits into one cache-line (64B, can contain 16 floats)even if data is in float. Quantizing the data there does notbring any benefits for out implementation, since we pad theremaining bits in a cache-line with zeros. The outcome ofthis experiment, as predicted by our calculations previously, isthat for data sets with a small number of features, quantizationdoes not provide speedup. For synthetic100 and synthetic1000,we need to use Q4 and Q8, respectively, to achieve the sameconvergence quality as float. We see in Figure 6c that hogwildconvergence is slightly faster than the float FPGA-SGD, butthen Q8 provides 2x speedup over that. The reason for hogwild
Classification(gisette dataset,n=5K).
10
�2
10
�1
10
0
10
1
1
2
·10�2
6.5x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA
Time (s)Tr
aini
nglo
ss
(a) gisette. � = 1/212.
10
�2
10
�1
10
0
0.04
0.06
0.08
0.1
3.6x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA
Time (s)
Trai
ning
loss
(b) epsilon. � = 1/212.
10
�2
10
�1
10
0
10
1
1.8
2
2.2·10�2
10.6x
float CPU 1-threadfloat CPU 10-threadfloat FPGAQ1 FPGA
Time (s)
Trai
ning
loss
(c) mnist, trained for digit 7. � = 1/215.
Fig. 5: SGD on classification data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.
10
�3
10
�2
10
�1
1
1.5
2
2.5
·10�2
3.7x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA
Time (s)
Trai
ning
loss
(a) cadata. � = 1/212.
10
�3
10
�2
10
�1
0
0.5
1
·10�2
7.2x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ4 FPGA
Time (s)Tr
aini
nglo
ss
(b) synthetic100. � = 1/29.
10
�2
10
�1
10
0
0
1
2
·10�2
2x
float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA
Time (s)
Trai
ning
loss
(c) synthetic1000. � = 1/29.
Fig. 6: SGD on regression data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.
TABLE IV: Data sets used in experimental evaluation.
Name Training size Testing size # Features # Classescadata 20,640 8 regressionmusic 463,715 90 regression
synthetic100 10,000 100 regressionsynthetic1000 10,000 1000 regression
mnist 60,000 10,000 780 10gisette 6000 1000 5000 2epsilon 10,000 10,000 2000 2
updates; that is, each thread works on a separate chunk of dataand applies gradient updates to a common model without anysynchronization. Although this results in convergence speedupfor sparse data sets, for dense and high variance ones theconvergence quality might be affected negatively, becauseof the lack of synchronization. hogwild also makes use ofvectorization, benefiting from AVX extensions.
Methodology: Since we apply the step size as a bit-shiftoperation, we choose one of the following step sizes, whichresults in the smallest loss for the full-precision data after64 epochs: (1/26, 1/29, 1/212, 1/215). With a determinedconstant step size, we run FPGA-SGD on all the precisionvariations that we have implemented (Q1-only for classifica-tion data-, Q2, Q4, Q8, float). For each data set, we present theloss function over time in Figures 6-5, which always containsingle-threaded and 10-threaded CPU-SGD curve for float, anFPGA-SGD curve for float (for supported dimensions) and anFPGA-SGD curve for the smallest precision data, that hasconverged within 5% of the original loss. We would like toshow the difference in time for all implementations to convergeto the same loss, emphasizing the speedup we can achieve withquantized FPGA-SGD to full precision variants.
Main results: In Figure 5 we can observe that for all classi-fication data sets the quantized FPGA-SGD achieves speedupwhile maintaining convergence quality. For gisette in Figure5a, Q2 reaches the same loss 6.9x faster than hogwild. Dueto high variance data in epsilon, both hogwild and Qx curvesseem to be shaky, in Figure 5b. We can see that float FPGA-SGD in this case behaves very nicely, providing both 1.8xspeedup over hogwild and better convergence quality. Theresults for the music data set are very similar to epsilon, sowe omit showing them for space constraints. On mnist dataset in Figure 5c, we can even use Q1 without losing anyconvergence quality, showing that the characteristics of thedata set effect the choice of quantization precision heavily,which justifies having multiple Qx implementations. For thisone, Q1 FPGA-SGD provides 10.6x speedup over hogwild and11x speedup over float FPGA-SGD. In Figure 6a we see thatfloat FPGA-SGD converges as fast as Q2. The reason for thatis the zero padding we apply to be cache-line aligned (seeSection III): The cadata data set has only 8 features, meaninga full row fits into one cache-line (64B, can contain 16 floats)even if data is in float. Quantizing the data there does notbring any benefits for out implementation, since we pad theremaining bits in a cache-line with zeros. The outcome ofthis experiment, as predicted by our calculations previously, isthat for data sets with a small number of features, quantizationdoes not provide speedup. For synthetic100 and synthetic1000,we need to use Q4 and Q8, respectively, to achieve the sameconvergence quality as float. We see in Figure 6c that hogwildconvergence is slightly faster than the float FPGA-SGD, butthen Q8 provides 2x speedup over that. The reason for hogwild
Regression(syntheticdataset,n=100). 21
Summary
22
GotApplications?
QSGD isafamilyoflossy compressionschemeswhichtradeofflowbitcomplexityandconvergencetime.
QSGD istheoreticallyoptimal,canbepractical,andextensibletoend-to-endcompression.
Experiments:MoreScalability
7x
6.2x
6x2x
23
Backup1:Comparisonwith1BitSGD
24
6x2.3x
25