quantized stochastic gradient descent · 2018. 11. 21. · more accuracy: please see paper alexnet...

QuantizedStochasticGradientDescent

DanAlistarhETHZurich

2

ThePracticalProblemTraininglargemachinelearningmodelsefficiently

• LargeDatasets:• ImageNet:1.6millionimages(~300GB)• NIST2000Switchboarddataset:2000hours

• LargeModels:• ResNet-152[Heetal.2015]:152layers,60millionparameters• LACEA[Yuetal.2016]:22layers,65millionparameters

Heetal.(2015)“DeepResidualLearningforImageRecognition”Yuetal.(2016)“Deepconvolutionalneuralnetworkswithlayer-wisecontextexpansionandattention” 3

Data-parallelStochasticGradientDescent

Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2 Minibatch 34

DataparallelSGD(biggermodels)Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 25

DataparallelSGD(biggerermodel)Computegradient

Exchangegradient

Updateparams


Idea[Seide etal,CNTK]:compress thegradients…


BackgroundonSGD• Startfromthegradientdescentiteration

• Let𝒈"(𝒙𝒕) =gradientatrandomlychosendatapoint.

• Let𝜠 𝒈" 𝒙 − 𝜵𝒇 𝒙 𝟐 ≤ 𝝈𝟐 (variancebound)

𝒙𝒕.𝟏 = 𝒙𝒕 − 𝜼𝒕𝒈"(𝒙𝒕),where𝜠[𝑔4(𝑥𝑡)]=𝛻𝑓 𝑥𝑡 .

Theorem (standard):Given𝑓 convexandsmooth,and𝑅2 = ||𝑥0− 𝑥∗||2.IfwerunSGDfor𝑻 = 𝓞(𝑹𝟐 𝟐𝝈

𝟐

𝜺𝟐 ) iterations,then

𝜠 𝑓(1𝑇F

𝑥G)H

GIJ

− 𝑓 𝑥∗ ≤ 𝜀.

𝒙𝒕.𝟏 = 𝒙𝒕 − 𝜼𝒕𝛻𝑓 𝑥𝑡 .

8

BasicIdea:QSGD[Alistarhetal.,NIPS17]

• Quantizationfunction

𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖

where𝜉T 𝑣𝑖 = 𝟏 withprobability 𝒗𝒊 / 𝒗 𝟐 and0,otherwise.

Properties:1. Unbiasedness:

𝑬 𝑸 𝒗𝒊 = 𝑣 O ⋅ sgn 𝑣T𝒗𝒊𝒗 𝟐

= 𝒗𝒊2. Secondmoment(variance)bound:

𝑬 𝑸 𝒗 𝟐 ≤ 𝒏� 𝒗 𝟐

3.Sparsity:Ifv hasdimensionn,thenE[non-zeroesinQ(v) ]=𝑬 ∑ 𝝃𝒊 𝒗�

𝒊 ≤ 𝒗 𝟏 𝒗 𝟐⁄ ≤ 𝒏�

−1

1

0

Convergence:𝜠[𝑄(𝒗)]=𝒗.

Runtime≤ 𝑛� moreiterations.

𝑣𝑖 / 𝑣 O

9

QSGDCompression



where𝜉T 𝑣𝑖 = 1 withprobability 𝑣T / 𝑣 O and0otherwise.

v1float

v2float

v3float

v4float

vnfloat

⋯

𝑣 O

float Compression≈ 𝑛� / log 𝑛.signandlocations

≈ 𝑛� bitsandints

10

QSGDCompression



where𝜉T 𝑣𝑖 = 1 withprobability 𝑣T / 𝑣 O and0otherwise.

v1float

v2float

v3float

v4float

vnfloat

⋯

𝑣 O

float Compression≈ 𝑛� / log 𝑛.signandlocations

≈ 𝑛� bitsandintsMoral:We’renottoohappy,sincethe 𝒏� increaseinnumberofiterationsisquitelarge.

Inthepaper:Wereduce 𝒏� toasmallconstant(2) byincreasingthenumberofquantizationlevels.Inpractice,4-8bitsisenough.

11

ExperimentalSetupWhere?• Amazonp16xLarge(16xNVIDIAK80GPUs),withNVIDIADirectConnect• MicrosoftCNTKv2.0,withfastMPI-basedcommunicationWhat?• Tasks:imageclassification(ImageNet,CIFAR)andspeechrecognition(CMUAN4)• Nets:ResNet, VGG, AlexNet ,respectivelyLSTM,withdefaultparametersWhy?• Accuracyvs.Speed/ScalabilityHowlong?• >25KGPUhours

12

Experiments:Accuracy

ResNet-110onCIFAR-10AlexNet onImageNet

13

MoreAccuracy:pleaseseepaperAlexNet ImageNet

Setup Accuracy Samples per second (MPI)Precision Bucket size 56 89 112 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs

32bit / / / 59.90% 240.80 301.45 328.00 272.90 192.10QSGD 16bit 8192 / / / / 388.80 508.80 500.90 335.60QSGD 8bit 512 59.63% 60.36% 60.37% / 424.90 544.60 739.10 535.00QSGD 4bit 512 59.20% 59.97% 60.05% / 466.50 598.70 964.90 748.50QSGD 2bit 128 57.09% 58.09% 58.17% / 449.20 609.15 1076.50 889.801bitSGD / 59.57% 60.23% 60.31% / 424.05 564.30 971.10 849.401bitSGD* 64 59.26% 59.92% 59.99% / 370.80 476.50 761.20 712.701bitSGD* 512 58.44% 59.25% 59.29% / / / / /

ResNet50 ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 60 96 120 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / 70.75% 74.60% 74.68% 47.20 80.80 142.40 247.90 272.30

QSGD 16bit 8192 / / / / 90.20 156.30 275.80 348.70QSGD 8bit 512 70.66% 75.07% 75.19% / 92.60 162.70 313.70 416.80QSGD 4bit 512 70.66% 74.62% 74.76% / 93.90 165.70 326.10 461.20QSGD 2bit 128 / / / / 93.30 178.35 330.45 472.251bitSGD / / / / / 45.10 81.70 160.15 155.201bitSGD* 64 70.78% 74.93% 75.02% / 88.10 156.50 296.70 442.40

ResNet110 CIFAR10Setup Accuracy Samples per second (MPI)

Precision Bucket size 80 128 160 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / 87.09% 93.65% 93.86% 343.70 555.00 957.70 1229.10 831.60

QSGD 16bit 8192 / / / / 551.00 942.70 1164.20 763.40QSGD 8bit 512 86.92% 93.97% 94.19% / 550.20 960.10 1193.10 759.70QSGD 4bit 512 87.77% 93.52% 93.76% / 571.10 957.40 1257.10 784.30QSGD 2bit 128 85.77% 92.59% 92.64% / 557.20 973.10 1227.90 780.401bitSGD / 87.13% 94.02% 94.15% / 465.60 643.30 610.90 406.901bitSGD* 64 / / / / 550.40 884.80 1156.70 757.70

ResNet152 ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 60 96 120 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / 77.00% 16.90 26.10 45.00 73.90 113.50

QSGD 16bit 8192 / / / / 31.20 54.50 95.50 151.00QSGD 8bit 512 72.95% 76.66% 76.74% / 32.80 62.70 109.20 182.50QSGD 4bit 512 / / / / 33.60 60.20 121.90 203.20QSGD 2bit 128 / / / / 33.50 64.35 123.55 208.501bitSGD / / / / / 10.55 22.10 41.40 63.151bitSGD* 64 / / / / 30.40 55.50 108.10 193.50

VGG19 ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 40 64 80 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / / 12.40 20.40 36.30 53.95 40.60

QSGD 16bit 8192 / / / / 24.80 46.40 35.80 67.80QSGD 8bit 512 / / / / 24.20 47.50 119.50 106.60QSGD 4bit 512 / / / / 27.00 52.30 151.65 143.80QSGD 2bit 128 / / / / 24.60 49.35 160.35 170.501bitSGD / / / / / 22.20 43.15 117.35 120.601bitSGD* 64 / / / / 22.90 44.80 99.15 134.30

BN-Inception ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 150 240 300 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / / 88.30 164.80 316.75 473.75 500.40

QSGD 16bit 8192 / / / / 171.80 337.10 482.70 592.30QSGD 8bit 512 / / / / 173.60 342.50 552.90 696.30QSGD 4bit 512 / / / / 174.80 346.90 593.40 743.30QSGD 2bit 128 / / / / 173.40 343.70 591.80 747.501bitSGD / / / / / 127.60 236.25 336.15 321.301bitSGD* 64 / / / / 170.30 335.10 480.50 700.40

Figure 5: Results of using MPI on Amazon EC2 instance. “/” means a number is missing for multiple reasons: (1) it might take more than 2weeks to run to converge; (2) we already see preliminary result that certain run diverges or significantly worse than full precision. Becausethere are no data parallel SGD on 1 GPU therefore most columns about 1GPU is empty. We present the reason for each missing cell.

Generally,4bitsaresufficientforrecoveringorimprovingaccuracyatsamerate(alsotheoreticallysound).

14

• AllexperimentsranonImageNet• Amazonbaseline:

Experiments:Scalability

2.3x

15

Experiments:Scalability

7x

6.2x

4x2x

16

AccuracyvsTimeonLSTM

SpeechRecognition (3-LayerLSTM)

2.5x

• AN4dataset,2xTitanXGPUs

17

WhatElseCanWeQuantize?

Processor(GradientComputation)

FastMemory(StoringModel)

Large,SlowerMemory(StoringSamples)


• Iteration: 𝑥G.e = 𝑥G − 𝜂𝑡𝑔4(𝑥𝑡, 𝑒𝑡)

18

• Iteration:

Everything:Gradient,Samples,andModel


FastMemory(StoringModel)

Large,SlowerMemory(StoringSamples)


𝑥G.e = 𝑥G − 𝜂𝑡𝑸𝑔(𝑔4(𝑸𝑚(𝑥𝑡),𝑸𝑠(𝑒𝑡)))

Canwequantizeend-to-endandstillgetconvergenceguarantees?19

ZipML:End-to-endQuantizationforLinearModels[Alistarh,Li,Liu,Kara,Zhang&Zhang,ICML2017]

• Samples areproblematic:Originalgradient:

Naïvequantization:

• Wecanfixthisbydoublesampling

𝑔4 𝑥𝑡 = 𝑎𝑡(𝑎𝑡𝑇𝑥𝑡 − 𝑏𝑡)

𝑔4 𝑥𝑡 = 𝑸(𝒂𝒕)(𝑸(𝒂𝒕)𝑻𝑥𝑡 − 𝑏𝑡) Notunbiased!

𝑔4 𝑥𝑡 = 𝑸1(𝒂𝒕)(𝑸2(𝒂𝒕)𝑻𝑥𝑡 − 𝑏𝑡)Ifsamplesare

independent,thegradientisagainunbiased!

SampleSource Processor

(𝑎𝑡, 𝑏𝑡) (𝑄1 𝑎𝑡 ,𝑄2 𝑎𝑡 , 𝑏𝑡) 𝑔4 𝑥𝑡20

Generalization&Implementation• Wecandoend-to-endquantizedlinearregressionwithguarantees• Generalclassificationlossfunctions:

1. Forsmoothlosses,wecandopolynomialapproximation,andthenmulti-sampling2. Fornon-smoothlosses(e.g.SVM),wecando“re-fetchsampling”

• ImplementedonanXeon+FPGA platform(Intel-AlteraHARP)• Quantizeddata->8-16xmoredatapertransfer

10

�2

10

�1

10

0

10

1

1

2

·10�2

6.5x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA

Time (s)

Trai

ning

loss

(a) gisette. � = 1/212.

10

�2

10

�1

10

0

0.04

0.06

0.08

0.1

3.6x


Time (s)

Trai

ning

loss

(b) epsilon. � = 1/212.

10

�2

10

�1

10

0

10

1

1.8

2

2.2·10�2

10.6x

float CPU 1-threadfloat CPU 10-threadfloat FPGAQ1 FPGA

Time (s)

Trai

ning

loss

(c) mnist, trained for digit 7. � = 1/215.

Fig. 5: SGD on classification data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

10

�3

10

�2

10

�1

1

1.5

2

2.5

·10�2

3.7x


Time (s)

Trai

ning

loss

(a) cadata. � = 1/212.

10

�3

10

�2

10

�1

0

0.5

1

·10�2

7.2x


Time (s)

Trai

ning

loss

(b) synthetic100. � = 1/29.

10

�2

10

�1

10

0

0

1

2

·10�2

2x


Time (s)

Trai

ning

loss

(c) synthetic1000. � = 1/29.

Fig. 6: SGD on regression data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

TABLE IV: Data sets used in experimental evaluation.

Name Training size Testing size # Features # Classescadata 20,640 8 regressionmusic 463,715 90 regression

synthetic100 10,000 100 regressionsynthetic1000 10,000 1000 regression

mnist 60,000 10,000 780 10gisette 6000 1000 5000 2epsilon 10,000 10,000 2000 2

updates; that is, each thread works on a separate chunk of dataand applies gradient updates to a common model without anysynchronization. Although this results in convergence speedupfor sparse data sets, for dense and high variance ones theconvergence quality might be affected negatively, becauseof the lack of synchronization. hogwild also makes use ofvectorization, benefiting from AVX extensions.

Methodology: Since we apply the step size as a bit-shiftoperation, we choose one of the following step sizes, whichresults in the smallest loss for the full-precision data after64 epochs: (1/26, 1/29, 1/212, 1/215). With a determinedconstant step size, we run FPGA-SGD on all the precisionvariations that we have implemented (Q1-only for classifica-tion data-, Q2, Q4, Q8, float). For each data set, we present theloss function over time in Figures 6-5, which always containsingle-threaded and 10-threaded CPU-SGD curve for float, anFPGA-SGD curve for float (for supported dimensions) and anFPGA-SGD curve for the smallest precision data, that hasconverged within 5% of the original loss. We would like toshow the difference in time for all implementations to convergeto the same loss, emphasizing the speedup we can achieve withquantized FPGA-SGD to full precision variants.

Main results: In Figure 5 we can observe that for all classi-fication data sets the quantized FPGA-SGD achieves speedupwhile maintaining convergence quality. For gisette in Figure5a, Q2 reaches the same loss 6.9x faster than hogwild. Dueto high variance data in epsilon, both hogwild and Qx curvesseem to be shaky, in Figure 5b. We can see that float FPGA-SGD in this case behaves very nicely, providing both 1.8xspeedup over hogwild and better convergence quality. Theresults for the music data set are very similar to epsilon, sowe omit showing them for space constraints. On mnist dataset in Figure 5c, we can even use Q1 without losing anyconvergence quality, showing that the characteristics of thedata set effect the choice of quantization precision heavily,which justifies having multiple Qx implementations. For thisone, Q1 FPGA-SGD provides 10.6x speedup over hogwild and11x speedup over float FPGA-SGD. In Figure 6a we see thatfloat FPGA-SGD converges as fast as Q2. The reason for thatis the zero padding we apply to be cache-line aligned (seeSection III): The cadata data set has only 8 features, meaninga full row fits into one cache-line (64B, can contain 16 floats)even if data is in float. Quantizing the data there does notbring any benefits for out implementation, since we pad theremaining bits in a cache-line with zeros. The outcome ofthis experiment, as predicted by our calculations previously, isthat for data sets with a small number of features, quantizationdoes not provide speedup. For synthetic100 and synthetic1000,we need to use Q4 and Q8, respectively, to achieve the sameconvergence quality as float. We see in Figure 6c that hogwildconvergence is slightly faster than the float FPGA-SGD, butthen Q8 provides 2x speedup over that. The reason for hogwild

Classification(gisette dataset,n=5K).

10

�2

10

�1

10

0

10

1

1

2

·10�2

6.5x


Time (s)Tr

aini

nglo

ss

(a) gisette. � = 1/212.

10

�2

10

�1

10

0

0.04

0.06

0.08

0.1

3.6x


Time (s)

Trai

ning

loss

(b) epsilon. � = 1/212.

10

�2

10

�1

10

0

10

1

1.8

2

2.2·10�2

10.6x

float CPU 1-threadfloat CPU 10-threadfloat FPGAQ1 FPGA

Time (s)

Trai

ning

loss

(c) mnist, trained for digit 7. � = 1/215.

Fig. 5: SGD on classification data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

10

�3

10

�2

10

�1

1

1.5

2

2.5

·10�2

3.7x


Time (s)

Trai

ning

loss

(a) cadata. � = 1/212.

10

�3

10

�2

10

�1

0

0.5

1

·10�2

7.2x


Time (s)Tr

aini

nglo

ss

(b) synthetic100. � = 1/29.

10

�2

10

�1

10

0

0

1

2

·10�2

2x


Time (s)

Trai

ning

loss

(c) synthetic1000. � = 1/29.

Fig. 6: SGD on regression data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

TABLE IV: Data sets used in experimental evaluation.

Name Training size Testing size # Features # Classescadata 20,640 8 regressionmusic 463,715 90 regression

synthetic100 10,000 100 regressionsynthetic1000 10,000 1000 regression

mnist 60,000 10,000 780 10gisette 6000 1000 5000 2epsilon 10,000 10,000 2000 2

updates; that is, each thread works on a separate chunk of dataand applies gradient updates to a common model without anysynchronization. Although this results in convergence speedupfor sparse data sets, for dense and high variance ones theconvergence quality might be affected negatively, becauseof the lack of synchronization. hogwild also makes use ofvectorization, benefiting from AVX extensions.

Methodology: Since we apply the step size as a bit-shiftoperation, we choose one of the following step sizes, whichresults in the smallest loss for the full-precision data after64 epochs: (1/26, 1/29, 1/212, 1/215). With a determinedconstant step size, we run FPGA-SGD on all the precisionvariations that we have implemented (Q1-only for classifica-tion data-, Q2, Q4, Q8, float). For each data set, we present theloss function over time in Figures 6-5, which always containsingle-threaded and 10-threaded CPU-SGD curve for float, anFPGA-SGD curve for float (for supported dimensions) and anFPGA-SGD curve for the smallest precision data, that hasconverged within 5% of the original loss. We would like toshow the difference in time for all implementations to convergeto the same loss, emphasizing the speedup we can achieve withquantized FPGA-SGD to full precision variants.

Main results: In Figure 5 we can observe that for all classi-fication data sets the quantized FPGA-SGD achieves speedupwhile maintaining convergence quality. For gisette in Figure5a, Q2 reaches the same loss 6.9x faster than hogwild. Dueto high variance data in epsilon, both hogwild and Qx curvesseem to be shaky, in Figure 5b. We can see that float FPGA-SGD in this case behaves very nicely, providing both 1.8xspeedup over hogwild and better convergence quality. Theresults for the music data set are very similar to epsilon, sowe omit showing them for space constraints. On mnist dataset in Figure 5c, we can even use Q1 without losing anyconvergence quality, showing that the characteristics of thedata set effect the choice of quantization precision heavily,which justifies having multiple Qx implementations. For thisone, Q1 FPGA-SGD provides 10.6x speedup over hogwild and11x speedup over float FPGA-SGD. In Figure 6a we see thatfloat FPGA-SGD converges as fast as Q2. The reason for thatis the zero padding we apply to be cache-line aligned (seeSection III): The cadata data set has only 8 features, meaninga full row fits into one cache-line (64B, can contain 16 floats)even if data is in float. Quantizing the data there does notbring any benefits for out implementation, since we pad theremaining bits in a cache-line with zeros. The outcome ofthis experiment, as predicted by our calculations previously, isthat for data sets with a small number of features, quantizationdoes not provide speedup. For synthetic100 and synthetic1000,we need to use Q4 and Q8, respectively, to achieve the sameconvergence quality as float. We see in Figure 6c that hogwildconvergence is slightly faster than the float FPGA-SGD, butthen Q8 provides 2x speedup over that. The reason for hogwild

Regression(syntheticdataset,n=100). 21

Summary

22

GotApplications?

QSGD isafamilyoflossy compressionschemeswhichtradeofflowbitcomplexityandconvergencetime.

QSGD istheoreticallyoptimal,canbepractical,andextensibletoend-to-endcompression.

Experiments:MoreScalability

7x

6.2x

6x2x

23

Backup1:Comparisonwith1BitSGD

24

6x2.3x

25

quantized stochastic gradient descent · 2018. 11. 21. · more accuracy: please see paper alexnet...

Documents