quantized stochastic gradient descent · 2018. 11. 21. · more accuracy: please see paper alexnet...

24
Quantized Stochastic Gradient Descent Dan Alistarh ETH Zurich 2

Upload: others

Post on 12-Apr-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

QuantizedStochasticGradientDescent

DanAlistarhETHZurich

2

Page 2: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

ThePracticalProblemTraininglargemachinelearningmodelsefficiently

• LargeDatasets:• ImageNet:1.6millionimages(~300GB)• NIST2000Switchboarddataset:2000hours

• LargeModels:• ResNet-152[Heetal.2015]:152layers,60millionparameters• LACEA[Yuetal.2016]:22layers,65millionparameters

Heetal.(2015)“DeepResidualLearningforImageRecognition”Yuetal.(2016)“Deepconvolutionalneuralnetworkswithlayer-wisecontextexpansionandattention” 3

Page 3: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Data-parallelStochasticGradientDescent

Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2 Minibatch 34

Page 4: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

DataparallelSGD(biggermodels)Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 25

Page 5: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

DataparallelSGD(biggerermodel)Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 26

Page 6: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Idea[Seide etal,CNTK]:compress thegradients…

Minibatch 1 Minibatch 27

Page 7: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

BackgroundonSGD• Startfromthegradientdescentiteration

• Let𝒈"(𝒙𝒕) =gradientatrandomlychosendatapoint.

• Let𝜠 𝒈" 𝒙 − 𝜵𝒇 𝒙 𝟐 ≤ 𝝈𝟐 (variancebound)

𝒙𝒕.𝟏 = 𝒙𝒕 − 𝜼𝒕𝒈"(𝒙𝒕),where𝜠[𝑔4(𝑥𝑡)]=𝛻𝑓 𝑥𝑡 .

Theorem (standard):Given𝑓 convexandsmooth,and𝑅2 = ||𝑥0− 𝑥∗||2.IfwerunSGDfor𝑻 = 𝓞(𝑹𝟐 𝟐𝝈

𝟐

𝜺𝟐 ) iterations,then

𝜠 𝑓(1𝑇F

𝑥G)H

GIJ

− 𝑓 𝑥∗ ≤ 𝜀.

𝒙𝒕.𝟏 = 𝒙𝒕 − 𝜼𝒕𝛻𝑓 𝑥𝑡 .

8

Page 8: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

BasicIdea:QSGD[Alistarhetal.,NIPS17]

• Quantizationfunction

𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖

where𝜉T 𝑣𝑖 = 𝟏 withprobability 𝒗𝒊 / 𝒗 𝟐 and0,otherwise.

Properties:1. Unbiasedness:

𝑬 𝑸 𝒗𝒊 = 𝑣 O ⋅ sgn 𝑣T𝒗𝒊𝒗 𝟐

= 𝒗𝒊2. Secondmoment(variance)bound:

𝑬 𝑸 𝒗 𝟐 ≤ 𝒏� 𝒗 𝟐

3.Sparsity:Ifv hasdimensionn,thenE[non-zeroesinQ(v) ]=𝑬 ∑ 𝝃𝒊 𝒗�

𝒊 ≤ 𝒗 𝟏 𝒗 𝟐⁄ ≤ 𝒏�

−1

1

0

Convergence:𝜠[𝑄(𝒗)]=𝒗.

Runtime≤ 𝑛� moreiterations.

𝑣𝑖 / 𝑣 O

9

Page 9: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

QSGDCompression

• Quantizationfunction

𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖

where𝜉T 𝑣𝑖 = 1 withprobability 𝑣T / 𝑣 O and0otherwise.

v1float

v2float

v3float

v4float

vnfloat

𝑣 O

float Compression≈ 𝑛� / log 𝑛.signandlocations

≈ 𝑛� bitsandints

10

Page 10: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

QSGDCompression

• Quantizationfunction

𝑄(𝑣𝑖) = 𝑣 O ⋅ sgn 𝑣T ⋅ 𝜉T 𝑣𝑖

where𝜉T 𝑣𝑖 = 1 withprobability 𝑣T / 𝑣 O and0otherwise.

v1float

v2float

v3float

v4float

vnfloat

𝑣 O

float Compression≈ 𝑛� / log 𝑛.signandlocations

≈ 𝑛� bitsandintsMoral:We’renottoohappy,sincethe 𝒏� increaseinnumberofiterationsisquitelarge.

Inthepaper:Wereduce 𝒏� toasmallconstant(2) byincreasingthenumberofquantizationlevels.Inpractice,4-8bitsisenough.

11

Page 11: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

ExperimentalSetupWhere?• Amazonp16xLarge(16xNVIDIAK80GPUs),withNVIDIADirectConnect• MicrosoftCNTKv2.0,withfastMPI-basedcommunicationWhat?• Tasks:imageclassification(ImageNet,CIFAR)andspeechrecognition(CMUAN4)• Nets:ResNet, VGG, AlexNet ,respectivelyLSTM,withdefaultparametersWhy?• Accuracyvs.Speed/ScalabilityHowlong?• >25KGPUhours

12

Page 12: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Experiments:Accuracy

ResNet-110onCIFAR-10AlexNet onImageNet

13

Page 13: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

MoreAccuracy:pleaseseepaperAlexNet ImageNet

Setup Accuracy Samples per second (MPI)Precision Bucket size 56 89 112 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs

32bit / / / 59.90% 240.80 301.45 328.00 272.90 192.10QSGD 16bit 8192 / / / / 388.80 508.80 500.90 335.60QSGD 8bit 512 59.63% 60.36% 60.37% / 424.90 544.60 739.10 535.00QSGD 4bit 512 59.20% 59.97% 60.05% / 466.50 598.70 964.90 748.50QSGD 2bit 128 57.09% 58.09% 58.17% / 449.20 609.15 1076.50 889.801bitSGD / 59.57% 60.23% 60.31% / 424.05 564.30 971.10 849.401bitSGD* 64 59.26% 59.92% 59.99% / 370.80 476.50 761.20 712.701bitSGD* 512 58.44% 59.25% 59.29% / / / / /

ResNet50 ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 60 96 120 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / 70.75% 74.60% 74.68% 47.20 80.80 142.40 247.90 272.30

QSGD 16bit 8192 / / / / 90.20 156.30 275.80 348.70QSGD 8bit 512 70.66% 75.07% 75.19% / 92.60 162.70 313.70 416.80QSGD 4bit 512 70.66% 74.62% 74.76% / 93.90 165.70 326.10 461.20QSGD 2bit 128 / / / / 93.30 178.35 330.45 472.251bitSGD / / / / / 45.10 81.70 160.15 155.201bitSGD* 64 70.78% 74.93% 75.02% / 88.10 156.50 296.70 442.40

ResNet110 CIFAR10Setup Accuracy Samples per second (MPI)

Precision Bucket size 80 128 160 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / 87.09% 93.65% 93.86% 343.70 555.00 957.70 1229.10 831.60

QSGD 16bit 8192 / / / / 551.00 942.70 1164.20 763.40QSGD 8bit 512 86.92% 93.97% 94.19% / 550.20 960.10 1193.10 759.70QSGD 4bit 512 87.77% 93.52% 93.76% / 571.10 957.40 1257.10 784.30QSGD 2bit 128 85.77% 92.59% 92.64% / 557.20 973.10 1227.90 780.401bitSGD / 87.13% 94.02% 94.15% / 465.60 643.30 610.90 406.901bitSGD* 64 / / / / 550.40 884.80 1156.70 757.70

ResNet152 ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 60 96 120 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / 77.00% 16.90 26.10 45.00 73.90 113.50

QSGD 16bit 8192 / / / / 31.20 54.50 95.50 151.00QSGD 8bit 512 72.95% 76.66% 76.74% / 32.80 62.70 109.20 182.50QSGD 4bit 512 / / / / 33.60 60.20 121.90 203.20QSGD 2bit 128 / / / / 33.50 64.35 123.55 208.501bitSGD / / / / / 10.55 22.10 41.40 63.151bitSGD* 64 / / / / 30.40 55.50 108.10 193.50

VGG19 ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 40 64 80 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / / 12.40 20.40 36.30 53.95 40.60

QSGD 16bit 8192 / / / / 24.80 46.40 35.80 67.80QSGD 8bit 512 / / / / 24.20 47.50 119.50 106.60QSGD 4bit 512 / / / / 27.00 52.30 151.65 143.80QSGD 2bit 128 / / / / 24.60 49.35 160.35 170.501bitSGD / / / / / 22.20 43.15 117.35 120.601bitSGD* 64 / / / / 22.90 44.80 99.15 134.30

BN-Inception ImageNetSetup Accuracy Samples per second (MPI)

Precision Bucket size 150 240 300 1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs32bit / / / / 88.30 164.80 316.75 473.75 500.40

QSGD 16bit 8192 / / / / 171.80 337.10 482.70 592.30QSGD 8bit 512 / / / / 173.60 342.50 552.90 696.30QSGD 4bit 512 / / / / 174.80 346.90 593.40 743.30QSGD 2bit 128 / / / / 173.40 343.70 591.80 747.501bitSGD / / / / / 127.60 236.25 336.15 321.301bitSGD* 64 / / / / 170.30 335.10 480.50 700.40

Figure 5: Results of using MPI on Amazon EC2 instance. “/” means a number is missing for multiple reasons: (1) it might take more than 2weeks to run to converge; (2) we already see preliminary result that certain run diverges or significantly worse than full precision. Becausethere are no data parallel SGD on 1 GPU therefore most columns about 1GPU is empty. We present the reason for each missing cell.

Generally,4bitsaresufficientforrecoveringorimprovingaccuracyatsamerate(alsotheoreticallysound).

14

Page 14: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

• AllexperimentsranonImageNet• Amazonbaseline:

Experiments:Scalability

2.3x

15

Page 15: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Experiments:Scalability

7x

6.2x

4x2x

16

Page 16: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

AccuracyvsTimeonLSTM

SpeechRecognition (3-LayerLSTM)

2.5x

• AN4dataset,2xTitanXGPUs

17

Page 17: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

WhatElseCanWeQuantize?

Processor(GradientComputation)

FastMemory(StoringModel)

Large,SlowerMemory(StoringSamples)

Processor(GradientComputation)

• Iteration: 𝑥G.e = 𝑥G − 𝜂𝑡𝑔4(𝑥𝑡, 𝑒𝑡)

18

Page 18: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

• Iteration:

Everything:Gradient,Samples,andModel

Processor(GradientComputation)

FastMemory(StoringModel)

Large,SlowerMemory(StoringSamples)

Processor(GradientComputation)

𝑥G.e = 𝑥G − 𝜂𝑡𝑸𝑔(𝑔4(𝑸𝑚(𝑥𝑡),𝑸𝑠(𝑒𝑡)))

Canwequantizeend-to-endandstillgetconvergenceguarantees?19

Page 19: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

ZipML:End-to-endQuantizationforLinearModels[Alistarh,Li,Liu,Kara,Zhang&Zhang,ICML2017]

• Samples areproblematic:Originalgradient:

Naïvequantization:

• Wecanfixthisbydoublesampling

𝑔4 𝑥𝑡 = 𝑎𝑡(𝑎𝑡𝑇𝑥𝑡 − 𝑏𝑡)

𝑔4 𝑥𝑡 = 𝑸(𝒂𝒕)(𝑸(𝒂𝒕)𝑻𝑥𝑡 − 𝑏𝑡) Notunbiased!

𝑔4 𝑥𝑡 = 𝑸1(𝒂𝒕)(𝑸2(𝒂𝒕)𝑻𝑥𝑡 − 𝑏𝑡)Ifsamplesare

independent,thegradientisagainunbiased!

SampleSource Processor

(𝑎𝑡, 𝑏𝑡) (𝑄1 𝑎𝑡 ,𝑄2 𝑎𝑡 , 𝑏𝑡) 𝑔4 𝑥𝑡20

Page 20: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Generalization&Implementation• Wecandoend-to-endquantizedlinearregressionwithguarantees• Generalclassificationlossfunctions:

1. Forsmoothlosses,wecandopolynomialapproximation,andthenmulti-sampling2. Fornon-smoothlosses(e.g.SVM),wecando“re-fetchsampling”

• ImplementedonanXeon+FPGA platform(Intel-AlteraHARP)• Quantizeddata->8-16xmoredatapertransfer

10

�2

10

�1

10

0

10

1

1

2

·10�2

6.5x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA

Time (s)

Trai

ning

loss

(a) gisette. � = 1/212.

10

�2

10

�1

10

0

0.04

0.06

0.08

0.1

3.6x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA

Time (s)

Trai

ning

loss

(b) epsilon. � = 1/212.

10

�2

10

�1

10

0

10

1

1.8

2

2.2·10�2

10.6x

float CPU 1-threadfloat CPU 10-threadfloat FPGAQ1 FPGA

Time (s)

Trai

ning

loss

(c) mnist, trained for digit 7. � = 1/215.

Fig. 5: SGD on classification data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

10

�3

10

�2

10

�1

1

1.5

2

2.5

·10�2

3.7x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA

Time (s)

Trai

ning

loss

(a) cadata. � = 1/212.

10

�3

10

�2

10

�1

0

0.5

1

·10�2

7.2x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ4 FPGA

Time (s)

Trai

ning

loss

(b) synthetic100. � = 1/29.

10

�2

10

�1

10

0

0

1

2

·10�2

2x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA

Time (s)

Trai

ning

loss

(c) synthetic1000. � = 1/29.

Fig. 6: SGD on regression data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

TABLE IV: Data sets used in experimental evaluation.

Name Training size Testing size # Features # Classescadata 20,640 8 regressionmusic 463,715 90 regression

synthetic100 10,000 100 regressionsynthetic1000 10,000 1000 regression

mnist 60,000 10,000 780 10gisette 6000 1000 5000 2epsilon 10,000 10,000 2000 2

updates; that is, each thread works on a separate chunk of dataand applies gradient updates to a common model without anysynchronization. Although this results in convergence speedupfor sparse data sets, for dense and high variance ones theconvergence quality might be affected negatively, becauseof the lack of synchronization. hogwild also makes use ofvectorization, benefiting from AVX extensions.

Methodology: Since we apply the step size as a bit-shiftoperation, we choose one of the following step sizes, whichresults in the smallest loss for the full-precision data after64 epochs: (1/26, 1/29, 1/212, 1/215). With a determinedconstant step size, we run FPGA-SGD on all the precisionvariations that we have implemented (Q1-only for classifica-tion data-, Q2, Q4, Q8, float). For each data set, we present theloss function over time in Figures 6-5, which always containsingle-threaded and 10-threaded CPU-SGD curve for float, anFPGA-SGD curve for float (for supported dimensions) and anFPGA-SGD curve for the smallest precision data, that hasconverged within 5% of the original loss. We would like toshow the difference in time for all implementations to convergeto the same loss, emphasizing the speedup we can achieve withquantized FPGA-SGD to full precision variants.

Main results: In Figure 5 we can observe that for all classi-fication data sets the quantized FPGA-SGD achieves speedupwhile maintaining convergence quality. For gisette in Figure5a, Q2 reaches the same loss 6.9x faster than hogwild. Dueto high variance data in epsilon, both hogwild and Qx curvesseem to be shaky, in Figure 5b. We can see that float FPGA-SGD in this case behaves very nicely, providing both 1.8xspeedup over hogwild and better convergence quality. Theresults for the music data set are very similar to epsilon, sowe omit showing them for space constraints. On mnist dataset in Figure 5c, we can even use Q1 without losing anyconvergence quality, showing that the characteristics of thedata set effect the choice of quantization precision heavily,which justifies having multiple Qx implementations. For thisone, Q1 FPGA-SGD provides 10.6x speedup over hogwild and11x speedup over float FPGA-SGD. In Figure 6a we see thatfloat FPGA-SGD converges as fast as Q2. The reason for thatis the zero padding we apply to be cache-line aligned (seeSection III): The cadata data set has only 8 features, meaninga full row fits into one cache-line (64B, can contain 16 floats)even if data is in float. Quantizing the data there does notbring any benefits for out implementation, since we pad theremaining bits in a cache-line with zeros. The outcome ofthis experiment, as predicted by our calculations previously, isthat for data sets with a small number of features, quantizationdoes not provide speedup. For synthetic100 and synthetic1000,we need to use Q4 and Q8, respectively, to achieve the sameconvergence quality as float. We see in Figure 6c that hogwildconvergence is slightly faster than the float FPGA-SGD, butthen Q8 provides 2x speedup over that. The reason for hogwild

Classification(gisette dataset,n=5K).

10

�2

10

�1

10

0

10

1

1

2

·10�2

6.5x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA

Time (s)Tr

aini

nglo

ss

(a) gisette. � = 1/212.

10

�2

10

�1

10

0

0.04

0.06

0.08

0.1

3.6x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA

Time (s)

Trai

ning

loss

(b) epsilon. � = 1/212.

10

�2

10

�1

10

0

10

1

1.8

2

2.2·10�2

10.6x

float CPU 1-threadfloat CPU 10-threadfloat FPGAQ1 FPGA

Time (s)

Trai

ning

loss

(c) mnist, trained for digit 7. � = 1/215.

Fig. 5: SGD on classification data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

10

�3

10

�2

10

�1

1

1.5

2

2.5

·10�2

3.7x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ2 FPGA

Time (s)

Trai

ning

loss

(a) cadata. � = 1/212.

10

�3

10

�2

10

�1

0

0.5

1

·10�2

7.2x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ4 FPGA

Time (s)Tr

aini

nglo

ss

(b) synthetic100. � = 1/29.

10

�2

10

�1

10

0

0

1

2

·10�2

2x

float CPU 1-threadfloat CPU 10-threadsfloat FPGAQ8 FPGA

Time (s)

Trai

ning

loss

(c) synthetic1000. � = 1/29.

Fig. 6: SGD on regression data. All curves represent 64 SGD epochs. Speedup shown for Qx FPGA vs. float CPU 10-threads.

TABLE IV: Data sets used in experimental evaluation.

Name Training size Testing size # Features # Classescadata 20,640 8 regressionmusic 463,715 90 regression

synthetic100 10,000 100 regressionsynthetic1000 10,000 1000 regression

mnist 60,000 10,000 780 10gisette 6000 1000 5000 2epsilon 10,000 10,000 2000 2

updates; that is, each thread works on a separate chunk of dataand applies gradient updates to a common model without anysynchronization. Although this results in convergence speedupfor sparse data sets, for dense and high variance ones theconvergence quality might be affected negatively, becauseof the lack of synchronization. hogwild also makes use ofvectorization, benefiting from AVX extensions.

Methodology: Since we apply the step size as a bit-shiftoperation, we choose one of the following step sizes, whichresults in the smallest loss for the full-precision data after64 epochs: (1/26, 1/29, 1/212, 1/215). With a determinedconstant step size, we run FPGA-SGD on all the precisionvariations that we have implemented (Q1-only for classifica-tion data-, Q2, Q4, Q8, float). For each data set, we present theloss function over time in Figures 6-5, which always containsingle-threaded and 10-threaded CPU-SGD curve for float, anFPGA-SGD curve for float (for supported dimensions) and anFPGA-SGD curve for the smallest precision data, that hasconverged within 5% of the original loss. We would like toshow the difference in time for all implementations to convergeto the same loss, emphasizing the speedup we can achieve withquantized FPGA-SGD to full precision variants.

Main results: In Figure 5 we can observe that for all classi-fication data sets the quantized FPGA-SGD achieves speedupwhile maintaining convergence quality. For gisette in Figure5a, Q2 reaches the same loss 6.9x faster than hogwild. Dueto high variance data in epsilon, both hogwild and Qx curvesseem to be shaky, in Figure 5b. We can see that float FPGA-SGD in this case behaves very nicely, providing both 1.8xspeedup over hogwild and better convergence quality. Theresults for the music data set are very similar to epsilon, sowe omit showing them for space constraints. On mnist dataset in Figure 5c, we can even use Q1 without losing anyconvergence quality, showing that the characteristics of thedata set effect the choice of quantization precision heavily,which justifies having multiple Qx implementations. For thisone, Q1 FPGA-SGD provides 10.6x speedup over hogwild and11x speedup over float FPGA-SGD. In Figure 6a we see thatfloat FPGA-SGD converges as fast as Q2. The reason for thatis the zero padding we apply to be cache-line aligned (seeSection III): The cadata data set has only 8 features, meaninga full row fits into one cache-line (64B, can contain 16 floats)even if data is in float. Quantizing the data there does notbring any benefits for out implementation, since we pad theremaining bits in a cache-line with zeros. The outcome ofthis experiment, as predicted by our calculations previously, isthat for data sets with a small number of features, quantizationdoes not provide speedup. For synthetic100 and synthetic1000,we need to use Q4 and Q8, respectively, to achieve the sameconvergence quality as float. We see in Figure 6c that hogwildconvergence is slightly faster than the float FPGA-SGD, butthen Q8 provides 2x speedup over that. The reason for hogwild

Regression(syntheticdataset,n=100). 21

Page 21: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Summary

22

GotApplications?

QSGD isafamilyoflossy compressionschemeswhichtradeofflowbitcomplexityandconvergencetime.

QSGD istheoreticallyoptimal,canbepractical,andextensibletoend-to-endcompression.

Page 22: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Experiments:MoreScalability

7x

6.2x

6x2x

23

Page 23: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

Backup1:Comparisonwith1BitSGD

24

Page 24: Quantized Stochastic Gradient Descent · 2018. 11. 21. · More Accuracy: please see paper AlexNet ImageNet Setup Accuracy Samples per second (MPI) Precision Bucket size 56 89 112

6x2.3x

25