anovellow-bitquantizationstrategyforcompressingdeep...

Research ArticleA Novel Low-Bit Quantization Strategy for Compressing DeepNeural Networks

Xin Long 1 XiangRong Zeng 1 Zongcheng Ben12 Dianle Zhou3 and Maojun Zhang1

1College of Systems Engineering National University of Defense Technology Changsha 410073 China2College of Computer National University of Defense Technology Changsha 410073 China3College of Intelligent Science National University of Defense Technology Changsha 410073 China

Correspondence should be addressed to XiangRong Zeng zengxronggmailcom

Received 1 September 2019 Revised 9 January 2020 Accepted 22 January 2020 Published 18 February 2020

Academic Editor Leonardo Franco

Copyright copy 2020 Xin Long et al )is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

)e increase in sophistication of neural network models in recent years has exponentially expanded memory consumption andcomputational cost thereby hindering their applications on ASIC FPGA and other mobile devices )erefore compressing andaccelerating the neural networks are necessary In this study we introduce a novel strategy to train low-bit networks with weightsand activations quantized by several bits and address two corresponding fundamental issues One is to approximate activationsthrough low-bit discretization for decreasing network computational cost and dot-product memory)e other is to specify weightquantization and update mechanism for discrete weights to avoid gradient mismatch With quantized low-bit weights andactivations the costly full-precision operation will be replaced by shift operation We evaluate the proposed method on commondatasets and results show that this method can dramatically compress the neural network with slight accuracy loss

1 Introduction

Deep neural networks such as handwritten characterimage recognition and many burgeoning AI applicationshave achieved great success in recent years [1ndash3] All theseachievements rely on complex deep models In the 2012ILSVRC contest Krizhevsky constructed a multilayernetwork [4] with 60 million parameters and this networkhas exceeded all previous methods in terms of classificationaccuracy However training the entire network requires 2to 3 days Deep networks introduce a large number of layersdue to their complicated structure thereby increasing themodel size (such as 50 200 250 and 500MB for Goo-gleNet ResNet-101 AlexNet and VGG-Net respectively)[5] computational complexity and demand for energyconsumption )erefore embedding these properties ontomobile devices is a large challenge In deep neural net-works the computational cost and memory consumptionare mainly dominated by convolution operation which isexactly the dot-product between weight and activationvector Most existing techniques focus on weight sharing

pruning quantization and activations discretion [6ndash8])ey also exhibit large accuracy drop and high computa-tion during training and testing with float operation In thiswork we introduce a method to train low-bit networks Onone hand this study approximates activations through low-bit discretization On the other hand weight quantizationand special update mechanism for discrete weights areintroduced With quantized low-bit network weights andoutput activations the costly full-precision convolutionaloperation will be replaced by shift operation and marginalaccuracy cost will decrease slightly Our method will beimportant on embedded devices such as ASIC or FPGA forAI

2 Related Work

In this section we discuss related work from followingaspects

(i) Pruning and Sharing Parameter pruning andsharing has been used both to reduce the complexity

HindawiComputational Intelligence and NeuroscienceVolume 2020 Article ID 7839064 7 pageshttpsdoiorg10115520207839064

of neural network and to avoid model overfitting[6 9ndash11] propose method to find and prune theredundant connections with small weight valuesquantize the weights via weight sharing )e run-time memory saving and compression effect arevery limited by those simple methods

(ii) Structured Pruning and Sparsifying Generallyspeaking L1 norm L2 norm Group Lasso andother regularization terms are efficient ways tolearning sparse weight structures in numerous re-searches Wen et al [12] proposes StructuredSparsity Learning by using Group Lasso to sparsifymultiple DNN structures (filters channels and evenlayers) Besides the authors of [13ndash16] also try totrain network with sparsity regularizer and trans-form the problem of measuring channel importanceinto optimization problem

(iii) Special Neural Architecture Reducing calculationFLOPs and accelerating inference process ofneural networks by designing special architectureRelated researches including Mobile-Net [17 18]Squeeze-Net [19] and Shuffle-Net [20] by adoptingconvolutional filters of small size depth-wise con-volution operations

(iv) Weight and Activation Quantization Our proposedquantization method also falls into this categoryLow-bit quantization methods mean that the net-work weights and activations are expressed bydiscrete values according to special mathematicalmethod which could replace the costly originalfloating-point operations by only accumulations oreven binary logic operations )e authors of [21 22]firstly constrain the weights to the binary and ter-nary space It follows that both weights and acti-vations are mapped into binary space or ternaryspace ie binary neural networks (BNN) [7]XNOR-Net [8] and ternary neural networks (TNN)[23] which directly replace multiply-accumulateoperations by logic operations DoReFa-Net [24]not only quantizes weights and activations but alsoquantizes gradients to low-bit width floating-pointnumbers with discrete states in the backwardpropagation

3 Low-Bit Neural Networks

In this section we concentrate on training quantized low-bitnetworks Specifically the activations of layer output arequantized by either zero or powers of two to reduce storageand calculations )e weights of network are also restrictedin the same way to obtain a sparse model By constrainingweights and activations to zero or powers of two the costlyfloating-point multiplication operation can be replaced bycheaper shift operations [13]

31 Dot-product Function Deep neural networks generallyconsist of multiple layers and each neuron in different layerscomputes activation function

z f wTx + b1113872 1113873 (1)

where z is output activation x is input vector x is weightvector b is bias and f is a nonlinear function such as ReLUGiven the convolutional networks the computationalcomplexity is mainly dominated by the convolution oper-ation )e key point of quantization for compression onhardware applications can be summarized into two aspectsOne is the large memory required to store weights andactivations )e other is the computational cost required tocompute large numbers of dot-products)e difficulty lies infloating-point arithmetic which is limited in practical ap-plications [5] and is examined in this study Figure 1 showsthe schematic of standard convolution process and ourmethod (DST will be introduced in Section 33)

32 Low-Bit Activation Approximation In this section wehave proposed a novel approximation strategy for activa-tions quantization and corresponding suitable methods tokeep the efficiency of backpropagation

321 Forward Approximation Process In accordance withthe discussion above the activations of network are quan-tized by either zero or powers of two in this section )eoptimization model is formulated as follows

qi+1 2qi qi ge 0( 1113857

P(x) qi 0lexisin ti ti+1( 1113859

minus qi 0gexisin minus ti+1 minus ti1113858 11138571113896

⎧⎪⎪⎨

⎪⎪⎩(2)

where numerous parameter values within the interval(ti ti+1] ([minus ti+1 minus ti)) are quantized into a common valueqi(minus qi) and P(x) is our new defined discrete activationfunction We attempt to find the mean-squared error of allvalues for obtaining the optimal quantization method )usoptimization model (2) could be transformed into the fol-lowing model

qi+1 2qi

Plowast(x) argminP

1113946φ(x)(P(x) minus x)2dx

⎧⎪⎪⎨

⎪⎪⎩(3)

where φ(x) is the probability density function of x Fol-lowing Cairsquos implementation [4] we apply batch normali-zation to the dot-product in (1) to determine the closeness ofdistributions to Gaussian with zero mean and unit varianceAccordingly the optimal solution of (3) can be acquired byLloydrsquos algorithm [25] As a result the best partition is

P1 x 0le xle x11113864 1113865

P2 x x1 ltxlex21113864 1113865

middot middot middot

Pvminus 1 x xvminus 2 lt xlexvminus 11113864 1113865

Pv x xvminus 1 lt xleinfin1113864 1113865

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(4)

where Pv denotes different value interval of x )e endpointsof each interval are

2 Computational Intelligence and Neuroscience

x1 12

q1 + q2( 1113857

x2 12

q2 + q3( 1113857


xvminus 1 12

qvminus 1 + qv( 1113857

xv 12

qv + qv+1( 1113857

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)

where we set up q1 0 and consider the symmetry of in-terval for xlt 0 )erefore the final optimization function ofour quantizer is

argminq2

1113946q22

0φ(x)x

2dx + 11139463q22

q22φ(x) q2 minus x( 1113857

2dx1113896

+ 1113944n

i31113946

(32)2iminus 2q2

(32)2iminus 3q2

φ(x) 2iminus 2q2 minus x1113872 1113873

2dx

+ 1113946+infin

(32)2iminus 2q2

φ(x) 2nminus 1q2 minus x1113872 1113873

2dx1113897 (6)

where φ(x) is the probability density function of standardnormal distribution n is the number of bits of activationfunction Only one variable is considered in (6) )us theabove-mentioned formula has a theoretical solutionHowever we adopt the genetic algorithm in the experimentsgiven the difficulty in solving segmented variable limit in-tegral Table 1 shows the optimal error of different q2 valuesWith the further refinement of q2 we still obtain the sameerror value of 00189

322 Backward Approximation Process Since dot-productvalues are equal within the same interval after using pro-posed forward approximation method zero derivative al-most everywhere )us we have proposed a better possible

backward solution here and the final experimental resultsprove its feasibility in backpropagation process

For 0le xlex1 we approximate all values in this intervalto be zero similar to ReLU function which does not need toupdate Considering Gaussian distribution of dot-productmentioned above plenty of activations fall into the intervalnear zero We keep the gradient of this part as it was For ourquantization method where activations are within intervalPv has tiny probability In this case we need to limit theirupdates preventing them from updating to other intervalsand keep network accuracy )e derivative of quantizationfunction has the following form

zC

zP

0 0le xlex1

1 x isin x1 xvminus 1( 1113859

1x minus xvminus 1 minus 1( 1113857( 1113857

xgtxvminus 1

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(7)

For xlt 0 consider interval symmetry In our final ex-periment we find this method keeping the efficiency ofbackpropagation and making learning stable

33 Low-Bit Weight Quantization )e weight quantizationshown above can be solved using various methods such asBWN DoReFa-Net and XNOR [8 21 24] However wehave to save full-precision weights in backward computationin these networks this approach may cause frequent data

Input Weight

Conv

+Bias

ReLU

Activation

(a)

Input Weight Conv (bit count)

+Bias (false)

Activation

N bit

DST

M bit

Quantize

M bit

Propagation

(b)

Figure 1 Convolution operation pipeline (a) General convolution operation without quantization of weight and activation (b) Descriptionof proposed method with weight and activation quantized by low-bit

Table 1 Expect error of different q2 values

Scheme q2 00625 q2 0125 q2 025 q2 05 q2 1

n 3 04078 03298 02106 00825 00458n 4 03298 02103 00795 00239 00443n 5 02102 00791 00209 00223 00443n 6 00790 00205 00193 00223 00443n 7 00204 00189 00193 00223 00443n 8 00189 00189 00193 00223 00443

Computational Intelligence and Neuroscience 3

exchange between external memory and parameter storage[26] In this section we propose a simple discretizationfunction that maps weights into either zero or powers of two)is way replaces the floating-point operation by shift op-erations on hardware in backward process and avoids largecomputation and memory in hardware deployment

331 Weight Quantization in Forward Process At theoutset we have considered weight discretization in forwardprocess and updated them in constrained discrete domainHowever the weight is quantized into a discrete sequence ofequal ratios here which is difficult to update for the cor-responding prescribed quantized values in backpropagation)e nonuniform distribution of discrete values is the mainproblem Similar works such as BWN DoReFa-Net andXNOR the derivative of weight in those ways is zero almosteverywhere making it apparently incompatible with back-propagation and the gradient computation is based on thestored full-precision weights and frequent data exchange isrequired during the training phase In view of this we seek todirectly discrete network weights to either zero or powers oftwo in the backward process to avoid gradient mismatchproblem other than forward process

332 Weight Quantization in Backward Process We in-troduce a weight update mechanism for discrete values inthe backward process to avoid gradient mismatch Fromprevious works we find that the weight value can be con-strained to [minus 1 1] in our quantization method At the be-ginning we introduce discrete state transformation (DST)problem for later use We let Δw be the variation in weightw be the updated weight and wprime be the raw weight )us

Δw w minus wprime (8)

L is the minimum interval of defined quantizationweight for (0plusmn2minus 2plusmn2minus 1plusmn20) and L is 2minus 2 For conve-nience seven possible integer states (0plusmn1plusmn2plusmn4) areconsidered when we constrain weight to (0plusmn2minus 2plusmn2minus 1plusmn20)Continuous weights need to be mapped into these discreteinteger states Accordingly we adopt round operation

w_state roundx

2minus 21113874 1113875 (9)

where round is round operation in math and x is the ar-bitrary value within [minus 1 1] wstate plusmn3 is not the defineddiscrete weight stated above )us we introduce the bino-mial distribution to jump into the defined integer state onboth sides

wstateprime p times plusmn211113872 1113873 +(1 minus p) times plusmn221113872 1113873 (10)

where the positive and negative signs are both positive orboth negative at the same time and p has a probability of 0or 1 (we use random number for p which has equalprobability to be 0 or 1) Figure 2 shows the above-men-tioned process

Finally the weight state needs to be transformed intodefined weight value

w w_state times 2minus 2 (11)

In this way we can transform continuous weights intodefined discrete weights successfully We transform theweight variation into defined discrete state transition Firstwe decompose Δw into integer and decimal parts by theminimum interval of quantization weight

k sign(Δw) times floor|Δw|

L1113888 1113889

v Δw minus k times L

(12)

where floor represents the round down k is the integernumber of weight state transition and v is the fine tuningparameter of weight state )us the final state transitionnumber is

Δwprime (sign(Δw) times gate + k) times L (13)

where gate submits to submits to binomial distributionwhich has the opportunity p1 to be 1 and opportunity 1 minus p1to be 0 p1 is defined by fine tuning parameter v

p1 tanh(th times|vL|) (14)

where th is a positive constant to adjust the state fine tuningprobability p1 which will be explored in the experimentsFinally we use the DSTfunction which is introduced aboveto obtain the final quantized weight

w_new DST wprime + Δwprime( 1113857 (15)

In this way we constrain all weights to(0plusmn2minus 2plusmn2minus 1plusmn20) For other values the same theory asdescribed above applies

4 Results and Discussion

In this section we evaluate our proposed algorithm onMNIST (LeNet5) SVHN (VGG) and CIFAR10 (ResNet-18)for image classification by Pytorch Most previous works donot quantize the first and last layers In our method we donot quantize the first layer only Furthermore we report theaveraged results over three runs for each experiment byAdaptive Moment Estimation optimizer (ADAM)

41 Exploration the Quantization Combination of Weightsand Activations We illustrate the behavior of the differentcombinations of weights and activations with a standardResNet-18 on the CIFAR10 dataset We quantize all weightsinto (0plusmn20) (0plusmn2minus 1plusmn20) and (0plusmn2minus 2plusmn2minus 1plusmn20) For theactivation approximation we use q2 0125 025 05 and 1as shown in Figure 1 For convenience we set [p q2] todefine the quantization combination mode where p minus 2represents above (0plusmn2minus 2plusmn2minus 1plusmn20) and the value of q2

ndash4 ndash3 ndash2 0 2 3 4ndash1 1

p p1 ndash p 1 ndash p

Figure 2 Binominal choice of undefined states for wstate plusmn3


determines the activation approximation degree After crosscombination we set th 05 here and the results are shownin Figure 3

In general weight quantization causes some accuracydegradation Figure 3 confirms that accuracy increases withthe deep degree of weight quantization However differentapproximation methods for activation do not influence testaccuracy dramatically but fluctuation during training oc-curs Our method is also evaluated on other datasets Table 2shows the comparison results under same conditions and theresults from [27] As elaborated above BWN TWN andXNOR methods quantize weights to 1 or 2 bits of floatingpoint every layer but not in the entire network However ourmethod achieves 2 or 3 bits of fixed-point in the entirenetwork and can be used with shift operation on ASIC orFPGA To demonstrate the effectiveness of proposed methodwe also show the comparison results on CIFAR100 with morecomplex model (ResNet-34 ResNet-50) as shown in Table 3

42 Effect with the Change of th We explore the effect ofparameter th in this section As explained above th adjuststhe weight state fine tuning probability to influence the finallearning accuracy Figure 4 shows the results which indicateexcellent nonlinearity Here we test the combination[minus 3 0125] Evidently the curve has the best accuracy atapproximately th 05 whereas larger or smaller values may

result in slight improvement )e same result is obtained forother combinations after several experiments )us weadopt th 05 for all experiments in this study

43 Influence of the First and Last Layer Quantization)e first and last layers are critical to network quantizationresearch according to previous works In the current studyall our experiments do not quantize the first layer only Weattempt to investigate the influence of first layer quantiza-tion )e results are summarized in Table 4 We test theweight and activation quantization combination [minus 3 0125]here ldquo+rdquo and ldquominus rdquo indicate with or without weight quanti-zation of the corresponding layer

Evidently accuracy degradation may occur when quan-tizing the first or last layer Our method is slightly better thanBNNbut is not better than BWNwhich quantizes weights only

44 Parameter Sparsity Most of the current AI applicationsare based on ResNet)us we analyze parameter sparsity onResNet-18 Previous methods clip a large number of weightsby setting most weights of small values to zero but not to beexactly zero [28] By contrast our method can obtain precise

Table 2 Test error comparison on multiple datasets

Method Weight(bit)

Activation(bit) MNIST SVHN CIFAR10

BNN 1 1 127 253 846BWN 1 32 054 mdash 725TWN 2 32 065 mdash 744DoReFa 8 8 mdash 230 mdashOurs 3 3 096 214 748

Table 3 Accuracy comparison on CIFAR100

avb BNN XNOR OursResNet-34 48817832 53288129 61338722ResNet-50 52078160 59208532 62928865

00 05 10 15 20 25 30th

95

90

85

80

75

70

65

60

Accu

racy

Figure 4 Comparison of accuracy with different combinations ofquantized weights and activations )e horizontal axis shows theactivation approximation bits and the vertical axis represents thequantization bits of network weight

9400

9200

9000

8800

8600

8400

8200

Accu

racy

Activation Q20125 0250 0500 1000

p = 0p = ndash1p = ndash2

p = ndash3p = ndash4


Table 4 Accuracy comparison with quantization of first or lastconvolutional layer

CIFAR10MNIST BWN BNN Ours+ First minus last 92379937 91409866 92089886+ First + last 92219941 91309852 91969855minus First + last 92529938 91479871 92529875minus First minus last 92759946 91549873 92129904


zero value weights )e results of our method using thecombination [minus 3 0125] are shown in Table 5

Evidently our method can obtain large sparsity onconvolutional layer parameters and several top layers of thenetwork may be valuable for final evaluation )e back layeris sparser than the front one which may be pruned in ourfuture work As an attempt we prune the pretty sparse layers(conv19 conv20) finding accuracy dropping little andobtaining more compact layers More meaningfully trainingand inference time are reduced in a certain extent whichmaysignificant for hardware implementations

5 Conclusions

In deep networks computational cost and storage capacityare key factors that directly affect the learning performanceCompression and acceleration of networks aim to reduce theredundancy of complexmodels Accordingly we introduce amethod to train networks with weights and activationsquantized by several bits We find that our method dropsnetwork accuracy slightly whereas it decreases storage andcomputation sharply Interestingly our quantified modelhas evident sparsity whichmay be pruned on ASIC or FPGAfor AI in the future

Data Availability

)e data used to support the findings of this study are opendatasets which could be found in general websites and thedatasers are also freely available

Conflicts of Interest

)e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

)is research was supported in part by the National NatureScience Foundation of China (Grants No 61602494) andNatural Science Foundation of Hunan Province

References

[1] G E Hinton N Srivastava and K Swersky Neural Networksfor Machine Learning Vol 264 University of Toronto Tor-onto Canada 2012

[2] D Bahdanau K Cho and Y Bengio ldquoNeural machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[3] K He X Zhang S Ren and J Sun ldquoDelving deep intorectifiers surpassing human-level performance on imagenetclassificationrdquo in Proceedings of the 2015 IEEE InternationalConference on Computer Vision (ICCV) pp 1026ndash1034Santiago Chile December 2015

[4] A Krizhevsky I Sutskever and G E Hinton ldquoImageNetclassification with deep convolutional neural networksrdquoCommunications of the ACM vol 60 no 6 pp 84ndash90 2017

[5] Z Cai X He J Sun and N Vasconcelos ldquoDeep learning withlow precision by half-wave Gaussian quantizationrdquo in Pro-ceedings of the 2017 IEEE Conference on Computer Vision andPattern Recognition (CVPR) pp 5406ndash5414 Honolulu HIUSA July 2017

[6] S Han H Mao and W J Dally ldquoDeep compressioncompressing deep neural networks with pruning trainedquantization and huffman codingrdquo in Proceedings of theInternational Conference on Learning Representations SanJuan PR USA May 2016

[7] M Courbariaux I Hubara D Soudry R El-Yaniv andY Bengio ldquoBinarized neural networks training deep neuralnetworks with weights and activations constrained to+ 1 orminus 1rdquo 2016 httparxivorgabs160202830

Table 5 Sparsity of ResNet-18 on CIFAR10

Layers (weight tensors) Full precision (1 minus sparsity) () Our method (1 minus sparsity) ()Conv1 (64 3 3 3) 100 100Conv2 (64 64 3 3) 100 8532Conv3 (64 64 3 3) 100 8671Conv4 (64 64 3 3) 100 8584Conv5 (64 64 3 3) 100 8510Conv6 (128 64 3 3) 100 8604Conv7 (128 128 3 3) 100 8346Conv8 (128 64 1 1) 100 8652Conv9 (128 128 3 3) 100 8288Conv10 (128 128 3 3) 100 8075Conv11 (256 128 3 3) 100 7745Conv12 (256 256 3 3) 100 7023Conv13 (256 128 1 1) 100 7774Conv14 (256 256 3 3) 100 5951Conv15 (256 256 3 3) 100 4264Conv16 (512 256 3 3) 100 2216Conv17 (512 512 3 3) 100 1072Conv18 (512 256 1 1) 100 4156Conv19 (512 512 3 3) 100 502Conv20 (512 512 3 3) 100 3461 minus Sparsity 100 2332Accuracy 9374 9252


[8] M Rastegari V Ordonez J Redmon and A FarhadildquoXNOR-Net ImageNet classification using binary convolu-tional neural networksrdquo 2016 httparxivorgabs160305279

[9] S Han X Liu H Mao et al ldquoEierdquo ACM Sigarch ComputerArchitecture News vol 44 no 3 pp 243ndash254 2016

[10] S Han X Liu H Mao et al ldquoDeep compression and EIEefficient inference engine on compressed deep neural net-workrdquo in Proceedings of the 2016 IEEE Hot Chips 28 Sym-posium (HCS) pp 1ndash6 Cupertino CA USA August 2016

[11] B Liu M Wang H Foroosh M Tappen and M PenksyldquoSparse convolutional neural networksrdquo in Proceedings of the2015 IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) pp 806ndash814 Boston MA USA June 2015

[12] W Wen C Wu Y Wang Y Chen and H Li ldquoLearningstructured sparsity in deep neural networksrdquo 2016 httparxivorgabs160803665

[13] C Leng H Li S Zhu and R Jin ldquoExtremely low bit neuralnetwork squeeze the last bit out with ADMMrdquo 2017 httparxivorgabs170709870

[14] T Zhang S Ye K Zhang et al ldquoA systematic DNN weightpruning framework using alternating direction method ofmultipliersrdquo Computer VisionmdashECCV 2018 Springervol 11212 pp 191ndash207 Cham Switzerland 2018

[15] P Molchanov S Tyree T Karras T Aila and J KautzldquoPruning convolutional neural networks for resource efficientinferencerdquo 2017 httparxivorgabs161106440

[16] X Long Z Ben X Zeng Y Liu M Zhang and D ZhouldquoLearning sparse convolutional neural network via quanti-zation with low rank regularizationrdquo IEEE Access vol 7pp 51866ndash51876 2019

[17] M Sandler A Howard M Zhu A Zhmoginov andL-C Chen ldquoMobilenetv2 inverted residuals and linearbottlenecksrdquo in Proceedings of the 2018 IEEECVF Conferenceon Computer Vision and Pattern Recognition pp 4510ndash4520Salt Lake City UT USA June 2018

[18] A Howard M Zhu B Chen et al ldquoMobilenets efficientconvolutional neural networks for mobile vision applica-tionsrdquo 2017 httparxivorgabs170404861

[19] F N Iandola S Han M W Moskewicz K AshrafW J Dally and K Keutzer ldquoSqueezeNet AlexNet-level ac-curacy with 50x fewer parameters and lt05MB model sizerdquo2016 httparxivorgabs160207360

[20] X Zhang X Zhou M Lin and J Sun ldquoShufflenet an ex-tremely efficient convolutional neural network for mobiledevicesrdquo in Proceedings of the 2018 IEEECVF Conference onComputer Vision and Pattern Recognition pp 6848ndash6856 SaltLake City UT USA June 2018

[21] M Courbariaux Y Bengio and J-P David ldquoBinaryconnecttraining deep neural networks with binary weights duringpropagationsrdquo 2015 httparxivorgabs151100363

[22] F Li B Zhang and B Liu ldquoTernary weight networksrdquo 2016httparxivorgabs160504711

[23] N Mellempudi A Kundu D Mudigere D Das B Kaul andP Dubey ldquoTernary neural networks with fine-grainedquantizationrdquo 2017 httparxivorgabs170501462

[24] S Zhou Y Wu Z Ni X Zhou HWen and Y Zou ldquoDorefa-net training low bitwidth convolutional neural networks withlow bitwidth gradientsrdquo 2016 httparxivorgabs151100363

[25] S Lloyd ldquoLeast squares quantization in PCMrdquo IEEE Trans-actions on Information eory vol 28 no 2 pp 129ndash1371982

[26] L Deng P Jiao J Pei Z Wu and G Li ldquoGxnor-net trainingdeep neural networks with ternary weights and activationswithout full-precision memory under a unified discretizationframeworkrdquo Neural Networks vol 100 pp 49ndash58 2018

[27] S Wu G Li F Chen and L Shi ldquoTraining and inference withintegers in deep neural networksrdquo 2018 httparxivorgabs180204680

[28] A Torfi R A Shirvani S Soleymani and N M NasrabadildquoAttention-Based guided structured sparsity of deep neuralnetworksrdquo 2018 httparxivorgabs180209902


of neural network and to avoid model overfitting[6 9ndash11] propose method to find and prune theredundant connections with small weight valuesquantize the weights via weight sharing )e run-time memory saving and compression effect arevery limited by those simple methods

(ii) Structured Pruning and Sparsifying Generallyspeaking L1 norm L2 norm Group Lasso andother regularization terms are efficient ways tolearning sparse weight structures in numerous re-searches Wen et al [12] proposes StructuredSparsity Learning by using Group Lasso to sparsifymultiple DNN structures (filters channels and evenlayers) Besides the authors of [13ndash16] also try totrain network with sparsity regularizer and trans-form the problem of measuring channel importanceinto optimization problem

(iii) Special Neural Architecture Reducing calculationFLOPs and accelerating inference process ofneural networks by designing special architectureRelated researches including Mobile-Net [17 18]Squeeze-Net [19] and Shuffle-Net [20] by adoptingconvolutional filters of small size depth-wise con-volution operations

(iv) Weight and Activation Quantization Our proposedquantization method also falls into this categoryLow-bit quantization methods mean that the net-work weights and activations are expressed bydiscrete values according to special mathematicalmethod which could replace the costly originalfloating-point operations by only accumulations oreven binary logic operations )e authors of [21 22]firstly constrain the weights to the binary and ter-nary space It follows that both weights and acti-vations are mapped into binary space or ternaryspace ie binary neural networks (BNN) [7]XNOR-Net [8] and ternary neural networks (TNN)[23] which directly replace multiply-accumulateoperations by logic operations DoReFa-Net [24]not only quantizes weights and activations but alsoquantizes gradients to low-bit width floating-pointnumbers with discrete states in the backwardpropagation

3 Low-Bit Neural Networks

In this section we concentrate on training quantized low-bitnetworks Specifically the activations of layer output arequantized by either zero or powers of two to reduce storageand calculations )e weights of network are also restrictedin the same way to obtain a sparse model By constrainingweights and activations to zero or powers of two the costlyfloating-point multiplication operation can be replaced bycheaper shift operations [13]

31 Dot-product Function Deep neural networks generallyconsist of multiple layers and each neuron in different layerscomputes activation function

z f wTx + b1113872 1113873 (1)

where z is output activation x is input vector x is weightvector b is bias and f is a nonlinear function such as ReLUGiven the convolutional networks the computationalcomplexity is mainly dominated by the convolution oper-ation )e key point of quantization for compression onhardware applications can be summarized into two aspectsOne is the large memory required to store weights andactivations )e other is the computational cost required tocompute large numbers of dot-products)e difficulty lies infloating-point arithmetic which is limited in practical ap-plications [5] and is examined in this study Figure 1 showsthe schematic of standard convolution process and ourmethod (DST will be introduced in Section 33)

32 Low-Bit Activation Approximation In this section wehave proposed a novel approximation strategy for activa-tions quantization and corresponding suitable methods tokeep the efficiency of backpropagation

321 Forward Approximation Process In accordance withthe discussion above the activations of network are quan-tized by either zero or powers of two in this section )eoptimization model is formulated as follows

qi+1 2qi qi ge 0( 1113857

P(x) qi 0lexisin ti ti+1( 1113859

minus qi 0gexisin minus ti+1 minus ti1113858 11138571113896

⎧⎪⎪⎨

⎪⎪⎩(2)

where numerous parameter values within the interval(ti ti+1] ([minus ti+1 minus ti)) are quantized into a common valueqi(minus qi) and P(x) is our new defined discrete activationfunction We attempt to find the mean-squared error of allvalues for obtaining the optimal quantization method )usoptimization model (2) could be transformed into the fol-lowing model

qi+1 2qi

Plowast(x) argminP

1113946φ(x)(P(x) minus x)2dx

⎧⎪⎪⎨

⎪⎪⎩(3)

where φ(x) is the probability density function of x Fol-lowing Cairsquos implementation [4] we apply batch normali-zation to the dot-product in (1) to determine the closeness ofdistributions to Gaussian with zero mean and unit varianceAccordingly the optimal solution of (3) can be acquired byLloydrsquos algorithm [25] As a result the best partition is

P1 x 0le xle x11113864 1113865

P2 x x1 ltxlex21113864 1113865


Pvminus 1 x xvminus 2 lt xlexvminus 11113864 1113865

Pv x xvminus 1 lt xleinfin1113864 1113865

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(4)

where Pv denotes different value interval of x )e endpointsof each interval are


x1 12

q1 + q2( 1113857

x2 12

q2 + q3( 1113857


xvminus 1 12

qvminus 1 + qv( 1113857

xv 12

qv + qv+1( 1113857

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)


argminq2

1113946q22

0φ(x)x

2dx + 11139463q22

q22φ(x) q2 minus x( 1113857

2dx1113896

+ 1113944n

i31113946

(32)2iminus 2q2

(32)2iminus 3q2


2dx

+ 1113946+infin

(32)2iminus 2q2


2dx1113897 (6)





zC

zP

0 0le xlex1



xgtxvminus 1

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(7)



Input Weight

Conv

+Bias

ReLU

Activation

(a)


+Bias (false)

Activation

N bit

DST

M bit

Quantize

M bit

Propagation

(b)



Scheme q2 00625 q2 0125 q2 025 q2 05 q2 1

n 3 04078 03298 02106 00825 00458n 4 03298 02103 00795 00239 00443n 5 02102 00791 00209 00223 00443n 6 00790 00205 00193 00223 00443n 7 00204 00189 00193 00223 00443n 8 00189 00189 00193 00223 00443







w_state roundx

2minus 21113874 1113875 (9)








L1113888 1113889


(12)























Method Weight(bit)





00 05 10 15 20 25 30th

95

90

85

80

75

70

65

60

Accu

racy


9400

9200

9000

8800

8600

8400

8200

Accu

racy

Activation Q20125 0250 0500 1000









5 Conclusions


Data Availability




Acknowledgments


References

































x1 12

q1 + q2( 1113857

x2 12

q2 + q3( 1113857


xvminus 1 12

qvminus 1 + qv( 1113857

xv 12

qv + qv+1( 1113857

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)


argminq2

1113946q22

0φ(x)x

2dx + 11139463q22

q22φ(x) q2 minus x( 1113857

2dx1113896

+ 1113944n

i31113946

(32)2iminus 2q2

(32)2iminus 3q2


2dx

+ 1113946+infin

(32)2iminus 2q2


2dx1113897 (6)





zC

zP

0 0le xlex1



xgtxvminus 1

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

(7)



Input Weight

Conv

+Bias

ReLU

Activation

(a)


+Bias (false)

Activation

N bit

DST

M bit

Quantize

M bit

Propagation

(b)



Scheme q2 00625 q2 0125 q2 025 q2 05 q2 1

n 3 04078 03298 02106 00825 00458n 4 03298 02103 00795 00239 00443n 5 02102 00791 00209 00223 00443n 6 00790 00205 00193 00223 00443n 7 00204 00189 00193 00223 00443n 8 00189 00189 00193 00223 00443







w_state roundx

2minus 21113874 1113875 (9)








L1113888 1113889


(12)























Method Weight(bit)





00 05 10 15 20 25 30th

95

90

85

80

75

70

65

60

Accu

racy


9400

9200

9000

8800

8600

8400

8200

Accu

racy

Activation Q20125 0250 0500 1000









5 Conclusions


Data Availability




Acknowledgments


References






































w_state roundx

2minus 21113874 1113875 (9)








L1113888 1113889


(12)























Method Weight(bit)





00 05 10 15 20 25 30th

95

90

85

80

75

70

65

60

Accu

racy


9400

9200

9000

8800

8600

8400

8200

Accu

racy

Activation Q20125 0250 0500 1000









5 Conclusions


Data Availability




Acknowledgments


References









































Method Weight(bit)





00 05 10 15 20 25 30th

95

90

85

80

75

70

65

60

Accu

racy


9400

9200

9000

8800

8600

8400

8200

Accu

racy

Activation Q20125 0250 0500 1000









5 Conclusions


Data Availability




Acknowledgments


References



































5 Conclusions


Data Availability




Acknowledgments


References

































anovellow-bitquantizationstrategyforcompressingdeep...

Documents