energy-efficient convolutional neural networks with ... · 5 / 25 date 2019 •sc has been used for...

S. Rasoul Faraji, M. Hassan Najafi, Bingzhe Li, Kia Bazargan, and David J. Lilja

[email protected]

Energy-Efficient Convolutional Neural Networks with Deterministic Bit-Stream Processing

DATE 20192 / 25

• An introduction to Stochastic Computing (SC)

• Advantages, weaknesses, binary to stochastic number generation

• SC-based Neural Networks

• Hybrid Stochastic-Binary Neural Networks

• Deterministic Approaches to SC

• Current methods, LD-based methods

• Proposed Hybrid NN Design

• Case Study: LeNet-5 NN

• Performance Evaluation

• Cost Comparison

• Conclusion

Overview

DATE 20193 / 25

• Stochastic computing (SC)

• A re-emerging computing paradigm, first introduced in 1960s

• An approximate computing approach for many years

• Logical computation on random (or unary) bit-streams

• All digits have the same weight, numbers limited to the [0, 1] interval

• Value: probability of obtaining a one versus a zero

• Advantages

• Noise tolerance e.g., 0010000011000000 3/16 -> 4/16 or 2/16

• Low hardware cost

• Progressive precision [Alaghi et al, DAC’13]

• Skew tolerance [Najafi et al, TC’17] [Alaghi et al., JETC’17]

Introduction

e.g., 11100, 10101, 1011011100 -> 0.6

Stochastic multiplication using AND gate

1 bit flips

DATE 20194 / 25

• Converting from binary to stochastic bit-stream representation

• Set the Constant Number register to your target value

• Use a source of generating numbers as the second input of comparator

• Output a 1 if “Number Source” ≤ “Constant Number”

Introduction

DATE 20195 / 25

• SC has been used for low-cost implementations of convolutional neural networks (CNNs)

• Multiplication as essential and costly operations in conventional designs of NNs

• Implemented with simple standard AND gates in the stochastic domain

• Significant savings in hardware area and power costs

• However, two main barriers in wide adoption of SC-based NNs

• Quality degradation

• Random fluctuation in generating bit-streams and correlation between bit-streams

• High processing time

• Inevitable to achieve acceptable results.

• Translates to high energy consumption

• Higher than that of conventional fixed-point designs

SC-based Neural Networks

DATE 20196 / 25

• SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks, DAC’19

• A Stochastic Computational Multi-Layer Perceptron with Backward Propagation, IEEE TC’18

• Sign-magnitude SC: getting 10X accuracy for free in stochastic computing for deep NN, DAC’18

• DPS: Dynamic Precision Scaling for SC-based deep neural networks, DAC’18

• An Energy-Efficient Stochastic Computational Deep Belief Network, DATE’18

• A New SC Multiplier with Application to Deep Convolutional Neural Networks, DAC’17

• Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing, DATE’17

• Structural Design Optimization for Deep Convolutional Neural Networks using SC, DATE’17

• Dynamic Energy-Accuracy Trade-off Using SC in Deep Neural Networks, DAC’16

• VLSI Implementation of Deep Neural Network Using Integral SC, TVLSI’17

• Scalable SC Accelerator for Convolutional Neural Networks, ASP-DAC’17

• Towards Acceleration of Deep Convolutional Neural Networks using SC, ASP-DAC’17

• SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using SC, ASPLOS’17

• DSCNN: Hardware-Oriented Optimization for SC Based Deep Convolutional Neural Networks, ICCD’16

• A New SC Methodology for Efficient Neural Network Implementation, IEEE Tran. NN’16

• A Hardware Implementation of a Radial Basis Function NN Using Stochastic Logic, DATE’15

• Stochastic neural computation. I. (II.) Computational elements, IEEE Tran. on Computers, 2001

SC-based Neural Networks

DATE 20197 / 25

• Hybrid stochastic-binary implementations of NNs have been proposed

• Improve the accuracy

• Reduce the energy consumption

• Lee et al. [1] used SC to implement the first convolutional layer of the NN

• The remaining layers were all implemented in the binary domain

• Hybrid designs in [2-4] also use approximate parallel counter, accumulative parallel counter, and binary arithmetic to improve the accuracy and energy-efficiency of NN designs.

• Challenge

• None of these prior designs achieves the same classification rate as the conventional fixed-point binary design

Hybrid Stochastic-Binary NNs

[1] V. T. Lee et al., Energy-efficient hybrid stochastic-binary neural networks for near-sensor computing. DATE’17.[2] K. Kim et al., Dynamic Energy-accuracy Trade-off Using Stochastic Computing in Deep Neural Networks. DAC ’16.[3] J. Li et al., Towards Acceleration of Deep Convolutional Neural Networks using Stochastic Computing. ASP-DAC’17.[4] H. Sim et al, Scalable Stochastic Computing Accelerator for Convolutional Neural Networks. ASP-DAC’17.

DATE 20198 / 25

• Recent progress in SC has revolutionized the paradigm

[Jenson and Riedel ICCAD’16] [Najafi et al. TVLSI’17] [Najafi and Lilja ICCD’17]

• By properly structuring the bit-streams,

• Computation can be performed deterministically and accurately

• 1) Relatively Prime Stream Length

• 2) Clock Division

• 3) Rotation

• Every bit of one bit-stream pairs with every bit of the other exactly once

Deterministic Approaches to SC

DATE 20199 / 25

• Low discrepancy (LD) sequences have been used in improving the speed of computation on stochastic bit-streams

• Halton-based bit-streams [Alaghi and Hayes, DATE’14]

• Sobol-based bit-streams [Liu and Han, DATE’17]

• With LD sequences,

• 1s and 0s in the bit-streams are uniformly spaced: removing random fluctuations

• Bit-streams quickly converge to the target value: reducing processing time

• Important property of Sobol sequences:

• The first 𝟐𝒏 numbers in any Sobol sequence can precisely present

• All possible 𝑛-bit precision numbers in the [0, 1] interval

• E.g., the simplest Sobol sequence:


0, 1/2, 1/4, 3/4, 1/8, 5/8, 3/8, 7/8, 1/16, 9/16, 5/16, 13/16, 3/16, 11/16, 7/16, 15/162-bit3-bit4-bit

DATE 201910 / 25

• 4) Direct LD [Najafi et al. ICCAD’18]

• Directly use different LD Sobol sequences to generate bit-streams (no need for special structuring)

• Every bit of one bit-stream pairs with every bit of the other exactly once


DATE 201911 / 25

• 4) Direct LD [Najafi et al. ICCAD’18]

• 5) Integrated LD


0, 1/2,1/4,3/4 0,1/2,1/4,3/4, 0,1/2,1/4,3/4, 0,1/2,1/4,3/40, 1/2,3/4,1/4 1/4,0,1/2,3/4, 3/4,1/4,0,1/2, 1/2,3/4,1/4,0

DATE 201912 / 25

• Both of these LD-based deterministic methods

• Produce completely accurate results when running for 𝟐𝟐𝑵cycles

• Produce a similar result (the same truncation error) when operating on short streams

• In the conventional random-stream based SC,

• The multiplication operation must be done multiple times and average the outputs

• Each time generating and ANDing a different pair of bit-streams

• The output of processing LD bit-streams is deterministic with a standard deviation of zero

• Once performing the operation is sufficient to have a result free of variation


DATE 201913 / 25

• The performance (Mean Absolute Error %) of

• Four different combinations of Sobol sequences

• In multiplying two 8-bit precision input data


DATE 201914 / 25

• Convolutional layers account for the most hardware area and power cost in CNNs

• The basic operation in these layers: dot-product

• Multiplying and accumulating the input data and the weights

• We use the LD deterministic bit-stream processing in low-cost and energy-efficientimplementation of the first convolutional layer

• To avoid the issue of compounding errors over multiple layers

• Multiplications and Accumulation

• Multiplication is performed in the bit-stream domain using AND gates

• Two different Sobol sequences to convert the two inputs of each multiplication operation to bit-stream representation

• The output bit-streams are converted back to binary format implicitly

• Accumulating them in the binary domain using conventional binary adders

• The correlation between the produced output bit-streams will not affect the accuracy of the accumulation

Proposed Hybrid Design

DATE 201915 / 25

• Handling Negative Weights

• The weight inputs involve both positive and negative data

• The common approach of handling negative data in the stochastic domain:

• Extending the range of numbers from [0,1] to [-1,1] using a linear transformation and processing bit-streams in a so-called stochastic bipolar domain

• This, however, requires a longer processing time for the same accuracy

• Our approach

• The weights are divided into positive and negative subsets

• Converted to bit-stream representation assuming that all are positive values

• The multiplication outputs of the “positive” subset and the “negative” subset are first summed separately and then subtracted from each other to produce the final output value


DATE 201916 / 25

• Proposed design for the first convolutional layer


DATE 201917 / 25

• Implemented the LeNet-5 NN

• 784-11520-2880-1280-320-800-500-10 configuration as illustrated in the figure

• Two convolutional layers, two max-pooling layers, two fully connected layers, one softmax layer

• The first layer processes each pixel of the input image with 20 filters of 5x5 size

Case Study: LeNet-5 NN

DATE 201918 / 25

• Multiplications of the first layer in the bit-stream domain (24x24x5x5 multiplications)

• The output bit-streams accumulated in each cycle using binary adders

• The produced binary output is passed to a ReLU as the activation function

• The remaining layers are all implemented in the binary domain


DATE 201919 / 25

• MNIST database to train and test the NN (60,000 training and 10,000 testing 28x28 images)

• Baseline design: an 8-bit fixed-point implementation of the NN

• Training: over the 60,000 training images using the 8-bit fixed-point binary design

• Testing: over the 10,000 testing images with

• The fixed-point binary design

• The bit-stream-based designs

• The conventional stochastic random bitstream-based method

• RNG: MATLAB built-in random number generator

• The proposed deterministic LD bitstream-based method

• RNG: different pairs of Sobol sequences from MATLAB built-in Sobol sequence generator

• Only differ in the bit-stream generation part

• The core logic for the multiplications and accumulations is the same in all structures


DATE 201920 / 25

• Classification of 10,000 test images from MNIST

• 8-bit Fixed-Point Binary Design: 0.80%

• Conventional random stream-based design: 20 trials running the simulation

• Proposed bit-stream-based designs: 1 run for each Sobol combination

• Results are deterministic and reproducible

• The proposed designs achieve a better classification rate than the conventional random-stream based one for all different operation cycles

Performance Evaluation

Misclassification Rates of the Bit-stream-based Designs for Different number of Operation Cycles

DATE 201921 / 25

• The optimum number of cycles for high-quality results with the proposed designs

• Only 8 cycles

• Running for one more cycle (a total of 9 cycles) increases the misclassification rate

• Reason: imprecise representation of input data

• The general trend, however,

• A decreasing behavior in error rate when the number of operation cycles increases

• After 𝟐𝟔 cycles,

• The same or a lower error rate than the 8-bit fixed-point design

• The imprecise computation with truncated bit-streams has turned a couple of misclassifications into correct classification

• If the application accepts higher misclassification rates

• Even 4 cycles might satisfy the quality expectations

Performance Evaluation

DATE 201922 / 25

• Synthesis results for a 5x5 convolution engine

• Synthesis results of the first convolutional layer (24x24 convolution units in parallel for one filter)

Cost Comparison

SNGs cost not included

SNGs cost included

23x area37x area

19x area30x area

Total Energy

1 cycle -> 24pJ

1 cycle -> 14.5nJ

8 cycles -> 4.1nJ

8 cycles -> 9.0pJ

DATE 201923 / 25

• So, the proposed design achieves

• More than 70% energy saving compared to the non-pipelined fixed-point design

• With the conventional random stream-based design

• A significantly longer processing time is required to achieve the same classification rate

• Energy inefficient compared to the fixed-point and the proposed design

• Random fluctuation problem

• 24x24 parallel 5x5 convolution engines were implemented to parallel process a 28x28 input test image with one filter

• To parallel process with 20 filters -> need to implement 20 copies

• SNGs and RNGs will be shared between more convolution engines

• The overhead cost of bit-stream generation will be further reduced

Cost Comparison

DATE 201924 / 25

• Propose a low-cost and energy-efficient design for hardware implementation of convolutional neural networks (CNNs)

• Fast and accurate multiplications in the first convolutional layers using low-discrepancy deterministic bit-streams and standard AND gates

• Compared to prior random bit-stream-based designs, the proposed design achieves a lower misclassification rate for the same processing time.

• Evaluating LeNet5 NN and MNIST dataset

• The same classification rate as the conventional fixed-point binary design

• But with 70% savings in the energy consumption of the first convolutional layer

• If accepting slight inaccuracies, higher energy savings is also possible by processing shorter bit-streams.

Conclusion

DATE 201925 / 25

Thank you

Questions?

M. Hassan [email protected]

energy-efficient convolutional neural networks with ... · 5 / 25 date 2019 •sc has been used for...

Documents