embedded machine learning - institute of computer

Embedded MachineLearning

Axel Jantsch

TU Wien, Vienna, Austria

December 2020

ww

w.i

ct.

tuw

ien

.ac.a

t

Outline

1 Motivation and Challenges

2 HW Friendly Optimizations

3 CNN Accelerator Architectures

4 MSc Projects

5 Quantization

6 Inference Run Time Estimation

2

ww

w.i

ct.

tuw

ien

.ac.a

t

Motivation and Challenges

4

ww

w.i

ct.

tuw

ien

.ac.a

t

Why Embedded Machine Learning ?

• Machine learning is a powerful method to analyze data;

• Embedded applications produce huge amounts of sensordata;

• The data can or should not always be moved to centralservers;

5

ww

w.i

ct.

tuw

ien

.ac.a

t

Compute Usage Trend

From: https://openai.com/blog/ai-and-compute/ 6

https://openai.com/blog/ai-and-compute/

ww

w.i

ct.

tuw

ien

.ac.a

t

Total Energy Consumption

SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 2015

7

ww

w.i

ct.

tuw

ien

.ac.a

t

Deployed Sensors

SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 2015

8

ww

w.i

ct.

tuw

ien

.ac.a

t

What is Special About “Embedded”?

• Resource limitation

• Connectivity

• Security

• Privacy

9

ww

w.i

ct.

tuw

ien

.ac.a

t

What is Special About “Embedded”?

Resource limitations

Embedded Server farmComputation [flop] 30 − 1800 · 1012 86 · 1018

Memory [bit] 1010 1015

Power [W] 5-100 103 − 106

Energy [Wh] 48-1000 200 · 106

Computation Embedded refers to an Nvidia Jetson Nano running 1 min and 1 hour,respectively.Computation server refers to the computation needed for the 40 day experiment withAlphaGo ZeroEnergy embedded refers to a mobile phone and to a car battery, respectively.Energy server refers to the 40 day experiment for AlphaGo Zero.

10

ww

w.i

ct.

tuw

ien

.ac.a

t

Case for Embedded ML

• Embedded inference:• More energy efficient• Bandwidth constraints• Latency constraints• Not always on-line and connected to a cloud server• Security• Privacy

• Embedded continuous learning:• Customization and specialization• Security• Privacy

11

ww

w.i

ct.

tuw

ien

.ac.a

t

DNNs: Embedded and the Cloud

Training

Inference

Em

bed

ded

Clo

ud

Training

Inference

Preprocessing

Embedded

Inference

Embedded

Inference

Cloud based

Em

bed

ded

Clo

ud

Training

Inference

Preprocessing

Embedded

Inference

Embedded

Inference

Cloud based

Training

Embedded

Adaptation

Embedded

Cloud based

Design timetraining

Periodicretraining

Em

bed

ded

Clo

ud

12

ww

w.i

ct.

tuw

ien

.ac.a

t

Design Space

ARM NN

Nvidia Turing

Platform Choices

Processing unit selection

Reconfiguration

Batch processing

Deep pipelining

Resource reuse

Hierarchical control

Memory allocation

Memory reuse

etc.

Platform Selection

DNN Choices

Convolutional layers

Filter kernels

Pooling layers

Filter shape

Stride

Fully connected layer

Number of layers

Regularization

etc.

Number of filters

Mapping Choices

Neuron pruning

Data type selection

Approximation

Retraining

Connection pruning

Weight sparsifying

Regularization

etc.

13

ww

w.i

ct.

tuw

ien

.ac.a

t

Quiz Time

1 In which year will computing consume all world-wideproduced energy, if

a current trends continue without major innovation intechnology;Choices: 2027, 2032, 2037, 2042, 2047

b current trends continue with aggressive and majorinnovations in technology?

14

ww

w.i

ct.

tuw

ien

.ac.a

t

HW Friendly Optimizations

16

ww

w.i

ct.

tuw

ien

.ac.a

t

Optimization Categories

1 Minimize number of operations to be performed;

2 Simplify each operation;

3 Execute the operations as efficient as possible.

17

ww

w.i

ct.

tuw

ien

.ac.a

t

HW Friendly Optimizations

• Loop reordering, unrolling, pipelining• Tiling• Batching• Binarized CNNs

18

ww

w.i

ct.

tuw

ien

.ac.a

t

Loop OptimizationsConvolution layer algorithm:

for ( to =0; t <M; to ++) { / / ou tput f ea tu re mapfor ( t i =0; t <N; t i ++) { / / i npu t f ea tu re map

for ( row =0; row<R; row++) { / / rowfor ( co l =0; col <C; co l ++) { / / column

for ( i =0; i <K ; i ++) { / / f i l t e rfor ( j =0; j <K ; j ++) {

Ofmap [ to ] [ row ] [ co l ]+= W[ to ] [ t i ] [ i ] [ j ]

∗ Ifmap [ t i ] [ S∗row+ i ] [ S∗ co l+ j ] ;} } } } } }

M ... number of output feature mapsN ... number of input feature mapsR ... number of rowsC ... number of columnsK ... filter kernel sizeS ... strideW ... weight matrix 19

ww

w.i

ct.

tuw

ien

.ac.a

t

Loop Optimizations

Loop reordering to improve cache efficiency;Loop unrolling to improve parallelism;Loop pipelining to improve parallelism.

20

ww

w.i

ct.

tuw

ien

.ac.a

t

Loop Tiling

for (to=0; t<M; to++) { // output feature mapfor (ti=0; t<N; ti++) { // input feature mapfor (row=0; row<R; row+=Tr) { // tiled row loopfor (col=0; col<C; col++) { // column

for (trr=row; trr<min(R,row+Tr); trr++) {for (i=0; i<K; i++) { // filter

for (i=0; i<K; i++) {Ofmap[to][trr][col]

+= W[to][ti][i][j]

* Ifmap[ti][S*trr+i][S*col+j];}}}}}}}

For efficient use of caches.

21

ww

w.i

ct.

tuw

ien

.ac.a

t

Batching

• Reuse of weights;• Improves throughput;• Increases latency.

22

ww

w.i

ct.

tuw

ien

.ac.a

t

Binarized CNNs (BNN)

• Weights and internal computation are represented as 1bbinary numbers;

• Instead of MAC operations, BNNs use XOR and bit-count;• Attractive for HW and FPGAs

23

ww

w.i

ct.

tuw

ien

.ac.a

t

Quiz Time

2 Which optimization method is most promising in improvingML compute efficiency:

a Optimize network to minimize number of computations,

b Improve processing element and processing datapatharchitecture,

c Improve memory architecture,

d Optimize mapping of network onto target architecture,

e Minimize data type ?

24

ww

w.i

ct.

tuw

ien

.ac.a

t

CNN AcceleratorArchitectures

26

ww

w.i

ct.

tuw

ien

.ac.a

t

CNN Accelerator Architectures

• Systolic array architecture

• In-memory computing

27

ww

w.i

ct.

tuw

ien

.ac.a

t

Systolic Arrays

28

ww

w.i

ct.

tuw

ien

.ac.a

t

Tensor Flow Unit (TPU)

N. P. Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In: 2017 ACM/IEEE 44th AnnualInternational Symposium on Computer Architecture (ISCA). 2017, pp. 1–12

29

ww

w.i

ct.

tuw

ien

.ac.a

t


30

ww

w.i

ct.

tuw

ien

.ac.a

t


31

ww

w.i

ct.

tuw

ien

.ac.a

t

In-memory Computing

• Storage capacity and memory access bandwidth andlatency dominate DNNs.

• Avoid moving data.• Distribute the MAC units in the memory architecture.

32

ww

w.i

ct.

tuw

ien

.ac.a

t

Wire Aware Accelerator (WAX)

Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow for CNN Accelerators”. In: Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO ’52. Columbus, OH, USA:Association for Computing Machinery, 2019

33

ww

w.i

ct.

tuw

ien

.ac.a

t

Quiz Time

3 For optimizing CNN execution, is it more effective

a to re-use (and keep on-chip) input data,

b to re-use weights,

c to re-use intermediate data ?

34

ww

w.i

ct.

tuw

ien

.ac.a

t

MSc Projects

36

ww

w.i

ct.

tuw

ien

.ac.a

t

MSc Projects

• Embedded High Speed ImageClassification/Detection/Segmentation with QuantizedNeural Networks

• Evaluation of Quantization Aware Training Methods forINT8 and lower bit-widths for image classification, objectdetection and segmentation

• Power and Latency Optimization of the Xilinx ZCU102Platform for a Pedestrian Intention Recognition Use Case

• Optimization of 3D Convolutions• Optimization of Object Detection Networks on the Google

Edge TPU

37

ww

w.i

ct.

tuw

ien

.ac.a

t

Questions ?

?

38

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization

40

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Regularization

• Using small bit width for weights saves memory, bandwidthand computation;

• Bit width can be different for different layers of the DNN;• Quantization scheme: Dynamic fixed point, power of 2;• Retraining after quantization recovers accuracy losses:

Regularization;• Not all weights are equal: Weighted regularization.

Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch. “Weighted Quantization-Regularization in DNNsfor Weight Memory Minimization towards HW Implementation”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 37.10 (Oct. 2018)

41

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Motivation• DNN quantization

• Reduces data movement• Reduces logic energy

• Layerwise bit-width optimization

INPUT

conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9

Channels: 96 96 96 192 192 192 192 192 100Max-Pooling Max-Pooling

0,0E+0

2,0E+6

4,0E+6

6,0E+6

8,0E+6

1,0E+7

1,2E+7

conv1

[7]

conv2

[7]

conv3

[7]

conv4

[4]

conv5

[4]

conv6

[3]

conv7

[3]

conv8

[7]

conv9

[7]

We

igh

t M

em

ory

[b

it]

Layer[bit-width]

AllConvNet - CIFAR-10

Weights

Quantized

Weights

42

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Motivation

Layer 13-bit DFP

Layer 22-bit DFP

Weight Value

Wei

ght

Co

un

t

Weight Value

Wei

ght

Co

un

t

(a)

(b)

Layer 1Full Resolution

600 Weights

Layer 2Full Resolution

900 Weights

43

ww

w.i

ct.

tuw

ien

.ac.a

t

CIFAR-100

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5Compression Ratio

40

45

50

55

60

65

Accu

racy

in %

BaselineCIFAR100 compression

Equal DFPLayer-wise DFPEqual Po2

Eq. DFP RetrainedLw. DFP RetrainedEq. Po2 Retrained

44

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Conclusion

• Dynamic Fixed Point is an effective alternative to integer orfloating point representation;

• Different layers require different precision;• Retraining adjusts the network to the available weight

values;• Weight memory reduction of 4-8x is common;• Reduced weight precision reduces weight memory and

cost of operation.

45

ww

w.i

ct.

tuw

ien

.ac.a

t

Summary

• Embedded Machine Learning has many applications;• Bandwidth limitations;

• Delay constraints;

• Privacy;

• Security;

• There are distinct challenges:• Limited resources;

• Specialized HW platforms;

• Huge design space for optimization and mapping.

46

ww

w.i

ct.

tuw

ien

.ac.a

t

Inference Run TimeEstimation

48

ww

w.i

ct.

tuw

ien

.ac.a

t

Estimation Framework

49

ww

w.i

ct.

tuw

ien

.ac.a

t

Inference Run Time Estimation

Assumption:• Inference time as a

function of problemsize is a combinationof step and linearfunctions due tolimited parallelresources.

•Example:• Single convolutional

layer sweep• 32x32x64 with k filter

and kernel size 3

50

ww

w.i

ct.

tuw

ien

.ac.a

t

Inference Run Time Estimation

• Assumption:The inference time can be approximated by a combinationof linear and step functions for each dimension, such asfilter, channels, etc.

• Determining the function based on selected measurements• Goals: automatic computation of estimation functions for

latency, power consumption and various platforms.

51

ww

w.i

ct.

tuw

ien

.ac.a

t

Automatic Estimation Function Generation

52

ww

w.i

ct.

tuw

ien

.ac.a

t

Iterative Refinement

53

ww

w.i

ct.

tuw

ien

.ac.a

t


54

ww

w.i

ct.

tuw

ien

.ac.a

t


55

ww

w.i

ct.

tuw

ien

.ac.a

t

Next Point Selection

• Linear function criteria:

• Point furthestaway fromprevious points

• Step function criteria

• Point with mostunique discretelevels

• Point with largestrange of values

• Point farthestaway fromprevious points

• Next point selection:Point with highest score

56

ww

w.i

ct.

tuw

ien

.ac.a

t

Next Point Selection

• Parameter optimizationfor step function criteria-> Grid search

• Step function criteria

• Point with mostunique discretelevels

• Point with largestrange of values

• Point farthestaway fromprevious points

• Interesting areas(cluster of purplevalues)

57

ww

w.i

ct.

tuw

ien

.ac.a

t

Method Evaluation

• Results after 3iterations (5meausurment points)

• Execution times:

• Full sweep: 3-4 h• Proposed

approach: 2-5minutes

58

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example

• Phase 1: Estimatefunction in singledimension: number offilters

• Result: step function

59

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example

• Phase 2: Test how d , wand h behave in the nextdimension

• Next dimension: inputchannels din

• Result:

• Step function:d0=0.1418, w0 = 8,h0 = 0.0106

• Constant: c = 32• Step function:

d1 = 0.044,w1 = 8,h1 = 0.0121

60

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example

Generated model:

f (din, k)

= 0.1418 + bdin − 18c0.0106

+ bk − 132

(0.044 + bdin − 1

8c0.0121

)• Meausrement points: 112• Execution time: 32 minutes

61

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

62

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

63

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

Slice through 2D plane at k = 1024

f (din, k)

= 0.1418 + bdin − 1

8c0.0106

+ bk − 1

32

(0.044 + b

din − 18c0.0121

)

f (din, 1024)

= 1.5058 + bdin − 1

8c0.3857

64

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

Slice through 2D plane at din = 128

f (din, k)

= 0.1418 + bdin − 1

8c0.0106

+ bk − 1

32

(0.044 + b

din − 18c0.0121

)

f (128, k)

= 0.3008 + bk − 1

32c0.2255

65

ww

w.i

ct.

tuw

ien

.ac.a

t

Latency Estimation Summary

• Exploiting the discrete nature of HW resources• Fast estimation function for latency based on linear an step

functions• Automatic derivation of estimation for a new platform• Results for Nvidia GPU platforms are promising

66

ww

w.i

ct.

tuw

ien

.ac.a

t

References I

Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow forCNN Accelerators”. In: Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture. MICRO ’52. Columbus,OH, USA: Association for Computing Machinery, 2019.

N. P. Jouppi et al. “In-datacenter performance analysis of a tensorprocessing unit”. In: 2017 ACM/IEEE 44th Annual InternationalSymposium on Computer Architecture (ISCA). 2017, pp. 1–12.

SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep.Semiconductor Industry Association and Semiconductor ResearchCorporation, Sept. 2015.

Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch.“Weighted Quantization-Regularization in DNNs for Weight MemoryMinimization towards HW Implementation”. In: IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 37.10(Oct. 2018).

Yu Wang, Gu-Yeon Wei, and David Brooks. “Benchmarking TPU, GPU,and CPU Platforms for Deep Learning”. In: CoRR abs/1907.10701(2019). arXiv: 1907.10701.

67

http://arxiv.org/abs/1907.10701

ww

w.i

ct.

tuw

ien

.ac.a

t

Questions ?

?

68

embedded machine learning - institute of computer

Documents