embedded machine learning - institute of computer

63
Embedded Machine Learning Axel Jantsch TU Wien, Vienna, Austria December 2020

Upload: others

Post on 19-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embedded Machine Learning - Institute of Computer

Embedded MachineLearning

Axel Jantsch

TU Wien, Vienna, Austria

December 2020

Page 2: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Outline

1 Motivation and Challenges

2 HW Friendly Optimizations

3 CNN Accelerator Architectures

4 MSc Projects

5 Quantization

6 Inference Run Time Estimation

2

Page 3: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Motivation and Challenges

4

Page 4: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Why Embedded Machine Learning ?

• Machine learning is a powerful method to analyze data;

• Embedded applications produce huge amounts of sensordata;

• The data can or should not always be moved to centralservers;

5

Page 5: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Compute Usage Trend

From: https://openai.com/blog/ai-and-compute/ 6

Page 6: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Total Energy Consumption

SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 2015

7

Page 7: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Deployed Sensors

SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 2015

8

Page 8: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

What is Special About “Embedded”?

• Resource limitation

• Connectivity

• Security

• Privacy

9

Page 9: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

What is Special About “Embedded”?

Resource limitations

Embedded Server farmComputation [flop] 30 − 1800 · 1012 86 · 1018

Memory [bit] 1010 1015

Power [W] 5-100 103 − 106

Energy [Wh] 48-1000 200 · 106

Computation Embedded refers to an Nvidia Jetson Nano running 1 min and 1 hour,respectively.Computation server refers to the computation needed for the 40 day experiment withAlphaGo ZeroEnergy embedded refers to a mobile phone and to a car battery, respectively.Energy server refers to the 40 day experiment for AlphaGo Zero.

10

Page 10: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Case for Embedded ML

• Embedded inference:• More energy efficient• Bandwidth constraints• Latency constraints• Not always on-line and connected to a cloud server• Security• Privacy

• Embedded continuous learning:• Customization and specialization• Security• Privacy

11

Page 11: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

DNNs: Embedded and the Cloud

Training

Inference

Em

bed

ded

Clo

ud

Training

Inference

Preprocessing

Embedded

Inference

Embedded

Inference

Cloud based

Em

bed

ded

Clo

ud

Training

Inference

Preprocessing

Embedded

Inference

Embedded

Inference

Cloud based

Training

Embedded

Adaptation

Embedded

Cloud based

Design timetraining

Periodicretraining

Em

bed

ded

Clo

ud

12

Page 12: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Design Space

ARM NN

Nvidia Turing

Platform Choices

Processing unit selection

Reconfiguration

Batch processing

Deep pipelining

Resource reuse

Hierarchical control

Memory allocation

Memory reuse

etc.

Platform Selection

DNN Choices

Convolutional layers

Filter kernels

Pooling layers

Filter shape

Stride

Fully connected layer

Number of layers

Regularization

etc.

Number of filters

Mapping Choices

Neuron pruning

Data type selection

Approximation

Retraining

Connection pruning

Weight sparsifying

Regularization

etc.

13

Page 13: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quiz Time

1 In which year will computing consume all world-wideproduced energy, if

a current trends continue without major innovation intechnology;Choices: 2027, 2032, 2037, 2042, 2047

b current trends continue with aggressive and majorinnovations in technology?

14

Page 14: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

HW Friendly Optimizations

16

Page 15: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Optimization Categories

1 Minimize number of operations to be performed;

2 Simplify each operation;

3 Execute the operations as efficient as possible.

17

Page 16: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

HW Friendly Optimizations

• Loop reordering, unrolling, pipelining• Tiling• Batching• Binarized CNNs

18

Page 17: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Loop OptimizationsConvolution layer algorithm:

for ( to =0; t <M; to ++) { / / ou tput f ea tu re mapfor ( t i =0; t <N; t i ++) { / / i npu t f ea tu re map

for ( row =0; row<R; row++) { / / rowfor ( co l =0; col <C; co l ++) { / / column

for ( i =0; i <K ; i ++) { / / f i l t e rfor ( j =0; j <K ; j ++) {

Ofmap [ to ] [ row ] [ co l ]+= W[ to ] [ t i ] [ i ] [ j ]

∗ Ifmap [ t i ] [ S∗row+ i ] [ S∗ co l+ j ] ;} } } } } }

M ... number of output feature mapsN ... number of input feature mapsR ... number of rowsC ... number of columnsK ... filter kernel sizeS ... strideW ... weight matrix 19

Page 18: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Loop Optimizations

Loop reordering to improve cache efficiency;Loop unrolling to improve parallelism;Loop pipelining to improve parallelism.

20

Page 19: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Loop Tiling

for (to=0; t<M; to++) { // output feature mapfor (ti=0; t<N; ti++) { // input feature mapfor (row=0; row<R; row+=Tr) { // tiled row loopfor (col=0; col<C; col++) { // column

for (trr=row; trr<min(R,row+Tr); trr++) {for (i=0; i<K; i++) { // filter

for (i=0; i<K; i++) {Ofmap[to][trr][col]

+= W[to][ti][i][j]

* Ifmap[ti][S*trr+i][S*col+j];}}}}}}}

For efficient use of caches.

21

Page 20: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Batching

• Reuse of weights;• Improves throughput;• Increases latency.

22

Page 21: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Binarized CNNs (BNN)

• Weights and internal computation are represented as 1bbinary numbers;

• Instead of MAC operations, BNNs use XOR and bit-count;• Attractive for HW and FPGAs

23

Page 22: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quiz Time

2 Which optimization method is most promising in improvingML compute efficiency:

a Optimize network to minimize number of computations,

b Improve processing element and processing datapatharchitecture,

c Improve memory architecture,

d Optimize mapping of network onto target architecture,

e Minimize data type ?

24

Page 23: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

CNN AcceleratorArchitectures

26

Page 24: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

CNN Accelerator Architectures

• Systolic array architecture

• In-memory computing

27

Page 25: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Systolic Arrays

28

Page 26: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Tensor Flow Unit (TPU)

N. P. Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In: 2017 ACM/IEEE 44th AnnualInternational Symposium on Computer Architecture (ISCA). 2017, pp. 1–12

29

Page 27: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Tensor Flow Unit (TPU)

30

Page 28: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Tensor Flow Unit (TPU)

31

Page 29: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

In-memory Computing

• Storage capacity and memory access bandwidth andlatency dominate DNNs.

• Avoid moving data.• Distribute the MAC units in the memory architecture.

32

Page 30: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Wire Aware Accelerator (WAX)

Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow for CNN Accelerators”. In: Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO ’52. Columbus, OH, USA:Association for Computing Machinery, 2019

33

Page 31: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quiz Time

3 For optimizing CNN execution, is it more effective

a to re-use (and keep on-chip) input data,

b to re-use weights,

c to re-use intermediate data ?

34

Page 32: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

MSc Projects

36

Page 33: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

MSc Projects

• Embedded High Speed ImageClassification/Detection/Segmentation with QuantizedNeural Networks

• Evaluation of Quantization Aware Training Methods forINT8 and lower bit-widths for image classification, objectdetection and segmentation

• Power and Latency Optimization of the Xilinx ZCU102Platform for a Pedestrian Intention Recognition Use Case

• Optimization of 3D Convolutions• Optimization of Object Detection Networks on the Google

Edge TPU

37

Page 34: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Questions ?

?

38

Page 35: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization

40

Page 36: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Regularization

• Using small bit width for weights saves memory, bandwidthand computation;

• Bit width can be different for different layers of the DNN;• Quantization scheme: Dynamic fixed point, power of 2;• Retraining after quantization recovers accuracy losses:

Regularization;• Not all weights are equal: Weighted regularization.

Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch. “Weighted Quantization-Regularization in DNNsfor Weight Memory Minimization towards HW Implementation”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 37.10 (Oct. 2018)

41

Page 37: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Motivation• DNN quantization

• Reduces data movement• Reduces logic energy

• Layerwise bit-width optimization

INPUT

conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9

Channels: 96 96 96 192 192 192 192 192 100Max-Pooling Max-Pooling

0,0E+0

2,0E+6

4,0E+6

6,0E+6

8,0E+6

1,0E+7

1,2E+7

conv1

[7]

conv2

[7]

conv3

[7]

conv4

[4]

conv5

[4]

conv6

[3]

conv7

[3]

conv8

[7]

conv9

[7]

We

igh

t M

em

ory

[b

it]

Layer[bit-width]

AllConvNet - CIFAR-10

Weights

Quantized

Weights

42

Page 38: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Motivation

Layer 13-bit DFP

Layer 22-bit DFP

Weight Value

Wei

ght

Co

un

t

Weight Value

Wei

ght

Co

un

t

(a)

(b)

Layer 1Full Resolution

600 Weights

Layer 2Full Resolution

900 Weights

43

Page 39: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

CIFAR-100

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5Compression Ratio

40

45

50

55

60

65

Accu

racy

in %

BaselineCIFAR100 compression

Equal DFPLayer-wise DFPEqual Po2

Eq. DFP RetrainedLw. DFP RetrainedEq. Po2 Retrained

44

Page 40: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Quantization - Conclusion

• Dynamic Fixed Point is an effective alternative to integer orfloating point representation;

• Different layers require different precision;• Retraining adjusts the network to the available weight

values;• Weight memory reduction of 4-8x is common;• Reduced weight precision reduces weight memory and

cost of operation.

45

Page 41: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Summary

• Embedded Machine Learning has many applications;• Bandwidth limitations;

• Delay constraints;

• Privacy;

• Security;

• There are distinct challenges:• Limited resources;

• Specialized HW platforms;

• Huge design space for optimization and mapping.

46

Page 42: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Inference Run TimeEstimation

48

Page 43: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Estimation Framework

49

Page 44: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Inference Run Time Estimation

Assumption:• Inference time as a

function of problemsize is a combinationof step and linearfunctions due tolimited parallelresources.

•Example:• Single convolutional

layer sweep• 32x32x64 with k filter

and kernel size 3

50

Page 45: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Inference Run Time Estimation

• Assumption:The inference time can be approximated by a combinationof linear and step functions for each dimension, such asfilter, channels, etc.

• Determining the function based on selected measurements• Goals: automatic computation of estimation functions for

latency, power consumption and various platforms.

51

Page 46: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Automatic Estimation Function Generation

52

Page 47: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Iterative Refinement

53

Page 48: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Iterative Refinement

54

Page 49: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Iterative Refinement

55

Page 50: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Next Point Selection

• Linear function criteria:

• Point furthestaway fromprevious points

• Step function criteria

• Point with mostunique discretelevels

• Point with largestrange of values

• Point farthestaway fromprevious points

• Next point selection:Point with highest score

56

Page 51: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Next Point Selection

• Parameter optimizationfor step function criteria-> Grid search

• Step function criteria

• Point with mostunique discretelevels

• Point with largestrange of values

• Point farthestaway fromprevious points

• Interesting areas(cluster of purplevalues)

57

Page 52: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Method Evaluation

• Results after 3iterations (5meausurment points)

• Execution times:

• Full sweep: 3-4 h• Proposed

approach: 2-5minutes

58

Page 53: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example

• Phase 1: Estimatefunction in singledimension: number offilters

• Result: step function

59

Page 54: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example

• Phase 2: Test how d , wand h behave in the nextdimension

• Next dimension: inputchannels din

• Result:

• Step function:d0=0.1418, w0 = 8,h0 = 0.0106

• Constant: c = 32• Step function:

d1 = 0.044,w1 = 8,h1 = 0.0121

60

Page 55: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example

Generated model:

f (din, k)

= 0.1418 + bdin − 18c0.0106

+ bk − 132

(0.044 + bdin − 1

8c0.0121

)• Meausrement points: 112• Execution time: 32 minutes

61

Page 56: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

62

Page 57: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

63

Page 58: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

Slice through 2D plane at k = 1024

f (din, k)

= 0.1418 + bdin − 1

8c0.0106

+ bk − 1

32

(0.044 + b

din − 18c0.0121

)

f (din, 1024)

= 1.5058 + bdin − 1

8c0.3857

64

Page 59: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

2D Example - Error

Slice through 2D plane at din = 128

f (din, k)

= 0.1418 + bdin − 1

8c0.0106

+ bk − 1

32

(0.044 + b

din − 18c0.0121

)

f (128, k)

= 0.3008 + bk − 1

32c0.2255

65

Page 60: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Latency Estimation Summary

• Exploiting the discrete nature of HW resources• Fast estimation function for latency based on linear an step

functions• Automatic derivation of estimation for a new platform• Results for Nvidia GPU platforms are promising

66

Page 61: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

References I

Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow forCNN Accelerators”. In: Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture. MICRO ’52. Columbus,OH, USA: Association for Computing Machinery, 2019.

N. P. Jouppi et al. “In-datacenter performance analysis of a tensorprocessing unit”. In: 2017 ACM/IEEE 44th Annual InternationalSymposium on Computer Architecture (ISCA). 2017, pp. 1–12.

SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep.Semiconductor Industry Association and Semiconductor ResearchCorporation, Sept. 2015.

Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch.“Weighted Quantization-Regularization in DNNs for Weight MemoryMinimization towards HW Implementation”. In: IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 37.10(Oct. 2018).

Yu Wang, Gu-Yeon Wei, and David Brooks. “Benchmarking TPU, GPU,and CPU Platforms for Deep Learning”. In: CoRR abs/1907.10701(2019). arXiv: 1907.10701.

67

Page 62: Embedded Machine Learning - Institute of Computer

ww

w.i

ct.

tuw

ien

.ac.a

t

Questions ?

?

68

Page 63: Embedded Machine Learning - Institute of Computer