Embedded MachineLearning
Axel Jantsch
TU Wien, Vienna, Austria
December 2020
ww
w.i
ct.
tuw
ien
.ac.a
t
Outline
1 Motivation and Challenges
2 HW Friendly Optimizations
3 CNN Accelerator Architectures
4 MSc Projects
5 Quantization
6 Inference Run Time Estimation
2
ww
w.i
ct.
tuw
ien
.ac.a
t
Motivation and Challenges
4
ww
w.i
ct.
tuw
ien
.ac.a
t
Why Embedded Machine Learning ?
• Machine learning is a powerful method to analyze data;
• Embedded applications produce huge amounts of sensordata;
• The data can or should not always be moved to centralservers;
5
ww
w.i
ct.
tuw
ien
.ac.a
t
Compute Usage Trend
From: https://openai.com/blog/ai-and-compute/ 6
ww
w.i
ct.
tuw
ien
.ac.a
t
Total Energy Consumption
SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 2015
7
ww
w.i
ct.
tuw
ien
.ac.a
t
Deployed Sensors
SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 2015
8
ww
w.i
ct.
tuw
ien
.ac.a
t
What is Special About “Embedded”?
• Resource limitation
• Connectivity
• Security
• Privacy
9
ww
w.i
ct.
tuw
ien
.ac.a
t
What is Special About “Embedded”?
Resource limitations
Embedded Server farmComputation [flop] 30 − 1800 · 1012 86 · 1018
Memory [bit] 1010 1015
Power [W] 5-100 103 − 106
Energy [Wh] 48-1000 200 · 106
Computation Embedded refers to an Nvidia Jetson Nano running 1 min and 1 hour,respectively.Computation server refers to the computation needed for the 40 day experiment withAlphaGo ZeroEnergy embedded refers to a mobile phone and to a car battery, respectively.Energy server refers to the 40 day experiment for AlphaGo Zero.
10
ww
w.i
ct.
tuw
ien
.ac.a
t
Case for Embedded ML
• Embedded inference:• More energy efficient• Bandwidth constraints• Latency constraints• Not always on-line and connected to a cloud server• Security• Privacy
• Embedded continuous learning:• Customization and specialization• Security• Privacy
11
ww
w.i
ct.
tuw
ien
.ac.a
t
DNNs: Embedded and the Cloud
Training
Inference
Em
bed
ded
Clo
ud
Training
Inference
Preprocessing
Embedded
Inference
Embedded
Inference
Cloud based
Em
bed
ded
Clo
ud
Training
Inference
Preprocessing
Embedded
Inference
Embedded
Inference
Cloud based
Training
Embedded
Adaptation
Embedded
Cloud based
Design timetraining
Periodicretraining
Em
bed
ded
Clo
ud
12
ww
w.i
ct.
tuw
ien
.ac.a
t
Design Space
ARM NN
Nvidia Turing
Platform Choices
Processing unit selection
Reconfiguration
Batch processing
Deep pipelining
Resource reuse
Hierarchical control
Memory allocation
Memory reuse
etc.
Platform Selection
DNN Choices
Convolutional layers
Filter kernels
Pooling layers
Filter shape
Stride
Fully connected layer
Number of layers
Regularization
etc.
Number of filters
Mapping Choices
Neuron pruning
Data type selection
Approximation
Retraining
Connection pruning
Weight sparsifying
Regularization
etc.
13
ww
w.i
ct.
tuw
ien
.ac.a
t
Quiz Time
1 In which year will computing consume all world-wideproduced energy, if
a current trends continue without major innovation intechnology;Choices: 2027, 2032, 2037, 2042, 2047
b current trends continue with aggressive and majorinnovations in technology?
14
ww
w.i
ct.
tuw
ien
.ac.a
t
HW Friendly Optimizations
16
ww
w.i
ct.
tuw
ien
.ac.a
t
Optimization Categories
1 Minimize number of operations to be performed;
2 Simplify each operation;
3 Execute the operations as efficient as possible.
17
ww
w.i
ct.
tuw
ien
.ac.a
t
HW Friendly Optimizations
• Loop reordering, unrolling, pipelining• Tiling• Batching• Binarized CNNs
18
ww
w.i
ct.
tuw
ien
.ac.a
t
Loop OptimizationsConvolution layer algorithm:
for ( to =0; t <M; to ++) { / / ou tput f ea tu re mapfor ( t i =0; t <N; t i ++) { / / i npu t f ea tu re map
for ( row =0; row<R; row++) { / / rowfor ( co l =0; col <C; co l ++) { / / column
for ( i =0; i <K ; i ++) { / / f i l t e rfor ( j =0; j <K ; j ++) {
Ofmap [ to ] [ row ] [ co l ]+= W[ to ] [ t i ] [ i ] [ j ]
∗ Ifmap [ t i ] [ S∗row+ i ] [ S∗ co l+ j ] ;} } } } } }
M ... number of output feature mapsN ... number of input feature mapsR ... number of rowsC ... number of columnsK ... filter kernel sizeS ... strideW ... weight matrix 19
ww
w.i
ct.
tuw
ien
.ac.a
t
Loop Optimizations
Loop reordering to improve cache efficiency;Loop unrolling to improve parallelism;Loop pipelining to improve parallelism.
20
ww
w.i
ct.
tuw
ien
.ac.a
t
Loop Tiling
for (to=0; t<M; to++) { // output feature mapfor (ti=0; t<N; ti++) { // input feature mapfor (row=0; row<R; row+=Tr) { // tiled row loopfor (col=0; col<C; col++) { // column
for (trr=row; trr<min(R,row+Tr); trr++) {for (i=0; i<K; i++) { // filter
for (i=0; i<K; i++) {Ofmap[to][trr][col]
+= W[to][ti][i][j]
* Ifmap[ti][S*trr+i][S*col+j];}}}}}}}
For efficient use of caches.
21
ww
w.i
ct.
tuw
ien
.ac.a
t
Batching
• Reuse of weights;• Improves throughput;• Increases latency.
22
ww
w.i
ct.
tuw
ien
.ac.a
t
Binarized CNNs (BNN)
• Weights and internal computation are represented as 1bbinary numbers;
• Instead of MAC operations, BNNs use XOR and bit-count;• Attractive for HW and FPGAs
23
ww
w.i
ct.
tuw
ien
.ac.a
t
Quiz Time
2 Which optimization method is most promising in improvingML compute efficiency:
a Optimize network to minimize number of computations,
b Improve processing element and processing datapatharchitecture,
c Improve memory architecture,
d Optimize mapping of network onto target architecture,
e Minimize data type ?
24
ww
w.i
ct.
tuw
ien
.ac.a
t
CNN AcceleratorArchitectures
26
ww
w.i
ct.
tuw
ien
.ac.a
t
CNN Accelerator Architectures
• Systolic array architecture
• In-memory computing
27
ww
w.i
ct.
tuw
ien
.ac.a
t
Systolic Arrays
28
ww
w.i
ct.
tuw
ien
.ac.a
t
Tensor Flow Unit (TPU)
N. P. Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In: 2017 ACM/IEEE 44th AnnualInternational Symposium on Computer Architecture (ISCA). 2017, pp. 1–12
29
ww
w.i
ct.
tuw
ien
.ac.a
t
Tensor Flow Unit (TPU)
30
ww
w.i
ct.
tuw
ien
.ac.a
t
Tensor Flow Unit (TPU)
31
ww
w.i
ct.
tuw
ien
.ac.a
t
In-memory Computing
• Storage capacity and memory access bandwidth andlatency dominate DNNs.
• Avoid moving data.• Distribute the MAC units in the memory architecture.
32
ww
w.i
ct.
tuw
ien
.ac.a
t
Wire Aware Accelerator (WAX)
Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow for CNN Accelerators”. In: Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO ’52. Columbus, OH, USA:Association for Computing Machinery, 2019
33
ww
w.i
ct.
tuw
ien
.ac.a
t
Quiz Time
3 For optimizing CNN execution, is it more effective
a to re-use (and keep on-chip) input data,
b to re-use weights,
c to re-use intermediate data ?
34
ww
w.i
ct.
tuw
ien
.ac.a
t
MSc Projects
36
ww
w.i
ct.
tuw
ien
.ac.a
t
MSc Projects
• Embedded High Speed ImageClassification/Detection/Segmentation with QuantizedNeural Networks
• Evaluation of Quantization Aware Training Methods forINT8 and lower bit-widths for image classification, objectdetection and segmentation
• Power and Latency Optimization of the Xilinx ZCU102Platform for a Pedestrian Intention Recognition Use Case
• Optimization of 3D Convolutions• Optimization of Object Detection Networks on the Google
Edge TPU
37
ww
w.i
ct.
tuw
ien
.ac.a
t
Questions ?
?
38
ww
w.i
ct.
tuw
ien
.ac.a
t
Quantization
40
ww
w.i
ct.
tuw
ien
.ac.a
t
Quantization - Regularization
• Using small bit width for weights saves memory, bandwidthand computation;
• Bit width can be different for different layers of the DNN;• Quantization scheme: Dynamic fixed point, power of 2;• Retraining after quantization recovers accuracy losses:
Regularization;• Not all weights are equal: Weighted regularization.
Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch. “Weighted Quantization-Regularization in DNNsfor Weight Memory Minimization towards HW Implementation”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 37.10 (Oct. 2018)
41
ww
w.i
ct.
tuw
ien
.ac.a
t
Quantization - Motivation• DNN quantization
• Reduces data movement• Reduces logic energy
• Layerwise bit-width optimization
INPUT
conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9
Channels: 96 96 96 192 192 192 192 192 100Max-Pooling Max-Pooling
0,0E+0
2,0E+6
4,0E+6
6,0E+6
8,0E+6
1,0E+7
1,2E+7
conv1
[7]
conv2
[7]
conv3
[7]
conv4
[4]
conv5
[4]
conv6
[3]
conv7
[3]
conv8
[7]
conv9
[7]
We
igh
t M
em
ory
[b
it]
Layer[bit-width]
AllConvNet - CIFAR-10
Weights
Quantized
Weights
42
ww
w.i
ct.
tuw
ien
.ac.a
t
Quantization - Motivation
Layer 13-bit DFP
Layer 22-bit DFP
Weight Value
Wei
ght
Co
un
t
Weight Value
Wei
ght
Co
un
t
(a)
(b)
Layer 1Full Resolution
600 Weights
Layer 2Full Resolution
900 Weights
43
ww
w.i
ct.
tuw
ien
.ac.a
t
CIFAR-100
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5Compression Ratio
40
45
50
55
60
65
Accu
racy
in %
BaselineCIFAR100 compression
Equal DFPLayer-wise DFPEqual Po2
Eq. DFP RetrainedLw. DFP RetrainedEq. Po2 Retrained
44
ww
w.i
ct.
tuw
ien
.ac.a
t
Quantization - Conclusion
• Dynamic Fixed Point is an effective alternative to integer orfloating point representation;
• Different layers require different precision;• Retraining adjusts the network to the available weight
values;• Weight memory reduction of 4-8x is common;• Reduced weight precision reduces weight memory and
cost of operation.
45
ww
w.i
ct.
tuw
ien
.ac.a
t
Summary
• Embedded Machine Learning has many applications;• Bandwidth limitations;
• Delay constraints;
• Privacy;
• Security;
• There are distinct challenges:• Limited resources;
• Specialized HW platforms;
• Huge design space for optimization and mapping.
46
ww
w.i
ct.
tuw
ien
.ac.a
t
Inference Run TimeEstimation
48
ww
w.i
ct.
tuw
ien
.ac.a
t
Estimation Framework
49
ww
w.i
ct.
tuw
ien
.ac.a
t
Inference Run Time Estimation
Assumption:• Inference time as a
function of problemsize is a combinationof step and linearfunctions due tolimited parallelresources.
•Example:• Single convolutional
layer sweep• 32x32x64 with k filter
and kernel size 3
50
ww
w.i
ct.
tuw
ien
.ac.a
t
Inference Run Time Estimation
• Assumption:The inference time can be approximated by a combinationof linear and step functions for each dimension, such asfilter, channels, etc.
• Determining the function based on selected measurements• Goals: automatic computation of estimation functions for
latency, power consumption and various platforms.
51
ww
w.i
ct.
tuw
ien
.ac.a
t
Automatic Estimation Function Generation
52
ww
w.i
ct.
tuw
ien
.ac.a
t
Iterative Refinement
53
ww
w.i
ct.
tuw
ien
.ac.a
t
Iterative Refinement
54
ww
w.i
ct.
tuw
ien
.ac.a
t
Iterative Refinement
55
ww
w.i
ct.
tuw
ien
.ac.a
t
Next Point Selection
• Linear function criteria:
• Point furthestaway fromprevious points
• Step function criteria
• Point with mostunique discretelevels
• Point with largestrange of values
• Point farthestaway fromprevious points
• Next point selection:Point with highest score
56
ww
w.i
ct.
tuw
ien
.ac.a
t
Next Point Selection
• Parameter optimizationfor step function criteria-> Grid search
• Step function criteria
• Point with mostunique discretelevels
• Point with largestrange of values
• Point farthestaway fromprevious points
• Interesting areas(cluster of purplevalues)
57
ww
w.i
ct.
tuw
ien
.ac.a
t
Method Evaluation
• Results after 3iterations (5meausurment points)
• Execution times:
• Full sweep: 3-4 h• Proposed
approach: 2-5minutes
58
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example
• Phase 1: Estimatefunction in singledimension: number offilters
• Result: step function
59
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example
• Phase 2: Test how d , wand h behave in the nextdimension
• Next dimension: inputchannels din
• Result:
• Step function:d0=0.1418, w0 = 8,h0 = 0.0106
• Constant: c = 32• Step function:
d1 = 0.044,w1 = 8,h1 = 0.0121
60
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example
Generated model:
f (din, k)
= 0.1418 + bdin − 18c0.0106
+ bk − 132
(0.044 + bdin − 1
8c0.0121
)• Meausrement points: 112• Execution time: 32 minutes
61
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example - Error
62
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example - Error
63
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example - Error
Slice through 2D plane at k = 1024
f (din, k)
= 0.1418 + bdin − 1
8c0.0106
+ bk − 1
32
(0.044 + b
din − 18c0.0121
)
f (din, 1024)
= 1.5058 + bdin − 1
8c0.3857
64
ww
w.i
ct.
tuw
ien
.ac.a
t
2D Example - Error
Slice through 2D plane at din = 128
f (din, k)
= 0.1418 + bdin − 1
8c0.0106
+ bk − 1
32
(0.044 + b
din − 18c0.0121
)
f (128, k)
= 0.3008 + bk − 1
32c0.2255
65
ww
w.i
ct.
tuw
ien
.ac.a
t
Latency Estimation Summary
• Exploiting the discrete nature of HW resources• Fast estimation function for latency based on linear an step
functions• Automatic derivation of estimation for a new platform• Results for Nvidia GPU platforms are promising
66
ww
w.i
ct.
tuw
ien
.ac.a
t
References I
Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow forCNN Accelerators”. In: Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture. MICRO ’52. Columbus,OH, USA: Association for Computing Machinery, 2019.
N. P. Jouppi et al. “In-datacenter performance analysis of a tensorprocessing unit”. In: 2017 ACM/IEEE 44th Annual InternationalSymposium on Computer Architecture (ISCA). 2017, pp. 1–12.
SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep.Semiconductor Industry Association and Semiconductor ResearchCorporation, Sept. 2015.
Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch.“Weighted Quantization-Regularization in DNNs for Weight MemoryMinimization towards HW Implementation”. In: IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 37.10(Oct. 2018).
Yu Wang, Gu-Yeon Wei, and David Brooks. “Benchmarking TPU, GPU,and CPU Platforms for Deep Learning”. In: CoRR abs/1907.10701(2019). arXiv: 1907.10701.
67
ww
w.i
ct.
tuw
ien
.ac.a
t
Questions ?
?
68