mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...
TRANSCRIPT
1MIT 2National Taiwan University 3MIT-IBM Watson AI Lab
MCUNet: Tiny Deep Learning on IoT Devices
NeurIPS 2020 (spotlight)
Ji Lin1 Wei-Ming Chen1,2 John Cohn3Yujun Lin1 Song Han1Chuang Gan3
• Low-cost, low-power
Background: The Era of AIoT on Microcontrollers (MCUs)
Background: The Era of AIoT on Microcontrollers (MCUs)
• Low-cost, low-power • Rapid growth
#Uni
ts (B
illion
)
01020304050
12 13 14 15F 16F 17F 18F 19F
Smart Retail Personalized Healthcare Precision Agriculture Smart Home
…
• Wide applications
• Low-cost, low-power • Rapid growth
#Uni
ts (B
illion
)
01020304050
12 13 14 15F 16F 17F 18F 19F
Background: The Era of AIoT on Microcontrollers (MCUs)
Challenge: Memory Too Small to Hold DNN
Memory (Activation)
Storage (Weights)
Cloud AI
Challenge: Memory Too Small to Hold DNN
Memory (Activation)
Storage (Weights)
16GB
~TB/PB
Cloud AI Mobile AI
Challenge: Memory Too Small to Hold DNN
Memory (Activation)
Storage (Weights)
16GB 4GB
256GB~TB/PB
Cloud AI Mobile AI Tiny AI
Challenge: Memory Too Small to Hold DNN
Memory (Activation)
Storage (Weights)
16GB 4GB
256GB
320kB
1MB~TB/PB
Cloud AI Mobile AI Tiny AI
Challenge: Memory Too Small to Hold DNN
Memory (Activation)
Storage (Weights)
16GB
~TB/PB
4GB
256GB
320kB
1MB13,000x smaller
50,000x smaller
Cloud AI Mobile AI Tiny AI
Challenge: Memory Too Small to Hold DNN
Memory (Activation)
Storage (Weights)
16GB
~TB/PB
4GB
256GB
320kB
1MB13,000x smaller
50,000x smaller
We need to reduce the peak activation size AND the model size to fit a DNN into MCUs.
Existing efficient network only reduces model size but NOT activation size!
0
10
20
30
40
50
Param (MB) Peak Activation (MB)
ResNet-18 MobileNetV2-0.75 MCUNet
~70% ImageNet Top-1
4.6x
1.8x
Challenge: Memory Too Small to Hold DNN
ResNet-50
MobileNetV2
MobileNetV2 (int8)
0 2000 4000 6000 8000320kB constraint
Peak Memory (kB)
22x
23x
5x
Challenge: Memory Too Small to Hold DNN
ResNet-50
MobileNetV2
MobileNetV2 (int8)
0 2000 4000 6000 8000320kB constraint
Peak Memory (kB)
22x
23x
5x
Challenge: Memory Too Small to Hold DNN
MCUNet
ResNet-50
MobileNetV2
MobileNetV2 (int8)
0 2000 4000 6000 8000320kB constraint
Peak Memory (kB)
22x
23x
5x
MCUNet: System-Algorithm Co-design
MCUNet: System-Algorithm Co-design
(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet
LibraryNAS
MCUNet: System-Algorithm Co-design
(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet
(b) Tune deep learning library given a NN modele.g., TVM
LibraryNAS LibraryNN Model
MCUNet: System-Algorithm Co-design
TinyEngineTinyNAS
Efficient Compiler / Runtime
Efficient Neural Architecture
LibraryNN Model
(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet
(b) Tune deep learning library given a NN modele.g., TVM
(c) MCUNet: system-algorithm co-design
MCUNet
LibraryNAS
MCUNet: System-Algorithm Co-design
TinyEngineTinyNAS
Efficient Compiler / Runtime
Efficient Neural Architecture
LibraryNN Model
(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet
(b) Tune deep learning library given a NN modele.g., TVM
(c) MCUNet: system-algorithm co-design
MCUNet
LibraryNAS
TinyNAS
TinyNAS: Two-Stage NAS for Tiny Memory Constraints
Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design
Full Network Space
TinyNAS: Two-Stage NAS for Tiny Memory Constraints
Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design
Optimized Search SpaceFull Network Space
Memory/Storage Constraints
TinyNAS: Two-Stage NAS for Tiny Memory Constraints
Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design
Optimized Search SpaceFull Network Space Model Specialization
Memory/Storage Constraints
Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth
TinyNAS: (1) Automated search space optimization
Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth
k=7
k=5
k=3
TinyNAS: (1) Automated search space optimization
Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth
k=7
k=5
k=3
e=6
e=4
e=2
pw1dw
pw2
TinyNAS: (1) Automated search space optimization
Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth
k=7
k=5
k=3
e=6
e=4
e=2
pw1dw
pw2
d=4
d=3
d=2
TinyNAS: (1) Automated search space optimization
Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth
Out of memory!
TinyNAS: (1) Automated search space optimization
Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W
TinyNAS: (1) Automated search space optimization
Different R and W for different hardware capacity
R=224, W=1.0
TinyNAS: (1) Automated search space optimization
(i.e., different optimized sub-space)
Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W
Different R and W for different hardware capacity
R=260, W=1.4 * R=224, W=1.0
TinyNAS: (1) Automated search space optimization
(i.e., different optimized sub-space)
Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W
* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20
Different R and W for different hardware capacity
R=260, W=1.4 R=224, W=1.0 R=?, W=?
F412/F743/H746/…
TinyNAS: (1) Automated search space optimization
(i.e., different optimized sub-space)
Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W
256kB/320kB/512kB/…
Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
TinyNAS: (1) Automated search space optimization
0%
25%
50%
75%
100%
25 30 35 40 45 50 55 60 65
w0.3-r160 | 32.5w0.4-r144 | 46.9
FLOPs (M)
Cum
ulat
ive
Prob
abili
ty
width-res. | mFLOPs
TinyNAS: (1) Automated search space optimization
Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
320kB?
0%
25%
50%
75%
100%
25 30 35 40 45 50 55 60 65
w0.3-r160 | 32.5w0.4-r144 | 46.9
FLOPs (M)
Cum
ulat
ive
Prob
abili
ty
width-res. | mFLOPs
TinyNAS: (1) Automated search space optimization
Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
0%
25%
50%
75%
100%
25 30 35 40 45 50 55 60 65
w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8
FLOPs (M)
Cum
ulat
ive
Prob
abili
ty p=80%(32.3M, 80%)
width-res. | mFLOPs
Bad design space(45.4M, 80%)
TinyNAS: (1) Automated search space optimization
Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
0%
25%
50%
75%
100%
25 30 35 40 45 50 55 60 65
w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8
FLOPs (M)
Cum
ulat
ive
Prob
abili
ty p=80%(32.3M, 80%)
best
acc:
76.4%
width-res. | mFLOPs
Bad design space
best
acc:
74.2
%
(45.4M, 80%)
TinyNAS: (1) Automated search space optimization
Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
0%
25%
50%
75%
100%
25 30 35 40 45 50 55 60 65
w0.3-r160 | 32.5w0.4-r112 | 32.4w0.4-r128 | 39.3w0.4-r144 | 46.9w0.5-r112 | 38.3w0.5-r128 | 46.9w0.5-r144 | 52.0w0.6-r112 | 41.3w0.7-r96 | 31.4w0.7-r112 | 38.4p0.8
FLOPs (M)
Cum
ulat
ive
Prob
abili
ty p=80%(50.3M, 80%)(32.3M, 80%)
best
acc:
76.4%
width-res. | mFLOPs
Good design space: likely to achieve high FLOPs under memory constraint
Bad design space
best
acc:
78.7%
best
acc:
74.2
%
TinyNAS: (1) Automated search space optimization
Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
TinyNAS: (2) Resource-constrained model specialization
Random sample (kernel size,
expansion, depth)
Jointly fine-tune multiple sub-
networks
Super Network
• Small sub-networks are nested in large sub-networks.
• One-shot NAS through weight sharing
* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20
TinyNAS: (2) Resource-constrained model specialization
Random sample (kernel size,
expansion, depth)
Jointly fine-tune multiple sub-
networks
Super Network
…
Directly evaluate the accuracy of sub-nets
• One-shot NAS through weight sharing
TinyNAS: (2) Resource-constrained model specialization
40
Elastic Kernel Size
Elastic Depth
Elastic Width
TinyNAS: (2) Resource-constrained model specialization
41
Elastic Kernel Size
Elastic Depth
Elastic Width
7x7
Index central
5x5
Index central
3x3
Start with full kernel size Smaller kernel takes centered weights
TinyNAS: (2) Resource-constrained model specialization
42
Elastic Kernel Size
Elastic Depth
Elastic Width
unit i
train with full depth
unit i
shrink the depth
O1
O2
O3
Allow later layers in each unit to be skipped to reduce the depth
TinyNAS: (2) Resource-constrained model specialization
Elastic Kernel Size
Elastic Depth
Elastic Width
train with full width
channel importance
0.02
0.15
0.85
0.63
channel sorting
reorg.
shrink the widthO1
O2
O3
O1
Shrink the width Keep the most important channels when shrinking via channel sorting
TinyNAS Better Utilizes the Memory
Peak
Mem
(kB)
0
75
150
225
300
MobileNetV2 TinyNAS
Peak Memory for First Two Stages
TinyNAS Better Utilizes the Memory
Peak
Mem
(kB)
0
75
150
225
300
averageaverage
MobileNetV2 TinyNAS
• TinyNAS designs networks with more uniform peak memory for each block, allowing us to fit a larger model at the same amount of memory
Peak Memory for First Two Stages
2.2x 1.6xmaxmax
TinyEngine: A Memory-Efficient Inference Library
1. Reducing overhead with separated compilation & runtime
(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN
NN ModelAll Supported ops
Inference
Runtime
Meta info. & Memory allocation
Interprete
TinyEngine: A Memory-Efficient Inference Library
(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN
NN ModelAll Supported ops
Inference
Runtime
Meta info. & Memory allocation
Interprete
Computation overhead
1. Reducing overhead with separated compilation & runtime
TinyEngine: A Memory-Efficient Inference Library
(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN
NN ModelAll Supported ops
Inference
Runtime
Meta info. & Memory allocation
Interprete
Computation overhead
Storage overhead
1. Reducing overhead with separated compilation & runtime
TinyEngine: A Memory-Efficient Inference Library
(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN
NN ModelAll Supported ops
Inference
Runtime
Meta info. & Memory allocation
Interprete
Computation overhead
Memory overhead
Storage overhead
1. Reducing overhead with separated compilation & runtime
TinyEngine: A Memory-Efficient Inference Library
(b) TinyEngine: Model-adaptive code generation.
NN Model Specialized ops
Inference
Compile time (offline) Runtime
Memory schedule
(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN
NN ModelAll Supported ops
Inference
Runtime
Meta info. & Memory allocation
Interprete
Computation overhead
Memory overhead
Storage overhead
1. Reducing overhead with separated compilation & runtime
TinyEngine: A Memory-Efficient Inference Library
(b) TinyEngine: Model-adaptive code generation.
NN Model Specialized ops
Inference
Compile time (offline) Runtime
Memory schedule
Code generation
Peak Mem (KB) ↓0 7648
Baseline: ARM CMSIS-NN
160
1.7x smaller
1. Reducing overhead with separated compilation & runtime
TinyEngine: A Memory-Efficient Inference Library
2. In-place depth-wise convolution
The dilemma of inverted bottleneck 1x
peak mem 6x
1x
TinyEngine: A Memory-Efficient Inference Library
2. In-place depth-wise convolution
The dilemma of inverted bottleneck 1x
peak mem 6x
1x
Peak Memory: 2N
channels
(a) Depth-wise convolution (b) In-place depth-wise convolution
Input activation Output activation Input/output activation Temp buffer
21
nn-1
1
2
n
write back
(Peak Mem: 2n) (Peak Mem: n+1)
N-1N
TinyEngine: A Memory-Efficient Inference Library
2. In-place depth-wise convolution
The dilemma of inverted bottleneck 1x
peak mem 6x
1x
Peak Memory: 2N
channels
(a) Depth-wise convolution (b) In-place depth-wise convolution
Input activation Output activation Input/output activation Temp buffer
21
nn-1
1
2
n
write back
(Peak Mem: 2n) (Peak Mem: n+1)
N-1N
channels
(a) Depth-wise convolution (b) In-place depth-wise convolution
Input activation Output activation Input/output activation Temp buffer
21
nn-1
1
2
n
write back
(Peak Mem: 2n) (Peak Mem: n+1)
N
Peak Memory: N+1
TinyEngine: A Memory-Efficient Inference Library
2. In-place depth-wise convolution
The dilemma of inverted bottleneck 1x
peak mem 6x
1x
Code generation
Peak Mem (KB) ↓0 7648
Baseline: ARM CMSIS-NN
160
3.4x smallerIn-place depth-wise
Peak Memory: 2N
channels
(a) Depth-wise convolution (b) In-place depth-wise convolution
Input activation Output activation Input/output activation Temp buffer
21
nn-1
1
2
n
write back
(Peak Mem: 2n) (Peak Mem: n+1)
N-1N
channels
(a) Depth-wise convolution (b) In-place depth-wise convolution
Input activation Output activation Input/output activation Temp buffer
21
nn-1
1
2
n
write back
(Peak Mem: 2n) (Peak Mem: n+1)
N
Peak Memory: N+1
TinyEngine: Faster Inference Speed
Baseline: ARM CMSIS-NNMillion MAC/s ↑
0 52
Analyzing Million MAC/s improved by each technique
Baseline: ARM CMSIS-NN Code generation: Eliminate runtime interpretation overheadMillion MAC/s ↑
0 6452
(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation Analyzing Million MAC/s improved by each technique
OOM
NN ModelSpecialized ops
Inference
TinyEngine: Model-adaptive code generation.
Compile time Runtime
Memory schedule
TinyEngine: Faster Inference Speed
OOM
Baseline: ARM CMSIS-NN Code generation Specialized Im2col: Increase the data reuseMillion MAC/s ↑
0 64 7052
(a) Model-level memory scheduling
(b) Tile size configuration for Im2col
(2) Model-adaptive memory scheduling -> Increase data reuse for each layer Analyzing Million MAC/s improved by each technique
TinyEngine: Faster Inference Speed
OOM
Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusionMillion MAC/s ↑
0 64 7052 75
(3) Computation Kernel Specialization: Operation fusion
• Minimize memory footprint • Optimize the overall computation
Pad
Conv
ReLU
BN
Specialized kernel
e.g., fuse Pad+Conv+ReLU+BN
Analyzing Million MAC/s improved by each technique
TinyEngine: Faster Inference Speed
OOM
Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrollingMillion MAC/s ↑
0 64 7052 75 79 82
(3) Computation Kernel Specialization: Loop unrolling
e.g., fully unroll for 3x3 conv
...
• Eliminate the branch instruction overheads of loops
Analyzing Million MAC/s improved by each technique
TinyEngine: Faster Inference Speed
Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling TilingMillion MAC/s ↑
0 64 7052 75 79 82
Specialized loop unrolling/tiling according to network architecture
Analyzing Million MAC/s improved by each technique (3) Computation Kernel Specialization: Loop tiling for each layer
Im2col buffer
x
Weights
...
Tile size
Output
Tile size
• Tiling loops based on the kernel size and available memory
TinyEngine: Faster Inference Speed
TinyEngine: A Memory-Efficient Inference Library
Code generation
Peak Mem (KB) ↓0 7648
Baseline: ARM CMSIS-NN
160
3.4x smallerIn-place depth-wise
Million MAC/s ↑
1.6x fasterBaseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling
Million MAC/s ↑0 64 7052 75 79 82
adsfas
TinyEngine: A Memory-Efficient Inference Library
Code generation
Peak Mem (KB) ↓0 7648
Baseline: ARM CMSIS-NN
160
3.4x smallerIn-place depth-wise
Million MAC/s ↑
Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling
0 64 7052 75 79 82
1.6x faster
0%
25%
50%
75%
100% 1.001.001.001.00
0.610.660.64
0.94
0.82
0.320.330.320.33
TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine
SmallCifar MobileNetV2 ProxylessNAS MnasNet
Normalized Speed↑
1.6x faster
3x faster
3x faster
3x faster
3x faster
1.5xfaster
1.6x faster
OO
M
OO
M
OO
M
Peak Mem (KB)↓
0
46
92
138
184
230
84
4165
46
228
197217
67
144
216
161
211
64
SmallCifar MobileNetV2 ProxylessNAS MnasNet
3.1x smaller
4.8x smaller
OO
M
2.7x smaller3.3x
smaller
OO
M
OO
M
• Consistent improvement on different networks
TinyEngine: A Memory-Efficient Inference Library
Code generation
Peak Mem (KB) ↓0 7648
Baseline: ARM CMSIS-NN
160
3.4x smallerIn-place depth-wise
Million MAC/s ↑
Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling
0 64 7052 75 79 82
1.6x faster
0%
25%
50%
75%
100% 1.001.001.001.00
0.610.660.64
0.94
0.82
0.320.330.320.33
TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine
SmallCifar MobileNetV2 ProxylessNAS MnasNet
Normalized Speed↑
1.6x faster
3x faster
3x faster
3x faster
3x faster
1.5xfaster
1.6x faster
OO
M
OO
M
OO
M
Peak Mem (KB)↓
0
46
92
138
184
230
84
4165
46
228
197217
67
144
216
161
211
64
SmallCifar MobileNetV2 ProxylessNAS MnasNet
3.1x smaller
4.8x smaller
OO
M
2.7x smaller3.3x
smaller
OO
M
OO
M
• Consistent improvement on different networks
Experimental Results
We focus on large-scale datasets to reflect real-life use cases.
Datasets: (1) ImageNet-1000(2) Wake Words
• Visual: Visual Wake Words• Audio: Google Speech Commands
System-Algorithm Co-design Gives the Best Results
• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)
40Baseline (MbV2*+CMSIS)
ImageNet Top1: 35% 45% 55% 65%
* scaled down version: width multiplier 0.3, input resolution 80
5644
40Baseline (MbV2*+CMSIS)System-only (MbV2*+TinyEngine)
Model-only (TinyNAS+CMSIS)
ImageNet Top1: 35% 45% 55% 65%
System-Algorithm Co-design Gives the Best Results
• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)
* scaled down version: width multiplier 0.3, input resolution 80
6256
4440Baseline (MbV2*+CMSIS)
System-only (MbV2*+TinyEngine)Model-only (TinyNAS+CMSIS)
Co-design (TinyNAS+TinyEngine)
ImageNet Top1: 35% 45% 55% 65%
System-Algorithm Co-design Gives the Best Results
• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)
* scaled down version: width multiplier 0.3, input resolution 80
Handling Diverse Hardware
50
55
60
65
70
7570.7
65.963.562.0
ImageNet Top-1 Accuracy (%)
STM32F412(256kB/1MB)
STM32F746(320kB/1MB)
STM32F765(512kB/1MB)
STM32H743(512kB/2MB)
• Specializing models (int4) for different MCUs (SRAM/Flash)
Handling Diverse Hardware
50
55
60
65
70
7570.7
65.963.562.0
ImageNet Top-1 Accuracy (%)
STM32F412(256kB/1MB)
STM32F746(320kB/1MB)
STM32F765(512kB/1MB)
STM32H743(512kB/2MB)
• Specializing models (int4) for different MCUs (SRAM/Flash)
The first to achieve >70% ImageNet accuracy on commercial MCUs
Handling Diverse Hardware
50
55
60
65
70
75
53.8
70.7
65.963.562.0
ImageNet Top-1 Accuracy (%)
STM32F412(256kB/1MB)
STM32F746(320kB/1MB)
STM32F765(512kB/1MB)
STM32H743(512kB/2MB)
• Specializing models (int4) for different MCUs (SRAM/Flash)
The first to achieve >70% ImageNet accuracy on commercial MCUs
MobileNetV2+CMSIS-NN
+17%
0
10
20
30
40
50
Param (MB) Peak Activation (MB)
ResNet-18 MobileNetV2-0.75 MCUNet~70% ImageNet Top-1
4.6x
1.8x
Reduce Both Model Size and Activation Size
Reduce Both Model Size and Activation Size
0
10
20
30
40
50
Param (MB) Peak Activation (MB)
ResNet-18 MobileNetV2-0.75 MCUNet
24.6x
13.8x
~70% ImageNet Top-1
Visual Wake Words (VWW)V
WW
Acc
urac
y
84
86
88
90
92
0 440 880 1320 1760 2200
MCUNet MobileNetV2 ProxylessNAS Han et al.
84
86
88
90
92
50 162.5 275 387.5 500
OOM
Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory
256kB constraint on MCU
Visual Wake Words (VWW)V
WW
Acc
urac
y
84
86
88
90
92
0 440 880 1320 1760 2200
MCUNet MobileNetV2 ProxylessNAS Han et al.
84
86
88
90
92
50 162.5 275 387.5 500
10FPS
5FPS
OOM
Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory
256kB constraint on MCU
Visual Wake Words (VWW)V
WW
Acc
urac
y
84
86
88
90
92
0 440 880 1320 1760 2200
MCUNet MobileNetV2 ProxylessNAS Han et al.
84
86
88
90
92
50 162.5 275 387.5 500
10FPS
5FPS
OOM
Latency (ms) Peak SRAM (kB)
3.4× faster
3.7× smaller
2.4× faster
(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory
256kB constraint on MCU
Audio Wake Words (Speech Commands)G
SC A
ccur
acy
88
90
92
94
96
0 340 680 1020 1360 1700
MCUNet MobileNetV2 ProxylessNAS
88
90
92
94
96
30 147.5 265 382.5 500
10FPS
5FPS2.8× faster 4.1× smaller
2% higher
256kB constraint
Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory
OOM
Visual Wake Word Detection
• Detecting whether a person is present in the frame
MCUNet: Tiny Deep Learning on IoT Devices
Project Page: http://tinyml.mit.edu
• Our study suggests that the era of tiny machine learning on IoT devices has arrived
Cloud AI Mobile AI Tiny AI
ResNet MobileNet MCUNet