mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...

79
1 MIT 2 National Taiwan University 3 MIT-IBM Watson AI Lab MCUNet: Tiny Deep Learning on IoT Devices NeurIPS 2020 (spotlight) Ji Lin 1 Wei-Ming Chen 1,2 John Cohn 3 Yujun Lin 1 Song Han 1 Chuang Gan 3

Upload: others

Post on 27-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

1MIT 2National Taiwan University 3MIT-IBM Watson AI Lab

MCUNet: Tiny Deep Learning on IoT Devices

NeurIPS 2020 (spotlight)

Ji Lin1 Wei-Ming Chen1,2 John Cohn3Yujun Lin1 Song Han1Chuang Gan3

Page 2: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

• Low-cost, low-power

Background: The Era of AIoT on Microcontrollers (MCUs)

Page 3: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Background: The Era of AIoT on Microcontrollers (MCUs)

• Low-cost, low-power • Rapid growth

#Uni

ts (B

illion

)

01020304050

12 13 14 15F 16F 17F 18F 19F

Page 4: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Smart Retail Personalized Healthcare Precision Agriculture Smart Home

• Wide applications

• Low-cost, low-power • Rapid growth

#Uni

ts (B

illion

)

01020304050

12 13 14 15F 16F 17F 18F 19F

Background: The Era of AIoT on Microcontrollers (MCUs)

Page 5: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

Page 6: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Cloud AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB

~TB/PB

Page 7: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Cloud AI Mobile AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB 4GB

256GB~TB/PB

Page 8: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB 4GB

256GB

320kB

1MB~TB/PB

Page 9: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB

~TB/PB

4GB

256GB

320kB

1MB13,000x smaller

50,000x smaller

Page 10: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB

~TB/PB

4GB

256GB

320kB

1MB13,000x smaller

50,000x smaller

We need to reduce the peak activation size AND the model size to fit a DNN into MCUs.

Page 11: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Existing efficient network only reduces model size but NOT activation size!

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet

~70% ImageNet Top-1

4.6x

1.8x

Page 12: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Challenge: Memory Too Small to Hold DNN

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

Page 13: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Challenge: Memory Too Small to Hold DNN

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

Page 14: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Challenge: Memory Too Small to Hold DNN

MCUNet

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

Page 15: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

MCUNet: System-Algorithm Co-design

Page 16: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

LibraryNAS

Page 17: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

(b) Tune deep learning library given a NN modele.g., TVM

LibraryNAS LibraryNN Model

Page 18: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

MCUNet: System-Algorithm Co-design

TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

(b) Tune deep learning library given a NN modele.g., TVM

(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

Page 19: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

MCUNet: System-Algorithm Co-design

TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

(b) Tune deep learning library given a NN modele.g., TVM

(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

TinyNAS

Page 20: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Full Network Space

Page 21: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Optimized Search SpaceFull Network Space

Memory/Storage Constraints

Page 22: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Optimized Search SpaceFull Network Space Model Specialization

Memory/Storage Constraints

Page 23: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

TinyNAS: (1) Automated search space optimization

Page 24: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

k=7

k=5

k=3

TinyNAS: (1) Automated search space optimization

Page 25: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

k=7

k=5

k=3

e=6

e=4

e=2

pw1dw

pw2

TinyNAS: (1) Automated search space optimization

Page 26: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

k=7

k=5

k=3

e=6

e=4

e=2

pw1dw

pw2

d=4

d=3

d=2

TinyNAS: (1) Automated search space optimization

Page 27: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

Out of memory!

TinyNAS: (1) Automated search space optimization

Page 28: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

TinyNAS: (1) Automated search space optimization

Page 29: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Different R and W for different hardware capacity

R=224, W=1.0

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space)

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

Page 30: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Different R and W for different hardware capacity

R=260, W=1.4 * R=224, W=1.0

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space)

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

Page 31: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Different R and W for different hardware capacity

R=260, W=1.4 R=224, W=1.0 R=?, W=?

F412/F743/H746/…

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space)

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

256kB/320kB/512kB/…

Page 32: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

TinyNAS: (1) Automated search space optimization

Page 33: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty

width-res. | mFLOPs

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

320kB?

Page 34: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty

width-res. | mFLOPs

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

Page 35: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(32.3M, 80%)

width-res. | mFLOPs

Bad design space(45.4M, 80%)

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

Page 36: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(32.3M, 80%)

best

acc:

76.4%

width-res. | mFLOPs

Bad design space

best

acc:

74.2

%

(45.4M, 80%)

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

Page 37: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r112 | 32.4w0.4-r128 | 39.3w0.4-r144 | 46.9w0.5-r112 | 38.3w0.5-r128 | 46.9w0.5-r144 | 52.0w0.6-r112 | 41.3w0.7-r96 | 31.4w0.7-r112 | 38.4p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(50.3M, 80%)(32.3M, 80%)

best

acc:

76.4%

width-res. | mFLOPs

Good design space: likely to achieve high FLOPs under memory constraint

Bad design space

best

acc:

78.7%

best

acc:

74.2

%

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

Page 38: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

• Small sub-networks are nested in large sub-networks.

• One-shot NAS through weight sharing

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

Page 39: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

Directly evaluate the accuracy of sub-nets

• One-shot NAS through weight sharing

Page 40: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: (2) Resource-constrained model specialization

40

Elastic Kernel Size

Elastic Depth

Elastic Width

Page 41: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: (2) Resource-constrained model specialization

41

Elastic Kernel Size

Elastic Depth

Elastic Width

7x7

Index central

5x5

Index central

3x3

Start with full kernel size Smaller kernel takes centered weights

Page 42: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: (2) Resource-constrained model specialization

42

Elastic Kernel Size

Elastic Depth

Elastic Width

unit i

train with full depth

unit i

shrink the depth

O1

O2

O3

Allow later layers in each unit to be skipped to reduce the depth

Page 43: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS: (2) Resource-constrained model specialization

Elastic Kernel Size

Elastic Depth

Elastic Width

train with full width

channel importance

0.02

0.15

0.85

0.63

channel sorting

reorg.

shrink the widthO1

O2

O3

O1

Shrink the width Keep the most important channels when shrinking via channel sorting

Page 44: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS Better Utilizes the Memory

Peak

Mem

(kB)

0

75

150

225

300

MobileNetV2 TinyNAS

Peak Memory for First Two Stages

Page 45: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyNAS Better Utilizes the Memory

Peak

Mem

(kB)

0

75

150

225

300

averageaverage

MobileNetV2 TinyNAS

• TinyNAS designs networks with more uniform peak memory for each block, allowing us to fit a larger model at the same amount of memory

Peak Memory for First Two Stages

2.2x 1.6xmaxmax

Page 46: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

1. Reducing overhead with separated compilation & runtime

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Page 47: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

1. Reducing overhead with separated compilation & runtime

Page 48: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

Storage overhead

1. Reducing overhead with separated compilation & runtime

Page 49: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

Memory overhead

Storage overhead

1. Reducing overhead with separated compilation & runtime

Page 50: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

Memory overhead

Storage overhead

1. Reducing overhead with separated compilation & runtime

Page 51: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

1.7x smaller

1. Reducing overhead with separated compilation & runtime

Page 52: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Page 53: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N

Page 54: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N

Peak Memory: N+1

Page 55: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N

Peak Memory: N+1

Page 56: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: Faster Inference Speed

Baseline: ARM CMSIS-NNMillion MAC/s ↑

0 52

Analyzing Million MAC/s improved by each technique

Page 57: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Baseline: ARM CMSIS-NN Code generation: Eliminate runtime interpretation overheadMillion MAC/s ↑

0 6452

(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation Analyzing Million MAC/s improved by each technique

OOM

NN ModelSpecialized ops

Inference

TinyEngine: Model-adaptive code generation.

Compile time Runtime

Memory schedule

TinyEngine: Faster Inference Speed

Page 58: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col: Increase the data reuseMillion MAC/s ↑

0 64 7052

(a) Model-level memory scheduling

(b) Tile size configuration for Im2col

(2) Model-adaptive memory scheduling -> Increase data reuse for each layer Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

Page 59: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusionMillion MAC/s ↑

0 64 7052 75

(3) Computation Kernel Specialization: Operation fusion

• Minimize memory footprint • Optimize the overall computation

Pad

Conv

ReLU

BN

Specialized kernel

e.g., fuse Pad+Conv+ReLU+BN

Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

Page 60: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrollingMillion MAC/s ↑

0 64 7052 75 79 82

(3) Computation Kernel Specialization: Loop unrolling

e.g., fully unroll for 3x3 conv

...

• Eliminate the branch instruction overheads of loops

Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

Page 61: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling TilingMillion MAC/s ↑

0 64 7052 75 79 82

Specialized loop unrolling/tiling according to network architecture

Analyzing Million MAC/s improved by each technique (3) Computation Kernel Specialization: Loop tiling for each layer

Im2col buffer

x

Weights

...

Tile size

Output

Tile size

• Tiling loops based on the kernel size and available memory

TinyEngine: Faster Inference Speed

Page 62: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Million MAC/s ↑

1.6x fasterBaseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

Million MAC/s ↑0 64 7052 75 79 82

adsfas

Page 63: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

0%

25%

50%

75%

100% 1.001.001.001.00

0.610.660.64

0.94

0.82

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster

3x faster

3x faster

3x faster

3x faster

1.5xfaster

1.6x faster

OO

M

OO

M

OO

M

Peak Mem (KB)↓

0

46

92

138

184

230

84

4165

46

228

197217

67

144

216

161

211

64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x smaller

4.8x smaller

OO

M

2.7x smaller3.3x

smaller

OO

M

OO

M

• Consistent improvement on different networks

Page 64: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

0%

25%

50%

75%

100% 1.001.001.001.00

0.610.660.64

0.94

0.82

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster

3x faster

3x faster

3x faster

3x faster

1.5xfaster

1.6x faster

OO

M

OO

M

OO

M

Peak Mem (KB)↓

0

46

92

138

184

230

84

4165

46

228

197217

67

144

216

161

211

64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x smaller

4.8x smaller

OO

M

2.7x smaller3.3x

smaller

OO

M

OO

M

• Consistent improvement on different networks

Page 65: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Experimental Results

We focus on large-scale datasets to reflect real-life use cases.

Datasets: (1) ImageNet-1000(2) Wake Words

• Visual: Visual Wake Words• Audio: Google Speech Commands

Page 66: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

40Baseline (MbV2*+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

* scaled down version: width multiplier 0.3, input resolution 80

Page 67: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

5644

40Baseline (MbV2*+CMSIS)System-only (MbV2*+TinyEngine)

Model-only (TinyNAS+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

* scaled down version: width multiplier 0.3, input resolution 80

Page 68: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

6256

4440Baseline (MbV2*+CMSIS)

System-only (MbV2*+TinyEngine)Model-only (TinyNAS+CMSIS)

Co-design (TinyNAS+TinyEngine)

ImageNet Top1: 35% 45% 55% 65%

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

* scaled down version: width multiplier 0.3, input resolution 80

Page 69: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Handling Diverse Hardware

50

55

60

65

70

7570.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32F746(320kB/1MB)

STM32F765(512kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

Page 70: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Handling Diverse Hardware

50

55

60

65

70

7570.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32F746(320kB/1MB)

STM32F765(512kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

The first to achieve >70% ImageNet accuracy on commercial MCUs

Page 71: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Handling Diverse Hardware

50

55

60

65

70

75

53.8

70.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32F746(320kB/1MB)

STM32F765(512kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

The first to achieve >70% ImageNet accuracy on commercial MCUs

MobileNetV2+CMSIS-NN

+17%

Page 72: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet~70% ImageNet Top-1

4.6x

1.8x

Reduce Both Model Size and Activation Size

Page 73: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Reduce Both Model Size and Activation Size

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet

24.6x

13.8x

~70% ImageNet Top-1

Page 74: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

OOM

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

Page 75: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

10FPS

5FPS

OOM

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

Page 76: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

10FPS

5FPS

OOM

Latency (ms) Peak SRAM (kB)

3.4× faster

3.7× smaller

2.4× faster

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

Page 77: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Audio Wake Words (Speech Commands)G

SC A

ccur

acy

88

90

92

94

96

0 340 680 1020 1360 1700

MCUNet MobileNetV2 ProxylessNAS

88

90

92

94

96

30 147.5 265 382.5 500

10FPS

5FPS2.8× faster 4.1× smaller

2% higher

256kB constraint

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

OOM

Page 78: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

Visual Wake Word Detection

• Detecting whether a person is present in the frame

Page 79: MCUNet: Tiny Deep Learning on IoT DevicesBackground: The Era of AIoT on Microcontrollers (MCUs) • Low-cost, low-power • Rapid growth #Units (Billion) 0 10 20 30 40 50 12 13 14

MCUNet: Tiny Deep Learning on IoT Devices

Project Page: http://tinyml.mit.edu

• Our study suggests that the era of tiny machine learning on IoT devices has arrived

Cloud AI Mobile AI Tiny AI

ResNet MobileNet MCUNet