mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...

1MIT 2National Taiwan University 3MIT-IBM Watson AI Lab

MCUNet: Tiny Deep Learning on IoT Devices

NeurIPS 2020 (spotlight)

Ji Lin1 Wei-Ming Chen1,2 John Cohn3Yujun Lin1 Song Han1Chuang Gan3

• Low-cost, low-power

Background: The Era of AIoT on Microcontrollers (MCUs)


• Low-cost, low-power • Rapid growth

#Uni

ts (B

illion

)

01020304050

12 13 14 15F 16F 17F 18F 19F

Smart Retail Personalized Healthcare Precision Agriculture Smart Home

…

• Wide applications

• Low-cost, low-power • Rapid growth

#Uni

ts (B

illion

)

01020304050

12 13 14 15F 16F 17F 18F 19F


Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

Cloud AI


Memory (Activation)

Storage (Weights)

16GB

~TB/PB

Cloud AI Mobile AI


Memory (Activation)

Storage (Weights)

16GB 4GB

256GB~TB/PB

Cloud AI Mobile AI Tiny AI


Memory (Activation)

Storage (Weights)

16GB 4GB

256GB

320kB

1MB~TB/PB



Memory (Activation)

Storage (Weights)

16GB

~TB/PB

4GB

256GB

320kB

1MB13,000x smaller

50,000x smaller



Memory (Activation)

Storage (Weights)

16GB

~TB/PB

4GB

256GB

320kB

1MB13,000x smaller

50,000x smaller

We need to reduce the peak activation size AND the model size to fit a DNN into MCUs.

Existing efficient network only reduces model size but NOT activation size!

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet

~70% ImageNet Top-1

4.6x

1.8x


ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x


MCUNet

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

MCUNet: System-Algorithm Co-design


(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

LibraryNAS



(b) Tune deep learning library given a NN modele.g., TVM

LibraryNAS LibraryNN Model


TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model



(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS


TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model



(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

TinyNAS

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Full Network Space



Optimized Search SpaceFull Network Space

Memory/Storage Constraints



Optimized Search SpaceFull Network Space Model Specialization

Memory/Storage Constraints

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

TinyNAS: (1) Automated search space optimization


k=7

k=5

k=3



k=7

k=5

k=3

e=6

e=4

e=2

pw1dw

pw2



k=7

k=5

k=3

e=6

e=4

e=2

pw1dw

pw2

d=4

d=3

d=2



Out of memory!


Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W


Different R and W for different hardware capacity

R=224, W=1.0


(i.e., different optimized sub-space)



R=260, W=1.4 * R=224, W=1.0




* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20


R=260, W=1.4 R=224, W=1.0 R=?, W=?

F412/F743/H746/…




256kB/320kB/512kB/…

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy


0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty

width-res. | mFLOPs



320kB?

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty

width-res. | mFLOPs



0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(32.3M, 80%)

width-res. | mFLOPs

Bad design space(45.4M, 80%)



0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(32.3M, 80%)

best

acc:

76.4%

width-res. | mFLOPs

Bad design space

best

acc:

74.2

%

(45.4M, 80%)



0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r112 | 32.4w0.4-r128 | 39.3w0.4-r144 | 46.9w0.5-r112 | 38.3w0.5-r128 | 46.9w0.5-r144 | 52.0w0.6-r112 | 41.3w0.7-r96 | 31.4w0.7-r112 | 38.4p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(50.3M, 80%)(32.3M, 80%)

best

acc:

76.4%

width-res. | mFLOPs

Good design space: likely to achieve high FLOPs under memory constraint

Bad design space

best

acc:

78.7%

best

acc:

74.2

%



TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

• Small sub-networks are nested in large sub-networks.

• One-shot NAS through weight sharing

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20


Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

…

Directly evaluate the accuracy of sub-nets

• One-shot NAS through weight sharing


40

Elastic Kernel Size

Elastic Depth

Elastic Width


41

Elastic Kernel Size

Elastic Depth

Elastic Width

7x7

Index central

5x5

Index central

3x3

Start with full kernel size Smaller kernel takes centered weights


42

Elastic Kernel Size

Elastic Depth

Elastic Width

unit i

train with full depth

unit i

shrink the depth

O1

O2

O3

Allow later layers in each unit to be skipped to reduce the depth


Elastic Kernel Size

Elastic Depth

Elastic Width

train with full width

channel importance

0.02

0.15

0.85

0.63

channel sorting

reorg.

shrink the widthO1

O2

O3

O1

Shrink the width Keep the most important channels when shrinking via channel sorting

TinyNAS Better Utilizes the Memory

Peak

Mem

(kB)

0

75

150

225

300

MobileNetV2 TinyNAS

Peak Memory for First Two Stages

TinyNAS Better Utilizes the Memory

Peak

Mem

(kB)

0

75

150

225

300

averageaverage

MobileNetV2 TinyNAS

• TinyNAS designs networks with more uniform peak memory for each block, allowing us to fit a larger model at the same amount of memory

Peak Memory for First Two Stages

2.2x 1.6xmaxmax

TinyEngine: A Memory-Efficient Inference Library

1. Reducing overhead with separated compilation & runtime

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete




Inference

Runtime


Interprete

Computation overhead





Inference

Runtime


Interprete


Storage overhead





Inference

Runtime


Interprete


Memory overhead

Storage overhead



(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule



Inference

Runtime


Interprete


Memory overhead

Storage overhead



(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

1.7x smaller



2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x




peak mem 6x

1x

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N




peak mem 6x

1x

Peak Memory: 2N

channels



21

nn-1

1

2

n

write back


N-1N

channels



21

nn-1

1

2

n

write back


N

Peak Memory: N+1




peak mem 6x

1x

Code generation



160

3.4x smallerIn-place depth-wise

Peak Memory: 2N

channels



21

nn-1

1

2

n

write back


N-1N

channels



21

nn-1

1

2

n

write back


N

Peak Memory: N+1

TinyEngine: Faster Inference Speed

Baseline: ARM CMSIS-NNMillion MAC/s ↑

0 52

Analyzing Million MAC/s improved by each technique

Baseline: ARM CMSIS-NN Code generation: Eliminate runtime interpretation overheadMillion MAC/s ↑

0 6452

(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation Analyzing Million MAC/s improved by each technique

OOM

NN ModelSpecialized ops

Inference

TinyEngine: Model-adaptive code generation.

Compile time Runtime

Memory schedule


OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col: Increase the data reuseMillion MAC/s ↑

0 64 7052

(a) Model-level memory scheduling

(b) Tile size configuration for Im2col

(2) Model-adaptive memory scheduling -> Increase data reuse for each layer Analyzing Million MAC/s improved by each technique


OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusionMillion MAC/s ↑

0 64 7052 75

(3) Computation Kernel Specialization: Operation fusion

• Minimize memory footprint • Optimize the overall computation

Pad

Conv

ReLU

BN

Specialized kernel

e.g., fuse Pad+Conv+ReLU+BN



OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrollingMillion MAC/s ↑

0 64 7052 75 79 82

(3) Computation Kernel Specialization: Loop unrolling

e.g., fully unroll for 3x3 conv

...

• Eliminate the branch instruction overheads of loops



Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling TilingMillion MAC/s ↑

0 64 7052 75 79 82

Specialized loop unrolling/tiling according to network architecture

Analyzing Million MAC/s improved by each technique (3) Computation Kernel Specialization: Loop tiling for each layer

Im2col buffer

x

Weights

...

Tile size

Output

Tile size

• Tiling loops based on the kernel size and available memory



Code generation



160


Million MAC/s ↑

1.6x fasterBaseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

Million MAC/s ↑0 64 7052 75 79 82

adsfas


Code generation



160


Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

0%

25%

50%

75%

100% 1.001.001.001.00

0.610.660.64

0.94

0.82

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster

3x faster

3x faster

3x faster

3x faster

1.5xfaster

1.6x faster

OO

M

OO

M

OO

M

Peak Mem (KB)↓

0

46

92

138

184

230

84

4165

46

228

197217

67

144

216

161

211

64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x smaller

4.8x smaller

OO

M

2.7x smaller3.3x

smaller

OO

M

OO

M

• Consistent improvement on different networks

Experimental Results

We focus on large-scale datasets to reflect real-life use cases.

Datasets: (1) ImageNet-1000(2) Wake Words

• Visual: Visual Wake Words• Audio: Google Speech Commands

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

40Baseline (MbV2*+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

* scaled down version: width multiplier 0.3, input resolution 80

5644

40Baseline (MbV2*+CMSIS)System-only (MbV2*+TinyEngine)

Model-only (TinyNAS+CMSIS)

ImageNet Top1: 35% 45% 55% 65%




6256

4440Baseline (MbV2*+CMSIS)

System-only (MbV2*+TinyEngine)Model-only (TinyNAS+CMSIS)

Co-design (TinyNAS+TinyEngine)

ImageNet Top1: 35% 45% 55% 65%




Handling Diverse Hardware

50

55

60

65

70

7570.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)



STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)


50

55

60

65

70

7570.7

65.963.562.0







The first to achieve >70% ImageNet accuracy on commercial MCUs


50

55

60

65

70

75

53.8

70.7

65.963.562.0







The first to achieve >70% ImageNet accuracy on commercial MCUs

MobileNetV2+CMSIS-NN

+17%

0

10

20

30

40

50


ResNet-18 MobileNetV2-0.75 MCUNet~70% ImageNet Top-1

4.6x

1.8x

Reduce Both Model Size and Activation Size

Reduce Both Model Size and Activation Size

0

10

20

30

40

50


ResNet-18 MobileNetV2-0.75 MCUNet

24.6x

13.8x

~70% ImageNet Top-1

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

OOM

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU


WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200


84

86

88

90

92

50 162.5 275 387.5 500

10FPS

5FPS

OOM




WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200


84

86

88

90

92

50 162.5 275 387.5 500

10FPS

5FPS

OOM

Latency (ms) Peak SRAM (kB)

3.4× faster

3.7× smaller

2.4× faster

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory


Audio Wake Words (Speech Commands)G

SC A

ccur

acy

88

90

92

94

96

0 340 680 1020 1360 1700

MCUNet MobileNetV2 ProxylessNAS

88

90

92

94

96

30 147.5 265 382.5 500

10FPS

5FPS2.8× faster 4.1× smaller

2% higher

256kB constraint


OOM

Visual Wake Word Detection

• Detecting whether a person is present in the frame

MCUNet: Tiny Deep Learning on IoT Devices

Project Page: http://tinyml.mit.edu

• Our study suggests that the era of tiny machine learning on IoT devices has arrived


ResNet MobileNet MCUNet

mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...

Documents