mixed-signal techniques for embedded machine learning …2 h 1 p h 2 v 1 h 1 h 2 𝐻= v 1 v 2 𝑉=...
TRANSCRIPT
Mixed-Signal Techniques for Embedded
Machine Learning Systems
Boris Murmann
June 16, 2019
Applications
Speed of response
Bandwidth utilized
Privacy
Power consumed
CloudEdge
Conversational
Interfaces
…natural?
…seamless?
…real-time?
2
Task Complexity, Memory and Classification Energy
Face Detection
SVM
Digit RecognitionImage Classification
10 classes
Binary CNN
Image Classification
1000 classes
MobileNet v1
Self-Driving
ResNet-50 HD
pJ
nJ
µJ
mJ
J
3
Task Complexity, Memory and Classification Energy
pJ
nJ
µJ
mJ
J
My main interest
4
Edge Inference System
5
A/D
A/D
Sensors
Wake-up
Detector
Pre-
Processin
g
Deep
Neural
Network
DRA
M
SRA
M
Buffer
PE
Array
Register
File
Compute
MAC
Opportunities for Analog/Mixed-Signal Design
6
A/D
A/D
Sensors
Wake-up
Detector
Pre-
Processin
g
Deep
Neural
Network
DRA
M
SRA
M
Buffer
PE
Array
Register
File
Compute
MAC
Analog
Feature
Extraction
Analog
MAC
In-Memory
Computing
Outline
7
Data-Compressive Imager for Object Detection
› Omid-Zohoor & Young, TCSVT 2018 & ISSCC 2019
Mixed-Signal ConvNet
› Bankman, ISSCC 2018 & JSSC 2019
RRAM-based ConvNet with In-Memory Compute
› Ongoing work
Wake-Up Detector with Hand-Crafted Features
Feature
Extractor
Simple
Classifier
Sensor
Signal LabelA/D
Training
Algorithm
Labeled
Training Data
High-dimensional
data
Low-dimensional
representation
Data Deluge
8
Analog Feature Extractor
Low-rate and/or low–resolution ADC
Low data rate digital I/O
Reduced memory requirements
Feature
Extractor
Simple
Classification
Sensor
SignalLabelA/D
Training
Algorithm
Labeled
Training Data
Low-dimensional
representation
9
Histogram of Oriented Gradients
10
Analog Gradient Computation
H1 H2𝐺𝐻 =
V1 V2𝐺𝑉 =
horizontal
vertical
V2
H
1
PH
2
V1
𝐺𝐻 = 400𝑚𝑉 − 100𝑚𝑉 = 𝟑𝟎𝟎𝒎𝑽
𝐺𝐻 =1
4400𝑚𝑉 −
1
4100𝑚𝑉 = 𝟕𝟓𝒎𝑽
Dark patch
Bright patch
11
High Dynamic Range Images
12
Gradient magnitude varies
significantly across image
Would require high-
resolution ADCs (≥ 9b) to
digitize computed gradients
› But, we want to produce
as little data as possible
Ratio-Based (“Log”) Gradients
V2
H
1
PH
2
V1
H1
H2
𝐺𝐻 =V1
V2
𝐺𝑉 =
H1
H2
𝐺𝐻 = =𝛼 ×
𝛼 ×
H1
H2
Illumination Invariant
13
Ratio Quantization
V2
H
1
PH
2
V1
H1
H2
𝐺𝐻 =
V1
V2
𝐺𝑉 =
𝑉𝑝𝑖𝑥1𝑉𝑝𝑖𝑥2
1
22
Negative
edge
Positive
edge
No
edge
*Empirically determined
thresholds
𝑉𝑝𝑖𝑥1𝑉𝑝𝑖𝑥2
1
22
1.5b
2.75b
4.21.31
1.3
1
4.2
H1
H2
V1
V2
𝐺𝑉 = 𝑄𝐺𝐻 = 𝑄
14
HOG Feature Compression with 1.5b Gradients
Quant
Fewer angle bins
Quantizing histogram
magnitudes
25× less data in HOG
features compared to 8-bit
image
15
Log vs. Linear GradientsLess Illu
min
atio
n
16
Log vs. Linear GradientsLess Illu
min
atio
n
17
Log vs. Linear GradientsLess Illu
min
atio
n
18
Prototype Chip
19
Row Buffers with Pixel Binning (Image Pyramid)
20
Ratio-to-Digital Converter (RDC)
21
Data-Driven Spec Derivation
22
𝐻1 + 𝝐𝟏𝐻2 + 𝝐𝟐
Chip Summary
• 0.13 µm CIS 1P4M
• 5µm 4T pixels
• QVGA 320(V) x 240(H)
• 229 mW @ 30 FPS
Supply Voltages
Pixel: 2.5V
Analog: 1.5V, 2.5V
Digital: 0.9V
23
Sample Images
24
25
Results using Deformable Parts Model detection & custom database (PascalRAW)
Comparison to State of the Art
26
Information Preservation
27
Use Log Gradients as ConvNet Input?
28
Ongoing work; comparable performance using ResNet-10 (PascalRaw dataset)
Digital ConvNet Fabric
Can We Play Mixed-Signal Tricks in a ConvNet?
29
BinaryNet
Courbariaux et al., NIPS 2016
Weights and activations constrained to +1
and -1, multiplication becomes XNOR
Minimizes D/A and A/D overhead
Nice option for small/medium-size
problems and mixed-signal exploration
30
Bankman et al., ISSCC 2018
Mixed-Signal Binary CNN Processor
1. Binary CNN with “CMOS-inspired” topology, engineered for minimal circuit-level path loading
2. Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB)
3. Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree
31
CIFAR-10
Zhao et al., FPGA 2017
Original BinaryNet Topology
88.54% accuracy on CIFAR-10
1.67 MB weight memory (68% FC layers)
27.9 mJ/classification with FPGA
32
Mixed-Signal BinaryNet Topology
Sacrificed accuracy for regularity and energy efficiency
86.05% accuracy on CIFAR-10
328 KB weight memory
3.8 mJ per classification
33
Neuron
34
Naïve Sequential Computation
35
Weight-Stationary
256
x2 (north/south)
x4 (neuron mux)
36
Weight-Stationary and Data-ParallelParallel
broadcast
37
Complete Architecture
38
𝑧 = sgn
𝑖=0
1023
𝑤𝑖𝑥𝑖 + 𝑏 𝑧
𝑤 𝑏Filter Weights
𝑥Image Patch Filter Bias
9b sign-magnitude
𝑏 = −1 𝑠
𝑖=0
7
2𝑖𝑚𝑖
Output
2×2×256 2×2×256
Neuron Function
39
𝑣diff𝑉𝐷𝐷
=𝐶𝑢𝐶𝑡𝑜𝑡
𝑖=0
1023
𝑤𝑖𝑥𝑖 + 𝑏
𝑏 = −1 𝑠
𝑖=0
7
2𝑖𝑚𝑖
Switched-Capacitor Implementation
Bias & offset cal.Weights x inputs
Batch normalization
folded into weight
signs and bias
40
Behavioral Simulations
41
Significant margin in noise, offset, and mismatch (VFS = 460 mV)
𝐶𝑢 = 1 fF𝑣𝑜𝑠 = 4.6 mV RMS𝑣𝑛 = 460 µV RMS
“Memory-Cell-Like” Processing Element
1 fF metal-oxide-metal fringe capacitorStandard-cell-based
42 transistors
24107 F2
42
TSMC 28nm HPL 1P8M
6 mm2 area
328 KB SRAM
10 MHz clock
Supply Voltages
VDD – Digital Logic, 0.6V – 1.0V
VMEM – SRAM, 0.53V – 1.0V
VNEU – Neuron Array, 0.6V
VCOMP – Comparators, 0.8V
Die Photo
43
μ = 86.05%
σ = 0.40%
Measured Classification Accuracy
10 chips, 180 runs each through
10,000 CIFAR-10 test images
VDD = 0.8V, VMEM = 0.8V
3.8 μJ/classification
237 FPS, 899 μW
0.43 μJ in 1.8V I/O
Mean accuracy μ = 86.05% same
as perfect digital model
44
Comparison to Synthesized Digital
BinarEye(Moons et al., CICC 2018)
Synthesized Digital Mixed-Signal
45
Digital vs. Mixed-Signal Binary CNN Processor
Synthesized
Digital
Moons et al.
CICC 2018
Hand-Designed
Digital
(Projected)
Mixed-Signal
Bankman et al.
ISSCC 2018
Energy @
86.05%
CIFAR-10
46
CIFAR-10 Energy vs. Accuracy Neuromorphic› [1] TrueNorth, Esser PNAS
2016
GPU› [2] Zhao FPGA 2017
FPGA› [2] Zhao FPGA 2017
› [3] Umuroglu FPGA 2017
MCU› [4] CMSIS-NN, Lai arXiv 2018
Memory-like, mixed-signal› [5] Bankman ISSCC 2018
BinarEye, digital› [6] Moons CICC 2018
In-memory, mixed-signal› [7] Jia arXiv 2018
*energy excludes off-chip DRAM
47
Limitations of Mixed-Signal BinaryNet
48
Poor programmability
Relatively limited accuracy (even on CIFAR-10) due to 1b arithmetic
Energy advantage over customized digital is not revolutionary
› Same SRAM, essentially same dataflow
Need a more “analog” memory system to unleash larger gains
› In-memory computing
BinaryNet Synapse versus Resistive RAM
• TBD
• 25 F2
• Multi-bit (?)
49
• 0.93 fJ per 1b-MAC in 28 nm
• 24107 F2
• Single-bit
Tsai, 2018
Matrix-Vector Multiplication with Resistive Cells
50
Typically use two cells to
achieve pos/neg weights
(other schemes possible)
256 columns
…
10
24
row
s
25 F2 cell @ F = 90 nm
Side length s = 0.45 mm
2 x 1024 x 0.45 um x 256 x 0.45 um = 0.106 mm2
per layer
0.84 mm2 for the complete 8-layer network
RRAM Density – BinaryNet Example
51
(2x)
Ongoing Research
52
What is the best architecture?
How many levels can be stored in each cell?
What is the most efficient readout?
Can we cope with nonidealities using training techniques?
53
(Content deleted for online distribution)
54
(Content deleted for online distribution)
55
(Content deleted for online distribution)
56
(Content deleted for online distribution)
VGG-7 Experiment (4.8 Million Parameters)
57
8192
10
3×3×128
3×3×128
3×3×256 3×3×512
3×3×2563×3×3
2-bit weights, 2-bit
activations
Accuracy on CIFAR-10
2b quantization only: 93%
2b quantization + RRAM/ADC model: 92%
Work in progress!
Optimal energy/MAC = 0.38 fJ
Energy Model for Column in Conv6 Layer
58
Summary
59
Analog feature extraction is attractive for wake-up detectors
Adding analog compute in ConvNets can be beneficial when it
simultaneously lets us reduce data movement
› In-memory analog compute looks most promising
› Can consider SRAM or emerging memories (e.g. RRAM)
Expect significant progress as more application drivers for “machine
learning at the edge” emerge