S a m s u n g S e m i c o n d u c t o rN e u r a l P r o c e s s i n g L a b
S a n J o s e , C A
J o s e p h H a s s o u n
A ½ mWatt, 128‐MAC Sparsity Aware Neural Processing Unit for Classification and Semantic Segmentation
C O L L A B O R A T E . I N N O V A T E . G R O W .
Contents
1. Motivation: Demand for Edge‐Neural Processing
2. High‐Performance Mobile NPU Architecture in several Samsung Products
3. HW/SW Co‐Designing for Edge‐NPU
a) HW Down‐Scaling with Multi‐Dimensional Parallelism
b) SW Algorithm: Binarization of Neural Network
4. Edge‐NPU Hardware & Inference Performance
C O L L A B O R A T E . I N N O V A T E . G R O W .
Demand for Edge‐Neural Processing
• Solving the challenges of edge‐computing in the internet‐of‐things(IOT) era
RETAIL HEALTHCARE
SMARTBUILDING
TRAVEL&HOSPITALITY
MANUFACTURING
C O L L A B O R A T E . I N N O V A T E . G R O W .
High‐Perf‐Mobile NPU
• Butterfly‐structure NPU w/1024 MACs (ISSCC2019’)• 2 NPU Cores & NPU Controller• Achieved 6.9 TOPS & 11.5 TOPS/W (8b) with 5.5 mm2 of area• Power = 39 mW@ 0.5V
Chip Micrograph
5.5 mm2
Source: Song et. al. “7.1 An 11.5TOPS/W 1024‐MAC Butterfly Structure Dual‐Core Sparsity‐Aware Neural Processing Unit in 8nm Flagship Mobile SoC”, ISSCC 2019AnandTech.com: https://www.anandtech.com/show/14072/the‐samsung‐galaxy‐s10plus‐review/4
C O L L A B O R A T E . I N N O V A T E . G R O W .
Demand for Edge‐Neural Processing
Smart Things Fitness Tracker
Smart Phone Galaxy Watch
• Solving the challenges of edge‐computing in the internet‐of‐things(IOT) era
• Task: Edge‐NPU for Supporting Wearable Devices from Samsung
C O L L A B O R A T E . I N N O V A T E . G R O W .
Input Feature Map
4 inputchannels
16 pixelsin spatial dim.
Filters
…
16 outputchannels
Partial/Output Feature Map
16 outputchannels
16 pixelsin spatial dim.
Convolution with 3D‐Parallelism
• 3D‐data balanced parallelism for convolutional operation:1. Input channel dimension : Reduce partial store/reload2. Output channel dimension : Reuse IFM, mitigating SRAM energy per access3. Spatial(X|Y) unrolling of pixel dimension : Share weight parameters
• High‐performance NPU with 1024 MACs = (4 input channel) x (16 output channel) x (16 pixels)
Source: Song et. al. “7.1 An 11.5TOPS/W 1024‐MAC Butterfly Structure Dual‐Core Sparsity‐Aware Neural Processing Unit in 8nm Flagship Mobile SoC”, ISSCC 2019
C O L L A B O R A T E . I N N O V A T E . G R O W .
Architecture
Source: Song et. al. “7.1 An 11.5TOPS/W 1024‐MAC Butterfly Structure Dual‐Core Sparsity‐Aware Neural Processing Unit in 8nm Flagship Mobile SoC”, ISSCC 2019
7/
DRU: Data return unit
DSU: Data Staging Unit
MAA: MAC Arrays
NPU Core
C O L L A B O R A T E . I N N O V A T E . G R O W .
Contents
1. Motivation: Demand for Edge‐Neural Processing
2. High‐Performance Mobile NPU Architecture in Galaxy S10
3. HW/SW Co‐Designing for Edge‐NPU
1) HW Down‐Scaling with Multi‐Dimensional Parallelism
2) SW Algorithm: Binarization of Neural Network
4. Edge‐NPU Hardware & Inference Performance
C O L L A B O R A T E . I N N O V A T E . G R O W .
Scaling‐Down to Edge‐NPU
High‐Performance NPU(1024 MACs)
Edge‐NPU(128 MACs)
#Parallel HW module #Parallel HW module
Input Channel 4 2 Cores x 2 DSUs 2 1 Core w/ 2 DSUs
Output Channel 16 16 MAAs/Core 16 16 MAAs/Core
SpatialDimension 16 16 Dual‐MACs/MAA 4 4 Dual‐MACs/MAA
• Scaling‐down high‐performance NPU (1K MACs) to Edge‐NPU with 128 MACs,while keeping the benefit from multi‐dimensional parallelism.
C O L L A B O R A T E . I N N O V A T E . G R O W .
DSU 0 DSU 1
MMA0 MMA1 MMA15
Scratchpad
DRU0 DRU1 ….
Edge‐NPU CoreWeight from DSU 1 (ich=1)
Weight from DSU 0 (ich=0)
2
3
Each dispatcher in DSU sends 1 weight and 2x2 (=4) pixels of activation
Each MMA is for one output channel
1
2
Each MMA has 4 dual‐MACs that adds partial results from 2 input‐channel (i.e. 2 DSUs)3
x
+ x
Edge‐NPU Architecture
DRU15
x x+x x
+x x
+x x+
W Act
W ActDual‐MAC 0
Dual‐MAC 1
Dual‐MAC 2
Dual‐MAC 31
SOC & External Memory
C O L L A B O R A T E . I N N O V A T E . G R O W .
Contents
1. Motivation: Demand for Edge‐Neural Processing
2. High‐Performance Mobile NPU Architecture in Galaxy S10
3. HW/SW Co‐Designing for Edge‐NPU
1) HW Down‐Scaling with Multi‐Dimensional Parallelism
2) SW Algorithm: Binarization of Neural Network
4. Edge‐NPU Hardware & Inference Performance
C O L L A B O R A T E . I N N O V A T E . G R O W .
Neural Network Binarization
Floating point conv layer
IFM
OFM
filter outputpixel
• Naïve quantization to binary weights and low‐bit activations will greatly reduce accuracy
• An algorithmic solution is required to preserve both performance and accuracy
• Fortunately, Group‐Net (Zhuang et al., CVPR 2019) address this issue well using structure approximation
Binarized conv layer
C O L L A B O R A T E . I N N O V A T E . G R O W .
Group‐Net Layer‐Wise Decomposition
Source: Zhuang et. al. “Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation”, CVPR 2019
Layer‐wise decomposition with two bases
• Layer‐wise decomposition is the simplest form of Group‐Net decomposition
• For layer‐wise decomposition, replace each layer with a set of binarized bases
• Take a weighted average of the binarized layer outputs to generate the layer output
Floating point conv layer
IFM
OFM
filter outputpixel
C O L L A B O R A T E . I N N O V A T E . G R O W .
Group‐Wise Decomposition
• Instead of decomposing each layer separately, group‐wise structural decomposition offers more flexibility and better accuracy Group‐structure decomposition
can reduce gradient deviation during backpropagation
• Optimal group structures can be learned using neural architecture search
C O L L A B O R A T E . I N N O V A T E . G R O W .
Model Weight width Activation width
Group‐Net Bases
Top‐1 Top‐5
ResNet‐18 32 32 n/a 69.7% 89.4%
FullyBinarized 1 1 1 56.4% (‐13.3%) 79.5% (‐9.9%)
Group‐Net
1 4 1 61.5% (‐8.2%) 83.2% (‐6.2%)
1 4 3 68.5% (‐1.2%) 88.7% (‐0.7%)
1 4 5 70.1% (+0.4%) 89.5% (+0.1%)
Representative Accuracy Results
Source: Zhuang et. al. “Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation”, CVPR 2019
C O L L A B O R A T E . I N N O V A T E . G R O W .
Contents
1. Motivation: Demand for Edge‐Neural Processing
2. High‐Performance Mobile NPU Architecture in Galaxy S10
3. HW/SW Co‐Designing for Edge‐NPU
1) HW Down‐Scaling with Multi‐Dimensional Parallelism
2) Algorithm Choice : Group‐wise Binarization of Neural Network
4. Edge‐NPU Hardware & Inference Performance
C O L L A B O R A T E . I N N O V A T E . G R O W .
Edge‐NPU: Scaling Factors & Power
• With the help of software, Group Convolution : Reduce computation per Conv. Layer → Increases fps
Low precision reduction With down‐scaled hardware resources,
1024 MACs → 128 MACs• Inference Performance :
Power reduced by 73x & Energy efficiency enhanced by x6.9
HP‐NPU Edge‐NPU Edge vsHP‐NPU Reduction
MACs 1024 128 8x
Frequency 67MHz 50MHz ‐25%
Multiplier Prec. 8 X 8b 4 X 1b ~ 16x
Accumulation Prec. 32b 10b ~ 3x
Low Prec. Reduction 1.0 0.15 6.7x
Memory 1568kB 784kB 2x
Power 39 mW 0.53 mW 73x
TOPS/W 3.52 24.1 6.9x
C O L L A B O R A T E . I N N O V A T E . G R O W .
Edge‐NPU: Inference PerformanceHP‐NPU
& ResNet18Edge‐NPU
& GroupNet*Efficiency
Gain
ModelOptimization
Model Prec. 8b A4b & W1b (3‐bases) 4X
Class. Accuracy(Top‐1) 69.7% 68.5% ‐
#Operation/frame 4.30E+9 4.20E+8 10.3X
HardwareOptimization
OPS 1.37E+11 1.3E+10 ‐
Power 39mW 0.53mW 73X
InferencePerformance
Frame‐per‐second 31.9 fps 30.4 fps 95%
Energy‐per‐frame 1.2 mJ 5.8 J 210X
• Co‐optimization of both SW and HW for edge‐computing : HP‐NPU with ResNet18 Edge‐NPU with GroupNet
• Inference Performance : Near HP‐NPU frame‐per‐second (95%) with 210X energy‐per‐frame
C O L L A B O R A T E . I N N O V A T E . G R O W .
HP Mobile NPU Edge NPU
GroupNet Resnet‐18ResNet‐18
68.5%
5.8 J
69.7%
1.2 mJ
31.9 fps 30.4 fps
Model
Energy
Frames per Sec
Top‐1 Accuracy
C O L L A B O R A T E . I N N O V A T E . G R O W .
• NPL: Ali Shafiee Ardestani, Jong Hoon Shin, David Thorsley, Hamzah Abdelaziz• SAIT: Sehwan Lee, Jun‐Woo Jang, Joon‐Ho Song, Eunsoo Shim• S.LSI: Jinook Song, Yunkyo Cho, Jun‐Seok Park, Inyup Kang
Acknowledgement