deep learning hardware accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · deep learning...
TRANSCRIPT
![Page 1: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/1.jpg)
Deep Learning Hardware
Acceleration
Jorge Albericio+ Alberto Delmas Lascorz
Patrick Judd Sayeh Sharify
Tayler Hetherington*
Natalie Enright Jerger Tor Aamodt*
Andreas Moshovos
*
+ now at NVIDIA
![Page 2: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/2.jpg)
The University of Toronto has filed patent
applications for the mentioned technologies.
Disclaimer
![Page 3: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/3.jpg)
Time: ~ 60% - 90% inner products
Deep Learning: Where Time Goes?
100s
-
1000s
X
X
+
Convolutional Neural Networks: e.g., Image Classification
![Page 4: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/4.jpg)
4
100s
-
1000s
X
X
+
X
X
+
X
X
+X
X
+X
X
+X
X
+
Time: ~ 60% - 90% inner products
Deep Learning: Where Time Goes?
![Page 5: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/5.jpg)
SIMD: Exploit Computation Stucture
5
DaDianNao
4K terms/cycle
0
15
0
15
0
15
x
x
x
x
x16x256
![Page 6: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/6.jpg)
Our Approach
6
0
15
0
15
0
15
Filter 0
Filter 15
Improve by Exploiting Value Properties
Maintaining:
Massive Parallelism
SIMD Lanes
Wide Memory Accesses
No Modifications to the Networks
![Page 7: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/7.jpg)
Longer Term Goal
7
0
15
0
15
0
15
Filter 0
Filter 15
One Architecture to Rule them All
![Page 8: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/8.jpg)
Value Properties to Exploit? Many ~0 values
8
![Page 9: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/9.jpg)
Value Properties to Exploit? Varying Precision Needs
9
X
X
X
![Page 10: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/10.jpg)
Our Results: Performance
10
1.5x1.9x
3.1x1.60x
2.08x
0
1
1
2
2
3
3
4
CNVLUTIN STRIPES ENGINE P
100% 99%
PRAGMATIC
TARTAN +
4.3x
vs. DaDianNao which was ~300x over GPUs
Accuracy
ISCA’16 MICRO’16 Arxiv +ICLR Workshop
![Page 11: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/11.jpg)
• Proteus:
44% less memory bandwidth + footprint
Our Results: Memory Footprint and Bandwidth
11
![Page 12: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/12.jpg)
• Avoiding computations with ~0
• Performance from precision
• Performance from zero bits
• Reducing footprint and bandwidth
Roadmap
12
![Page 13: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/13.jpg)
#1: Skipping Ineffectual Activations
13
![Page 14: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/14.jpg)
Many ineffectual multiplications
Cnvlutin: ISCA’16
14
X0
X
+0
0
![Page 15: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/15.jpg)
• Zero Activations:
• Runtime calculated
• None that are always zero
Many Activations and Weights
are Intrinsically Ineffectual (zero)
15
Zero Weights:
Known in Advance
Not pursued in this work
45% of Runtime Values are zero
% Stable for any Input
None always zero
![Page 16: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/16.jpg)
Many ineffectual multiplications
Cnvlutin
16
X0
X
+0
0
![Page 17: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/17.jpg)
Many more ineffectual multiplications
Cnvlutin
17
X0
X
+0
0
a
b
![Page 18: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/18.jpg)
Cnvlutin
18
X
X
+
Beating Fast and
“Dumb” SIMD is Hard
On-the-fly ineffectual product elimination
Performance + energy
Optional: accuracy loss +performance
![Page 19: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/19.jpg)
Cnvlutin
No Accuracy Loss
+52% performance
-7% power
+5% area
Can relax the ineffectual criterion
better performance: 60%
even more w/ some accuracy loss
![Page 20: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/20.jpg)
Deep Learning: Convolutional Neural Networks
20
imagelayers10s-100
Beaver
maybe
![Page 21: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/21.jpg)
• > 60% -- 90% of time in Convolutional Layers
Deep Learning: Convolutional Networks
21
i
activations
activationsN
K
K
i
weights
N
input
output
![Page 22: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/22.jpg)
Why are there so many zero neurons?
22
N
Hypothesis: Filter = feature
z
![Page 23: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/23.jpg)
SIMD: Exploit Computation Stucture
23
DaDianNao
4K terms/cycle
0
15
0
15
0
15
x
x
x
x
x16x256
![Page 24: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/24.jpg)
• Processing all Activations:
• All Lanes operate in lock step
Skipping Ineffectual Activations: Key Challenge
Lane 15
Lane 0
0
15
0
15
0
15
x
x
x
x
![Page 25: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/25.jpg)
• 16 independent narrow activation streams
Naïve Solution: No Wide Memory Accesses
25
Lane 15
Lane 0
![Page 26: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/26.jpg)
Removing Zeroes: At the output of each layer
en
co
de
Layer
iLayer
i + 1N
eu
ron
Mem
Beaver
maybe
![Page 27: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/27.jpg)
Maintaining Wide Accesses But Skipping Zeroes
27
#1: Partition NM in 16 Slices over 16 Banks
Processing order does not matter
Neuron lane 15
Neuron lane 0
Neuron lane 1
![Page 28: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/28.jpg)
Maintaining Wide Accesses But Skipping Zeroes
28
#2: Fetch and Maintain One Container per Slice
Container: up to 16 non-zero neurons
Neuron lane 15
Neuron lane 0
Neuron lane 1
![Page 29: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/29.jpg)
Maintaining Wide Accesses But Skipping Zeroes
29
#3: Keep Neuron Lanes Supplied with One
Neuron Per Cycle
Lane 15
Lane 0
Lane 1
![Page 30: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/30.jpg)
Maintaining Wide Accesses But Skipping Zeroes
30
#4: When a container is exhausted, get the next
one within the slice
Lane 15
Lane 0
Lane 1
![Page 31: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/31.jpg)
Maintaining Wide Accesses But Skipping Zeroes
Container: stores only non-zeros
Encoding: Value, 4-bit offset
Could use 1 extra bit: encoded vs. raw
![Page 32: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/32.jpg)
• Zero-Free Neuron Array Format:
• Only non-zero neurons + offsets
• Brick-level
Inside Each Neuron Container
32
ZFNAf: Enabling the Skipping of Ineffectual Neurons
![Page 33: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/33.jpg)
Cnvlutin: No Accuracy Loss
33better
![Page 34: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/34.jpg)
• Treat
Neurons
close to zero
as zero
Loosening the Ineffectual Neuron Criterion
34
Open Questions:
Are these robust? How to find the best?
![Page 35: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/35.jpg)
#2: Exploiting Precision
35
X
X
X
![Page 36: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/36.jpg)
Another Property of CNNs
36
X
X
+
Operand Precision Required Fixed?
16 bits?
16 bits
![Page 37: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/37.jpg)
CNNs: Precision Requirements Vary
37
X
X
+
Operand Precision Required Fixed Varies
5 bits to 13 bits
16 bits
p bits
![Page 38: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/38.jpg)
Stripes
38
X
X
+
Execution Time = 16 / PPeformance + Energy Efficiency + Accuracy Knob
p bits
![Page 39: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/39.jpg)
• Devil in the Details: Carefully chose what to serialize and
what to reuse same input wires as baseline
Stripes: Key Concept
39
2 2x2b
Terms/Step2 1x2b 4 1x2b
![Page 40: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/40.jpg)
SIMD: Exploit Computation Stucture
40
DaDianNao
4K terms/cycle
0
15
0
15
0
15
x
x
x
x
x16
![Page 41: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/41.jpg)
Stripes Bit-Serial Engine
41
0
15
0
15
x
x
1
1
16
16
248
255
x
x
1
1
![Page 42: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/42.jpg)
• Each Tile:
• 16 Windows Concurrently – 16 neurons each
• 16 Filters
• 16 partial output neurons
Compensating for Bit-Serial’s Compute Bandwidth Loss
42
16
16
neurons
neurons
synapses
16
![Page 43: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/43.jpg)
Stripes
No Accuracy Loss
+192% performance
-57% energy
+32% area
More performance w/ accuracy loss
*
* W/O Older: LeNet + Covnet
![Page 44: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/44.jpg)
Stripes: Performance Boost
44
be
tte
r
![Page 45: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/45.jpg)
• Each Tile:
• No Weight Reuse
• Cannot Have 16 Windows
Fully-Connected Layers?
45
16
16
neurons
neurons
synapses
16
![Page 46: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/46.jpg)
• No Weight Reuse
• Cannot Have 16 Windows
Fully-Connected Layers
46
Input neurons Output
neurons
synapses
![Page 47: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/47.jpg)
• Bit-Parallel Engine
• V: activation
• I: weight
• Both 2 bits
TARTAN: Accelerating Fully-Connected Layers
47
![Page 48: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/48.jpg)
• Cycle 1:
• Activation: a1 and Weight: W
Bit-Parallel Engine: Processing one Activation x Weight
48
![Page 49: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/49.jpg)
• Cycle 2:
• Activation: a2 and Weight: W
• a1 x W + a2 x W over two cycles
Bit-Parallel Engine: Processing Another Pair
49
![Page 50: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/50.jpg)
• 2 x 1b activation inputs
• 2b or 2 x 1b weight inputs
TARTAN engine
50
activations
we
igh
ts
![Page 51: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/51.jpg)
• Cycle 1: load 2b weight into BRs
TARTAN: Convolutional Layer Processing
51
![Page 52: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/52.jpg)
• Cycle 2: Multiply W with bit 1 of activations a1 and a2
TARTAN: Weight x 1st bit of Two Activations
52
![Page 53: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/53.jpg)
• Cycle 3: multiply W with 2nd bit of a1 and a2
• Load new W’ into BR
• 3-stage pipeline to do 2: 2b activation x 2b weight
TARTAN: Weight x 2nd bit of Two Activations
53
![Page 54: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/54.jpg)
• What is different? Weights cannot be reused
• Cycle 1: Load first bit of two weights into Ars
TARTAN: Fully-Connected Layers: Loading Weights
54
Bit 1 of Two Different Weights
![Page 55: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/55.jpg)
• Cycle 2: Load 2nd bit of w1 and w2 into ARs
• Bit 2 of Two Different Weights
• Loaded Different Weights to Each Unit
TARTAN: Fully-Connected Layers: Loading Weights
55
![Page 56: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/56.jpg)
• Cycle 3: Move AR into BR and proceed as before over
two cycles
• 5-stage pipeline to do:
• TWO of (2b activation x 2b weight)
TARTAN: Fully-Connected Layers: Processing Activations
56
![Page 57: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/57.jpg)
• Bit-Serial TARTAN
•2.04x faster than DaDiannao
•1.25x more energy efficient at the same frequency
•1.5x area overhead
• 2-bit at-a-time TARTAN
•1.6x faster over DaDiannao
•Roughly same energy efficiency
•1.25x area overhead
TARTAN: Result Summary
57
![Page 58: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/58.jpg)
Bit-Pragmatic Engine
58
X
X
+
Operand Information Content Varies
![Page 59: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/59.jpg)
• Want to do A x B
• Let’s look at A
• Which bits really matter?
Inner-Products
59
X B
![Page 60: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/60.jpg)
• Only 8% of bits are non-zero once precision is reduced
• 15%-10% otherwise
Zero Bit Content: 16-bit fixed-point
60
![Page 61: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/61.jpg)
• Only 27% of bits are non-zero
Zero Bit Content: 8-bit Quantized (Tensorflow-like)
61
![Page 62: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/62.jpg)
Pragmatic Concept: Bit-Parallel Engine
62
![Page 63: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/63.jpg)
• Simply Modify Stripes?
• Too Large + Cross Lane Synchronization
Pragmatic Concept: Use Shift-and-Add
63
![Page 64: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/64.jpg)
Bit-Parallel Engine
64
0
15
0
15
x
x
16
16
16
16
![Page 65: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/65.jpg)
STRIPES
65
0
15
0
15
x
x
1
1
16
16
248
255
x
x
1
1
![Page 66: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/66.jpg)
Pragmatic: Naive STRIPES extension? Problem #1: Too Large
66
0
15
0
15
BIG
>>
>>
offset
offset
16
16
248
255
>>
>>
offset
offset
32
32
32
32
BIG = 3.7x area overhead just for the datapath
![Page 67: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/67.jpg)
• Process in groups of Max N Difference
• Example with N = 4
• Some opportunity loss, much lower area overhead
• Can skip groups of all zeroes
Solution to #1? 2-Stage Shifting
67
0
15
0
15
OK
>>
>>
1
1
16
16
20
20>>
0
1
10 00 00
00 00 10
![Page 68: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/68.jpg)
• Process in groups of Max N Difference
• Example with N = 4
• Some opportunity loss, much lower area overhead
Solution to #1? 2-Stage Shifting
68
0
15
0
15
OK
>>
>>
1
1
16
16
20
20>>
1
0
10 00 00
00 00 10
4
![Page 69: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/69.jpg)
• Different # of 1 bits
• Lanes go out of sync
• May have to fetch up to 256 different activations from
NM
• Keep Lanes Synchronized:
• No cost: All lanes
• Extra register for weights:
• Allow columns to advance by 1
• Some cost but much better performance
Lane Synchronization
70
![Page 70: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/70.jpg)
Speedup and Energy Efficiency vs. DaDianNao
71
![Page 71: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/71.jpg)
Bit-Pragmatic
72
X
X
+
No Accuracy Loss
+310% performance
- 48% Energy
+ 45% AreaBetter w/ 8-bit Quantization
4.3x with Encoding
![Page 72: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/72.jpg)
Reducing Memory Footprint and Bandwidth
73
![Page 73: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/73.jpg)
• Operand Precision Required Varies
Proteus
74
X
X
+
Proteus: Store in reduced precision in memory
Less Bandwidth, Less Energy
![Page 74: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/74.jpg)
• Weights (synapses) and Data (activations/neurons)
Proteus: Pick Per Layer Precision
75
Layered Extension:
Compatible with Existing Systems
![Page 75: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/75.jpg)
0
15
0
15
x
x
Conventional Format: Base Precision
Data Physically aligns with Unit Inputs
![Page 76: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/76.jpg)
Conventional Format: Base Precision
Need Shuffling Network to Route Synapses
4K input bits Any 4K output bit position
0
15
0
255
x
x
4K
x 4
K
Cro
ssb
ar
![Page 77: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/77.jpg)
0
15
0
15
x
x
Proteus’ Key Idea: Pack Along Data Lane Columns
78
Local Shufflers: 16b input 16b output
Much simpler
![Page 78: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/78.jpg)
Proteus
44% less memory
bandwidth
![Page 79: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/79.jpg)
• Training
• Prototype
•Design Space: lower-end confs
• Unified Architecture
•Dispatcher + Compute
•Other Workloads: Comp. Photo
• General Purpose Compute Class
What’s Next
80
![Page 80: Deep Learning Hardware Accelerationmoshovos/...media=wiki:moshovos_cnn_acc… · Deep Learning Hardware Acceleration Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify](https://reader036.vdocuments.us/reader036/viewer/2022063001/5f1aab70ff011a310d29e59a/html5/thumbnails/80.jpg)
• More properties to discover and exploit
• E.g., Filters do overlap significantly
• CNNs one class
• Other networks
• Use the same layers
• Relative importance different
• Training
A Value-Based Approach to Acceleration
81