designing an analog crossbar based
TRANSCRIPT
Photos placed in horizontal position
with even amount of white space
between photos and header
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and
Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S.
Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
Sapan Agarwal, Alexander Hsia, Robin Jacobs-Gedrim, David R. Hughart,
Steven J. Plimpton, Conrad D. James, Matthew J. Marinella
Sandia National Laboratories
Designing an Analog Crossbar based
Neuromorphic Accelerator
1“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
The Von Neumann Bottleneck
CPU MemoryCommunications Bus
Cross chip communications ~ 1 pJ
DRAM Access >10 pJ
Ethernet ~ 1nJCurrent Transistors ~ 10 aJ
40kT Noise Limit ~ 0.2 aJ
Processor Layer Photonic Layer
Optical interconnects 100 fJ to 1 pJ
Communications require
orders of magnitude more
energy!
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Use Resistive Memories for Local Computation
• A resistive memory or ReRAM is a
programmable resistor• Apply small voltages allows the conductance
to be read: I = G × V
• Apply large voltages to change the resistance
Ta
Pt
+ +
++++
+ ++ ++
+ ++ +
+ +++
ON
Ta
Pt
+ +
++++
TaOx
+
OFF
V = I×R
I = G×V
multiplication
Addition: I=I1+I2
I1
I2
Current
VoltageVREAD VSET
VRESETSET
RESETWrite
Read Window
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Directly Process in the Memory Itself
4
Analog is efficiently and naturally
able to combine computation and
data access
Effectively, large-scale processing in
memory with a multiplier and adder
at each real-valued memory location
w11
w21
w31
w41
w12
w22
w32
w42
w13
w23
w33
w43
w14
w24
w34
w44
V1=x1 +-
+-
+-
+-
V2=x2
V3=x3
V4=x4
I1=x1*w11 + … + x4*w41
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Crossbars Can Perform Parallel Reads and Writes
w11
w21
w31
w41
w12
w22
w32
w42
w13
w23
w33
w43
w14
w24
w34
w44
V1=x1+-
+-
+-
+-
V2=x2
V3=x3
V4=x4x4=1
V t
x3=0.66V t
x2=0.33V t
x1=0V t
y1=0.25
V V V V
tt t t
y2=0.5 y3=0.75 y4=1
w11
w21
w31
w41
w12
w22
w32
w42
w13
w23
w33
w43
w14
w24
w34
w44
N r
ow
s
M columns
Energy to charge the crossbar is CV2
E ∝ C ∝ number of RRAMs ∝ N×M
E ~ O(N×M)
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
SRAM Arrays Require Charging Columns Multiple Times
WL[0]
WL[1]
WL[2]
BL[0] BL[1] BL[2]
SRAMs must be read one row at a time, charging M columns
Each column wire length is O(N).
Energy = N Rows × M Columns × O(N) wire length
Energy ~ O(N2×M)
O(N) times worse than a crossbar!
N r
ow
s
M columns
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Want To Accelerate Many Different Neural Algorithms
BackpropagationSparse
Coding
Liquid State
Machine
Input
Nodes
Output
NodesReservoir
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Crossbars Can Perform Parallel Reads and Writes
w11
w21
w31
w41
w12
w22
w32
w42
w13
w23
w33
w43
w14
w24
w34
w44
V1=x1+-
+-
+-
+-
V2=x2
V3=x3
V4=x4x4=1
V t
x3=0.66V t
x2=0.33V t
x1=0V t
y1=0.25
V V V V
tt t t
y2=0.5 y3=0.75 y4=1
w11
w21
w31
w41
w12
w22
w32
w42
w13
w23
w33
w43
w14
w24
w34
w44
N r
ow
s
M columns
Energy to charge the crossbar is CV2
E ∝ C ∝ number of RRAMs ∝ N×M
E ~ O(N×M)
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
General Purpose Neural Architecture
Neuromorphic core:
• Evaluate vector matrix multiplies along
rows or columns
• Train based on input vectors
Digital Core:
• Process neural core inputs/outputs
• For NxN crossbar, the crossbar accelerates
O(N2) operations leaving only O(N) operations
for the digital core
Run any neural algorithm on the
same hardware
R RR BusBus
R RR Bus Bus
R RR Bus Bus
Neural
Core(s)
Digital
Core
Neural
Core(s)
Digital
Core
Neural
Core(s)
Digital
Core
Neural
Core(s)
Digital
Core
Router
D/A
D/A
D/A D/A
A/D A/D
+ - + -
A/D
A/D
+ -
+ -
positive
weights
negative
weights
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
yi
Neural Core
jzje
y
1
1
i
ijij wyz
yj
O(N2)
Operations
O(N)
Operations
in
out
i
j
k
Can Run Neural Networks on this
Architecture
w11
w21
w31
w12
w22
w32
w31
w32
w33
Digital Core
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
yi
Digital Core
kjj zdz
dy )(
i
ijij wyz
O(N2) Read
Operations
O(N)
Operations
i
j
k
kerror
k
k
jkk w
j k
yiiy
O(N2) Write
Operations
Back Propagation
w11
w21
w31
w12
w22
w32
w31
w32
w33
w11
w21
w31
w12
w22
w32
w31
w32
w33
w11
w21
w31
w12
w22
w32
w31
w32
w33
jz
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Design & Model Detailed Architecture
12
Vector Matrix Multiply Matrix Vector Multiply Outer product Update
Neuron Circuitry Current Conveyor
Based Integrator
Comparator
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Row & Column Driver Circuitry
13
Row Driver Logic
Column Driver Logic
Voltage level shifter (drive
high V transistor with low V)
Array driver pass transistors
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Compare Architectures
14
1024 x1024 = 1M array operations, sum over 1 training cycle, 3 operations:
• Vector Matrix Multiply
• Matrix Vector Multiply
• Outer Product Update
Latency
35 – 800X over SRAM
Energy
430 – 6,900X over SRAM
Area
11 – 20X over SRAM
Used a commercial 14/16 nm PDK ***Requires 100 MΩ on state devices
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Neural Core Energy Analysis
15
Analog
ReRAM
Digital
ReRAM
SRAM
12,010 nJ 10,150 nJ 8,970 nJ
7,520 nJ 5,580 nJ 4,340 nJ
28 nJ 2.7 nJ 1.3 nJ
8 bits In/out
8 bit weights
4 bits In/out
8 bit weights
2 bits In/out
8 bit weights
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
16
Algorithms
Sandia Cross-Sim: Translates device measurements and crossbar circuits to algorithm-level performance
Architecture
Circuits
Devices
Materials
Target Algorithms• Deep Learning• Sparse Coding• Liquid State
Machines
x2
x2
x2
x2
w1,1
w2,1
w3,1
w4,1
w1,2
w2,2
w3,2
w4,2
w2,x
w2,x
w3,x
w4,x
...
...
...
...
Drift-diffusion model of ReRAM band diagram & transport (REOS, Charon)
DFT of model of oxide physics, bands
+++
+-
--
--
-
++
VTE
+
++
+
+
--
+
++
++
+
+
--
--
-
PtTaOxTa
Modified McPAT/CACTI: Model performance and energy requirements
In situ TEM of filament switching: Use DFT model to interpret EELS signature
Sandia’s Xyce Circuit Sim: Simulate crossbar circuits based on our devices
Memristor fabrication and measurements in MESAFab
Multiscale Model of a Neural Training Accelerator
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Numeric
Crossbar
Simulator
Xyce
Crossbar
Circuit Model
Learning
Algorithm
Neural Core
Simulator
Simple Python API:
# Do a matrix vector multiplication
result = neural_core.run_xbar_mvm(vector)
+ - + - + -
333231
232221
131211
www
www
www
Detailed but
slow
Fast but
approximate
Measured
Devices
TaOx –MNIST
Periodic Carry
Single Device
Ideal Numeric
https://cross-sim.sandia.gov
Algorithmic
Performance
Physical
Hardware
Crossbar
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
18
Simple API to model crossbars# ************** set parameters defining the crossbar
params.algorithm_params.weights.sim_type = “XYCE” # Use a XYCE based sim
params.algorithm_params.weights.maximum = 10 # clipping limits
params.algorithm_params.weights.minimum = -10 # clipping limits
params.xyce_parameters.xbar.device.TAHA_A1 = 4e-4 # Xyce Parameters
…
…
# ************** API for running neural operations
# All crossbar details are transparent to the user
# Create a neural_core object that models a crossbar
neural_core = MakeCore(params=params)
neural_core.set_matrix(weights) # set the initial weights
result = neural_core.run_xbar_vmm(vector) # Do a vector matrix multiply
result = neural_core.run_xbar_mvm(vector) # Do the transpose, a matrix vector mult.
neural_core.update_matrix(vector1,vector2) # Do an outer product update
https://cross-sim.sandia.gov
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Go from Measurement to AccuracyMeasured Pulsing ΔG Scatterplot
Cumulative
Probability of ΔG
D/A
D/A
D/A D/A
A/D A/D
+ - + -
A/D
A/D
+ -
+ -
positive
weights
negative
weights
R
R
R
R
R
R
Bus R
R
R
Bus
Neural
Core(s)
Digital
Core
Neural
Core(s)
Digital
Core
Neural
Core(s)
Digital
Core
Neural
Core(s)
Digital
Core
Router
Bus Bus
Bus Bus
TiN
Ta– 50 nm
TaOx – 10 nm
TiN
Fabricate
Device
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
If we need more bits per synapse, use multiple memristors
• Three 10 level ReRAMs could represent 1-1000!
• Adding to the weight requires reading every
ReRAM to account for any carries and serially
programming each ReRAM: VERY EXPENSIVE
Neuron
×100 ×10 ×1 • Use >10 levels to represent a base 10 system
• Ignore carry and program the crossbar in parallel.
• Periodically (once every few hundred cycles) read
the ReRAM and perform the carry
10 levels
represent the
weight
Extra levels
store the
carry
conductance
Multi-ReRAM Synapse: Periodic Carry
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Read and reset every 100 pulses
Do 300,000 small (0.02% of weight range) updates
• net of 1500 positive training pulses
Noise Sigma = 1.4% for single device
• (from )
• Write noise applied during updates and carries
Periodic Carry Compensates for Write Noise
Periodic
Carry
Single
Device
rangerangenoise GGG 1.0
Learn from a 0.5% Signal
1/5
-1/5
1/25 1/125
-1/25 -1/125
Carry
-1
1
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Pulse Number
Weig
ht
Positive
Pulses
Alternating
Pulses
1/5
-1/5
1/25 1/125
-1/25 -1/125
Carry
-1
1
Periodic Carry Mitigates Write Nonlinearity
Write NonlinearityAlternating Pulses Cause Weight Decay
Use center linear range of weights
• Train with 1% signal
• Ideal result is 0.6
Single
Device
Periodic
Carry
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
TaOx Results
A/D and D/A is modeled, Serial operations modeled• When resetting weight, need to adjust pulse size based on current state to compensate for nonlinearity
• When reading a single weight, need to adjust readout range to be smaller (change capacitor on the integrator)
Carry once every
1000 updates
1/4
-1/4
Carry
-1
1 TaOx – File Types
Periodic Carry
Ideal Numeric
Single Device
TaOx –MNIST
Periodic Carry
Single Device
Ideal Numeric
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Li-Ion Synaptic Transistor for Analog Computation (LISTA)
E. J. Fuller, et al, "Li-Ion Synaptic Transistor for Low Power Analog Computing," Advanced
Materials, vol. 29, no. 4, p. 1604310, 2017.
Off by ~ 1%
from Ideal
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Summary
Fundamental O(N) energy scaling advantage
Use CrossSim to co-design materials to algorithms Use periodic carry to overcome noise devices
Need high resistance 10-100 MΩ Devices
Need low write nonlinearities
25
Latency
35 – 800X over SRAM
Energy
430 – 6,900X over SRAM
Area
11 – 20X over SRAM
https://cross-sim.sandia.gov
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Extra Slides
26
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Overcoming the Power Limit
CPU MemoryCommunications Bus
Integrate Processing and Memory
Richard Goering, “Three Die Stack -- A Big Step “Up” for 3D-ICs with TSVs” Cadence blog
w11
w21
w31
w41
w12
w22
w32
w42
w13
w23
w33
w43
w14
w24
w34
w44
V1=x1+-
+-
+-
+-
V2=x2
V3=x3
V4=x4
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
VGI oo
VGI oo
VGI oo
Measure N resistors and determine the total output current with some signal to noise ratio (SNR)*
What is the minimum energy?
fNGVEnergy O
12
Power in each resistor ×number of resistors
Determined by noise and SNR
*we are assuming we need some fixed precision on the output, and don’t need full floating point accuracy
fGTkN
I
ob
4
Noise Thermal 2
2
2
2
I
NISNR o
24 SNRTkEnergy b
If we double the number of resistors, we can double the speed to get the same energy and SNR.
This is because the noise scales as sqrt(N) while the signal scales as N
NGVSNRTk
f o
b
2
2 14
1
The Noise Limited Energy to Read a Crossbar Column is Independent of Crossbar Size
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Experimental Device Non-idealities
Device: Write Variability, Write Nonlinearity, Asymmetry, Read Noise
Circuit: A/D, D/A noise, parasitics
29
G∝
W
Pulse Number (Vwrite=1V, tpulse=1µs)
= Ideal
Variability and Nonlinearity
= Variability Range
= Nonlinear
I∝
G∝
W
Time (Vread=100mV)
I0I0-∆I
I0+∆I
Read Noise
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Combined Effects of Nonidealities
0
90
98Linear Asymmetric, ν = 1
Asymmetric, ν = 5 Symmetric, ν = 5
AccuracyMNIST
Read Noise (σRN) Read Noise (σRN)
Write
No
ise
(σ
WN)
Write
No
ise
(σ
WN)
0
90
98Linear Asymmetric, ν = 1
Asymmetric, ν = 5 Symmetric, ν = 5
AccuracyFile Types
Read Noise (σRN) Read Noise (σRN)
Write
No
ise
(σ
WN)
Write
No
ise
(σ
WN)
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
What are the Neural ReRAM Device Requirements?
Small
Images
Large
Images
File
Types
Read Noise σ (% Range) 3% 5% 9%
Write Noise σ (% Range) 0.3% 0.4% 0.4%
Asymmetric Nonlinearity (ν) 0.1 0.1 0.1
Symmetric Nonlinearity (ν) >20 5 5
Maximum Current 160 nA 13 nA 40 nA
wmin
wmax
We
igh
t (C
on
du
cta
nce
)
Normalized Pulse Number
Positive Pulses Negative Pulses
0 0.5 1 0.5 0
Symmetric Nonlinearity
wmin
wmax
We
igh
t (C
on
du
cta
nce
)
Normalized Pulse Number
Positive Pulses Negative Pulses
0 0.5 1 0.5 0
Asymmetric Nonlinearity
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Full System Simulation
Range Bits
Row Input -1 to 1 8
Col Output -6 to 6 8
Col Input -1 to 1 8
Row Output -4 to 4 8
Row Update -0.01 to 0.01 7
Col Update -1 to 1 5
A/D & D/A Have
Minimal Impact
D/A
D/A
D/A D/A
A/D A/D
- + - +
A/D
A/D
-+
-+
positive
weights
negative
weights
Data set#Training/Test
Examples
Network Size
File Types 4,501 / 900 256×512×9
MNIST 60,000 /10,000 784×300×10
MNIST
Conductance
Wmax
-Wmax
0
0
Gmax
Weight
Gmax/2
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
TaOx Results
A/D and D/A is modeled, serial operations modeled
• When resetting weight, need to adjust pulse size based on current state to
compensate for nonlinearity
• When reading a single weight, need to adjust readout range to be smaller (change
capacitor on the integrator)
Carry once every 1000 updates
for the LSB, and every 2 updates
on others
1/4
-1/4
Carry
-1
1
TaOx – File Types
Periodic Carry
Ideal Numeric
Single Device
TaOx –MNIST
Periodic Carry
Single Device
Ideal Numeric
Update Count (x 10,000)
Dig
it 1
Wei
ght
Dig
it 0
Wei
ght
Weights During Training
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential
Information.”
LISTA ResultsWeight
Configuration
(base 7)
49
-49
-98
98
0
×7
343
-343
-686
686
0
×49
Carry
• Carry once every 1000 updates
• Use a single device per weight and
subtract a reference current
7
-7
-14
14
0
×1
LISTA - MNIST99
98
97
9695
Periodic
CarryIdeal
Numeric
Ideal
w/ A/D
Single
Device
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Neural Core Latency Analysis
35
Analog ReRAM
Min write time of 8 ns vs
1 ns incremental write1.28 µs
Digital ReRAM
All bit precisions
4 bit in/out 2 bit in/out
SRAM
All bit precisions
SRAM transpose
read expensive
0.08 µs 0.054 µs
1335 µs 44 µs
8 bit in/out
x1 x0.06 x0.04
x1040 x35
OPU = Outer Product Update
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”
Neural Core Area Analysis
36
Analog ReRAM
Digital ReRAM
SRAM
x1
x1.8
x11.1
8 bits In/out
8 bit weights
SRAM Array
MAC
Array
Drivers
Array
836k µm2
137k µm2
75k µm2
4 bits In/out
8 bit weights
2 bits In/out
8 bit weights
For the ReRAM, high voltage transistors require 8X area, improving this could give ~2X area savings
ReRAM Array
on logic
“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”