the n3xt technology for brain-inspired computing · 1 h.-s. philip wong 2015.04.15 stanford...
TRANSCRIPT
Stanford University 2015.04.15 H.-S. Philip Wong 1
Stanford University
Department of Electrical Engineering 2007.11.08 Stanford SystemX Alliance
The N3XT Technology for Brain-Inspired Computing
H.-S. Philip Wong Stanford University
Stanford University 2015.04.15 H.-S. Philip Wong 9
Application Hardware used
Estimated
power
consumption
Emulating 4.5% of human brain:1013 synapses, 109 neurons
Blue Gene/P:
36,864 nodes,
147,456 cores
2.9 MW (LINPACK)
Deep sparse autoencoder:
109 synapses, 10M images 1,000 CPUs
(16,000 cores) ~100 kW
(cores only)
Convolutional neural net with 60M
synapses, 650K neurons 2 GPUs 1,200 W
Restricted Boltzmann Machine:
28M synapses; 69,888 neurons GPU 550 W
CPU 65 W
Processing 1 s of speech using deep
neural network GPU 238 W
CPU (4 cores) 80 W
Larg
e s
cale
S
mall
to
mo
dera
te s
cale
Scale Up Requires Energy Efficiency
S. B. Eryilmaz et al., IEDM 2015
Stanford University 2015.04.15 H.-S. Philip Wong 10
These nanotechnology innovations will have to be developed in close coordination with new computer architectures, and will likely be informed by our growing understanding of the brain—a remarkable, fault-tolerant system that consumes less power than an incandescent light bulb.
Stanford University 2015.04.15 H.-S. Philip Wong 12
Approaches of Neuromorphic Hardware B
iolo
gy-b
ase
d
mo
de
ls /
al
gori
thm
s
Co
nve
nti
on
al
ML
algo
rith
ms
Stanford University 2015.04.15 H.-S. Philip Wong 13
Approaches of Neuromorphic Hardware
Conventional hardware (CPU, GPU, supercomputers, etc)
Neuromorphic hardware
Stanford University 2015.04.15 H.-S. Philip Wong 14
Approaches of Neuromorphic Hardware
Conventional hardware (CPU, GPU, supercomputers, etc)
with analog non-volatile memory synapses
Neuromorphic hardware
Stanford University 2015.04.15 H.-S. Philip Wong 15
Approaches of Neuromorphic Hardware B
iolo
gy-b
ase
d
mo
de
ls /
al
gori
thm
s Conventional hardware (CPU, GPU, supercomputers, etc)
with analog non-volatile memory synapses
Neuromorphic hardware
Brain emulation on BlueGene [7]
HTM [3]
Stanford University 2015.04.15 H.-S. Philip Wong 16
Approaches of Neuromorphic Hardware B
iolo
gy-b
ase
d
mo
de
ls /
al
gori
thm
s
Co
nve
nti
on
al
ML
algo
rith
ms
Conventional hardware (CPU, GPU, supercomputers, etc)
with analog non-volatile memory synapses
Neuromorphic hardware
Brain emulation on BlueGene [7]
HTM [3]
“Cats on YouTube” ANNs: ConvNets, DNNs, DBNs [10-13]
Stanford University 2015.04.15 H.-S. Philip Wong 17
Approaches of Neuromorphic Hardware B
iolo
gy-b
ase
d
mo
de
ls /
al
gori
thm
s
Co
nve
nti
on
al
ML
algo
rith
ms
Conventional hardware (CPU, GPU, supercomputers, etc)
with analog non-volatile memory synapses
Neuromorphic hardware
Brain emulation on BlueGene [7]
HTM [3]
“Cats on YouTube” ANNs: ConvNets, DNNs, DBNs [10-13]
Human Brain Project [20]
TrueNorth [16]
SpiNNaker [19]
Stanford University 2015.04.15 H.-S. Philip Wong 18
Approaches of Neuromorphic Hardware B
iolo
gy-b
ase
d
mo
de
ls /
al
gori
thm
s
Co
nve
nti
on
al
ML
algo
rith
ms
Conventional hardware (CPU, GPU, supercomputers, etc)
with analog non-volatile memory synapses
Neuromorphic hardware
Brain emulation on BlueGene [7]
HTM [3]
“Cats on YouTube” ANNs: ConvNets, DNNs, DBNs [10-13]
Human Brain Project [20]
TrueNorth [16]
SpiNNaker [19]
Hebbian learning Spike-based ANN
PCM, RRAM, CBRAM
Stanford University 2015.04.15 H.-S. Philip Wong 19
Approaches of Neuromorphic Hardware B
iolo
gy-b
ase
d
mo
de
ls /
al
gori
thm
s
Co
nve
nti
on
al
ML
algo
rith
ms
Conventional hardware (CPU, GPU, supercomputers, etc)
with analog non-volatile memory synapses
Neuromorphic hardware
Brain emulation on BlueGene [7]
HTM [3]
“Cats on YouTube” ANNs: ConvNets, DNNs, DBNs [10-13]
Human Brain Project [20]
TrueNorth [16]
SpiNNaker [19]
Hebbian learning Spike-based ANN
PCM, RRAM, CBRAM
ANN, RBM, sparse learning
PCM, RRAM
Stanford University 2015.04.15 H.-S. Philip Wong 21
Many of these breakthroughs will require new kinds of nanoscale devices and materials integrated into three-dimensional systems and may take a decade or more to achieve.
Stanford University 2015.04.15 H.-S. Philip Wong 22
Memory Ultra-dense
vertical connections
N3XT Nanosystems Computation immersed in memory
Computing logic
Stanford University 2015.04.15 H.-S. Philip Wong 23
Memory Ultra-dense
vertical connections
N3XT Nanosystems Computation immersed in memory
Impossible with today’s technologies
Computing logic
Stanford University 2015.04.15 H.-S. Philip Wong 24
N3XT: Computation Immersed in Memory
thermal
thermal
MRAM Quick access
3D Resistive RAM
Massive storage
Not TSV thermal
1D CNFET, 2D FET Compute, RAM access
1D CNFET, 2D FET Compute, RAM access
1D CNFET, 2D FET Compute, Power,
Clock
Silicon compatible
Ultra-dense, fine-grained
vias
Stanford University 2015.04.15 H.-S. Philip Wong 26
Non-Volatile Memory (NVM)
D. Kuzum et al., Nano Lett. 2013, Y. Wu et al., IEDM 2013; A. Calderoni et al., IMW 2014
Metal oxide resistive switching memory (RRAM)
25nm
12 nm
TiOx/
HfOx
TiN
10 nm
SiO2
filament
oxygen ion
Top Electrode
Bottom Electrode
metaloxide
oxygen vacancy
poly c-GSTpoly c-GSTpoly c-GST
amorphousamorphous
SiO2 SiO2 SiO2TiN TiN TiN
TiNTiNTiN
fully set state partially reset state fully reset state
Phase change memory (PCM)
Bottom Electrode
Top Electrode
oxide isolation
switching region
phase change material
Conductive bridge memory (CBRAM)
filament
Bottom Electrode
solid electrolyte
Active Top Electrode
metal atoms
Phase change memory
Cu ion Electrolyte
Bottom electrode
Stanford University 2015.04.15 H.-S. Philip Wong 27
Non-Volatile Memory (NVM) → Synapse
Analog programmable
Scalable to a few nm
Stack in 3D
D. Kuzum et al., Nano Lett. 2013, Y. Wu et al., IEDM 2013; A. Calderoni et al., IMW 2014
Metal oxide resistive switching memory (RRAM)
25nm
12 nm
TiOx/
HfOx
TiN
10 nm
SiO2
poly c-GSTpoly c-GSTpoly c-GST
amorphousamorphous
SiO2 SiO2 SiO2TiN TiN TiN
TiNTiNTiN
fully set state partially reset state fully reset state
Phase change memory (PCM)
Conductive bridge memory (CBRAM)
Phase change memory
Cu ion Electrolyte
Bottom electrode
Stanford University 2015.04.15 H.-S. Philip Wong 28
0 500 1000 1500 2000 250010
3
104
105
Re
sis
tan
ce
(O
hm
)
Pulse Number2100 2200 2300
103
104
105
Resis
tan
ce (
Oh
m)
Pulse Number
100-step grey scale (1% resolution) Control the size and nature of the amorphous region
Partial RESET
Partial SET
Phase change synapse
Nanoscale Memory as Synaptic Weights
Synaptic updates in the brain: basis for learning Requirement: analog resistance change
D. Kuzum et al., Nano Lett., p. 2179 (2012)
Stanford University 2015.04.15 H.-S. Philip Wong 29
Nanoscale Memory Can Emulate Biological Synaptic Behavior
-60 -40 -20 0 20 40 60
-40
0
40
80
120
=-11.3
=-18.6
=-29
=11
=20.7
LTP-1
LTP-2
LTP-3
LTD-1
LTD-2
LTD-3
Sy
na
pti
c w
eig
ht
ch
an
ge
w
(%
)
Spike timing t (ms)
=29.5
Various time constants
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
Various STDP kernels
0 20 40 60 80 100
0
20
40
60
80
100
120
140 10 ms 25 ms
15 ms 35 ms
20 ms 45 ms
Syn
ap
tic W
eig
ht
Ch
an
ge
w (
%)
Number of pre/post spike pairs
Weight update saturation
STDP (spike-timing-dependent plasticity)
D. Kuzum et al., Nano Lett., p. 2179 (2012)
Stanford University 2015.04.15 H.-S. Philip Wong 30
Nanoscale Memory Can Emulate Biological Synaptic Behavior
-60 -40 -20 0 20 40 60
-40
0
40
80
120
=-11.3
=-18.6
=-29
=11
=20.7
LTP-1
LTP-2
LTP-3
LTD-1
LTD-2
LTD-3
Sy
na
pti
c w
eig
ht
ch
an
ge
w
(%
)
Spike timing t (ms)
=29.5
Various time constants
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
Various STDP kernels
0 20 40 60 80 100
0
20
40
60
80
100
120
140 10 ms 25 ms
15 ms 35 ms
20 ms 45 ms
Syn
ap
tic W
eig
ht
Ch
an
ge
w (
%)
Number of pre/post spike pairs
Weight update saturation
STDP (spike-timing-dependent plasticity)
D. Kuzum et al., Nano Lett., p. 2179 (2012)
Stanford University 2015.04.15 H.-S. Philip Wong 31
Nanoscale Memory Can Emulate Biological Synaptic Behavior
-60 -40 -20 0 20 40 60
-40
0
40
80
120
=-11.3
=-18.6
=-29
=11
=20.7
LTP-1
LTP-2
LTP-3
LTD-1
LTD-2
LTD-3
Sy
na
pti
c w
eig
ht
ch
an
ge
w
(%
)
Spike timing t (ms)
=29.5
Various time constants
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
Various STDP kernels
0 20 40 60 80 100
0
20
40
60
80
100
120
140 10 ms 25 ms
15 ms 35 ms
20 ms 45 ms
Syn
ap
tic W
eig
ht
Ch
an
ge
w (
%)
Number of pre/post spike pairs
Weight update saturation
STDP (spike-timing-dependent plasticity)
D. Kuzum et al., Nano Lett., p. 2179 (2012)
Stanford University 2015.04.15 H.-S. Philip Wong 32
Nanoscale Memory Can Emulate Biological Synaptic Behavior
-60 -40 -20 0 20 40 60
-40
0
40
80
120
=-11.3
=-18.6
=-29
=11
=20.7
LTP-1
LTP-2
LTP-3
LTD-1
LTD-2
LTD-3
Sy
na
pti
c w
eig
ht
ch
an
ge
w
(%
)
Spike timing t (ms)
=29.5
Various time constants
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
-50 0 50
-50
0
50
100
w
(%
)
t (ms)-50 0 50
-50
0
50
100
w
(%
)
t (ms)
Various STDP kernels
0 20 40 60 80 100
0
20
40
60
80
100
120
140 10 ms 25 ms
15 ms 35 ms
20 ms 45 ms
Syn
ap
tic W
eig
ht
Ch
an
ge
w (
%)
Number of pre/post spike pairs
Weight update saturation
STDP (spike-timing-dependent plasticity)
D. Kuzum et al., Nano Lett., p. 2179 (2012)
Stanford University 2015.04.15 H.-S. Philip Wong 33
Hyper Dimensional (HD) Computing
Information is represented by High-dimensional representation (e.g., D = 10,000)
Variables and values are combined into a “holistic” record using vector algebra:
– Multiplication for Binding
– Addition for Bundling
Composed vector can in turn become a component in further composition
Holistic record is decoded with (inverse) multiplication
Approximate results of vector operations are identified with exact ones using content-addressable memory
ENIGMA
Elements of Hyper dimensional Computing: Also known as vector symbolic architectures or holographic reduced representation (Kanerva, Cognitive Computation, 1(2):139-159,2009)
Stanford University 2015.04.15 H.-S. Philip Wong 34
HD Computing layers
011101010101…..011101010100…..
10110010010…..101010011110…..
Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...
Projected into hyper-dimensional space
Random vectors: 1k ~ 10k bits
MAP Kernels(Multiplication-Addition-Permutation)
v1 v2, sum(v1, v2,…), sum(v1), perm(v1)
unknown
‘closest’?Measure ‘distance’: learned & unseen vectors
(recognition/classification/reasoning/etc.)
Application
Representation
Computation
Inference
Stanford University 2015.04.15 H.-S. Philip Wong 35
HD Computing layers
011101010101…..011101010100…..
10110010010…..101010011110…..
Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...
Projected into hyper-dimensional space
Random vectors: 1k ~ 10k bits
MAP Kernels(Multiplication-Addition-Permutation)
v1 v2, sum(v1, v2,…), sum(v1), perm(v1)
unknown
‘closest’?Measure ‘distance’: learned & unseen vectors
(recognition/classification/reasoning/etc.)
Application
Representation
Computation
Inference
Alg
ori
thm
s
Stanford University 2015.04.15 H.-S. Philip Wong 36
HD Computing layers
011101010101…..011101010100…..
10110010010…..101010011110…..
Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...
Projected into hyper-dimensional space
Random vectors: 1k ~ 10k bits
MAP Kernels(Multiplication-Addition-Permutation)
v1 v2, sum(v1, v2,…), sum(v1), perm(v1)
unknown
‘closest’?Measure ‘distance’: learned & unseen vectors
(recognition/classification/reasoning/etc.)
Application
Representation
Computation
Inference
Alg
ori
thm
s Sy
stem
an
d C
ircu
it D
esig
n
Stanford University 2015.04.15 H.-S. Philip Wong 37
HD Computing layers
011101010101…..011101010100…..
10110010010…..101010011110…..
Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...Letters/Image features/
Phonemes/DNA sequences...
Projected into hyper-dimensional space
Random vectors: 1k ~ 10k bits
MAP Kernels(Multiplication-Addition-Permutation)
v1 v2, sum(v1, v2,…), sum(v1), perm(v1)
unknown
‘closest’?Measure ‘distance’: learned & unseen vectors
(recognition/classification/reasoning/etc.)
Application
Representation
Computation
Inference
Alg
ori
thm
s Sy
stem
an
d C
ircu
it D
esig
n
Ass
oci
ativ
e m
emo
ry e
nab
led
by
no
vel d
evic
e te
chn
olo
gies
Stanford University 2015.04.15 H.-S. Philip Wong 38
Hyperdimensional (HD) Computing
Monolithic 3D enables
– Energy-efficient classification
– Area efficient HD projection
➔ use RRAM variability & stochastic switching
3D RRAM
+ low power access transistors
+ address decoders
Low power computation
High-density
Inter-layer vias
Stanford University 2015.04.15 H.-S. Philip Wong 39
3D Enables In-Memory Computing
50 nm
TiN (20 nm)
TiN/Ti (50 nm)
TiN
TiN
TiN (BE)
TiNTiN/Ti(TE)
Layer 1 (L1)
Layer 2 (L2)
Layer 3 (L3)
Layer 4 (L4)
3D RRAM with FinFET BL select
H. Li et al., Symp. VLSI Tech., 2016
Stanford University 2015.04.15 H.-S. Philip Wong 40
0 0 1 1
0 1 0 1
1 0 0 1
0 1 1 0
Multiplication
Multiplication
{-1, 1} system
XOR
{0, 1} system
equiv.
MAP Kernels: 3D RRAM Approach
Key HD operations: multiplication, addition, permutation
100101
1
0101
1
1
0
1
Addition
01
0
0
0
1
0
Permutation
10 1k 100k 10M 1G 100G
0
10
20
30
40
50
60
Cu
rre
nt
(A
)
Addition Cycle (#)
Symbol: 1-pillar exp. Line: 2-pillar emulated
11111111
01111111
0011111
100011111
1111
0111
0011
0001
0000
1
2
3
4
B
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
0
0
0
1
1
2
3
4
B
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
0
0
1
0
1
2
3
4
B
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
0
1
0
0
1
2
3
4
B
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
1
0
0
0
VDD
gnd VDD
gnd VDD
gnd
L1
L2
L4
L3
200 ns
(a)
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
4.000E+05
5.030E+05
6.325E+05
7.953E+05
1.000E+06
1
1
1
0
1
1
0
1
1
0
1
1
0
1
1
1VDD
gnd
gnd
VDD
VDD
gnd
200 ns
L1
L2
L4
L3
(b)
1 2 3 4
B
4.0
00E
+05
5.0
30E
+05
6.3
25E
+05
7.9
53E
+05
1.0
00E
+06
Measured HRS
(400kΩ-1MΩ)
Measured LRS
(~10kΩ)
1 2 3 4
B
4.0
00E
+05
5.0
30E
+05
6.3
25E
+05
7.9
53E
+05
1.0
00E
+06
1 0
Measured data on 4-layer 3D vertical RRAM
1M
100k
10k
1k 1M 1G 1T
1M
100k
10k
Resis
tan
ce
(
)
Logic Evaluation Cycle (#)1k 1M 1G 1T
D = 0
C = 1
C = 0
D = 1
C = 0
D = 1
D = 0
C = 1
A
BCD
00
10
Input AB = pillar addr. = 00
A
BCD
10
01
Input AB = pillar addr. = 01
A
BCD
01
01
Input AB = pillar addr. = 10
A
BCD
11
10
Input AB = pillar addr. = 11
Logic Evaluation Cycle (#)
Resis
tan
ce (
)
Stanford University 2015.04.15 H.-S. Philip Wong 41
3D Integration of Memory and Logic Circuits within the Same Layer & Across Layers
Logic (Si CMOS)
Neuron circuits Communication Synapses/Weights (CNFET/2D FET) + RRAM
Synapses/Weights (3D RRAM)
0 1 0 1
Stanford University 2015.04.15 H.-S. Philip Wong 42
Nano-Engineered Computing Systems Technology
0 1 0 1
Aly et al., IEEE Computer, 2015
Stanford University 2015.04.15 H.-S. Philip Wong 44
Collaborators
Gert Cauwenberghs Siddharth Joshi Emre Neftci (UC San Diego)
Jinfeng Kang (Peking U)
Chung Lam SangBum Kim Matt Brightsky
K.S. Lee, J.M. Shieh, W.K. Yen… (NDL, Taiwan)
Stanford University 2015.04.15 H.-S. Philip Wong 45
Sponsors
Non-Volatile Memory Technology Research Initiative
E2CDA “Engima”
Stanford University 2015.04.15 H.-S. Philip Wong 47
Non-Volatile Memory Technology Research Initiative (NMTRI) @ Stanford University
Stanford University 2015.04.15 H.-S. Philip Wong 48
End of Talk
Questions?
poly c-GSTpoly c-GSTpoly c-GST
amorphousamorphous
SiO2 SiO2 SiO2TiN TiN TiN
TiNTiNTiN
fully set state partially reset state fully reset state
Stanford University 2015.04.15 H.-S. Philip Wong 49
Open Research Questions 1. Functionality → performance/Watt, performance/m2 →
variability → reliability
2. Scale up (system size), scale down (device size)
3. Role of variability (functionality, performance)
4. Fan-in / fan-out, hierarchical connections, power delivery
5. Low voltage (wire energy ≅ device energy)
6. Stochastic learning behavior → statistical learning rules
7. Meta-plasticity (internal state variables)
8. Timing as an internal variable
9. Learning rules: biological? AI?
10. Algorithm-device co-design
11. Materials/fabrication: monolithic 3D integration a must, MUST be low temperature