the n3xt technology for brain-inspired computing · 1 h.-s. philip wong 2015.04.15 stanford...

Stanford University 2015.04.15 H.-S. Philip Wong 1

Stanford University

Department of Electrical Engineering 2007.11.08 Stanford SystemX Alliance

The N3XT Technology for Brain-Inspired Computing

H.-S. Philip Wong Stanford University

Stanford University 2015.04.15 H.-S. Philip Wong 2 Source: Google

Stanford University 2015.04.15 H.-S. Philip Wong 3 Source: vrworld.com

Stanford University 2015.04.15 H.-S. Philip Wong 4 Source: BDC Magazine

Stanford University 2015.04.15 H.-S. Philip Wong 5 1988 Winter Olympic Games in Calgary, Canada

Stanford University 2015.04.15 H.-S. Philip Wong 7 Source: Gogamguro.com


100’s of kW

Source: Google


Application Hardware used

Estimated

power

consumption

Emulating 4.5% of human brain:1013 synapses, 109 neurons

Blue Gene/P:

36,864 nodes,

147,456 cores

2.9 MW (LINPACK)

Deep sparse autoencoder:

109 synapses, 10M images 1,000 CPUs

(16,000 cores) ~100 kW

(cores only)

Convolutional neural net with 60M

synapses, 650K neurons 2 GPUs 1,200 W

Restricted Boltzmann Machine:

28M synapses; 69,888 neurons GPU 550 W

CPU 65 W

Processing 1 s of speech using deep

neural network GPU 238 W

CPU (4 cores) 80 W

Larg

e s

cale

S

mall

to

mo

dera

te s

cale

Scale Up Requires Energy Efficiency

S. B. Eryilmaz et al., IEDM 2015


These nanotechnology innovations will have to be developed in close coordination with new computer architectures, and will likely be informed by our growing understanding of the brain—a remarkable, fault-tolerant system that consumes less power than an incandescent light bulb.


Approaches of Neuromorphic Hardware


Approaches of Neuromorphic Hardware B

iolo

gy-b

ase

d

mo

de

ls /

al

gori

thm

s

Co

nve

nti

on

al

ML

algo

rith

ms



Conventional hardware (CPU, GPU, supercomputers, etc)

Neuromorphic hardware




with analog non-volatile memory synapses




iolo

gy-b

ase

d

mo

de

ls /

al

gori

thm

s Conventional hardware (CPU, GPU, supercomputers, etc)



Brain emulation on BlueGene [7]

HTM [3]



iolo

gy-b

ase

d

mo

de

ls /

al

gori

thm

s

Co

nve

nti

on

al

ML

algo

rith

ms





HTM [3]

“Cats on YouTube” ANNs: ConvNets, DNNs, DBNs [10-13]



iolo

gy-b

ase

d

mo

de

ls /

al

gori

thm

s

Co

nve

nti

on

al

ML

algo

rith

ms





HTM [3]


Human Brain Project [20]

TrueNorth [16]

SpiNNaker [19]



iolo

gy-b

ase

d

mo

de

ls /

al

gori

thm

s

Co

nve

nti

on

al

ML

algo

rith

ms





HTM [3]



TrueNorth [16]

SpiNNaker [19]

Hebbian learning Spike-based ANN

PCM, RRAM, CBRAM



iolo

gy-b

ase

d

mo

de

ls /

al

gori

thm

s

Co

nve

nti

on

al

ML

algo

rith

ms





HTM [3]



TrueNorth [16]

SpiNNaker [19]

Hebbian learning Spike-based ANN

PCM, RRAM, CBRAM

ANN, RBM, sparse learning

PCM, RRAM


Many of these breakthroughs will require new kinds of nanoscale devices and materials integrated into three-dimensional systems and may take a decade or more to achieve.


Memory Ultra-dense

vertical connections

N3XT Nanosystems Computation immersed in memory

Computing logic


Memory Ultra-dense

vertical connections

N3XT Nanosystems Computation immersed in memory

Impossible with today’s technologies

Computing logic


N3XT: Computation Immersed in Memory

thermal

thermal

MRAM Quick access

3D Resistive RAM

Massive storage

Not TSV thermal

1D CNFET, 2D FET Compute, RAM access

1D CNFET, 2D FET Compute, RAM access

1D CNFET, 2D FET Compute, Power,

Clock

Silicon compatible

Ultra-dense, fine-grained

vias


Aly et al., IEEE Computer, 2015


Non-Volatile Memory (NVM)

D. Kuzum et al., Nano Lett. 2013, Y. Wu et al., IEDM 2013; A. Calderoni et al., IMW 2014

Metal oxide resistive switching memory (RRAM)

25nm

12 nm

TiOx/

HfOx

TiN

10 nm

SiO2

filament

oxygen ion

Top Electrode

Bottom Electrode

metaloxide

oxygen vacancy

poly c-GSTpoly c-GSTpoly c-GST

amorphousamorphous

SiO2 SiO2 SiO2TiN TiN TiN

TiNTiNTiN

fully set state partially reset state fully reset state

Phase change memory (PCM)

Bottom Electrode

Top Electrode

oxide isolation

switching region

phase change material

Conductive bridge memory (CBRAM)

filament

Bottom Electrode

solid electrolyte

Active Top Electrode

metal atoms

Phase change memory

Cu ion Electrolyte

Bottom electrode


Non-Volatile Memory (NVM) → Synapse

Analog programmable

Scalable to a few nm

Stack in 3D

D. Kuzum et al., Nano Lett. 2013, Y. Wu et al., IEDM 2013; A. Calderoni et al., IMW 2014

Metal oxide resistive switching memory (RRAM)

25nm

12 nm

TiOx/

HfOx

TiN

10 nm

SiO2


amorphousamorphous


TiNTiNTiN


Phase change memory (PCM)

Conductive bridge memory (CBRAM)

Phase change memory

Cu ion Electrolyte

Bottom electrode


0 500 1000 1500 2000 250010

3

104

105

Re

sis

tan

ce

(O

hm

)

Pulse Number2100 2200 2300

103

104

105

Resis

tan

ce (

Oh

m)

Pulse Number

100-step grey scale (1% resolution) Control the size and nature of the amorphous region

Partial RESET

Partial SET

Phase change synapse

Nanoscale Memory as Synaptic Weights

Synaptic updates in the brain: basis for learning Requirement: analog resistance change

D. Kuzum et al., Nano Lett., p. 2179 (2012)


Nanoscale Memory Can Emulate Biological Synaptic Behavior

-60 -40 -20 0 20 40 60

-40

0

40

80

120

=-11.3

=-18.6

=-29

=11

=20.7

LTP-1

LTP-2

LTP-3

LTD-1

LTD-2

LTD-3

Sy

na

pti

c w

eig

ht

ch

an

ge

w

(%

)

Spike timing t (ms)

=29.5

Various time constants

-50 0 50

-50

0

50

100

w

(%

)

t (ms)-50 0 50

-50

0

50

100

w

(%

)

t (ms)

-50 0 50

-50

0

50

100

w

(%

)

t (ms)-50 0 50

-50

0

50

100

w

(%

)

t (ms)

Various STDP kernels

0 20 40 60 80 100

0

20

40

60

80

100

120

140 10 ms 25 ms

15 ms 35 ms

20 ms 45 ms

Syn

ap

tic W

eig

ht

Ch

an

ge

w (

%)

Number of pre/post spike pairs

Weight update saturation

STDP (spike-timing-dependent plasticity)




-60 -40 -20 0 20 40 60

-40

0

40

80

120

=-11.3

=-18.6

=-29

=11

=20.7

LTP-1

LTP-2

LTP-3

LTD-1

LTD-2

LTD-3

Sy

na

pti

c w

eig

ht

ch

an

ge

w

(%

)

Spike timing t (ms)

=29.5


-50 0 50

-50

0

50

100

w

(%

)

t (ms)-50 0 50

-50

0

50

100

w

(%

)

t (ms)

-50 0 50

-50

0

50

100

w

(%

)

t (ms)-50 0 50

-50

0

50

100

w

(%

)

t (ms)


0 20 40 60 80 100

0

20

40

60

80

100

120

140 10 ms 25 ms

15 ms 35 ms

20 ms 45 ms

Syn

ap

tic W

eig

ht

Ch

an

ge

w (

%)






Hyper Dimensional (HD) Computing

Information is represented by High-dimensional representation (e.g., D = 10,000)

Variables and values are combined into a “holistic” record using vector algebra:

– Multiplication for Binding

– Addition for Bundling

Composed vector can in turn become a component in further composition

Holistic record is decoded with (inverse) multiplication

Approximate results of vector operations are identified with exact ones using content-addressable memory

ENIGMA

Elements of Hyper dimensional Computing: Also known as vector symbolic architectures or holographic reduced representation (Kanerva, Cognitive Computation, 1(2):139-159,2009)


HD Computing layers

011101010101…..011101010100…..

10110010010…..101010011110…..

Letters/Image features/

Phonemes/DNA sequences...Letters/Image features/


Phonemes/DNA sequences...

Projected into hyper-dimensional space

Random vectors: 1k ~ 10k bits

MAP Kernels(Multiplication-Addition-Permutation)

v1 v2, sum(v1, v2,…), sum(v1), perm(v1)

unknown

‘closest’?Measure ‘distance’: learned & unseen vectors

(recognition/classification/reasoning/etc.)

Application

Representation

Computation

Inference


HD Computing layers

011101010101…..011101010100…..

10110010010…..101010011110…..









unknown



Application

Representation

Computation

Inference

Alg

ori

thm

s


HD Computing layers

011101010101…..011101010100…..

10110010010…..101010011110…..









unknown



Application

Representation

Computation

Inference

Alg

ori

thm

s Sy

stem

an

d C

ircu

it D

esig

n


HD Computing layers

011101010101…..011101010100…..

10110010010…..101010011110…..









unknown



Application

Representation

Computation

Inference

Alg

ori

thm

s Sy

stem

an

d C

ircu

it D

esig

n

Ass

oci

ativ

e m

emo

ry e

nab

led

by

no

vel d

evic

e te

chn

olo

gies


Hyperdimensional (HD) Computing

Monolithic 3D enables

– Energy-efficient classification

– Area efficient HD projection

➔ use RRAM variability & stochastic switching

3D RRAM

+ low power access transistors

+ address decoders

Low power computation

High-density

Inter-layer vias


3D Enables In-Memory Computing

50 nm

TiN (20 nm)

TiN/Ti (50 nm)

TiN

TiN

TiN (BE)

TiNTiN/Ti(TE)

Layer 1 (L1)

Layer 2 (L2)

Layer 3 (L3)

Layer 4 (L4)

3D RRAM with FinFET BL select

H. Li et al., Symp. VLSI Tech., 2016


0 0 1 1

0 1 0 1

1 0 0 1

0 1 1 0

Multiplication

Multiplication

{-1, 1} system

XOR

{0, 1} system

equiv.

MAP Kernels: 3D RRAM Approach

Key HD operations: multiplication, addition, permutation

100101

1

0101

1

1

0

1

Addition

01

0

0

0

1

0

Permutation

10 1k 100k 10M 1G 100G

0

10

20

30

40

50

60

Cu

rre

nt

(A

)

Addition Cycle (#)

Symbol: 1-pillar exp. Line: 2-pillar emulated

11111111

01111111

0011111

100011111

1111

0111

0011

0001

0000

1

2

3

4

B

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

0

0

0

1

1

2

3

4

B

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

0

0

1

0

1

2

3

4

B

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

0

1

0

0

1

2

3

4

B

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

1

0

0

0

VDD

gnd VDD

gnd VDD

gnd

L1

L2

L4

L3

200 ns

(a)

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

4.000E+05

5.030E+05

6.325E+05

7.953E+05

1.000E+06

1

1

1

0

1

1

0

1

1

0

1

1

0

1

1

1VDD

gnd

gnd

VDD

VDD

gnd

200 ns

L1

L2

L4

L3

(b)

1 2 3 4

B

4.0

00E

+05

5.0

30E

+05

6.3

25E

+05

7.9

53E

+05

1.0

00E

+06

Measured HRS

(400kΩ-1MΩ)

Measured LRS

(~10kΩ)

1 2 3 4

B

4.0

00E

+05

5.0

30E

+05

6.3

25E

+05

7.9

53E

+05

1.0

00E

+06

1 0

Measured data on 4-layer 3D vertical RRAM

1M

100k

10k

1k 1M 1G 1T

1M

100k

10k

Resis

tan

ce

(

)

Logic Evaluation Cycle (#)1k 1M 1G 1T

D = 0

C = 1

C = 0

D = 1

C = 0

D = 1

D = 0

C = 1

A

BCD

00

10

Input AB = pillar addr. = 00

A

BCD

10

01


A

BCD

01

01


A

BCD

11

10


Logic Evaluation Cycle (#)

Resis

tan

ce (

)


3D Integration of Memory and Logic Circuits within the Same Layer & Across Layers

Logic (Si CMOS)

Neuron circuits Communication Synapses/Weights (CNFET/2D FET) + RRAM

Synapses/Weights (3D RRAM)

0 1 0 1


Nano-Engineered Computing Systems Technology

0 1 0 1

Aly et al., IEEE Computer, 2015


Students and Post-Docs


Collaborators

Gert Cauwenberghs Siddharth Joshi Emre Neftci (UC San Diego)

Jinfeng Kang (Peking U)

Chung Lam SangBum Kim Matt Brightsky

K.S. Lee, J.M. Shieh, W.K. Yen… (NDL, Taiwan)


Sponsors

Non-Volatile Memory Technology Research Initiative

E2CDA “Engima”


Stanford SystemX Alliance


Non-Volatile Memory Technology Research Initiative (NMTRI) @ Stanford University


End of Talk

Questions?


amorphousamorphous


TiNTiNTiN



Open Research Questions 1. Functionality → performance/Watt, performance/m2 →

variability → reliability

2. Scale up (system size), scale down (device size)

3. Role of variability (functionality, performance)

4. Fan-in / fan-out, hierarchical connections, power delivery

5. Low voltage (wire energy ≅ device energy)

6. Stochastic learning behavior → statistical learning rules

7. Meta-plasticity (internal state variables)

8. Timing as an internal variable

9. Learning rules: biological? AI?

10. Algorithm-device co-design

11. Materials/fabrication: monolithic 3D integration a must, MUST be low temperature

the n3xt technology for brain-inspired computing · 1 h.-s. philip wong 2015.04.15 stanford...

Documents