David J. Mountain
Senior Technical Director
Advanced Computing Systems Research Program
Neuromorphic Computing:
Our approach to developing applications
using a new model of computing
Outline
• Background Info
• Mapping applications onto a neuromorphic computer
• Example Applications*
– AES-256 Encryption
– Malware Identification
• Quantitative results
*I will not be using cats, MNIST digits or ImageNet pictures – there are lots of people demonstrating those applications
Neuromorphic Computing
• The integration of algorithms, architectures, and technologies,
informed by neuroscience, to create new computational
approaches.
Image: zmescience.com
Silicon Brain not required
Neural Networks
Feed-Forward Neural Network
Ʃ
w2
x3
x2
x1 w1
w3
Single Neuron Equation Multiply accumulate (MACC) with an activation function
Activation Function
McCulloch/Pitts Diagram
• A neuron can be considered a threshold gate and used to
perform logic functions
• Many interconnected threshold gates can perform complex
logic functions
Threshold Gates
β = -2.5 β = 0.5
Ʃ
1
x3
x2
x1 1
1 Ʃ
1
x3
x2
x1 1
1
AND OR
Ʃ x1 -1
NOT
β = -0.5
McCulloch & Pitts
Threshold Logic, Neural Nets
A Logical Calculus of the Ideas Immanent in Nervous Activity,
Bulletin of Mathematical Biophysics, 1943
Images: (L) NESFA, Boskone V Conference, 1968 (R) Estate of Francis Bello/ScienceSource via Nautilus/A. Gefter
Frank Rosenblatt
Mark I Perceptron, circa 1960
Images: Hecht-Nielsen, R. Neurocomputing (Reading, Mass.: Addison-Wesley, 1990)
via rutherfordjournal.org.
Different computational primitives
will become the common case:
Majority function example
Digital implementations are relatively inefficient for large numbers of inputs;
MACC-centric design appears to have a large sweet spot
Ʃ
1
d
c
b
1
1
a
e
β = -2.5 1
1 0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
0 200 400 600 800 1000
No
rmali
zed
in
pu
ts/s
ec/W
att
Number of inputs
Majority Gate -- Throughput per Watt
MACC
Digital
Our approach
• Architectures that scale to handle real applications
Ohmic Weave
• Methodologies and algorithms for designing/programming these systems
Loom
• Experience & experiments with applications to guide architectures and methodologies
AES-256, malware triage
Implementing neurons using physics
Image: Stan Williams, HP Labs via arstechnica.com
Ʃ
w2
x3
x2
x1 w1
w3
- + -
I = V/R
G = 1/R
I = VG
Multiply
Multiply
Multiply
Accumulate
Ohmic Weave: Single Tile
INP
UT
DR
IVE
RS
COMPARATORS IO
PORT
+ - + - memristor
input
drivers
comparators
256x256 memristor crossbar
128 differential comparators
All inputs and all outputs are sent to a central router
256 axons, 128 neurons, 65536 synapses
Ohmic Weave: 64 Tile General
Purpose Processor*
64 port router
all-to-all connectivity
16k axons
8k neurons
4M synapses
*56 Tera synaptic ops per watt (TSOPS/W), 1.1 TSOPS/mm2
Tools, Methodologies, Algorithms
• Loom – Ohmic Weave design tool
– Python classes with C, Cuda extensions
– Enables exploration of design trade-offs
• Limited precision weights
• Neural network topologies (layers, neurons per layer)
• Connectivity pruning
• Simulates Ohmic Weave designs on CPUs, GPUs • Debug with full view of internal state
Methodology: Block Based Design
• Decompose the problem into blocks
– Much like block based CMOS design
– Can pull blocks from a “circuit library”
• Loom can compose blocks into a single larger network
– Will optimize by removing unused neurons and connections
– Compresses to minimum number of layers
– Handles recurrence/loops
Digital Hierarchical Neural Nets
• Digital functions must be 100% correct
• Divide and conquer by partitioning – 64 inputs = 1.6 x 1019 training vectors
– 4 x 16 inputs = 2.56 x 105 training vectors
• Reduce the training set size – But train to 100% accuracy
– The logic truth table becomes the training set
– The training data encompasses all possible data
Training
• Loom can train blocks given a training set or
truth table
– Uses the Concurrent Learning Algorithm*
– Can train for exact logic or for inexact classifiers
*M. McLean, “Concurrent Learning Algorithm and the Importance Map,” Network Science and Cybersecurity, ed. By R. Pino, Springer 2014,
vol. 55, pp. 239-250.
Libraries and Composability
• Once trained, a block can be reused
– Train once, use often
• We have a growing library, starting with simple logic
(NANDs, NORs, latches) and growing to more sophisticated
functions (majority gates)
• We have algorithms for composing multiple neural networks
into a single network
AES-256 Encryption
• Advanced Encryption Standard, 256 bit – 128 bits of data encrypted using 256 bit key
– Algorithm uses 14 rounds of 4 steps each
– Published standard, result must be exact
• Ohmic Weave Implementation – 45 blocks, 21 unique types
• 16 Subbytes Blocks, 3 Mix Col Blocks, 1 Control Block, 1 Mux
• Each Subbytes block unique because keys “baked in”
– 12,500 neurons in 10 layers
– Each block trained using CLA
AES-256 Conceptual Diagram
45 instances of 21 unique blocks
About 12,500 neurons in 10 layers
SubBytes
SubBytes
SubBytes
MixCol
C
MixCol
A
MixCol
B
Cypher
text MixCol
C
MixCol
C
MixCol
C
32 b
it mux
SubBytes 8
ShiftRows
(to other cols)
8
8
8 16
16
State Machine
Plain
text
Application: Malware Detection
• Classifies files as malware (e.g. virus) or
benign
– Looks at the file in 6 byte n-grams at a time
– Matches 2000 critical n-grams, notes their
presence in a 2000 bit latch
– Uses a neural network classifier to decide if pattern
in latch is malware
48:1536
DECODER
6 byte
n-gram 2000
PATTERN
MATCHER
2000
BIT
LATCH
MALWARE
DETECTOR
STAGE
1
MALWARE
DETECTOR
STAGE
2
malware?
Conceptual Diagram
About 4007 instances of 5 unique blocks
About 5800 neurons in 5 layers
Mapped to Ohmic Weave using Loom
64 port router
all-to-all connectivity
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Unused
Malware detection using neural nets:
General purpose Ohmic Weave vs CPU
Function Area (mm^2) Power (mW)
Row Drivers 0.70 77.56
Memristive Array 8.15 860.36
Comparators 15.01 38.78
Router 0.43 493.04
Total 24.29 1469.74
CPU: 6 Core Intel Core i7 3930K; Throughput is 1.92 Gbps
Ohmic Weave requires 12 copies to match that throughput
Numbers shown below are total for 12 copies -- comparison is performance neutral
14x Improvement in Area
54x Improvement in Power*
*Aggressive implementations of memristor technology and on-chip routing increase the power improvement to ~500x
1.00
10.00
100.00
1000.00
0 1 2 3 4 5
Imp
rov
emen
t F
aa
cto
r
CMOS technology node
Energy efficiency scaling comparison
54X
75X
45 nm 32 nm 22 nm 15 nm 11 nm 8 nm
Roadmap Forward
• Fabricate and characterize circuits
• Continue to characterize memristor crossbars
• Build increasingly mature prototype boards
– Explore on-chip vs. off-chip training
– Validate routing choices
– Provide more realistic power comparisons
• Improve tools and simulation environment
• More applications and comparisons (to FPGA, GPU, ASIC, etc.)
Acknowledgments to the NMC team that
is making all this happen
Chris Krieger
Mark McLean
Josh Prucnal
Doug Palmer
along with a substantial number of
academic, national lab and industry
partners
A first attempt at road mapping NMC
• Strengths
– CMOS compatible
– Room T operation
– Integrates with
other approaches
(traditional,
approximate)
– Self-learning
• Neutral
– Security
– Programmer
productivity
• Weaknesses
– Legacy code
– FP intensive
applications
– Serial speed
– Data selection
Efficient STDP/Spiking device
Memristors
Access devices (monolithic 3D)
Comparators and on-chip programming
Interconnect density
Controlled growth of interconnects
Efficient analog comms (or efficient
transducers to optical, magnetic, etc.);
includes sensors
NMC attributes
Nanotechnology challenges