feynman machine: a novel neural architecture for cortical

Feynman Machine: A Novel Neural Architecturefor Cortical And Machine Intelligence

Eric Laukien, Richard Crowder, Fergal Byrne*[email protected], [email protected], *[email protected]

Ogma Intelligent Systems Corp.4425 Military Trail, Ste. 209, Jupiter, FL 33458

Abstract

Developments in the study of Nonlinear Dynamical Systems(NDS’s) over the past thirty years have allowed access tonew understandings of natural and artificial phenomena, yetmuch of this work remains unknown to the wider scientificcommunity. In particular, the fields of Computational Neu-roscience and Machine Learning rely heavily for their theo-retical basis on ideas from 19th century Statistical Physics,Linear Algebra, and Statistics, which neglect or average outthe important information content of time series signals gen-erated between and within NDS’s. In contrast, the FeynmanMachine, our model of cortical and machine intelligence, isdesigned specifically to exploit the computational power ofcoupled, communicating NDS’s. Recent empirical evidenceof causal coupling in primate neocortex corresponds closelywith our model. A high-performance software implementa-tion has been developed, allowing us to examine the computa-tional properties of this novel Machine Learning framework.

IntroductionComputer-assisted and numerical methods have only re-cently allowed us to study the properties of NonlinearDynamical Systems. Since Lorenz discovered his famousstrange attractor (Lorenz 1963), applied mathematicianshave steadily discovered an unexpected world of structure inthe behaviour and interactions of such systems (Kantz andSchreiber 2004; Strogatz 2014). In particular, techniquesbased on delay embedding of trajectories on manifolds havebeen used to identify the true causal structure of numerousnatural and artificial phenomena, either in the absence ofclassical correlation or despite spurious correlations (Sug-ihara et al. 2012). These techniques rely in the main onthe Theorem of Floris Takens (Takens 1981), which proves(given certain assumptions) that a reconstruction of suffi-cient dimensionality is diffeomorphic (topologically equiv-alent) to the system generating a time series.

In simple terms, the mere representation in a com-puter or brain area, of a vector of lagged values(xt, xt−τ , xt−2τ ...xt−kτ ) from the time series xt, will traceout a trajectory in k-dimensional space which is to all intentsand purposes the same thing as the real NDS which producesthe time series, in the sense that forecasts of the futures of

Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

the representation and that of the real system have the samemathematical properties. Importantly, this information trans-fer requires no knowledge of the underlying rules governingthe evolution of the real system. This notion is extended, inbrains and intelligent machines, to the interactions betweenbrain regions and processing modules respectively.

The Feynman Machine (Figure 1) is a network or hier-archy of paired encoders and decoders, which operate asNDS’s and communicate spatiotemporal sparse distributedrepresentations (SDR’s). It is superficially similar to astacked autoencoder ladder network (Rasmus et al. 2015),but the encoders are designed to incorporate the past historyof their activity and thus produce encodings which are pre-dictive of the future evolution of the time series input.

In the hierarchy, inputs (which can be sensory or senso-rimotor) are fed in to the bottom layer encoder, which pro-duces a sparse binary SDR to be passed up to the next higherlayer, a process which continues up the hierarchy. Each de-coder takes the output of its paired encoder (which formsa lateral prediction signal) and combines it with the out-put of the next higher layer’s decoder (a top-down feed-back/context signal). At the bottom, the decoder outputs aprediction of the next (sensory or sensorimotor) input in thetime series. During learning, the decoder compares its previ-ous prediction with the actual input to its paired encoder, andthis error vector is used to train both encoder and decoder.

Our model was originally inspired by previous work ona theory of neocortical function based on communicatingNDS’s (Byrne 2015). There is now convincing evidencefrom empirical Neuroscience that this process is indeed oc-curring in primate neocortex (Tajima et al. 2015). Interest-ingly, the authors of this study used the methods of nonlineartime series analysis to extract the causal structure of interac-tions among patches of neocortex in the monkey brain. Inmore recent work, (Tajima and Kanai 2017), they have pro-posed a role for delay embedding dimensions in IntegratedInformation Theory (Tononi et al. 2016).

The artificial Feynman Machine is described in full de-tail in (Laukien, Crowder, and Byrne 2016), so here wewill briefly describe its most important characteristics in theMethods section. The Results section describes some initialexperimental evidence of self-organised unsupervised learn-ing of high-dimensional spatiotemporal structure, in the caseof colour video prediction. Our Conclusion includes some

The AAAI 2017 Spring Symposium on Science of Intelligence: Computational Principles of Natural and Artificial Intelligence

Technical Report SS-17-07

588

… …

Encoder 1 Decoder 1

Decoder 2

Encoder N

sensory predictioncurre

nt a

ction

s

Decoder N

Encoder 2

chosen actionsens

ory i

nput

encoder output

optional top-down supervising signal

encoder output

encoder output

decoder error

decoder error

decoder error

Figure 1: The Sparse Predictive Hierarchy Feynman Ma-chine (Laukien, Crowder, and Byrne 2016).

comments on ongoing and future work.Source code for the architecture and experiments

detailed here is available for non-commercial use athttps://github.com/OgmaCorp.

MethodsWe briefly describe the operation of an artificial FeynmanMachine, which is a hierarchical architecture (Figure 1)of paired encoders and decoders, each pair forming a spa-tiotemporal predictive autoencoder module. A variety of en-coder designs can be used, as long as they have certain prop-erties: a) they must incorporate past history in their process-ing, and b) the output must be a nonlinear function of theinput. Our experiments indicate that the best-performing en-coders are those that produce sparse binary outputs; in thispaper we chose the Delay Encoder, described below.

The decoder in each pair takes the top-down feedbackfrom higher layers and the output of its paired encoder, andattempts to reconstruct an output, usually the next input inthe time series. We have found that a simple linear decoder(which uses a weight matrix for each of its inputs) is suffi-cient for sequence learning. The decoder learning algorithmis simple perceptron learning on each weight matrix.

An Example Encoder Design: Delay EncoderThe Delay Encoder takes a vector x of (scalar or binary)inputs, passes them through a linear weight matrix W, com-bines the summed results with a bias b (producing the stim-ulus vector s), and applies a nonlinearity (a Rectified LinearUnit or ReLU function), giving the activation vector a. Sofar, this resembles the operation of a standard artificial neu-ral network. The temporal memory is provided by adding a

decayed copy of the activations from the previous timestepa to the stimulus vector before entering the ReLU (Figure2). The activation vector is then passed through a k-sparseinhibition stage, producing a sparse binary vector of “firing”units z, which is the output of the encoder. The activationsof any firing units are set to zero for use in the next timestep.

x[1]

x[2]

x[m]

s[1]

s[2]

s[n]

W

input stimulus

a[1]

a[2]

a[n]

activations

+

-b[1]

ReLU

+

-b[n]

ReLU

biases

z[1] = 0

z[2] = 1

z[n] = 0

output

decay (t-1)

decay (t-1)

z[i] = 1

zero (t-1)

zero (t-1)

zero (t-1)

x[j]s[i] a[2]

zero (t-1)

k-sparse

+

+

+

+

Figure 2: Activation diagram of the Delay Encoder.

Outline of Delay Encoder Algorithm Calculate the stim-ulus vector s ∈ R

n from the feed-forward weights W ∈R

n×m and input vector x ∈ Rm:

st = Wxt (1)

Update activation vector a element-wise: preserve onlythe previous non-active activations, add the stimulus fromEq. (1), subtract the biases, and apply a rectified linear unitfunction (ReLU):

at = max(0, (Jn − zt−1)at−1 + st − bt−1) (2)

Choose the top k activation elements:

Γk = suppk(at) (3)

Set the appropriate elements to 1 or 0 to generate the en-coder output z:

zt = δ(i ∈ Γk) (4)

Determine the exponentially decaying traces of the inputs xt

and outputs zt:xt = λxt−1 + xt (5)

zt = λzt−1 + zt (6)

Learning in the Delay Encoder combines a number of pre-viously studied ideas in machine learning. In particular, aform of Real-Time Recurrent Learning, or RTRL (Williamsand Zipser 1989), which uses local information to main-tain a trace of past activity and inputs, is combined with aform of Spike-time Dependent Plasticity, or STDP (Sjstrmand Gerstner 2010; Markram, Gerstner, and Sjstrm 2012;Gilson, Burkitt, and Van Hemmen 2010), which cross-correlates inputs and output states and traces over time. Each

589

connection in the delay encoder computes a STDP functionthat indicates the preferred temporal direction the connec-tion would like the corresponding cell to take. The STDPvalues of all connections are averaged together to get the to-tal temporal direction the cell should move in order to max-imize the number of inputs the cell is predicting. Once thisdirection is obtained, the cell can be updated to fire eithersooner or later depending on the desired temporal direction.This is done by maintaining another set of per-connectiontraces, which accumulate inputs and are reset to 0 when thecell emits a spike. The final update to the connection weightsis then a product of the desired temporal direction and the el-igibility trace for a connection.

Maintain a trace matrix T for previous activity taking re-set into account (derived from RTRL (Williams and Zipser1989)):

Tt = λTt−1 � (Jn,m − Zt−1) +Xt−1 (7)

where Zij = zi, Xij = xj and Jij = 1, i = 1..n, j =1..m, when there are n outputs and m inputs. Each elementTij contains a decaying memory of all inputs xj since thelast time the hidden cell zi fired, the trace is erased to zeroeach time cell i fires.

Determine “importance” Y of the feed forward weights:

Yij = eWij (8)

Calculate the importance-weighted spike time dependentplasticity (STDP) value for each connection:

P = Y � (zt−1 ⊗ xt − zt−1 ⊗ xt−1) (9)

The second term in Eq. (9) is the STDP matrix, formed asthe difference of two outer products. The first outer productmatrix zt−1 ⊗ xt cross-correlates the previous firing patternwith the input trace vector, and the second, zt−1 ⊗ xt−1

cross-correlates the output traces with the previous inputvector.

Determine the temporal direction using an average of theconnection STDPs:

di =

∑j Pij

∑j Yij

(10)

Finally, update the weights:

ΔWij = α(diTij) (11)

To improve stability, normalize W such that each node’sconnection weights sum up to 1:

Wij ⇐ Wij

‖Wi‖2 (12)

Update the biases:

Δbt = β(st−1 − bt−1) (13)

Source code, documentation and a white paperon the Delay Encoder algorithms are available athttps://github.com/OgmaCorp.

ResultsThe Feynman Machine is capable of learning to representthe hierarchical spatiotemporal structure of high-velocity,high-dimensional data such as streaming video, and to usethe sparse representations at all levels to generate the imag-ined future of that data. In order to demonstrate this, wetrained a system with clips of video of natural and artificialscenes. After fewer than twenty presentations, the hierarchywas capable of replaying an entire clip with excellent sub-jective fidelity, simply by providing the first few frames andthen running the system off its own predictions (Figure 3).

The hierarchy used in this case was 8 layers (encoder-decoder pairs) of 512x512 units at the input layer, decreas-ing to 64x64 units in the top layer, trained on a down-sampled video for 16 presentations. We used a desktop PCwith a consumer GPU, training time was under 3 minutes.Videos of this and similar experiments can be accessed athttps://www.youtube.com/ogmaai.

Figure 3: Video Prediction Experiment. An 8-second HDvideo sequence (example frame top left) was downsampledand fed as input to the hierarchy 16 times. The network wasthen fed the beginning of the sequence, and subsequently itspredictions were fed back as input, producing a recalled se-quence, of which 37 successive 400x300 pixel frames areshown.

We also apply the Feynman Machine architecture to theNoisy Lorenz Attractor as studied in (Hamilton, Berry, andSauer 2016). The classic Lorenz system (Lorenz 1963) ismade stochastic by adding noise terms η to each equationin the system. This dynamic noise has variance σ2 = 0.8 ineach dimension, causing the resulting time series to jump tonearby trajectories of the attractor (Figure 4 left). A laggedtime series vector (xt, xt−τ , xt−2τ ) is produced from thex-coordinates for each time t, resulting in a 3D delay em-bedded reconstruction (Figure 4 centre), where the lag τ is13 timesteps. Each observation xt is further perturbed byadding a large amount of observation noise, with varianceσ2 = 20. The resulting observation time series is then fed tothe Feynman Machine (a 12-layer hierarchy, each layer has48x48 units). Starting from a randomly initialised network,the system’s predictions converge to within an RMSE of ap-proximately 2.7 of the next true x-coordinate, comparable

590

with results reported in (Hamilton, Berry, and Sauer 2016).This convergence occurs within fewer than 1000 steps ofnoisy observations.

A recording of this experiment, from which Figure 4 wastaken, is available at https://youtu.be/fUHnEzPqCJo and in-cludes a visualisation of the noisy observations. The systemruns at c.60 steps per second on a consumer laptop with astandard GPU.

Figure 4: Noisy Lorenz Attractor (based on the system stud-ied in (Hamilton, Berry, and Sauer 2016). The signal source(left) is used to reconstruct a delay embedding (centre), andnoise is added (not shown). The Feynman Machine forms aprediction (right) based on the noisy observations.

A high-performance library, OgmaNeo, has been built us-ing C++ and OpenCL, and runs on CPU-only or compat-ible CPU/GPU PC’s. Bindings for C++, Python, and Javaare publicly available, along with source code for experi-ments described here and several others. Source code forboth the library and numerous experiments can be accessedat https://github.com/OgmaCorp.

ConclusionWe have described a novel neural architecture which ex-ploits the information communication power of coupled dy-namical systems, using a hierarchical network structure in-spired by the mesoscale connectome of the mammalian neo-cortex. Our currently best-performing encoder design, theDelay Encoder, is capable of adaptively learning to modelhigh-velocity, high-dimensional data streams such as natu-ral video sequences, and forms excellent predictions basedon its model.

The system has also been tested on a number of otherspatiotemporal tasks, as detailed in (Laukien, Crowder, andByrne 2016).

The artificial Feynman Machine is a very new neural net-work architecture. While its function is based on solid theoryin Dynamical Systems, and its structure is inspired by Neu-roscience, the capability, range and limitations of cognitivesystems built on these principles are only beginning to beexplored.

Current work at this time is focused on applying the ar-chitecture in a Reinforcement Learning context. Initial exer-cising of an agent using the Feynman Machine as its corelearning component has been carried out on the OpenAI

Gym environments (OpenAI 2016), with encouraging re-sults. We are also exploring the abilities of the architecturein sequence identification, anomaly detection, and vocalisa-tion.

Future work includes exploration of further opportuni-ties for cross-fertilisation with the fields of Nonlinear Dy-namical Systems - in particular, causation analysis and de-tection (Sugihara et al. 2012), stochastic partial differentialequations (Ovchinnikov and Wang 2015; Ovchinnikov et al.2016), and hardware implementations (Appeltant 2012) -and Neuroscience (Tajima et al. 2015; Hawkins and Ahmad2015).

ReferencesAppeltant, L. 2012. Reservoir computing based on delay-dynamical systems. These de Doctorat, Vrije UniversiteitBrussel/Universitat de les Illes Balears.Byrne, F. 2015. Symphony from Synapses: Neocortex as aUniversal Dynamical Systems Modeller using HierarchicalTemporal Memory. arXiv preprint arXiv:1512.05245.Gilson, M.; Burkitt, A.; and Van Hemmen, L. J. 2010. STDPin recurrent neuronal networks. Frontiers in ComputationalNeuroscience 4(23).Hamilton, F.; Berry, T.; and Sauer, T. 2016. Kalman-takensfiltering in the presence of dynamical noise. arXiv preprintarXiv:1611.05414.Hawkins, J., and Ahmad, S. 2015. Why neurons have thou-sands of synapses, a theory of sequence memory in neocor-tex. arXiv preprint arXiv:1511.00083.Kantz, H., and Schreiber, T. 2004. Nonlinear time seriesanalysis, volume 7. Cambridge university press.Laukien, E.; Crowder, R.; and Byrne, F. 2016. Feynman Ma-chine: The Universal Dynamical Systems Computer. arXivpreprint arXiv:1609.03971.Lorenz, E. N. 1963. Deterministic nonperiodic flow. Journalof the Atmospheric Sciences 20(2):130–141.Markram, H.; Gerstner, W.; and Sjstrm, P. J. 2012. Spike-timing-dependent plasticity: a comprehensive overview.Frontiers in Synaptic Neuroscience 4(2).OpenAI. 2016. OpenAI Gym. Online Reinforcement Learn-ing Testbed.Ovchinnikov, I. V., and Wang, K. L. 2015. StochasticDynamics and Combinatorial Optimization. arXiv preprintarXiv:1505.00056.Ovchinnikov, I. V.; Li, W.; Schwartz, R. N.; Hudson, A. E.;Meier, K.; and Wang, K. L. 2016. Collective neurodynam-ics: Phase diagram. arXiv preprint arXiv:1609.00001.Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; andRaiko, T. 2015. Semi-supervised learning with ladder net-works. In Advances in Neural Information Processing Sys-tems, 3546–3554.Sjstrm, J., and Gerstner, W. 2010. Spike-timing dependentplasticity. Scholarpedia 5(2):1362. revision 151671.Strogatz, S. H. 2014. Nonlinear dynamics and chaos: withapplications to physics, biology, chemistry, and engineering.Westview press.

591

Sugihara, G.; May, R.; Ye, H.; Hsieh, C.-h.; Deyle, E.; Foga-rty, M.; and Munch, S. 2012. Detecting causality in complexecosystems. science 338(6106):496–500.Tajima, S., and Kanai, R. 2017. Integrated informationand dimensionality in continuous attractor dynamics. arXivpreprint arXiv:1701.05157.Tajima, S.; Yanagawa, T.; Fujii, N.; and Toyoizumi, T. 2015.Untangling brain-wide dynamics in consciousness by cross-embedding. PLoS Comput Biol 11(11):e1004537.Takens, F. 1981. Detecting strange attractors in turbulence.Springer.Tononi, G.; Boly, M.; Massimini, M.; and Koch, C. 2016. In-tegrated information theory: from consciousness to its phys-ical substrate. Nature Reviews Neuroscience 17(7):450–461.Williams, R. J., and Zipser, D. 1989. A learning al-gorithm for continually running fully recurrent neural net-works. Neural computation 1(2):270–280.

592

feynman machine: a novel neural architecture for cortical

Documents