Continual learning in humans and neuroscience-inspired AI
Lucas Weber,∗ Elia Bruni,† and Dieuwke Hupkes‡
University of Amsterdam
(Dated: June 28, 2018)
Abstract
The field of Artificial Intelligence (AI) research is more prosperous than ever. However, current
research and applications are still aiming to optimise single task-performances instead of algorithms
that are able to generalize over multiple tasks and reuse prior knowledge for new challenges. This
makes these systems data hungry, computationally expensive and inflexible. A main obstacle
in the way towards more flexible and generalizing algorithms is the phenomenon of catastrophic
forgetting in connectionist networks. While artificial neural networks generalizing-abilities suffer
under catastrophic forgetting, biological neural networks seem to be relatively unaffected by it. In
this literature review we aim to understand the necessary mechanisms implemented in the human
nervous system to overcome catastrophic forgetting and review in how far these mechanisms are
already realized in AI systems. Our review is guided by Marrs contemplations of levels of analysis
and comes to the conclusion that an integration of the partial solutions already realized in AI may
be able to overcome catastrophic forgetting in a more complete way than prior solutions.
Keywords: neuroscience, interdisciplinary, catastrophic forgetting, connectionist,
∗ [email protected]† [email protected]‡ [email protected]
1
CONTENTS
1. Introduction 3
2. Machine learning literature on catastrophic forgetting 10
3. Neuroscientific literature on catastrophic forgetting 12
3.1. Neuroscientific framework 13
3.2. Neuroscientific Theory and Evidence 17
3.2.1. Complementary Learning Systems Theory 17
3.2.2. Selective Constraints of neuroplasticity 21
3.2.3. Neurogenesis within the hippocampus 24
4. Integration of neuroscientific insight into machine learning 25
4.1. Using complementary learning systems: from the DQN-model to deep generative
replay 25
4.2. Constraining weight plasticity within the network 27
5. Discussion 29
References 32
2
1. INTRODUCTION
Reports in mass media and popular science in recent years come thick and fast with de-
lineations how artificial intelligence (AI) research is breaking through over and over again,
making the lay reader expecting the machine revolution just around the next corner. As
usual, news reports over the current developments in science are tremendously exaggerated:
Neither is any company currently working on the creation of Skynet nor will your boss
be fired tomorrow, because machine learning made his job obsolete. However, the current
enthusiasm is not completely unfounded since state-of-the-art algorithms made great leaps
in their capabilities in the recent years, surpassing human-level performance in important
tasks.
The success is mainly carried by so-called artificial neural networks (ANNs). ANNs are sim-
plified computational models of biological brains comprising graph-like structures. Nodes
of artificial neural networks correspond to biological neurons and are organized layer-wise.
From layer to layer the network calculates non-linear transformations, by weighting the out-
put from the previous layer(s) and applying a non-linearity (e.g. by setting negative values
to zero Glorot et al. (2011)). Their interactions are modeled through non-linear functions
applied to weighted input to a unit. Especially popular are deep neural networks (DNNs),
which are stacking greater numbers of neural layers on top of each other, making up large
architectures with several million optimizable parameters. Examples for benchmark-shifting
neural network architectures are, amongst others, from the domain of computer-vision (e.g.
AlexNet, Krizhevsky et al. 2012) or control of artificial environments (e.g. DQN, Mnih et al.
2015). In other domains, like speech-recognition (Graves et al. 2013; Hinton et al. 2012),
even though not beating human expertise yet, machine learning made great progress.
Advances in these areas are especially impressive, since these domains of learning largely
depend on unstructured data, that traditionally posed the more difficult form of data for
algorithms to learn from. Opposed to structured data, where every datapoint has a spe-
cific inherent meaning and is organized in a predefined manner (e.g. lists of housing prices
mapped to number of rooms in a house), in learning from unstructured data conclusions
have to be drawn from datapoints which only get their meaning through their context. As
illustrative example for unstructured learning we can consider multiple pixels in a picture
making up a pattern that looks like a dog. While a single brown pixel on the tip of the dogs
3
nose itself has no meaning without its surrounding, the pattern of multiple pixel together
are meaningful. Another example from speech-recognition might be sound frequencies that
have to be combined in certain patterns to make up comprehensible speech. While humans
are normally extraordinary good at finding regularities in and making sense of this kind
of data, AI was traditionally better in inferring from structured information. The advance
of machines in these domains is reason for excitement, since most data in the world is of
unstructured nature, giving AI greater scope of possible application. On top of that it may
make their interactions with humans become more intuitive, when data utilized from both
agents becomes more coherent: computer-speech-recognition and computer-vision might
play a crucial role in handling increasingly complex technology opposed to classical highly
structured interfaces.
However, under closer inspection, these recent advances may lose some of their gloriousness.
Looking for the reasons of their success, namely, one finds that these are, to a not insignif-
icant part, based on two factors: availability of big amounts of data and computational
power. The increased distribution and usage of mobile information-technology (statista.com
2018a,b) and the accompanying surge in produced digital data (Kitchin 2014), made it
possible to build large scale on-line databases (e.g. ImageNet, Jia Deng et al. 2009). These
databases provide researchers with millions of labeled training examples, making it possi-
ble to train larger architectures, approximating more complex functions (e.g. by building
deeper networks, Cabestany et al. 2005). Training these large architectures with excessive
amounts of data has a great computational cost. Which brings us to our second reason
for the current surge in machine learning: Moore’s law (Moore 1965) is still retaining and
results in a before unmatched amount of computational power available to researchers. The
combination of explosion in available data and computational power enabled the training of
larger and larger models, resulting in better and better performances.
An instance of this development is the prior mentioned AlexNet (Krizhevsky et al. 2012),
which caught the AI-community by surprise by winning the ImageNet large scale visual
recognition-challenge (ILSVRC) in 2012 (Russakovsky et al. 2015) and almost halving the
error-rate of all its competitors on the fly. While the used technique (convolutional neural
networks [CNNs]) was already widely known in the AI-community since the introduction
of LeNet by Lecun & Bengio in 1995, the refinement of this technique, but especially the
usage in a computationally expensive DNN trained on millions of labeled images, made this
4
success possible. To illustrate the dynamics the development of increasing architecture size
took we shortly mention state-of-the-art architectures like deep residual networks (ResNet)
(He et al. 2015) that comprise up to 152 layers of convolutional computations.
If increasing network size works so well, why would we want to change anything about it?
Why do we negate media’s reports about machine’s rise of intelligence? There are multiple
ways to answer this question.
First of all, retaining at the moment, there is the prospect of Moore’s law to end in the
near future (see e.g. Kumar 2012, Waldrop 2016): Engineers are approaching the physical
limitations of possible transistor size, what may decelerate the growth of computational effi-
ciency of processing units in the near future. With its dependence on ever growing amounts
of available computational power, the approach of building bigger and bigger systems to
yield better performances is most likely to decelerate the advance of AI as well. If the field
of AI does not want to rely on a revolution in electrical engineering and the way we do
computations within the near future, it is well advised to avoid to get too heavily invested
into developing brute force systems (e.g. deep ResNets). Optimization objective during the
development of new architectures should not only be the decrease of error rates, but also
how data- and computationally efficient they obtain their results.
Second, the current approach leads to algorithms, whose capabilities are limited to the
very specific domain they are trained for. Within their domain they are, without com-
plete retraining of the system (Lake et al. 2016), unable to adapt to new environments or
changes in task demands. Even though DNNs have solved the problem of finding patterns
within a task, finding regularities on a larger scope (between different tasks) is still a mostly
unsolved issue. This reveals a lack of self-organized transfer-learning and larger scale gen-
eralization. These, however, are attributes necessary to achieve what psychologists term
general intelligence or g - factor (Spearman 1904), the ultimate goal in the creation of AI.
In psychometrics, g describes the positive correlation of performances of an individual in
different cognitive tasks (A transfer of g to machine intelligence is given by Legg & Hutter
2007). This generalization of cognitive abilities over multiple domains and tasks, being
central in psychologist’s definition of intelligence since over a century, is missing in current
AI. On the contrary, the current brute-force, data-hungry models perform extra-ordinarily
good, but are restricted to their predefined, very narrow domain. Stating that nowadays
systems are intelligent is therefore per definitionem false.
5
However, building a system comprising real general intelligence is a immensely complex
task. Luckily, researchers are able to draw inspiration from the most sophisticated cogni-
tive processor currently known: the human brain. Emerging from selective pressure over
thousands of years (Wynn 1988), the human mind is the most adaptive and productive
computational agent we know. To illustrate the extra-ordinary capabilities of human in-
telligence opposed to contemporary AI, we would like to cite Lake et al. (2016), who are
giving a comprehensible example on the control in Atari 2600 video games: When both,
humans without noteworthy experience in one of the games and a contemporary deep rein-
forcement learning algorithm (DQN) (Mnih et al. 2013) learn to play Atari videos games,
their learning curves differ tremendously. When the DQN is trained on an equivalent of 924
hours of unique playing time and additionally revisiting these 924 hours of playing eight
times, it still only reaches 19% of a human player’s performance, who played the very same
game for 2 hours. This illustrates how humans are much more efficient in making use of
the data they are given. Even though subsequent, enhanced variants of the same algorithm
(DQN+ and DQN++, Wang et al. 2015) were able to achieve up to 83% and even 98% of
the humans players performance, their learning curve is still far from being as steep. This
is especially significant in the early phase of learning: while humans demonstrate particular
large performance gains in the initial phase of learning, DQN++ needs more time to show
improvements. After being trained for only two hours like the human competitor, DQN++
only reaches 3.5% of human performance.
How is this possible? As explained above, g, as it is found in humans, requires to generalize
and transfer knowledge from prior tasks to apply it to new challenges posed by the environ-
ment. Most machine learning agents are currently lacking this ability.
To be able to transfer knowledge from one domain to another, a cognitive agent needs to be
able to learn sequentially from a myriad of different experiences over a life time and integrate
their commonalities. This sequential learning task appears very natural to humans, but it
is in fact of great difficulty for other cognitive agents. Learning opportunities often appear
unanticipated, only shortly and temporarily separated from each other. To nevertheless
make sense of this apparent mess of inputs, Lake et al. (2016) formulated guiding principles
that are likely to be central in how humans are thought to learn. Lake et al. (2016) highlight
three principles that are necessary for efficient, generalizing sequential learning: (1) compo-
6
sitionality, (2) learning-to-learn and (3) causality. We will shortly introduce these principles
here and relate them to the core issue of this paper, the obstacle of catastrophic forgetting
in sequentially learning systems, which we will explain in more depth in part 2.
(1) Compositionality is the idea that concepts are build out of more primitive building
blocks. While these concepts can be decomposed into their elements, they can themselves
again also pose the building blocks for even more complex concepts. These more complex
concepts can then be recombined again and so on. As an anecdotal illustration of this
we can consider computer programming: in computer programming basic functions can be
combined to build more complex functions, which on their part can again be recombined to
make up even more sophisticated functions. In this way functions stack up from machine
code up to high level programming languages or sophisticated computer programs. To be
able to reach this high level of complexity, the lower level concepts need to be of general
form and be shared among as many higher level concepts as possible.
Compositionality naturally connects to (2) learning-to-learn, first introduced by Harlow
(1949): When humans are confronted with situations that go beyond the data they have
encountered so far, they are able to infer based on prior learned concepts what is most rea-
sonable in the new situation and therewith try to deal with the new circumstances. Since
concepts are often (partially) shared between different tasks, learning will go faster and with
need of less data. While being very similar to transfer learning, an idea already very popular
in current AI, learning-to-learn has a greater emphasis on being based in the prior mentioned
compositionality. Transfer learning describes that parts of learned concepts are taken and
utilized to solve other tasks. It usually takes place when two similar tasks are trained in
a row. Transfer learning is already partly realized in deep learning CNNs, through feature
sharing between tasks. Transfer learning through feature sharing in CNNs, however, is on
a very small scale. Learning-to-learn on the other hand is defined somewhat differently. It
intends to take transfer learning to a higher, more human like level, by not only extracting
shareable features and let them loosely coexist next to each other, but also relate these
features (or concepts) to each other in a causal way.
This leads us to Lakes third principle. (3) Causality refers to knowledge about how the
observed data comes about. Systems that feature causality therefore not only concentrate
on the final product, but also on the process in which it is created. In general, causal models
are generative (opposed to pure discriminative models). Causality gives generative models
7
the possibility to grasp how concepts relate to each other, making generative models that
are embracing causality usually better in capturing regularities in the environment. Causal-
ity comprises knowledge of state-to-state transitions in the environment and goes therefore
naturally hand in hand with sequential learning. When knowledge about state-to-state tran-
sitions is learned as well, the system is able to relate concepts with each other and determine
how they usually interact. By inverting the idea of causality, a cognitive agent can infer and
reason about the causes of its current situation.
While making these points, Lake et al. (2016) refer to their implementation of the very same
ideas in Lake et al. (2015). Their generative model (called Bayesian Programme Learning
[BPL]) recognizes and categorizes different characters from different alphabets by combining
a set of primitives according to the prior mentioned principles. Doing so, it is able to reach
super-human level performance in one-shot learning of new character concepts, demonstrat-
ing its ability to learn new concepts from sparse data, with little computational power.
While promoting important ideas and yielding promising results, BPL has the problem that
it needs a lot of top-down, knowledge-based hand-crafting to obtain its impressive perfor-
mance. Further, their task is limited to a very specific, simple domain. This is the opposite
of greater generalization abilities, the idea and ultimate goal they intent to promote. This
handcrafting includes that they provide their generative model with primitives that it can
use to create its more complex characters. While it is practicable to provide a generative
model with appropriate primitives for a relatively simple character recognition task, it be-
comes more difficult to do so with models learning more complicated functions. McClelland
et al. (2010) named this problem more eloquently with the need for ’considerable initial
knowledge about the hypothesis-space, space of possible concepts and structures for related
concepts’ that is inherent to generative, probabilistic models. The idea of top-down design
of model-architecture become less feasible when we aim for a more general AI agent. The
principle ideas, compositionality, causality and learning-to-learn, however, are to be consid-
ered fundamental to build more intelligent AI systems. The implementation though has to
take another route: it has to be driven by an emergent approach that is able to add the
needed complexity to the model. The prior mentioned ANN-architectures offer the needed
emergent complexity. One way to harness top-down ideas while sticking to an emergent,
connectionist framework is to create modular architectures, holding top-down inspiration in
the functionality of its modules and their interactions, while keeping the benefits of emerging
8
complex structure within the single components (Marblestone et al. 2016).
An example for the value approaches integrating emergent structure with top-down guiding
knowledge is the so-called long-short term memory (LSTM) (Hochreiter & Schmidhuber
1997). LSTMs enjoy evergrowing popularity since their introduction. LSTMs are related
to the function of human working memory (Baddeley & Hitch 1974). Similar to human
working memory, LSTMs can hold small portions of information that will be needed lateron
by providing a static temporary memory buffer that can store, retrieve or erase its contents
as needed. This modular edition to classical recurrent neural network architectures (RNNs)
allows a great increase in performance in sequential-behavior tasks that rely on use of in-
formation over a larger amount of timesteps. Building on this idea, subsequent algorithms
(e.g. memory networks, Weston et al. 2014) that are constructed even closer to its biological
archetype (e.g. by dividing memory and control function), yielding even better performance
without adding excessive amounts of trainable parameters by refining the modular structure
of the architecture.
In the transfer of Lake et al. (2016) principles to a proposed modular structured emergent
approach, however, lays an old well known problem. To learn sequentially with the perspec-
tive to achieve true intelligence, one needs to avoid the common problem of catastrophic
forgetting. A system that forgets catastrophically will not be able to utilize Lake’s princi-
ples of learning-to-learn, compositionality and causality. However, catastrophic forgetting is
inherent to classical emergent connectionist approaches.
In the following part 2 of this review we will explain catastrophic forgetting in more detail
and expand on early ideas in the AI-community to resolve the problem. In the subsequent
part 3, we will look at how humans handle the problem of catastrophic forgetting. Since we
already reasoned that human cognitive agents are able to learn sequentially and implement
Lake’s principles, we should be able to find mechanisms by which catastrophic forgetting is
prevented in biological neural networks. While doing so, we will localize the ideas we find on
Marr’s 1982 levels of analysis, to make it easier to put them into context. Thereafter, we will
present evidence from the isolated disciplines, that is likely to be useful in the construction
of new AI architectures. In part 4 we will demonstrate machine learning algorithms in
which those ideas are already implemented. In part 5 we will discuss how to go about and
possibly integrate the prior mentioned ideas.
9
2. MACHINE LEARNING LITERATURE ON CATASTROPHIC FORGETTING
The phenomenon of catastrophic forgetting, also known as catastrophic interference, was
first brought up by McCloskey & Cohen (1989). It describes the interference of a new, with
previous learned tasks in classical, sequentially-trained connectionist networks. The reason
why these networks are prone to interference is that, when a task A is learned by the network,
information regarding this task is not saved localized, but in a distributed manner spread
over many nodes in the network (see parallel distributed processing (pdp) (Rumelhart et al.
1986). When a second task B is trained afterwards, the network will use the very same
connections to learn task B, that beforehand were used to memorize task A. Therewith new
training of task B is overwriting the pattern for task A within the weight-distribution.
What happens when knowledge representations of two tasks are interfering with each other
is easiest to understand, when we consider the learned solution to a task in weight space.
The weights space is a multidimensional space in which every parameter of the network
represents one dimension. The weight space represents all combinations of weight-values that
a network can possibly adopt. What happens in weight space, when we train the network on
two different tasks (A and B) one after the other? While being trained on task A the weight
distribution of the system will slowly migrate through weight-space and finally converge on a
weight-combination that solves task A satisfactorily well. When the network is subsequently
trained on the second task B the weight distribution will migrate through weight space
towards a solution of task B. During this second training phase it is neglecting prior learned
information, veering away from the solution of task A and therewith consequentially causing
catastrophic forgetting. To prevent this from happening, the network needs to find a solution
within the weight-space for task B that also poses a solution to task A. Since networks are
usually overparameterized it is very likely that there are multiple points in weight space
(certain weight-combinations) that yield overlapping solutions for both tasks (Kirkpatrick
et al. 2017). If the learning algorithm can be constraint in a way that it finds a solution
resided in such an overlap area, task B can be learned without interfering with task A.
Catastrophic forgetting is an extreme case of the stability-plasticity dilemma (Carpenter
& Grossberot 1987). The stability-plasticity dilemma is concerned with the how the global
rate of learning (or plasticity) in a network influences the stability/instability of distributed
knowledge-representations. In a parallel distributed system a certain amount of plasticity
10
Figure 1: In this example a two-weigth system found a solution for task A. Sequentially it is trained on a
second task B. During training the system is unconstrained by its prior knowledge and therefore neglects
task B while migrating through weight space towards a solution of task B. This results in catastrophic
forgetting of task A. To not forget catastrophically the system needs to migrate towards the overlapping
area on top.
is necessary to integrate new knowledge into the system: When plasticity is too low, a
so called entrenchment effect can be observed. In entrenchment the rate of change of the
connection-weights is too low to cause any noteworthy adaptations when confronted with
new information. While new information will not erase prior knowledge, the network is also
not able anymore to adapt to new information. However, if there is too much plasticity,
prior knowledge will constantly be overwritten and catastrophic forgetting will occur. Thus,
a optimal learner has to keep its connections partially plastic to be able to integrate new
knowledge, while at the same time constraining plasticity selectively to not overwrite prior
knowledge.
Catastrophic forgetting is posing a substantial problem to sequential learning of neural
11
networks, and thereby to the development of continually learning and generalizing systems.
This is why over the years different solutions to the problem have been proposed. One of the
first suggested solutions to the matter came from French (1992). He argues that reducing
overlap between different representations is key to avoid catastrophic forgetting. Almost
all subsequent solutions follow this line of thinking. French introduced the technique of
weight ’sharpening’, in which he increased activations of nodes which are already high and
decreased activation of nodes that are already low, making the activation pattern for a certain
representation more sharply separated and therewith disentangling it from the activation
patterns of other representations. He called the outcome ’semi-distributed representations’.
While his approach was partly successful in reducing catastrophic forgetting, it may reduce
the ability of the network to generalize. Generalization relays on emergence of more abstract
features that may contribute to different tasks and not only to a single one. Since it is
harder for the network to change prior knowledge representations when weight-sharpening
is applied, it won’t create abstract features. Instead it is more likely that it will find solutions
for different tasks seperate from each other and store them in parallel in different parts of
the network. In another approach, Brousse & Smolensky (1989) and McRae & Hetherington
(1993) stated that humans do not learn from scratch, but they can base new representations
on large pretrained networks. They describe that when tasks are highly internally structured
(like e.g. speaking a language), new data samples will have the same regularities as previous
data and therefore not omit drastically different activation and changes in weights. However,
according to French (1994) and Sharkey & Sharkey (1995) this idea suffers from an inability
of generalization as well: According to Brousse & Smolensky (1989), only highly internally
structured tasks should be learned squentially, meaning that new data must have the same
regularities as previous data. Differing tasks will most likely have different internal structure.
Therefore, generalizing knowledge from one domain to another, like it is necessary for general
intelligence,, won’t be possible.
3. NEUROSCIENTIFIC LITERATURE ON CATASTROPHIC FORGETTING
In the previous paragraph we depicted the problem that catastrophic forgetting poses for
AI-systems to become better sequential learners and to utilize causality, compositionality
and learning-to-learn, and thereby to become true generalizing, intelligent agents. Almost
12
thirty years have passed since the first description of the problem, but it still was not
resolved by the AI-research community. We also depicted the early attempts to resolve
the issue, which were mostly seeking a solution through dexterous mathematical insight,
which, however, was not able to completely sort out the problem or brought other issues
with them (e.g. loss of ability to discriminate properly in pretraining). Considering this
apparent persistence of catastrophic forgetting against the prevailing practice, broadening
our perspective to other disciplines (studying human information processing) for inspiration,
as suggested in the introduction, might be a good idea. In the following paragraph we will
carve out the most promising fields of human-related research to find a solution in and
present available evidence that relate to the problem of catastrophic forgetting.
3.1. Neuroscientific framework
So, how do humans prevent catastrophic forgetting in their biological neural networks?
To be able to harness scientific insight about this question and probably finally answer
it, we need to not only consider the discipline of neuroscience in its broadest sense, but
also theory from cognitive sciences. Together, they span the field of Brain and Cognitive
Science, a strongly interdisciplinary endeavor. Ranging from loose cognitive-psychological
theory to deterministic molecular-biological mechanisms, the field is not straightforward to
comprehend from an outside view. To make it more palatable, we will present Marr (1982)
influential framework of levels of analysis, which is guiding the discipline up to this day.
Based on Marr’s levels it becomes easier to organize discovered research in a meaningful
way and appreciate the idea that there is most likely not a single, but multiple approaches
that conjointly may give us a satisfying and complete solution.
Marr introduces three levels of analysis: computational, algorithmic and implementational.
He describes the computational level as where we state the problem which we would like
to address, but providing no answer on how to solve it. In Brain and Cognitive Sciences it
is best described by the discipline of cognitive psychology/sciences. It offers modularized
cognitive concepts, whose interconnections are loose and unformalized, providing only few
concrete mechanisms by which they come about and interact. It helps us stating relevant
questions (e.g. What kind of modules do humans need to store short-term memories, ma-
nipulate and integrate them into prior knowledge?). We mostly obtain knowledge on this
13
Figure 2: The three levels of Marr, (a) computational, (b) algorithmic and (c) implementational, are
exemplified by the process of human vision on the right hand side. Every lower level is a realization of its
higher levels. Every higher level can be realized in different ways on lower levels. In our example of vision
the algorithm at the algorithmic level can not only be realized in vivo in biological neural networks, but
also in silico using artificial neural networks.
level through highly controlled, quantified psychological experiments and deductive reason-
ing. Approaches that are solely inspired by the computational level of analysis and not
informed by the other two, are top-down oriented like Lake et al.’s BPL mentioned in the
introduction.
The algorithmic level helps to find solutions to the problems stated on the computational
level, holding the concrete mechanisms by which they may be solved. This level is providing
the bridge between computation and implementation. It is corresponding to the discipline
of cognitive neuroscience, which is locating cognitive concepts within different brain areas
and therewith matching them to neural substrate. Researchers do this by utilizing evidence
from e.g. neuroanatomy and neuroimaging or electrophysiology. By finding correlations
between the use of certain cognitive resources and brain activity the algorithmic level is
building a bridge between ideas about high level concepts like cognition and the underlying
neural ’hardware’.
On the implementational level we define how the prior mentioned mechanisms or al-
gorithms are realized (i.e. the physical substrate that the mechanisms are performed on).
This physical substrate may be in silico through transistors on a microchip or in vivo
14
through populations of neurons and their interactions. While some substrates may be more
suitable than others, in principle all theory from the higher levels may be implemented in
multiple different substrate. In regard to humans, this level is best represented by molecu-
lar/behavioral neuroscience, showing us how exactly single neurons function and how they
can be affected through different ways of stimulation (e.g. neuromodulation or in vivo elec-
trical stimulation). Evolved from ideas on the implementational level of analysis, emergent
connectionist networks are a very successful account in state-of-the-art AI (e.g. in form of
straight feedforward networks and recurrent networks).
Thus it appears that in Brain and Cognitive Sciences we have the same approaches to knowl-
edge acquisition as we have modeling approaches in AI: A top-down, knowledge-guided
approach represented through Cognitive Science/Psychology and a bottom-up, emergent
approach represented through molecular-neurobiology. Both approaches try to inform an
intermediate level of understanding. This intermediate level is yielding the mechanism by
which neural and cognitive processing works.
In Marr’s framework, every new level should be considered as a realization of its predeces-
sor, meaning that for example the algorithmic level is realizing the problem stated on the
computational level. It is worth mentioning that this, however, does not mean that insights
from lower level research cannot inform higher level theories (for example the discovery
of grid cells in the human cortex changed the way we think about spatial-memory and
-cognition [Moser et al., 2008]). Choosing the right level of analysis to conduct research has
been a controversial subject for a long time. The protracted debate about the supposed
superiority of one approach over another has seen no single winner. The opposite is the
case: Holding on to a single framework has not been proven to be fruitful, and it is wide
consensus by now that a complete theory should be informed by all levels of analysis. As
systems in artificial intelligence grows more and more sophisticated, this idea will become
increasingly important in that discipline as well. For illustrative purposes, we would like to
give two brief examples of contemporary modeling approaches ignoring this notion.
Our first exemplar will be Bayesian Programme Learning (BPL), which we already men-
tioned in the introduction. BPL is an algorithm inspired by cognitive science only and
therewith residing on the computational level. It categorizes different handwritten charac-
ters from different alphabets by combining a set of primitives. These primitives are possible
pen strokes, that, when combined, make up a character. However, these primitives are fairly
15
simple and few in numbers, which is why it is straightforward to provide a appropriate
hypothesis-space (set of primitives) for the bayesian inference. As soon as the task becomes
more complex the primitives will have to become more abstract and greater in number.
Finding such an appropriate hypothesis-space and hand-crafting it it into the system is not
trivial.Therefore a purely top-down oriented approach is not able to capture complexity as
in human information processing.
On the other hand, even though they were successful in the past and still are, the pure
emergent connectionist networks guided from insights from the implementational level like
standard feedforward networks pose the problems stated in the introduction: They lack
the ability to generalize broadly and learning sequentially. Focusing on the implementa-
tional level can construct more powerful networks that perform extraordinary on single
tasks, but will most likely not be able to learn this task flexibly, adapt to changing task
requirements and transfer knowledge from one domain to another. Thus I will not satisfy
our quest for real general intelligence. An demonstration of this is how the idea of pure
feedforward neural networks has been recently led ad-absurdum by creating incredibly deep
networks (e.g. ResNet-152). Those Networks may yield better performances, but need an
unreasonable large amount of training, consuming computational power and data on an
exaggerated scale and are only possible by using clever hacks in the network architecture
(skip connections in case of ResNet). From a biological perspective these models lack every
plausibility. When considering human vision, which is commonly modeled by these kind of
networks, such an excessive amount of layers in the feedforward sweep of processing would
lead to exaggerated perceptual delays in humans, since stage to stage processing time in
the ventral visual stream is approximated with 10ms per neural population (Panzeri et al.
2001). A biologically implemented ResNet would therefore be no match for human object
recognition (which is estimated with 120ms) in regard of efficiency. It thus seems not to be
necessary to maintain large numbers of expensive computational layers to reach sufficient
performance for object recognition. In accordance with this, Serre et al. (2007) actually
suggest that the depth of the human ventral visual pathway may be estimated at only 10
processing stages.
An algorithm providing a generally intelligent solution should therefore be informed by all
levels of analysis and by doing so providing flexibility paired with complexity. We will keep
this in mind and see how it also may apply to a solution for the problem of catastrophic
16
forgetting.
3.2. Neuroscientific Theory and Evidence
So far we have layed out the problem of catastrophic forgetting in machine learning and
isolated the levels of description on which we are searching for a solution in neuroscience.
Now we will look at evidence from the brain and cognitive sciences that we might profit
from. Interestingly, even though humans do suffer from retrograde interference of newly
acquired information with older memories (Barnes & Underwood 1959), this interference is
never catastrophic. Since artificial neural networks are thought to function in the same way
as biological networks, the human neural system must have implemented countermeasures
to overcome this sequential learning problem that haven’t been adopted by their highly
simplified artificial counterparts. There are different ideas how this might be accomplished
about what these countermeasures are.
3.2.1. Complementary Learning Systems Theory
The first idea we want to present here is a cognitive neuroscientific theory, which is in-
formed by molecular neuroscience as well as cognitive scientific ideas. Rooted in ideas of
Marr (1970, 1971) and Tulving (1985), the so-called complementary learning systems theory
(CLS) was first formalized by McClelland et al. (1995). CLS might give an account for how
catastrophic forgetting is avoided in humans. The theory proposes that human learning
functions via two separate memory systems. The first system is the neocortex which, as
the name implies, is a very recent evolutionary acquisition shared only among mammals. It
is responsible for all higher cognitive functions of mammals (Lodato & Arlotta 2015). To
achieve the complexity of higher cognitive functions, the neocortex has to integrate ambigu-
ous information over long time spans. It does this by slowly estimating the statistics of the
individual’s environment. The functionality of modern deep ANNs is mainly inspired by the
neocortex. Due to their similarity in architecture neocortex and deep ANNs share a large set
of properties, like being large in capacity, and their slow, statistical way of learning. Since
the neocortex is a statistical learner, it integrates general knowledge (i.e. semantic knowl-
edge) about the world that is not connected to the specific learning experience anymore (i.e.
17
it stores no episodic memory).
The storage of episodic information is achieved by the second of the two memory system in
CLS-theory. It is located in the medial temporal lobe structure of the hippocampus. This
system is thought to be a fast learner with very limited storage capacity. The hippocampus
main objective is to store episodic memories and preprocess them for later integration into
the statistically learning neocortex. To be able to store specific events the hippocampus
has to orthogonalize incoming activation patterns, to makes them distinct from previous
experience, a process called pattern separation. Further, the hippocampus extracts regular-
ities from these distinct experiences and then trains the neocortex in an interleaved fashion.
This interleaved memory replay helps the neocortex in the learning process by reactivating
cortical connections central for the memory. Replay is essential, since the slow learning neo-
cortex will hardly learn from the single exposure to an experience. Additionally, interleaving
different memories and replaying them can also be a mean to prevent catastrophic forgetting
in the neocortex (McCloskey & Cohen 1989). This is similar to optimizing multiple tasks
in parallel during training of ANNs, where by interleaving examples of different tasks can
also help to overcome catastrophic forgetting. In support of the assumption that the hip-
pocampus’ interleaved replay is important to avoid catastrophic forgetting, McClelland et al.
(1995) argue that in lower mammals that are lacking the hippocampal-neocortical division
(and therewith do not have complementary learning systems), catastrophic forgetting might
actually take place. It is still an open question if that is actually the case (French 1999). As
mentioned before, architecture and functionality of current ANNs can be seen analogous to
the human neocortex. The second memory system of the hippocampus has no counterpart in
most current AI-systems. Since it seems to be important to prevent catastrophic forgetting,
however, it might be potentially a worthwhile additional module. To better understand how
such a module might work, we should consider the inner dynamics of the hippocampus in
more detail (see also figure 3 ).
The hippocampus circuitry consists mainly of the trisynaptic pathway or loop (TSP) and
the monosynaptic pathway (MSP). The TSP is made up of the entorhinal cortex (ERC),
dentate gyrus (DG), CA3 and CA1. This neural populations are connected through forward
connections, while CA3 has additional recurrent autoconnections. The TSP is responsible
for the encoding of new information and pattern separation (orthogonalization of single ex-
perience) (Schapiro et al. 2016). The encoded and orthogonalized information is then stored
18
Figure 3: The entorhinal cortex (EC) serves as an input as well as output module for episodic memory
buffer of the hippocampus. The trisynaptic pathway (green) comprises the dentate gyrus (DG) and cornu
ammonis 3 and 1 (CA3, CA1). The DG orthogonalizes the input from EC to be able to store without
overlap to prior experiences in CA3. The monosynaptic pathway (red) comprises CA1 and the
Ecsubscript(output). CA1 is extracting statistical regularities from the episodic memory buffer in CA3,
which is necessary to subsequently train the slow, statistically learning neocortex.
in CA3 (Tulving, 1985). This episodic memory buffer is in its mechanism similar to Hopfield
networks (Wiskott et al. 2006; see Amit 1989, for an introduction to Hopfield networks as
neural circuit model).
The MSP on the other hand, consisting of ERC and CA1, is trained by CA3 in a statistical
manner, similar to the neocortex in the CLS theory. The generalized knowledge representa-
tion in the MSP is then used to train the neocortex via repetitive, interleaved memory-replay.
Replay takes predominantly place during low activity phases (e.g. during slow-wave sleep,
Stickgold 2005).
So far CLS-theory seems like a reasonable and parsimonious solution to our problem. How-
ever, there are three reasons that make it unlikely that the hippocampal-neocortical division
and therewith episodic memory replay is the only mechanism that is contributing to prevent
catastrophic forgetting in humans. Firstly, there are no cases of catastrophic forgetting in
higher mammals. Lesion studies in animal models or case studies in humans with lesions due
to strokes in the hippocampus should lead to conditions similar to catastrophic forgetting in
19
neural network models. The biological brain would not be able to interleave its new learning
experience with older experiences any more. A lesioned hippocampus, however, leads to a
related, but different condition: medial temporal lobe amnesia (MTL amnesia) (Squire et al.
2004; Squire et al. 1991). In MTL amnesia individuals suffer from loss of episodic memory,
what we described above as orthogonal, pattern separated memories stored in the CA3 of
the hippocampus. At the same time patients have relatively unimpaired generalized seman-
tic memory (Race et al. 2013). Since they general semantic memory is unimpaired, they
seem not to suffer under catastrophic forgetting. This appears to speak against a central
role of complementary learning systems for prevention of catastrophic forgetting in humans,
because the lack or malfunction of the hippocampus would leave the neocortex exposed to
new experiences that are not interleaved with prior knowledge. However, one might argue
that with the lack of the memory replay unit, MTL amnesia patients’ neocortex is only
affected by new experiences once, namely during the time at which the event is actually
taking place, opposed to exposure through multiple replays in healthy individuals. It may
be that the neocortex is simply not stimulated enough to exhibit fundamental changes to
its connections. While it is still stimulated by the original experience, there is no replay
of this memory, which is essential for the slow learning neocortex to efficiently alter it’s
connectivity patterns. This would ’freeze’ the knowledge stored in the neocortex, making
it inaccessible for new information, but at the same time preventing it from losing older
semantic knowledge. This is indeed the case in MTL-amnestic patients: while retrospective
knowledge, manifested before the loss of the hippocampus is relatively unimpaired, acquisi-
tion of new knowledge almost comes to standstill, with learning of new factual information
being learned only after long time intervals and repetitions (Bayley & Squire 2002).
Another compelling refutation to the exclusive role of the hippocampal system as a counter-
measure to catastrophic forgetting is that all experiences an individual was ever confronted
with would have to be saved within the capacity-limited hippocampal system and constantly
be replayed interleaved with new experience. We know, that the hippocampus has relatively
limited capacity. Additionally, the amount of memories that would be needed to be replayed
would grow linearly with lifetime. Thus, with a certain age memory replay would become
unfeasible. A solution to this issue would be so-called pseudoreheasal introduced by Robins
(1995). Pseudorehearsal works without access to all prior training data (in our case the
memories of a lifetime), but creates its own training examples (pseudoitems) by passing
20
random binary input into the network and using the output as new training examples for
interleaved training. In this way created pseudoitems are described by Robins as a kind of
’map’ that is able to reproduce the original weight distribution. As an anecdotic side-note
from the authors, this intuitively makes sense, since sleep is the time during which memory
replay is thought to predominantly take place (Stickgold 2005) and sleep co-occurs with the
subjective experiences of dreams which often resemble a commingling of recent experience
and odd intermixes of past memories.
Lastly, how do we keep the hippocampus itself free of catastrophic forgetting? It is itself a
connectionist network and should suffer under the same problem of catastrophic forgetting
that it is trying to avoid in the neocortex. An additional mechanism would be necessary to
protect the hippocampus from suffering under catastrophic forgetting itself (Wiskott et al.
2006). Excitingly, this mechanism actually exists and is layed out by research of Wiskott
et al. (2006). Adult neurogenesis, the generation of new neurons out of neural stem cells,
in the DG of the hippocampus might be a countermeasure against catastrophic forgetting
within the hippocampus itself. The DG of the hippocampus is one of only two regions within
the brain that is capable of neurogenesis (Eriksson et al. 1998). We will consider the idea
of neurogenesis in more depth in part 3.2.3. of this review.
3.2.2. Selective Constraints of neuroplasticity
The next neuroscientific insight we would like to present here is located on the molecular
level. In the human central nervous system a change of plasticity (in other words the readi-
ness of a synapse to change its connection-strength) is able to either render connections (in
ANNs: weights) in a network convertible or fix its status quo. By selectively increasing or
decreasing plasticity, the brain is able to learn new tasks, while conserving old skills and
knowledge (Yang et al. 2009). Important here is, that changes in plasticity are selective,
which makes it possible to learn a new task, while not overwriting existing skills. This is
opposed to what is happening in most ANNs during training: while the learning rate is
often times dynamic, meaning that it is changing during the learning progress, it is applied
globally over all connections and not adapted separatly in different regions.
When talking about plasticity changes in the human nervous system on a cellular level,
dendritic spines are considered essential. In an interneuronal connection, spines are little
21
protrusions on the post-synaptic neuron, which are formed at synaptic connections between
neurons. Changes of connection-strength between two neurons are due to morphological
changes (remodulation) of the dendritic spines of the post-synaptic neuron (Yang et al.
2009). Additionally, increased plasticity can also lead to the formation of new spines and
therewith the formation of new interneural connections or the elimination of existing spines
and therewith the loss of connections. Those changes, formations and eliminations of den-
dritic spines are the means to change the connection-strength between two neurons and thus
are the basis for learning.
To better understand how selective neuroplasticity comes about we will take a look at the
molecular basis of spine remodulation. The changes in morphology of spines depend on N -
methyl-D-asparate (NMDA)-receptor activity. Opposed to other excitatory receptors (e.g.
AMPA in figure 4), NMDA is not only able to allow Sodium (Na+) to enter the neuron, but
also allows influx of Calcium-Ions (Ca2+). Ca2+-influx renders the cell morphology change-
able, thus increases plasticity. If connection-plasticity is elavated, connection strength can
either be increased (potentiate) (Bliss & Lømo 1973) or decreased (depotantiate) (Ito 1989)
as a consequence of activity. The kind of change occurring depends on the time interval
between the synaptic activity and the Ca2+-spike. Thus modulating the activity of NMDA-
receptors and consequently Ca2+-influx changes plasticity of the neural connection.
One mean of the brain to influence the activity of NMDA-receptors is through the hormone
somatostatin (SST). According to Pittaluga et al. (2000) the hormone is able to increase
NMDA activity by releasing the Mg2+-block off the receptor. Normally, the Mg2+-block is
only released due to high activity of the neuron and consequently its strong depolarization.
Only the removal of the Mg2+-block enables Ca2+ to pass the membrane into the cell.
SST is distributed via interneurons, which are neurons that connect different neural circuits
with each other without taking a primary function within either of them. To make learn-
ing without forgetting possible it is important that SST-related plasticity changes can be
selective for certain branches of a neuron. This is indeed the case: Changes do not occur
over all connections (i.e. branches) the neuron maintains, but only in branches that are
relevant for the task that the cognitive agent is engaged in (Cichon & Gan 2015). When
SST-release is disrupted (e.g. in SST-interneuron deleted mice), SST is not longer influenc-
ing NMDA-receptor activity and branch-specific plasticity of spine morphology is lost. As
a result the same branches show similar synaptic changes during learning of different tasks,
22
Figure 4: (a) When the NMDA-receptors Mg2+-block hampering Calcium (Casuperscript(2+)) ions to flux
into the neuron. In this case only Sodium (Na+)-ions will enter the neuron via the AMPA-receptor on the
right side of the dendritic spine, which may cause the cell to fire, but wont lead to strengthening of the
synaptic connection. (b) If the Mg2+-block is removed via a high level of depolarization of the neuron (high
rate of activity) or Somatostetin (SST) in the synaptic cleft, Ca2+ will enter the cell which will result in
changes of the cell-metabolism. (c) The changes in cell metabolism due to Ca2+-influx results in additional
AMPA-receptors being integrated into the neurons membrane. A higher density of AMPA-receptors
increases the rate of Na+)-ion influx upon stimulation, which makes the neuron more likely to fire.
causing subsequent tasks to ’erase’ memories of preceding tasks. This happens because new
tasks are altering synapse strengths of the preceding tasks, which fits what we know as high
activity in ANNs. Cichon & Gan (2015) also provide evidence for this on the behavioural
level: the prior mentioned SST-interneuron deleted mice do indeed exhibit catastrophic for-
getting when learning two different tasks sequentially. This gives us causal evidence that
selective changes in neuroplasticity can serve as countermeasure to catastrophic forgetting,
when it is branch-specific.
A single SST-interneuron is targeting directly a single other neuron, what we refer to as
homosynaptic interaction. Not all neural interaction is homosynaptic. There are other
substance, that are refered to as neuromodulators, acting in a heterosynaptic fashion. Gen-
erally speaking, heterosynaptic neuromodulation means that a neurotransmitter released by
a neuron does not only affect a single target neuron, but a whole population of neurons
that that are in close distance. This happens when the neuromodulator is not only released
into the targeted synaptic cleft, but also ’spilled over’ into the extracellular space (ECS),
where it can diffuse and target other prior uninvolved neurons in close proximity. Next
to spillover some neuromodulators are directly released into the ECS, for example classical
neurotransmitter like Dopamine (Descarries et al. 1996) and Serotonin (De-Miguel & Trueta
23
2005). Additionally, there are also highly diffusible gaseous substances like nitric oxid (NO),
carbon-monoxid (CO) and hydrogen sulfide (H2S ) (Wang 2002), which since they are highly
diffusable have a greater area of effect. By diffusing through the ECS, neuromodulators are
able to render the plasticity of adjacent neurons as well, making them more prone to change
their connection strength (or vice versa, make them more stable). This localized change of
plasticity might be a mean to avoid catastrophic forgetting by rendering currently relevant
parts of the neural network changeable while keeping the rest of the network stable and
therewith antagonizing interference with old information. While the branch-specific SST-
induced changes to neuroplasticity are acting on a rather finegrained level, neuromodulators
are able to render larger portions of cortex more plastic, but both are able to influence
memory intereference.
3.2.3. Neurogenesis within the hippocampus
A third mechanism in the human nervous system that might be able to prevent catas-
trophic forgetting we already mentioned in our section about CLS and is located inside the
hippocampus. As depicted before a episodic memory unit that stores experiences tempo-
rally to replay and therewith train a slow statistical learner like the neocortex will have the
problem of catastrophic forgetting itself. Interestingly, there is another mechanism within
the episodic memory buffer of the hippocampus to circumvent this problem. Prominently,
the DG of the hippocampus is one of two cortical areas capable of adult-neurogenesis (Alt-
man & Das 1965, Gould & Gross 2002, Kempermann et al. 2004). Neurogenesis describes
the constant production of nervous cells from neural stem cells. These newly generated
neurons in the DG differ from older cells in the way that they exhibit a greater degree of
synaptic plasticity (Schmidt-Hieber et al. 2004), greater ease to form new connections to
other neurons (Gould and Gross, 2002) and greater mortality (apoptosis) (Eriksson et al.
1998). These properties draw a picture of a nervous cell that can easily be integrated into
an existing neural circuit, but may also be easily obliterated when not being proved useful.
The extend of neurogenesis and cell survival is decreased by age (Altman & Das 1965) and
aversive-stressful experiences (Gould & Tanapat 1999) and increased by diet (Lee et al.,
2000), physical activity (van Praag et al. 1999) and enriched environments (Kempermann
et al. 1998). But how do new neurons help to tackle catastrophic forgetting? French (1991)
24
suggested that within a large network sparsity of representations in the hidden layers of a
network is a mean to reduce CF, since representations will be localized and therefore not
interfere with each other. This strategy, however, has the effect of reducing generalization,
since solutions will just be stored in parallel and there is no need for generalization. Wiskott
et al. (2006) complement this idea by suggesting that the newly generated neurons open up
the opportunity to learn new feature-representations and at the same time, by the reduction
of plasticity in old neurons, saving the possibility to remember older feature-representations.
While in early life people encounter a lot of new environments with a lot of new features,
the need for DG to be able to adapt needs to be larger. On the other hand, in later life
new environments mostly consist of a recombination of known features which reduces the
need to create new feature-representations. This is an intuitive account for the reduction of
neurogenesis within the individual’s lifetime. The same goes for enriched environments: In
a complex environment the capability to adapt and learn is more important that it is in an
impoverished one. Higher rates of neurogenesis makes this possible.
4. INTEGRATION OF NEUROSCIENTIFIC INSIGHT INTO MACHINE LEARN-
ING
After we have illustrated different approaches from the different disciplines of Brain and
Cognitive Sciences, we now take a look in how far these ideas are already implemented in
contemporary AI systems.
4.1. Using complementary learning systems: from the DQN-model to deep gen-
erative replay
The aforementioned CLS-theory probably received most attention in the last years when
it comes to tackling the problem of catastrophic forgetting in connectionist networks. In
their very influential approach to train a neural network architecture on controlling Atari
2600 games, Mnih et al. (2013, 2015) introduced a memory replay unit in the deep Q-network
(DQN). In their attempt to make a DNN architecture learn based on less data, Mnih et al.
introduced a seperate memory unit, saving all prior experiences and replaying all of them
randomly up to eight times after the initial training. Even though this idea bears a lot of
what we think might help overcoming catastrophic forgetting and corresponds functionally
somewhat with the episodic memory buffer in CA3 of the hippocampus that we are also
25
referring to, it exhibits some conceptional flaws that limit its capacities as a mean against
catastrophic forgetting (to be fair, overcoming catastrophic forgetting was not Mnih et al.’s
intention here). As we explained in part 3.2.1 a episodic memory buffer that saves all recent
experiences in a one-to-one manner like in DQNs is not feasible. For an AI system com-
prising real general intelligence it is necessary to learn continual over a long time period.
An episodic memory buffer like in DQN would become exorbitantly large over time and
replay of the random, uniform (every memory is equally likely) sampled memories becomes
a computationally heavy task. To lower the extend of replayed memories, Schaul et al.
(2016) suggested to prioritize memories that are likely to yield a high reward over other
memories that might not be as important for success. By doing so their model outperforms
an uniformly sampling system on the basis of the same amount of training. This selective
reward-guided replay is biologically plausible (see Atherton et al. 2015;Hattori 2014). Even
though this constraining of memory sampling is already a step towards the right direction
by reducing the amount of memory that has to be replayed, over an agents lifetime still too
many memories would be needed to be stored.
In part 3.2.1 of this paper we presented Robins (1995) idea of pseudo-pattern replay. In
pseudo pattern replay there is no need for saving actual memories in a one-to-one fashion,
but the patterns that the new acquired memories are interleaved with are generated out of
the prior weight distribution. Mocanu et al. (2016) pick up on this idea and describe the
Online Contrastive Learning with Generative Replay (OCLGR)-model that uses generative
Restricted Boltzmann Machines (gRBMs) to store past experiences. By saving past experi-
ences in gRBMs the need to save them explicitly (in a one-to-one fashion) is made obsolete.
Using this idea the OCLGR outperforms regular experience replay models and adds more
biological plausibility to the approach by substantially reducing the memory requirements.
Generative replay is applicable to all common types of machine learning (reinforcement-
, supervised- and unsupervised learning). However, Mocanu et al. do not evaluate their
model on its capabilities to cope with catastrophic forgetting. Just recently, Shin et al.
(2017) put a generative replay model into test in how far it may help overcome catastrophic
forgetting on the MNIST-dataset (LeCun et al. 2010). Their results imply that generative
replay is compatible with other contemporary countermeasures (e.g. elastic weight consoli-
dation [EWC], Kirkpatrick et al. 2017; learning without forgetting [LwF], Li & Hoiem 2017).
Additionally, they state that their approach is superior in regard to weight constraining ap-
26
proaches like EWC and LwF since there is no a trade-off between performances of old and
new task. We will explain weight constraining approaches in the following part 4.2 in more
detail.
4.2. Constraining weight plasticity within the network
Another approach to tackle catastrophic forgetting in connectionist networks is the se-
lective constraining of weights. As explained in section 2, catastrophic forgetting in neural
networks is caused by plasticity of connections needed for a first task A being high during
the training of a subsequent second task B. When plasticity is high, the information for the
task A will be forgotten since the weights holding this information will adapt to task B.
On the other hand, when the plasticity of the weights is constrained globally, the network
will lose its ability to learn (see plasticity-stability dilemma; (Carpenter & Grossberot 1987,
cited in Gerstner & Kistler 2002). Current approaches to prevent catastrophic forgetting
therefore try to selectively constrain weights in a way, that weights necessary for task A are
protected during learning of task B and vice versa. There are several models implementing
this being based on the prior mentioned neurobiological insights.
The first of these implementations is elastic weight consolidation (EWC) (Kirkpatrick et al.
2017). EWC is inspired from ideas on a molecular neurobiological level: SST-expressing in-
terneurons are able selectively constrain plasticity on certain branches of a cortical neuron,
while leaving it intact for other branches of the very same neuron (Yang et al. 2009). The
branch-wise constraints here are functionally related: when a branch is necessary for task
A, plasticity will be unconstrained during learning of task A and constrained during task B
and vice versa. Kirkpatrick et al. (2017) take this idea of selectively constrained plasticity
and apply it to DNNs. However, while taking biology as an inspiration, they do not try to
model the underlying mechanics, but instead use a Bayesian approximation to determine
the importance of single connections (i.e. weights) for the current task. When being trained
on the following task, the algorithm determines based on Bayesian approximation how im-
portant the different weights were for the prior solved tasks and put constraints on them,
so that they remain relatively unchanged during backpropagation and subsequent weight
updating during learning. The track through weight-space changes accordingly (see figure
5 ).
27
Figure 5: Similar to Figure 1 a two-weight system is trained sequentially on two different tasks (A and B).
While being trained on task B in an unconstrained manner the system will migrate towards a solution of
task B neglecting prior knowledge of task A (see bottom trajectory). If weights are constrained by elastic
weight consolidation the system will be constantly ’pulled back’ towards the solution of task A will
migrating through weight space. Ultimately, it will converge on a weight combination that solves both
tasks satisfactory (if such a solution exists).
Velez & Clune (2017) are taking their inspiration from the way neuromodulation in the hu-
man cortex is thought to work. Part 3.2.2 explained how neuromodulators are spread locally
within the human cortex and are affecting plasticity of neural connections in their range.
Velez & Clune (2017) translate this by locally plasticity of connection-weights through the
spread of an ’artificial neuromodulator’ within their ANNs. This artificial neuromodulator
then selectively increases the plasticity of the weights around the diffusion node. For differ-
ent tasks different diffusion nodes are activated during training and therewith the network
creates local functional clusters while the rest of the network is relatively unaffected during
training. Their model is only validated on a very primitive small network and verification
28
of its capabilities on large scale state-of-the-art architecture still remains to be proven.
The main difference between these two weight constraining approaches is that on the one
hand the diffusion-based implementation of Valez and Clune has greater biological plausibil-
ity than Kirkpatrick et al.’s EWC. On the other hand EWC targets directly the functionality
of certain weights, while the diffusion-based approach lets functionality emerge within their
predetermined local clusters around the diffusion nodes. In this regard EWC might be su-
perior to the diffusion implementation, since the functional diffusion clusters are not flexible
in their scope as the EWC constraints are. The scope of EWC weight constraints is only
determined by the scale of needed weights in the task and is not handcrafted like in the
diffusion based model. This lets EWC become the more elegant and flexible solution.
In both weight constraining approaches the functional structure of the network is relatively
fixed. When a weight combination that maximizes the performance of one task is found in
the network, it is protected against change. This makes the network overall less flexible and
limits its capacity to a certain amount of tasks. Also it is possible that the network does not
find generalized solutions, since the approach is minimizing overlap between the different
representations of tasks, not allowing a more parsimonious, flexible solution that might be
found when both tasks are optimized in parallel.
5. DISCUSSION
In this literature review we took a closer look into the current developments in AI re-
search. We found the field prospering, especially within the recent years: In different im-
portant machine-learning applications, performance benchmarks have been shifted to reach
human level performance or even go beyond. However, recent performance achievements
were often build on the utilization of big amounts of data and computational power and
the systems are often lacking the ability to generalize the acquired skills to other related
or slightly changed tasks. We stated that this ability, however, is central to acquire true
intelligence. One obstacle on the way to more generalizing, intelligent systems is catastophic
forgetting in connectionist networks. Years of approaching the problem with sole mathe-
matical insight did not resolve the issue satisfyingly. As a consequence researchers turned
to neuroscience to draw inspiration from human cognitive agents that do not suffer from
catastrophic forgetting. We introduced Marr (1982) levels of analysis to help us better un-
29
derstand the neuroscientific research we encounter. Here we emphasized that just like in
neuroscientific research, where complete theories are always informed by all levels of analysis,
an algorithm for a truly intelligent neuroscience-inspired AI system should as well always
be informed by all levels of analysis. Hereafter, we brought up the complementary learning
systems theory (CLS) and constrained neuroplasticity, two of the main ideas of contempo-
rary neuroscience about how catastrophic forgetting is avoided in humans. Additionally to
that, we shortly explained how neurogenesis in the hippocampus might be able to prevent
catastrophic forgetting as well. Finally, we surveyed recent implementations of these ideas
in AI and what advantages and shortcomings the different approaches pose. Doing so, we
presented different examples of realisations of CLS with an emphasis on the most promising
and recent ’deep generative replay’-approach (Shin et al. 2017), utilizing two complementary
learning systems, just like humans do, to interleave current and past experiences.
We further depicted two realization of the molecular neuroscientific idea of selectively con-
strained neuroplasticity in elastic weight consolidation (EWC) of Kirkpatrick et al. (2017)
(2017) and in the diffusion-based approach of Velez & Clune (2017).
Both of the main approaches using constrained neuroplasticity have their shortcomings: The
EWC is only able to change plasticity in one direction: from plastic to stable. As soon as
the networks capacity is reached and the system is saturated, no more new information can
be learned and blackout catastrophe, a phenomenon known from saturated Hopfield net-
works (Amit 1989), may occur. A blackout catastrophe renders information in the network
unretrievable. To continually learn, the cognitive agent has to be able to selectively forget
information to prevent a Blackout.
On the other hand, in the diffusion-based approach number and size of the functional clus-
ters is handcrafted into the system. Handcrafting limits the range of applicability of the
system. To obtain a model that can function without these top-down decisions, the model
will have to learn the scale of clusters from data. The question remains which mechanism
might be able to supervise the assignment and the size of diffusion node within a ANN. One
possible candidate for this might be the human basal ganglia (Alexander & Crutcher 1990).
The basal ganglia (especially the ventral striatum) is central in human reward prediction
and processing (e.g. Schultz et al. 1992). The basal ganglia seems to maintains neuromod-
ulatory projections to the cortex (Alcaro et al. 2007, Graybiel 1990) and as stated in part
3.2.2 dopamine, while dopaminergic neurons are relative rare and mainly expressed in the
30
basal ganglia (Bjoerklund & Dunnett 2007), is able to serve as a potent neuromodulator
(Descarries et al. 1996). This might pose a natural connection to currently popular rein-
forcement learning algorithms in AI (like the prior mentioned DQN (Mnih et al. 2013); for
an introduction to reinforcement learning see Sutton & Barto 1998).
As converging evidence in neuroscience shows it seems to be the case that catastrophic for-
getting in humans is not overcome by a single, but multiple mechanisms on different levels.
While there is direct causal evidence for the importance of constraints on branch-specific
neuroplasticity for the avoidance of catastrophic forgetting in animal models (Cichon & Gan
2015), there is also broad evidence for the relevance of complementary learning systems in
human learning and the interleaved fashion in which the hippocampus is training the slow
learning neocortex (Kumaran et al. 2016). To overcome catastrophic forgetting in a way
equivalent to humans, developers of connectionist AI-systems might need to integrate the
separately developed frameworks into an all-embracing solution. This solution might not
be straightforward, but the prospect of being able to understand and prevent catastrophic
forgetting in a more complete fashion might be worth the while. Our outlook sees impor-
tance in the connection of reinforcement processing and selective changes in neuroplasticity
that might be facilitated through flexibly acting neuromodulatory nodes. An additional
episodic memory replay unit that creates pseudopattern to replay recent experiences in an
interleaved manner can not only help the consolidate recent memories within the neocortex
(like intended by DQNs), but also bound the networks to earlier learned tasks and prevent
catastrophic forgetting. A next step might be to create a memory replay unit that more
closely resembles the inner dynamics of the human hippocampus. A potential candidate
to closer match a artificial memory replay units to the human hippocampus would be the
utilization of the REMERGE-model (Kumaran & McClelland 2012). REMERGE models
hippocampal encoding, memory orthogonalization and retrieval in a down-scaled fashion.
When the episodic memory replay unit becomes more complex by closer resembling its
biological archetype (the hippocampus), it might be necessary to also mimic the internal
hippocampal process of adult neurogenesis to protect the new module from forgetting catas-
trophically as well.
Even though the envisioned integration of different neuroscientific inspired solutions to catas-
trophic forgetting poses a big challenge the prospects it has to offer are compelling enough
to shoulder the effort. Deeper analysis of the here collected ideas is necessary to find rea-
31
sonable and effective ways for a fusion. The final result might be a algorithm grasping the
complexity of human neural processing on all analysis levels in a complete fashion.
With such an algorithm available and the therewith connected possibility to create sequen-
tially learning AI-systems a big stepping-stone towards the development of true intelligence
is taken. Sequentially learning models make it easier to train models that let composition-
ality, explained in the introduction, emerge, since it may enable connectionist networks to
recycle prior used knowledge (Kirkpatrick et al. 2017).
Alcaro, A., Huber, R., & Panksepp, J. (2007). Behavioral Functions of the Mesolimbic Dopamin-
ergic System: an Affective Neuroethological Perspective. Brain Res Rev, 56(2), 283–321.
Alexander, G. E. & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits:
neural substrates of parallel processing. Trends in Neurosciences, 13(7), 266–271.
Altman, J. & Das, G. D. (1965). Autoradiographic and histological evidence of postnatal hip-
pocampal neurogenesis in rats. The Journal of comparative neurology, 124(3), 319–35.
Amit, D. (1989). Modeling brain function: the world of attractor neu-ral networks.
Atherton, L. A., Dupret, D., & Mellor, J. R. (2015). Memory trace replay: The shaping of memory
consolidation by neuromodulation.
Baddeley, A. D. & Hitch, G. (1974). Working Memory. Psychology of Learning and Motivation, 8,
47–89.
Barnes, J. & Underwood, B. (1959). "Fate" of first-list associations in transfer theory.
Export EXPORT Add To My List Email Print Share.
Bayley, P. J. & Squire, L. R. (2002). Medial temporal lobe amnesia: Gradual acquisition of factual
information by nondeclarative memory. The Journal of neuroscience : the official journal of
the Society for Neuroscience, 22(13), 5741–8.
Bjoerklund, A. & Dunnett, S. B. (2007). Dopamine neuron systems in the brain: an update. Trends
in Neurosciences, 30(5), 194–202.
Bliss, T. V. P. & Lømo, T. (1973). Longlasting potentiation of synaptic transmission in the dentate
area of the anaesthetized rabbit following stimulation of the perforant path. The Journal of
Physiology.
Brousse, O. & Smolensky, P. (1989). Virtual Memories and Massive Generalization in Connectionist
Combinatorial Learning ; CU- CS-431-89. 414.
Cabestany, J., Prieto, A., & Sandoval, F. (2005). LNCS 3512 - Computational Intelligence and
Bioinspired Systems. 8th International Work-Conference on Artificial Neural Networks.
Carpenter, G. A. & Grossberot, S. (1987). A Massively Parallel Architecture for a Self-Organizing
Neural Pattern Recognition Machine. 37, 54–115.
Cichon, J. & Gan, W.-b. (2015). cause persistent synaptic plasticity.
De-Miguel, F. F. & Trueta, C. (2005). Synaptic and extrasynaptic secretion of serotonin.
Descarries, L., Watkins, K. C., Garcia, S., Bosler, O., & Doucet, G. (1996). Dual character,
asynaptic and synaptic, of the dopamine innervation in adult rat neostriatum: A quantitative
autoradiographic and immunocytochemical analysis. Journal of Comparative Neurology.
32
Eriksson, P. S., Perfilieva, E., Bjork-Eriksson, T., Alborn, A.-M., Nordborg, C., Peterson, D. A., &
Gage, F. H. (1998). Neurogenesis in the adult human hippocampus. Nature Medicine, 4(11),
1313–1317.
French, R. (1992). Semi-distributed representations and catastrophic forgetting in connectionist
networks.
French, R. (1994). Dynamically constraining connectionist networks to produce distributed, or-
thogonal representations to reduce catastrophic interference - Semantic Scholar. Proceedings
of the 16th Annual Cognitive Science Society Conference.
French, R. M. (1991). Catastrophic Forgetting in Connectionist Networks. In Encyclopedia of
Cognitive Science. Chichester: John Wiley & Sons, Ltd.
French, R. M. (1999). Catastrophic forgetting in connectionist networks.
Gerstner, W. & Kistler, W. M. (2002). Spiking neuron models : single neurons, populations,
plasticity. Cambridge University Press.
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks.
Gould, E. & Gross, C. G. (2002). Neurogenesis in adult mammals: some progress and problems.
The Journal of neuroscience : the official journal of the Society for Neuroscience, 22(3), 619–
23.
Gould, E. & Tanapat, P. (1999). Stress and hippocampal neurogenesis. In Biological Psychiatry.
Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural
Networks.
Graybiel, A. M. (1990). Neurotransmitters and neuromodulators in the basal ganglia. Trends in
neurosciences, 13(7), 244–54.
Harlow, H. F. (1949). The formation of learning sets. Psychological Review, 56(1), 51–65.
Hattori, M. (2014). A biologically inspired dual-network memory model for reduction of catas-
trophic forgetting. Neurocomputing.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling
in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing
Magazine, 29(6), 82–97.
Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),
1735–80.
Ito, M. (1989). LONG-TERM DEPRESSION. Ann. Rev. Neurosci, 12, 85–102.
Jia Deng, Wei Dong, Socher, R., Li-Jia Li, Kai Li, & Li Fei-Fei (2009). ImageNet: A large-
scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition.
Kempermann, G., Jessberger, S., Steiner, B., & Kronenberg, G. (2004). Milestones of neuronal
development in the adult hippocampus. Trends in Neurosciences, 27(8), 447–452.
Kempermann, G., Kuhn, H. G., & Gage, F. H. (1998). Experience-induced neurogenesis in the
senescent dentate gyrus. The Journal of neuroscience : the official journal of the Society for
Neuroscience, 18(9), 3206–12.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., & Rusu, A. A. (2017).
Overcoming catastrophic forgetting in neural networks. 114(13), 3521–3526.
Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their
Consequences. 1st edition edition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolu-
33
tional Neural Networks. NIPS’12 Proceedings of the 25th International Conference on Neural
Information Processing Systems, 1, 1097–1105.
Kumar, S. (2012). Fundamental Limits to Moore’s Law.
Kumaran, D., Hassabis, D., & Mcclelland, J. L. (2016). What Learning Systems do Intelligent
Agents Need ? Complementary Learning Systems Theory Updated. Trends in Cognitive
Sciences, 20(7), 512–534.
Kumaran, D. & McClelland, J. L. (2012). Generalization through the recurrent interaction of
episodic memories: A model of the hippocampal system. Psychological Review, 119(3), 573–
616.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through
probabilistic program induction. Science (New York, N.Y.), 350(6266), 1332–8.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building Machines That
Learn and Think Like People.
Lecun, Y. & Bengio, Y. (1995). Convolutional networks for images, speech, and time-series.
LeCun, Y., Cortex, C., & Christopher, B. (2010). MNIST handwritten digit database, Yann LeCun,
Corinna Cortes and Chris Burges.
Legg, S. & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence.
Li, Z. & Hoiem, D. (2017). Learning without Forgetting.
Lodato, S. & Arlotta, P. (2015). Generating Neuronal Diversity in the Mammalian Cerebral Cortex.
Annual Review of Cell and Developmental Biology.
Marblestone, A., Wayne, G., & Kording, K. (2016). Towards an integration of deep learning and
neuroscience.
Marr, D. (1970). A Theory for Cerebral Neocortex. Proceedings of the Royal Society B: Biological
Sciences.
Marr, D. (1971). Simple Memory: A Theory for Archicortex. Philosophical Transactions of the
Royal Society B: Biological Sciences.
Marr, D. (1982). Vision : a computational investigation into the human representation and pro-
cessing of visual information. MIT Press.
McClelland, J. L., Botvinick, M. M., Noelle, D. C., Plaut, D. C., Rogers, T. T., Seidenberg,
M. S., & Smith, L. B. (2010). Letting structure emerge: Connectionist and dynamical systems
approaches to cognition. Trends in Cognitive Sciences.
McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary
learning systems in the hippocampus and neocortex: Insights from the successes and failures
of connectionist models of learning and memory. Psychological Review.
McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The
Sequential Learning Problem. Psychology of Learning and Motivation, 24, 109–165.
McRae, K. & Hetherington, P. (1993). Catastrophic Interference is Eliminated in Pretrained
Networks. Proceedings of the 15th Annual Conference of the Cognitive Science Society, (pp.
723–728).
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.
(2013). Playing Atari with Deep Reinforcement Learning.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Ried-
miller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540).
Mocanu, D. C., Vega, M. T., Eaton, E., Stone, P., & Liotta, A. (2016). Online Contrastive
34
Divergence with Generative Replay: Experience Replay without Storing Data.
Moore, G. J. (1965). Cramming more components onto integrated circuits. Electronics, Volume
38(8).
Moser, E. I., Kropff, E., & Moser, M.-B. (2008). Place Cells, Grid Cells, and the Brain’s Spatial
Representation System. Annual Review of Neuroscience, 31(1), 69–89.
Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of
spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29(3),
769–77.
Pittaluga, A., Bonfanti, A., & Raiteri, M. (2000). Somatostatin potentiates NMDA receptor
function via activation of InsP 3 receptors and PKC leading to removal of the Mg 2+ block
without depolarization 1. British Journal of Pharmacology, 130, 557–566.
Race, E., Keane, M. M., & Verfaellie, M. (2013). Living in the moment: patients with MTL
amnesia can richly describe the present despite deficits in past and future thought. Cortex; a
journal devoted to the study of the nervous system and behavior, 49(6), 1764–6.
Robins, A. (1995). Catastrophic Forgetting, Rehearsal and Pseudorehearsal. Connection Science.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-
propagating errors. Nature, 323(6088), 533–536.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer Vision.
Schapiro, A. C., Turk-Browne, N. B., Norman, K. A., & Botvinick, M. M. (2016). Statistical
learning of temporal community structure in the hippocampus. Hippocampus, 26(1), 3–8.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay.
Schmidt-Hieber, C., Jonas, P., & Bischofberger, J. (2004). Enhanced synaptic plasticity in newly
generated granule cells of the adult hippocampus. Nature.
Schultz, W., Apicella, P., Scarnati, E., & Ljungberg, T. (1992). Neuronal activity in monkey
ventral striatum related to the expectation of reward. The Journal of neuroscience : the
official journal of the Society for Neuroscience, 12(12), 4595–610.
Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid catego-
rization. Proceedings of the National Academy of Sciences of the United States of America,
104(15), 6424–9.
Sharkey, N. E. & Sharkey, A. J. C. (1995). Backpropagation Discrimination Geometric Analysis
Interference Memory Modelling Neural Nets. Connection Science, 7(3-4), 301–330.
Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual Learning with Deep Generative Replay.
Spearman, C. (1904). "General Intelligence," Objectively Determined and Measured.
The American Journal of Psychology, 15(2), 201.
Squire, L. R., Stark, C. E., & Clark, R. E. (2004). THE MEDIAL TEMPORAL LOBE. Annual
Review of Neuroscience.
Squire, L. R., Zola-Morgan, & Stuart (1991). The Medial Temporal Lobe Memory System. Science,
253(5026).
statista.com (2018a). Statistics on smartphone sells.
statista.com (2018b). Statistics on smartphone usage.
Stickgold, R. (2005). Sleep-dependent memory consolidation.
Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning : an introduction. MIT Press.
Tulving, E. (1985). Memory and consciousness. Canadian Psychology/Psychologie canadienne,
26(1), 1–12.
35
van Praag, H., Kempermann, G., & Gage, F. H. (1999). Running increases cell proliferation and
neurogenesis in the adult mouse dentate gyrus. Nature Neuroscience, 2(3), 266–270.
Velez, R. & Clune, J. (2017). Diffusion-based neuromodulation can eliminate catastrophic forget-
ting in simple neural networks. PLOS ONE, 12(11), e0187736.
Waldrop, M. M. (2016). The chips are down for Moores law. Nature, 530(7589), 144–147.
Wang, R. (2002). Twos company, threes a crowd: can H2S be the third endogenous gaseous
transmitter? The FASEB Journal, 16(13), 1792–1798.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2015). Dueling
Network Architectures for Deep Reinforcement Learning.
Weston, J., Chopra, S., & Bordes, A. (2014). Memory Networks.
Wiskott, L., Rasch, M. J., & Kempermann, G. (2006). A functional hypothesis for adult hippocam-
pal neurogenesis: Avoidance of catastrophic interference in the dentate gyrus. Hippocampus.
Wynn, T. (1988). Tools and the evolution of human intelligence. New York: Clarendon
Press/Oxford University Press.
Yang, G., Pan, F., & Gan, W.-b. (2009). Stably maintained dendritic spines are associated with
lifelong memories. Nature, 462(7275), 920–924.
36