barbara hammer- perspectives on learning symbolic data with connectionistic systems
TRANSCRIPT
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
1/19
Perspectives on Learning Symbolic Data with
Connectionistic Systems
Barbara Hammer
University of Osnabruck, Department of Mathematics/Computer Science, D-49069 Osnabruck,
Germany, e-mail: [email protected].
Abstract. This paper deals with the connection of symbolic and subsymbolic systems. It focuses
on connectionistic systems processing symbolic data. We examine the capability of learning sym-
bolic data with various neural architectures which constitute partially dynamic approaches: dis-
crete time partially recurrent neural networks as a simple and well established model for process-
ing sequences, and advanced generalizations like holographic reduced representation, recursive
autoassociative memory, and folding networks for processing tree structured data. The methods
share the basic dynamics, but they differ in the specific training methods. We consider the follow-
ing questions: Which are the representational capabilities of the architectures from an algorithmic
point of view? Which are the representational capabilities from a statistical point of view? Are
the architectures learnable in an appropriate sense? Are they efficiently learnable?
1 Introduction
Symbolic methods and connectionistic or subsymbolic systems constitute complemen-
tary approaches in order to automatically process data appropriately. Various learning
algorithms for learning an unknown regularity based on training examples exist in both
domains: decision trees, rule induction, inductive logic programming, version spaces,
. . . on the one side and Bayesian reasoning, vector quantization, clustering algorithms,
neural networks, . . . on the other side [24]. The specific properties of the learning algo-
rithms are complementary as well. Symbolic methods deal with high level information
formulated via logical formulas, for example; data processing is human-understandable;
hence it is often easy to involve prior knowledge, to adapt the training outputs to specific
domains, or to retrain the system on additional data; at the same time, training is often
complex, inefficient, and sensitive to noise. In comparison, connectionistic systems deal
with low level information. Since they perform pattern recognition, their behavior is not
human understandable and often, adaptation to specific situations or additional data re-
quires complete retraining. At the same time, training is efficient, noise tolerant, and
robust. Common data structures for symbolic methods are formulas or terms, i.e., high
level data with little redundant information and a priori unlimited structure where lots
of information lie in the interaction of the single data components. As an example, the
meaning of each of the symbols in the term father(John,Bill) is essentially connected toits respective positions in the term. No symbol can be omitted without loosing impor-
tant information. Assumed Bill was the friend of Marys brother, the above term could
be substituted by father(John,friend(brother(Mary))), a term with a different length and
structure. Connectionistic methods process patterns, i.e., real vectors of a fixed dimen-
sion, which commonly comprise low level, noisy, and redundant information of a fixed
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
2/19
2 Barbara Hammer
Fig. 1. Example for subsymbolic data: hand-written digit
.
and determined form. The precise value and location of the single components is often
unimportant, information comes from the sum of local features. As an example, Fig. 1
depicts various representations of the digit
; each picture can be represented by a vec-
tor of gray-levels; the various pictures differ considerably in detail while preserving
important features such as two curved lines of the digit
.
Often, data possess both, symbolic and subsymbolic aspects: As an example, database
entries may combine the picture of a person, his income, and his occupation; web sites
consist of text, pictures, formulas, and links; arithmetical formulas may contain vari-
ables and symbols as well as real numbers. Hence appropriate machine learning meth-
ods have to process hybrid data. Moreover, people are capable of dealing with both
aspects at the same time. It would be interesting to see which mechanisms allow artifi-
cial learning systems to handle both aspects simultaneously. We will focus on connec-
tionistic systems capable of dealing with symbolic and hybrid data. Our main interests
are twofold: On the one hand, we would like to obtain an efficient learning system
which can be used for practical applications involving hybrid data. On the other hand,
we would like to gain insight into the questions of how symbolic data can be processed
with connectionistic systems in principle; do there exist basic limitations; does this point
of view allow further insight into the black-box dynamics of connectionistic systems?
Due to the nature of symbolic and hybrid data, there exist two ways of asking questions
about the theoretical properties of those mechanisms: the algorithmic point of view and
the statistical point of view. One can, for example, consider the question whether sym-bolic mechanisms can be learned with hybrid systems exactly; alternatively, the focus
can lie on the property that the probability of poor performance on input data can be
limited. Generally speaking, one can focus on the symbolic data; alternatively, one can
focus on the connectionistic systems. It will turn out that this freedom leads to both,
further insight into the systems as well as additional problems which are to be solved.
Various mechanisms extend connectionistic systems with symbolic aspects; a ma-
jor problem of networks dealing with symbolic or hybrid data lies in the necessity of
processing structures with a priori unlimited size. Mainly three different approaches
can be found in the literature: Symbolic data may be represented by a fixed number of
features and further processed with standard neural networks. Time series, as an exam-
ple, may be represented by a local time window of fixed length and additional global
features such as the overall trend [23]. Formulas may be represented by the involved
symbols and a measure of their complexity. This approach is explicitely static: Data are
encoded in a finite dimensional vector space via problem specific features before further
processing with a connectionistic system. Obviously, the representation of data is not
fitted to the specific learning task since learning is independent of encoding. Moreover,
it may be difficult or in general impossible to find a representation in a finite dimen-
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
3/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 3
sional vector space such that all relevant information is preserved. As an example, the
terms equal(a,a), equal(f(a),f(a)), equal(f(f(a)),f(f(a))), . . . could be represented by the
number of occurrences of the symbol
at the first and second position in the respec-
tive term. The terms equal(g(a,g(a,a)),g(a,g(a,a))), equal(g(g(a,a),a),g(g(a,a),a)) can
no longer be represented in the same way without loss of information, we have to add
an additional part encoding the order of the symbols.
Alternatively, the a priori unlimited structure of the inputs can be mapped to a pri-
ori unlimited processing time of the connectionistic system. Standard neural networks
are equipped with additional recurrent connections for this purpose. Data are processed
in a dynamic way involving the additional dimension of time. This can either be fully
dynamic, i.e., symbolic input and output data are processed over time, the precise dy-
namics and number of recurrent computation steps being unlimited and correlated to
the respective computation; or the model can be partially dynamic and implicitly static,
i.e., the precise dynamics are correlated to the structure of the respective symbolic data
only. In the first case, complex data may be represented via a limiting trajectory of the
system, via the location of neurons with highest activities in the neural system, or viasynchronous spike trains, for example. Processing may be based on Hebbian or compet-
itive activation such as in LISA or SHRUTI [15,39] or on an underlying potential which
is minimized such as in Hopfield networks [14]. There exist advanced approaches which
enable complex reasoning or language processing with fully dynamic systems; how-
ever, these models are adapted to the specific area of application and require a detailed
theoretical investigation for each specific approach.
In the second case, the recurrent dynamics directly correspond to the data structure
and can be determined precisely assumed the input or output structure, respectively, is
known. One can think of the processing as an inherently static approach: The recur-
rence enables the systems to encode or decode data appropriately. After encoding, a
standard connectionistic representation is available for the system. The difference to a
feature based approach consists in the fact that the encoding is adapted to the specificlearning task and need not be separated from the processing part, coding and processing
constitute one connected system. A simple example of these dynamics are discrete time
recurrent neural networks or Elman networks which can handle sequences of real vec-
tors [6,9]. Knowledge of the respective structure, i.e., the length of the sequence allows
to substitute the recurrent dynamics by an equivalent standard feedforward network.
Input sequences are processed step by step such that the computation for each entry
is based on the context of the already computed coding of the previous entries of the
sequence. A natural generalization of this mechanism allows neural encoding and de-
coding of tree structured data as well. Instead of linear sequences, one has to deal with
branchings. Concrete implementations of this approach are the recursive autoassocia-
tive memory (RAAM) [30] and labeled RAAM (LRAAM) [40], holographic reduced
representations (HRR) [29], and recurrent and folding networks [7]. They differ in the
method of how they are trained and in the question as to whether the inputs, the outputs,or both may be structured or real valued, respectively. The basic recurrent dynamics are
the same for all approaches. The possibility to deal with symbolic data, tree structures,
relies on some either fixed or trainable recursive encoding and decoding of data with
simple mappings computed by standard networks. Hence the approaches are uniform
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
4/19
4 Barbara Hammer
and a general theory can be developed in contrast to often very specific fully dynamic
systems. However, the idea is limited to data structures whose dynamics can be mapped
to an appropriate recursive network. It includes recursive data like sequences or tree
structures, possibly cyclic graphs are not yet covered.
We will start with the investigation of standard recurrent networks because they are
a well established and successful method and, at the same time, demonstrate a typical
behavior. Their in principle capacity as well as their learnability can be investigated
from an algorithmic as well as a statistical point of view. From an algorithmic point of
view, the connection to classical approaches like finite automata and Turing machines
is interesting. Moreover, this connection allows partial insight into the way in which
the networks perform their tasks. There are only few results concerning the learnability
of these dynamics from an algorithmic point of view. Afterwards, we will study the
statistical learnability and approximation ability of recurrent networks. These results
are transferred to various more general approaches for tree structured data.
2 Network Dynamics
First, the basic recurrent dynamics are defined. As usual, a feedforward network con-
sists of a weighted directed acyclic graph of neurons such that a global processing rule
is obtained via successive local computations of the neurons. Commonly, the neurons
iteratively compute their activation
"
$,
% & (denoting the
predecessors of neuron(,
denoting some real-valued weight assigned to connection
% & (
, "3 2 # 4
denoting the bias of neuron(
, and
its activation function4 & 4
.
Starting with the neurons without predecessors, the so called input neurons, which ob-
tain their activation from outside, the neurons successively compute their activation
until the output of the network can be found at some specified output neurons. Hence
feedforward networks compute functions from a finite dimensional real-vector space
into a finite dimensional real-vector space. A network architecture only specifies the
directed graph and the activation functions, but not the weights and biases. Often, so-
called multilayer networks or multilayer architectures are used, meaning that the graph
decomposes into subsets, so-called layers, such that connections can only be found
between consecutive layers. It is well known that feedforward neural networks are uni-
versal approximators in an appropriate sense: Every continuous or measurable function,
respectively, can be approximatedby some network with appropriate activation function
on any compact input domain or for inputs of arbitrarily high probability, respectively.
Moreover, such mappings can be learned from a finite set of examples. This, in more
detail, means that two requirements are met. First, neural networks yield valid general-
ization: The empirical error, i.e., the error on the training data, is representative for the
real error of the architecture, i.e., the error for unknown inputs, if a sufficiently large
training set has been taken into account. Concrete bounds on the required training setsize can be derived. Second, effective training algorithms for minimizing the empirical
error on concrete training data can be found. Usually, training is performed with some
modification of backpropagation like the very robust and fast method RProp [32].
Sequences of real vectors constitute simple symbolic structures. They are difficult
for standard connectionistic methods due to their unlimited length. We denote the set
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
5/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 5
of sequences with elements in an alphabet 6 by 6 7 . A common way of processing
sequences with standard networks consists in truncating, i.e., a sequence9 @ A C E E E C @ P R
with initially unknown lengthT
is substituted by only a part9 @ A C E E E C @ V R
with a priori
fixed time horizonX
. Obviously, truncation usually leads to information loss. Alter-
natively, one can equip feedforward networks with recurrent connections and use the
further dimension of time. Here, we introduce the general concept of recurrent coding
functions. Every mapping with appropriate domain and codomain induces a mapping
on sequences or into sequences, respectively, via recursive application as follows:
Definition 1. Assume6
is some set. Any function Y 6 b 4
P
& 4
P
and initial
contextf8 2 g 4
P
induce a recursive encoding
ench
Y 6
7
& 4
P
C
ench
9 @A
C E E E C @P
R $
u
fif
T w
@ P C
ench
9 @ A C E E E C @ P A R $ $otherwise
Any function C A $ Y 4
P
& 6 b 4
P
and final set 4
P
induce a recursive
decoding
decY 4
P
& 6
7
C
dec @ $
u
9 Rif
@ 2
9 @ $ C
dec A @ $ $ R
otherwise
Note that
dec @ $
may be not defined if the decoding
does not lead to values in
. Therefore one often restricts decoding to decoding of sequences up to a fixed finite
length in practice. Recurrent neural networks compute the composition of up to three
functions
dec
ench depending on their respective domain and codomain where
, ,
and
are computed by standard feedforward networks. Note that this notation is some-
what unusual in the literature. Mostly, recurrent networks are defined via their transition
function and referring to the standard dynamics of discrete dynamic system. However,
the above definition has the advantage that the role of the single network parts can be
made explicit: Symbolic data are first encoded into a connectionistic representation, this
connectionistic representation is further processed with a standard network, finally, the
implicit representation is decoded to symbolic data. In practice, these three parts are not
well separated and one can indeed show that the transformation part can be included
in either encoding or decoding. Encoding and decoding need not compute a precise
encoding or decoding such that data can be restored perfectly. Encoding and decoding
are part of a system which as a whole should approximate some function. Hence only
those parts of the data have to be taken into account which contribute to the specific
learning task. Recurrent networks are mostly used for time series prediction, i.e., the
decoding
dec is dropped. Long term prediction of time series, where the decoding part
is necessary, is a particularly difficult task and can rarely be found in applications.
A second advantage of the above formalism is the possibility to generalize the dy-
namics to tree structured data. Note that terms and formulas possess a natural repre-sentation via a tree structure: The single symbols, i.e., the variables, constants, function
symbols, predicates, and logical symbols are encoded in some real-vector space via
unique values, e.g., natural numbers or unary vectors; these values correspond to the
labels of the nodes in a tree. The tree structure directly corresponds to the structure of
the term or formula; i.e., subterms of a single term correspond to subtrees of a node
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
6/19
6 Barbara Hammer
(1,0,0)
(1,0,0)
(0,0,1) (0,0,1)(0,0,1)
(0,1,0)
Fig. 2. Example for a tree representation of symbolic data: Left: encoding of j m n o m o m n n
,
where o o n
represents
, o o n
representsj
, and o o n
representsm
. Right: encoding of
z n n n z n n .
equipped with the label encoding the function symbol. See Fig. 2 for an example. In
the following, we restrict the maximum arity of functions and predicates to some fixed
value|
. Hence data we are interested in are trees where each node has at most|
succes-
sors. Expanding the tree by empty nodes if necessary, we can restrict ourselves to the
case of trees with fan-out exactly|
of the nodes. Hence we will deal with tree structureswith fan-out
|as inputs or outputs of network architectures in the following.
Definition 2. A| -tree with labels in some set 6 is either the empty tree which we
denote by}
, or it consists of a root labeled with some~ 2 6
and|
subtrees, some of
which may be empty,X
A
, . . . ,X
. In the latter case we denote the tree by~ X
AC E E E C X $
.
Denote the set of|
-trees with labels in6
by 6 $ 7
.
The recursive nature of trees induces a natural dynamics for recursively encoding
or decoding trees to real vectors. We can define an induced encoding or decoding, re-
spectively, for each mapping with appropriate arity in the following way:
Definition 3. Denote by6
a set. Any mapping Y 6 b 4 $
& 4 and initial
contextf8 2 g 4
induces a recursive encoding
ench
Y 6 $
7
& 4
C X &
u
fif
X }
~ C
ench
XA
$ C E E E C
ench
X $ $ if X ~ XA
C E E E C X $ E
Any mapping
C A
C E E E C $ Y 4
& 6 b 4
$
and set 4
induces a
recursive decoding
decY 4
& 6 $
7
C @ &
u
}if
@ 2
@ $
dec A @ $ $ C E E E C
dec
@ $ $ $
otherwise.
Again,
dec might be a partial function. Therefore decoding is often restricted to decod-
ing of trees up to a fixed height in practice. The encoding recursively applies a mapping
in order to obtain a code for a tree in a real-vector space. One starts at the leaves and
recursively encodes the single subtrees. At each level the already computed codes of the
respective subtrees are used as context. The recursive decoding is defined in a similarmanner: Recursively applying some decoding function to a real vector yields the label
of the root and codes for the | subtrees. In the connectionistic setting, the two mappings
used for encoding or decoding, respectively, can be computed by standard feedforward
neural networks. As in the linear case, i.e., the case of simple recurrent networks, one
can combine mappings
ench ,
dec , and depending on the specific learning task.
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
7/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 7
Note that this definition constitutes a natural generalization of standard recurrent
networks and hence allows for successful practical applications as well as general in-
vestigations concerning concrete learning algorithms, the connection to classical mech-
anisms like tree automata, and the theoretical properties of approximation ability and
learnability. However, it is not biologically motivated compared to standard recurrent
networks, and though this approach can shed some light on the possibility of dealing
with structured data in connectionistic systems, it does not necessarily enlighten the
way in which humans solve these tasks. We will start with a thorough investigation of
simple recurrent networks since they are biologically plausible and, moreover, signifi-
cant theoretical difficulties and benefits can already be found at this level.
3 Recurrent Neural Networks
Recurrent networks are a natural tool in any domain where time plays a role, such as
speech recognition, control, or time series prediction, to mention just a few [8,9,25,41].They are also used for the classification of symbolic data such as DNA sequences [31].
Turing Capabilities
The fact that their inputs and outputs may be sequences suggests the comparison to
other mechanisms operating on sequences, such as classical Turing machines. One can
consider the internal states of the network as a memory or tape of the Turing machine.
Note that the internal states of the network may consist of real values, hence an infinite
memory is available in the network. In Turing machines, operations on the tape are per-
formed. Each operation can be simulated in a network by a recursive computation step
of the transition function. In a Turing machine, the end of a computation is indicated by
a specific final state. In a network, this behavior can be mimicked by the activation of
some specific neuron which indicates whether the computation is finished or still con-
tinues. The output of the computation can be found at the same time step at some other
specified neuron of the network. Note that computations of a Turing machine which do
not halt correspond to recursive computations of the network such that the value of the
specified halting neuron is different from some specified value. A schematic view of
such a computation is depicted in Fig. 3. A possible formalization is as follows:
Definition 4. A (possibly partial) function Y w C 7 & w C
can be computedby a
recurrent neural network if feedforward networks Y w C b 4
P
& 4
P
, Y 4
P
& 4
P
,
and Y 4
P
& 4exist such that
@ $
ench
@ $for all sequences
@, where
denotes the smallest number of iterations such that the activation of some specified
output neuron of the part is contained in a specified set encoding the end of the
computation after iteratively applying
to
ench
@ $
.Note that simulations of Turing machines are merely of theoretical interest; such com-
putation mechanisms will not be used in practice. However, the results shed some light
on the power of recurrent networks. The networks capacity naturally depends on the
choice of the activation functions. Common activation functions in the literature are
piecewise polynomial or S-shaped functions such as:
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
8/19
8 Barbara Hammer
inputrecursiveencoding computation
recursive output
1
not yet 1
Fig. 3. Turing computation with a recurrent network.
perceptron function H @ $ w
for@ w
, H @ $
for@
,
semilinear activation lin @ $ @
forw @
, lin @ $ @ $
otherwise,
sigmoidal function sgd @ $
e
$
A
E
Obviously, recurrent networks with a finite number of neurons and the perceptron acti-
vation function have at most the power of finite automata, since their internal stack isfinite. In [38] it is shown that recurrent networks with the semilinear activation function
are Turing universal, i.e., there exists for every Turing machine a finite size recurrent
network which computes the same function. The proof consists essentially in a simu-
lation of the stacks corresponding to the left and right half of the Turing tape via the
activation of two neurons. Additionally, it is shown that standard tape operations like
push and pop and Boolean operations can be computed with a semilinear network.
The situation is more complicated for the standard sigmoidal activation function since
exact classical computations which require precise activationsw
or
, as an example,
can only be approximated within a sigmoidal network. Hence the approximation errors
which add up in recurrent computations must be controlled. [16] shows the Turing uni-
versality of sigmoidal recurrent networks via simulating so-called clock machines, a
Turing-universal formalism which, unfortunately, leads to an exponential delay. How-
ever, people believe that standard sigmoidal recurrent networks are Turing universal
with polynomial resources, too, although the formal proof is still missing.
In [37] the converse direction, simulation of neural network computations with clas-
sical mechanisms is investigated. The authors relate semilinear recurrent networks to
so-called non-uniform Boolean circuits. This is particularly interesting due to the fact
that non-uniform circuits are super-Turing universal, i.e., they can compute every, pos-
sibly non-computable function, possibly requiring exponential time. Speaking in terms
of neural networks: Additionally to the standard operations, networks can use the un-
limited storage capacity of the single digits in their real weights as an oracle, a linear
number of such digits is available in linear time. Again, the situation is more diffi-
cult for the sigmoidal activation function. The super-Turing capability is demonstrated
in [36], for example. [11] shows the super-Turing universality in possibly exponential
time and which is necessary in all super-Turing capability demonstrations of recurrentnetworks, of course with at least one irrational weight. Note that the latter results rely
on an additional severe assumption: The operations on the real numbers are performed
with infinite precision. Hence further investigation could naturally be put in a line with
the theory of computation on the real numbers [3].
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
9/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 9
Finite Automata and Languages
The transition dynamics of recurrent networks directly correspond to finite automata,
hence comparing to finite automata is a very natural question. For formal definition, afinite automaton with states computes a function X ench Y 6 7 & w C , where 6 is a
finite alphabet,X Y 6 b C E E E C & C E E E C
is a transition function mapping an
input letter and a context state to a new state,f 2 C E E E C
is the initial state, and
is a projection of the states to w C
. A language 6 7
is accepted by an automaton
if some automaton computing
X
ench exists such that
@ 2 6 7 - @ $
.
Since neural networks are far more powerful, they are super-Turing universal, it is
not surprising that finite automata and some context sensitive languages, too, can be
simulated by recurrent networks. However, automata simulations have practical conse-
quences: The constructions lead to effective techniques of automata rule insertion and
extraction; moreover, the automaton behavior is even learnable from data as demon-
strated in computer simulations. It has been shown in [27], for example, that finite
automata can be simulated by recurrent networks of the form
ench
,
being a sim-ple projection, and
being a standard feedforward network. The number of neurons
which are sufficient in
is upper bounded by a linear term in
, the number of states of
the automaton. Moreover, the perceptron activation function or the sigmoidal activation
function or any other function with similar properties will do. One could ask whether
less neurons are sufficient since one could encode
states in the activation of only
binary valued neurons. However, an abstract argumentation shows that at least
for perceptron networks, a number of
$neurons is necessary. Since this
argumentation can be used at several places, we shortly outline the main steps:
The set of finite automata with
states and binary inputs defines the class of func-
tions computable with such a finite automaton, say F
. Assume a network with at most
neurons could implement every
-state finite automaton. Then the class of functions
computable with
-neuron architectures, say F , would be at least as powerful as F
.
Consider the sequences of length
with an entry
precisely at the(th position.
Assume some arbitrary binary function
is fixed on these sequences. Then there exists
an
-state automaton which implements
on the sequences: We can use
states
for counting the positions of the respective input entry. We map to a specified final ac-
cepting state whenever the corresponding function value is
. As a consequence, we
need to find for those
sequences and every dichotomy some recurrent network
which maps the sequences accordingly, too. However, the number of input sequences
which can be mapped to arbitrary values is upper bounded by the so-called pseudodi-
mension, a quantity measuring the richness of function classes as we will see later. In
particular, this quantity can be upper bounded by a term O
$
$ $
for perceptron networks with input sequences of length
and
weights. Hence
the limit
$follows.
However, various researchers have demonstrated in theory as well as in practice thatsigmoidal recurrent networks can recognize some context sensitive languages as well:
It is proved in [13] that they can perform counting, i.e., recognize languages of the form
~
P P P
- T 2 0 or, generally spoken, languages where the multiplicities of various
symbols have to match. Approaches like [20] demonstrate that a finite approximation of
these languages can be learned from a finite set of examples. This capacity is of partic-
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
10/19
10 Barbara Hammer
ular interest due to its importance for the capability of understanding natural languages
with nested structures. The learning algorithms are usually standard algorithms for re-
current networks which we will explain later. Commonly, they do not guarantee the
correct long-term behavior of the networks, i.e., they lead only sometimes to the correct
behavior for long input sequences, although they perform surprisingly well on short
training samples. Learnability for example in the sense of identification in the limit as
introduced by Gold is not granted. Approaches which explicitely tackle the long term
behavior and which, moreover, allow for a symbolic interpretation of the connection-
istic processing are automata rule insertion or extraction: The possibly partial explicit
knowledge of the automatons behavior can be directly encoded in a recurrent network
used for connectionistic processing, if necessary with further retraining of the network.
Conversely, automata rules can be extracted from a trained network which describe the
behavior approximately and generalize to arbitrarily long sequences [5,26].
However, all these approaches are naturally limited due to the fact that common
connectionistic data are subject to noise. Adequate recursive processing relies to some
extent on the accuracy of the computation and the input data. The capacity is different ifnoise is present: At most finite state automata can be found assumed the support of the
noise is limited. Assumed the support of the noise is not limited, e.g. the noise is Gaus-
sian, then the capacity reduces to the capacity of simple feedforward dynamics with a
finite time window [21,22]. Hence recurrent networks can in a finite approximation
algorithmically process symbolic data, the presence of noise limits their capacities.
Learning Algorithms
Naturally, an alternative point of view is the classical statistical scenario, i.e., possibly
noisy data allow to learn an unknown regularity with high accuracy and confidence for
data of high probability. In particular, the behavior need not be correct for every input;
the learning algorithms are only guaranteed to work well in typical cases, in unlikelysituations the system may fail. The classical PAC setting as introduced by Valiant for-
malizes this approach of learnability [42] as follows: Some unknown regularity
for
which only a finite set of exampless @ C @ $ $
is available, is to be learned. A learning
algorithm chooses a function from a specified class of functions, e.g. given by a neural
architecture, based on the training examples. There are two demands: The output of
the algorithm should nearly coincide with the unknown regularity, mathematically, the
probability that the algorithm outputs a function which differs considerably from the
function to be learned should be small. Moreover, the algorithm should run in polyno-
mial time, the parameters being the desired accuracy and confidence of the algorithm.
Usually, learning separates into two steps as depicted in Fig. 4: First, a function
class with limited capacity is chosen, e.g. the number of neurons and weights is fixed,
such that the function class is large enough to approximate the regularity to be learned
and, at the same time, allows identification of an approximation based on the availabletraining set, i.e., guarantees valid generalization to unseen samples. This is commonly
addressed by the term structural risk minimization and obtained via a control of the
so-called pseudodimension of the function class. We will address this topic later. In
a second step, a concrete regularity is actually searched for in the specified function
class, commonly via so called empirical risk minimization, i.e., a function is chosen
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
11/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 11
nested function classes
of increasing complexity
f
g
f
First step: choose a function class
^f: empirical approximation^f: function to be learned
Second step: minimize the empirical error,g is the output of the algorithm^
generalization error
Fig. 4. Structural and empirical risk minimization
which nearly coincides with the regularity
to be learned on the training examples
@
. According to these two steps, the generalization error divides into two parts: The
structural error, i.e., the deviation of the empirical error on a finite set of data from
the overall error for functions in the specified class, and the empirical error, i.e., the
deviation of the output function from the regularity on the training set.We will shortly summarize various empirical risk minimization techniques for re-
current neural networks: Assume @ C @ $ $
are the training data and some neural ar-
chitecture computing a function
which is parameterized by the weights
is chosen.
Often, training algorithms choose appropriate weights
by means of minimizing the
quadratic error
@ $ C @ $ $
$,
being some appropriate distance, e.g. the Eu-
clidian distance. Since in popular cases the above term is differentiable with respect
to the weights, a simple gradient descent can be used. The derivative with respect to
one weight decomposes into various terms according to the sequential structure of the
inputs and outputs, i.e., the number of recursive applications of the transition functions.
A direct recursive computation of the single terms has the complexity O
$,
being the number of weights and
being the number of recurrent steps. In so-called
real time recurrent learning, these weight updates are performed immediately after thecomputation such that initially unlimited time series can be processed. This method can
be applied in online learning in robotics, for example. In analogy to standard backprop-
agation, the most popular learning algorithm for feedforward networks, one can speed
up the computation and obtain the derivatives in time O
$via first propagating the
signals forward through the entire network and recursive steps and afterwards propagat-
ing the error signals backwards through the network and all recursive steps. However,
the possibility of online adaptation while a sequence is still processed is lost in this so
called backpropagation through time [28,44]. There exist combinations of both meth-
ods and variations for training continuous systems [33]. The true gradient is sometimes
substituted by a truncated gradient in earlier approaches [6]. Since theoretical investi-
gation suggests that pure gradient descent techniques will likely suffer from numerical
instabilities the gradients will either blow up or vanish at propagation through the
recursive steps alternative methods propose a random guessing, statistical approacheslike the EM algorithm, or an explicit normalization of the error like LSTM [1,12].
Practice shows that training recurrent networks is harder than training feedforward
networks due to numerically ill-behaved gradients as shown in [2]. Hence the com-
plexity of training recurrent networks is a very interesting topic; moreover, the fact that
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
12/19
12 Barbara Hammer
the empirical error can be minimized efficiently is one ingredient of PAC learnabil-
ity. Unfortunately, precise theoretical investigations can be found only for very limited
situations: It has been proved that fixed recurrent architectures with the perceptron ac-
tivation function can be trained in polynomial time [11]. Things change if architectural
parameters are allowed to vary. This means that the number of input neurons, for exam-
ple, may change from one training problem to the next since most learning algorithm
are uniform with respect to the architectural size. In this case, almost every realistic
situation is NP-hard already for feedforward networks, although this has not yet been
proved for a sufficiently general scenario. One recent result reads as follows: Assume
there is given a multilayer perceptron architecture where the number of input neurons
is allowed to vary from one instance to the next instance, the input biases are dropped,
and no solution without errors exist. Then it is NP-hard to find a network such that
the number of misclassified points of the network compared to the optimum achievable
number is limited by a term which may even be exponential in the network size [4].
People are working on adequate generalizations to more general or typical situations.
Approximation Ability
The ability of recurrent neural networks to simulate Turing machines manifests their
enormous capacity. From a statistical point of view, we are interested in a slightly dif-
ferent question: Given some finite set of examples @ C f $
, the inputs@
or outputsf
may be sequences, does there exist a network which maps each@
approximately onto
the correspondingf
? Which are the required resources? If there is an underlying map-
ping, can it be approximated in an appropriate sense, too? The difference to the previous
argumentation consists in the fact that there need not be a recursive underlying regular-
ity producingf
from@
. At the same time we do not require to interpolate or simulate
the underlying possibly non-recursive behavior precisely in the long term limit.
One way to attack the above questions consists in a division of the problem intothree parts: It is to be shown that sequences can be encoded or decoded, respectively,
with a neural network, and that the induced mapping on the connectionistic represen-
tation can be approximated with a standard feedforward network. There exist two nat-
ural ways of encoding sequences in a finite dimensional vector space: Sequences of
length at most|
can be written in a vector space of dimension|
, filling the empty
spaces, if any, with entriesw
; we refer to this coding as vector-coding. Alternatively,
the single entries in a sequence can be cut to a fixed precision and concatenated in a
single real number; we refer to this method as real-value-coding. Hence the sequence
9 w E C w E C w E C w E Rbecomes
w E C w E C w E C w E C w C w C w $or
w E w w, as an
example. One can show that both codings can be computed with a recurrent network.
Vector-encoding and decoding can be performed with a network whose number of
neurons is linear in the maximum input length and which possesses an appropriate ac-
tivation function. Real-value-encoding is possible with only a fixed number of neuronsfor purely symbolic data, i.e., inputs from
6 7with
- 6 - . Sequences in
4
P
$ 7re-
quire additional neurons which compute the discretization of the real values. Naturally,
precise decoding of the discretization is not possible since this information is lost in the
coding. Encoding such that unique codes result can be performed with O
$ neurons
being the number of sequences to be encoded. Decoding real-value codes is possible,
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
13/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 13
too. However, a standard activation function requires a number of neurons increasing
with the maximum length even for symbolic data [11].
It is well known that feedforward networks with one hidden layer and appropriate
activation function are universal approximators. Hence one can conclude that approx-
imation of general functions is possible if the above encoding or decoding networks
are combined with a standard feedforward network which approximates the induced
mappings on the connectionistic codes. To be more precise, approximating measurable
functions on inputs of arbitrary high probability is possible through real-value encoding.
Each continuous function can be approximated for inputs from a compact set through
vector-encoding. In the latter case, the dimension used for the connectionistic represen-
tation necessarily increases for increasing length of the sequences [11].
Learnability
Having settled the universal approximation ability, we should make sure that the struc-
tural risk can be controlled within a fixed neural architecture. I.e., we have to show thata finite number of training examples is sufficient in order to nearly specify the unknown
underlying regularity. Assume there is fixed some probability measure
on the inputs.
For the moment assume that we deal with real-valued outputs only. Then one standard
way to guarantee the above property for a function class Fis via the so called uniform
convergence of empirical distances (UCED) property, i.e.,
@ -
F- C $
C C @ $ - $ & w & $
holds for every w
where C $ - - @ $
is the real errorand
C C @ $
- @
$ @
$ - is the empirical error. The UCED property guarantees that the
empirical error of any learning algorithm is representative for the real generalization
error. We refer to the above distance as the risk. A standard way to prove the UCED
property consists in an estimation of a combinatorial quantity, the pseudodimension.
Definition 5. The pseudodimension of a function class F, VC F$ is the largest car-
dinality (possibly infinite) of a set of points which can be shattered. A set of points
@A
C E E E C @
is shattered if reference points
2 4
exist such that for every
Y
@A
C E E E C @
& w C some function
2 Fexists with @
$
@
$ .
The pseudodimension measures the richness of a function class. It is the largest set of
points such that every possible binary function can be realized on these points. No gen-
eralization can be expected if a training set can be shattered. It is well known that the
UCED property holds if the pseudodimension of a function class is finite [43]. More-
over, the number of examples required for valid generalization can be explicitely limited
by roughly the order , being the pseudodimension and the required accuracy.Assume F is given by a recurrent architecture with
weights. Denote by FV
the
restriction to inputs of length at most X . Then one can limit VC FV
$ by a polynomial in
X and
. However, lower bounds exist which show that the pseudodimension necessar-
ily depends onX
in most interesting cases [17]. Hence VCF
$is infinite for unrestricted
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
14/19
14 Barbara Hammer
sequences. As a consequence, the above argumentation proves learnability only for re-
stricted inputs. Moreover, since a finite pseudodimension (more precisely, a finite so-
called fat-shattering dimension) is necessary for distribution independent learnability
under realistic conditions, distribution independent bounds for the risk cannot exist in
principle [43]. Hence one has to add special considerations to the standard argumenta-
tion for recurrent architectures. Mainly two possibilities can be found in the literature:
One can either take specific knowledge about the underlying probability into consider-
ation, or one can derive posterior bounds which depend on the specific training set. The
results are as follows [11]:
Assume w
and one can findX
such that the probability of sequences of length
Xis bounded from above by
. Then the risk is limited by
provided that the
number of examples is roughly of order
V
,
V
being the (finite) pseudodimension of
the architecture restricted to input sequences of length X
.
Assume training on a set of size
and maximum lengthX
has been performed. Then
the risk can be bounded by a term of roughly order V
$ ,
Vbeing the (finite)
pseudodimension of the architecture restricted to input sequences of length at most X . Amore detailed analysis even allows to drop the long sequences before measuring
X[10].
Hence one can guarantee valid generalization, although only with additional con-
siderations compared to the feedforward case. Moreover, there may exist particularly
ugly situations for recurrent networks where training is possible only with an exponen-
tially increasing number of training examples [11]. This is the price one has to pay for
the possibility of dealing with structured data, in particular data with a priori unlimited
length. Note that the above argumentation holds only for architectures with real val-
ues as outputs. The case of structured outputs requires a more advanced analysis via so
called loss functions and yields to similar results [10].
4 Advanced Architectures
The next step is to go from sequences to tree structured data. Since trees cover terms
and formulas, this is a fairly general approach. The network dynamics and theoretical
investigations are direct generalizations of simple recurrent networks. One can obtain
a recursive neural encoding
ench
Y 6
$ 7 & 4
P
and a recursive neural decoding
decY
4
P
& 6
$7 of trees if
and
are computed by standard networks. These codings
can be composed with standard networks for the approximation of general functions.
Depending on whether the inputs, the outputs, or both may be structured and depending
on which part is trainable, we obtain different connectionistic mechanisms. A sketch of
the first two mechanisms which are described in the following can be found in Fig. 5.
Recursive Autoassociative Memory
The recursive autoassociative memory (RAAM) as introduced by Pollack and gener-
alized by Sperduti and Starita [30,40] consists of a recursive encoding
ench , a recursive
decoding dec , and being standard feedforward networks, and a standard feedforward
network . An appropriate composition of these parts can approximate mappings where
the inputs or the outputs may be | -trees or vectors, respectively. Training proceeds in
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
15/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 15
encoding
a
b c
d fe
a
b c
d fe
x
Folding networks:
decodingencoding
a
b c
d fe
RAAM:
Fig. 5. Processing tree structures with connectionistic methods.
two steps, first the composition
dec
ench is trained on the identity on a given training set
with truncated gradient descent such that the two parts constitute a proper encoding or
decoding, respectively. Afterwards, a standard feedforward network is combined with
either the encoding or decoding and trained via standard backpropagation where the
weights in the recursive coding are fixed. Hence arbitrary mappings on structured datacan be approximated. Note that the encoding is fitted to the specific training set. It is
not fitted to the specific approximation task. In all cases encoding and decoding must
be learned even if only the inputs or only the outputs are structured.
In analogy to simple recurrent networks the following questions arise: Can any map-
ping be approximated in principle? Do the respective parts show valid generalization?
Is training efficient? We will not consider the efficiency of training in the following
since the question is not yet satisfactorily answered for feedforward networks. The other
questions are to be answered for both, the coding parts and the feedforward approxima-
tion on the encoded values. Note that the latter task only deals with standard feedfor-
ward networks whose approximation and generalization properties are well established.
Concerning the approximation capability of the coding parts we can borrow ideas from
recurrent networks: A natural encoding of tree structured data consists in the prefix rep-
resentation of a tree. For example, the
-tree~
A
A C
A C
$ $ C
$can uniquely be
represented by the sequence9 ~ C
A C
A C } C } C
C A C } C } C C } C } C
C } C } Rincluding
the empty tree}
. Depending on whether real-labeled trees are to be encoded precisely,
or only symbolic data, i.e., labels from6 7
where- 6 -
, are dealt with or a finite
approximation of the real values is sufficient, the above sequence can be encoded in
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
16/19
16 Barbara Hammer
a real-value code with a fixed dimension or a vector code whose dimension depends
on the maximum height of the trees. The respective encoding or decoding can be com-
puted with recursive architectures induced by a standard feedforward network [11]. The
required resources are as follows: Vector-coding requires a number of neurons which
increases exponentially with the maximum height of the trees. Real-value-encoding re-
quires only a fixed number of neurons for symbolic data and a number of neurons which
is quadratic in the number of patterns for real-valued labels. Real-value-decoding re-
quires a number of neurons which increases exponentially with increasing height of
the trees; the argument consists in a lower bound of the pseudodimension of function
classes which perform proper decoding. This number increases more than exponentially
in the height. Learning the coding yields valid generalization provided prior information
about the input distribution is available. Alternatively, one can derive posterior bounds
on the generalization error which depend on the concrete training set. These results
follow in the same way as for standard recurrent networks. Hence the RAAM consti-
tutes a promising and in principle applicable mechanism. Due to the difficulty of proper
decoding, applications can be found for small training examples only [40].
Folding Networks
Folding networks use ideas of the LRAAM [19]. They focus on clustering symbolic
data, i.e., the outputs are not structured, but real vectors. This limitation makes decoding
superfluous. For training, the encoding part and the feedforward network are composed
and simultaneously trained on the respective task via a gradient descent method, so-
called backpropagation through structure, a generalization of backpropagation through
time. Hence the encoding is fitted to the data and the respective learning task.
It follows immediately from the above discussion that folding networks can approx-
imate every measurable function in probability using real-value codes, and they can
approximate every continuous function on compact input domains with vector codes.Additionally, valid generalization can be guaranteed with a similar argumentation as
above with bounds depending on the input distribution or the concrete training set.
Due to the fact that the difficult part, proper decoding, is dropped, several applica-
tions of folding networks for large data sets can be found in the literature: classification
of terms and formulas, logo recognition, drug design, support of automatic theorem
provers, . . . [19,34,35]. Moreover, they can be related to finite tree automata in analogy
to the correlation of recurrent networks and finite automata [18].
Holographic Reduced Representation
Holographic reduced representation (HRR) is identical to RAAM with a fixed encod-
ing and decoding: a priori chosen functions given by so-called circular correlation or
convolution, respectively [29]. Correlation (denoted by
) and convolution (denotedby
) constitute a specific way to relate two vectors to a vector of the same dimen-
sion such that correlation and convolution are approximately inverse to each other, i.e.,
~
~
$ ~ . Hence one can encode a tree ~ XA
C X
$ via computing the convolution of
each entry with a specific vector indicating the role of the component and adding these
three vectors: A ~
X A
X ,
being the roles. The single entries can be
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
17/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 17
approximately restored via correlation: A A ~ X A X $ ~ . One can
compute the deviation in the above equation under statistical assumptions. Commonly,
the restored values are accurate provided the dimension of the vectors is sufficiently
high, the height of the trees is limited, and the vectors are additionally cleaned up in an
associative memory. It follows immediately from our above argumentation that these
three conditions are necessary: Decoding is a difficult task which requires for standard
computations exponentially increasing resources. HRR is used in the literature for stor-
ing and recognizing language [29]. Since encoding and decoding are fixed, no further
investigation of the approximation or generalization ability is necessary.
5 Conclusions
Combinations of symbolic and connectionistic systems, more precisely, connectionistic
systems processing symbolic data have been investigated. A particular difficulty con-
sists in the fact that the informational content of symbolic data is not limited a priori.Hence a priori unlimited length is to be mapped to a connectionistic vector represen-
tation. We have focused on recurrent systems which map the unlimited length to a
priori unlimited processing time. Simple recurrent neural networks constitute a well
established model. Apart from the simplicity of the data they process, sequences, the
main theoretical properties are the same as for advanced mechanisms. One can inves-
tigate algorithmic or statistical aspects of learning, the first ones being induced by the
nature of the data, the second ones by the nature of the connectionistic system. We
covered algorithmic aspects mainly in comparison to standard mechanisms. Although
being of merely theoretical interest, the enormous capacity of recurrent networks has
turned out. Concerning statistical learning theory, satisfactory results for the universal
approximation capability and the generalization ability have been established, although
generalization can only be guaranteed if specifics of the data are taken into account.
The idea of coding leads to an immediate generalization to tree structured data.
Well established approaches like RAAM, HRR, and folding networks fall within this
general definition. The theory established for recurrent networks can be generalized to
these advanced approaches immediately. The in-principle statistical learnability of these
mechanisms follows. However, some specific situations might be extremely difficult:
Decoding requires an increasing amount of resources. Hence the RAAM is applicable
for small data only, decoding in HRR requires an additional cleanup, whereas folding
networks can be found in real world applications.
Nevertheless, the results are encouraging since they prove the possibility to process
symbolic data with neural networks and constitute a theoretical foundation for the suc-
cess of some of the above mentioned methods. Unfortunately, the general approaches
do neither generalize to cyclic structures like graphs, nor do they provide biological
plausibility and could help explaining human recognition of these data. For both as-pects fully dynamic approaches would be more promising although it would be more
difficult to find effective training algorithms for practical applications.
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
18/19
18 Barbara Hammer
References
1. Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backpropaga-
tion. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information
Processing Systems, Volume 5. Morgan Kaufmann, 1994.
2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient de-
scent is difficult. IEEE Transactions on Neural Networks, 5(2), 1994.
3. L. Blum, F. Chucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer,
1998.
4. B. DasGupta, and B. Hammer. On approximate learning by multi-layered feedforward
circuits. In: H. Arimura, S. Jain, A. Sharma (eds.), Algorithmic Learning Theory2000,
Springer, 2000.
5. M. W. Craven and J. W. Shavlik. Using sampling and queries to extract rules from trained
neural networks. In: Proceedings of the Eleventh International Conference on Machine
Learning, Morgan Kaufmann, 1994.
6. J. L. Elman. Finding structure in time. Cognitive Science, 14, 1990.
7. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of datasequences. IEEE Transactions on Neural Networks, 9(5), 1997.
8. C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural
networks. IEEE Transactions on Neural Networks, 5(2), 1994.
9. M. Gori, M. Mozer, A. C. Tsoi, and R. L. Watrous. Special issue on recurrent neural networks
for sequence processing. Neurocomputing, 15(3-4), 1997.
10. B. Hammer. Approximation and generalization issues of recurrent networks dealing with
structured data. In: P. Frasconi, M. Gori, F. Kurfes, and A. Sperduti, Proceedings of the ECAI
workshop on Foundations of connectionist-symbolic integration: representation, paradigms,
and algorithms, 2000.
11. B. Hammer. Learning with recurrent neural networks. Lecture Notes in Control and Infor-
mation Sciences 254, Springer, 2000.
12. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8),
1997.
13. S. Holldobler, Y. Kalinke, and H. Lehmann. Designing a Counter: Another case study of Dy-
namics and Activation Landscapes in Recurrent Networks. In G. Brewka and C. Habel and
B. Nebel (eds.): KI97: Advances in Artificial Intelligence, Proceedings of the 21st German
Conference on Artificial Intelligence, LNAI 1303, Springer, 1997.
14. J.J. Hopfield and D.W. Tank. Neural computation of decisions in optimization problems.
Biological Cybernetics, 52, 1985.
15. J.E. Hummel and K.L. Holyoak. Distributed representation of structure: a theory of analog-
ical access and mapping. Psychological Review, 104, 1997.
16. J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks.
Information and Computation, 128, 1996.
17. P. Koiran and E. D. Sontag. Neural networks with quadratic VC dimension. Journal of
Computer and System Sciences, 54, 1997.
18. A. Kuchler. On the correspondence between neural folding architectures and tree automata.
Technical report, University of Ulm, 1998.19. A. Kuchler and C. Goller. Inductive learning symbolic domains using structure-driven neural
networks. In G. Gorz and S. Holldobler, editors, KI-96: Advances in Artificial Intelligence.
Springer, 1996.
20. S. Lawrence, C.L. Giles, and S. Fong. Can recurrent neural networks learn natural language
grammars?. In: International Conference on Neural Networks, IEEE Press, 1996.
-
8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems
19/19
Perspectives on Learning Symbolic Data with Connectionistic Systems 19
21. W. Maass and P. Orponen. On the effect of analog noise in discrete-time analog computation.
Neural Computation, 10(5), 1998.
22. W. Maass and E. D. Sontag. Analog neural nets with Gaussian or other common noise
distributions cannot recognize arbitrary regular languages. In M. C. Mozer, M. I. Jordan,
and T. Petsche, editors, Advances in Neural Information Processing Systems, Volume 9. The
MIT Press, 1998.
23. M. Masters. Neural, Novel, & Hybrid Algorithms for Time Series Prediction. Wiley, 1995.
24. T. Mitchel. Machine Learning. McGraw-Hill, 1997.
25. M. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and
N. Gershenfeld, editors, Predicting the future and understanding the past. Addison-Wesley,
1993.
26. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent neural net-
works. Neural Networks, 9(1), 1996.
27. C. Omlin and C. Giles. Constructing deterministic finite-state automata in recurrent neural
networks. Journal of the ACM, 43(2), 1996.
28. B. A. Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey.
IEEE Transactions on Neural Networks, 6(5), 1995.29. T. Plate. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3),
1995.
30. J. Pollack. Recursive distributed representation. Artificial Intelligence, 46, 1990.
31. M. Reczko. Protein secondary structure prediction with partially recurrent neural networks.
SAR and QSAR in environmental research, 1, 1993.
32. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation: The
RPROP algorithm. In Proceedings of the Sixth International Conference on Neural Net-
works. IEEE, 1993.
33. J. Schmidhuber. A fixed size storage O(
) time complexity learning algorithm for fully
recurrent continually running networks. Neural Computation, 4(2), 1992.
34. T. Schmitt and C. Goller. Relating chemical structure to activity with the structure processing
neural folding architecture. In Engineering Applications of Neural Networks, 1998.
35. S. Schulz, A. Kuchler, and C. Goller. Some experiments on the applicability of folding
architectures to guide theorem proving. In Proceedings of the 10th International FLAIRSConference, 1997.
36. H. T. Siegelmann. The simple dynamics of super Turing theories. Theoretical Computer
Science, 168, 1996.
37. H. T. Siegelmann and E. D. Sontag. Analog computation, neural networks, and circuits.
Theoretical Computer Science, 131, 1994.
38. H. T. Siegelmann and E. D. Sontag. On the computational power of neural networks. Journal
of Computer and System Sciences, 50, 1995.
39. L.Shastri. Advances in Shruti A neurally motivated model of relational knowledge repre-
sentation and rapid inference using temporal synchrony. Applied Intelligence, 11, 1999.
40. A. Sperduti. Labeling RAAM. Connection Science, 6(4), 1994.
41. J. Suykens, B. DeMoor, and J. Vandewalle. Static and dynamic stabilizing neural controllers
applicable to transition between equilibrium point. Neural Networks, 7(5), 1994.
42. L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984.43. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997.
44. R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and
their computational complexity. In Y. Chauvin and D. Rumelhart, editors, Back-propagation:
Theory, Architectures and Applications. Erlbaum, 1992.