peter tino, barbara hammer and mikael boden- markovian bias of neural-based architectures with...

8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

1/39

Markovian bias of neural-based architectures

with feedback connections

Peter Tino1, Barbara Hammer2, and Mikael Boden3

1 University of Birmingham, Birmingham, UK [email protected] Clausthal University of Technology, Germany [email protected] University of Queensland, Brisbane, Australia [email protected]

Summary. Dynamic neural network architectures can deal naturally with sequen-tial data through recursive processing enabled by feedback connections. We show

how such architectures are predisposed for suffix-based Markovian input sequencerepresentations in both supervised and unsupervised learning scenarios. In partic-ular, in the context of such architectural predispositions, we study computationaland learning capabilities of typical dynamic neural network architectures. We alsoshow how efficient finite memory models can be readily extracted from untrainednetworks and argue that such models should be used as baselines when comparingdynamic network p erformances in a supervised learning task. Finally, potential ap-plications of the Markovian architectural predisposition of dynamic neural networksin bioinformatics are presented.

1 Introduction

There has been a considerable research activity in connectionist processing ofsequential symbolic structures. For example, researchers have been interestedin formulating models of human performance in processing linguistic patternsof various complexity (e.g. [1]).

Of special importance are dynamic neural network architectures capableof naturally dealing with sequential data through recursive processing. Sucharchitectures are endowed with feedback delay connections that enable themodels to operate with a neural memory in the form of past activationsof a selected subset of neurons, sometimes referred to as recurrent neurons.It is expected that activations of such recurrent neurons will code all theimportant information from the past that is needed to solve a given task.In other words, the recurrent activations can be thought of as representing, insome relevant way, the temporal context in which the current input item is

observed. The relevance is given by the nature of the task, be it next symbolprediction, or mapping sequences in a topographically ordered manner.

It was a bit surprising then, when researchers reported that when trainingdynamic neural networks to process language structures, activations of re-


2/39

2 Peter Tino, Barbara Hammer, and Mikael Boden

current neurons displayed a considerable amount of structural differentiationeven prior to learning [2, 3, 1]. Following [1], we refer to this phenomenon asthe architectural bias of dynamic neural network architectures.

It has been recently shown, both theoretically and empirically, that the

structural differentiation of recurrent activations before the training has muchdeeper implications [4, 5]. When initialized with small weights (standard ini-tialization procedure), typical dynamic neural network architectures will or-ganize recurrent activations in a suffix-based Markovian manner and it ispossible to extract from such untrained networks predictive models compa-rable to efficient finite memory source models called variable memory lengthMarkov models [6]. In addition, in such networks, recurrent activations tendto organize along a self-similar fractal substrate the fractal and multifractalproperties of which can be studied in a rigorous manner [7]. Also, it is possi-ble to rigorously characterize learning capabilities of dynamic neural networkarchitectures in the early stages of learning [5].

Analogously, the well-known Self-Organizing Map (SOM) [8] for topo-

graphic low-dimensional organization of high-dimensional vectorial data hasbeen reformulated to enable processing of sequential data. Typically, standardSOM is equipped with additional feed-back connections [9, 10, 11, 12, 13, 14].Again, various forms of inherent predispositions of such models to Markovianorganization of processed data have been discovered [15, 13, 16].

The aim of this paper is to present major developments in the phenomenonof architectural bias of dynamic neural network architectures in a unifiedframework. The paper has the following organization: First, a general state-space model formulation used throughout the paper is introduced in section2. Then, still on the general high description level, we study in section 3 thelink between Markovian suffix-based state space organization and contrac-tive state transition maps. Basic tools for measuring geometric complexity offractal sets are introduced as well. Detailed studies of several aspects of the

architectural bias phenomenon in the supervised and unsupervised learningscenarios are presented in sections 4 and 5, respectively. Section 6 brings aflavour of potential applications of the architectural bias in bioinformatics.Finally, key findings are summarized in section 7.

2 General model formulation

We consider models that recursively process inputs x(t) from a set X RNI

by updating their state r(t) in a bounded set R RN, and producing outputsy(t) Y, Y RNO . State space model representation takes the form:

y(t) = h(r(t)) (1)

r(t) = f(x(t), r(t 1)). (2)

Processing of an input time series x(t) X, t = 1, 2,..., starts by initializingthe model with some r(0) R and then recursively updating the state r(t)


3/39

Markovian bias of neural-based architectures with feedback connections 3

R and output y(t) Y via functions f : X R R and h : R Y,respectively. This is illustrated in figure 1.

x(t)

f

r(t)

h

y(t)

r(t1)

delayunit time

Fig. 1. The basic state space model used throughout the paper. Processing of aninput time series x(t) X, t = 1, 2,..., is done by recursively updating the stater(t) R and output y(t) Y via functions f : X R R and h : R Y,respectively.

We consider inputs coming from a finite alphabet A of A symbols. Theset of all finite sequences over A is denoted by A. The set A without theempty string is denoted by A+. The set of all sequences over A with exactlyn symbols (n-blocks) is denoted by An. Each symbol s A is mapped to itsreal-vector representation c(s) by an injective coding function c : A X.The state transition process (2) can be viewed as a composition of fixed-inputmaps

fs(r) = f(c(s), r), s A. (3)

In particular, for a sequence s1:n = s1...sn2sn1sn over A and r R, wewrite

fs1:n(r) = fsn(fsn1(...(fs2(fs1(r)))...))

= (fsn fsn1 ... fs2 fs1)(r). (4)


4/39


3 Contractive state transitions

In this section we will show that by constraining the system to contractivestate transitions we obtain state organizations with Markovian flavour. More-

over, one can quantify geometric complexity of such state space organizationsusing tools of fractal geometry.

3.1 Markovian state-space organization

Denote a Euclidean norm by . Recall that a mapping F : R R is saidto be a contraction with contraction coefficient [0, 1), if for any r, r R,it holds

F(r) F(r) r r. (5)

F is a contraction if there exists [0, 1) so that F is a contraction withcontraction coefficient .

Assume now that each member of the family {fs}sA is a contractionwith contraction coefficient s and denote the weakest contraction rate in thefamily by

max = maxsA

s.

Consider a sequence s1:n = s1...sn2sn1sn An, n 1. Then for any two

prefixes u and v and for any state r R, we have

fus1:n(r) fvs1:n(r) = fs1:n1sn(fu(r)) fs1:n1sn(fv(r)) (6)

= fsn(fs1:n1(fu(r))) fsn(fs1:n1(fv(r))) (7)

max fs1:n1(fu(r)) fs1:n1(fv(r)), (8)

and consequently

fus1:n(r) fvs1:n(r) nmax fu(r) fv(r) (9)

nmax diam(R), (10)

where diam(R) is the diameter of the set R, i.e. diam(R) = supx,yR x y.By similar arguments, for L < n,

fs1:n(r) fsnL+1:n(r) L fs1:nL(r) r (11)

L diam(R). (12)

There are two related lessons to be learnt from this exercise:

1. No matter what state r R the model is in, the final states fus1:n(r)and fvs1:n(r) after processing two sequences with a long common suffixs1:n will lie close to each other. Moreover, the greater the length n of thecommon suffix, the closer lie the final states fus1:n(r) and fvs1:n(r).


5/39


2. If we could only operate with finite input memory of depth L, then allreachable states from an initial state r0 R could be collected in a finiteset

CL(r0) = {fw(r0) | w AL}.

Consider now a long sequence s1:n over A. Disregarding the initial tran-sients for 1 t L 1, the states fs1:t(r0) of the original model (2) canbe approximated by the states fstL+1:t(r0) CL(r0) of the finite memorymodel to arbitrarily small precision, as long as the memory depth L islong enough.

What we have just described can be termed Markovian organization of themodel state space. Information processing states that result from processingsequences sharing a common suffix naturally cluster together. In addition, thespatial extent of the cluster reflects the length of the common suffix - longersuffixes lead to tighter cluster structure.

If, in addition, the readout function h is Lipschitz with coefficient > 0,

i.e. for any r, r

R,

h(r) h(r) r r,

thenh(fus1:n(r)) h(fvs1:n(r))

n fu(r) fv(r).

Hence, for arbitrary prefixes u, v, we have that for sufficiently long commonsuffix length n, the model outputs resulting from driving the system withus1:n and vs1:n can be made arbitrarily close. In particular, on the same longinput sequence (after initial transients of length L 1), the model outputs (1)will be closely shadowed by those of the finite memory model operating onlyon the most recent L symbols (and hence on states from CL(r)), as long as Lis large enough.

3.2 Measuring complexity of state space activations

Let us first introduce measures of size of geometrically complicated objectscalled fractals (see e.g. [17]). Let K R. For > 0, a -fine cover of K is acollection of sets of diameter that cover K. Denote by N(K) the smallestpossible cardinality of a -fine cover of K.

Definition 1. The upper andlower box-counting dimensions ofK are definedas

dim+B K = lim sup0

log N(K)

log and dimB K = liminf0

log N(K)

log , (13)

respectively.


6/39


Let > 0. For > 0, define

H (K) = inf(K)

B(K)

(diam(B)) , (14)

where the infimum is taken over the set (K) of all countable -fine coversof K. Define

H(K) = lim0

H (K).

Definition 2. The Hausdorff dimension of the set K is

dimHK = inf{| H(K) = 0}. (15)

It is well known that

dimHK dimB K dim

+B K. (16)

The Hausdorff dimension is more subtle than the box-counting dimen-sions: the former can capture details not detectable by the latter. For a moredetailed discussion see e.g. [17].

Consider now a Bernoulli source S over the alphabet A and (without lossof generality) assume that all symbol probabilities are nonzero.

The system (2) can be considered an Iterative Function System (IFS) [18]{fs}sA. If all the fixed input maps fs are contractions (contractive IFS), thereexists a unique set K R, called the IFS attractor, that is invariant underthe action of the IFS:

K =sA

fs(K).

As the system (2) is driven by an input stream generated by S, the states{r(t)} converge to K. In other words, after some initial transients, state tra- jectory of the system (2) samples the invariant set K (chaos game algo-rithm).

It is possible to upper-bound fractal dimensions of K [17]:

dim+B K , wheresA

s = 1.

If the IFS obeys the open set condition, i.e. all fs(K) are disjoint, wecan get a lower bound on the dimension of K as well: assume there exist0 < s < 1, s A, such that

fs(r) fs(r) s r r

.

Then [17],

dimHK , wheresA

s = 1.


7/39


4 Supervised learning setting

In this section, we will consider the implications of Markovian state-spaceorganization for recurrent neural networks (RNN) formulated and trained in

a supervised setting. We present the results for a simple 3-layer topology ofRNN. The results can be easily generalized to more complicated architectures.

4.1 Recurrent Neural Networks and Neural Prediction Machines

Let Wr,x, Wr,r and Wy,r denote real N NI, N N and NO N weightmatrices, respectively. For a neuron activation function g, we denote itselement-wise application to a real vector a = (a1, a2,...,ak)

T by g(a), i.e.g(a) = (g(a1), g(a2),...,g(ak))

T. Then, the state transition function f in (2)takes the form

f(x, r) = g(Wr,xx + Wr,rr + tf), (17)

where tf RN is a bias term.Often, the output function h reads:

h(r) = g(Wy,rr + th), (18)

with the bias term th RNO . However, for the purposes of symbol stream pro-

cessing, the output function h is sometimes realized as a piece-wise constantfunction consisting of a collection of multinomial next-symbol distributions,one for each compartment of the partitioned state space R. RNN with such anoutput function were termed Neural Prediction Machines (NPM) in [4]. Moreprecisely, assuming the symbol alphabet A contains A symbols, the outputspace Y is the simplex

Y = {y = (y1, y2,...,yA)T RA |

Ai=1

yi = 1 and yi 0}.

The next-symbol probabilities h(r) Y are estimates of the relative fre-quencies of symbols, conditional on RNN state r. Regions of constant prob-abilities are determined by vector quantizing the set of recurrent activationsthat result from driving the network with the training sequence:

1. Given a training sequence s1:n = s1s2...sn over the alphabet A ={1, 2,...,A}, first re-set the network to the initial state r(0), and then,while driving the network with the sequence s1:n, collect the recurrentlayer activations in the set = {r(t) | 1 t n}.

2. Run your favorite vector quantizer with M codebook vectorsb1, ...,

bM R, on the set . Vector quantization partitions the state space R into M

Voronoi compartments V1,...,VM, of the codebook vectors4 b1, ..., bM:

4 ties are broken according to index order


8/39


Vi = {r | r bi = minj

r bj}. (19)

All points in Vi are allocated to the codebook vector bi via a projection : R {1, 2,...,M},

(r) = i, provided r Vi. (20)

3. Re-set the network again with the initial state r(0).4. Set the [Voronoi compartment,next-symbol] counters N(i, a), i = 1,...,M,

a A, to zero.5. Drive the network again with the training sequence s1:n.

For 1 t < n, if (r(t)) = i, and the next symbol st+1 is a, incrementthe counter N(i, a) by one.

6. With each codebook vector associate the next symbol probabilities5

P(a | i) =N(i, a)

bA

N(i, b), a A, i = 1, 2,...,M. (21)

7. The output map is then defined as

h(r) = P( | (r)), (22)

where the a-th component ofh(r), ha(r), is the probability of next symbolbeing a A, provided the current RNN state is r.

Once the NPM is constructed, it can make predictions on a test sequence(a continuation of the training sequence) as follows: Let r(n) be the vector ofrecurrent activations after observing the training sequence s1:n. Given a prefixu1u2,...u of the test sequence, the NPM makes a prediction about the next

symbol u+1 by:

1. Re-setting the network with r(n).2. Recursively updating states r(n + t), 1 t , while driving the network

with u1u2...u.

5 For bigger codebooks, we may encounter data sparseness problems. In such cases,it is advisable to perform smoothing of the empirical symbol frequencies by ap-plying Laplace correction with parameter > 0:

P(a | i) = + N(i, a)

A +P

bAN(i, b)

.

The parameter can be viewed in the Dirichlet prior interpretation for the multi-nomial distribution P(a | i) as the effective number of times each symbol a Awas observed in the context of state compartment i, prior to counting the condi-tional symbol occurrences N(i, a) in the training sequence. Typically, = 1, or = A1.


9/39


3. The probability of u+1 occurring is

hu+1(r) = P(u+1 | (r(n + ))). (23)

A schematic illustration of NPM is presented in figure 2.

x(t)

f

r(t1)

r(t)

P(. | i)

P(. | 1)

P(. | 2)

P(. | 3)

P(. | 4)

delayunit time

Fig. 2. Schematic illustration of Neural Prediction Machine. The output functionis realized as a piece-wise constant function consisting of a collection of multinomialnext-symbol distributions P(|i), one for each compartment i of the state space R.

4.2 Variable memory length Markov models and NPM built on

untrained RNN

In this section we will present intuitive arguments for a strong link betweenNeural Prediction Machines (NPM) constructed on untrained RNN and a classof efficient implementations of Markov models called Variable memory lengthMarkov models (VLMM) [6]. The arguments will be made more extensive andformal in section 4.5.

In Markov models (MMs) of (fixed) order L, the conditional next-symboldistribution over alphabet A, given the history of symbols s1:t = s1s2...st A

t

observed so far, is written in terms of the last L symbols (L t),

P(st+1 | s1:t) = P(st+1 | stL+1:t). (24)


10/39


11/39


Dense areas in the RNN state space correspond to symbol histories with longcommon suffixes and are given more attention by the vector quantizer. Hence,the prediction contexts are analogous to those of VLMM in that deep memoryis used only when there are many different symbol histories sharing a long

common suffix. In such situations it is natural to consider deeper past contextswhen predicting the future. This is done automatically by the vector quantizerin NPM construction as it devotes more codebook vectors to the dense regionsthan to the sparsely inhabited areas of the RNN state space. More codebookvectors in dense regions imply tiny quantization compartments Vi that in turngroup symbol histories with long shared suffixes.

Note, however, that while prediction contexts of VLMMs are built in asupervised manner, i.e. deep memory is considered only if it is dictated by theconditional distributions over the next symbols, in contractive NPMs predic-tion contexts of variable length are constructed in an unsupervised manner:prediction contexts with deep memory are accepted if there is a suspicionthat shallow memory may not be enough, i.e. when there are many different

symbol histories in the training sequence that share a long common suffix.

4.3 Fractal Prediction Machines

A special class of affine contractive NPMs operating with finite input mem-ory has been introduced in [26] under the name Fractal Prediction Machines(FPM). Typically, the state space R is an N-dimensional unit hypercube6

R = [0, 1]N, N = log2 A. The coding function c : A X = R maps each ofthe A symbols of the alphabet A to a unique vertex of the hypercube R. Thestate transition function (17) is of particularly simple form, as the activationfunction g is identity and the connection weight matrices are set to Wr,x = Iand Wr,x = (1 )I, where 0 < < 1 is a contraction coefficient and I is the

N N identity matrix. Hence, the state space evolution (2) reads:r(t) = f(x(t), r(t 1))

= r(t 1) + (1 )x(t). (25)

The reset state is usually set to the center of R, r0 = {12}

N. After fixingthe memory depth L, the FPM states are confined to the set

CL(r0) = {fw(r0) | w AL}.

FPM construction proceeds in the same manner as that of NPM, exceptfor the states fs1:t(r0) coding history of inputs up to current time t are approx-imated by their memory-L counterparts fstL+1:t(r0) from CL(r0), as discussed

in section 3.1.

6 for x , x is the smallest integer y, such that y x


12/39


4.4 Echo and Liquid State Machines

A similar architecture has been proposed by Jaeger in [27, 28]. So-called EchoState Machines (ESM) combine an untrained recurrent reservoir with a train-

able linear output layer. Thereby, the state space model has the followingform

y(t) = h(x(t), r(t), y(t 1)) (26)

r(t) = f(x(t), r(t 1), y(t 1)). (27)

For training, the function f is fixed and chosen in such a way that it has theecho state property, i.e. the network state r(t) is uniquely determined by leftinfinite sequences. In practise, the dependence ofr on y is often dropped suchthat this property is equivalent to the fading property of recurrent systemsand it can be tested by determining the Lipschitz constant of f(which must besmaller than 1). The readout h is trainable. It is often chosen as a linear func-tion such that training can be done efficiently using linear regression. Unlike

FPMs, ESMs usually work with very high dimensional state representationsand a transition function which is obtained as the composition of a nonlin-ear activation function and a linear mapping with (sufficiently small) randomweights. This combination should guarantee that the transition function hassufficiently rich dynamics to represent important properties of input serieswithin its reservoir.

A similar idea is the base for so-called Liquid State Machines (LSM) asproposed by Maass [29]. These are based on a fixed recurrent mechanism incombination with a trainable readout. Unlike FPM and ESM, LSMs considercontinuous time, and they are usually based on biologically plausible circuitsof spiking neurons with sufficiently rich dynamics. It has been shown in [29]that LSMs are approximation complete for operators with fading memory

under mild conditions on the models. Possibilities to estimate the capacity ofsuch models and their generalization ability have recently been presented in[30]. However, the fading property is not required for LSMs.

4.5 On the computational power of recursive models

Here we will rigorously analyze the computational power of recursive modelswith contractive transition function. The arguments are partially taken from[5]. Given a sequence s1:t = s1 . . . st, the L-truncation is given by

TL(s1:t) =

stL+1:t if t Ls otherwise

Assume A is a finite input alphabet and Y is an output set. A Definite MemoryMachine (DMM) computes a function f : A Y such that f(s) = f(TL(s))for some fixed memory length L and every sequence s. A Markov model (MM)fulfils P(st+1|s1:t) = P(st+1|TL(s)) for some L. Obviously, the function which


13/39


maps a sequence to the next symbol probability given by a Markov model de-fines a definite memory machine. Variable length Markov models are subsumedby standard Markov models taking L as the maximum length. Therefore, wewill only consider definite memory machines in the following.

State space models as defined in eqs. (1-2) induce a function h f(r) :A Y, s1:t (h fst . . . fs1)(r) as defined in (4) given a fixed initial valuer. As beforehand, we assume that the set of internal states R is bounded.Consider the case that f is a contraction w.r.t. x with parameter and h isLipschitz continuous with parameter . According to (12), |fs(r)fTL(s)(r)| L diam(R). Note that the function h fTL()(r) can be given by a definitememory machine. Therefore, if we choose L = log(/( diam(R))), we find|h fs(r) h fTL(s)(r)| . Hence, we can conclude that every recurrentsystem with contractive transition function can be approximated arbitrarilywell by a finite memory model. This argument proves the Markovian propertyof RNNs with small weights:

Theorem 1. Assume f is a contraction w.r.t. x, and h is Lipschitz continu-

ous. Assume r is a fixed initial state. Assume > 0. Then there exists a defi-nite memory machine with function g : A Y such that |g(s) h fs(r)| for every input sequence s.

As shown in [5], this argument can be extended to the situation where theinput alphabet is not finite, but given by an arbitrary subset of a real-vectorspace.

It can easily be seen that standard recurrent neural networks with smallweights implement contractions, since standard activation functions such astanh are Lipschitz continuous and linear maps with sufficiently small weights

are contractive. RNNs are often initialized by small random weights. There-fore, RNNs have a Markovian bias in early stages of training, first pushingsolutions towards simple dynamics before going towards more complex modelssuch as finite state automata and beyond.

It has been shown in [27] that ESMs without output feedback possess theso-called shadowing property, i.e. for every a memory length L can be foundsuch that input sequences the last L entries of which are similar or identicallead to images which are at most apart. As a consequence, such mechanismscan also be simulated by Markovian models. The same applies to FPMs (thetransition is a contraction by construction) and NPMs built on untrainedRNNs randomly initialized with small weights.

The converse has also been proved in [5]: every definite memory machine

can be simulated by a RNN with small weights. Therefore, an equivalence ofsmall weight RNNs (and similar mechanisms) and definite memory machinesholds. One can go a step further and consider the question in which cases


14/39


randomly initialized RNNs with small weights can simulate a given definitememory machine provided a suitable readout is chosen.

The question boils down to the question whether the images of input se-quences s of a fixed memory length L under the recursion (fsL . . . fs1)(r)

are of a form such that they can be mapped to desired output values by atrainable readout. We can assume that the codes for the symbols si are lin-early independent. Further, we assume that the transfer function of the net-work is a nonlinear and non algebraic function such as tanh. Then the terms(fsL . . . fs1)(r) constitute functions of the weights which are algebraicallyindependent since they consist of the composition of linearly independentterms and the activation function. Thus, the images of sequences of lengthL are almost surely disjoint such that they can be mapped to arbitrary out-put values by a trainable readout provided the readout function h fulfils theuniversal approximation capability as stated e.g. in [31] (e.g. FNNs with onehidden layer). If the dimensionality of the output is large enough (dimension-ality LA), the outputs are almost surely linearly independent such that even

a linear readout is sufficient. Thus, NPMs and ESMs with random initializa-tion constitute universal approximators for finite memory models providedsufficient dimensionality of the reservoir.

4.6 On the generalization ability of recursive models

Another important issue of supervised adaptive models is their generaliza-tion ability, i.e. the possibility to infer from learnt data to new examples. Itis well known that the situation is complex for recursive models due to therelatively complex inputs: sequences of priorly unrestricted length can, in prin-ciple, encode behavior of priorly unlimited complexity. This observation canbe mathematically substantiated by the fact that generalization bounds for

recurrent networks as derived e.g. in [32, 33] depend on the input distribution.The situation is different for finite memory models for which generalizationbounds independent of the input distribution can be derived. A formal argu-ment which establishes generalization bounds of RNNs with Markovian biashas been provided in [5], whereby the bounds have been derived also for thesetting of continuous input values stemming from a real-vector space. Here,we only consider the case of a finite input alphabet A.

Assume F is the class of functions which can be implemented by a re-current model with Markovian bias. For simplicity, we restrict to the case ofbinary classification, i.e. outputs stem from {0, 1}, e.g. adding an appropriatediscretization to the readout h. Assume P is an unknown underlying input-output distribution which has to be learnt. Assume a set of m input-outputpairs, Tm = {(s1, y1), . . . , (sm, ym)}, is given; the pairs are sampled from

P in an i.i.d. manner7. Assume a learning algorithm outputs a function gwhen trained on the samples from Tm. The empirical error of the algorithm

7 independent identically distributed


15/39


is defined as

ETm(g) =1

m

mi=1

|g(si) yi| .

This is usually minimized during training. The real error is given by thequantity

EP(g) =

|g(s) y| dP(s, y) .

A learning algorithm is said to generalize to unseen data if the real error isminimized by the learning scheme.

The capability of a learning algorithm of generalizing from the trainingset to new examples depends on the capacity of the function class used fortraining. One measure of the complexity is offered by the Vapnik Chervonenkis(VC) dimension of the function class F. The VC dimension of F is the size ofthe largest set such that every binary mapping on this set can be realized bya function in F. It has been shown by Vapnik [34] that the following bound

holds for sample size m, m VC(F) and m > 2/:

Pm{Tm | g F s.t. |EP(g) ETm(g)| > }

4

2 e m

VC(F)

exp(2 m/8)

Obviously, the number of different binary Markovian classifications with

finite memory length L and alphabet A of A symbols is restricted by 2LA

,thus, the VC dimension of the class of Markovian classifications is limitedby LA. Hence its generalization ability is guaranteed. Recurrent networkswith Markovian bias can be simulated by finite memory models. Thereby,the necessary input memory length L can be estimated from the contractioncoefficient of the recursive function, the Lipschitz constant of the readout,

and the extent of the state space R, by the term log(2 diam(R)).This holds since we can approximate every binary classification up to = 0.5by a finite memory machine using this length as shown in Theorem 1. Thisargument proves the generalization ability of ESMs, FPMs, or NPMs withfixed recursive transition function and finite input alphabet A for which amaximum memory length can be determined a priori. The argumentation canbe generalized to the case of arbitrary (continuous) inputs as shown in [5].

If we train a model (e.g. a RNN with small weights) in such a way that aMarkovian model with finite memory length arises, whereby the contractionconstant of the recursive part (and hence the corresponding maximum memorylength L) is not fixed, these bounds do not apply. In this case, one can usearguments from the theory of structural risk minimization as provided e.g.

in [35]. Denote by Fi the set of recursive functions which can be simulatedusing memory length i (this corresponds to a contraction coefficient (2 diam(R)1/i). Obviously, the VC dimension of Fi is iA. Assume a classifier istrained on a sample Tm of m i.i.d. input-output pairs (sj , yj), j = 1, 2, ...m,


16/39


generated from P. Assume the number of errors on Tm of the trained classifierg is k, and the recursive model g is contained in Fi. Then, according to [35](Theorem 2.3), the generalization error EP(g) can be upper bounded by thefollowing term with probability 1 :

1

m

(2 + 4 ln2)k + 4ln 2i + 4ln

4

+ 4iA ln

2em

iA

.

(Here, we chose the prior probabilities used in [35] (Theorem 2.3) as 1/2i forthe probability that g is in Fi and 1/2k that at most k errors occur.) Unlike thebounds derived e.g. in [15], this bound holds for any posterior finite memorylength L obtained during training. In practise, the memory length can beestimated from the contraction and Lipschitz constants as shown above. Thus,unlike for general RNNs, generalization bounds which do not depend on theinput distribution, but rather on the input memory length, can be derived forRNNs with Markovian bias.

4.7 Verifying the architectural bias

The fact that one can extract from untrained recurrent networks NeuralPrediction Machines (NPM, section 4.1) that yield comparable performanceto Variable Memory Length Markov Models (VLMM, section 4.2) has beendemonstrated e.g. in [4]. This is of vital importance for model building andevaluation, since in the light of these findings, the base line models in neural-based symbolic sequence processing tasks should in fact be VLMMs. If aneural network model after the training phase cannot beat a NPM extractedfrom an untrained model, there is obviously no point in training the modelfirst place. This may sound obvious, but there are studies e.g. in the fieldof cognitive aspects of natural language processing, where trained RNN are

compared with fixed order Markov models [1]. Note that such Markov modelsare often inferior to VLMM.

In [4], five model classes were considered:

RNN trained via Real Time Recurrent Learning [36] coupled with Ex-tended Kalman Filtering [37, 38]. There were A input and output units,one for each symbol in the alphabet A, i.e. NI = NO = A. The inputs anddesired outputs were encoded through binary one-of-A encoding. At eachtime step, the output activations were normalized to obtain next-symbolprobabilities.

NPM extracted from trained RNN. NPM extracted from RNN prior to training. Fixed order Markov models. VLMM.

The models were tested on two data sets:


17/39


A series of quantized activations of a laser in a chaotic regime. This se-quence can be modelled quite successfully with finite memory predictors,although, because of the limited finite sequence length, predictive contextsof variable memory depth are necessary.

An artificial language exhibiting deep recursive structures. Such structurescannot be grasped by finite memory models and fully trained RNN shouldbe able to outperform models operating with finite input memory andNPM extracted from untrained networks.

Model performance was measured by calculating the average (per symbol)negative log-likelihood on the hold-out set (not used during the training). Asexpected, given fine enough partition of the state space, the performances ofRNNs and NPMs extracted from trained RNNs were almost identical. Be-cause of finite sample size effects, fixed-order Markov models were beatenby VLMMs. Interestingly enough, when considering models with comparablenumber of free parameters, the performance of NPMs extracted prior to train-ing mimicked that of VLMMs. Markovian bias of untrained RNN was clearly

visible.On the laser sequence, negative log-likelihood on the hold out set improved

slightly through an expensive RNN training, but overall, almost the same per-formance could be achieved by a cheap construction of NPMs from randomlyinitialized untrained networks.

On the deep-recursion set, trained RNNs could achieve much better per-formance. However, it is essential to quantify how much useful informationhas really been induced during the training. As argued in this study, this canonly be achieved by consulting VLMMs and NPMs extracted before training.

A large-scale comparison in [26] of Fractal Prediction Machines (FPM, sec-tion 4.3) with both VLMM an fixed order Markov models revealed an almostequivalent performance (modulo model size) of FPM and VLMM. Again, fixed

order Markov models were inferior to both FPM and VLMM.

4.8 Geometric complexity of RNN state activations

It has been well known that RNN state space activations r often live on afractal support of a self-similar nature [39]. As outlined in section 3.2, incase of contractive RNN, one can actually estimate the size and geometriccomplexity of the RNN state space evolution.

Consider a Bernoulli source S over the alphabet A. When we drive aRNN with symbolic sequences generated by S, we get recurrent activations{r(t)} that tend to group in clusters sampling the invariant set K of the RNNiterative function system {fs}sA.

Note that for each input symbol s A, we have an affine mapping actingon R:Qs(r) = W

r,rr + Wr,xc(s) + tf.


18/39


The range of possible net-in activations of recurrent units for input s isthen

Cs =

1iN

[Qs(R)]i, (28)

where [B]i is the slice of the set B, i.e. [B]i = {ri| r = (r1,...,rN)T B},

i = 1, 2,...,N.By finding upper bounds on the values of derivatives of activation function,

gs = supvCs

|g(v)|, gmax = maxsA

gs,

and settingmaxs = +(W

r,r) gs, (29)

where +(Wr,r) is the largest singular value ofWr,r, we can bound the upper

box-counting dimension of RNN state activations [40]: Suppose +(Wr,r)

gmax < 1 and let max be the unique solution ofsA(maxs )

= 1. Then,

dim+

B K max.Analogously, Hausdorff dimension of RNN sates can be lower-bounded asfollows: Define

q = mins,sA,s=s

Wr,x(c(s) c(s))

and assume +(Wr,r) < min

(gmax)

1, q diam(R)

. Let min be the uniquesolution of

sA(

mins )

= 1, where mins = (Wr,r) infvCs |g

(v)| and(W

r,r) is the smallest singular value of Wr,r. Then, dimHK min.Closed-form (less tight) bounds are also possible [40]:

dim+B K log A

log N + log Wr,rmax + log gmax,

where Wr,rmax = max1i,jN |W

r,r

ij |; and

dimHK log A

log min,

where min = minsA mins .A more involved multifractal analysis of RNN state activations under more

general stochastic sources driving the RNN input can be found in [7].

5 Unsupervised learning setting

In this section, we will study model formulations for constructing topographic

maps of sequential data. While most approaches to topographic map forma-tion assume that the data points are members of a finite-dimensional vectorspace of a fixed dimension, there has been a considerable interest in extend-ing topographic maps to more general data structures, such as sequences or


19/39


trees. Several modifications of standard Self-Organizing Map (SOM) [8] to se-quences and/or tree structures have been proposed in the literature [41, 15].Several modifications of SOM equip standard SOM with additional feed-backconnections that allow for natural processing of recursive data types. Typi-

cal examples of such models are Temporal Kohonen Map [9], recurrent SOM[10], feedback SOM [11], recursive SOM [12], merge SOM [13] and SOM forstructured data [14].

In analogy to well-known capacity results derived for supervised recursivemodels e.g. in [42, 43], some approaches to investigate the principled capacityof these models have been presented in the last few years. Thereby, dependingon the model, the situation is diverse: whereas SOM for structured data andmerge SOM (both with arbitrary weights) are equivalent to finite automata[44, 13], recursive SOM equipped with a somewhat simplified context modelcan simulate at least pushdown automata [45]. The Temporal Kohonen Mapand recurrent SOM are restricted to finite memory models due to their re-stricted recurrence even with arbitrary weights [45, 44]. However, the exact

capacity of most unsupervised recursive models as well as biases which ariseduring training have hardly been understood so far.

5.1 Recursive SOM

In this section we will analyse recursive SOM (RecSOM) [12]. There are twomain reasons for concentrating on RecSOM. First, RecSOM transcends sim-ple local recurrence of leaky integrators of earlier SOM-motivated models forprocessing sequential data and consequently can represent much richer dynam-ical behavior [15]. Second, the underlying continuous state-space dynamics ofRecSOM makes this model particularly suitable for analysis along the lines ofsection 3.1.

The map is formed at the recurrent layer. Each unit i {1, 2,...,N} inthe map has two weight vectors associated with it:

wr,xi X linked with an NI-dimensional input x(t) feeding the networkat time t

wr,ri R linked with the context

r(t 1) = (r1(t 1), r2(t 1),...,rN(t 1))T

containing map activations ri(t 1) from the previous time step.

Hence, the weight matrices Wr,x and Wr,r can be written as Wr,x =((wr,x1 )

T, (wr,x2 )T,..., (wr,xN )

T)T and Wr,r = ((wr,r1 )T, (wr,r2 )

T, ..., (wr,rN )T)T,

respectively.The activation of a unit i at time t is computed as

ri(t) = exp(di(t)), (30)

where


20/39


di(t) = x(t) wr,xi

2 + r(t 1) wr,ri 2. (31)

In eq. (31), > 0 and > 0 are parameters that respectively quantify theinfluence of the input and the context on the map formation. The output of

RecSOM is the index of the unit with maximum activation (best-matchingunit),y(t) = h(r(t)) = argmax

i{1,2,...,N}ri(t). (32)

Both weight vectors can be updated using a SOM-type learning [12]:

wr,xi = (i, y(t)) (x(t) wr,xi ), (33)

wr,ri = (i, y(t)) (r(t 1) wr,ri ), (34)

where 0 < < 1 is the learning rate. Neighborhood function (i, k) is aGaussian (of width ) on the distance d(i, k) of units i and k in the map:

(i, k) = e

d(i,k)2

2

. (35)The neighborhood width, , linearly decreases in time to allow for formingtopographic representation of input sequences. A schematic illustration ofRecSOM is presented in figure 3.

The time evolution (31) becomes

ri(t) = exp(x(t) wr,xi

2) exp(r(t 1) wr,ri 2). (36)

We denote the Gaussian kernel of inverse variance > 0, acting on RN,by G(, ), i.e. for any u, v RN,

G(u, v) = euv2. (37)

Finally, the state space evolution (2) can written for RecSOM as follows:

r(t) = f(x(t), r(t 1)) = (f1(x(t), r(t 1)),...,fN(x(t), r(t 1))), (38)

wherefi(x, r) = G(x, w

r,xi ) G(r, w

r,ri ), i = 1, 2,...,N. (39)

The output response to an input time series {x(t)} is the time series {y(t)}of best-matching unit indices given by (32).

5.2 Markovian organization of receptive fields in contractive

RecSOM

As usual in topographic maps, each unit on the map can be naturally repre-sented by its receptive field (RF). Receptive field of a unit i is the commonsuffix of all sequences for which that unit becomes the best-matching unit.


21/39


x(t)

f

r(t)

r(t1

h

argmax r (t)

delayunit time

i i

Fig. 3. Schematic illustration of RecSOM. Dimensions (neural units) of the statespace R are topologically organized on a grid structure. The degree of closeness ofunits i and k on the grid is given by the neighborhood function (i, k). The outputof RecSOM is the index of maximally activated unit (dimension of the state space).

It is common to visually represent trained topographic maps by printing outRFs on the map grid.

Let us discuss how contractive fixed-input maps fs (3) shape the overallorganization of RFs in the map. For each input symbol s A, the autonomous

dynamicsr(t) = fs(r(t 1)) (40)

induces an output dynamics y(t) = h(r(t)) of best-matching (winner) unitson the map. Since for each input symbol s A, the fixed-input dynamics (40)is a contraction, it will be dominated by a unique attractive fixed point rs.It follows that the output dynamics {y(t)} on the map settles down in unitis, corresponding to the mode of rs. The unit is will be most responsive toinput subsequences ending with long blocks of symbols s. Receptive fields ofother units on the map will be organized with respect to the closeness of theunits to the fixed-input winners is, s A. When symbol s is seen at time t,the mode of the map activation profile r(t) starts drifting towards the unitis. The more consecutive symbols s we see, the more dominant the attractive

fixed point of fs becomes and the closer the winner position is to is.It has indeed been reported that maps of sequential data obtained by

RecSOM often seem to have a Markovian flavor: The units on the map becomesensitive to recently observed symbols. Suffix-based RFs of the neurons are


22/39


topographically organized in connected regions according to last seen symbols.If two prefixes s1:p and s1:q of a long sequence s1...sp2sp1sp...sq2sq1sq...share a common suffix of length L, it follows from (10) that for any r R,

fs1:p(r) fs1:q (r) Lmax diam(R).

For sufficiently large L, the two activations r1 = fs1:p(r) and r2 = fs1:q(r) will

be close enough to have the same location of the mode,8

i = h(r1) = h(r2),

and the two subsequences s1:p and s1:q yield the same best matching uniti on the map, irrespective of the position of the subsequences in the inputstream. All that matters is that the prefixes share a sufficiently long commonsuffix. We say that such an organization of RFs on the map has a Markovianflavour, because it is shaped solely by the suffix structure of the processedsubsequences, and it does not depend on the temporal context in which they

occur in the input stream9.RecSOM parameter weighs the significance of importing information

about possibly distant past into processing of sequential data. Intuitively,when is sufficiently small, e.g. when information about the very recent inputsdominates processing in RecSOM, the resulting maps should have Markovianflavour. This intuition was given a more rigorous form in [16]. In particular,theoretical bounds on parameter that guarantee contractiveness of the fixedinput maps fs were derived. As argued above, contractive fixed input mappingsare likely to produce Markovian organizations of RFs on the RecSOM map.

Denote by G(x) the collection of activations coming from the feed-forward part of RecSOM,

G(x) = (G(x, wr,x

1), G(x, w

r,x

2),...,G(x, w

r,x

N)). (41)

Then we have [16]:

Theorem 2. Consider an input x X. If for some [0, 1),

2e

2G(x)

2, (42)

then the mapping fx(r) = f(x, r) is a contraction with contraction coefficient.

8 or at least mode locations on neighboring grid points of the map9 Theoretically, there can be situations where (1) locations of the modes of r1

and r2 are distinct, but the distance between r1 and r2 is small; or where (2)the modes of r1 and r2 coincide, while their distance is quite large. This followsfrom discontinuity of the output map h. However, in our extensive experimentalstudies, we have registered only a negligible number of such cases.


23/39


Corollary 1. Provided

0 if for every a A and rI RI a neuron i can be found such that awr,xi 1 and r

Iwr,ri 1.2. SOMSD is distance preserving given 2 > 0 if for every two neurons i

and j the following inequality holds |Ii Ij wr,xi w

r,xj

wr,ri wr,rj | 2. This inequality relates the distances of the locations

of neurons on the lattice to the distances of their content.

Assume a SOMSD is given with initial context r0. We denote the outputresponse to the empty sequence by i = h(r0), the winner unit for an inputsequence u by iu := h fu(r0) and the corresponding winner location on themap grid by Iu. We can define an explicit Markovian metric on sequences inthe following way: dA : A

A R, dA(, ) = 0, dA(u, ) = dA(, u) =I Iu, and for two sequences u = u1 . . . up1up, v = v1 . . . vq1vq over A,

dA (u, v) = c(up) c(vq) + dA(u1 . . . up1, v1 . . . vq1). (47)

Assume < 1, then, the output of SOMSD is Markovian in the sense thatthe metric as defined in (47) approximates the metric induced by the winnerlocation of sequences. More precisely, the following inequality is valid:

| dA(u, v) Iu Iv | (2 + 4 ( + )1)/(1 ).

This equation is immediate if u or v are empty. For nonempty sequencesu = u1:p1up and v = v1:q1vq, we find

| dA (u, v) Iu Iv | = | c(up) c(vq)

+ dA(u1:p1, v1:q1) Iu Iv |

2 + | wr,xiu

wr,xiv c(up) c(vq) |

+ | wr,riu wr,riv

dA(u1:p1, v1:q1)|

= ()

Therefore, wr,xiu c(up) + wr,riu

Iu1:p1 is minimum. We haveassumed sufficient granularity, i.e. there exists a neuron with weights at most1 away from c(up) and Iu1:p1 . Therefore,

wr,xiu up ( + )1/


28/39


29/39


the learning dynamics. Thereby, the limit of Hebbian learning for vanishingneighborhood size is considered which yields the update rules

wr,xit = (x(t) wr,xit

) (49)

wr,rit = (rM(t 1) wr,rit ) (50)

after time step t, whereby x(t) constitutes the input at time step t, rM(t1) = wr,xit1 + (1 ) w

r,rit1

the context at time step t, and it the winner at timestep t; > 0 is the learning rate.

Assume an input series {x(t)} is presented to MSOM and assume the initialcontext r0 is the null vector. Then, if a sufficient number of neurons is availablesuch that there exists a disjoint winner for every input x(t), the followingweights constitute a stable fixed point of Hebbian learning as described ineqs. (49-50):

wr,xit

= x(t) (51)

wr,rit

=t1j=1

(1 )jx(t j). (52)

The proof of this fact can be found in [15, 13]. Note that, depending on thesize of , the context representation yields sequence codes which are verysimilar to the recursive states provided by fractal prediction machines (FPM)of section 4.3. For < 0.5 and unary inputs x(t) (one-of-A encoding), (52)yields a fractal whereby the global position in the state space is determinedby the most recent entries of the input series. Hence, a Markovian bias isobtained by Hebbian learning.

This can clearly be observed in experiments, compare figure 411 whichdescribes a result of an experiment conducted in [13]: a MSOM with 644

neurons is trained on DNA sequences. The data consists of windows aroundpotential splice sites from C.elegans as described in [46]. The symbols TCGAare encoded as the edges of a tetrahedron in R3. The windows are concatenatedand presented to a MSOM using merge parameter = 0.5. Thereby, theneighborhood function used for training is not a standard SOM lattice, buttraining is data driven by the ranks as described for neural gas networks[47]. Figure 4 depicts a projection of the learnt contexts. Clearly, a fractalstructure which embeds sequences in the whole context space can be observed,substantianting the theoretical finding by experimental evidence.

Note that, unlike FPM which use a fixed fractal encoding, the encodingof MSOM emerges from training. Therefore, it takes the internal probabilitydistribution of symbols into account, automatically assigning a larger space

to those inputs which are presented more frequently to the map.

11 We would like to thank Marc Strickert for providing the picture.


30/39


Fig. 4. Projection of the three dimensional fractal contexts which are learnt whentraining a MSOM on DNA-strings whereby the symbols ACTG are embedded inthree dimensions.

6 Applications in bioinformatics

In this section we briefly describe applications of the architectural bias phe-nomenon in the supervised learning scenario. Natural Markovian organizationof biological sequences within the RNN state space may be exploited whensolving some sequence-related problems of bioinformatics.

6.1 Genomic data

Enormous amounts of genomic sequence data has been produced in recentyears. Unfortunately, the emergence of experimental data for the functionalparts of such large-scale genomic repositories is lagging behind. For example,several hundred thousands of proteins are known, but only about 35,000 havehad their structure determined. One third of all proteins are believed to bemembrane associated, but less than a thousand have been experimentallywell-characterized to be so.

As commonly assumed, genomic sequence data specifies the structure andfunction of its products (proteins). Efforts are underway in the field of bioin-formatics to link genomic data to experimental observations with the goal ofpredicting such characteristics for the yet unexplored data [48]. DNA, RNAand proteins are all chain molecules, made up of distinct monomers, namelynucleotides (DNA and RNA) and amino acids (protein). We choose to cast

such chemical compounds as symbol sequences. There are four different nu-cleotides and twenty different amino acids, here referred to as alphabets A4and A20, respectively.


31/39


A functional region of DNA typically includes thousands of nucleotides.The average number of amino acids in a protein is around the 300 mark butsequence lengths vary greatly. The combinatorial nature of genomic sequencedata poses a challenge for machine learning approaches linking sequence data

with observations. Even when thousands of samples are available, the sequencespace is sparsely populated. The architectural bias of predictive models maythus influence the result greatly. This section demonstrates that a Markovianstate-space organizationas imposed by the recursive model in this paperdiscerns patterns with a biological flavor [49].

In bioinformatics motifs are perceived as recurring patterns in sequencedata (DNA, RNA and proteins) with an observed or conjectured biologicalsignificance (the character varies widely). Problems like promoter and DNA-binding site discovery [50, 51], identification of alternatively spliced exons[52] and glycosylation sites are successfully approached using the concept ofsequence motifs. Motifs are usually quite short and to be useful may requirea more permissive description than a list of consecutive sequence symbols.

Hence, we augment the motif alphabet A to include the so-called wildcardsymbol : A = A{}. Let M = m1m2...m|M| be a motif where mi A. Asoutlined below the wildcard symbol has special meaning in the motif sequencematching process.

6.2 Recurrent architectures promote motif discovery

We are specifically interested if the Markovian state space organization pro-motes the discovery of a motif within a sequence. Following [49], position-dependent motifs are patterns of symbols relative to a fixed sequence-position.A motif match is defined as

match(s1:n, M) =t, if i {1, ..., |M|} [mi = mi = sn|M|+i]f otherwise

wheren |M|. Notice that the motif must occur at the end of the sequence.To inspect how well sequences with and without a given motif M (t and f,

respectively) are separated in the state-space, Boden and Hawkins determineda motif match entropy for each of the Voronoi compartments V1,...,VM, intro-duced in the context of NPM construction in section 4.1. The compartmentswere formed by vector quantizing the final activations produced by a neural-based architecture on a set of sequences (see Section 4). With each codebookvector bi R, i = 1, 2,...,M, we associate the probability P(t|i) of observinga state r Vi corresponding to a sequence motif match. This probability is es-timated by tracking the matching and non-matching sequences while counting

the number of times Vi is visited.The entropy for each codebook is defined as

Hi = P(t|i)log2 P(t|i) (1 P(t|i))log2(1 P(t|i)) (53)


32/39


and describes the homogeneity within the Voronoi compartment by consid-ering the proportion of motif positives and negatives. An entropy of 0 in-dicates perfect homogeneity (contained states are exclusively for matchingor non-matching sequences). An entropy of 1 indicates random organization

(a matching state is observed by chance). The average codebook entropy,H = M1

Mi=1 Hi, is used to characterize the overall state-space organiza-

tion.In our experiments, the weights Wr,x and Wr,r were randomly drawn from

a uniform distribution [0.5, +0.5]. This, together with activation function g,ensured contractiveness of the state transition map f and hence (as explainedin section 3.1) Markovian organization of the RNN state space.

In an attempt to provide a simple baseline alternative to our recursiveneural-based models, a feedforward neural network was devised accepting thewhole sequence at its input without any recursion. Given the maximum inputsequence length max, the input layer has NI max units and the input codeof a sequence s1:n = s1...sn takes the form

c(s1:n) = c(s1) ... c(sn) {0}NI(maxn),

where denotes vector concatenation. Hence, c(s1:n) is a concatenation ofinput codes for symbols in s1:n, followed by NI (max n) zeros. The hiddenlayer response of the feedforward network to s1:n is then

fs1:n = f(c(s1:n)) = g(Wr,x

c(s1:n) + tf), (54)

where the input-to-hidden layer weights (again randomly drawn from [0.5, +0.5])

are collected in the N (NI max) matrix Wr,x

.In each trial, a random motif M was first selected and then 500 random

sequences (250 with a match, 250 without) were generated. Ten differently

initialized networks (both RNN and the feed-forward baseline) were then sub-jected to the sequences and the average entropy H was determined. By vary-ing sequence and motif lengths, the proportion of wildcards in the motifs,state space dimensionality and number of codebook vectors extracted fromsequence states, Boden and Hawkins established that recurrent architectureswere clearly superior in discerning random but pre-specified motifs in the statespace (see Figure 5). However, there was no discernable difference between thetwo types of networks (RNN and feedforward) for shift-invariant motifs (analternative form which allows matches to appear at any point in the sequence).To leverage the bias, a recurrent network should thus continually monitor itsstate space.

These synthetic trials (see [49]) provide evidence that a recurrent networknaturally organizes its state space at each step to enable classification of ex-tended and flexible sequential patterns that led up to the current time step.The results have been shown to hold also for higher-order recurrent architec-tures [53].


33/39


0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Length of sequence

Entropy

Fig. 5. The average entropy H (y-axis) for recurrent networks (unfilled markers) andbaseline feedforward networks (filled markers) when A4 (squares) and A20 (circles)were used to generate sequences of varying length (x-axis) and position-dependentmotifs (with length equal to half the sequence length and populated with 33% wild-cards). The number of codebooks, M, was 10.

6.3 Discovering discriminative patterns in signal peptides

A long-standing problem of great biological significance is the detection of so-called signal peptides. Signal peptides are reasonably short protein segmentsthat fulfil a transportation role in the cell. Basically, if a protein has a signalpeptide it translocates to the exoplasmic space. The signal peptide is cleavedoff in the process.

Viewed as symbol strings, signal peptides are rather diverse. However,there are some universal characteristics. For example, they always appearat the N-terminus of the protein, they are 15-30 residues long, consist of astretch of seven to 15 hydrophobic (lipid-favouring) amino acids and a fewwell-conserved, small and neutral monomers close to the cleavage site. In Fig-ure 6 nearly 400 mammalian signal peptides are aligned at their respectivecleavage site to illustrate a subtle pattern of conservation.

In an attempt to simplify the sequence classification problem, each positionof the sequence can first be tagged as belonging (t) or not belonging (f) to asignal peptide [54]. SignalP is a feedforward network that is designed to lookat all positions of a sequence (within a carefully sized context window) and

trained to classify each into the two groups (t and f) [54].The Markovian character of the recursive models state space suggests

that replacing the feedforward network with a recurrent network may providea better starting point for a training algorithm. However, since biological


34/39


weblogo.berkeley.edu

0

1

2

3

4

bits

N1

V

F

K

RG

SP

LAM

2

R

Q

E

T

P

V

S

G

L

AM

3

K

V

P

S

G

L

A

RM

4

Q

C

K

F

S

L

P

G

A

RM

5

V

T

F

K

G

A

P

S

RLM

6

T

C

V

K

G

R

P

A

S

LM

7

V

K

T

G

R

P

A

S

LM

8

T

P

G

K

R

V

S

A

ML

9

H

F

G

T

K

V

P

A

S

RML

10

T

R

W

I

P

G

F

V

S

M

K

A

L

11

T

G

R

C

I

W

P

F

S

M

VAL

12

Q

W

K

C

R

M

I

T

G

P

S

FVAL

13

K

R

C

P

W

I

F

T

S

G

AVL

14

M

P

W

C

T

S

IFGAVL

15

PHWCTSGIFAVL

16

P

W

T

G

SCFAIVL

17

Q

P

W

M

G

T

SCIFVAL

18

I

G

CTFSVAL

19

P

W

I

M

T

G

FVC

SAL

20

M

W

P

Q

I

F

T

V

C

SGAL

21

M

T

C

W

G

S

I

FPVAL

22

R

V

Q

T

L

PS

GA

23

E

F

T

Q

V

S

L

PAG

24

I

CL

GTSVA

25

W

H

R

E

Q

A

S

L

26

R

Q

PC

TSGA

27

K

D

V

G

L

S

E

Q

A

28

T

L

H

A

R

K

G

Q

V

D

S

EP

C

Fig. 6. Mammalian signal peptide sequences aligned at their cleavage site (betweenpositions 26 and 27). Each amino acid is represented by its one-letter code. Therelative frequency of each amino acid in each aligned position is indicated by theheight of each letter (as measured in bits). The graph is produced using WebLogo.

sequences may exhibit dependencies in both directions (upstream as well asdownstream) the simple recurrent network may be insufficient.

Hawkins and Boden [53, 55] replaced the feedforward network with a bi-directional recurrent network [56]. The bi-directional variant is simply twoconventional (uni-directional) recurrent networks coupled to merge two stateseach produced from traversing the sequence from one direction (either up-stream or downstream relative the point of interest). In Hawkins and Bodensstudy, the networks ability was only examined after training.

After training, the average test error (as measured over known signal pep-tides and known negatives) dropped by approximately 25% compared to feed-forward networks. The position-specific errors for both classes of architectures

are presented in Figure 7. Several possible setups of the bi-directional recur-rent network are possible, but the largest performance increase was achievedwhen the number of symbols presented at each step was 10 (for both theupstream and downstream networks).

Similar to signal peptides, mitochondrial targeting peptides and chloro-plast transit peptides traffick proteins. In their case proteins are translocatedto mitochondrion and chloroplast, respectively. Bi-directional recurrent net-works applied to such sequence data show a similar pattern of improvementcompared to feedforward architectures [53].

The combinatorial space of possible sequences and the comparatively smallnumber of known samples make the application of machine learning to bioin-formatics challenging. With significant data variance, it is essential to design

models that guide the learning algorithm to meaningful portions of the modelparameter space. By presenting exactly the same training data sets to twoclasses of models, we note that the architectural bias of recurrent networksseems to enhance the prospects of achieving a valid generalization. As noted


35/39


0 10 20 30 40 50 600

0.05

0.1

0.15

0.2

0.25

Squared

error

Sequence position

FN

RN

Fig. 7. The average error of recurrent and feedforward neural networks over thefirst 60 residues in a sequence test set discriminating between residues that belongto a signal peptide from those that do not. The signal peptide always appear at theN-terminus of a sequence (position 1 and onwards, average length is 23).

in [53] the locations of subtle sequential patterns evident in the data sets (cf.Figure 6) correspond well with those regions at which the recurrent networksexcel. The observation lends empirical support to the idea that recursive mod-els are well suited to this broad class of bioinformatics problems.

7 Conclusion

We have studied architectural bias of dynamic neural network architecturestowards suffix-based Markovian input sequence representations. In the super-vised learning scenario, the cluster structure that emerges in the recurrentlayer prior to training is, due to the contractive nature of the state transitionmap, organized in a Markovian manner. In other words, the clusters emulateMarkov prediction contexts in that histories of symbols are grouped accordingto the number of symbols they share in their suffix. Consequently, from suchrecurrent activation clusters it is possible to extract predictive models thatcorrespond to variable memory length Markov models (VLMM). To appreci-ate how much information has been induced during the training, the dynamic

neural network performance should always be compared with that of VLMMbaseline models.Apart from trained models (possibly with small weights, i.e. Markovian

bias) various mechanisms which only rely on the dynamics of fixed or randomly


36/39


initialized recurrence have recently been proposed, including fractal predictionmachines, echo state machines, and liquid state machines. As discussed inthe paper, these models almost surely allow an approximation of Markovianmodels even with random initialization provided a large dimensionality of the

reservoir.Interestingly, the restriction to Markovian models has benefits with respect

to the generalization ability of the models. Whereas generalization bounds ofgeneral recurrent models necessarily rely on the input distribution due to thepossibly complex input information, the generalization ability of Markovianmodels can be limited in terms of the relevant input length or, alternatively,the Lipschitz parameter of the models. This yields bounds which are inde-pendent of the input distribution and which are the better, the smaller thecontraction coefficient. These bounds also hold for arbitrary real-valued in-puts and for posterior bounds on the contraction coefficient achieved duringtraining. The good generalization ability of RNNs with Markovian bias ispartiuclarly suited for typical problems in bioinformatics where dimensional-

ity is high compared to the number of available training data. A couple ofexperiments which demonstrate this effect have been included in this article.Unlike supervised RNNs, a variety of fundamentally different unsupervised

recurrent networks has recently been proposed. Apart from comparably oldbiologically plausible leaky integrators such as the temporal Kohonen mapor recurrent networks, which obviously rely on a Markovian representation ofsequences, models which use a more elaborate temporal context such as themap activation, the winner location, or winner content have been proposed.These models have, in principle, the larger capacity of finite state automataor pushdown automata, respectively, depending on the context model. Inter-estingly, this capacity reduces to Markovian models under certain realisticconditions discussed in this paper. These include a small mixing parameterfor recursive networks, a sufficient granularity and distance preservation for

SOMSD, and Hebbian learning for MSOM. The field of appropriate trainingof complex unsupervised recursive models is, so far, widely unexplored. Ther-fore, the identification of biases of the architecture and training algorithmtowards the Markovian property is a very interesting result which allows togain more insight into the process of unupervised sequence learning.

References

1. Christiansen, M., Chater, N.: Toward a connectionist model of recursion inhuman linguistic performance. Cognitive Science 23 (1999) 417437

2. Kolen, J.: Recurrent networks: state machines or iterated function systems?

In Mozer, M., Smolensky, P., Touretzky, D., Elman, J., Weigend, A., eds.: Pro-ceedings of the 1993 Connectionist Models Summer School. Erlbaum Associates,Hillsdale, NJ (1994) 203210


37/39


3. Kolen, J.: The origin of clusters in recurrent neural network state space. In:Proceedings from the Sixteenth Annual Conference of the Cognitive ScienceSociety, Hillsdale, NJ: Lawrence Erlbaum Associates (1994) 508513

4. Tino, P., Cernansky, M., Benuskova, L.: Markovian architectural bias of re-

current neural networks. IEEE Transactions on Neural Networks 15 (2004)615

5. Hammer, B., Tino, P.: Recurrent neural networks with small weights implementdefinite memory machines. Neural Computation 15 (2003) 18971929

6. Ron, D., Singer, Y., Tishby, N.: The power of amnesia. Machine Learning 25(1996)

7. Tino, P., Hammer, B.: Architectural bias in recurrent neural networks: Fractalanalysis. Neural Computation 15 (2004) 19311957

8. Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78 (1990)14641479

9. Chappell, G., Taylor, J.: The temporal kohonen map. Neural Networks 6 (1993)441445

10. Koskela, T., znd J. Heikkonen, M.V., Kaski, K.: Recurrent SOM with local linearmodels in time series prediction. In: 6th European Symposium on Artificial

Neural Networks. (1998) 16717211. Horio, K., Yamakawa, T.: Feedback self-organizing map and its application to

spatio-temporal pattern classification. International Journal of ComputationalIntelligence and Applications 1 (2001) 118

12. Voegtlin, T.: Recursive self-organizing maps. Neural Networks 15 (2002) 979992

13. Strickert, M., Hammer, B.: Merge SOM for temporal data. Neurocomputing64 (2005) 3972

14. Hagenbuchner, M., Sperduti, A., Tsoi, A.: Self-organizing map for adaptiveprocessing of structured data. IEEE Transactions on Neural Networks 14 (2003)491505

15. Hammer, B., Micheli, A., Strickert, M., Sperduti, A.: A general framework forunsupervised processing of structured data. Neurocomputing 57 (2004) 335

16. Tino, P., Farkas, I., van Mourik, J.: Dynamics and topographic organization ofrecursive self-organizing maps. Neural Computation 18 (2006) 2529256717. Falconer, K.: Fractal Geometry: Mathematical Foundations and Applications.

John Wiley and Sons, New York (1990)18. Barnsley, M.: Fractals everywhere. Academic Press, New York (1988)19. Ron, D., Singer, Y., Tishby, N.: The power of amnesia. In: Advances in Neural

Information Processing Systems 6, Morgan Kaufmann (1994) 17618320. Guyon, I., Pereira, F.: Design of a linguistic postprocessor using variable memory

length markov models. In: International Conference on Document Analysis andRecognition, Monreal, Canada, IEEE Computer Society Press (1995) 454457

21. Weinberger, M., Rissanen, J., Feder, M.: A universal finite memory source.IEEE Transactions on Information Theory 41 (1995) 643652

22. Buhlmann, P., Wyner, A.: Variable length markov chains. Annals of Statistics27 (1999) 480513

23. Rissanen, J.: A universal data compression system. IEEE Trans. Inform. Theory29 (1983) 656664

24. Giles, C., Omlin, C.: Insertion and refinement of production rules in recurrentneural networks. Connection Science 5 (1993)


38/39


25. Doya, K.: Bifurcations in the learning of recurrent neural networks. In: Proc.of 1992 IEEE Int. Symposium on Circuits and Systems. (1992) 27772780

26. Tino, P., Dorffner, G.: Predicting the future of discrete sequences from fractalrepresentations of the past. Machine Learning 45 (2001) 187218

27. Jaeger, H.: The echo state approach to analysing and training recurrentneural networks. Technical Report GMD Report 148, German National ResearchCenter for Information Technology (2001)

28. Jaeger, H., Haas, H.: Harnessing nonlinearity: Predicting chaotic systems andsaving energy in wireless communication. Science 304 (2004) 7880

29. Maass, W., Natschlager, T., Markram, H.: Real-time computing without stablestates: A new framework for neural computation based on perturbations. NeuralComputation 14 (2002) 25312560

30. Maass, W., Legenstein, R.A., Bertschinger, N.: Methods for estimating thecomputational p ower and generalization capability of neural microcircuits. In:Advances in Neural Information Processing Systems. Volume 17., MIT Press(2005) 865872

31. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks areuniversal approximators. Neural Networks 2 (1989) 359366

32. Hammer, B.: Generalization ability of folding networks. IEEE Transactions onKnowledge and Data Engineering 13 (2001) 196206

33. Koiran, P., Sontag, E.: Vapnik-chervonenkis dimension of recurrent neural net-works. In: European Conference on Computational Learning Theory. (1997)223237

34. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience (1998)35. Shawe-Taylor, J., Bartlett, P.L.: Structural risk minimization over data-

dependent hierarchies. IEEE Trans. on Information Theory 44 (1998) 19261940

36. Williams, R., Zipser, D.: A learning algorithm for continually running fullyrecurrent neural networks. Neural Computation 1 (1989) 270280

37. Williams, R.: Training recurrent networks using the extended kalman filter. In:Proc. 1992 Int. Joint Conf. Neural Networks. Volume 4. (1992) 241246

38. Patel, G., Becker, S., Racine, R.: 2d image modelling as a time-series predictionproblem. In Haykin, S., ed.: Kalman filtering applied to neural networks. Wiley(2001)

39. Manolios, P., Fanelli, R.: First order recurrent neural networks and deterministicfinite state automata. Neural Computation 6 (1994) 11551173

40. Tino, P., Hammer, B.: Architectural bias in recurrent neural networks fractalanalysis. In Dorronsoro, J., ed.: Artificial Neural Networks - ICANN 2002.Lecture Notes in Computer Science, Springer-Verlag (2002) 13591364

41. de A. Barreto, G., Araujo, A., Kremer, S.: A taxanomy of spatiotemporalconnectionist networks revisited: The unsupervised case. Neural Computation15 (2003) 12551320

42. Kilian, J., Siegelmann, H.T.: On the power of sigmoid neural networks. Infor-mation and Computation 128 (1996) 4856

43. Siegelmann, H.T., Sontag, E.D.: Analog computation via neural networks. The-

oretical Computer Science 131 (1994) 33136044. Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing

network models. Neural Networks 17 (2004) 10611086


39/39


45. Hammer, B., Neubauer, N.: On the capacity of unsupervised recursive neuralnetworks for symbol processing. In dAvila Garcez, A., Hitzler, P., Tamburrini,G., eds.: Workshop proceedings of NeSy06. (2006)

46. Sonnenburg, S.: New methods for splice site recognition. Masters thesis, Diplom

thesis, Institut fur Informatik, Humboldt-Universitat Berlin (2002)47. Martinetz, T., Berkovich, S., Schulten, K.: neural-gas networks for vector

quantization and its application to time-series prediction. IEEE Transactionson Neural Networks 4 (1993) 558569

48. Baldi, P., Brunak, S.: Bioinformatics: The machine learning approach. MITPress, Cambridge, Mass (2001)

49. Boden, M., Hawkins, J.: Improved access to sequential motifs: A note on thearchitectural bias of recurrent networks. IEEE Transactions on Neural Networks16 (2005) 491494

50. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization todiscover motifs in biopolymers. In: Proceedings of ISMB, Stanford, CA (1994)2836

51. Stormo, G.D.: Dna binding sites: representation and discovery. Bioinformatics16 (2000) 1623

52. Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G., Shamir, R.: A non-EST-based method for exon-skipping prediction. Genome Research 14 (2004)16171623

53. Hawkins, J., Boden, M.: The applicability of recurrent neural networks for bio-logical sequence analysis. IEEE/ACM Transactions on Computational Biologyand Bioinformatics 2 (2005) 243253

54. Dyrlov Bendtsen, J., Nielsen, H., von Heijne, G., Brunak, S.: Improved predic-tion of signal peptides: SignalP 3.0. Journal of Molecular Biology 340 (2004)783795

55. Hawkins, J., Boden, M.: Detecting and sorting targeting papetides with neuralnetworks and support vector machines. Journal of Bioinformatics and Compu-tational Biology 4 (2006) 118

56. Baldi, P., Brunak, S., Frasconi, P., Soda, G., Pollastri, G.: Exploiting the past

and the future in protein secondary structure prediction. Bioinformatics 15(1999) 937946

peter tino, barbara hammer and mikael boden- markovian bias of neural-based architectures with...

Documents