peter tino, barbara hammer and mikael boden- markovian bias of neural-based architectures with...

Upload: grettsz

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    1/39

    Markovian bias of neural-based architectures

    with feedback connections

    Peter Tino1, Barbara Hammer2, and Mikael Boden3

    1 University of Birmingham, Birmingham, UK [email protected] Clausthal University of Technology, Germany [email protected] University of Queensland, Brisbane, Australia [email protected]

    Summary. Dynamic neural network architectures can deal naturally with sequen-tial data through recursive processing enabled by feedback connections. We show

    how such architectures are predisposed for suffix-based Markovian input sequencerepresentations in both supervised and unsupervised learning scenarios. In partic-ular, in the context of such architectural predispositions, we study computationaland learning capabilities of typical dynamic neural network architectures. We alsoshow how efficient finite memory models can be readily extracted from untrainednetworks and argue that such models should be used as baselines when comparingdynamic network p erformances in a supervised learning task. Finally, potential ap-plications of the Markovian architectural predisposition of dynamic neural networksin bioinformatics are presented.

    1 Introduction

    There has been a considerable research activity in connectionist processing ofsequential symbolic structures. For example, researchers have been interestedin formulating models of human performance in processing linguistic patternsof various complexity (e.g. [1]).

    Of special importance are dynamic neural network architectures capableof naturally dealing with sequential data through recursive processing. Sucharchitectures are endowed with feedback delay connections that enable themodels to operate with a neural memory in the form of past activationsof a selected subset of neurons, sometimes referred to as recurrent neurons.It is expected that activations of such recurrent neurons will code all theimportant information from the past that is needed to solve a given task.In other words, the recurrent activations can be thought of as representing, insome relevant way, the temporal context in which the current input item is

    observed. The relevance is given by the nature of the task, be it next symbolprediction, or mapping sequences in a topographically ordered manner.

    It was a bit surprising then, when researchers reported that when trainingdynamic neural networks to process language structures, activations of re-

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    2/39

    2 Peter Tino, Barbara Hammer, and Mikael Boden

    current neurons displayed a considerable amount of structural differentiationeven prior to learning [2, 3, 1]. Following [1], we refer to this phenomenon asthe architectural bias of dynamic neural network architectures.

    It has been recently shown, both theoretically and empirically, that the

    structural differentiation of recurrent activations before the training has muchdeeper implications [4, 5]. When initialized with small weights (standard ini-tialization procedure), typical dynamic neural network architectures will or-ganize recurrent activations in a suffix-based Markovian manner and it ispossible to extract from such untrained networks predictive models compa-rable to efficient finite memory source models called variable memory lengthMarkov models [6]. In addition, in such networks, recurrent activations tendto organize along a self-similar fractal substrate the fractal and multifractalproperties of which can be studied in a rigorous manner [7]. Also, it is possi-ble to rigorously characterize learning capabilities of dynamic neural networkarchitectures in the early stages of learning [5].

    Analogously, the well-known Self-Organizing Map (SOM) [8] for topo-

    graphic low-dimensional organization of high-dimensional vectorial data hasbeen reformulated to enable processing of sequential data. Typically, standardSOM is equipped with additional feed-back connections [9, 10, 11, 12, 13, 14].Again, various forms of inherent predispositions of such models to Markovianorganization of processed data have been discovered [15, 13, 16].

    The aim of this paper is to present major developments in the phenomenonof architectural bias of dynamic neural network architectures in a unifiedframework. The paper has the following organization: First, a general state-space model formulation used throughout the paper is introduced in section2. Then, still on the general high description level, we study in section 3 thelink between Markovian suffix-based state space organization and contrac-tive state transition maps. Basic tools for measuring geometric complexity offractal sets are introduced as well. Detailed studies of several aspects of the

    architectural bias phenomenon in the supervised and unsupervised learningscenarios are presented in sections 4 and 5, respectively. Section 6 brings aflavour of potential applications of the architectural bias in bioinformatics.Finally, key findings are summarized in section 7.

    2 General model formulation

    We consider models that recursively process inputs x(t) from a set X RNI

    by updating their state r(t) in a bounded set R RN, and producing outputsy(t) Y, Y RNO . State space model representation takes the form:

    y(t) = h(r(t)) (1)

    r(t) = f(x(t), r(t 1)). (2)

    Processing of an input time series x(t) X, t = 1, 2,..., starts by initializingthe model with some r(0) R and then recursively updating the state r(t)

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    3/39

    Markovian bias of neural-based architectures with feedback connections 3

    R and output y(t) Y via functions f : X R R and h : R Y,respectively. This is illustrated in figure 1.

    x(t)

    f

    r(t)

    h

    y(t)

    r(t1)

    delayunit time

    Fig. 1. The basic state space model used throughout the paper. Processing of aninput time series x(t) X, t = 1, 2,..., is done by recursively updating the stater(t) R and output y(t) Y via functions f : X R R and h : R Y,respectively.

    We consider inputs coming from a finite alphabet A of A symbols. Theset of all finite sequences over A is denoted by A. The set A without theempty string is denoted by A+. The set of all sequences over A with exactlyn symbols (n-blocks) is denoted by An. Each symbol s A is mapped to itsreal-vector representation c(s) by an injective coding function c : A X.The state transition process (2) can be viewed as a composition of fixed-inputmaps

    fs(r) = f(c(s), r), s A. (3)

    In particular, for a sequence s1:n = s1...sn2sn1sn over A and r R, wewrite

    fs1:n(r) = fsn(fsn1(...(fs2(fs1(r)))...))

    = (fsn fsn1 ... fs2 fs1)(r). (4)

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    4/39

    4 Peter Tino, Barbara Hammer, and Mikael Boden

    3 Contractive state transitions

    In this section we will show that by constraining the system to contractivestate transitions we obtain state organizations with Markovian flavour. More-

    over, one can quantify geometric complexity of such state space organizationsusing tools of fractal geometry.

    3.1 Markovian state-space organization

    Denote a Euclidean norm by . Recall that a mapping F : R R is saidto be a contraction with contraction coefficient [0, 1), if for any r, r R,it holds

    F(r) F(r) r r. (5)

    F is a contraction if there exists [0, 1) so that F is a contraction withcontraction coefficient .

    Assume now that each member of the family {fs}sA is a contractionwith contraction coefficient s and denote the weakest contraction rate in thefamily by

    max = maxsA

    s.

    Consider a sequence s1:n = s1...sn2sn1sn An, n 1. Then for any two

    prefixes u and v and for any state r R, we have

    fus1:n(r) fvs1:n(r) = fs1:n1sn(fu(r)) fs1:n1sn(fv(r)) (6)

    = fsn(fs1:n1(fu(r))) fsn(fs1:n1(fv(r))) (7)

    max fs1:n1(fu(r)) fs1:n1(fv(r)), (8)

    and consequently

    fus1:n(r) fvs1:n(r) nmax fu(r) fv(r) (9)

    nmax diam(R), (10)

    where diam(R) is the diameter of the set R, i.e. diam(R) = supx,yR x y.By similar arguments, for L < n,

    fs1:n(r) fsnL+1:n(r) L fs1:nL(r) r (11)

    L diam(R). (12)

    There are two related lessons to be learnt from this exercise:

    1. No matter what state r R the model is in, the final states fus1:n(r)and fvs1:n(r) after processing two sequences with a long common suffixs1:n will lie close to each other. Moreover, the greater the length n of thecommon suffix, the closer lie the final states fus1:n(r) and fvs1:n(r).

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    5/39

    Markovian bias of neural-based architectures with feedback connections 5

    2. If we could only operate with finite input memory of depth L, then allreachable states from an initial state r0 R could be collected in a finiteset

    CL(r0) = {fw(r0) | w AL}.

    Consider now a long sequence s1:n over A. Disregarding the initial tran-sients for 1 t L 1, the states fs1:t(r0) of the original model (2) canbe approximated by the states fstL+1:t(r0) CL(r0) of the finite memorymodel to arbitrarily small precision, as long as the memory depth L islong enough.

    What we have just described can be termed Markovian organization of themodel state space. Information processing states that result from processingsequences sharing a common suffix naturally cluster together. In addition, thespatial extent of the cluster reflects the length of the common suffix - longersuffixes lead to tighter cluster structure.

    If, in addition, the readout function h is Lipschitz with coefficient > 0,

    i.e. for any r, r

    R,

    h(r) h(r) r r,

    thenh(fus1:n(r)) h(fvs1:n(r))

    n fu(r) fv(r).

    Hence, for arbitrary prefixes u, v, we have that for sufficiently long commonsuffix length n, the model outputs resulting from driving the system withus1:n and vs1:n can be made arbitrarily close. In particular, on the same longinput sequence (after initial transients of length L 1), the model outputs (1)will be closely shadowed by those of the finite memory model operating onlyon the most recent L symbols (and hence on states from CL(r)), as long as Lis large enough.

    3.2 Measuring complexity of state space activations

    Let us first introduce measures of size of geometrically complicated objectscalled fractals (see e.g. [17]). Let K R. For > 0, a -fine cover of K is acollection of sets of diameter that cover K. Denote by N(K) the smallestpossible cardinality of a -fine cover of K.

    Definition 1. The upper andlower box-counting dimensions ofK are definedas

    dim+B K = lim sup0

    log N(K)

    log and dimB K = liminf0

    log N(K)

    log , (13)

    respectively.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    6/39

    6 Peter Tino, Barbara Hammer, and Mikael Boden

    Let > 0. For > 0, define

    H (K) = inf(K)

    B(K)

    (diam(B)) , (14)

    where the infimum is taken over the set (K) of all countable -fine coversof K. Define

    H(K) = lim0

    H (K).

    Definition 2. The Hausdorff dimension of the set K is

    dimHK = inf{| H(K) = 0}. (15)

    It is well known that

    dimHK dimB K dim

    +B K. (16)

    The Hausdorff dimension is more subtle than the box-counting dimen-sions: the former can capture details not detectable by the latter. For a moredetailed discussion see e.g. [17].

    Consider now a Bernoulli source S over the alphabet A and (without lossof generality) assume that all symbol probabilities are nonzero.

    The system (2) can be considered an Iterative Function System (IFS) [18]{fs}sA. If all the fixed input maps fs are contractions (contractive IFS), thereexists a unique set K R, called the IFS attractor, that is invariant underthe action of the IFS:

    K =sA

    fs(K).

    As the system (2) is driven by an input stream generated by S, the states{r(t)} converge to K. In other words, after some initial transients, state tra- jectory of the system (2) samples the invariant set K (chaos game algo-rithm).

    It is possible to upper-bound fractal dimensions of K [17]:

    dim+B K , wheresA

    s = 1.

    If the IFS obeys the open set condition, i.e. all fs(K) are disjoint, wecan get a lower bound on the dimension of K as well: assume there exist0 < s < 1, s A, such that

    fs(r) fs(r) s r r

    .

    Then [17],

    dimHK , wheresA

    s = 1.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    7/39

    Markovian bias of neural-based architectures with feedback connections 7

    4 Supervised learning setting

    In this section, we will consider the implications of Markovian state-spaceorganization for recurrent neural networks (RNN) formulated and trained in

    a supervised setting. We present the results for a simple 3-layer topology ofRNN. The results can be easily generalized to more complicated architectures.

    4.1 Recurrent Neural Networks and Neural Prediction Machines

    Let Wr,x, Wr,r and Wy,r denote real N NI, N N and NO N weightmatrices, respectively. For a neuron activation function g, we denote itselement-wise application to a real vector a = (a1, a2,...,ak)

    T by g(a), i.e.g(a) = (g(a1), g(a2),...,g(ak))

    T. Then, the state transition function f in (2)takes the form

    f(x, r) = g(Wr,xx + Wr,rr + tf), (17)

    where tf RN is a bias term.Often, the output function h reads:

    h(r) = g(Wy,rr + th), (18)

    with the bias term th RNO . However, for the purposes of symbol stream pro-

    cessing, the output function h is sometimes realized as a piece-wise constantfunction consisting of a collection of multinomial next-symbol distributions,one for each compartment of the partitioned state space R. RNN with such anoutput function were termed Neural Prediction Machines (NPM) in [4]. Moreprecisely, assuming the symbol alphabet A contains A symbols, the outputspace Y is the simplex

    Y = {y = (y1, y2,...,yA)T RA |

    Ai=1

    yi = 1 and yi 0}.

    The next-symbol probabilities h(r) Y are estimates of the relative fre-quencies of symbols, conditional on RNN state r. Regions of constant prob-abilities are determined by vector quantizing the set of recurrent activationsthat result from driving the network with the training sequence:

    1. Given a training sequence s1:n = s1s2...sn over the alphabet A ={1, 2,...,A}, first re-set the network to the initial state r(0), and then,while driving the network with the sequence s1:n, collect the recurrentlayer activations in the set = {r(t) | 1 t n}.

    2. Run your favorite vector quantizer with M codebook vectorsb1, ...,

    bM R, on the set . Vector quantization partitions the state space R into M

    Voronoi compartments V1,...,VM, of the codebook vectors4 b1, ..., bM:

    4 ties are broken according to index order

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    8/39

    8 Peter Tino, Barbara Hammer, and Mikael Boden

    Vi = {r | r bi = minj

    r bj}. (19)

    All points in Vi are allocated to the codebook vector bi via a projection : R {1, 2,...,M},

    (r) = i, provided r Vi. (20)

    3. Re-set the network again with the initial state r(0).4. Set the [Voronoi compartment,next-symbol] counters N(i, a), i = 1,...,M,

    a A, to zero.5. Drive the network again with the training sequence s1:n.

    For 1 t < n, if (r(t)) = i, and the next symbol st+1 is a, incrementthe counter N(i, a) by one.

    6. With each codebook vector associate the next symbol probabilities5

    P(a | i) =N(i, a)

    bA

    N(i, b), a A, i = 1, 2,...,M. (21)

    7. The output map is then defined as

    h(r) = P( | (r)), (22)

    where the a-th component ofh(r), ha(r), is the probability of next symbolbeing a A, provided the current RNN state is r.

    Once the NPM is constructed, it can make predictions on a test sequence(a continuation of the training sequence) as follows: Let r(n) be the vector ofrecurrent activations after observing the training sequence s1:n. Given a prefixu1u2,...u of the test sequence, the NPM makes a prediction about the next

    symbol u+1 by:

    1. Re-setting the network with r(n).2. Recursively updating states r(n + t), 1 t , while driving the network

    with u1u2...u.

    5 For bigger codebooks, we may encounter data sparseness problems. In such cases,it is advisable to perform smoothing of the empirical symbol frequencies by ap-plying Laplace correction with parameter > 0:

    P(a | i) = + N(i, a)

    A +P

    bAN(i, b)

    .

    The parameter can be viewed in the Dirichlet prior interpretation for the multi-nomial distribution P(a | i) as the effective number of times each symbol a Awas observed in the context of state compartment i, prior to counting the condi-tional symbol occurrences N(i, a) in the training sequence. Typically, = 1, or = A1.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    9/39

    Markovian bias of neural-based architectures with feedback connections 9

    3. The probability of u+1 occurring is

    hu+1(r) = P(u+1 | (r(n + ))). (23)

    A schematic illustration of NPM is presented in figure 2.

    x(t)

    f

    r(t1)

    r(t)

    P(. | i)

    P(. | 1)

    P(. | 2)

    P(. | 3)

    P(. | 4)

    delayunit time

    Fig. 2. Schematic illustration of Neural Prediction Machine. The output functionis realized as a piece-wise constant function consisting of a collection of multinomialnext-symbol distributions P(|i), one for each compartment i of the state space R.

    4.2 Variable memory length Markov models and NPM built on

    untrained RNN

    In this section we will present intuitive arguments for a strong link betweenNeural Prediction Machines (NPM) constructed on untrained RNN and a classof efficient implementations of Markov models called Variable memory lengthMarkov models (VLMM) [6]. The arguments will be made more extensive andformal in section 4.5.

    In Markov models (MMs) of (fixed) order L, the conditional next-symboldistribution over alphabet A, given the history of symbols s1:t = s1s2...st A

    t

    observed so far, is written in terms of the last L symbols (L t),

    P(st+1 | s1:t) = P(st+1 | stL+1:t). (24)

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    10/39

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    11/39

    Markovian bias of neural-based architectures with feedback connections 11

    Dense areas in the RNN state space correspond to symbol histories with longcommon suffixes and are given more attention by the vector quantizer. Hence,the prediction contexts are analogous to those of VLMM in that deep memoryis used only when there are many different symbol histories sharing a long

    common suffix. In such situations it is natural to consider deeper past contextswhen predicting the future. This is done automatically by the vector quantizerin NPM construction as it devotes more codebook vectors to the dense regionsthan to the sparsely inhabited areas of the RNN state space. More codebookvectors in dense regions imply tiny quantization compartments Vi that in turngroup symbol histories with long shared suffixes.

    Note, however, that while prediction contexts of VLMMs are built in asupervised manner, i.e. deep memory is considered only if it is dictated by theconditional distributions over the next symbols, in contractive NPMs predic-tion contexts of variable length are constructed in an unsupervised manner:prediction contexts with deep memory are accepted if there is a suspicionthat shallow memory may not be enough, i.e. when there are many different

    symbol histories in the training sequence that share a long common suffix.

    4.3 Fractal Prediction Machines

    A special class of affine contractive NPMs operating with finite input mem-ory has been introduced in [26] under the name Fractal Prediction Machines(FPM). Typically, the state space R is an N-dimensional unit hypercube6

    R = [0, 1]N, N = log2 A. The coding function c : A X = R maps each ofthe A symbols of the alphabet A to a unique vertex of the hypercube R. Thestate transition function (17) is of particularly simple form, as the activationfunction g is identity and the connection weight matrices are set to Wr,x = Iand Wr,x = (1 )I, where 0 < < 1 is a contraction coefficient and I is the

    N N identity matrix. Hence, the state space evolution (2) reads:r(t) = f(x(t), r(t 1))

    = r(t 1) + (1 )x(t). (25)

    The reset state is usually set to the center of R, r0 = {12}

    N. After fixingthe memory depth L, the FPM states are confined to the set

    CL(r0) = {fw(r0) | w AL}.

    FPM construction proceeds in the same manner as that of NPM, exceptfor the states fs1:t(r0) coding history of inputs up to current time t are approx-imated by their memory-L counterparts fstL+1:t(r0) from CL(r0), as discussed

    in section 3.1.

    6 for x , x is the smallest integer y, such that y x

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    12/39

    12 Peter Tino, Barbara Hammer, and Mikael Boden

    4.4 Echo and Liquid State Machines

    A similar architecture has been proposed by Jaeger in [27, 28]. So-called EchoState Machines (ESM) combine an untrained recurrent reservoir with a train-

    able linear output layer. Thereby, the state space model has the followingform

    y(t) = h(x(t), r(t), y(t 1)) (26)

    r(t) = f(x(t), r(t 1), y(t 1)). (27)

    For training, the function f is fixed and chosen in such a way that it has theecho state property, i.e. the network state r(t) is uniquely determined by leftinfinite sequences. In practise, the dependence ofr on y is often dropped suchthat this property is equivalent to the fading property of recurrent systemsand it can be tested by determining the Lipschitz constant of f(which must besmaller than 1). The readout h is trainable. It is often chosen as a linear func-tion such that training can be done efficiently using linear regression. Unlike

    FPMs, ESMs usually work with very high dimensional state representationsand a transition function which is obtained as the composition of a nonlin-ear activation function and a linear mapping with (sufficiently small) randomweights. This combination should guarantee that the transition function hassufficiently rich dynamics to represent important properties of input serieswithin its reservoir.

    A similar idea is the base for so-called Liquid State Machines (LSM) asproposed by Maass [29]. These are based on a fixed recurrent mechanism incombination with a trainable readout. Unlike FPM and ESM, LSMs considercontinuous time, and they are usually based on biologically plausible circuitsof spiking neurons with sufficiently rich dynamics. It has been shown in [29]that LSMs are approximation complete for operators with fading memory

    under mild conditions on the models. Possibilities to estimate the capacity ofsuch models and their generalization ability have recently been presented in[30]. However, the fading property is not required for LSMs.

    4.5 On the computational power of recursive models

    Here we will rigorously analyze the computational power of recursive modelswith contractive transition function. The arguments are partially taken from[5]. Given a sequence s1:t = s1 . . . st, the L-truncation is given by

    TL(s1:t) =

    stL+1:t if t Ls otherwise

    Assume A is a finite input alphabet and Y is an output set. A Definite MemoryMachine (DMM) computes a function f : A Y such that f(s) = f(TL(s))for some fixed memory length L and every sequence s. A Markov model (MM)fulfils P(st+1|s1:t) = P(st+1|TL(s)) for some L. Obviously, the function which

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    13/39

    Markovian bias of neural-based architectures with feedback connections 13

    maps a sequence to the next symbol probability given by a Markov model de-fines a definite memory machine. Variable length Markov models are subsumedby standard Markov models taking L as the maximum length. Therefore, wewill only consider definite memory machines in the following.

    State space models as defined in eqs. (1-2) induce a function h f(r) :A Y, s1:t (h fst . . . fs1)(r) as defined in (4) given a fixed initial valuer. As beforehand, we assume that the set of internal states R is bounded.Consider the case that f is a contraction w.r.t. x with parameter and h isLipschitz continuous with parameter . According to (12), |fs(r)fTL(s)(r)| L diam(R). Note that the function h fTL()(r) can be given by a definitememory machine. Therefore, if we choose L = log(/( diam(R))), we find|h fs(r) h fTL(s)(r)| . Hence, we can conclude that every recurrentsystem with contractive transition function can be approximated arbitrarilywell by a finite memory model. This argument proves the Markovian propertyof RNNs with small weights:

    Theorem 1. Assume f is a contraction w.r.t. x, and h is Lipschitz continu-

    ous. Assume r is a fixed initial state. Assume > 0. Then there exists a defi-nite memory machine with function g : A Y such that |g(s) h fs(r)| for every input sequence s.

    As shown in [5], this argument can be extended to the situation where theinput alphabet is not finite, but given by an arbitrary subset of a real-vectorspace.

    It can easily be seen that standard recurrent neural networks with smallweights implement contractions, since standard activation functions such astanh are Lipschitz continuous and linear maps with sufficiently small weights

    are contractive. RNNs are often initialized by small random weights. There-fore, RNNs have a Markovian bias in early stages of training, first pushingsolutions towards simple dynamics before going towards more complex modelssuch as finite state automata and beyond.

    It has been shown in [27] that ESMs without output feedback possess theso-called shadowing property, i.e. for every a memory length L can be foundsuch that input sequences the last L entries of which are similar or identicallead to images which are at most apart. As a consequence, such mechanismscan also be simulated by Markovian models. The same applies to FPMs (thetransition is a contraction by construction) and NPMs built on untrainedRNNs randomly initialized with small weights.

    The converse has also been proved in [5]: every definite memory machine

    can be simulated by a RNN with small weights. Therefore, an equivalence ofsmall weight RNNs (and similar mechanisms) and definite memory machinesholds. One can go a step further and consider the question in which cases

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    14/39

    14 Peter Tino, Barbara Hammer, and Mikael Boden

    randomly initialized RNNs with small weights can simulate a given definitememory machine provided a suitable readout is chosen.

    The question boils down to the question whether the images of input se-quences s of a fixed memory length L under the recursion (fsL . . . fs1)(r)

    are of a form such that they can be mapped to desired output values by atrainable readout. We can assume that the codes for the symbols si are lin-early independent. Further, we assume that the transfer function of the net-work is a nonlinear and non algebraic function such as tanh. Then the terms(fsL . . . fs1)(r) constitute functions of the weights which are algebraicallyindependent since they consist of the composition of linearly independentterms and the activation function. Thus, the images of sequences of lengthL are almost surely disjoint such that they can be mapped to arbitrary out-put values by a trainable readout provided the readout function h fulfils theuniversal approximation capability as stated e.g. in [31] (e.g. FNNs with onehidden layer). If the dimensionality of the output is large enough (dimension-ality LA), the outputs are almost surely linearly independent such that even

    a linear readout is sufficient. Thus, NPMs and ESMs with random initializa-tion constitute universal approximators for finite memory models providedsufficient dimensionality of the reservoir.

    4.6 On the generalization ability of recursive models

    Another important issue of supervised adaptive models is their generaliza-tion ability, i.e. the possibility to infer from learnt data to new examples. Itis well known that the situation is complex for recursive models due to therelatively complex inputs: sequences of priorly unrestricted length can, in prin-ciple, encode behavior of priorly unlimited complexity. This observation canbe mathematically substantiated by the fact that generalization bounds for

    recurrent networks as derived e.g. in [32, 33] depend on the input distribution.The situation is different for finite memory models for which generalizationbounds independent of the input distribution can be derived. A formal argu-ment which establishes generalization bounds of RNNs with Markovian biashas been provided in [5], whereby the bounds have been derived also for thesetting of continuous input values stemming from a real-vector space. Here,we only consider the case of a finite input alphabet A.

    Assume F is the class of functions which can be implemented by a re-current model with Markovian bias. For simplicity, we restrict to the case ofbinary classification, i.e. outputs stem from {0, 1}, e.g. adding an appropriatediscretization to the readout h. Assume P is an unknown underlying input-output distribution which has to be learnt. Assume a set of m input-outputpairs, Tm = {(s1, y1), . . . , (sm, ym)}, is given; the pairs are sampled from

    P in an i.i.d. manner7. Assume a learning algorithm outputs a function gwhen trained on the samples from Tm. The empirical error of the algorithm

    7 independent identically distributed

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    15/39

    Markovian bias of neural-based architectures with feedback connections 15

    is defined as

    ETm(g) =1

    m

    mi=1

    |g(si) yi| .

    This is usually minimized during training. The real error is given by thequantity

    EP(g) =

    |g(s) y| dP(s, y) .

    A learning algorithm is said to generalize to unseen data if the real error isminimized by the learning scheme.

    The capability of a learning algorithm of generalizing from the trainingset to new examples depends on the capacity of the function class used fortraining. One measure of the complexity is offered by the Vapnik Chervonenkis(VC) dimension of the function class F. The VC dimension of F is the size ofthe largest set such that every binary mapping on this set can be realized bya function in F. It has been shown by Vapnik [34] that the following bound

    holds for sample size m, m VC(F) and m > 2/:

    Pm{Tm | g F s.t. |EP(g) ETm(g)| > }

    4

    2 e m

    VC(F)

    exp(2 m/8)

    Obviously, the number of different binary Markovian classifications with

    finite memory length L and alphabet A of A symbols is restricted by 2LA

    ,thus, the VC dimension of the class of Markovian classifications is limitedby LA. Hence its generalization ability is guaranteed. Recurrent networkswith Markovian bias can be simulated by finite memory models. Thereby,the necessary input memory length L can be estimated from the contractioncoefficient of the recursive function, the Lipschitz constant of the readout,

    and the extent of the state space R, by the term log(2 diam(R)).This holds since we can approximate every binary classification up to = 0.5by a finite memory machine using this length as shown in Theorem 1. Thisargument proves the generalization ability of ESMs, FPMs, or NPMs withfixed recursive transition function and finite input alphabet A for which amaximum memory length can be determined a priori. The argumentation canbe generalized to the case of arbitrary (continuous) inputs as shown in [5].

    If we train a model (e.g. a RNN with small weights) in such a way that aMarkovian model with finite memory length arises, whereby the contractionconstant of the recursive part (and hence the corresponding maximum memorylength L) is not fixed, these bounds do not apply. In this case, one can usearguments from the theory of structural risk minimization as provided e.g.

    in [35]. Denote by Fi the set of recursive functions which can be simulatedusing memory length i (this corresponds to a contraction coefficient (2 diam(R)1/i). Obviously, the VC dimension of Fi is iA. Assume a classifier istrained on a sample Tm of m i.i.d. input-output pairs (sj , yj), j = 1, 2, ...m,

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    16/39

    16 Peter Tino, Barbara Hammer, and Mikael Boden

    generated from P. Assume the number of errors on Tm of the trained classifierg is k, and the recursive model g is contained in Fi. Then, according to [35](Theorem 2.3), the generalization error EP(g) can be upper bounded by thefollowing term with probability 1 :

    1

    m

    (2 + 4 ln2)k + 4ln 2i + 4ln

    4

    + 4iA ln

    2em

    iA

    .

    (Here, we chose the prior probabilities used in [35] (Theorem 2.3) as 1/2i forthe probability that g is in Fi and 1/2k that at most k errors occur.) Unlike thebounds derived e.g. in [15], this bound holds for any posterior finite memorylength L obtained during training. In practise, the memory length can beestimated from the contraction and Lipschitz constants as shown above. Thus,unlike for general RNNs, generalization bounds which do not depend on theinput distribution, but rather on the input memory length, can be derived forRNNs with Markovian bias.

    4.7 Verifying the architectural bias

    The fact that one can extract from untrained recurrent networks NeuralPrediction Machines (NPM, section 4.1) that yield comparable performanceto Variable Memory Length Markov Models (VLMM, section 4.2) has beendemonstrated e.g. in [4]. This is of vital importance for model building andevaluation, since in the light of these findings, the base line models in neural-based symbolic sequence processing tasks should in fact be VLMMs. If aneural network model after the training phase cannot beat a NPM extractedfrom an untrained model, there is obviously no point in training the modelfirst place. This may sound obvious, but there are studies e.g. in the fieldof cognitive aspects of natural language processing, where trained RNN are

    compared with fixed order Markov models [1]. Note that such Markov modelsare often inferior to VLMM.

    In [4], five model classes were considered:

    RNN trained via Real Time Recurrent Learning [36] coupled with Ex-tended Kalman Filtering [37, 38]. There were A input and output units,one for each symbol in the alphabet A, i.e. NI = NO = A. The inputs anddesired outputs were encoded through binary one-of-A encoding. At eachtime step, the output activations were normalized to obtain next-symbolprobabilities.

    NPM extracted from trained RNN. NPM extracted from RNN prior to training. Fixed order Markov models. VLMM.

    The models were tested on two data sets:

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    17/39

    Markovian bias of neural-based architectures with feedback connections 17

    A series of quantized activations of a laser in a chaotic regime. This se-quence can be modelled quite successfully with finite memory predictors,although, because of the limited finite sequence length, predictive contextsof variable memory depth are necessary.

    An artificial language exhibiting deep recursive structures. Such structurescannot be grasped by finite memory models and fully trained RNN shouldbe able to outperform models operating with finite input memory andNPM extracted from untrained networks.

    Model performance was measured by calculating the average (per symbol)negative log-likelihood on the hold-out set (not used during the training). Asexpected, given fine enough partition of the state space, the performances ofRNNs and NPMs extracted from trained RNNs were almost identical. Be-cause of finite sample size effects, fixed-order Markov models were beatenby VLMMs. Interestingly enough, when considering models with comparablenumber of free parameters, the performance of NPMs extracted prior to train-ing mimicked that of VLMMs. Markovian bias of untrained RNN was clearly

    visible.On the laser sequence, negative log-likelihood on the hold out set improved

    slightly through an expensive RNN training, but overall, almost the same per-formance could be achieved by a cheap construction of NPMs from randomlyinitialized untrained networks.

    On the deep-recursion set, trained RNNs could achieve much better per-formance. However, it is essential to quantify how much useful informationhas really been induced during the training. As argued in this study, this canonly be achieved by consulting VLMMs and NPMs extracted before training.

    A large-scale comparison in [26] of Fractal Prediction Machines (FPM, sec-tion 4.3) with both VLMM an fixed order Markov models revealed an almostequivalent performance (modulo model size) of FPM and VLMM. Again, fixed

    order Markov models were inferior to both FPM and VLMM.

    4.8 Geometric complexity of RNN state activations

    It has been well known that RNN state space activations r often live on afractal support of a self-similar nature [39]. As outlined in section 3.2, incase of contractive RNN, one can actually estimate the size and geometriccomplexity of the RNN state space evolution.

    Consider a Bernoulli source S over the alphabet A. When we drive aRNN with symbolic sequences generated by S, we get recurrent activations{r(t)} that tend to group in clusters sampling the invariant set K of the RNNiterative function system {fs}sA.

    Note that for each input symbol s A, we have an affine mapping actingon R:Qs(r) = W

    r,rr + Wr,xc(s) + tf.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    18/39

    18 Peter Tino, Barbara Hammer, and Mikael Boden

    The range of possible net-in activations of recurrent units for input s isthen

    Cs =

    1iN

    [Qs(R)]i, (28)

    where [B]i is the slice of the set B, i.e. [B]i = {ri| r = (r1,...,rN)T B},

    i = 1, 2,...,N.By finding upper bounds on the values of derivatives of activation function,

    gs = supvCs

    |g(v)|, gmax = maxsA

    gs,

    and settingmaxs = +(W

    r,r) gs, (29)

    where +(Wr,r) is the largest singular value ofWr,r, we can bound the upper

    box-counting dimension of RNN state activations [40]: Suppose +(Wr,r)

    gmax < 1 and let max be the unique solution ofsA(maxs )

    = 1. Then,

    dim+

    B K max.Analogously, Hausdorff dimension of RNN sates can be lower-bounded asfollows: Define

    q = mins,sA,s=s

    Wr,x(c(s) c(s))

    and assume +(Wr,r) < min

    (gmax)

    1, q diam(R)

    . Let min be the uniquesolution of

    sA(

    mins )

    = 1, where mins = (Wr,r) infvCs |g

    (v)| and(W

    r,r) is the smallest singular value of Wr,r. Then, dimHK min.Closed-form (less tight) bounds are also possible [40]:

    dim+B K log A

    log N + log Wr,rmax + log gmax,

    where Wr,rmax = max1i,jN |W

    r,r

    ij |; and

    dimHK log A

    log min,

    where min = minsA mins .A more involved multifractal analysis of RNN state activations under more

    general stochastic sources driving the RNN input can be found in [7].

    5 Unsupervised learning setting

    In this section, we will study model formulations for constructing topographic

    maps of sequential data. While most approaches to topographic map forma-tion assume that the data points are members of a finite-dimensional vectorspace of a fixed dimension, there has been a considerable interest in extend-ing topographic maps to more general data structures, such as sequences or

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    19/39

    Markovian bias of neural-based architectures with feedback connections 19

    trees. Several modifications of standard Self-Organizing Map (SOM) [8] to se-quences and/or tree structures have been proposed in the literature [41, 15].Several modifications of SOM equip standard SOM with additional feed-backconnections that allow for natural processing of recursive data types. Typi-

    cal examples of such models are Temporal Kohonen Map [9], recurrent SOM[10], feedback SOM [11], recursive SOM [12], merge SOM [13] and SOM forstructured data [14].

    In analogy to well-known capacity results derived for supervised recursivemodels e.g. in [42, 43], some approaches to investigate the principled capacityof these models have been presented in the last few years. Thereby, dependingon the model, the situation is diverse: whereas SOM for structured data andmerge SOM (both with arbitrary weights) are equivalent to finite automata[44, 13], recursive SOM equipped with a somewhat simplified context modelcan simulate at least pushdown automata [45]. The Temporal Kohonen Mapand recurrent SOM are restricted to finite memory models due to their re-stricted recurrence even with arbitrary weights [45, 44]. However, the exact

    capacity of most unsupervised recursive models as well as biases which ariseduring training have hardly been understood so far.

    5.1 Recursive SOM

    In this section we will analyse recursive SOM (RecSOM) [12]. There are twomain reasons for concentrating on RecSOM. First, RecSOM transcends sim-ple local recurrence of leaky integrators of earlier SOM-motivated models forprocessing sequential data and consequently can represent much richer dynam-ical behavior [15]. Second, the underlying continuous state-space dynamics ofRecSOM makes this model particularly suitable for analysis along the lines ofsection 3.1.

    The map is formed at the recurrent layer. Each unit i {1, 2,...,N} inthe map has two weight vectors associated with it:

    wr,xi X linked with an NI-dimensional input x(t) feeding the networkat time t

    wr,ri R linked with the context

    r(t 1) = (r1(t 1), r2(t 1),...,rN(t 1))T

    containing map activations ri(t 1) from the previous time step.

    Hence, the weight matrices Wr,x and Wr,r can be written as Wr,x =((wr,x1 )

    T, (wr,x2 )T,..., (wr,xN )

    T)T and Wr,r = ((wr,r1 )T, (wr,r2 )

    T, ..., (wr,rN )T)T,

    respectively.The activation of a unit i at time t is computed as

    ri(t) = exp(di(t)), (30)

    where

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    20/39

    20 Peter Tino, Barbara Hammer, and Mikael Boden

    di(t) = x(t) wr,xi

    2 + r(t 1) wr,ri 2. (31)

    In eq. (31), > 0 and > 0 are parameters that respectively quantify theinfluence of the input and the context on the map formation. The output of

    RecSOM is the index of the unit with maximum activation (best-matchingunit),y(t) = h(r(t)) = argmax

    i{1,2,...,N}ri(t). (32)

    Both weight vectors can be updated using a SOM-type learning [12]:

    wr,xi = (i, y(t)) (x(t) wr,xi ), (33)

    wr,ri = (i, y(t)) (r(t 1) wr,ri ), (34)

    where 0 < < 1 is the learning rate. Neighborhood function (i, k) is aGaussian (of width ) on the distance d(i, k) of units i and k in the map:

    (i, k) = e

    d(i,k)2

    2

    . (35)The neighborhood width, , linearly decreases in time to allow for formingtopographic representation of input sequences. A schematic illustration ofRecSOM is presented in figure 3.

    The time evolution (31) becomes

    ri(t) = exp(x(t) wr,xi

    2) exp(r(t 1) wr,ri 2). (36)

    We denote the Gaussian kernel of inverse variance > 0, acting on RN,by G(, ), i.e. for any u, v RN,

    G(u, v) = euv2. (37)

    Finally, the state space evolution (2) can written for RecSOM as follows:

    r(t) = f(x(t), r(t 1)) = (f1(x(t), r(t 1)),...,fN(x(t), r(t 1))), (38)

    wherefi(x, r) = G(x, w

    r,xi ) G(r, w

    r,ri ), i = 1, 2,...,N. (39)

    The output response to an input time series {x(t)} is the time series {y(t)}of best-matching unit indices given by (32).

    5.2 Markovian organization of receptive fields in contractive

    RecSOM

    As usual in topographic maps, each unit on the map can be naturally repre-sented by its receptive field (RF). Receptive field of a unit i is the commonsuffix of all sequences for which that unit becomes the best-matching unit.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    21/39

    Markovian bias of neural-based architectures with feedback connections 21

    x(t)

    f

    r(t)

    r(t1

    h

    argmax r (t)

    delayunit time

    i i

    Fig. 3. Schematic illustration of RecSOM. Dimensions (neural units) of the statespace R are topologically organized on a grid structure. The degree of closeness ofunits i and k on the grid is given by the neighborhood function (i, k). The outputof RecSOM is the index of maximally activated unit (dimension of the state space).

    It is common to visually represent trained topographic maps by printing outRFs on the map grid.

    Let us discuss how contractive fixed-input maps fs (3) shape the overallorganization of RFs in the map. For each input symbol s A, the autonomous

    dynamicsr(t) = fs(r(t 1)) (40)

    induces an output dynamics y(t) = h(r(t)) of best-matching (winner) unitson the map. Since for each input symbol s A, the fixed-input dynamics (40)is a contraction, it will be dominated by a unique attractive fixed point rs.It follows that the output dynamics {y(t)} on the map settles down in unitis, corresponding to the mode of rs. The unit is will be most responsive toinput subsequences ending with long blocks of symbols s. Receptive fields ofother units on the map will be organized with respect to the closeness of theunits to the fixed-input winners is, s A. When symbol s is seen at time t,the mode of the map activation profile r(t) starts drifting towards the unitis. The more consecutive symbols s we see, the more dominant the attractive

    fixed point of fs becomes and the closer the winner position is to is.It has indeed been reported that maps of sequential data obtained by

    RecSOM often seem to have a Markovian flavor: The units on the map becomesensitive to recently observed symbols. Suffix-based RFs of the neurons are

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    22/39

    22 Peter Tino, Barbara Hammer, and Mikael Boden

    topographically organized in connected regions according to last seen symbols.If two prefixes s1:p and s1:q of a long sequence s1...sp2sp1sp...sq2sq1sq...share a common suffix of length L, it follows from (10) that for any r R,

    fs1:p(r) fs1:q (r) Lmax diam(R).

    For sufficiently large L, the two activations r1 = fs1:p(r) and r2 = fs1:q(r) will

    be close enough to have the same location of the mode,8

    i = h(r1) = h(r2),

    and the two subsequences s1:p and s1:q yield the same best matching uniti on the map, irrespective of the position of the subsequences in the inputstream. All that matters is that the prefixes share a sufficiently long commonsuffix. We say that such an organization of RFs on the map has a Markovianflavour, because it is shaped solely by the suffix structure of the processedsubsequences, and it does not depend on the temporal context in which they

    occur in the input stream9.RecSOM parameter weighs the significance of importing information

    about possibly distant past into processing of sequential data. Intuitively,when is sufficiently small, e.g. when information about the very recent inputsdominates processing in RecSOM, the resulting maps should have Markovianflavour. This intuition was given a more rigorous form in [16]. In particular,theoretical bounds on parameter that guarantee contractiveness of the fixedinput maps fs were derived. As argued above, contractive fixed input mappingsare likely to produce Markovian organizations of RFs on the RecSOM map.

    Denote by G(x) the collection of activations coming from the feed-forward part of RecSOM,

    G(x) = (G(x, wr,x

    1), G(x, w

    r,x

    2),...,G(x, w

    r,x

    N)). (41)

    Then we have [16]:

    Theorem 2. Consider an input x X. If for some [0, 1),

    2e

    2G(x)

    2, (42)

    then the mapping fx(r) = f(x, r) is a contraction with contraction coefficient.

    8 or at least mode locations on neighboring grid points of the map9 Theoretically, there can be situations where (1) locations of the modes of r1

    and r2 are distinct, but the distance between r1 and r2 is small; or where (2)the modes of r1 and r2 coincide, while their distance is quite large. This followsfrom discontinuity of the output map h. However, in our extensive experimentalstudies, we have registered only a negligible number of such cases.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    23/39

    Markovian bias of neural-based architectures with feedback connections 23

    Corollary 1. Provided

    0 if for every a A and rI RI a neuron i can be found such that awr,xi 1 and r

    Iwr,ri 1.2. SOMSD is distance preserving given 2 > 0 if for every two neurons i

    and j the following inequality holds |Ii Ij wr,xi w

    r,xj

    wr,ri wr,rj | 2. This inequality relates the distances of the locations

    of neurons on the lattice to the distances of their content.

    Assume a SOMSD is given with initial context r0. We denote the outputresponse to the empty sequence by i = h(r0), the winner unit for an inputsequence u by iu := h fu(r0) and the corresponding winner location on themap grid by Iu. We can define an explicit Markovian metric on sequences inthe following way: dA : A

    A R, dA(, ) = 0, dA(u, ) = dA(, u) =I Iu, and for two sequences u = u1 . . . up1up, v = v1 . . . vq1vq over A,

    dA (u, v) = c(up) c(vq) + dA(u1 . . . up1, v1 . . . vq1). (47)

    Assume < 1, then, the output of SOMSD is Markovian in the sense thatthe metric as defined in (47) approximates the metric induced by the winnerlocation of sequences. More precisely, the following inequality is valid:

    | dA(u, v) Iu Iv | (2 + 4 ( + )1)/(1 ).

    This equation is immediate if u or v are empty. For nonempty sequencesu = u1:p1up and v = v1:q1vq, we find

    | dA (u, v) Iu Iv | = | c(up) c(vq)

    + dA(u1:p1, v1:q1) Iu Iv |

    2 + | wr,xiu

    wr,xiv c(up) c(vq) |

    + | wr,riu wr,riv

    dA(u1:p1, v1:q1)|

    = ()

    Therefore, wr,xiu c(up) + wr,riu

    Iu1:p1 is minimum. We haveassumed sufficient granularity, i.e. there exists a neuron with weights at most1 away from c(up) and Iu1:p1 . Therefore,

    wr,xiu up ( + )1/

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    28/39

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    29/39

    Markovian bias of neural-based architectures with feedback connections 29

    the learning dynamics. Thereby, the limit of Hebbian learning for vanishingneighborhood size is considered which yields the update rules

    wr,xit = (x(t) wr,xit

    ) (49)

    wr,rit = (rM(t 1) wr,rit ) (50)

    after time step t, whereby x(t) constitutes the input at time step t, rM(t1) = wr,xit1 + (1 ) w

    r,rit1

    the context at time step t, and it the winner at timestep t; > 0 is the learning rate.

    Assume an input series {x(t)} is presented to MSOM and assume the initialcontext r0 is the null vector. Then, if a sufficient number of neurons is availablesuch that there exists a disjoint winner for every input x(t), the followingweights constitute a stable fixed point of Hebbian learning as described ineqs. (49-50):

    wr,xit

    = x(t) (51)

    wr,rit

    =t1j=1

    (1 )jx(t j). (52)

    The proof of this fact can be found in [15, 13]. Note that, depending on thesize of , the context representation yields sequence codes which are verysimilar to the recursive states provided by fractal prediction machines (FPM)of section 4.3. For < 0.5 and unary inputs x(t) (one-of-A encoding), (52)yields a fractal whereby the global position in the state space is determinedby the most recent entries of the input series. Hence, a Markovian bias isobtained by Hebbian learning.

    This can clearly be observed in experiments, compare figure 411 whichdescribes a result of an experiment conducted in [13]: a MSOM with 644

    neurons is trained on DNA sequences. The data consists of windows aroundpotential splice sites from C.elegans as described in [46]. The symbols TCGAare encoded as the edges of a tetrahedron in R3. The windows are concatenatedand presented to a MSOM using merge parameter = 0.5. Thereby, theneighborhood function used for training is not a standard SOM lattice, buttraining is data driven by the ranks as described for neural gas networks[47]. Figure 4 depicts a projection of the learnt contexts. Clearly, a fractalstructure which embeds sequences in the whole context space can be observed,substantianting the theoretical finding by experimental evidence.

    Note that, unlike FPM which use a fixed fractal encoding, the encodingof MSOM emerges from training. Therefore, it takes the internal probabilitydistribution of symbols into account, automatically assigning a larger space

    to those inputs which are presented more frequently to the map.

    11 We would like to thank Marc Strickert for providing the picture.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    30/39

    30 Peter Tino, Barbara Hammer, and Mikael Boden

    Fig. 4. Projection of the three dimensional fractal contexts which are learnt whentraining a MSOM on DNA-strings whereby the symbols ACTG are embedded inthree dimensions.

    6 Applications in bioinformatics

    In this section we briefly describe applications of the architectural bias phe-nomenon in the supervised learning scenario. Natural Markovian organizationof biological sequences within the RNN state space may be exploited whensolving some sequence-related problems of bioinformatics.

    6.1 Genomic data

    Enormous amounts of genomic sequence data has been produced in recentyears. Unfortunately, the emergence of experimental data for the functionalparts of such large-scale genomic repositories is lagging behind. For example,several hundred thousands of proteins are known, but only about 35,000 havehad their structure determined. One third of all proteins are believed to bemembrane associated, but less than a thousand have been experimentallywell-characterized to be so.

    As commonly assumed, genomic sequence data specifies the structure andfunction of its products (proteins). Efforts are underway in the field of bioin-formatics to link genomic data to experimental observations with the goal ofpredicting such characteristics for the yet unexplored data [48]. DNA, RNAand proteins are all chain molecules, made up of distinct monomers, namelynucleotides (DNA and RNA) and amino acids (protein). We choose to cast

    such chemical compounds as symbol sequences. There are four different nu-cleotides and twenty different amino acids, here referred to as alphabets A4and A20, respectively.

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    31/39

    Markovian bias of neural-based architectures with feedback connections 31

    A functional region of DNA typically includes thousands of nucleotides.The average number of amino acids in a protein is around the 300 mark butsequence lengths vary greatly. The combinatorial nature of genomic sequencedata poses a challenge for machine learning approaches linking sequence data

    with observations. Even when thousands of samples are available, the sequencespace is sparsely populated. The architectural bias of predictive models maythus influence the result greatly. This section demonstrates that a Markovianstate-space organizationas imposed by the recursive model in this paperdiscerns patterns with a biological flavor [49].

    In bioinformatics motifs are perceived as recurring patterns in sequencedata (DNA, RNA and proteins) with an observed or conjectured biologicalsignificance (the character varies widely). Problems like promoter and DNA-binding site discovery [50, 51], identification of alternatively spliced exons[52] and glycosylation sites are successfully approached using the concept ofsequence motifs. Motifs are usually quite short and to be useful may requirea more permissive description than a list of consecutive sequence symbols.

    Hence, we augment the motif alphabet A to include the so-called wildcardsymbol : A = A{}. Let M = m1m2...m|M| be a motif where mi A. Asoutlined below the wildcard symbol has special meaning in the motif sequencematching process.

    6.2 Recurrent architectures promote motif discovery

    We are specifically interested if the Markovian state space organization pro-motes the discovery of a motif within a sequence. Following [49], position-dependent motifs are patterns of symbols relative to a fixed sequence-position.A motif match is defined as

    match(s1:n, M) =t, if i {1, ..., |M|} [mi = mi = sn|M|+i]f otherwise

    wheren |M|. Notice that the motif must occur at the end of the sequence.To inspect how well sequences with and without a given motif M (t and f,

    respectively) are separated in the state-space, Boden and Hawkins determineda motif match entropy for each of the Voronoi compartments V1,...,VM, intro-duced in the context of NPM construction in section 4.1. The compartmentswere formed by vector quantizing the final activations produced by a neural-based architecture on a set of sequences (see Section 4). With each codebookvector bi R, i = 1, 2,...,M, we associate the probability P(t|i) of observinga state r Vi corresponding to a sequence motif match. This probability is es-timated by tracking the matching and non-matching sequences while counting

    the number of times Vi is visited.The entropy for each codebook is defined as

    Hi = P(t|i)log2 P(t|i) (1 P(t|i))log2(1 P(t|i)) (53)

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    32/39

    32 Peter Tino, Barbara Hammer, and Mikael Boden

    and describes the homogeneity within the Voronoi compartment by consid-ering the proportion of motif positives and negatives. An entropy of 0 in-dicates perfect homogeneity (contained states are exclusively for matchingor non-matching sequences). An entropy of 1 indicates random organization

    (a matching state is observed by chance). The average codebook entropy,H = M1

    Mi=1 Hi, is used to characterize the overall state-space organiza-

    tion.In our experiments, the weights Wr,x and Wr,r were randomly drawn from

    a uniform distribution [0.5, +0.5]. This, together with activation function g,ensured contractiveness of the state transition map f and hence (as explainedin section 3.1) Markovian organization of the RNN state space.

    In an attempt to provide a simple baseline alternative to our recursiveneural-based models, a feedforward neural network was devised accepting thewhole sequence at its input without any recursion. Given the maximum inputsequence length max, the input layer has NI max units and the input codeof a sequence s1:n = s1...sn takes the form

    c(s1:n) = c(s1) ... c(sn) {0}NI(maxn),

    where denotes vector concatenation. Hence, c(s1:n) is a concatenation ofinput codes for symbols in s1:n, followed by NI (max n) zeros. The hiddenlayer response of the feedforward network to s1:n is then

    fs1:n = f(c(s1:n)) = g(Wr,x

    c(s1:n) + tf), (54)

    where the input-to-hidden layer weights (again randomly drawn from [0.5, +0.5])

    are collected in the N (NI max) matrix Wr,x

    .In each trial, a random motif M was first selected and then 500 random

    sequences (250 with a match, 250 without) were generated. Ten differently

    initialized networks (both RNN and the feed-forward baseline) were then sub-jected to the sequences and the average entropy H was determined. By vary-ing sequence and motif lengths, the proportion of wildcards in the motifs,state space dimensionality and number of codebook vectors extracted fromsequence states, Boden and Hawkins established that recurrent architectureswere clearly superior in discerning random but pre-specified motifs in the statespace (see Figure 5). However, there was no discernable difference between thetwo types of networks (RNN and feedforward) for shift-invariant motifs (analternative form which allows matches to appear at any point in the sequence).To leverage the bias, a recurrent network should thus continually monitor itsstate space.

    These synthetic trials (see [49]) provide evidence that a recurrent networknaturally organizes its state space at each step to enable classification of ex-tended and flexible sequential patterns that led up to the current time step.The results have been shown to hold also for higher-order recurrent architec-tures [53].

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    33/39

    Markovian bias of neural-based architectures with feedback connections 33

    0 20 40 60 80 1000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Length of sequence

    Entropy

    Fig. 5. The average entropy H (y-axis) for recurrent networks (unfilled markers) andbaseline feedforward networks (filled markers) when A4 (squares) and A20 (circles)were used to generate sequences of varying length (x-axis) and position-dependentmotifs (with length equal to half the sequence length and populated with 33% wild-cards). The number of codebooks, M, was 10.

    6.3 Discovering discriminative patterns in signal peptides

    A long-standing problem of great biological significance is the detection of so-called signal peptides. Signal peptides are reasonably short protein segmentsthat fulfil a transportation role in the cell. Basically, if a protein has a signalpeptide it translocates to the exoplasmic space. The signal peptide is cleavedoff in the process.

    Viewed as symbol strings, signal peptides are rather diverse. However,there are some universal characteristics. For example, they always appearat the N-terminus of the protein, they are 15-30 residues long, consist of astretch of seven to 15 hydrophobic (lipid-favouring) amino acids and a fewwell-conserved, small and neutral monomers close to the cleavage site. In Fig-ure 6 nearly 400 mammalian signal peptides are aligned at their respectivecleavage site to illustrate a subtle pattern of conservation.

    In an attempt to simplify the sequence classification problem, each positionof the sequence can first be tagged as belonging (t) or not belonging (f) to asignal peptide [54]. SignalP is a feedforward network that is designed to lookat all positions of a sequence (within a carefully sized context window) and

    trained to classify each into the two groups (t and f) [54].The Markovian character of the recursive models state space suggests

    that replacing the feedforward network with a recurrent network may providea better starting point for a training algorithm. However, since biological

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    34/39

    34 Peter Tino, Barbara Hammer, and Mikael Boden

    weblogo.berkeley.edu

    0

    1

    2

    3

    4

    bits

    N1

    V

    F

    K

    RG

    SP

    LAM

    2

    R

    Q

    E

    T

    P

    V

    S

    G

    L

    AM

    3

    K

    V

    P

    S

    G

    L

    A

    RM

    4

    Q

    C

    K

    F

    S

    L

    P

    G

    A

    RM

    5

    V

    T

    F

    K

    G

    A

    P

    S

    RLM

    6

    T

    C

    V

    K

    G

    R

    P

    A

    S

    LM

    7

    V

    K

    T

    G

    R

    P

    A

    S

    LM

    8

    T

    P

    G

    K

    R

    V

    S

    A

    ML

    9

    H

    F

    G

    T

    K

    V

    P

    A

    S

    RML

    10

    T

    R

    W

    I

    P

    G

    F

    V

    S

    M

    K

    A

    L

    11

    T

    G

    R

    C

    I

    W

    P

    F

    S

    M

    VAL

    12

    Q

    W

    K

    C

    R

    M

    I

    T

    G

    P

    S

    FVAL

    13

    K

    R

    C

    P

    W

    I

    F

    T

    S

    G

    AVL

    14

    M

    P

    W

    C

    T

    S

    IFGAVL

    15

    PHWCTSGIFAVL

    16

    P

    W

    T

    G

    SCFAIVL

    17

    Q

    P

    W

    M

    G

    T

    SCIFVAL

    18

    I

    G

    CTFSVAL

    19

    P

    W

    I

    M

    T

    G

    FVC

    SAL

    20

    M

    W

    P

    Q

    I

    F

    T

    V

    C

    SGAL

    21

    M

    T

    C

    W

    G

    S

    I

    FPVAL

    22

    R

    V

    Q

    T

    L

    PS

    GA

    23

    E

    F

    T

    Q

    V

    S

    L

    PAG

    24

    I

    CL

    GTSVA

    25

    W

    H

    R

    E

    Q

    A

    S

    L

    26

    R

    Q

    PC

    TSGA

    27

    K

    D

    V

    G

    L

    S

    E

    Q

    A

    28

    T

    L

    H

    A

    R

    K

    G

    Q

    V

    D

    S

    EP

    C

    Fig. 6. Mammalian signal peptide sequences aligned at their cleavage site (betweenpositions 26 and 27). Each amino acid is represented by its one-letter code. Therelative frequency of each amino acid in each aligned position is indicated by theheight of each letter (as measured in bits). The graph is produced using WebLogo.

    sequences may exhibit dependencies in both directions (upstream as well asdownstream) the simple recurrent network may be insufficient.

    Hawkins and Boden [53, 55] replaced the feedforward network with a bi-directional recurrent network [56]. The bi-directional variant is simply twoconventional (uni-directional) recurrent networks coupled to merge two stateseach produced from traversing the sequence from one direction (either up-stream or downstream relative the point of interest). In Hawkins and Bodensstudy, the networks ability was only examined after training.

    After training, the average test error (as measured over known signal pep-tides and known negatives) dropped by approximately 25% compared to feed-forward networks. The position-specific errors for both classes of architectures

    are presented in Figure 7. Several possible setups of the bi-directional recur-rent network are possible, but the largest performance increase was achievedwhen the number of symbols presented at each step was 10 (for both theupstream and downstream networks).

    Similar to signal peptides, mitochondrial targeting peptides and chloro-plast transit peptides traffick proteins. In their case proteins are translocatedto mitochondrion and chloroplast, respectively. Bi-directional recurrent net-works applied to such sequence data show a similar pattern of improvementcompared to feedforward architectures [53].

    The combinatorial space of possible sequences and the comparatively smallnumber of known samples make the application of machine learning to bioin-formatics challenging. With significant data variance, it is essential to design

    models that guide the learning algorithm to meaningful portions of the modelparameter space. By presenting exactly the same training data sets to twoclasses of models, we note that the architectural bias of recurrent networksseems to enhance the prospects of achieving a valid generalization. As noted

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    35/39

    Markovian bias of neural-based architectures with feedback connections 35

    0 10 20 30 40 50 600

    0.05

    0.1

    0.15

    0.2

    0.25

    Squared

    error

    Sequence position

    FN

    RN

    Fig. 7. The average error of recurrent and feedforward neural networks over thefirst 60 residues in a sequence test set discriminating between residues that belongto a signal peptide from those that do not. The signal peptide always appear at theN-terminus of a sequence (position 1 and onwards, average length is 23).

    in [53] the locations of subtle sequential patterns evident in the data sets (cf.Figure 6) correspond well with those regions at which the recurrent networksexcel. The observation lends empirical support to the idea that recursive mod-els are well suited to this broad class of bioinformatics problems.

    7 Conclusion

    We have studied architectural bias of dynamic neural network architecturestowards suffix-based Markovian input sequence representations. In the super-vised learning scenario, the cluster structure that emerges in the recurrentlayer prior to training is, due to the contractive nature of the state transitionmap, organized in a Markovian manner. In other words, the clusters emulateMarkov prediction contexts in that histories of symbols are grouped accordingto the number of symbols they share in their suffix. Consequently, from suchrecurrent activation clusters it is possible to extract predictive models thatcorrespond to variable memory length Markov models (VLMM). To appreci-ate how much information has been induced during the training, the dynamic

    neural network performance should always be compared with that of VLMMbaseline models.Apart from trained models (possibly with small weights, i.e. Markovian

    bias) various mechanisms which only rely on the dynamics of fixed or randomly

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    36/39

    36 Peter Tino, Barbara Hammer, and Mikael Boden

    initialized recurrence have recently been proposed, including fractal predictionmachines, echo state machines, and liquid state machines. As discussed inthe paper, these models almost surely allow an approximation of Markovianmodels even with random initialization provided a large dimensionality of the

    reservoir.Interestingly, the restriction to Markovian models has benefits with respect

    to the generalization ability of the models. Whereas generalization bounds ofgeneral recurrent models necessarily rely on the input distribution due to thepossibly complex input information, the generalization ability of Markovianmodels can be limited in terms of the relevant input length or, alternatively,the Lipschitz parameter of the models. This yields bounds which are inde-pendent of the input distribution and which are the better, the smaller thecontraction coefficient. These bounds also hold for arbitrary real-valued in-puts and for posterior bounds on the contraction coefficient achieved duringtraining. The good generalization ability of RNNs with Markovian bias ispartiuclarly suited for typical problems in bioinformatics where dimensional-

    ity is high compared to the number of available training data. A couple ofexperiments which demonstrate this effect have been included in this article.Unlike supervised RNNs, a variety of fundamentally different unsupervised

    recurrent networks has recently been proposed. Apart from comparably oldbiologically plausible leaky integrators such as the temporal Kohonen mapor recurrent networks, which obviously rely on a Markovian representation ofsequences, models which use a more elaborate temporal context such as themap activation, the winner location, or winner content have been proposed.These models have, in principle, the larger capacity of finite state automataor pushdown automata, respectively, depending on the context model. Inter-estingly, this capacity reduces to Markovian models under certain realisticconditions discussed in this paper. These include a small mixing parameterfor recursive networks, a sufficient granularity and distance preservation for

    SOMSD, and Hebbian learning for MSOM. The field of appropriate trainingof complex unsupervised recursive models is, so far, widely unexplored. Ther-fore, the identification of biases of the architecture and training algorithmtowards the Markovian property is a very interesting result which allows togain more insight into the process of unupervised sequence learning.

    References

    1. Christiansen, M., Chater, N.: Toward a connectionist model of recursion inhuman linguistic performance. Cognitive Science 23 (1999) 417437

    2. Kolen, J.: Recurrent networks: state machines or iterated function systems?

    In Mozer, M., Smolensky, P., Touretzky, D., Elman, J., Weigend, A., eds.: Pro-ceedings of the 1993 Connectionist Models Summer School. Erlbaum Associates,Hillsdale, NJ (1994) 203210

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    37/39

    Markovian bias of neural-based architectures with feedback connections 37

    3. Kolen, J.: The origin of clusters in recurrent neural network state space. In:Proceedings from the Sixteenth Annual Conference of the Cognitive ScienceSociety, Hillsdale, NJ: Lawrence Erlbaum Associates (1994) 508513

    4. Tino, P., Cernansky, M., Benuskova, L.: Markovian architectural bias of re-

    current neural networks. IEEE Transactions on Neural Networks 15 (2004)615

    5. Hammer, B., Tino, P.: Recurrent neural networks with small weights implementdefinite memory machines. Neural Computation 15 (2003) 18971929

    6. Ron, D., Singer, Y., Tishby, N.: The power of amnesia. Machine Learning 25(1996)

    7. Tino, P., Hammer, B.: Architectural bias in recurrent neural networks: Fractalanalysis. Neural Computation 15 (2004) 19311957

    8. Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78 (1990)14641479

    9. Chappell, G., Taylor, J.: The temporal kohonen map. Neural Networks 6 (1993)441445

    10. Koskela, T., znd J. Heikkonen, M.V., Kaski, K.: Recurrent SOM with local linearmodels in time series prediction. In: 6th European Symposium on Artificial

    Neural Networks. (1998) 16717211. Horio, K., Yamakawa, T.: Feedback self-organizing map and its application to

    spatio-temporal pattern classification. International Journal of ComputationalIntelligence and Applications 1 (2001) 118

    12. Voegtlin, T.: Recursive self-organizing maps. Neural Networks 15 (2002) 979992

    13. Strickert, M., Hammer, B.: Merge SOM for temporal data. Neurocomputing64 (2005) 3972

    14. Hagenbuchner, M., Sperduti, A., Tsoi, A.: Self-organizing map for adaptiveprocessing of structured data. IEEE Transactions on Neural Networks 14 (2003)491505

    15. Hammer, B., Micheli, A., Strickert, M., Sperduti, A.: A general framework forunsupervised processing of structured data. Neurocomputing 57 (2004) 335

    16. Tino, P., Farkas, I., van Mourik, J.: Dynamics and topographic organization ofrecursive self-organizing maps. Neural Computation 18 (2006) 2529256717. Falconer, K.: Fractal Geometry: Mathematical Foundations and Applications.

    John Wiley and Sons, New York (1990)18. Barnsley, M.: Fractals everywhere. Academic Press, New York (1988)19. Ron, D., Singer, Y., Tishby, N.: The power of amnesia. In: Advances in Neural

    Information Processing Systems 6, Morgan Kaufmann (1994) 17618320. Guyon, I., Pereira, F.: Design of a linguistic postprocessor using variable memory

    length markov models. In: International Conference on Document Analysis andRecognition, Monreal, Canada, IEEE Computer Society Press (1995) 454457

    21. Weinberger, M., Rissanen, J., Feder, M.: A universal finite memory source.IEEE Transactions on Information Theory 41 (1995) 643652

    22. Buhlmann, P., Wyner, A.: Variable length markov chains. Annals of Statistics27 (1999) 480513

    23. Rissanen, J.: A universal data compression system. IEEE Trans. Inform. Theory29 (1983) 656664

    24. Giles, C., Omlin, C.: Insertion and refinement of production rules in recurrentneural networks. Connection Science 5 (1993)

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    38/39

    38 Peter Tino, Barbara Hammer, and Mikael Boden

    25. Doya, K.: Bifurcations in the learning of recurrent neural networks. In: Proc.of 1992 IEEE Int. Symposium on Circuits and Systems. (1992) 27772780

    26. Tino, P., Dorffner, G.: Predicting the future of discrete sequences from fractalrepresentations of the past. Machine Learning 45 (2001) 187218

    27. Jaeger, H.: The echo state approach to analysing and training recurrentneural networks. Technical Report GMD Report 148, German National ResearchCenter for Information Technology (2001)

    28. Jaeger, H., Haas, H.: Harnessing nonlinearity: Predicting chaotic systems andsaving energy in wireless communication. Science 304 (2004) 7880

    29. Maass, W., Natschlager, T., Markram, H.: Real-time computing without stablestates: A new framework for neural computation based on perturbations. NeuralComputation 14 (2002) 25312560

    30. Maass, W., Legenstein, R.A., Bertschinger, N.: Methods for estimating thecomputational p ower and generalization capability of neural microcircuits. In:Advances in Neural Information Processing Systems. Volume 17., MIT Press(2005) 865872

    31. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks areuniversal approximators. Neural Networks 2 (1989) 359366

    32. Hammer, B.: Generalization ability of folding networks. IEEE Transactions onKnowledge and Data Engineering 13 (2001) 196206

    33. Koiran, P., Sontag, E.: Vapnik-chervonenkis dimension of recurrent neural net-works. In: European Conference on Computational Learning Theory. (1997)223237

    34. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience (1998)35. Shawe-Taylor, J., Bartlett, P.L.: Structural risk minimization over data-

    dependent hierarchies. IEEE Trans. on Information Theory 44 (1998) 19261940

    36. Williams, R., Zipser, D.: A learning algorithm for continually running fullyrecurrent neural networks. Neural Computation 1 (1989) 270280

    37. Williams, R.: Training recurrent networks using the extended kalman filter. In:Proc. 1992 Int. Joint Conf. Neural Networks. Volume 4. (1992) 241246

    38. Patel, G., Becker, S., Racine, R.: 2d image modelling as a time-series predictionproblem. In Haykin, S., ed.: Kalman filtering applied to neural networks. Wiley(2001)

    39. Manolios, P., Fanelli, R.: First order recurrent neural networks and deterministicfinite state automata. Neural Computation 6 (1994) 11551173

    40. Tino, P., Hammer, B.: Architectural bias in recurrent neural networks fractalanalysis. In Dorronsoro, J., ed.: Artificial Neural Networks - ICANN 2002.Lecture Notes in Computer Science, Springer-Verlag (2002) 13591364

    41. de A. Barreto, G., Araujo, A., Kremer, S.: A taxanomy of spatiotemporalconnectionist networks revisited: The unsupervised case. Neural Computation15 (2003) 12551320

    42. Kilian, J., Siegelmann, H.T.: On the power of sigmoid neural networks. Infor-mation and Computation 128 (1996) 4856

    43. Siegelmann, H.T., Sontag, E.D.: Analog computation via neural networks. The-

    oretical Computer Science 131 (1994) 33136044. Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing

    network models. Neural Networks 17 (2004) 10611086

  • 8/3/2019 Peter Tino, Barbara Hammer and Mikael Boden- Markovian bias of neural-based architectures with feedback connections

    39/39

    Markovian bias of neural-based architectures with feedback connections 39

    45. Hammer, B., Neubauer, N.: On the capacity of unsupervised recursive neuralnetworks for symbol processing. In dAvila Garcez, A., Hitzler, P., Tamburrini,G., eds.: Workshop proceedings of NeSy06. (2006)

    46. Sonnenburg, S.: New methods for splice site recognition. Masters thesis, Diplom

    thesis, Institut fur Informatik, Humboldt-Universitat Berlin (2002)47. Martinetz, T., Berkovich, S., Schulten, K.: neural-gas networks for vector

    quantization and its application to time-series prediction. IEEE Transactionson Neural Networks 4 (1993) 558569

    48. Baldi, P., Brunak, S.: Bioinformatics: The machine learning approach. MITPress, Cambridge, Mass (2001)

    49. Boden, M., Hawkins, J.: Improved access to sequential motifs: A note on thearchitectural bias of recurrent networks. IEEE Transactions on Neural Networks16 (2005) 491494

    50. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization todiscover motifs in biopolymers. In: Proceedings of ISMB, Stanford, CA (1994)2836

    51. Stormo, G.D.: Dna binding sites: representation and discovery. Bioinformatics16 (2000) 1623

    52. Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G., Shamir, R.: A non-EST-based method for exon-skipping prediction. Genome Research 14 (2004)16171623

    53. Hawkins, J., Boden, M.: The applicability of recurrent neural networks for bio-logical sequence analysis. IEEE/ACM Transactions on Computational Biologyand Bioinformatics 2 (2005) 243253

    54. Dyrlov Bendtsen, J., Nielsen, H., von Heijne, G., Brunak, S.: Improved predic-tion of signal peptides: SignalP 3.0. Journal of Molecular Biology 340 (2004)783795

    55. Hawkins, J., Boden, M.: Detecting and sorting targeting papetides with neuralnetworks and support vector machines. Journal of Bioinformatics and Compu-tational Biology 4 (2006) 118

    56. Baldi, P., Brunak, S., Frasconi, P., Soda, G., Pollastri, G.: Exploiting the past

    and the future in protein secondary structure prediction. Bioinformatics 15(1999) 937946