26 special issue on systems biology, january 2008 the …mxv091000/e-pubs/132.pdf · 2014-02-22 ·...

26 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008

The 4M (Mixed Memory Markov Model) Algorithmfor Finding Genes in Prokaryotic Genomes

Mathukumalli Vidyasagar, Fellow, IEEE, Sharmila S. Mande, Ch. V. Siva Kumar Reddy, and V. V. Raja Rao

Abstract—In this paper, we present a new algorithm called4M (mixed memory Markov model) for finding genes from thegenomes of prokaryotes. This is achieved by modeling the knowncoding regions of the genome as a set of sample paths of a multistepMarkov chain (call it ) and the known non-coding regions asa set of sample paths of another multistep Markov chain (call it

). The new feature of the 4M algorithm is that different statesare allowed to have different memory lengths, in contrast to a fixedmultistep Markov model used in GeneMark in its various versions.At the same time, compared with an algorithm like Glimmer3that uses an interpolation of Markov models of different memorylengths, the statistical significance of the conclusions drawn fromthe 4M algorithm is quite easy to quantify. Thus, when a wholegenome annotation is carried out and several new genes arepredicted, it is extremely easy to rank these predictions in termsof the confidence one has in the predictions. The basis of the 4Malgorithm is a simple rank condition satisfied by the matrix offrequencies associated with a Markov chain.

The 4M algorithm is validated by applying it to 75 organismsbelonging to practically all known families of bacteria and archae.The performance of the 4M algorithm is compared with those ofGlimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is foundthat, in a vast majority of cases, the 4M algorithm finds many moregenes than it misses, compared with any of the other three algo-rithms. Next, the 4M algorithm is used to carry out whole genomeannotation of 13 organisms by using 50% of the known genes asthe training input for the coding model and 20% of the knownnon-genes as the training input for the non-coding model. Afterthis, all of the open reading frames are classified. It is found thatthe 4M algorithm is highly specific in that it picks out virtually allof the known genes, while predicting that only a small number ofthe open reading frames whose status is unknown are genes.

Index Terms—Algorithm, gene prediction, K–L divergence,Markov model, prokaryotes.

I. INTRODUCTION

A. Gene-Finding Problem

ALL LIVING things consist of DNA, which is a very com-plex molecule arranged in a double helix. DNA consists

of a series of nucleotides, where each nucleotide is denotedby the base it contains, namely, A (Adenine), C (Cytosine), G(Guanine), or T (Thymine). The genome of an organism is thelisting of one strand of DNA as an enormously long sequence ofsymbols from the four-symbol alphabet {A, C, G, T}. Certain

Manuscript received January 22, 2007; revised September 1, 2007.The authors are with the Advanced Technology Centre, Tata Consul-

tancy Services, Software Units Layout, Madhapur, Hyderabad 500081,India (e-mail: [email protected]; [email protected]; [email protected];[email protected]).

Digital Object Identifier 10.1109/TAC.2007.911360

parts of the genome correspond to genes that get converted intoproteins, while the rest are non-coding regions. In prokaryotes,or “lower” organisms, the genes are in one continuous stretch,whereas in eukaryotes, or “higher” organisms, the genes con-sist of a series of exons, interruped by introns. The junctionsbetween exons and introns are called splice sites, and the detec-tion of splice sites is a very difficult problem. For this reason,the focus in this paper is on finding genes in prokaryotes.

It is easy to state some necessary but not sufficient conditionsfor a stretch of genome to be a gene, which are given as follows.

• The sequence must begin with the start codon ATG. Insome organisms, GTG is also a start codon.

• The sequence must end with one of the three stop codons,namely, TAA, TAG, or TGA.

• The length of the sequence must be an exact multiple ofthree.

A stretch of genome that satisfies these conditions is referred toas an open reading frame (ORF).

B. Statistical Approaches to Gene-Finding

There are in essence two distinct approaches to gene-finding,namely, string-matching and statistical modeling. Instring-matching algorithms, one looks for symbol-for-symbolmatching, whereas in statistical modeling one looks for simi-larity of the statistical behavior. If one were to examine geneswith the same function across two organisms, then it is likelythat the two DNA sequences would match quite well at asymbol-for-symbol level. For instance, if one were to comparethe gene that generates insulin in a mouse and in a human,the two strings would be very similar, except that occasionallyone would have to introduce a “gap” in one sequence or theother. This particular problem, namely to determine the bestpossible match between two sequences, after inserting a fewgaps here and there, is known as the optimal gapped align-ment problem. If and are the two strings to be alignedand have lengths and , respectively, then it is possible togive an optimal alignment based on dynamic programming,whose complexity is . Parallel implementations of thisalignment algorithm are also possible. For further details, see[7] and [5].

On the other hand, if one were to examine two different geneswith distinct functionalities but from within the same organismor within the same family of organisms, then the two geneswould not be similar at a symbol-for-symbol level. However,it is widely believed that they would match at a statistical level.The idea behind statistical prediction of genes can be summa-rized as follows. Suppose we are given several known stringsof genes and several known strings of non-genes

/ © 2008 IEEE

VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 27

. We think of the ’s as sample paths of one sto-chastic process generated by a “coding model” and of the

’s as sample paths of another stochastic process generatedby a “non-coding model” . Now suppose is an ORF,and we wish to classify it as being a gene or a non-gene. Thelogical approach is to use “log-likelihood ratio” classification.Thus, we compute the probabilities (or likelihoods) and

, that is, the likelihood of the string according tothe coding model and the non-coding model, respectively. If

, then we classify as a gene, whereas if, then we classify as a non-gene. In the

unlikely event that both likelihoods are comparable, the methodis inconclusive.

The existing statistical gene prediction methods differ onlyin the manner in which the coding model and the non-codingmodel are constructed. In Genescan [18], the basic premise isthat, in coding regions, the four nucleic acid symbols occurroughly with a period of three. Thus, there is no non-codingmodel as such. Instead, a discrete Fourier transform is takenof the occurrence of each nucleotide symbol, and the value iscompared against a threshold. In Genscan [2], [3], the samplepaths are viewed as outputs of a hidden Markov model thatcan also emit a “blank” symbol. Because of this, the lengthof the observed string does not match the length of the statesequence, thus leading to an extremely complicated dynamicprogramming problem. In GeneMark [12], the observed se-quences are interpreted as the outputs of a multistep Markovprocess. Probably the most widely used classification methodis Glimmer, which has several refinements to suit specificrequirements. Some references to Glimmer can be found in[4] and [17]. We shall return to Glimmer and GeneMark againin Section V, when we present our computational results andcompare them against those produced by Glimmer.

C. Contributions of This Paper

Glimmer uses an interpolated Markov model (IMM) wherebythe sample paths are fit with multistep Markov models whosememory varies from 0 to 8. (Note that a Markov process withzero memory is an i.i.d. process.) This requires the estimationof a fairly large number of parameters. To overcome this dif-ficulty, Glimmer uses high-order Markov models only whenthere is sufficient data to get a reliable estimate. Initially, Gen-eMark used a fifth-order Markov chain, but subsequent ver-sions use refined versions, including a hidden Markov model(HMM). In contrast, the premise of our study is that, even in amultistep Markov process, different states have different effec-tive memory. This leads to a “mixed memory Markov model.”Hence, the algorithm is called 4M.

We begin by fitting the sample paths of both the codingregions and the non-coding regions with a fifth-order Markovmodel each. The reason for using a fifth-order model (thusexactly replicating hexamer frequencies) is that lower ordermodels result in noticeably poorer performance, whereas higherorder models do not seem to improve the performance verymuch. Using two fifth-order Markov models for the coding andnon-coding regions results in two models havingstates each. Then, by using a simple singular value condition,many of these states are combined into one common state. In

some cases, the resulting reduced-size Markov model has asfew as 150 states—which is an 85% reduction! In addition tothe singular value test to choose the level of reduction permis-sible, we also use the Kullback–Leibler (K-L) divergence rate[9] to bound the error introduced by reducing the size of thestate space. This upper bound on the K-L divergence rate canbe used to choose a threshold parameter in the rank conditionin an intelligent fashion. In addition, the K-L divergence rateis also used to demonstrate that the three-periodicity effect isvery pronounced in coding regions but not in the non-codingregions. The statistical significance of the 4M algorithm israther easy to analyze. As a result, when some ORFs arepredicted to be genes using the 4M algorithm, our confidencein the prediction can be readily quantified, and the predictionscan be ranked in order of decreasing confidence. In this way,the most confident predictions (if they are also interesting froma biological standpoint) can be followed up for experimentalverification.

All in all, the conclusion is that the 4M algorithm performscomparably well or somewhat better than Glimmer and Gene-Mark; however, in the case of the 4M algorithm, it is quite easyand straightforward to compute the statistical significance of theconclusions drawn.

II. 4M ALGORITHM

A. Multistep Markov Models

Recall that the statistical approach to gene-finding depends onbeing able to construct two distinct models, for the coding re-gions and for the non-coding regions. From purely a math-ematical standpoint, we have only one problem at hand, namely,given a set of sample paths of a stationary stochasticprocess, construct a model for these paths. In other words, both

and are constructed using exactly the same methodology,but applied to distinct sets of sample paths. Thus, let us concen-trate on this problem formulation.

Suppose is a positive integer, and define. (In the case of genomics, .) Suppose

is a stationary stochastic process assuming values in ,and we have at hand several sample paths of thisprocess. The objective is to construct a stochastic model for theprocess on the basis of these observations.

Suppose an integer is specified, and we know the statisticsof the process up to order . This means that the probabil-ities of occurrence of all -tuples are specified for theprocess . Let denote the probability of occurence of thestring . Thus, if is a string of length , say ,then

Since the process is stationary, the above probability is inde-pendent of . Note that the frequencies must satisfy a set of“consistency conditions” as follows:

There is a well-known procedure that perfectly reproducesthe specified statistics by modeling the given


process as a -step Markov process. For brevity, letus use the notation to mean , and so on.Assuming that the process is a -step Markov processmeans that, if is a string of length larger than , then

In short, it is assumed that the probability is notaffected by values of when . Moreover, the transitionprobability of this multistep Markov process is computed as

The above model, though it is often called a -stepMarkov model, is also a traditional (one-step) Markov modelover the larger state space . Suppose aretwo states. Then a transition from to is possible only if thelast symbols of (read from left to right) are the same asthe first symbols of ; in other words, it must be the casethat

for some

In this case, the probability of transition from the state to thestate is given by

(1)

For all other , the transition probability equals zero. Itis clear that, though the state transition matrix has dimension

, every row contains at most nonzero entries.Such a -step Markov model perfectly reproduces the

-tuple frequencies for all .Given a long string of length , we can write

As a result,

Now, in the above summation, all numbers are of reasonablesize.

To round out the discussion, suppose that the statistics of theprocess are not known precisely, but need to be inferred onthe basis of observing a set of sample paths. This is exactly theproblem we have at hand. In this case, one can still apply (1), butwith the actual (but unknown) probabilities and replacedby their empirically observed frequencies. Each of these givesan unbiased estimate of the corresponding probability.

B. 4M Algorithm

Here, we introduce the basic idea behind the 4M algorithmand then present the algorithm itself. We begin with a multistepMarkov model and then reduce the size of the state space furtherby using a criterion for determining whether some states are“Markovian.”

The basis for the 4M algorithm is a simple property of Markovprocesses. Consider a Markov chain evolving over a finitealphabet . Let and consider thefrequency of the triplet . Clearly, for any process (Markovianor not), we have

Note that in the above formula we simplify notation by writing

and so on. Now, if the process is Markovian, then we have

Hence, if we examine the matrix

it will have rank one. This is because, with fixed, it looks like

...

...

There is nothing special about using only a single symbol. Suppose is a string of finite length, denoted as usual

by . Then, just as above, we have that

Thus, if we fix an integer and examine the matrix

then has rank one. Conversely, if the semi-infinite matrixhas rank one for every , then

the process is Markovian.Now suppose we drop the assumption that the process

is Markovian, and suppose the matrix has rank one for aparticular fixed . An elementary exercise in linear algebrashows that, in such a case, we must have

(2)

This follows by reversing the above reasoning. Accordingly, letus define a state to be Markovian of order if hasrank one. The distinction here is that we are now speaking aboutan individual state being Markovian as opposed to the entireprocess.


The above rank one property and definition can also beextended to multistep Markov processes. Suppose is an-step Markov process. This means that, for each fixed ,

we have

Hence, for each fixed , the matrix

has rank one. Following earlier reasoning, we can define a stateto be a Markovian state of order if the matrix

has rank one.Let us now consolidate all of this discussion and apply it

to reducing the size of the state space of a multi-step Markovprocess. Suppose, as before, that is a stationary stochasticprocess assuming values in a finite set .We have already seen that, in order to reproduce perfectlythe -tuple frequencies, it suffices to construct a -stepMarkov model. Suppose that such a multi-step model hasindeed been constructed. This model makes use of the -tuplefrequencies . Now suppose that, forsome integer and some string , it is the casethat the matrix

(3)

has rank one. By elementary linear algebra, this implies that

(4)

In other words, the conditional probability of finding a symbolat a particular location depends only on the preceding symbols

and not on the symbols that precede . Hence, ifhas rank one, then we can “collapse” all states of

the form for all into a single state . For thisreason, we call a Markovian state if has rank one.The interpretation of being a Markovian state is that, whenthis string occurs, the process has a “memory” of only timesteps and not in general.

To implement the reduction in state space, we therefore pro-ceed as follows.Step 1) Compute the vector of -tuple frequencies.Step 2) Set and, for each , compute the matrix

. If has rank one, then collapse allstates of the form for all into asingle state . Repeat this test for all .

Step 3) Increase the value of by one and repeat until.

When the search process is complete, the initial set ofstates will have been collapsed into some intermediate number,whose value depends on the -tuple frequencies. Since we aremodeling the process as a -step Markov process,in general, for a string of length , we can write

Now, if a substring of the string is Markovian,say , then we can take advantage of (3) and do thesubstitution

in the above formula. This is the reason for calling the algorithma “mixed memory Markov model” since different -tuples

have memories of different lengths.The preceding theory is “exact” provided we use true proba-

bilities in the various computations. However, in setting up themultistep Markov model, we are using empirically observed fre-quencies and not true probabilities. Hence, it is extremely un-likely that any matrix will exactly have rank one. At thispoint, we take advantage of the fact that we wish to do classi-fication and not modeling. This means that, in constructing thecoding model and the non-coding model , it is really notnecessary to get the likelihoods exactly right—it is sufficientfor them to be of the right order of magnitude. Hence, in im-plementing the 4M algorithm, we take a matrix as having“effective rank one” if it satisfies the condition

(5)

where and denote the largest two singularvalues of the matrix , and is an adjustable threshold pa-rameter. We point out in Section VI that, by using the K-Ldivergence rate between Markov models, it is possible tochoose the threshold “intelligently.” Setting a stateto be a Markovian state even if is not exactly a rank onematrix is equivalent to making the approximation

(6)

In other words, in the original -step Markov model, theentries in the rows corresponding to all states of the form aremodified according to (6).

III. K-L DIVERGENCE RATE

Here, we introduce the notion of the K-L divergence ratebetween stochastic processes and its applications to Markovchains. Then, we derive an expression for the K-L divergencerate between the original -step Markov model and the4M-reduced model. This formula is of interest because the twoprocesses have a common output space, but not a common statespace. These results are applied to some problems in genomicsin Section IV.

A. K-L Divergence

Let be an integer, and let denote the -simplex, namely,

Thus, is just the set of probability distributions for an-valued random variable. Suppose are two such


probability distributions. Then the K-L divergence between thetwo vectors is defined as

(7)

Note that, in order for to be finite, needs to be dom-inated by , that is, . We write or

to denote that is dominated by or that dominates. Here, we adopt the usual convention that .The K-L divergence has several possible interpretations, of

which only one is given here. Suppose we are given data gen-erated by an i.i.d. sequence whose one-dimensional marginaldistribution is . There are two competing hypotheses, namely,that the probability distribution is and that the probability dis-tribution is , neither of which may be “the truth” . If we ob-serve a sequence , where is the length of the obser-vation and each has one of the possible values, we com-pute the likelihood of the observation under each of the two hy-potheses and choose the more likely one, that is, the hypothesisthat is more compatible with the observed data. In this case, it iseasy to show that the expected value of the log-likelihood ratiois precisely equal to . Thus, in the longrun, we will choose the hypothesis ifand the hypothesis if . In other words,in the long run, we will choose the hypothesis that is closer’ tothe “truth” . Therefore, even though the K-L divergence is nottruly a distance (it does not satisfy either the symmetry propertyor the triangle inequality), it does induce a partial ordering onthe set . The difference is the per-symbolcontribution to the log-likelihood ratio. As a parenthetical aside,see [11] for a very general discussion of divergence generatedby an arbitrary convex function. For an appropriate choice ofthe convex function, the corresponding divergence will in factsatisfy the one-sided triangle inequality. However, the popularchoice is not one such function.

B. K-L Divergence Rate

The traditional K-L divergence measure is perfectly finewhen the classification problem involves a sequence of inde-pendent observations. However, in trying to model a stochasticprocess via observing it, it is not always natural to assume thatthe observations are independent. It is therefore desirable tohave a generalization of the K-L divergence to the case wherethe samples may be dependent. Such a generalization is givenby the K-L divergence rate.

Suppose is some set, and is a stochastic process as-suming values in the set . Thus, the stochastic process itselfassumes values in the infinite Cartesian product space . Sup-pose are two probability laws, that is, probability mea-sures on the product space . In principle, we could definethe K-L divergence between the two laws by extendingthe standard definition, using Radon–Nikodym derivatives andso on. The trouble is that most of the time the divergence wouldbe infinite and conveys no useful information. Thus, blindlycomputing the divergence between the two laws of a stochasticprocess gives no useful information most of the time.

To get around this difficulty, it is better to use the K-L diver-gence rate. It appears that the K-L divergence rate was intro-duced in [9]. If and are two probability laws on and if

is a finite set, we define

(8)

where and are the marginal distributions on and , re-spectively, onto the -dimensional product , and is justthe conventional K-L divergence (without the rate). The idea isthat, in many cases, the “pure” K-L divergence ap-proaches infinity as . However, dividing by moder-ates the rate of growth. Moreover, if the ratio has a finite limitas , then the K-L divergence rate gives a measure ofthe asymptotic rate at which the “pure” divergence blows up as

.The K-L divergence rate has essentially the same interpre-

tation as the K-L divergence. Suppose we are observing a sto-chastic process whose law is . We are trying to decide betweentwo competing hypotheses: The process has the law , and theprocess has the law . After samples, the expected value ofthe log-likelihood ratio is asymptotically equal to

.The paper [16] gives a good historical overview of the prop-

erties of the K-L divergence. Specifically, in general the K-L di-vergence may not exist between arbitrary probability measures,but it seems to exist under many reasonable conditions. For ex-ample, it is known [6] that, if is a stationary law and is thelaw of a finite-state Markov process, then the K-L divergencerate is well defined. It is shown in [14] that the K-L divergencerate exists if both laws correspond to ergodic processes.

C. K-L Divergence Rate Between Markov Processes

In [16], an explicit formula is given for the K-L divergencerate between two Markov processes over a common (finite) statespace. We give an alternate version of the formula derived in[16], which generalizes very cleanly to multistep Markov pro-cesses.

Suppose and are the laws of two Markovprocesses over a finite set . Thus,

are stochastic matrices

and are corresponding stationary vectors. Thus,. If is the law of the Markov process

and is the law of the Markov process , then it is shownin [16] that

(9)

where denote the th rows of the matrics and ,respectively. In order for the divergence rate to be finite, thetwo state transition matrices and must satisfy the condition

or, in the earlier notation, we must have. We denote this condition by or .


Now we give an alternate formulation of (9) that is in somesense a little more intuitive.

Theorem 1: Suppose are stochastic ma-trices, and let denote associated stationary probability dis-tributions. Thus, and . Let denote thelaw of the Markov process and let denote the law ofthe Markov process . Let denote the fre-quency vector of doublets , under the Markov chain

. Similarly let denote the frequency vectorof doublets , under the Markov chain . Suppose

. Then, the K-L divergence rate between the Markovchains is given by

(10)

where is the conventional K-L divergence between prob-ability vectors.

1) Remarks: Formula (10) gives a nice interpretation of theK-L divergence rate between Markov chains: it is just the dif-ference between the divergence of the doublet frequencies andthe divergence of the singlet frequencies. Moreover, it is easyto extend it to -step Markov chains. The K-L divergence rate isjust the difference between the divergence of -tuple fre-quencies and -tuple frequencies.

The proof is omitted in the interests of brevity. It can be foundin [21].

In [16], the authors do not give an explicit formula for the K-Ldivergence rate between multistep Markov chains. There is ananalogous formula to (10) in the case of -step Markov models.Define to be the frequency vector of -tuples for thefirst Markov chain, and define the symbols in theobvious fashion. Then

(11)

The proof, based on the fact that an -step Markov model isjust a one-step Markov model on , is easy and is left to thereader.

D. K-L Divergence Rate When the 4M Algorithm is Used

Theorem 1 gives the K-L divergence rate between twoMarkov processes over the same state space. In this paper, webegin with a -step Markov process and approximate itby some other Markov process by applying the 4M algorithm.When we do so, the resulting processes no longer share acommon state space. Thus, Theorem 1 no longer applies. Thenext theorem gives a formula for the K-L divergence rate whenthe 4M algorithm is used to achieve this reduction. Note thatthe problem of computing the K-L divergence rate between twoentirely arbitrary HMMs with a common output space is stillan open problem.

Theorem 2: Suppose is a stationary stochastic processand that the frequencies of all -tuples are speci-fied. Let denote the approximation of by a -stepMarkov process. Suppose now that we apply the 4M algorithmand choose various tuples as “Markovian states.” Letdenote the Markovian states and let denote the lengthof the Markovian state . Finally, let denote the resulting

stochastic process. Then, the K-L divergence rate between theoriginal -step Markov model and the 4M reduced Markovmodel (or equivalently between the laws of the process and

) is given by

(12)Proof: Note that the full -order Markov model has

exactly nonzero entries in each row labeled by , and theseentries are as varies over . One can think of the re-duced-order model obtained by the 4M algorithm as containingthe same rows, except that, if is a Markovian state, then theentry in all rows of the form are changed from to

. The vector is a stationary distributionof the original -order Markov model. Now (12) readilyfollows from (11).

Note that, in applying the 4M algorithm, we approximate theratio by the ratio for each string that isdeemed to be a Markovian state. Hence, the quantity inside thelogarithm in (12) should be quite close to one, and its logarithmshould be close to zero.

IV. COMPUTATIONAL RESULTS—I: APPLICATIONS OF

THE K-L DIVERGENCE RATE

This section contains the first set of computational results.Here, we study the three-periodicity of coding regions using theK-L divergence rate. The same K-L divergence rate is also usedto show that there is virtually no three-periodicity effect in thenon-coding regions. Then, we analyze the effect of reducing thesize of the state space using the 4M algorithm in terms of thegeneralization error.

A. List of Organisms Analyzed

The 4M algorithm was applied to 75 prokaryote genomes ofmicrobial organisms. These genomes comprised both bacteriaas well as archae. To save space, in the tables showing the com-putational results we give the names of the various organismsin a highly abbreviated form. Table V gives a list of all the or-ganisms for which the computational results are presented here,together with the abbreviations used.

B. Three-Periodicity of Coding Regions

There is an important feature that needs to be built into anystochastic model of genome sequences. It has been observedthat, if one were to treat the coding regions as sample paths of astationary Markov process, then the results are pretty poor. Thereason is that genomic sequences exhibit a pronounced three-periodicity. This means that the conditional probability

is not independent of , but is instead periodic with a period ofthree. Thus, instead of constructing one -step Markovmodel, we must in fact construct three such models. These arereferred to as the Frame 0, Frame 1, and Frame 2 models.We begin from the start of the genome and label the first nu-cleotide as Frame 0, the second nucleotide as Frame 1, the third


TABLE IDIVERGENCES BETWEEN MARKOV MODELS OF CODING REGIONS

nucleotide as Frame 2, and loop back to label the fourth nu-cleotide as Frame 0, and so on. Then the 4M reduction using therank condition is applied to each frame. Since a three-periodicMarkov chain over a state space can also be written asa stationary Markov chain over the state space , thereare no conceptual difficulties because of three-periodicity.

In this subsection, we use the K-L divergence rate introducedin Section III to assess the significance of three-periodicity inboth coding and non-coding regions in various organisms. Thestudy is carried out as follows. In 13 organisms, we constructedthree-periodic models for the known coding regions as wellas known non-coding regions, using the value . Thismeans that for each organism we constructed six differentfifth-order Markov models that perfectly reproduced the ob-served hexamer frequencies. These models are denoted by

respectively (three coding-regionmodels and three non-coding region models). For the threecoding region models, we computed six different divergencerates, namely for all . Then, we did the samefor the three non-coding region models. (Remember that theK-L divergence rate is not symmetric.) Tables II and III showthese six divergence rates for 13 of the organisms listed inSection IV-A, for both coding regions as well as non-codingregions. Actually, we computed 75 such divergences, butonly 13 are presented here. To make the tables fit within thetwo-column format, we use the obvious notation to denote

or as appropriate.Note that throughout the base of the logarithm used in (12) is

2. Thus, all logarithms are binary logarithms.From Tables I and II, it is clear that the three-periodicity effect

in the non-coding regions is noticeably less than in the codingregions, in the sense that the K-L divergence rates between thethree frames of the non-coding models are essentially negli-gible, compared with the corresponding divergence rates in thecoding regions. In fact, except for Mycoplasma genitalium, thedivergences in the non-coding regions are essentially negligible.M. genitalium is a peculiar organism, in which the codon TGAcodes for the amino acid tryptophane, instead of being a stopcodon as it is in practically all other organisms. Thus, when weconstruct algorithms for predicting genes, we would be justi-fied in ignoring the three-periodicity effect in the non-codingregions.

TABLE IIDIVERGENCES BETWEEN MARKOV MODELS OF NON-CODING REGIONS

TABLE IIISIZES OF 4M-REDUCED MARKOV MODELS

C. Reduction in Size of State Space

Here, we study the reduction in the size of the state spacewhen the 4M algorithm is used. In applying the 4M algorithm,we used the value . It has been verified numerically thatsmaller values of do not give good predictions, while largervalues of do not lead to any improvement in performance.Thus, seems to be the right value. Hence, for each or-ganism, we constructed three coding region models, and onenon-coding region model, each model being fifth-order Mar-kovian. Recall that, due to the three-periodicity of the codingregions, we need three models, one for each frame in the codingregion. Since the non-coding region does not show so muchof a three-periodicity effect (as demonstrated in the precedingsubsection), we ignore that possibility and construct just onemodel for the non-coding region. Each of the fifth-order Mar-kovian models has states, consisting of pentamersof nucleotides. Then, for each of these four models, we appliedthe 4M reduction, with the threshold in (5) set somewhat arbi-trarily at . A more systematic way to choose is givenin Section VI. Recall that the larger the threshold , the largerthe number of states that will satisfy the “near rank one” con-dition (5), and the greater the reduction in the size of the statespace.

The CPU time for computing the hexamer frequencies of agenome with about one million base pairs is approximately 10s on a Intel Pentium IV processor running at 2.8 GHz, whilethe state space reduction takes just 0.3 s or about 3% of the time


TABLE IVRESULTS OF WHOLE GENOME ANNOTATION USING THE 4M ALGORITHM

USING 50% TRAINING DATA

needed to compute the frequencies. Thus, once the hexamer fre-quencies are constructed, the extra effort needed to apply the 4Malgorithm is negligible.

Table III shows the size of the 4M-reduced state space foreach of the 13 organisms studied. All of the numbers in the tableshould be compared with 1024, which is the number of states ofthe full fifth-order Markov chain. Moreover, in Glimmer and itsvariants, one uses up to eighth-order Markov models for certainorganisms, meaning that in the worst case the size of the statespace could be as high as . From this table, it is clearthat in most cases the 4M algorithm leads to fairly significantreduction in the size of the state spaces. There are some dramaticreductions, such as in the case of B. sub, for which the reductionin the size of the state space is of the order of 85%. Moreover,in almost all cases, the size of the state space is reduced by atleast 50%.

V. COMPUTATIONAL RESULTS—II: GENE PREDICTION

Now we come to the main topic of this paper, namely,finding genes. One can identify two distinct philosophiesin gene-prediction algorithms. Some algorithms, includingthe one presented here, can be described as “bootstrapping.”Thus, we begin with some known genes, construct a stochasticmodel based on those, and then use that model to classifythe remaining ORFs as potential genes. The most promisingpredictions are then validated either through experiment orthrough comparison with known genes of other similar organ-isms. The validated genes are added to the training sample andthe process is repeated. This is why the process may be calledbootstrapping. In contrast, Glimmer (in its several variants),which is among the most popular and most accurate predictionalgorithm at present, can be described as an ab initio scheme.In Glimmer, all ORFs longer than 500 base pairs are used as thetraining set for the coding regions. The premise is that almostall of these are likely to be genes anyway. In principle, wecould apply the 4M algorithm with the same initial training setand the results would not be too different from those presentedhere.

For a gemone with one million base pairs, the 4M algorithmrequired approximately 10 s of CPU time and approximately 5Mb of storage for training the coding and non-coding models,compared with 10 s of CPU time and 50 Mb of storage for

TABLE VABBREVIATIONS OF ORGANISM NAMES AND COMPARISON OF

4M VERSUS GLIMMER 3

Glimmer3. The prediction problem took about 60 s of CPU timeand 20 Mb of storage for 4M versus 13 s of CPU time and 4Mb of storage for Glimmer3. Our implementation of the 4M al-gorithm was done using Python, which is very efficient for the


TABLE VICOMPARISON OF 4M ALGORITHM VERSUS GENEMARK 2.5D AND GENEMARKHMM2.6g

programmer but very inefficient in terms of CPU time Thus, webelieve that there is considerable scope for reduction in both

the CPU time as well as the storage requirements when imple-menting the 4M algorithm.


A. Classification of Annotated Genes Using 4M and OtherMethods: Comparison of Results

Here, we take the database of “annotated” genes for each of75 organisms and classify them using 4M, Glimmer3, Gene-Mark2.5d, and GeneMarkHMM2.6g. The database of “anno-tated” genes represents a kind of consensus. Some of the genesin the database are experimentally validated, while others aresufficiently similar at a symbol for symbol level to other knowngenes that these too are believed to be genes. Thus, it is essen-tial that any algorithm should pick up most if not all of theseannotated genes.

The test was conducted as follows. To construct the codingand non-coding models for the 4M algorithm, we took someknown genes to train the coding model and some known non-coding regions to train the non-coding model . The fractionof known genes used to train the coding model was 50%,that is, we used every other gene. For the non-coding model,we picked around 20% of the known non-coding regions atrandom. Throughout, we used a three-periodic model for thecoding regions and a “uniform” (i.e., nonperiodic) model forthe non-coding regions. These models were then 4M-reducedusing the threshold . Then the remaining (known) codingand non-coding regions were classified using the log-likelihoodmethod.

In the tables, we have used the following notation.• Total Genes denotes the total number of genes in the an-

notated database.• 4M & Gl denotes the genes picked up by both the 4M

algorithm and Glimmer3.• & denotes the genes missed by both the

algorithms.• 4M & denotes the genes picked up by the 4M algo-

rithm but missed by Glimmer3.• & Gl denotes the genes missed by the 4M algorithm

but picked up by Glimmer3.Similar notation is used in Table VI, with Glimmer3 replaced byGeneMark2.5d (denoted by GMK) and GeneMarkHMM (de-noted by GHMM).

First, we compare the performance of the 4M algorithmagainst that of Glimmer3, as detailed in Table V. It is worthpointing out that, in the results presented here, there is no“postprocessing” of the raw output of the 4M algorithm, as iscommon with other algorithms. The key points of comparisonare the numbers in the next-to-last and last columns. From thistable, it can be seen that, except in the case of seven organisms(B. jap, C. vio, G. vio, the three Pseudomonas family, and R.etl), 4M finds at least as many genes as it misses, comparedwith Glimmer3. On the other side, in 32 organisms, the numberof annotated genes found by 4M and missed by Glimmer3is more than double the number missed by 4M and found byGlimmer3.

Next, a glance at Table VI reveals that 4M overwhelminglyoutperforms GeneMark2.5d, which is an older algorithm basedon modeling the genes using a fifth-order Markov model. Since4M is also based on a fifth-order Markov model but with somereduction in the size of the state space, the vastly superior perfor-mance of 4M is intriguing to say the least. Compared with Gen-

TABLE VIICOMPARISON OF 4M ALGORITHM VERSUS GLIMMER3 ON SHORT GENES

eMarkHMM2.6g, the superiority of 4M is not so pronounced;nevertheless, 4M has the better performance.

To summarize, 4M somewhat outperforms Glimmer3 andGeneMarkHMM2.6g in most cases, and considerably outper-forms GeneMark2.5d .

Finally, we compared the performance of the 4M algorithmwith Glimmer3 on short genes. It is widely accepted that longgenes are easy to find using just about any algorithm and thatthe real test of an algorithm is its ability to find short genes.Since 4M significantly outperforms both versions of GeneMarkin any case, we present only the comparison of 4M againstGlimmer3 on three sets of genes: those of length less than 150base pairs, between 151 and 300 base pairs, and between 301and 500 base pairs. In presenting the results, we omitted any or-ganism where the number of “ultrashort genes” of length lessthan 150 base pairs was less than 20. These results are found inTable VII. From this table, it is clear that 4M vastly outperformsGlimmer3 in predicting ultrashort genes and is somewhat supe-rior in finding short genes.

In the case of the organism M. genitalium, which has the ex-ceptional property that the codon TGA codes for the amino acidtryptophane instead of being a stop codon, the 4M algorithmperforms poorly when this fact is not incorporated. However,when this fact is incorporated, the performance of 4M improvesdramatically. This is why there are two rows corresponding toM. genitalium: the first row is with assuming that TGA is a stopcodon, and the second assumes that TGA is not a stop codon. Wewere at first rather startled by the extremely poor performance


of the 4M algorithm in the case of M. genitalium, consideringthat the algorithm performed so well on the rest of the organ-isms. This caused us to investigate the organism further and ledus to discover from the literature that, in fact, M. genitalium hasa nonstandard genetic code. The “moral of the story” is that,by purely statistical analysis, we could find out that there wassomething unusual about this organism.

More interestingly, even in the case of the extremely well-studied organism E. coli, neither the 4M algorithm nor Glimmer3 performs particularly well. This kind of poor performance isusually indicative of some nonstandard behavior on the part ofthe organism, as in the case of M. genitalium. This issue needsto be studied further.

B. Whole Genome Annotation Using the 4M Algorithm

Here, we carry out “whole genome annotation” of 13 or-ganisms using the 4M algorithm. First, we identify all of theORFs in the entire genome. Then, we train the model usingevery other gene in the database of annotated genes and about20% of the known noncoding regions to train . Both modelsare 4M-reduced. Then, the entire set of ORFs are classifiedusing the log-likelihood classification scheme. The legend forthe column headings in Table IV is as follows: organism, thenumber of ORFs, the number of annotated genes, the number ofannotated genes that are “picked up” by 4M and predicted to begenes, the number of “other” ORFs whose current status is un-known, and finally, the number of these “other” ORFs that arepredicted by 4M to be genes. To save space, we present resultsfor only 13 organisms.

Now let us discuss the results in the Table IV. It is reasonableto expect that, since half of the annotated genes are used as thetraining set, the other half of the annotated genes are “pickedup” by 4M as being genes. However, what is surprising is howfew of the other ORFs whose status is unknown are predictedto be genes. Actually, for each organism, the 4M algorithm pre-dicts several hundred ORFs to be additional genes. However,since there are many overlaps amongst these ORFs, we elimi-nate the overlaps to predict a single gene for each set of overlap-ping regions. Thus, there is a clear differentiation between thestatistical properties of the annotated genes and the ORFs whosestatus is unknown, and the 4M algorithm is able to differentiatebetween them. These “additional predicted genes” are good can-didates for experimental verification. In order to prioritize them,they can be ranked in terms of the normalized log-likelihoodratio

where denotes the length of the ORF . The reason fornormalizing the log-likelihood ratio is that, as becomeslarge, the “raw” log-likelihood ratio will also become large.Thus, comparing the raw log-likelihood ratio of two ORFs isnot meaningful. However, comparing the normalized log-likeli-hood ratio allows us to identify the predictions about which weare most confident. This is the advantage of having a stochasticmodeling methodology whose significance is easy to analyze.

VI. CONCLUSION AND FUTURE WORK

We have studied the problem of finding genes from prokary-otic genomes using stochastic modeling techniques. The gene-finding problem is formulated as one of classifying a given se-quence of bases using two distinct Markovian models for thecoding regions and the non-coding regions, respectively. Forthe coding region, we construct a three-periodic fifth-order Mar-kovian model, whereas for the non-coding we construct a fifth-order Markovian model (ignoring the three-periodicity effect).Then, we introduced a new method known as 4M that allows usto assign variable length memories to each symbol, thus permit-ting a substantial reduction in the size of the state space of theMarkovian model.

The disparities between various models have been quantifiedusing the K-L divergence rate between Markov processes. TheK-L divergence rate has a number of useful applications, someof which are brought out in this paper. For instance, using thismeasure, it has been conclusively demonstrated that the three-periodicity effect is much more pronounced in coding regionsthan in non-coding regions. This is why we could ignore three-periodicity in non-coding regions. An explicit formula has beengiven for the K-L divergence rate between a fifth-order Markovmodel and the 4M-reduced model. This formula allows us toquantify the classification error resulting from this model orderreduction.

Using this new algorithm, we annotated 75 different micro-bial genomes from several classes of bacteria and archae. Theperformance of the 4M algorithm was then compared withthose of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g.It has been shown that the 4M algorithm somewhat outperformsGlimmer3 and considerably outperforms both versions of Gen-eMark. When it comes to finding ultrashort and short genes,the 4M algorithm significantly outperforms even Glimmer3.

We also carried out whole genome annotations of all ORFS inseveral organisms using the 4M algorithm. We found that, whilethe 4M algorithm detects an overwhelming majority ( )of annotated genes as genes, it picks up a surprisingly smallfraction of the remaining ORFs as genes. Thus, the 4M algorithmis able to differentiate very clearly between the “known” genesand the “unknown” ORFs. Moreover, since the 4M algorithmuses a simple log-likelihood test, it is possible to rank allthe “predicted” genes in terms of decreasing log-likelihoodratio. In this way, the most confident predictions can be triedout first.

Formula (12) can be used to choose the threshold in (5) in anadaptive manner. Let denote the laws of the full fifth-orderMarkov models for the coding regions and the non-coding re-gions, respectively, and let denote the law of the 4M-reducedmodel, obtained using a threshold . We should choose to beas large as possible while maintaining the constraint

where is a new adjustable parameter. If we choose a very smallvalue of , say , then the log-likelihood ratio betweenthe coding and non-coding models will hardly be affected, if


is replaced by . This will be a more intelligent and adaptiveway to choose .

ACKNOWLEDGMENT

The authors would like to thank B. Nittala and M. Haque forassisting with the interpretation of some of the computationalresults.

REFERENCES

[1] P. Baldi and S. Brünak, Bioinformatics: A Machine Learning Ap-proach. Cambridge, MA: MIT Press, 2001.

[2] C. Burge and S. Karlin, “Prediction of complete gene structures inhuman genomic DNA,” J. Molec. Biol., vol. 268, pp. 78–94, 1997.

[3] C. Burge and S. Karlin, “Finding genes in genomic DNA,” Curr. Opin.Struct. Biol., vol. 8, pp. 346–354, 1998.

[4] A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg, “Im-proved microbial gene identification with GLIMMER,” Nucleic AcidsRes., vol. 27, no. 23, pp. 4636–4641, 1999.

[5] W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics,2nd ed. New York: Springer-Verlag, 2006.

[6] R. M. Gray, Entropy and Information Theory. New York: Springer-Verlag, 1990.

[7] D. Gusfield, Algorithms on Strings, Trees and Sequences: ComputerScience and Computational Biology. Cambridge, U.K.: CambridgeUniv. Press, 1997.

[8] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge,MA: MIT Press, 1997.

[9] B.-H. Juang and L. R. Rabiner, “A probabilistic distance measure forhidden Markov models,” AT&T Tech. J., vol. 64, no. 2, pp. 391–408,Feb. 1985.

[10] A. Krogh, I. S. Mian, and D. Haussler, “A hidden Markov model thatfinds genes in E. coli DNA,” Nucleic Acids Res., vol. 22, no. 22, pp.4768–4778, 1994.

[11] F. Liese and L. Vajda, “On divergences and informations in statisticsand information theory,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp.4394–4412, Oct. 2006.

[12] A. V. Lukashin and M. Borodovsky, “GeneMark.hmm: New solutionsfor gene finding,” Nucleic Acids Res., vol. 26, no. 4, pp. 11–7, 1998.

[13] W. H. Majoros and S. L. Salzberg, “An empirical analysis of trainingprotocols for probabilistic gene finders,” BMC Bioinformat., vol. ???,p. 206, 2004.

[14] K. Marton and P. C. Shields, “The positive-divergence and blowing upproperties,” Israel J. Math, vol. 86, pp. 331–348, 1994.

[15] L. W. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp.257–285, Feb. 1989.

[16] Z. Rached, F. Alalaji, and L. L. Campbell, “The Kullback-Leibler di-vergence rate between Markov sources,” IEEE Trans. Inf. Theory, vol.50, no. 5, pp. 917–921, May 2004.

[17] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, “Microbial geneidentification using interpolated Markov models,” Nucleic Acids Res.,vol. 26, no. 2, pp. 544–548, 1998.

[18] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. bhattacharya, and R.Ramaswamy, “Prediction of probable genes by Fourier analysis of genesequences,” Computat. Appl. Biosci., vol. 13, no. 3, pp. 263–270, Jun.1997.

[19] J. C. Venter et al., “The sequence of the human genome,” Science, vol.291, pp. 1304–1351, 2001.

[20] M. Vidyasagar, “A realization theory for hidden Markov models: Thepartial realization problem,” in Proc. Symp. Math. Theory Netw. Syst.,Kyoto, Japan, Jul. 2006, pp. 2145–2150.

[21] M. Vidyasagar, “Bounds on the Kullback-Leibler divergence rate be-tween hidden Markov models,” in Proc. IEEE Conf. Decision Control,Dec. 12–17, 2007.

Mathukumalli Vidyasagar (F’83) was born inGuntur, India, on September 29, 1947. He receivedthe B.S., M.S., and Ph.D. degrees from the Uni-versity of Wisconsin, Madison, in 1965, 1967, and1969, respectively, all in electrical engineering.

Between 1969 and 1989, he was a Professor ofElectrical Engineering with various universities inthe United States and Canada. His last overseas jobwas with the University of Waterloo, Waterloo, ON,Canada from 1980 to 1989. In 1989, he returned toIndia as the Director of the newly-created Centre for

Artificial Intelligence and Robotics (CAIR) and built up CAIR into a leadingresearch laboratory of about 40 scientists working on aircraft control, robotics,neural networks, and image processing. In 2000, he joined Tata ConsultancyServices (TCS), Hyderabad, India, India’s largest IT firm, as an ExecutiveVice President in charge of Advanced Technology. In this capacity, he createdthe Advanced Technology Centre (ATC), which currently consists of about 80engineers and scientists working on e-security, advanced encryption methods,bioinformatics, Open Source/Linux, and smart-card technologies. He is theauthor or coauthor of nine books and more than 130 papers in archival journals.

Dr. Vidyasagar is a Fellow of the Indian Academy of Sciences, the IndianNational Science Academy, the Indian National Academy of Engineering, andthe Third World Academy of Sciences. He was the recipient of several honorsin recognition of his research activities, including the Distinguished Service Ci-tation from the University of Wisconsin at Madison, the 2000 IEEE Hendrik W.Bode Lecture Prize, and the 2008 IEEE Control Systems Award.

Sharmila S. Mande received the Ph.D. degreein physics from the Indian Institute of Science,Bangalore, India, in 1991.

Her research interests include genome infor-matics, protein crystallography, protein modeling,protein–protein interaction, and comparative ge-nomics. She performed research work with theUniversity of Groningen, The Netherlands, Uni-versity of Washington, Seattle, the Institute ofMicrobial Technology, Chandigarh, India, and thePost Graduate Institute of Medical Education and

Research, Chandigarh, before joining Tata Consultancy Services, Hyderabad,India, in 2001 as head of the Bio-Sciences Division, which is part of TCS’Innovation Lab.

Ch. V. Siva Kumar Reddy received the M.Tech. de-gree in computer science and engineering from theUniversity of Hyderabad, Hyderabad, India.

He is currently with Tata Consultancy Services(TCS)’s Bio-Sciences Division, which is part ofthe TCS’ Innovation Lab, Hyderabad. His researchinterests include computational methods in geneprediction.

V. Raja Rao received the M.S. degree in computerscience and engineering from the Indian Institute ofTechnology, Mumbai, India, in 2002.

He then joined the Bioinformatics Division, TataConsultancy Services (TCS), Hyderabad, India,and was involved in the development of variousbioinformatics products. His research interestsinclude computational methods in gene predictionand parallel computing for bioinformatics. He iscurrently consulting for TCS at Sequenom Inc., SanDiego, CA.

26 special issue on systems biology, january 2008 the …mxv091000/e-pubs/132.pdf · 2014-02-22 ·...

Documents