on exact string matching of unique oligonucleotides

9
Computers in Biology and Medicine 35 (2005) 173 – 181 http://www.intl.elsevierhealth.com/journals/cobm On exact string matching of unique oligonucleotides Heikki Hyyr o a , Martti Juhola a ; , Mauno Vihinen b a Department of Computer Sciences, 33014 University of Tampere, Tampere, Finland b Institute of Medical Technology, University of Tampere, Finland and Tampere University Hospital, 33520 Tampere, Finland Received 22 April 2003; accepted 18 November 2003 Abstract Unique, gene-specic oligonucleotides are used for many genetic investigations such as polymerase chain reaction, gene cloning, microarray technology and antisense DNA studies. It is a computationally demanding task to extract these oligonucleotides from DNA databases. We studied the problem from the point of view of the string matching problem. We implemented and tested several exact string matching algorithms and modied the implementations to be as eective as possible. Ten dierent implementations were tested on yeast genomic sequence data. The run times for the best algorithms were signicantly improved compared to conventional approaches, while in principle, i.e. in respect of theoretical time complexity, these algorithms do not actually dier essentially from each other. ? 2003 Elsevier Ltd. All rights reserved. Keywords: Exact string matching; Keyword tree (trie); Genomic data sequences; DNA sequences; Oligonucleotides 1. Introduction Our objective was to nd a string matching algorithm suitable for large data sets, especially for DNA sequences. The recent developments in biology have led to a proliferation of generated data. During the last few years several genomes, including the human genome, have been identi- ed. The human sequence alone contains some 3 billion base pairs. To be able to analyze ever increasing biological data ecient algorithms are needed for various purposes. Since the databases are vast, the time complexity and eciency are crucial. Oligonucleotides complementary to DNA sequences are used, for example, in polymerase chain reaction and other cloning, multiplication and biological diagnostic methods. Many experimental methods rely on the hybridization of oligo- or Corresponding author. Tel.: +358-3-2157972; fax: +358-8-2156070. E-mail address: [email protected]. (M. Juhola). 0010-4825/$ - see front matter ? 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiomed.2003.11.003

Upload: heikki-hyyroe

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On exact string matching of unique oligonucleotides

Computers in Biology and Medicine 35 (2005) 173–181http://www.intl.elsevierhealth.com/journals/cobm

On exact string matching of unique oligonucleotides

Heikki Hyyr,oa, Martti Juholaa ;∗, Mauno Vihinenb

aDepartment of Computer Sciences, 33014 University of Tampere, Tampere, FinlandbInstitute of Medical Technology, University of Tampere, Finland and Tampere University Hospital,

33520 Tampere, Finland

Received 22 April 2003; accepted 18 November 2003

Abstract

Unique, gene-speci6c oligonucleotides are used for many genetic investigations such as polymerase chainreaction, gene cloning, microarray technology and antisense DNA studies. It is a computationally demandingtask to extract these oligonucleotides from DNA databases. We studied the problem from the point of viewof the string matching problem. We implemented and tested several exact string matching algorithms andmodi6ed the implementations to be as e:ective as possible. Ten di:erent implementations were tested onyeast genomic sequence data. The run times for the best algorithms were signi6cantly improved compared toconventional approaches, while in principle, i.e. in respect of theoretical time complexity, these algorithms donot actually di:er essentially from each other.? 2003 Elsevier Ltd. All rights reserved.

Keywords: Exact string matching; Keyword tree (trie); Genomic data sequences; DNA sequences; Oligonucleotides

1. Introduction

Our objective was to 6nd a string matching algorithm suitable for large data sets, especiallyfor DNA sequences. The recent developments in biology have led to a proliferation of generateddata. During the last few years several genomes, including the human genome, have been identi-6ed. The human sequence alone contains some 3 billion base pairs. To be able to analyze everincreasing biological data eAcient algorithms are needed for various purposes. Since the databasesare vast, the time complexity and eAciency are crucial. Oligonucleotides complementary to DNAsequences are used, for example, in polymerase chain reaction and other cloning, multiplication andbiological diagnostic methods. Many experimental methods rely on the hybridization of oligo- or

∗ Corresponding author. Tel.: +358-3-2157972; fax: +358-8-2156070.E-mail address: [email protected] (M. Juhola).

0010-4825/$ - see front matter ? 2003 Elsevier Ltd. All rights reserved.doi:10.1016/j.compbiomed.2003.11.003

Page 2: On exact string matching of unique oligonucleotides

174 H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181

polynucleotides including PCR technology, (genome) sequencing, gene expression studies such asSouthern and Northern blotting, SAGE and microarrays. Gene function may be modulated by shortoligonucleotides either by antisense technology by RNA interference (RNAi). The reliability of theseexperiments is based on the speci6city and uniqueness of the oligonucleotides used as primers. Nowthat several complete genomes are available, it is possible to identify unique signature sequences forevery gene in a genome. There is a great demand for such unique signature sequences for applicationin several experiments and possibly even in therapy with antisense oligonucleotides. It is a majorproblem to analyze complete genomes in a reasonable time. Therefore, we studied the applicabilityof various exact string matching methods.

We tested several algorithms for the analysis of the yeast genome (Saccharomyces cerevisiae),which is commonly employed as a test object and which is readily available from databases. Theproblem was to identify unique oligonucleotides occurring in only one gene in the genome. Inthe future, our objective will naturally be to analyze the human genome. For that purpose, it isvery important to study and develop string matching algorithms. We started with much smaller,well-de6ned genomes, which are, nonetheless, challenging in this respect. Also, to be practicallyfeasible the analysis of smaller genomes requires very advanced and fast algorithms.

2. Sequence data acquisition

The yeast genome used in the tests was acquired from the NCBI database [1]. The 6rst step wasthe preprocessing of sequence data: to extract the encoded genes with a few preconditions from thegenome. The search algorithms were then implemented and tested. We implemented some generalstring matching algorithms and re6ned their function for our purpose. Running times were used asthe criterion to estimate the power of the algorithms.

Inasmuch as DNA sequence distributions are biased and nonuniform, it is necessary to test con-ditions that would be encountered during an actual bioinformatics analysis in order to e:ectivelymeasure performances under realistic run conditions. Hence, the unique substrings searched for bythe implementation had to satisfy the subsequent 13 conditions, which have been utilized in thesearch of oligonucleotide sequences mainly for microarray analysis, but which are also useful forother purposes [2,3]:

(1) The length of oligonucleotide was 25 nucleotides, (2) preferably located at the beginning ofa gene, (3) includes 12 or fewer A or T nucleotides, (4) includes 10 or fewer C or G nucleotides,no window of 8 nucleotides includes more than (5) 6 A, (6) 6 T, (7) 4 C or (8) 4 G nucleotides,includes at most (9) 6 successive A, (10) 6 successive T, (11) 5 successive C or (12) 5 successiveG nucleotides, and (13) an inverse complementary nucleotide of an oligonucleotide can match atmost 6 symbols from the beginning of an oligonucleotide.

The yeast genome comprises 16 chromosomes. The sequences incorporated a total of 6269 genes in8.9 megabases. Eventually we also included two 100 nucleotide extension areas (one before and theother after the gene) in each gene, which resulted in a data set of 10.2 megabases. In the followingdiscussion, we always refer to this extended data, even though it is not explicitly mentioned.

The main objective of the search task was to identify all unique oligonucleotides that satis6ed allthe conditions given. The search was performed with several algorithms. Their essential selection ismentioned in the following.

Page 3: On exact string matching of unique oligonucleotides

H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181 175

3. String matching methods

Since most experimental methods are based on oligonucleotides of 18–30 basepair length, wesearched for unique oligonucleotides of 6xed 25 basepair length. The 6rst, naive approach was ofthe brute-force type that naturally dealt with an exhaustive search. These methods were used toobtain baseline results for comparing the di:erent methods. The same task was also implementedby means of several conventional string matching algorithms for certain parts of the process asdescribed below.

Method 1: Conventional string matching approach

1. Select a gene from which no unique sequence of 25 symbols has so far been identi6ed and whichhas not yet been completely analyzed. If all genes have been entirely processed, stop.

2. Select a candidate sequence of 25 symbols preferably from the beginning of the gene.3. Analyze whether the candidate sequence satis6es all the 13 conditions assigned above. If so,

move to the following step, but otherwise write the candidate sequence to the list of processedand rejected sequences, and return to the preceding step.

4. Compute whether the selected candidate sequence appears in another gene. This is performed witha suitable exact string matching method. If the string occurs elsewhere, reject the candidate andreturn to step 2. In a negative case, accept the candidate as a valid unique oligonucleotide. Writeit as processed and go back to step 1.

In the worst case with regard to the time complexity of the method all the 25-long sequences ofevery gene would be accepted in step 3 and would be found at the end of the last gene consideredin step 4. When we assume the exact string matching method in step 4 to function in linear time andto perform exactly the same number of elementary computational operations as there are nucleotidesin the gene (its length in symbols), we can de6ne the number of computational operations for theworst case to be equal to

c = N25(L− l); (1)

where N25 is the number of 25 symbol long sequences of all genes, L is the total length of allgenes and l corresponds to the length of a single gene. Assuming that no gene is shorter than 25nucleotides, the value of N25 can be computed by subtracting 24 times the number N of all thegenes from L. The di:erence in (1) can be approximated with the subtraction of the mean lengthof the genes from L. Thus, we obtain the following estimate for the number of operations:

10:0 × 106 × 10:2 × 106 ≈ 1014: (2)

We can likewise estimate the best case of the time complexity to cover N times L operations:

6269 × 10:2 × 106 = 6:4 × 1010: (3)

The next method designed for the string matching task is a logical step from the previous algorithm.It takes advantage of parallel searching for each gene. All the candidate sequences of a gene areinvestigated at a time by employing a keyword tree or trie (e.g. [4]) as in the example in Fig. 1.The time complexity for building a keyword tree is O(n), where n is equal to the sum of the lengthsof the patterns searched for.

Page 4: On exact string matching of unique oligonucleotides

176 H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181

root

A

AA TA

AAT TCA TCG

TC

AAG TAT CTA

C T

CT

AT

C

A

T G

A

T

C

A G

T

A

Fig. 1. An example of a trie (keyword tree) structure including the three-nucleotide-long nucleotide sequences AAT, AAG,TAT, TCA, TCG and CTA.

Method 2: Use of trie

1. Select a gene which is not yet considered. If there is none, stop.2. Compute a trie from all sequences satisfying the 13 conditions with the length of 25 symbols

from the selected gene.3. Traverse, with the aid of the trie, all other genes and search for all occurrences of the length of

25 symbols. Write every such sequence found as rejected.4. Accept all the sequences of the trie that were not rejected in the preceding step. Note the current

gene as processed. Return to step 1.

The time complexity of the method depends on the technique used for the trie in step 3. If executedin the brute-force way and if we assume no sequences to be rejected, we can estimate the worst casefor the forming Tf of tries and the trie searches Ts (M is equal to the mean length of the genes)

Tf + Ts ≈ 25N25 + 25N (L−M) ≈ 25 × 10:0 × 106 + 25 × 6269 × 10:2 × 106 ≈ 1:6 × 1012:(4)

If the Aho–Corasick algorithm [4] is applied to step 3, trie searches become faster, since backtrackingin the sequence is avoided. On the other hand, the correction function of the Aho–Corasick algorithmrequires some processing. We used the following crude time complexity estimate for the forming(TAC) of Aho–Corasick trees and the tree searches Ts

TAC + Ts ≈ 2 × 25 × 10:0 × 106 + 5 × 6269 × 10:2 × 106 ≈ 3:2 × 1011: (5)

This is not very far from the previous best case estimate of (3).Next, we added more parallel processing by dealing with all the genes at the same time. This was

made based on tries, but in a di:erent way compared to the preceding method. All sequences ofthe length of 25 symbols from all genes were compared to each other by forming a common trie.When a new sequence is inserted to the trie, it is apparent if the sequence causes a new leaf nodein the trie. In such a situation the sequence is unique at least up to that moment. When reaching

Page 5: On exact string matching of unique oligonucleotides

H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181 177

an existing leaf, the sequence is not unique and has to be rejected. The remaining accepted stringsform the pool of unique oligonucleotides at the end of the analysis.

Because the DNA data was so very large, a trie would have required an enormous memory.Therefore, the Karp–Rabin 6ngerprint algorithm [5] was applied in order to divide the search probleminto smaller subtasks. First, all the 25 symbol long sequences were distributed into subsets accordingto their 6ngerprints. A sequence cannot have two di:erent 6ngerprints, thus, similar sequences alwaysbelong to the same 6ngerprint set.

Method 3: Advanced trie processing

1. Separate all sequences of the length of 25 symbols to proper 6ngerprint sets so that only sequenceswhich satisfy all the 13 conditions are included.

2. Select some 6ngerprint set not yet processed. If there is none, stop.3. Insert, one by one, all sequences of the current 6ngerprint set into the trie so that sequences

which already appear in the trie are rejected. If a new leaf is inserted, write the index of thecurrent gene in the corresponding leaf and make a reference to the leaf in the list maintained forthis indexing.

4. Traverse all leaves of the trie constructed so that the sequence of every leaf that is not markedas rejected is written as a representative of the current gene.

We can estimate the time complexity of this method. Let F be the operations required in order toseparate the data into 6ngerprint sets, S the number of 6ngerprint sets and A the average number ofoperations needed to form a trie. We used an estimate

F + SA ≈ 25L+ 25SLS

≈ 25 × 10:2 × 106 + 25 × 10:2 × 106 ≈ 5:1 × 108: (6)

Consequently, this method seemed to be by far the fastest algorithm.

4. Results and conclusion

In the beginning, we explored how many sequences with the length of 25 symbols would satisfythe 13 conditions assigned above. As expected, there were many such sequences; from the total of10.2 million nucleotides only 1600 oligonucleotides had to be discarded.

All tests were performed on a PC with a 600 MHz Pentium III processor and 512 Mbytes ofRAM. The whole data was read into the main memory and all the tests were performed by meansof this identical starting point (reading, e.g., gene by gene from the hard disk would have been veryslow).

Because the test battery was clearly huge for Method 1, we restricted the test set to include only afraction of the whole material and, on the basis of the run times, concluded estimates for the wholedata. A random sample of 1000 sequences was used for each string matching algorithm in Method 1.The methods tested were the simple brute-force algorithm [6] also called the na,Nve method [7] withthe time complexity of O(nm) (n for the length of the whole data and m for the length of a pat-tern searched for), the Knuth–Morris–Pratt algorithm [7–10] also with linear time complexity inrespect to n, the linear-time (when searching for only the 6rst occurrence) Boyer–Moore algorithm

Page 6: On exact string matching of unique oligonucleotides

178 H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181

Table 1Run times for 1000 oligonucleotides of the length of 25 symbols and estimates for the analyses of the complete yeastgenome (10.2 million nucleotides) using conventional exact string matching algorithms

Algorithm Run time (s) Estimated run time (h) for nucleotides1000 oligonucleotides complete genome

Brute-force 441 1225Knuth–Morris–Pratt 514 1428Boyer–Moore 216 600Boyer–Moore–Horspool 287 797Quick Search 303 842Karp–Rabin 1953 5425

Table 2Run times for 100 nucleotides including sequences of the length of 25 symbols and estimates for all such sequences inthe DNA data of the yeast genome when the exact trie string matching algorithms were used

Algorithm Run time (s) 100 genes Estimated run time (h)all 6269 genes

Brute-force trie 377 6.6Aho–Corasick 186 3.2

[6,7,11–13], the sub-linear on average Boyer–Moore–Horspool algorithm [6,14], the sub-linear onaverage Quick Search algorithm [15,16], and the linear on average Karp–Rabin algorithm [5,7]. Theresults for the Method 1 are shown in Table 1. In the worst case even the fastest search algorithmin Table 1 would use 10:0 × 106 × 216=1000 s for all sequences of the length of 25 symbols. Thiswould take more than 25 days, which is not feasible, especially when considering that the humangenome possibly contains 20 times more genes.

Astonishingly, the Karp–Rabin algorithm was considerably slower than the others, because its6ngerprint computations have quite a large constant coeAcient although computed asymptoticallyin linear time. The Knuth–Morris–Pratt algorithm was also slow compared to even the brute-forcealgorithm. Obviously, the pre6xes of the substrings searched for matched so infrequently that eventhe brute-force algorithm was able to compete with it. The situation may change when analyzinglarger genomes.

In the test of Method 2, instead of the sequences, the whole gene was used. Because of a greatertesting workload we executed tests for a smaller candidate number (Table 2). These results aresigni6cantly more a:ordable than those in Table 1. The Aho–Corasick trie method would haverequired only 3:2 h for the complete yeast genome.

Method 3 was tested with two di:erent approaches. First, the 6ngerprint sets were created todi:erent 6les, which were processed 6le by 6le. Second, a list was made for each 6ngerprint tocontain the locations of the sequences of the current 6ngerprint in the data. Because our early testson these methods indicated that the run time for processing the whole genome would be quite small,we ended up using the whole genome as the test data. These results of the method are presented in

Page 7: On exact string matching of unique oligonucleotides

H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181 179

Table 3Run times for 100 6ngerprints including sequences of the length of 25 symbols and estimates for all such sequences inthe DNA data of the yeast genome when the advanced trie string matching algorithms were used

Algorithm Run time (s) Run time (s)Fingerprint processing Processing the whole genome

Fingerprint tries, 6les 141 261Fingerprint tries, RAM 2 213

Table 4Proportions of valid and invalid oligonucleotides in the genome of Saccharomyces Cerevisiae according to the location ofthe oligonucleotides in the gene. The genes were divided into 9 parts in such a way that parts 1 and 9 correspond to the100 nucleotide extension areas and parts 2–8 correspond to an equal division of a gene into 7 parts

Location in the genes Valid oligonucleotides (%) Invalid oligonucleotides (%)

Part 1 97.1 2.9Part 2 95.5 4.5Part 3 96.4 3.6Part 4 98.0 2.0Part 5 98.3 1.7Part 6 98.4 1.6Part 7 98.5 1.5Part 8 98.8 1.2Part 9 99.5 0.5All (parts 1–9) 97.8 2.2

Table 3. List manipulation was far better than the use of 6les provided the data could be kept inthe main memory. The run time was approximately 3:5 min. This extremely good result comparedto those of Methods 1 and 2 partially arises from the fact that Method 3 processed each candidatesequence only once. The 6ngerprint calculation was also very fast, only 2 s.

The result of processing the whole genome was that about 98% of the candidate sequences werefound to be valid unique oligonucleotides. We also observed that the locations of the invalid oligonu-cleotides were biased towards the beginnings in the genes. These results are shown in Table 4.

The advanced trie processing with list manipulation run in the main memory was clearly superiorto the other methods. The principal reason for this ultimate result is that it considered every candidatesequence only once.

There exist a lot of algorithms, such as the widely employed BLAST [17], which are utilized inbioinformatics and for DNA sequences in particular. Nevertheless, BLAST does not cover such anexact string matching task at all as is the case with the algorithms considered and it includes a verydi:erent distance measure; hence BLAST or other corresponding algorithms could not be comparedin the current research.

Page 8: On exact string matching of unique oligonucleotides

180 H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181

5. Summary

Unique oligonucleotides are applied to several genetic investigations in bioinformatics e.g.microarray technology. To search for such exact strings from DNA databases is a computation-ally challenging task since the databases are very large and unique oligonucleotides are frequent inthem. We implemented 10 string pattern matching methods and developed the most eAcient of themin order to fasten processing.

We applied the yeast genome as the test battery that is common since it is freely available, knownand still small compared to the human genome. Although the string matching methods explored werequite equal in respect of their theoretical time complexities, strong di:erences were gained betweenthem in the tests. The methods applying the 6ngerprint tries were by far the most eAcient comparedto the others.

Acknowledgements

The 6rst author gratefully acknowledges to the Academy of Finland and Tampere Graduate Schoolin Information Sciences and Engineering for their 6nancial support. The third author is grateful tothe Sigrid Juselius Foundation and the Medical Research Fund of Tampere University Hospital.

References

[1] National Center for Biotechnology Information, web site: http://www.ncbi.nlm.nih.gov.[2] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S.M. Mittmann, C. Wang, M. Kobayashi, H.E.L.

Brown, Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotechnol. 14 (1996)1675–1680.

[3] L. Wodicka, H. Dong, M. Mittmann, M. Ho, D.J. Lockhart, Genome-wide expression monitoring in Saccharomycescerevisiae, Nature Biotechnol. 15 (1997) 1359.

[4] A. Aho, M. Corasick, EAcient string matching: an aid to bibliographic search, Commun. ACM 18 (1975) 333–340.[5] R. Karp, M. Rabin, EAcient randomized pattern matching algorithms, IBM J. Res. Develop. 31 (1987) 249–260.[6] G.A. Stephen, String Searching Algorithms, World Scienti6c, Singapore, 1994.[7] D. Gus6eld, Algorithms on Strings, Trees and Sequences: Computers Science and Computational Biology, Cambridge

University Press, Cambridge, 1997.[8] J.H. Morris, V.R. Pratt, A linear pattern-matching algorithm, Technical Report 40, University California, Berkeley,

1970.[9] M. Crochemore, W. Rytter, Text Algorithms, Oxford University Press, New York, 1994.

[10] D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 323–350.[11] R.S. Boyer, J.S. Moore, A fast string searching algorithm, Commun. ACM 20 (1977) 762–772.[12] W. Rytter, A correct preprocessing algorithm for Boyer–Moore string searching, SIAM J. Comput. 9 (1980)

509–512.[13] R. Cole, Tight bounds on the complexity of the Boyer–Moore pattern matching algorithm, SIAM J. Comput. 23

(1994) 1075–1091.[14] N. Horspool, Practical fast searching in strings, Software Practice and Experience 10 (1980) 501–506.[15] D.M. Sunday, A very fast substring search algorithm, Commun. ACM 33 (1990) 132–142.[16] T. Lecroq, Experimental results on string matching algorithms, Software Practice and Experience 25 (1995)

710–728.[17] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215

(1990) 403–410.

Page 9: On exact string matching of unique oligonucleotides

H. Hyyr*o et al. / Computers in Biology and Medicine 35 (2005) 173–181 181

Heikki Hyyr!o received his M.Sc. degree in Computer Science from the University of Tampere, Finland, in 2000. Herecently completed his Ph.D. degree with Tampere Graduate School in Information Sciences and Engineering. His researchfocuses on string matching algorithms and their applications in bioinformatics.

Martti Juhola received his M.Sc., Ph. Lic. and Ph.D. degrees in Computer Science from the University of Turku, Finland,in 1982, 1985, and 1987 respectively. Previously, he was an academic assistant, lecturer, and researcher at the Universityof Turku, and later a professor at the University of Kuopio, Finland. Currently, he is professor of Computer Scienceat the University of Tampere. His research interests include bioinformatics, medical informatics, medical signal analysis,arti6cial intelligence, neural networks, pattern recognition, and population studies.

Mauno Vihinen received his M.Sc., Ph.Lic. and Ph.D. degrees in Biochemistry at the University of Turku, Finland, in1985, 1990, and 1990 respectively. He has held several positions at the universities of Turku, Helsinki and Tampere andhas been a visiting scientist at the Karolinska Institute, Stockholm and the University of California, San Diego. Currently,Vihinen is Professor of Bioinformatics at the Institute of Medical Technology, University of Tampere. He is interestedin protein structure-function—relationships especially in relation to human diseases, and gene and protein expression inrelation to bioinformatics.