recognition of eukaryotic promoters using ......in this work, we propose a new approach to...

REGULATORY GENOMIC SEQUENCES

RECOGNITION OF EUKARYOTIC PROMOTERS USING GENETIC ALGORITHM BASED ON ITERATIVE DISCRIMINANT ANALYSIS Levitsky V.G. * Katokhin A.V., Lavryushev S.V.

Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia, e-mail: [email protected]*Corresponding author

Key words: promoter recognition, genetic algorithm, discriminant analysis, nucleosome potential

Resume

Motivation: The efficiency of methods for recognizing promoters of different types is proved to be dependent on account of their hidden context complexity. Combinations of various algorithms allow the recognition accuracy to be increased considerably; therefore, complex methods are most advantageous. Results: A new approach to recognizing promoter regions of eukaryotic genes is proposed and illustrated by an example of Drosophila melanogaster. The essence of its novelty is in realizing the genetic algorithm to search for optimal partition of promoter region into local nonoverlapping fragments and selection of the most significant dinucleotide frequencies for the fragments obtained. The method developed was applied to recognizing TATA-containing (TATA+) and DPE-containing (DPE+) promoters of Drosophila melanogaster genes. Availability: The program for promoter recognition is included into the GeneExpress system; section RegScan http://wwwmgs.bionet.nsc.ru/mgs/programs/proga/.

Introduction

The structure of core promoters displays a surprising diversity and is unique for each promoter, presumably reflecting the diversity of interactions between the proteins of transcription complex and promoter DNA (Goodrich et al., 1996; Kolchanov et al., 2002). Consequently, the research into various specific features of promoter context organization is acquiring an ever increasing importance (Hannenhalli, Levy, 2001). A number of methods for promoter recognition were proposed; these methods are based on detecting sets of specific context characteristics of promoter DNA taking into account their localization relative to the transcription start: models of Markov chains (Ohler et al., 1999), neural networks (Knudsen, 1999; Reese, 2001), and discriminant analysis (Davuluri et al., 2001). However the recent approaches combining several algorithms or utilizing both contextual and physicochemical properties of DNA for promoter recognition become the most efficient (Ohler et al., 2001). In this work, we propose a new approach to recognizing eukaryotic promoters through detection of local contextual characteristics using the genetic algorithm (GA) based on iterative application of discriminant analysis. GA approved itself as an efficient tool for optimizing the functional dependent on numerous parameters (Willett, 1995). Here, the linear discriminant function defined by frequencies of dinucleotides within local promoter regions is used as a functional. This approach allowed us to recognize, along with other promoter types, the Drosophila TATA-less promoters lacking pronounced context signals.

Methods and Algorithms

The developed algorithm for promoter recognition utilizes a combination of several approaches and methods. Optimization of parameters of promoter recognition functions. First, (i) a partition of promoter regions into local nonoverlapping fragments is searched for; then, (ii) the most significant frequencies of dinucleotides within the fragments obtained are selected. The GA utilizing iterative discriminant analysis of distribution of dinucleotide frequencies over the fragments of a current partition is used at both stages (Levitsky, Katokhin, 2001; Levitsky et al., 2001a). The method searching for optimal partition starts from assigning in a random manner a certain set of initial partitions (Fig. 1a). Let us specify N = 16 * P values of dinucleotide frequencies for P local regions (here, P = 12). Samples of (1) promoters and (2) random sequences obtained by shuffling individual promoter sequences were used to construct the recognition function. Let us determine the Mahalanobis distance R2 (Mahalanobis, 1936) for samples (1) and (2):

]}ff[*S*]ff{[R )1(k

)2(k

1k,n

)1(n

)2(n

N

1k

N

1n

2��

�

� �

�� . (1)

74

http://wwwmgs.bionet.nsc.ru/mgs/programs/recon/

BGRS� 2002

Here, is the mean frequency of the ith dinucleotide in the pth partition fragment for the sample of promoter

sequences; , the corresponding frequency for the sample of random sequences (n = (p � 1) × 16 + i, p = 1,�12,

i = 1,�16, n = 1,..,N); and matrix S�1 is calculated according to the dinucleotide frequencies and .

)1(p,i

)1(n ff �

)2(nf

)1(nf )2(

nfGA is constructed using elementary operations of two types: �mutations� (changes in the positions of borders between regions of the same partition, with constant number of these regions and constant minimal size of any region; Fig. 1b�e) and �recombinations� (exchange of fragments between two partitions; Fig. 2). The partition most suitable for recognition is determined as a result of successive �mutations� and �recombinations� of the partitions analyzed. The method for selecting the most significant contextual characteristics. A subset of mp � 16 dinucleotides is determined for each partition regions:

��

�

P

1ppmM (2)

joining splitting

a

b

c

d

e

1

2�

1 �

2 a

b

c

1

2�

1 �

2

Fig. 1. Examples of modifications used by the optimal partition searching: (a) arbitrary distribution; (b) shift of the border between adjacent regions; (c) shift of a region relatively to the neighbor regions; (d) symmetrical shift of the region�s borders relatively to its center; and (e) joining and splitting.

Fig. 2. Examples of crossovers used by the optimal partition searching: (a) introduction of a �break� (dotted lines) into two initial partitions 1 and 2 (indicated by different colors); (b) fragments of partitions after the break; and (с) exchange of fragments, removal of the break, and forming of the final partitions 1� and 2�.

Construction of the promoter recognition function. The value of recognition function is calculated for an arbitrary nucleotide sequence at each position of the window with a length of 400 bp (fragment X):

]}ff[S]ff[)()X(f{[R1)X( )1(

k)2(

k1k,n

)1(n

)2(n2

1M

1n

M

1kn2 ��

�

� �

�� , (3)

where fn(X) is the dinucleotide frequencies with account of the partition of the fragment X. The distance R2 is calculated using equation (1) by summing over M selected dinucleotides. A higher probability of promoter recognition corresponds to the values of the function �(X) close to +1. The recognition function �(X) (3) was transformed as follows to recognize promoters with a specified significance level �:

��

��

P|)X(1|

, if |1-�(Х)| < P��,

��(Х) =

0, otherwise.

(4)

Here, P

� is an �-quantile of the standard normal distribution (for example, P0.95 = 1.96) and ��, a standard deviation of the

recognition function �(Х) values over the sample of promoter sequences. Assessment of the recognition accuracy. The correlation coefficient (CC) characterizes the integral recognition accuracy with account of both the rate of false positives and the rate of false negatives:

75

BGRS� 2002

)(*)(*)(*)(**

FNTNFPTPFPTNFNTPFPFNTNTPCC

��

�

� , (5)

where TP and FP are the numbers of true and false promoter prediction; TN and FN, the numbers of true and false �non-promoters� prediction. Recognition of promoters using calculated recognition functions. While solving this problem for an arbitrary nucleotide sequence, the profile of nucleosome potential (NP; Levitsky, Katokhin, 2000) was calculated in addition to the promoter recognition function (4). Use of the NP profile allows the predictions of function (4) for the sequence positions failing to display NP values specific of promoter regions to be discarded. We have earlier demonstrated that the region [�50; +1] relative to the transcription start exhibits decreased mean values of NP (Levitsky, Katokhin, 2000; Levitsky et al., 2001a). We used the mean NPPR values over the region [�50; +1] relative to the transcription start for the promoter samples studied in this work with their standard deviations �PR, the mean NP value for the sample of D. melanogaster introns NPINT with its standard deviation �IN for calculating the threshold values NPST while recognizing promoters:

PRIN

PRININPRST

NPNPNP��

�� . (6)

The values NPINT were taken into account, as we discovered that the majority of false predictions fell into intron regions. However, we have earlier demonstrated that introns display a high NP (Levitsky et al., 2001b). Consequently, involvement of NPINT allows false predictions of the promoter recognition function (4) to be excluded. Thus, according to the combined approach developed, a position X in the sequence analyzed is recognized as promoter if the two following conditions are met:

��(Х) > 0 (7a);

NP(X) < NPST (7b) Here, ��(Х) is the promoter recognition function (4); NP(X), the mean NP value over the region [�50; +1] relative to the putative transcription start (position X); and NPST is calculated according to equation (6).

Results and Discussion

To construct the promoter recognition function, we took 236 [�300; +100] fragments of D. melanogaster promoter sequences, phased relative to the transcription start, from the database EnDPD (Katokhin, Levitsky, 2000). Two promoter samples�TATA-containing (TATA+) and DPE-containing (DPE+)�were formed. The TATA+ promoter sample comprised 68 sequences displaying ScoreТАТА � �5, i.e., the value of weight matrix (Bucher, 1990) in the region [�40; �5] relative to the transcription start. Specific of the DPE-containing promoters is occurrence of a weaker (compared with the TATA box) context signal DPE (Downstream Promoter Element; Kutach, Kadonaga, 2000). Earlier, these promoters belonged to the heterogeneous group of TATA-less promoters. The DPE+ sample comprised 31 sequences from the Drosophila Core Promoter Database (http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html; Kutach, Kadonaga, 2000), tested by BLAST-analysis with respect to 5� ends of the corresponding ESTs. Application of the GA utilizing iterative discriminant analysis at the first stage (search for optimal partition) resulted in the following accuracy estimates compared with the random sequences: CCTATA+ = 0.92 and CCDPE+ = 0.70. Upon the second GA stage (selection of the most significant contextual characteristics), the estimates amounted to CCTATA+ = 0.92 and CCDPE+ = 0.82. Note that the two-stage GA allowed the recognition accuracy of DPE+-type promoters to be essentially increased. Thus, we have demonstrated that selection of significant contextual characteristics is most efficient for increasing the recognition accuracy of the promoters with weakly pronounced contextual signals. Further increase in the promoter recognition accuracy is achieved through a combined consideration of both the values of promoter recognition function (4) and nucleosome potential to discard the predictions of the function in question for the positions failing to display the NP values typical of promoters. The mean values of NPPR for the promoter samples used in this work amount to NPTATA+ = 0.04 and NPDPE+ = �0.625; the standard deviations (�PR) equal 0.73 and 0.85, respectively. The mean NP value for the sample of intron fragments amounts to NPINT = 0.534 with a standard deviation of �IN = 1.05. Thus, according to equation (6), we obtain the following threshold values: NPST TATA+ = 0.33 and NPST DPE+ = 0.02. The combined approach for eukaryotic promoter recognition is implement as an web-available program included into the GeneExpress system, section RegScan http://wwwmgs.bionet.nsc.ru/mgs/programs/proga/. It allows to determine position of putative promoter in sequences of length up to 32000 bp. The choice between different promoter types and usage of NP filter are customized. Let us illustrate the efficiency of the combined approach developed. Figure 3 shows the profiles of functions ��(Х) (4) at � = 0.95 for promoters of the genes zen (TATA+; Fig. 3a) and Cyt-C2 (DPE+; Fig. 3c) and profiles of their nucleosome

76

http://wwwmgs.bionet.nsc.ru/mgs/programs/recon/

BGRS� 2002

potential (Figs. 3b and 3d, respectively). The annotated sequences of the genes zen (FBgn0004053) and Cyt-C2 (FBgn0000409) were retrieved from FlyBase (http://flybase.bio.indiana.edu/genes/). Pronounced minimums of the nucleosome potentials, presumably corresponding to the regions with less dense nucleosome packaging (Levitsky et al., 2001a), are evident in the regions of gene transcription starts. It is also apparent that the peaks of the promoter recognition function fall into these regions. Note that the peaks of recognition function in the case of Cyt-C2 gene promoters located near positions 1200, 1660, and 1700 (indicated with crosses; Fig. 3c) were discarded, as the corresponding NP values (Fig. 3c) fail to meet the condition (7b).

A C

0

0.2

0.4

0.6

0.8

1

2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500Position, bp

Recognition function value

TRANSCRIPTIONSTART

0

0.2

0.4

0.6

0.8

1

1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300Position, bp

Recognition function value

TRANSCRIPTIONSTART

B D

-3

-2

-1

0

1

2

3

2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500Position, bp

Nucleosome potential

TRANSCRIPTIONSTART

-3

-2

-1

0

1

2

3

1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300Position, bp

Nucleosome potential

TRANSCRIPTIONSTART

Fig. 3. Profiles of (a) the TATA+ promoter recognition function calculated for the gene zen; (b) its nucleosome potential; (c) DPE+ promoter recognition function calculated for the gene Cyt-C2; and (d) its nucleosome potential: arrows indicate transcription starts; crosses, the recognition function peaks discarded according to the condition (7b).

Acknowledgements

The work was supported in part by the Russian Foundation for Basic Research (grants № 01-07-90376, 02-07-90355, and 00-04-49229); Russian Ministry of Industry, Science, and Technologies (grant № 43.073.1.1.1501); Siberian Branch of the Russian Academy of Sciences (Integration Project № 65); US National Institutes of Health (grant № 2 R01-HG-01539-04A2); and US Department of Energy (grant № 535228 CFDA 81.049).

References

1. Bucher P. (1990). Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563-578.

2. Davuluri R.V, Grosse I., Zhang M.Q. (2001). Computational identification of promoters and first exons in the human genome. Nat. Genet. 29(4), 412-417.

3. Goodrich J.A., Cutler G., Tjian R. (1996). Contacts in context: promoter specificity and macromolecular interactions in transcription. Cell. 84(6), 825-830.

4. Hannenhalli S., Levy S. (2001). Promoter prediction in the human genome. Bioinformatics. 17, Suppl. 1, S90-S96. 5. Katokhin A.V., Levitsky V.G. (2000). Drosophila Promoter Database EnDPD: project and the first steps of its realization. In: Proc. IId

Int. Conf. on Bioinf. Genome Regulation and Structure, Novosibirsk (eds. Kolchanov N.A. et al.), III, 105-108. 6. Knudsen S. (1999) Promoter2.0: for the recognition of PolII promoter sequences. Bioinformatics. 15, 356-361. 7. Kolchanov N.A, Ignatieva E.V., Ananko E.A., Podkolodnaya O.A., Stepanenko I,L, Merkulova T.I., Pozdnyakov M.A., Podkolodny

N.L., Naumochkin A.N., Romashchenko A.G. (2002). Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucl. Acids Res. 30(1), 312-317.

8. Kutach A.K., Kadonaga J.T. (2000). The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol. Cell. Biol. 20(13), 4754-4764.

9. Levitskii V.G., Katokhin A.V. (2001). Computer analysis and recognition of Drosophila melanogaster gene promoters. Mol. Biol. (Mosk.). 35(6), 970-978.

77

BGRS� 2002

10. Levitsky V.G., Katokhin A.V. (2000). Characteristic modular promoter structure and its application to development of recognition program software. In: Proc. IId Intern. Conf. on Bioinf. Genome Regulation and Structure, Novosibirsk (eds. Kolchanov N.A. et al.), I, 86-89.

11. Levitsky V.G., Podkolodnaya O.A., Kolchanov N.A., Podkolodny N.L. (2001a). Nucleosome formation potential of eukaryotic DNA: tools for calculation and promoters analysis. Bioinformatics. 17, 998-1010.

12. Levitsky V.G., Podkolodnaya O.A., Kolchanov N.A., Podkolodny N.L. (2001b). Nucleosome formation potential of exons, introns, and Alu repeats. Bioinformatics. 17, 1062-1064.

13. Mahalanobis P.C. (1936). On the generalised distance in statistics. Proc. Natl Inst. Sci. India. 12, 49-55. 14. Ohler U., Harbeck S., Niemann H., Noth E., Reese M.G. (1999) Interpolated Markov chains for eukaryotic promoter recognition.

Bioinformatics. 15, 362-369. 15. Ohler U., Niemann H., Liao G., Rubin G.M. (2001) Joint modeling of DNA sequence and physical properties to improve eukaryotic

promoter recognition. Bioinformatics. 17(Suppl.1), S199-206. 16. Reese M.G. (2001). Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome.

Comput. Chem. 26(1), 51-56. 17. Willett P. (1995). Genetic algorithms in molecular recognition and design. Trends Biotechnol. 13(12), 516-521.

78

recognition of eukaryotic promoters using ......in this work, we propose a new approach to...

Documents