msc project
DESCRIPTION
Pdf file of M.Sc dessertation report done by Kalyan Kumar PasumarthyTRANSCRIPT
NON-CODING RNA PREDICTION OF CLINICALLY
IMPORTANT MYCOPLASMA BY COMPARATIVE
GENOMIC ANALYSIS
Dissertation submitted to the Madurai Kamaraj University
In partial fulfillment for the requirement of
Masters of Science in Biotechnology
Submitted by
Reg No: A242009
SCHOOL OF BIOTECHNOLOGY
MADURAI KAMARAJ UNIVERSITY
MADURAI 625 021
May 2004
ToTHESMALL AND POWERFUL
Non-coding RNA
DECLARATION
I declare that this dissertation entitled Non-coding RNA prediction of
clinically important Mycoplasma using comparative genome analysis submitted by
me in partial fulfillment for the requirement of Masters of Science in Biotechnology to
the Madurai Kamaraj University is based on the work carried out by me in the School of
Biotechnology, Madurai Kamaraj University, Madurai under the guidance and
supervision of Dr. Z. A. Rafi, Reader, School of Biotechnology, Madurai Kamaraj
University, Madurai. I also declare that this dissertation or any part of it has not been
submitted elsewhere for any other degree or diploma.
Madurai-21 Regn. No.:A242009
May 7, 2004
ACKNOWLEDGEMENTS
I owe my gratitude to DR. Z.A. RAFI for his guidance and supervision in
this project. His care and concern has been the driving force for me all through this work.
I am thankful for his constant advice and encouragement. I am thankful to Prof.
S.Krishnaswamy for introducing me to the field of Bioinformatics.
I would also like to thank my classmates Anurag, Basanth, Dinesh, Geeta,
Hridesh, Kaiser, Netrapal, Subhanjan, Sucharitha, Vijay, for their support and company
during the past two years, that made my stay in Madurai a memorable one. I would like
to thank Deepak for his help in creating a C programme.
My special thanks are due to my roommate and friend Santosh for his
constructive criticism for my mistakes. I acknowledge my special friend Ayushi who has
been my rich source of encouragement and entertainment during the last phase at MKU.
I am indebted to the entire School of Biotechnology for making my M.Sc
an intellectually stimulating experience.
I also acknowledge the Dept. of Science and Technology, Government of
India, for its financial support since last five years through Kishore Vaigyanik Protsahan
Yojana and Dept. of Biotechnology, Government of India, for supporting this project.
CONTENTS
1. Briefing
2. Introduction
3. Review of Literature
4. Materials
5. Methods
6. Results
7. Discussion
8. References
BRIEFING
Small untranslated RNA molecules are found in all kingdoms of life.
Many of them that are discovered till date are conserved between closely related
organisms with a characteristic secondary structure. These were found to regulate
diverse functions – mainly regulation of gene expression. Non-coding RNAs (ncRNAs)
are difficult to detect biochemically or to predict by traditional sequence analysis.
To search the ncRNAs that may play an important role in the life cycle of
pathogenic Mycoplasma, we used a well established computational strategy that
distinguish conserved RNA secondary structures from a background of other conserved
sequences using probabilistic models of expected mutational patterns in pairwise
sequence alignments.
We report here the complete genome screening for ncRNA done with this
method on the available completely sequenced six Mycoplasma genomes using
comparative sequence analysis. The screen resulted in several putative ncRNAs.
Majority of the predicted ncRNA sequences are in the range of 130-160 nucleotides and
the number of the ncRNAs predicted was in proportion to the length of their genome size
except for the one genome. Our candidate ncRNAs showed similarity with few of the
biochemically characterized ncRNAs in bacteria as well as eukaryotes. This suggests the
broadly conserved nature of the ncRNAs across the other kingdoms of life. This finding
places our putative ncRNAs as suitable candidates for the drug discovery and
developmental studies of the Mycoplasma.
1
INTRODUCTION
Central dogma of Molecular Biology defined a general pathway for
expression of genetic information stored in DNA, transcribed into transient mRNA &
decoded on ribosomes with the help of adapter RNA to produce proteins which in turn
perform all enzymatic and structural functions in the cell. According to this view RNAs
play a rather accessory role and the complexity of a given organism is defined by the
constellation of proteins encoded by the genome. However, discovery of RNAs
performing enzymatic and other functional roles in the cell complicated the existing
picture.
Discovery of RNaseP catalysis nature and self splicing activity of group I
introns suggested that functions of RNA go far beyond a passive role in the expression of
protein coding genes. More recent discoveries attributed a variety of regulatory roles to
RNA that includes control of plasmid replication, transposition in prokaryotes and
eukaryotes, phage development, viral replication, bacterial virulence, global circuits in
bacteria in response to environmental changes, or developmental control in lower
eukaryotes.
The above reviewed functions suggest that RNAs which are considered as
non functional RNAs are not only molecular fossils left from time immemorial. Analyses
of several sequenced genomes suggest that protein-coding genes alone are not enough to
account for the complexity of higher organisms. Genomic analysis showed that with an
increase of an organism’s complexity the protein coding contribution of the genome
decreases. It is estimated that about 98% of transcriptional output of eukaryotic and upto
10% of prokaryotic genomes in RNA does not encode for any protein.
2
In this context, ncRNA are defined as heterogeneous transcripts that have
a wide functional spectrum. Broadly, ncRNAs can be divided into two classes:
1. Housekeeping RNAs that are constitutively expressed and required for normal
functions and viability of the cell.
2. Regulatory ncRNAs, by contrast, include those that are expressed at certain stages
of an organism’s development or cell differentiation, or as a response to external
stimuli.
Many of these ncRNAs were discovered by chance while researchers were
studying individual genetic systems. NcRNA species have been difficult to detect by
targeted experimental procedures or by traditional computational approaches.
An attempt has been made in the present study to screen for the ncRNAs
of the completetly sequenced and clinically important Mycoplasma* genomes by
comparat ive sequence analysis.
*M.penetrans, M.mycoides, M.gallisepticum, M.pulmonis, M.pneumoniae, M.genetalium
3
LITERATURE
Generally the gene finding algorithms assumes that the target is a protein
coding gene that produces mRNA and they fail to scan or target towards ncRNAs.
However, a few computational strategies have recently emerged to detect these ncRNAs
which can be classified into the following four categories:
Sequence similarity analysis: This is simply searching a newly sequenced genome for
similarity against the known ncRNAs [Lowe et al., 1991; Lowe et al., 1999; Zwieb et al,
1999].
Transcriptional Signal analysis: It is based on the fact that ncRNAs are transcribed but
not translated. So, this is a systematic approach that searches for ncRNA genes that has
transcriptional signals but not translational signals [Argman et al., 2001; Olivas et al.,
1997]
Statistical analysis: This involves the analysis of base composition statistics of non-
coding regions in comparison to coding regions [Shattner, 2002]
Comparative genomic analysis: Sequences conferring important characteristics are
conserved across related genomes. Similar assumptions have been made in case of
ncRNAs also. A comparative analysis approach of related genes is used to screen
ncRNAs across the related genomes [Elena Rivas 2001; Elena Rivas et al., 2001;
Wassarman et al., 1999].
The aim of the current study is to find ncRNAs that may play important
role in determining pathogenesis of clinically important Mycoplasma. The current study
was carried out using comparative genomic analysis approach. This selection was done
on the basis that this Mycoplasma shares the common characteristic disease causing
ability. So, a comparative genomic analysis is assumed to highlight the group of ncRNAs
that help in pathogenesis.
4
In our approach to predict ncRNAs by comparative genomics we used a
computational tool – QRNA [Elena Rivas 2001], the heart of our project. The following
information details about its evolution and how it works.
There had been some earlier explored RNA gene finding approaches but
with limited success [Elena Rivas 2000]. Early hypothesis in this regard was that
biologically functional RNA structures may have more stable predicted secondary
structures than would be expected for a random sequence of the same base composition
[Chen JH et al., 1990; Le SY et al., 1988; 1990]. Although to a certain extent the above
hypothesis is true, it has been reported that stable predicted secondary structures alone
cannot give positive expected signal, since the predicted stability of structural RNAs is
not sufficiently distinguishable from the predicted stability of random sequence to use as
the basis for a reliable ncRNA gene finding algorithm [Elena Rivas 2000]. Nonetheless,
conserved RNA secondary structure remained a best hope for an exploitable statistical
signal in ncRNA genes. Hence, the above approaches were coupled to comparative
sequence analysis for determination of additional statistical signals [Elena Rivas 2001].
The comparative sequence analysis for ncRNA genes has its basis from
the work which used BLASTN programme to locate genomic regions with significant
sequence similarity between two related bacterial species. A computational tool
CRITICA analyzed the pattern of mutation in these ungapped, aligned conserved regions
for evidence of coding structure [Badger 1999]. For example mutations to synonymous
codons get positive scores, while aligned triplets that translate to dissimilar amino acids
get negative scores. The programme then subsequently extends any coding assigned
ungapped seed alignments into complete ORFs.
5
QRNA is an extension of CRITICA to identify structural RNA regions. The
extensions include:
1. using fully probabilistic models;
2. adding a third model of pairwise alignments constrained by structural RNA
evolution;
3. allowing gapped alignments; and
4. allowing for the possibility that only part of the pairwise alignment may represent
a coding region or structural RNA, because a primary sequence alignment may
extend into flanking non-coding or nonstructural conserved sequence.
These extensions add complexity to the approach. It also uses probabilistic modeling
methods and formal languages to guide our construction. Further pair – Hidden Markov
Models and pair – Stochastic Context Free Grammar were used to produce three
evolutionary models for coding, structural RNA or something else. Given three
probabilistic models and a pairwise sequence alignment to be tested, QRNA can calculate
the Bayesian posterior probability that an alignment should be classified as coding,
structural or something else.
QRNA screens for conserved RNA secondary structures. It detects
various non-genic sequences with conserved RNA structures, including rho-independent
terminators, rRNA spacers, transcriptional attenuators in ribosomal protein and amino
acid biosynthetic operons, other cis-regulatory RNA structures, and even certain
repetitive elements forming pseudo knots, stem loops, palindromic sequences etc.,.
The predicted targets are referred as ncRNA genes, but it must be
understood that this really meant a conserved RNA secondary structure that may or may
not turnout to be an independent functional ncRNA gene upon further analysis.
6
MATERIALS
System configuration
Hardware specification:
Machine Name : Pentium IV
CPU Speed : 2.8GHz
RAM Memory : 512MB
Hard disk : 80GB
Operating system specifications:
Red hat Linux 9.0
Microsoft Windows XP
Packages installed and Applications used:
Red Hat Linux 9.0 EMBOSS-2.8.0 WU BLAST 2.0 QRNA Microsoft Office Perl 5.0
Selected Genomes for the study:
Mycoplasma penetrans Mycoplasma mycoides Mycoplasma gallisepticum Mycoplasma pulmonis Mycoplasma pneumoniae Mycoplasma genetalium
7
METHODS
Downloading the genomes of Mycoplasma:
Folder containing various formats of genomes was downloaded from
NCBI ftp site ftp://ftp.ncbi.nlm.nih/Bacteria/Mycoplasma_Species for each of the
organisms selected. The formats should include fasta format of the whole genome
nucleotide sequence (accession_number.fna file), protein table format that constitute the
coordinates of the starting and ending regions of the protein coding regions
(accession_number.ptt file).
Preparing range file of intergenic regions:
Range file preparation involves three steps starting from manipulation of
the coordinates of protein coding regions.
Getting coordinates of the protein coding regions –
Protein table containing file was opened in Microsoft Word and the option Convert:Table
to Text & Text to Table was used to make a table of just protein coding region
coordinates.
Getting coordinates for intergenic regions –
The protein coding coordinates were pasted into a Microsoft Excel file and simple
mathematical options were used to obtain the coordinates of intergenic regions.
Making a Range file –
Final step of making a range file was done by copying the intergenic region coordinates
into notepad file and is given as input for a C programme that selects only the intergenic
regions whose length is greater than 49 nucleotides for further use as input file in emboss
applications (reason).
Extracting the intergenic regions from the genomes:
Extractseq application in the emboss suite was used to get each intergenic
region in the genome separately in fasta format. This procedure was repeated for each
genome.
Syntax: extractseq –regions @rangefile –separate
8
Making Genome databases:
A database formatting programme obtained within the WU BLAST 2.0
suite was used to make databases. Each database constituted five genomes excluding the
genome with which the database is subjected to BLAST.
Syntax: xdformat –n –o database_name
Similarity search:
Similarity search for the intergenic regions of each genome was done by
blastn programme with default parameters from WU BLAST 2.0 suite against a database
that doesn’t contain the organism’s genome.
Syntax: blastn database_name nucleotide_query >output_file_name
Parsing WU BLAST 2.0 outputs:
The output file of WU BLAST 2.0 needs to be parsed by a perl script.
This parsing is done with the default parameters using - blastn2qrnadepth.pl available
along with the QRNA-2.0.1 suite. The result of the parsing will give three output files
and one of the files, with a file_name.q extension will be used as input for the QRNA
application.
Syntax: blastn2qrnadepth.pl -g query_organism file_name
Non-coding RNA prediction:
The file_name.q file obtained from the parsed blast file was used as input
for QRNA with window size 150 and moving 50 nucleotides each time. An option –B
was used to avoid false positive scores.
Syntax: qrna –w 150 –x 50 –B input_file_name > output_file_name
9
Extraction of loci identified as ncRNA:
Perl script phase_count_fast.pl was used to prune the QRNA output to get
the actual independent genomic regions that are identified as RNAs using default
parameters. The nucleotide sequence of the predicted ncRNA was extracted by the same
procedure used for extracting the intergenic regions.
Syntax: phase_count_fast.pl file_name query_org database_org
10
RESULTS
Intergenic regions were rich source for the presence of ncRNAs. As a first
step, the contribution of the intergenic region to the genome of the organism was
calculated. Graph 1 show the length of the selected genomes and Graph 2 displays the
percentage of the intergenic regions in the genome. From the graph it was clear that the
intergenic sequences were very low compared to the protein coding regions. This agrees
with the common feature of the prokaryotes which processes only small percentage of
intergenic regions [Mattick 2001].
The number of intergenic sequences determined was high and it was found
that several intergenic sequences were of small stretches. Since biochemically
characterized ncRNA genes had a minimum length of 50 nucleotides, only the stretches
that contained more than or equal to 50 nucleotides in length were alone considered. This
curing was done by an in-house C programme. Graph 3 displays the intergenic regions
present before and after curing. It has been observed that nearly half of the intergenic
regions were eliminated based on the above criteria.
The current analysis is based on the prediction of conserved secondary
structures and comparative genomic studies based on similarity of the existing genomes.
Hence, databases of groups of organisms under study were created. Each database was a
collection of genomes of other five similar organisms excluding the one which was under
study. The organism under study was searched for similarity against a database
containing genomes of five related organisms. Table 1 lists the organism and database
contents against which the organism is searched for similarity. The table also indicates
the number of similar hits that would be fed in as an input to the QRNA after using the
perl script blastn2qrnadepth.pl. The perl script is used for filtering of hits below the
threshold level as described in the methods above. This in turn shows the relative
proportion of similarity existing between the organisms with respect to genome size. The
results of the perl script were displayed in Graph 4. The graph indicates that almost all
the selected genomes showed a proportionate increase in the number of similarity hits
11
found with respect to the genome size, except M.gallisepticum. This suggests that this
particular organism may have different characteristic sequence compared to the other
selected organisms.
The similarity hits that were selected above the set threshold were
evaluated by QRNA using a window scanning approach. A window size of 150
nucleotides and extension of 50 nucleotides was chosen to minimize the CPU time taken
by the QRNA.
Putative ncRNA output results received from the QRNA for each organism is shown in
Graph 5. Here again, the ncRNAs predicted show a proportional increase in their
number compared with respect to their genome size, except M.gallisepticum.
Spread of the length of the putative ncRNAs was plotted in Graph 6. The
graph shows the range i.e., the smallest and the longest ncRNAs predicted for each
organism together with the average length as pointed by the horizontal line.
12
1. Mycoplasma genetalium G37 complete genome - 0..580074480 proteinsLocation Strand Length PID Gene Synonym Code COG Product
735..1829 + 364 3844620 MG001 - - - DNA polymerase III, subunit beta (dnaN)
1829..2761 + 310 1045670 MG002 - - - dnaJ-like protein 2846..4798 + 650 1045671 MG003 - - - DNA gyrase subunit B (gyrB) 4813..7323 + 836 1045672 MG004 - - - DNA gyrase subunit A (gyrA) 7295..8548 + 417 1045673 MG005 - - - seryl-tRNA synthetase (serS) 8552..9184 + 210 1045674 MG006 - - - thymidylate kinase (tmk) ………….. …. ….. ……….. ………. .. .. .. ….....................................
2.
3.2762 28454799 48127224 72948549 8551…… ……
Fig1: (1) Protein table format of the Mycoplasma genetalium genome showing the annotation of the protein coding regions and the names of the characterized and putative proteins. (2) Coordinates of the protein coding regions alone obtained after a series of conversions from Table to Text and Text to Table option in Microsoft Word. (3) Coordinates of the intergenic sequences alone obtained after a simple mathematical application use in Microsoft Excel.
735 18291829 27612846 47984813 73237295 85488552 9184…… ……
13
Graph1: GENOME LENGTH COMPARISION OF THE MYCOPLASMA
Graph2: BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC
REGION IN THE GENOME OF MYCOPLASMA
Oraganism Genome sizeM.genetalium 580,074M.pneumoniae 8,16,394
M.pulmonis 9,63,879M.gallisepticum 9,96,422
M.mycoides 12,11,703M.penetrans 13,58,633
M.gen- Mycoplasma genetaliumM.pne- Mycoplasma pneumoniaeM.pul- Mycoplasma pulmonisM.gal- Mycoplasma gallisepticumM.myc- Mycoplasma mycoidesM.pen- Mycoplasma penetrans
0%
20%
40%
60%
80%
100%
M.pen M.myc M.gal M.pul M.pne M.gen
Intergenic region Protein Coding Region
Genome Size Comparision
0 500000 1000000 1500000
M.pen
M.myc
M.gal
M.pul
M.pne
M.gen
Genome length
14
1. Starting Ending Length
1 734 7342762 2845 844799 4812 147324 7294 08549 8551 39185 9156 09922 9923 211253 11251 012041 12068 2812726 12701 013566 13569 414434 14395 015317 15555 239……. ……. ….
2. Starting Ending Length1 734 7342762 2845 8415317 15555 239……. ……. …..
Fig2: (1) Intergenic sequence coordinates and their length in the Mycoplasma genetalium as obtained after the simple mathematics tool application in Microsoft Excel. Intergenic regions exist with a gap of 1 nucleotide to as many as thousands of nucleotides (not shown here).
(2) Intergenic regions curated by the C programme to remove the regions whose length is less than 50 nucleotides. One can easily notice that the number of the intergenic regions decreases considerably after curing.
#this is Mycoplasma genetalium G37 range file1 7342762 284515317 1555519760 1982420356 2054328449 2865036714 3697738979 3912747423 47580…… ……
Fig3: This figure shows an example of the first few coordinates of the range file created for Mycoplasma genetalium for use in the emboss application.
15
Curing of Intergenic Regions
0
200
400
600
800
1000
1200
No
. o
f In
terg
enic
Reg
ion
s
Before
After
Before 1037 1016 726 782 689 480
After 643 572 290 376 282 122
M.pen M.myc M.gal M.pul M.pne M.gen
Graph3: GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE C
PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL
TO 50 NUCLEOTIDES ONLY
16
>L43967_1_734 Mycoplasma genetalium G37 intergenic sequenceTAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATTATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATTTTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTGAACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCATTAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAAGCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATAATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAACCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAAAATTTTGGGAAAAA>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequenceAAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGCAAAAGCTTCTGTACTGTTTATTTA>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequenceACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTTAATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequenceATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAAAGCAA>L43967_20356_20543 Mycoplasma genetalium G37 intergenic sequenceCTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAAGGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTAAAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATTTAGCAGAA
Fig4: Result of the extractseq application in emboss suite which gives sequences of interest. The figure shows the fasta format of the first few intergenic sequences of the Mycoplasma genetalium obtained from the extractseq application given the range file and whole genome sequence as input.
17
Organism DatabaseCreated
Organisms inDatabase
No. of Blastnhits
M.penetrans ggmpnpudb
M.gallisepticumM.genetaliumM.mycoides
M.pneumoniaeM.pulmonis
1852
M.mycoides ggpppdb
M.gallisepticumM.genetaliumM.mycoidesM.penetrans
M.pneumoniaeM.pulmonis
1787
M.gallisepticum gempppdb
M.genetaliumM.mycoidesM.penetrans
M.pneumoniaeM.pulmonis
850
M.pulmonis ggmpepndb
M.gallisepticumM.genetaliumM.mycoidesM.penetrans
M.pneumoniae
1012
M.pneumoniae ggmpepudb
M.gallisepticumM.genetaliumM.mycoidesM.penetransM.pulmonis
565
M.genetalium gampppdb
M.gallisepticumM.mycoidesM.penetrans
M.pneumoniaeM.pulmonis
386
Table1: This table shows the databases created with the WU BLAST 2.0 application and the
organism against which the database is searched for similarity.
18
BLASTN 2.0MP-WashU [03-Mar-2004] [linux24-i686-ILP32F64 2004-03-03T16:23:09]
Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.All Rights Reserved.
Reference: Gish, W. (1996-2004) http://blast.wustl.edu
Notice: this program and its default parameter settings are optimized to findnearly identical sequences rapidly. To identify weak protein similaritiesencoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX.
Query= L43967_1_734 Mycoplasma genetalium G37 intergenic sequence (734 letters; record 1)
Database: gal.fasta 5 sequences; 5,347,031 total letters.Searching....10....20....30....40....50....60....70....80....90....100% done
WARNING: hspmax=1000 was exceeded by 1 of the database sequences, causing the associated cutoff score, S2, to be transiently set as high as 81.
Smallest Sum High ProbabilitySequences producing High-scoring Segment Pairs: Score P(N) N
gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence 692 1.8e-26 1emb|BX293980.1| Mycoplasma mycoides subsp. mycoides SC ge... 675 1.1e-25 1dbj|BA000026| Mycoplasma penetrans, intergenic sequence 602 2.1e-22 1emb|AL445566| Mycoplasma pulmonis (strain UAB CTIP) inter... 539 1.3e-19 2gb|AE015450.1| Mycoplasma gallisepticum strain R intergen... 528 4.6e-19 1
19
>gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence Length = 816,394
Plus Strand HSPs:
Score = 692 (109.9 bits), Expect = 1.8e-26, P = 1.8e-26 Identities = 410/664 (61%), Positives = 410/664 (61%), Strand = Plus / Plus
Query: 90 TTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACT-TGATA 148 | |||||| | ||||| || | | | | | | ||| | |||| | | ||Sbjct: 130 TAAATACTAATCTTCTATATAGTATAGAGAAACTTTTTCT-TTAACATAATATTATCTTA 188
Query: 149 AGTATTATTTAGATATTAGACAAAT-ACTAATTTTA-TATTGCTTTAATACT-TAATAAA 205 | ||||||||| || || | | | | || | ||| |||| |||| || | | ||| Sbjct: 189 A-TATTATTTACCTACTA-ATAGCTTAATATTATTAGTATTTATTTAGTATTATGCTAA- 245
Query: 206 TACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAA-C-AATATTATTAC-AAT 262 ||||| | ||||| | ||||||| | || || || || | ||||||||| |||Sbjct: 246 TACTATGCAGATATTATCTTAATATTA-TCTA-TAGTATTAGGCTAATATTATTCTTAAT 303
Query: 263 ATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATA 322 || || ||| | ||| | || || || || | ||||| || ||||| | |Sbjct: 304 ATT-TAT--TAAGGTA-CTAA-AGCATTACCTA-TAGGTGA-TATTATGACAATACTAAA 356
Query: 323 ATAAT-ATTTCTAC-ATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGAT 380 | | | | || | || || | || | ||| | | || | || | || |Sbjct: 357 GTGGTTAGTATTATTAGGGTATTAT-TCAA-AGTAT-TCTCCAACACTATTCCCTTAGCT 413
Fig5: Output of the blastn programme from the WU BLAST 2.0 run with the intergenic sequences of Mycoplasma genitaliumagainst the database containing the intergenic sequences of the other five Mycoplasma genomes: M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis.(The alignment is only partially shown). The blastn programme was run with default parameters.
20
Graph4: GRAPH SHOWING NUMBER OF BLAST
HITS FOR EACH GENOME
No. of Blast hits
1852
1787
850
1012
565
386
M.pen
M.myc
M.gal
M.pul
M.pne
M.gen
21
>L43967_15317_15555-1>179-MycoplasmaACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATTT
>gb-U00089--19096>19275-MycoplasmaACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATTT
>L43967_15317_15555-1<239-MycoplasmaAAAATTA-GCTGTTTTGATTATAAAAACTG-TTAAAATTCGAGGGGTATTTTGGTTAAATTAAAATTGTTATAATAATTA-GTTA-AATACCCTTAGGATT-ATGCTTGGAAGTATAGCTCAGTTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTCCACCAATAAT---GGGGATGTAGCTCAACTGATAGAGCACCTGATTTGCACTCAGGAGGTTGAGGGT
>gb-AE015450.1--417273>417511-MycoplasmaAATTTTACGC-GTTGTTATTACCAATCGAAATTAAAAATTAAGCAG-ATATTCTTTAA--TGAGCT-GA-AT--TAATTATGTTATAATTCATATGGCAATCACGACTGGAAGTATAGCTCAGCTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTCCACCAGTTTTTTTGGGGACGTAGCTCAATTGATAGAGCACCTGATTTGCACTCAGGAGGTCGAGGGT
>L43967_19760_19824-5<65-MycoplasmaTTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAAAT
>emb-BX293980.1--57200>57261-MycoplasmaTTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAAAT
Fig6.1: This figure shows one of the output file of the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium intergenic sequences Vs Mycoplasma database as input. The first file is named .q as extension (here genblast.q). This is the file used asinput for the qrna programme in QRNA-2.0.1 suite. This consists of a collection of sequences in fasta format, where two sequences are the two component of an alignment with the gaps left in place.
22
1. FILE: genblastDIR: /home/kalyankpy/coput2/blast//
FIRST TRIMMINGMinimum length = 1Maximum Evalue = 0.01Minimum %id = 0Maximum %id = 100
SECOND TRIMMINGAlignments culled by = SCDepth of alignments = 1shift = 1
113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 1121 After First trimming: 88 After Second trimming: 2
57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 152 After First trimming: 3 After Second trimming: 3
72-QUERY: L43967_386409_386461 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 292 After First trimming: 0 After Second trimming: 0
68-QUERY: L43967_364415_364533 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 155 After First trimming: 1 After Second trimming: 1
……………………………………………………………………………………………….
Total #Queries 122Total #Alignments 53927 ave_len = 309.5After first trimming 18851 ave_len = 552.6After second trimming 386 ave_len = 404.2
Fig6.2: This figure shows the second file of the output from the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium intergenic sequences Vs Mycoplasma database as input. This is a file named with .q.rep (here, genblast.q.rep) as extension that has the report of the BLASTN alignment that have been pruned in the process of creating a file with .q as extension according to the options used in the perl script.
23
#---------------------------------------------------------------------------------# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m (Sept 1997)#---------------------------------------------------------------------------------# PAM model = BLOSUM62 #---------------------------------------------------------------------------------# RNA model = /mix_tied_linux.cfg# RIBOPROB matrix = /RIBOPROB85-60.mat#---------------------------------------------------------------------------------# seq file = /home/kalyankpy/perlscriptresult/genblast.q# #seqs: 772 (max_len = 3420)#---------------------------------------------------------------------------------# window version: window = 150 slide = 50 -- length range = [0,9999999]#---------------------------------------------------------------------------------# 1 [both strands] (sre_shuffled)>L43967_1_734-90>722-Mycoplasma (664)>gb-U00089--130>767-Mycoplasma (664)
length of whole alignment after removing common gaps: 664 Divergence time (variable): 0.401[alignment ID = 61.75 MUT = 29.67 GAP = 8.58]
length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)
L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTTgb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT
L43967_1_734-90 TAAA.CAAATGAGAAATATAGTAATAAAGCAAATT.TTTTCACCAT.TTTgb-U00089--130> CAGTGCACATA.CCTATTCGCTAGTTAA.ACGATAAAGTTAAAGAAATTT
L43967_1_734-90 TTTATTATATCA.AAATTTAAAGAAAAATCTGAAAATTATCTATAATGTGgb-U00089--130> TTCTTTATATTCTAAATTT.AAAAATCTTCTCAATATAATACATAAT.TC
LOCAL_DIAG_VITERBI -- [Inside SCFG]
24
OTH ends *(+) = (0..[150]..149) OTH ends (-) = (0..[150]..149) COD ends *(+) = (120..[27]..146) COD ends (-) = (41..[12]..52) RNA ends *(+) = (0..[21]..20) RNA ends (-) = (0..[150]..149) winner = OTH OTH = 184.281 COD = 166.408 RNA = 179.710 logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571 sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571
Fig7: This is the qrna output file obtained by the syntax: qrna –w 150 –x 50 –a –B genblast.q.The qrna is a c programme written to evaluate the given alignment for its ability to forma a structural RNA. The above fig is the partial output of the qrna run with a scanning window option (here window size = 150, extension size = 50 nucleotides).
Every new blast alignment starts with two lines: “>Query_name” followed by “>Subject_name” “Divergence time” indicates the particular time parameterization of QRNA used. By default QRNA decides on the divergence time
(in this case it is 0.401) given the percentage identity of the alignment (61.75%). Each new analyzed window starts with the line: length alignment:
For each window and for each sequence in the alignment we have a line of the form: posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) The first pair of numbers represents the first and last coordinates of the window respect to the beginning of the alignment. The pair of numbers in brackets represents the mapping of the window into the coordinate system of sequence X (after removing the gaps). The adjacent number in parenthesis is the length of that segment in sequence X. finally the four decimal numbers in parenthesis are the fraction of A, C, G, T’s respectively in the segment of sequence X involved in that particular window.
For each model, and for each strand, you are given the actual local regions (they could be more than one per model and strand) that score according to the model. The notation is (from..[Length]..to). Coordinates for both strands are given relative to the positive strand. The * indicates the strand with the strongest signal for a given model.
For the given scoring algorithm (here it is local viterbi by default) we get three row of numbers:Row 1: The scores of the alignment under each of the three models. The null model is a forth model which assumes that the two
sequences in the alignment are independent from each other.Row 2: The two (COD and RNA) log-odds posterior probabilities respect to the OTH model.Row 3: The three sigmoidal scores calculated using the other two models as null models. The model with the highest sigmoidal
score is the winner.
25
---------------Some General Statistics-------------------FILE: ./genblast.2qrnamethod: LOCAL_DIAG_VITERBICutoff: 5
max id: 100
# blastn hits: 386# windows: 2574---------------------------------------------------------
---------------Statistics by Windows---------------------# windows: 2574
RNA>0: 41/2574RNA>cutoff: 2/2574
COD>0: 2/2574COD>cutoff: 0/2574
in phases: 2045/2574RNA: 2/2045COD: 0/2045OTH: 2043/2045
in transitions: 0/2574RNA/COD: 0/0RNA/OTH: 0/0COD/OTH: 0/0RNA/COD/OTH: 0/0
---------------------------------------------------------
---------------Statistics for RNA loci ():-------------------# loci: 1ave_length: 196.00
1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19
Fig8: This is the output of the phase_count_fast.pl perl script that extracted the RNA loci (with RNA score larger than a cutoff set with option –u, here –u is default set to 5). The script identified 1 independent locus in Mycoplasma genitalium that scores as RNA above 5 bits out of the 2574 windows from the 386 blastn alignments. The listed coordinates of each of the locus is of the following form:num-loci name_seq(seq_length)loc_from loc_to(loc_lenght)number_wind type_loc COD_sc RNA_scTherefore, 1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19means that the first M.genitalium locus corresponds to intergenic sequence named L43967_167180_175806-Mycoplasma. The locus has a length of 197 nucleotides and covers the region from 7457 to 7653. Two different windows have contributed to this RNA locus, and the average sigmoidal score for the coding model is -39.20 bits, while the average sigmoidal score for the RNA model is 26.19 bits.
26
No. of ncRNAs
52
40
12
39
4 10
10
20
30
40
50
60
M.pen M.myc M.gal M.pul M.pne M.gen
Range of Non-coding RNA
0
50
100
150
200
250
300
350
M.pen M.myc M.gal M.pul M.pne M.gen
Len
gth
(n
t)
Graph6: GRAPH SHOWING THE LENGTH RANGE OF NON-CODING RNAs.
(Vertical bars represent the spread of scores and horizontal bar represent the average)
Graph5: REPRESENTATION OF THE PUTATIVE ncRNAS PREDICTED BY QRNA
27
Fig9: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity. Each figure represents the histogram of the number of alignments bined in each percentage identity interval. Green colour histogram shows the total number of windows analyzed. Blue colour histogram shows the windows that score as RNA or Coding sequence above cutoff of 0 bits.
a) Figure showing the number of sequences scored as Coding regions in the windows analyzed.b) Figure showing the number of sequences scored as RNAs in the windows analyzed.
Fig10: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity. Each figure represents the scores of all the alignments as a function of the percentage identity of the alignments. “*” represents the average of the RNA or Coding sequence scores. The error bars correspond to one standard deviation.
a) Figure showing the average of the scores scored as Coding regions in the windows analyzed.b) Figure showing the average of the scores scored as RNAs in the windows analyzed.
28
29
0
5
10
15
20
25
30
50 55 60 65 70 75 80 85 90 95 100
NU
MB
ER
OF
WIN
DO
WS
// q
rna
2.0.
1
% ID
genblast.qrna.COD.id--sigmoidal LOD
<len> = 303 +/- 198 ID=[100:0] total_counts [361]real COD-phase_counts_above: 0 [16//361]
Fig 9a: Figure showing the number of sequences scored as coding regions in the windows analyzed30
0
5
10
15
20
25
30
50 55 60 65 70 75 80 85 90 95 100
NU
MB
ER
OF
WIN
DO
WS
// q
rna
2.0.
1
% ID
genblast.qrna.RNA.id--sigmoidal LOD
<len> = 303 +/- 198 ID=[100:0] total_counts [361]real RNA-phase_counts_above: 0 [28//361]
Fig 9a: Figure showing the number of sequences scored as RNAs in the windows analyzed31
-80
-60
-40
-20
0
20
40
60
50 55 60 65 70 75 80 85 90 95 100
CO
D s
igm
oida
l LO
DS
CO
RE
// q
rna
2.0.
1
% ID
genblast.qrna.COD.id--sigmoidal LOD
<len> = 303 +/- 198 ID=[100:0]ave COD lodscore above: 0 [16//361]
Fig 10a: Figure showing the average of the scores scored as coding regions in the windows analyzed32
-60
-50
-40
-30
-20
-10
0
10
20
50 55 60 65 70 75 80 85 90 95 100
RN
A s
igm
oida
l LO
DS
CO
RE
// q
rna
2.0.
1
% ID
genblast.qrna.RNA.id--sigmoidal LOD
<len> = 303 +/- 198 ID=[100:0]ave RNA lodscore above: 0 [28//361]
Fig 10b: Figure showing the average of the scores scored as RNAs in the windows analyzed.33
DISCUSSION
The intergenic regions in prokaryotes are small; however, their presence has long
been shown to play a significant role in these organisms. The percentage of the intergenic regions
in Mycoplasma genomes varied from 9.2% in M.genetalium (smallest) to 18% M.mycoides
(largest) genome. Number of intergenic regions was spread to over 122 locations (least in
M.genetalium) to 643 (highest in M.mycoides). Average length of intergenic regions ranged from
234 (in M.penetrans) to 441 (in M.genetalium) nucleotides. This indicates that the average length
of intergenic regions in a smaller genome is greater compared to the average length in a larger
genome. This could be due to the appearance of large number of small interspersing regions
(intergenic regions with few nucleotides only) in M.penetrans that results in the reduction of the
average length.
The QRNA was used with an option of shuffling the sequence. This estimates the
false positives that could arise with the given sequence composition and length. Earlier results in
similar ncRNA predictions in E.coli have shown 85% true positives (Rivas and Eddy 2001). The
predicted loci in the present study are regions of conserved secondary structures that include
ncRNAs and need not be individual ncRNAs alone.
To assess the significance of the prediction, the predicted loci were searched for
similarity against the already known and biochemically characterized ncRNAs obtained from the
ncRNA database at http://biobases.ibch.poznan.pl/nc (updated till 2002).
The putative non-coding RNAs were searched against known Mycoplasma ncRNA
data (only two ncRNAs have been characterized in Mycoplasma capricolum). The results
indicated that one of the putative ncRNA from the current study was showing a good percentage of
identity (60%) with one of the two biochemically available Mycoplasma ncRNA data viz.,
Mc_MCS4 ncRNA obtained from Mycoplasma capricolum. The Mc_MCS4 has already been
shown to have extensive similarity with the eukaryotic U6 snRNA also. This strengthens our
candidate ncRNA to be a possible functional entity. Since the number of ncRNA in
34
biochemically determined database was small the database was expanded to include other
prokaryotic ncRNAs.
The results indicated that a stretch of nucleotides in the putative ncRNA was
showing significant similarity to MicF RNAs from E.coli, S.typhi, and K.pneumoniae. Since MicF
was known to regulate the expression of OmpF and the stretch of nucleotides showing similarity
were conserved across all the species, one can possibly say that the putative ncRNA stretch may be
a MicF counterpart in Mycoplasma. Another ncRNA showing significant similarity to E.coli
OxyS RNA was also noticed. OxyS RNA was known to modulate gene expression in response to
Hydrogen peroxide, a common chemical produced by mammals in response to infection. So, this
proposes a defense mechanism operating in Mycoplasma.
The database was further expanded to include eukaryotic ncRNAs that constituted
the characterized miRNA and development regulating RNAs and protein function modifying
RNAs. The putative ncRNAs were found to have more than 60% identity with a number of
miRNAs from mouse, humans, A. thaliana and C.elegans. Fig. 11a shows a blastn hit showing
71% identity against one of the putative ncRNA from M.mycoides. This clearly shows that the
putative ncRNA does have a conserved secondary structure similar to the well characterized stem
loop region of C.briggsae miRNA. Fig 11b shows a blastn hit having an identity of 63% from the
same M.mycoides with the characterized ncRNA obtained from the development regulating RNA
of Homosapiens.
35
>cbr-mir-268 MI0000541 Caenorhabditis briggsae miR-268 stem-loop Length = 79
Minus Strand HSPs:
Score = 95 (20.3 bits), Expect = 0.22, P = 0.19 Identities = 33/46 (71%), Positives = 33/46 (71%), Strand = Minus / Plus
Query: 64 CAAAC-CTCTAAACTT-CTAAGAACTTCTTCTTCTTCTTCTTCTTC 21 || || | || | || || | |||||| || ||||||||||||Sbjct: 34 CAGACACACTCA-CTGACTCACTGCTTCTTGTTTTTCTTCTTCTTC 78
Fig 11a: A 71% identity blastn hit obtained for one of the putative ncRNA from M.mycoides. This
clearly shows that the putative ncRNA have a conserved secondary structure similar to the well
characterized stem loop region C.briggsae miRNA.
Significant hits were found with the development regulating ncRNAs included
those from Homosapiens also.
>Hs_NTT Length = 17,572
Plus Strand HSPs:
Score = 116 (23.5 bits), Expect = 0.025, P = 0.024 Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus
Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65 || |||| | || ||| | | || | |||| | ||| | |||| ||| ||||Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394
Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98 |||| | || ||| |||| | ||||| |||Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427
Fig 11b: A sample sequence hit having an identity of about 63% from the same M.mycoides with
the characterized ncRNA obtained from development regulating RNA of Homosapiens.
36
These results indicate that the ncRNAs were conserved across other kingdoms of
life. Since the ncRNAs are generally conserved across a wider spectrum, the ncRNAs can
possibly play variant roles in different cellular processes, though the role is yet to be proved
biochemically (which still remains as a challenging task).
The very existence and expression profile of ncRNAs is not predictable, their
functional analysis remains challenging. Given the predicted ncRNAs, the task can be handled
with reduced burden.
37
REFERENCES
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped
BLAST and PSI-BLAST: a new generation of protein database search programme.
Nucleic Acids Research 1997, 25:3389
2. Argman L, Hershberg R, Vogel J, Bejerano G, Wagner EG, Margalit H and Altuvia S:
Novel small RNA-encoding genes in the intergenic regions of Escherichia coli. Current
Biology 2001, 11:941
3. Badger JH and Oslen GJ: CRITICA: Coding Region Identification Tool Involving
Comparative Analysis. Molecular Biology and Evolution 1999, 16:512
4. Capara MG, Wilsen TW: RNA: versatility in form and function. Nature Structural
Biology 2000, 7:831
5. Elena Rivas & Sean R Eddy: Secondary structure alone is generally not statistically
significant for the detection of non-coding RNAs. Bioinformatics 2000, 16:583
6. Elena Rivas, Sean R Eddy: QRNA: A non-coding RNA genefinder using comparative
genome sequence analysis (ftp://ftp.genetics.wustl.edu/pub/eddy/software/qrna.tar.z) 2001
7. Elena Rivas, Robert J Klein, Thomas A Jones and Sean R Eddy: Computational
identification of non-coding RNAs in Escherichia coli by comparative genomics.
Current Biology 2001, 11:1369
8. Elena Rivas & Sean R Eddy: Non-coding RNA gene detection using comparative
sequence analysis. BMC Bioinformatics 2001, 2:8
9. Erdmann VA, Barciszewska MZ, Szymanski M, Hochberg A, de Groot N, Barciszewski J:
The non-coding RNAs as riboregulators. Nucleic Acids Research 2001, 29:189
10. Gish W: WU-BLAST 2.0 (http://blast.wustl.edu/) 2003
11. Huttenhofer A, Kiefmann M, Meier-Ewert S, O’Brien J, Lehrach H, Bachellerie JP,
Brosius J: RNomics: an experimental approach that identifies 201 candidates for
novel, small, non-messenger RNAs in mouse. EMBO journal, 2001, 20:2943
38
12. Lowe TM, Sean R Eddy: tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Research, 1997, 25:955
13. Lowe Sean R Eddy: A computational tool for methylation guide snoRNAs in yeast.
Science, 1999, 283:1168
14. Maciej Szymanski and Jan Barciszawski: Beyond the proteome: non-coding regulatory
RNAs. Genome Biology 2002, 3: 0005.1
15. Mattick JS: Non-coding RNAs: the architects of eukaryotic complexity. EMBO
Reports 2001, 2:986
16. Olivas WM, Muhlrad D, Parker R: Analysis of the yeast genome: identification of new
non-coding and small ORF-containing RNAs. Nucleic Acids Research 1997, 25:4619
17. Sean R Eddy: Non-coding RNA genes. Current Opinion in Genetics and Development
1999, 9:695
18. Sean R Eddy: Non-coding RNA genes and modern RNA world. Nature Review
Genetics 2001, 2:919
19. Shchattner P: Searching for RNA genes using base-composition statistics. Nucleic
Acids Research 2002, 30:2076
20. Wasserman KM, Zhang A, Storz G: Small RNAs in Escherichia coli. Trends in
Microbiology 1999, 7:37
21. Zweib, Wower I, Wower J: Comparative sequence analysis of tmRNA. Nucleic Acids
Research 1999, 27:2063
39