msc project

44
NON-CODING RNA PREDICTION OF CLINICALLY IMPORTANT MYCOPLASMA BY COMPARATIVE GENOMIC ANALYSIS Dissertation submitted to the Madurai Kamaraj University In partial fulfillment for the requirement of Masters of Science in Biotechnology Submitted by Reg No: A242009 SCHOOL OF BIOTECHNOLOGY MADURAI KAMARAJ UNIVERSITY MADURAI 625 021 May 2004

Upload: kalyankpy

Post on 15-Nov-2014

497 views

Category:

Documents


4 download

DESCRIPTION

Pdf file of M.Sc dessertation report done by Kalyan Kumar Pasumarthy

TRANSCRIPT

Page 1: MSc Project

NON-CODING RNA PREDICTION OF CLINICALLY

IMPORTANT MYCOPLASMA BY COMPARATIVE

GENOMIC ANALYSIS

Dissertation submitted to the Madurai Kamaraj University

In partial fulfillment for the requirement of

Masters of Science in Biotechnology

Submitted by

Reg No: A242009

SCHOOL OF BIOTECHNOLOGY

MADURAI KAMARAJ UNIVERSITY

MADURAI 625 021

May 2004

Page 2: MSc Project

ToTHESMALL AND POWERFUL

Non-coding RNA

Page 3: MSc Project

DECLARATION

I declare that this dissertation entitled Non-coding RNA prediction of

clinically important Mycoplasma using comparative genome analysis submitted by

me in partial fulfillment for the requirement of Masters of Science in Biotechnology to

the Madurai Kamaraj University is based on the work carried out by me in the School of

Biotechnology, Madurai Kamaraj University, Madurai under the guidance and

supervision of Dr. Z. A. Rafi, Reader, School of Biotechnology, Madurai Kamaraj

University, Madurai. I also declare that this dissertation or any part of it has not been

submitted elsewhere for any other degree or diploma.

Madurai-21 Regn. No.:A242009

May 7, 2004

Page 4: MSc Project

ACKNOWLEDGEMENTS

I owe my gratitude to DR. Z.A. RAFI for his guidance and supervision in

this project. His care and concern has been the driving force for me all through this work.

I am thankful for his constant advice and encouragement. I am thankful to Prof.

S.Krishnaswamy for introducing me to the field of Bioinformatics.

I would also like to thank my classmates Anurag, Basanth, Dinesh, Geeta,

Hridesh, Kaiser, Netrapal, Subhanjan, Sucharitha, Vijay, for their support and company

during the past two years, that made my stay in Madurai a memorable one. I would like

to thank Deepak for his help in creating a C programme.

My special thanks are due to my roommate and friend Santosh for his

constructive criticism for my mistakes. I acknowledge my special friend Ayushi who has

been my rich source of encouragement and entertainment during the last phase at MKU.

I am indebted to the entire School of Biotechnology for making my M.Sc

an intellectually stimulating experience.

I also acknowledge the Dept. of Science and Technology, Government of

India, for its financial support since last five years through Kishore Vaigyanik Protsahan

Yojana and Dept. of Biotechnology, Government of India, for supporting this project.

Page 5: MSc Project

CONTENTS

1. Briefing

2. Introduction

3. Review of Literature

4. Materials

5. Methods

6. Results

7. Discussion

8. References

Page 6: MSc Project

BRIEFING

Small untranslated RNA molecules are found in all kingdoms of life.

Many of them that are discovered till date are conserved between closely related

organisms with a characteristic secondary structure. These were found to regulate

diverse functions – mainly regulation of gene expression. Non-coding RNAs (ncRNAs)

are difficult to detect biochemically or to predict by traditional sequence analysis.

To search the ncRNAs that may play an important role in the life cycle of

pathogenic Mycoplasma, we used a well established computational strategy that

distinguish conserved RNA secondary structures from a background of other conserved

sequences using probabilistic models of expected mutational patterns in pairwise

sequence alignments.

We report here the complete genome screening for ncRNA done with this

method on the available completely sequenced six Mycoplasma genomes using

comparative sequence analysis. The screen resulted in several putative ncRNAs.

Majority of the predicted ncRNA sequences are in the range of 130-160 nucleotides and

the number of the ncRNAs predicted was in proportion to the length of their genome size

except for the one genome. Our candidate ncRNAs showed similarity with few of the

biochemically characterized ncRNAs in bacteria as well as eukaryotes. This suggests the

broadly conserved nature of the ncRNAs across the other kingdoms of life. This finding

places our putative ncRNAs as suitable candidates for the drug discovery and

developmental studies of the Mycoplasma.

1

Page 7: MSc Project

INTRODUCTION

Central dogma of Molecular Biology defined a general pathway for

expression of genetic information stored in DNA, transcribed into transient mRNA &

decoded on ribosomes with the help of adapter RNA to produce proteins which in turn

perform all enzymatic and structural functions in the cell. According to this view RNAs

play a rather accessory role and the complexity of a given organism is defined by the

constellation of proteins encoded by the genome. However, discovery of RNAs

performing enzymatic and other functional roles in the cell complicated the existing

picture.

Discovery of RNaseP catalysis nature and self splicing activity of group I

introns suggested that functions of RNA go far beyond a passive role in the expression of

protein coding genes. More recent discoveries attributed a variety of regulatory roles to

RNA that includes control of plasmid replication, transposition in prokaryotes and

eukaryotes, phage development, viral replication, bacterial virulence, global circuits in

bacteria in response to environmental changes, or developmental control in lower

eukaryotes.

The above reviewed functions suggest that RNAs which are considered as

non functional RNAs are not only molecular fossils left from time immemorial. Analyses

of several sequenced genomes suggest that protein-coding genes alone are not enough to

account for the complexity of higher organisms. Genomic analysis showed that with an

increase of an organism’s complexity the protein coding contribution of the genome

decreases. It is estimated that about 98% of transcriptional output of eukaryotic and upto

10% of prokaryotic genomes in RNA does not encode for any protein.

2

Page 8: MSc Project

In this context, ncRNA are defined as heterogeneous transcripts that have

a wide functional spectrum. Broadly, ncRNAs can be divided into two classes:

1. Housekeeping RNAs that are constitutively expressed and required for normal

functions and viability of the cell.

2. Regulatory ncRNAs, by contrast, include those that are expressed at certain stages

of an organism’s development or cell differentiation, or as a response to external

stimuli.

Many of these ncRNAs were discovered by chance while researchers were

studying individual genetic systems. NcRNA species have been difficult to detect by

targeted experimental procedures or by traditional computational approaches.

An attempt has been made in the present study to screen for the ncRNAs

of the completetly sequenced and clinically important Mycoplasma* genomes by

comparat ive sequence analysis.

*M.penetrans, M.mycoides, M.gallisepticum, M.pulmonis, M.pneumoniae, M.genetalium

3

Page 9: MSc Project

LITERATURE

Generally the gene finding algorithms assumes that the target is a protein

coding gene that produces mRNA and they fail to scan or target towards ncRNAs.

However, a few computational strategies have recently emerged to detect these ncRNAs

which can be classified into the following four categories:

Sequence similarity analysis: This is simply searching a newly sequenced genome for

similarity against the known ncRNAs [Lowe et al., 1991; Lowe et al., 1999; Zwieb et al,

1999].

Transcriptional Signal analysis: It is based on the fact that ncRNAs are transcribed but

not translated. So, this is a systematic approach that searches for ncRNA genes that has

transcriptional signals but not translational signals [Argman et al., 2001; Olivas et al.,

1997]

Statistical analysis: This involves the analysis of base composition statistics of non-

coding regions in comparison to coding regions [Shattner, 2002]

Comparative genomic analysis: Sequences conferring important characteristics are

conserved across related genomes. Similar assumptions have been made in case of

ncRNAs also. A comparative analysis approach of related genes is used to screen

ncRNAs across the related genomes [Elena Rivas 2001; Elena Rivas et al., 2001;

Wassarman et al., 1999].

The aim of the current study is to find ncRNAs that may play important

role in determining pathogenesis of clinically important Mycoplasma. The current study

was carried out using comparative genomic analysis approach. This selection was done

on the basis that this Mycoplasma shares the common characteristic disease causing

ability. So, a comparative genomic analysis is assumed to highlight the group of ncRNAs

that help in pathogenesis.

4

Page 10: MSc Project

In our approach to predict ncRNAs by comparative genomics we used a

computational tool – QRNA [Elena Rivas 2001], the heart of our project. The following

information details about its evolution and how it works.

There had been some earlier explored RNA gene finding approaches but

with limited success [Elena Rivas 2000]. Early hypothesis in this regard was that

biologically functional RNA structures may have more stable predicted secondary

structures than would be expected for a random sequence of the same base composition

[Chen JH et al., 1990; Le SY et al., 1988; 1990]. Although to a certain extent the above

hypothesis is true, it has been reported that stable predicted secondary structures alone

cannot give positive expected signal, since the predicted stability of structural RNAs is

not sufficiently distinguishable from the predicted stability of random sequence to use as

the basis for a reliable ncRNA gene finding algorithm [Elena Rivas 2000]. Nonetheless,

conserved RNA secondary structure remained a best hope for an exploitable statistical

signal in ncRNA genes. Hence, the above approaches were coupled to comparative

sequence analysis for determination of additional statistical signals [Elena Rivas 2001].

The comparative sequence analysis for ncRNA genes has its basis from

the work which used BLASTN programme to locate genomic regions with significant

sequence similarity between two related bacterial species. A computational tool

CRITICA analyzed the pattern of mutation in these ungapped, aligned conserved regions

for evidence of coding structure [Badger 1999]. For example mutations to synonymous

codons get positive scores, while aligned triplets that translate to dissimilar amino acids

get negative scores. The programme then subsequently extends any coding assigned

ungapped seed alignments into complete ORFs.

5

Page 11: MSc Project

QRNA is an extension of CRITICA to identify structural RNA regions. The

extensions include:

1. using fully probabilistic models;

2. adding a third model of pairwise alignments constrained by structural RNA

evolution;

3. allowing gapped alignments; and

4. allowing for the possibility that only part of the pairwise alignment may represent

a coding region or structural RNA, because a primary sequence alignment may

extend into flanking non-coding or nonstructural conserved sequence.

These extensions add complexity to the approach. It also uses probabilistic modeling

methods and formal languages to guide our construction. Further pair – Hidden Markov

Models and pair – Stochastic Context Free Grammar were used to produce three

evolutionary models for coding, structural RNA or something else. Given three

probabilistic models and a pairwise sequence alignment to be tested, QRNA can calculate

the Bayesian posterior probability that an alignment should be classified as coding,

structural or something else.

QRNA screens for conserved RNA secondary structures. It detects

various non-genic sequences with conserved RNA structures, including rho-independent

terminators, rRNA spacers, transcriptional attenuators in ribosomal protein and amino

acid biosynthetic operons, other cis-regulatory RNA structures, and even certain

repetitive elements forming pseudo knots, stem loops, palindromic sequences etc.,.

The predicted targets are referred as ncRNA genes, but it must be

understood that this really meant a conserved RNA secondary structure that may or may

not turnout to be an independent functional ncRNA gene upon further analysis.

6

Page 12: MSc Project

MATERIALS

System configuration

Hardware specification:

Machine Name : Pentium IV

CPU Speed : 2.8GHz

RAM Memory : 512MB

Hard disk : 80GB

Operating system specifications:

Red hat Linux 9.0

Microsoft Windows XP

Packages installed and Applications used:

Red Hat Linux 9.0 EMBOSS-2.8.0 WU BLAST 2.0 QRNA Microsoft Office Perl 5.0

Selected Genomes for the study:

Mycoplasma penetrans Mycoplasma mycoides Mycoplasma gallisepticum Mycoplasma pulmonis Mycoplasma pneumoniae Mycoplasma genetalium

7

Page 13: MSc Project

METHODS

Downloading the genomes of Mycoplasma:

Folder containing various formats of genomes was downloaded from

NCBI ftp site ftp://ftp.ncbi.nlm.nih/Bacteria/Mycoplasma_Species for each of the

organisms selected. The formats should include fasta format of the whole genome

nucleotide sequence (accession_number.fna file), protein table format that constitute the

coordinates of the starting and ending regions of the protein coding regions

(accession_number.ptt file).

Preparing range file of intergenic regions:

Range file preparation involves three steps starting from manipulation of

the coordinates of protein coding regions.

Getting coordinates of the protein coding regions –

Protein table containing file was opened in Microsoft Word and the option Convert:Table

to Text & Text to Table was used to make a table of just protein coding region

coordinates.

Getting coordinates for intergenic regions –

The protein coding coordinates were pasted into a Microsoft Excel file and simple

mathematical options were used to obtain the coordinates of intergenic regions.

Making a Range file –

Final step of making a range file was done by copying the intergenic region coordinates

into notepad file and is given as input for a C programme that selects only the intergenic

regions whose length is greater than 49 nucleotides for further use as input file in emboss

applications (reason).

Extracting the intergenic regions from the genomes:

Extractseq application in the emboss suite was used to get each intergenic

region in the genome separately in fasta format. This procedure was repeated for each

genome.

Syntax: extractseq –regions @rangefile –separate

8

Page 14: MSc Project

Making Genome databases:

A database formatting programme obtained within the WU BLAST 2.0

suite was used to make databases. Each database constituted five genomes excluding the

genome with which the database is subjected to BLAST.

Syntax: xdformat –n –o database_name

Similarity search:

Similarity search for the intergenic regions of each genome was done by

blastn programme with default parameters from WU BLAST 2.0 suite against a database

that doesn’t contain the organism’s genome.

Syntax: blastn database_name nucleotide_query >output_file_name

Parsing WU BLAST 2.0 outputs:

The output file of WU BLAST 2.0 needs to be parsed by a perl script.

This parsing is done with the default parameters using - blastn2qrnadepth.pl available

along with the QRNA-2.0.1 suite. The result of the parsing will give three output files

and one of the files, with a file_name.q extension will be used as input for the QRNA

application.

Syntax: blastn2qrnadepth.pl -g query_organism file_name

Non-coding RNA prediction:

The file_name.q file obtained from the parsed blast file was used as input

for QRNA with window size 150 and moving 50 nucleotides each time. An option –B

was used to avoid false positive scores.

Syntax: qrna –w 150 –x 50 –B input_file_name > output_file_name

9

Page 15: MSc Project

Extraction of loci identified as ncRNA:

Perl script phase_count_fast.pl was used to prune the QRNA output to get

the actual independent genomic regions that are identified as RNAs using default

parameters. The nucleotide sequence of the predicted ncRNA was extracted by the same

procedure used for extracting the intergenic regions.

Syntax: phase_count_fast.pl file_name query_org database_org

10

Page 16: MSc Project

RESULTS

Intergenic regions were rich source for the presence of ncRNAs. As a first

step, the contribution of the intergenic region to the genome of the organism was

calculated. Graph 1 show the length of the selected genomes and Graph 2 displays the

percentage of the intergenic regions in the genome. From the graph it was clear that the

intergenic sequences were very low compared to the protein coding regions. This agrees

with the common feature of the prokaryotes which processes only small percentage of

intergenic regions [Mattick 2001].

The number of intergenic sequences determined was high and it was found

that several intergenic sequences were of small stretches. Since biochemically

characterized ncRNA genes had a minimum length of 50 nucleotides, only the stretches

that contained more than or equal to 50 nucleotides in length were alone considered. This

curing was done by an in-house C programme. Graph 3 displays the intergenic regions

present before and after curing. It has been observed that nearly half of the intergenic

regions were eliminated based on the above criteria.

The current analysis is based on the prediction of conserved secondary

structures and comparative genomic studies based on similarity of the existing genomes.

Hence, databases of groups of organisms under study were created. Each database was a

collection of genomes of other five similar organisms excluding the one which was under

study. The organism under study was searched for similarity against a database

containing genomes of five related organisms. Table 1 lists the organism and database

contents against which the organism is searched for similarity. The table also indicates

the number of similar hits that would be fed in as an input to the QRNA after using the

perl script blastn2qrnadepth.pl. The perl script is used for filtering of hits below the

threshold level as described in the methods above. This in turn shows the relative

proportion of similarity existing between the organisms with respect to genome size. The

results of the perl script were displayed in Graph 4. The graph indicates that almost all

the selected genomes showed a proportionate increase in the number of similarity hits

11

Page 17: MSc Project

found with respect to the genome size, except M.gallisepticum. This suggests that this

particular organism may have different characteristic sequence compared to the other

selected organisms.

The similarity hits that were selected above the set threshold were

evaluated by QRNA using a window scanning approach. A window size of 150

nucleotides and extension of 50 nucleotides was chosen to minimize the CPU time taken

by the QRNA.

Putative ncRNA output results received from the QRNA for each organism is shown in

Graph 5. Here again, the ncRNAs predicted show a proportional increase in their

number compared with respect to their genome size, except M.gallisepticum.

Spread of the length of the putative ncRNAs was plotted in Graph 6. The

graph shows the range i.e., the smallest and the longest ncRNAs predicted for each

organism together with the average length as pointed by the horizontal line.

12

Page 18: MSc Project

1. Mycoplasma genetalium G37 complete genome - 0..580074480 proteinsLocation Strand Length PID Gene Synonym Code COG Product

735..1829 + 364 3844620 MG001 - - - DNA polymerase III, subunit beta (dnaN)

1829..2761 + 310 1045670 MG002 - - - dnaJ-like protein 2846..4798 + 650 1045671 MG003 - - - DNA gyrase subunit B (gyrB) 4813..7323 + 836 1045672 MG004 - - - DNA gyrase subunit A (gyrA) 7295..8548 + 417 1045673 MG005 - - - seryl-tRNA synthetase (serS) 8552..9184 + 210 1045674 MG006 - - - thymidylate kinase (tmk) ………….. …. ….. ……….. ………. .. .. .. ….....................................

2.

3.2762 28454799 48127224 72948549 8551…… ……

Fig1: (1) Protein table format of the Mycoplasma genetalium genome showing the annotation of the protein coding regions and the names of the characterized and putative proteins. (2) Coordinates of the protein coding regions alone obtained after a series of conversions from Table to Text and Text to Table option in Microsoft Word. (3) Coordinates of the intergenic sequences alone obtained after a simple mathematical application use in Microsoft Excel.

735 18291829 27612846 47984813 73237295 85488552 9184…… ……

13

Page 19: MSc Project

Graph1: GENOME LENGTH COMPARISION OF THE MYCOPLASMA

Graph2: BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC

REGION IN THE GENOME OF MYCOPLASMA

Oraganism Genome sizeM.genetalium 580,074M.pneumoniae 8,16,394

M.pulmonis 9,63,879M.gallisepticum 9,96,422

M.mycoides 12,11,703M.penetrans 13,58,633

M.gen- Mycoplasma genetaliumM.pne- Mycoplasma pneumoniaeM.pul- Mycoplasma pulmonisM.gal- Mycoplasma gallisepticumM.myc- Mycoplasma mycoidesM.pen- Mycoplasma penetrans

0%

20%

40%

60%

80%

100%

M.pen M.myc M.gal M.pul M.pne M.gen

Intergenic region Protein Coding Region

Genome Size Comparision

0 500000 1000000 1500000

M.pen

M.myc

M.gal

M.pul

M.pne

M.gen

Genome length

14

Page 20: MSc Project

1. Starting Ending Length

1 734 7342762 2845 844799 4812 147324 7294 08549 8551 39185 9156 09922 9923 211253 11251 012041 12068 2812726 12701 013566 13569 414434 14395 015317 15555 239……. ……. ….

2. Starting Ending Length1 734 7342762 2845 8415317 15555 239……. ……. …..

Fig2: (1) Intergenic sequence coordinates and their length in the Mycoplasma genetalium as obtained after the simple mathematics tool application in Microsoft Excel. Intergenic regions exist with a gap of 1 nucleotide to as many as thousands of nucleotides (not shown here).

(2) Intergenic regions curated by the C programme to remove the regions whose length is less than 50 nucleotides. One can easily notice that the number of the intergenic regions decreases considerably after curing.

#this is Mycoplasma genetalium G37 range file1 7342762 284515317 1555519760 1982420356 2054328449 2865036714 3697738979 3912747423 47580…… ……

Fig3: This figure shows an example of the first few coordinates of the range file created for Mycoplasma genetalium for use in the emboss application.

15

Page 21: MSc Project

Curing of Intergenic Regions

0

200

400

600

800

1000

1200

No

. o

f In

terg

enic

Reg

ion

s

Before

After

Before 1037 1016 726 782 689 480

After 643 572 290 376 282 122

M.pen M.myc M.gal M.pul M.pne M.gen

Graph3: GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE C

PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL

TO 50 NUCLEOTIDES ONLY

16

Page 22: MSc Project

>L43967_1_734 Mycoplasma genetalium G37 intergenic sequenceTAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATTATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATTTTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTGAACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCATTAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAAGCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATAATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAACCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAAAATTTTGGGAAAAA>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequenceAAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGCAAAAGCTTCTGTACTGTTTATTTA>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequenceACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTTAATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequenceATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAAAGCAA>L43967_20356_20543 Mycoplasma genetalium G37 intergenic sequenceCTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAAGGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTAAAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATTTAGCAGAA

Fig4: Result of the extractseq application in emboss suite which gives sequences of interest. The figure shows the fasta format of the first few intergenic sequences of the Mycoplasma genetalium obtained from the extractseq application given the range file and whole genome sequence as input.

17

Page 23: MSc Project

Organism DatabaseCreated

Organisms inDatabase

No. of Blastnhits

M.penetrans ggmpnpudb

M.gallisepticumM.genetaliumM.mycoides

M.pneumoniaeM.pulmonis

1852

M.mycoides ggpppdb

M.gallisepticumM.genetaliumM.mycoidesM.penetrans

M.pneumoniaeM.pulmonis

1787

M.gallisepticum gempppdb

M.genetaliumM.mycoidesM.penetrans

M.pneumoniaeM.pulmonis

850

M.pulmonis ggmpepndb

M.gallisepticumM.genetaliumM.mycoidesM.penetrans

M.pneumoniae

1012

M.pneumoniae ggmpepudb

M.gallisepticumM.genetaliumM.mycoidesM.penetransM.pulmonis

565

M.genetalium gampppdb

M.gallisepticumM.mycoidesM.penetrans

M.pneumoniaeM.pulmonis

386

Table1: This table shows the databases created with the WU BLAST 2.0 application and the

organism against which the database is searched for similarity.

18

Page 24: MSc Project

BLASTN 2.0MP-WashU [03-Mar-2004] [linux24-i686-ILP32F64 2004-03-03T16:23:09]

Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.All Rights Reserved.

Reference: Gish, W. (1996-2004) http://blast.wustl.edu

Notice: this program and its default parameter settings are optimized to findnearly identical sequences rapidly. To identify weak protein similaritiesencoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query= L43967_1_734 Mycoplasma genetalium G37 intergenic sequence (734 letters; record 1)

Database: gal.fasta 5 sequences; 5,347,031 total letters.Searching....10....20....30....40....50....60....70....80....90....100% done

WARNING: hspmax=1000 was exceeded by 1 of the database sequences, causing the associated cutoff score, S2, to be transiently set as high as 81.

Smallest Sum High ProbabilitySequences producing High-scoring Segment Pairs: Score P(N) N

gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence 692 1.8e-26 1emb|BX293980.1| Mycoplasma mycoides subsp. mycoides SC ge... 675 1.1e-25 1dbj|BA000026| Mycoplasma penetrans, intergenic sequence 602 2.1e-22 1emb|AL445566| Mycoplasma pulmonis (strain UAB CTIP) inter... 539 1.3e-19 2gb|AE015450.1| Mycoplasma gallisepticum strain R intergen... 528 4.6e-19 1

19

Page 25: MSc Project

>gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence Length = 816,394

Plus Strand HSPs:

Score = 692 (109.9 bits), Expect = 1.8e-26, P = 1.8e-26 Identities = 410/664 (61%), Positives = 410/664 (61%), Strand = Plus / Plus

Query: 90 TTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACT-TGATA 148 | |||||| | ||||| || | | | | | | ||| | |||| | | ||Sbjct: 130 TAAATACTAATCTTCTATATAGTATAGAGAAACTTTTTCT-TTAACATAATATTATCTTA 188

Query: 149 AGTATTATTTAGATATTAGACAAAT-ACTAATTTTA-TATTGCTTTAATACT-TAATAAA 205 | ||||||||| || || | | | | || | ||| |||| |||| || | | ||| Sbjct: 189 A-TATTATTTACCTACTA-ATAGCTTAATATTATTAGTATTTATTTAGTATTATGCTAA- 245

Query: 206 TACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAA-C-AATATTATTAC-AAT 262 ||||| | ||||| | ||||||| | || || || || | ||||||||| |||Sbjct: 246 TACTATGCAGATATTATCTTAATATTA-TCTA-TAGTATTAGGCTAATATTATTCTTAAT 303

Query: 263 ATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATA 322 || || ||| | ||| | || || || || | ||||| || ||||| | |Sbjct: 304 ATT-TAT--TAAGGTA-CTAA-AGCATTACCTA-TAGGTGA-TATTATGACAATACTAAA 356

Query: 323 ATAAT-ATTTCTAC-ATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGAT 380 | | | | || | || || | || | ||| | | || | || | || |Sbjct: 357 GTGGTTAGTATTATTAGGGTATTAT-TCAA-AGTAT-TCTCCAACACTATTCCCTTAGCT 413

Fig5: Output of the blastn programme from the WU BLAST 2.0 run with the intergenic sequences of Mycoplasma genitaliumagainst the database containing the intergenic sequences of the other five Mycoplasma genomes: M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis.(The alignment is only partially shown). The blastn programme was run with default parameters.

20

Page 26: MSc Project

Graph4: GRAPH SHOWING NUMBER OF BLAST

HITS FOR EACH GENOME

No. of Blast hits

1852

1787

850

1012

565

386

M.pen

M.myc

M.gal

M.pul

M.pne

M.gen

21

Page 27: MSc Project

>L43967_15317_15555-1>179-MycoplasmaACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATTT

>gb-U00089--19096>19275-MycoplasmaACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATTT

>L43967_15317_15555-1<239-MycoplasmaAAAATTA-GCTGTTTTGATTATAAAAACTG-TTAAAATTCGAGGGGTATTTTGGTTAAATTAAAATTGTTATAATAATTA-GTTA-AATACCCTTAGGATT-ATGCTTGGAAGTATAGCTCAGTTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTCCACCAATAAT---GGGGATGTAGCTCAACTGATAGAGCACCTGATTTGCACTCAGGAGGTTGAGGGT

>gb-AE015450.1--417273>417511-MycoplasmaAATTTTACGC-GTTGTTATTACCAATCGAAATTAAAAATTAAGCAG-ATATTCTTTAA--TGAGCT-GA-AT--TAATTATGTTATAATTCATATGGCAATCACGACTGGAAGTATAGCTCAGCTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTCCACCAGTTTTTTTGGGGACGTAGCTCAATTGATAGAGCACCTGATTTGCACTCAGGAGGTCGAGGGT

>L43967_19760_19824-5<65-MycoplasmaTTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAAAT

>emb-BX293980.1--57200>57261-MycoplasmaTTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAAAT

Fig6.1: This figure shows one of the output file of the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium intergenic sequences Vs Mycoplasma database as input. The first file is named .q as extension (here genblast.q). This is the file used asinput for the qrna programme in QRNA-2.0.1 suite. This consists of a collection of sequences in fasta format, where two sequences are the two component of an alignment with the gaps left in place.

22

Page 28: MSc Project

1. FILE: genblastDIR: /home/kalyankpy/coput2/blast//

FIRST TRIMMINGMinimum length = 1Maximum Evalue = 0.01Minimum %id = 0Maximum %id = 100

SECOND TRIMMINGAlignments culled by = SCDepth of alignments = 1shift = 1

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 1121 After First trimming: 88 After Second trimming: 2

57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 152 After First trimming: 3 After Second trimming: 3

72-QUERY: L43967_386409_386461 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 292 After First trimming: 0 After Second trimming: 0

68-QUERY: L43967_364415_364533 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 155 After First trimming: 1 After Second trimming: 1

……………………………………………………………………………………………….

Total #Queries 122Total #Alignments 53927 ave_len = 309.5After first trimming 18851 ave_len = 552.6After second trimming 386 ave_len = 404.2

Fig6.2: This figure shows the second file of the output from the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium intergenic sequences Vs Mycoplasma database as input. This is a file named with .q.rep (here, genblast.q.rep) as extension that has the report of the BLASTN alignment that have been pruned in the process of creating a file with .q as extension according to the options used in the perl script.

23

Page 29: MSc Project

#---------------------------------------------------------------------------------# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m (Sept 1997)#---------------------------------------------------------------------------------# PAM model = BLOSUM62 #---------------------------------------------------------------------------------# RNA model = /mix_tied_linux.cfg# RIBOPROB matrix = /RIBOPROB85-60.mat#---------------------------------------------------------------------------------# seq file = /home/kalyankpy/perlscriptresult/genblast.q# #seqs: 772 (max_len = 3420)#---------------------------------------------------------------------------------# window version: window = 150 slide = 50 -- length range = [0,9999999]#---------------------------------------------------------------------------------# 1 [both strands] (sre_shuffled)>L43967_1_734-90>722-Mycoplasma (664)>gb-U00089--130>767-Mycoplasma (664)

length of whole alignment after removing common gaps: 664 Divergence time (variable): 0.401[alignment ID = 61.75 MUT = 29.67 GAP = 8.58]

length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)

L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTTgb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT

L43967_1_734-90 TAAA.CAAATGAGAAATATAGTAATAAAGCAAATT.TTTTCACCAT.TTTgb-U00089--130> CAGTGCACATA.CCTATTCGCTAGTTAA.ACGATAAAGTTAAAGAAATTT

L43967_1_734-90 TTTATTATATCA.AAATTTAAAGAAAAATCTGAAAATTATCTATAATGTGgb-U00089--130> TTCTTTATATTCTAAATTT.AAAAATCTTCTCAATATAATACATAAT.TC

LOCAL_DIAG_VITERBI -- [Inside SCFG]

24

Page 30: MSc Project

OTH ends *(+) = (0..[150]..149) OTH ends (-) = (0..[150]..149) COD ends *(+) = (120..[27]..146) COD ends (-) = (41..[12]..52) RNA ends *(+) = (0..[21]..20) RNA ends (-) = (0..[150]..149) winner = OTH OTH = 184.281 COD = 166.408 RNA = 179.710 logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571 sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571

Fig7: This is the qrna output file obtained by the syntax: qrna –w 150 –x 50 –a –B genblast.q.The qrna is a c programme written to evaluate the given alignment for its ability to forma a structural RNA. The above fig is the partial output of the qrna run with a scanning window option (here window size = 150, extension size = 50 nucleotides).

Every new blast alignment starts with two lines: “>Query_name” followed by “>Subject_name” “Divergence time” indicates the particular time parameterization of QRNA used. By default QRNA decides on the divergence time

(in this case it is 0.401) given the percentage identity of the alignment (61.75%). Each new analyzed window starts with the line: length alignment:

For each window and for each sequence in the alignment we have a line of the form: posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) The first pair of numbers represents the first and last coordinates of the window respect to the beginning of the alignment. The pair of numbers in brackets represents the mapping of the window into the coordinate system of sequence X (after removing the gaps). The adjacent number in parenthesis is the length of that segment in sequence X. finally the four decimal numbers in parenthesis are the fraction of A, C, G, T’s respectively in the segment of sequence X involved in that particular window.

For each model, and for each strand, you are given the actual local regions (they could be more than one per model and strand) that score according to the model. The notation is (from..[Length]..to). Coordinates for both strands are given relative to the positive strand. The * indicates the strand with the strongest signal for a given model.

For the given scoring algorithm (here it is local viterbi by default) we get three row of numbers:Row 1: The scores of the alignment under each of the three models. The null model is a forth model which assumes that the two

sequences in the alignment are independent from each other.Row 2: The two (COD and RNA) log-odds posterior probabilities respect to the OTH model.Row 3: The three sigmoidal scores calculated using the other two models as null models. The model with the highest sigmoidal

score is the winner.

25

Page 31: MSc Project

---------------Some General Statistics-------------------FILE: ./genblast.2qrnamethod: LOCAL_DIAG_VITERBICutoff: 5

max id: 100

# blastn hits: 386# windows: 2574---------------------------------------------------------

---------------Statistics by Windows---------------------# windows: 2574

RNA>0: 41/2574RNA>cutoff: 2/2574

COD>0: 2/2574COD>cutoff: 0/2574

in phases: 2045/2574RNA: 2/2045COD: 0/2045OTH: 2043/2045

in transitions: 0/2574RNA/COD: 0/0RNA/OTH: 0/0COD/OTH: 0/0RNA/COD/OTH: 0/0

---------------------------------------------------------

---------------Statistics for RNA loci ():-------------------# loci: 1ave_length: 196.00

1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19

Fig8: This is the output of the phase_count_fast.pl perl script that extracted the RNA loci (with RNA score larger than a cutoff set with option –u, here –u is default set to 5). The script identified 1 independent locus in Mycoplasma genitalium that scores as RNA above 5 bits out of the 2574 windows from the 386 blastn alignments. The listed coordinates of each of the locus is of the following form:num-loci name_seq(seq_length)loc_from loc_to(loc_lenght)number_wind type_loc COD_sc RNA_scTherefore, 1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19means that the first M.genitalium locus corresponds to intergenic sequence named L43967_167180_175806-Mycoplasma. The locus has a length of 197 nucleotides and covers the region from 7457 to 7653. Two different windows have contributed to this RNA locus, and the average sigmoidal score for the coding model is -39.20 bits, while the average sigmoidal score for the RNA model is 26.19 bits.

26

Page 32: MSc Project

No. of ncRNAs

52

40

12

39

4 10

10

20

30

40

50

60

M.pen M.myc M.gal M.pul M.pne M.gen

Range of Non-coding RNA

0

50

100

150

200

250

300

350

M.pen M.myc M.gal M.pul M.pne M.gen

Len

gth

(n

t)

Graph6: GRAPH SHOWING THE LENGTH RANGE OF NON-CODING RNAs.

(Vertical bars represent the spread of scores and horizontal bar represent the average)

Graph5: REPRESENTATION OF THE PUTATIVE ncRNAS PREDICTED BY QRNA

27

Page 33: MSc Project

Fig9: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity. Each figure represents the histogram of the number of alignments bined in each percentage identity interval. Green colour histogram shows the total number of windows analyzed. Blue colour histogram shows the windows that score as RNA or Coding sequence above cutoff of 0 bits.

a) Figure showing the number of sequences scored as Coding regions in the windows analyzed.b) Figure showing the number of sequences scored as RNAs in the windows analyzed.

Fig10: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity. Each figure represents the scores of all the alignments as a function of the percentage identity of the alignments. “*” represents the average of the RNA or Coding sequence scores. The error bars correspond to one standard deviation.

a) Figure showing the average of the scores scored as Coding regions in the windows analyzed.b) Figure showing the average of the scores scored as RNAs in the windows analyzed.

28

Page 34: MSc Project

29

Page 35: MSc Project

0

5

10

15

20

25

30

50 55 60 65 70 75 80 85 90 95 100

NU

MB

ER

OF

WIN

DO

WS

// q

rna

2.0.

1

% ID

genblast.qrna.COD.id--sigmoidal LOD

<len> = 303 +/- 198 ID=[100:0] total_counts [361]real COD-phase_counts_above: 0 [16//361]

Fig 9a: Figure showing the number of sequences scored as coding regions in the windows analyzed30

Page 36: MSc Project

0

5

10

15

20

25

30

50 55 60 65 70 75 80 85 90 95 100

NU

MB

ER

OF

WIN

DO

WS

// q

rna

2.0.

1

% ID

genblast.qrna.RNA.id--sigmoidal LOD

<len> = 303 +/- 198 ID=[100:0] total_counts [361]real RNA-phase_counts_above: 0 [28//361]

Fig 9a: Figure showing the number of sequences scored as RNAs in the windows analyzed31

Page 37: MSc Project

-80

-60

-40

-20

0

20

40

60

50 55 60 65 70 75 80 85 90 95 100

CO

D s

igm

oida

l LO

DS

CO

RE

// q

rna

2.0.

1

% ID

genblast.qrna.COD.id--sigmoidal LOD

<len> = 303 +/- 198 ID=[100:0]ave COD lodscore above: 0 [16//361]

Fig 10a: Figure showing the average of the scores scored as coding regions in the windows analyzed32

Page 38: MSc Project

-60

-50

-40

-30

-20

-10

0

10

20

50 55 60 65 70 75 80 85 90 95 100

RN

A s

igm

oida

l LO

DS

CO

RE

// q

rna

2.0.

1

% ID

genblast.qrna.RNA.id--sigmoidal LOD

<len> = 303 +/- 198 ID=[100:0]ave RNA lodscore above: 0 [28//361]

Fig 10b: Figure showing the average of the scores scored as RNAs in the windows analyzed.33

Page 39: MSc Project

DISCUSSION

The intergenic regions in prokaryotes are small; however, their presence has long

been shown to play a significant role in these organisms. The percentage of the intergenic regions

in Mycoplasma genomes varied from 9.2% in M.genetalium (smallest) to 18% M.mycoides

(largest) genome. Number of intergenic regions was spread to over 122 locations (least in

M.genetalium) to 643 (highest in M.mycoides). Average length of intergenic regions ranged from

234 (in M.penetrans) to 441 (in M.genetalium) nucleotides. This indicates that the average length

of intergenic regions in a smaller genome is greater compared to the average length in a larger

genome. This could be due to the appearance of large number of small interspersing regions

(intergenic regions with few nucleotides only) in M.penetrans that results in the reduction of the

average length.

The QRNA was used with an option of shuffling the sequence. This estimates the

false positives that could arise with the given sequence composition and length. Earlier results in

similar ncRNA predictions in E.coli have shown 85% true positives (Rivas and Eddy 2001). The

predicted loci in the present study are regions of conserved secondary structures that include

ncRNAs and need not be individual ncRNAs alone.

To assess the significance of the prediction, the predicted loci were searched for

similarity against the already known and biochemically characterized ncRNAs obtained from the

ncRNA database at http://biobases.ibch.poznan.pl/nc (updated till 2002).

The putative non-coding RNAs were searched against known Mycoplasma ncRNA

data (only two ncRNAs have been characterized in Mycoplasma capricolum). The results

indicated that one of the putative ncRNA from the current study was showing a good percentage of

identity (60%) with one of the two biochemically available Mycoplasma ncRNA data viz.,

Mc_MCS4 ncRNA obtained from Mycoplasma capricolum. The Mc_MCS4 has already been

shown to have extensive similarity with the eukaryotic U6 snRNA also. This strengthens our

candidate ncRNA to be a possible functional entity. Since the number of ncRNA in

34

Page 40: MSc Project

biochemically determined database was small the database was expanded to include other

prokaryotic ncRNAs.

The results indicated that a stretch of nucleotides in the putative ncRNA was

showing significant similarity to MicF RNAs from E.coli, S.typhi, and K.pneumoniae. Since MicF

was known to regulate the expression of OmpF and the stretch of nucleotides showing similarity

were conserved across all the species, one can possibly say that the putative ncRNA stretch may be

a MicF counterpart in Mycoplasma. Another ncRNA showing significant similarity to E.coli

OxyS RNA was also noticed. OxyS RNA was known to modulate gene expression in response to

Hydrogen peroxide, a common chemical produced by mammals in response to infection. So, this

proposes a defense mechanism operating in Mycoplasma.

The database was further expanded to include eukaryotic ncRNAs that constituted

the characterized miRNA and development regulating RNAs and protein function modifying

RNAs. The putative ncRNAs were found to have more than 60% identity with a number of

miRNAs from mouse, humans, A. thaliana and C.elegans. Fig. 11a shows a blastn hit showing

71% identity against one of the putative ncRNA from M.mycoides. This clearly shows that the

putative ncRNA does have a conserved secondary structure similar to the well characterized stem

loop region of C.briggsae miRNA. Fig 11b shows a blastn hit having an identity of 63% from the

same M.mycoides with the characterized ncRNA obtained from the development regulating RNA

of Homosapiens.

35

Page 41: MSc Project

>cbr-mir-268 MI0000541 Caenorhabditis briggsae miR-268 stem-loop Length = 79

Minus Strand HSPs:

Score = 95 (20.3 bits), Expect = 0.22, P = 0.19 Identities = 33/46 (71%), Positives = 33/46 (71%), Strand = Minus / Plus

Query: 64 CAAAC-CTCTAAACTT-CTAAGAACTTCTTCTTCTTCTTCTTCTTC 21 || || | || | || || | |||||| || ||||||||||||Sbjct: 34 CAGACACACTCA-CTGACTCACTGCTTCTTGTTTTTCTTCTTCTTC 78

Fig 11a: A 71% identity blastn hit obtained for one of the putative ncRNA from M.mycoides. This

clearly shows that the putative ncRNA have a conserved secondary structure similar to the well

characterized stem loop region C.briggsae miRNA.

Significant hits were found with the development regulating ncRNAs included

those from Homosapiens also.

>Hs_NTT Length = 17,572

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024 Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus

Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65 || |||| | || ||| | | || | |||| | ||| | |||| ||| ||||Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394

Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98 |||| | || ||| |||| | ||||| |||Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427

Fig 11b: A sample sequence hit having an identity of about 63% from the same M.mycoides with

the characterized ncRNA obtained from development regulating RNA of Homosapiens.

36

Page 42: MSc Project

These results indicate that the ncRNAs were conserved across other kingdoms of

life. Since the ncRNAs are generally conserved across a wider spectrum, the ncRNAs can

possibly play variant roles in different cellular processes, though the role is yet to be proved

biochemically (which still remains as a challenging task).

The very existence and expression profile of ncRNAs is not predictable, their

functional analysis remains challenging. Given the predicted ncRNAs, the task can be handled

with reduced burden.

37

Page 43: MSc Project

REFERENCES

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped

BLAST and PSI-BLAST: a new generation of protein database search programme.

Nucleic Acids Research 1997, 25:3389

2. Argman L, Hershberg R, Vogel J, Bejerano G, Wagner EG, Margalit H and Altuvia S:

Novel small RNA-encoding genes in the intergenic regions of Escherichia coli. Current

Biology 2001, 11:941

3. Badger JH and Oslen GJ: CRITICA: Coding Region Identification Tool Involving

Comparative Analysis. Molecular Biology and Evolution 1999, 16:512

4. Capara MG, Wilsen TW: RNA: versatility in form and function. Nature Structural

Biology 2000, 7:831

5. Elena Rivas & Sean R Eddy: Secondary structure alone is generally not statistically

significant for the detection of non-coding RNAs. Bioinformatics 2000, 16:583

6. Elena Rivas, Sean R Eddy: QRNA: A non-coding RNA genefinder using comparative

genome sequence analysis (ftp://ftp.genetics.wustl.edu/pub/eddy/software/qrna.tar.z) 2001

7. Elena Rivas, Robert J Klein, Thomas A Jones and Sean R Eddy: Computational

identification of non-coding RNAs in Escherichia coli by comparative genomics.

Current Biology 2001, 11:1369

8. Elena Rivas & Sean R Eddy: Non-coding RNA gene detection using comparative

sequence analysis. BMC Bioinformatics 2001, 2:8

9. Erdmann VA, Barciszewska MZ, Szymanski M, Hochberg A, de Groot N, Barciszewski J:

The non-coding RNAs as riboregulators. Nucleic Acids Research 2001, 29:189

10. Gish W: WU-BLAST 2.0 (http://blast.wustl.edu/) 2003

11. Huttenhofer A, Kiefmann M, Meier-Ewert S, O’Brien J, Lehrach H, Bachellerie JP,

Brosius J: RNomics: an experimental approach that identifies 201 candidates for

novel, small, non-messenger RNAs in mouse. EMBO journal, 2001, 20:2943

38

Page 44: MSc Project

12. Lowe TM, Sean R Eddy: tRNAscan-SE: a program for improved detection of transfer

RNA genes in genomic sequence. Nucleic Acids Research, 1997, 25:955

13. Lowe Sean R Eddy: A computational tool for methylation guide snoRNAs in yeast.

Science, 1999, 283:1168

14. Maciej Szymanski and Jan Barciszawski: Beyond the proteome: non-coding regulatory

RNAs. Genome Biology 2002, 3: 0005.1

15. Mattick JS: Non-coding RNAs: the architects of eukaryotic complexity. EMBO

Reports 2001, 2:986

16. Olivas WM, Muhlrad D, Parker R: Analysis of the yeast genome: identification of new

non-coding and small ORF-containing RNAs. Nucleic Acids Research 1997, 25:4619

17. Sean R Eddy: Non-coding RNA genes. Current Opinion in Genetics and Development

1999, 9:695

18. Sean R Eddy: Non-coding RNA genes and modern RNA world. Nature Review

Genetics 2001, 2:919

19. Shchattner P: Searching for RNA genes using base-composition statistics. Nucleic

Acids Research 2002, 30:2076

20. Wasserman KM, Zhang A, Storz G: Small RNAs in Escherichia coli. Trends in

Microbiology 1999, 7:37

21. Zweib, Wower I, Wower J: Comparative sequence analysis of tmRNA. Nucleic Acids

Research 1999, 27:2063

39