finding the genes in microbial genomes natalia ivanova mgm workshop february 3, 2009

23
Advancing Science with DNA Sequence Finding the genes in Finding the genes in microbial genomes microbial genomes Natalia Ivanova Natalia Ivanova MGM Workshop MGM Workshop February 3, 2009 February 3, 2009

Upload: april

Post on 11-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Finding the genes in microbial genomes Natalia Ivanova MGM Workshop February 3, 2009. Introduction Tools out there Basic principles behind tools Known problems of the tools: why you may need manual curation. Outline. Introduction (who said annotating prokaryotic genomes is easy?) - PowerPoint PPT Presentation

TRANSCRIPT

Advancing Science with DNA Sequence

Finding the genes in Finding the genes in microbial genomesmicrobial genomes

Natalia IvanovaNatalia IvanovaMGM WorkshopMGM Workshop

February 3, 2009February 3, 2009

Advancing Science with DNA Sequence

1.1.IntroductionIntroduction

2.2.Tools out thereTools out there

3.3.Basic principles behind Basic principles behind toolstools

4.4.Known problems of the Known problems of the tools: why you may need tools: why you may need manual curationmanual curation

Outline

Advancing Science with DNA Sequence

1.1.Introduction Introduction (who said (who said annotating prokaryotic genomes is annotating prokaryotic genomes is easy?)easy?)

2.2.Tools out thereTools out there

3.3.Basic principles behind toolsBasic principles behind tools

4.4.Known problems of the Known problems of the tools: why you may need tools: why you may need manual curationmanual curation

Outline

Advancing Science with DNA Sequence

Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) …

Finding the genes in microbial genomesfeaturesfeatures

Well-annotated bacterial genome in Artemis genome viewer:

rRNA tRNAoperon

promoter terminatorprotein-coding gene

CDSprotein-binding site

Advancing Science with DNA Sequence

1.1. Introduction Introduction (who said annotating (who said annotating prokaryotic genomes is easy?)prokaryotic genomes is easy?)

2.2.Tools out thereTools out there (don’t bother to (don’t bother to write down the names and links, all write down the names and links, all presentations will be available on the web presentations will be available on the web site)site)

3.3.Basic principles behind toolsBasic principles behind tools4.4.Known problems of the tools: Known problems of the tools:

why you may need manual why you may need manual curationcuration

Outline

Advancing Science with DNA Sequence

• IMG-ERhttp://img.jgi.doe.gov/erIMG-ER submission page:http://durian.jgi-psf.org/~imachen/cgi-bin/Submission/main.cgi

• RASThttp://rast.nmpdr.org/

• JCVI Annotation Servicehttp://www.tigr.org/tigr-scripts/AnnotationEngine/ann_engine.cgi

Output: rRNAs and tRNAs, CDSs, functional annotationsoutput in several formats

Tools out there: servers for microbial genome annotation - I

Output: stable RNA-encoding genes, CDSs, functional annotationsoutput in GenBank format

Output: CDSs, stable RNAs? functional annotationsformat?

Advancing Science with DNA Sequence

Tools out there: servers for microbial genome annotation - II

• AMIGENEhttp://www.genoscope.cns.fr/agc/tools/amiga/Form/form.php

• RefSeqhttp://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgihttp://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi

• EasyGenehttp://www.cbs.dtu.dk/services/EasyGene/

Output: CDSs, output in tbl format

Output: CDSs, size restriction <1Mb

Output: CDSs, output in gff format

Advancing Science with DNA Sequence

• Artemishttp://www.sanger.ac.uk/Software/Artemis/

• Manateehttp://manatee.sourceforge.net/

• Argohttp://www.broad.mit.edu/annotation/argo/

Major difference: viewer vs editor?

Tools out there: genome browsers for manual annotation of microbial

genomes

Linux versions only;genome needs to be annotated by the JCVI Annotation Service

Windows and Linux versions;works with files in many formats, annotated by any pipeline

Windows and Linux;works with files in many formats

Advancing Science with DNA Sequence

• Large structural RNAs (23S and 16S rRNAs)RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/

• Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component)Rfam database, INFERNAL search toolhttp://www.sanger.ac.uk/Software/Rfam/http://rfam.janelia.org/http://infernal.janelia.org/

ARAGORNhttp://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html

tRNAScan-SEhttp://lowelab.ucsc.edu/tRNAscan-SE/

Tools out there: tools for finding stable (“non-coding”) RNAs - I

Web service: sequence search is limited to 2 kb

Web service: sequence search is limited to 15 kb, finds tRNAs and tmRNAs only

Web service: sequence search is limited to 5 Mb, finds tRNAs only

Advancing Science with DNA Sequence

• Short regulatory RNAs (riboswitches, etc.)Rfam database, INFERNAL search toolhttp://www.sanger.ac.uk/Software/Rfam/http://rfam.janelia.org/http://infernal.janelia.org/

Other (less popular) tools:Pipeline for discovering cis-regulatory ncRNA motifs:

http://bio.cs.washington.edu/supplements/yzizhen/pipeline/

RNAzhttp://www.tbi.univie.ac.at/~wash/RNAz/

Tools out there: tools for finding “non-coding” RNAs - II

Web service: sequence search is limited to 2 kb;Provides list of pre-calculated RNAs for publicly available genomes

Advancing Science with DNA Sequence

Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction)Open reading frame (ORF): reading frame between a start and stop codon

Tools out there: finding protein-coding genes (not ORFs!)

Advancing Science with DNA Sequence

Tools out there: most popular CDS-finding tools

• CRITICA• Glimmer family (Glimmer2, Glimmer3, RBS finder)

http://glimmer.sourceforge.net/• GeneMark family (GeneMark-hmm, GeneMarkS)

http://exon.gatech.edu/GeneMark/• EasyGene• AMIGENE• PRODIGAL (default JGI gene finder)

http://compbio.ornl.gov/prodigal/

Combinations and variations of the above• REGANOR (CRITICA + Glimmer3 + pre-processing)

• RAST (Glimmer2 + pre- and post-processing)

Advancing Science with DNA Sequence

Tools out there: metagenome annotation

• BLASTx• Fgenesb

http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=programs&subgroup=gfindb

• GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs)http://exon.gatech.edu/GeneMark/

• MetaGenehttp://metagene.cb.k.u-tokyo.ac.jp/metagene/

• GISMO ?http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/

Full-service servers• IMG/M-ER – uses GeneMark for Sanger, proxygenes for 454

http://img.jgi.doe.gov/submit• MG-RAST

http://metagenomics.nmpdr.org/

Advancing Science with DNA Sequence

1.1. Introduction Introduction (who said annotating (who said annotating prokaryotic genomes is easy?)prokaryotic genomes is easy?)

2.2. Tools out thereTools out there (don’t bother to (don’t bother to write down the names and links, all write down the names and links, all presentations will be available on the web presentations will be available on the web site)site)

3.3. Basic principles behind toolsBasic principles behind tools (very basic, see specific papers for details)(very basic, see specific papers for details)

4.4. Known problems of the tools: Known problems of the tools: why you may need manual why you may need manual curationcuration

Outline

Advancing Science with DNA Sequence

Basic principles: finding CDSs using evidence-based vs ab initio algorithms

Two major approaches to prediction of protein-coding genes:

• “evidence-based” (ORFs with translations homologous to the known proteins are CDSs)

Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions

Limitations: cannot find “unique” genes; low sensitivity towards short genes; prone to propagation of false positive results of ab initio annotation tools

• ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs)

Advantages: finds “unique” genes; high sensitivity

Limitations: often misses “unusual” genes; high rate of false positives

Advancing Science with DNA Sequence

Basic principles: finding protein-coding genes with ab initio methods

An ORF that is likely to be protein-coding is found by searching for “coding potential”

“Coding potential” is defined by comparing nucleotide sequence of an ORF to a hidden Markov model (HMM)

HMM is generated using a training set from the genome or from average frequencies observed for multiple genomes

Probability that an ORF is a protein-coding gene is computed

N-terminal (5’) boundary is found by finding a start codon (ATG, GTG, TTG) next to a ribosomal binding site (RBS, Shine-Dalgarno sequence)

Different genomes have different frequencies of start codons

RBS is found by (Gibbs sampling) multiple sequence alignment of upstream sequences and represented by a weighted positional frequency matrix

Or RBS is found by multiple sequence alignment and represented as one of the states in an HMM model

Or ...

Example: the overall HMM architecture used in EasyGene (from Larsen & Krogh, BMC Bioinformatics, 2003).

Advancing Science with DNA Sequence

Features and differences between gene finding tools

• Training set selection (evidence-based vs purely ab initio) Example: CRITICA and EasyGene use evidence-based training sets

(BLASTn with counting synonymous/non-synonymous codons in CRITICA, BLASTx in EasyGene); Glimmer uses ab initio training set of long non-overlapping ORFs; GeneMark uses ab initio heuristic model

• Statistical model of coding and non-coding regions (codon frequencies, dicodon frequencies, hidden Markov models)

Example: CRITICA uses dicodon frequencies to model coding regions; Glimmer uses interpolated Markov models (IMM) of up to 5-th order; GeneMark uses order 2 hmm for coding regions, order 0 hmm for non-coding regions

• Statistical model architecture (i. e. which parts of the CDS are explicitly modeled – may include RBS, spacer region, start codon, second codon, internal codons, stop codon, etc.)

Example: EasyGene explicitely models RBS, spacer region, start codon, second codon, internal codons, stop codon, codons surrounding stop codon, non-coding sequence; all other tools have less comprehensive architectures of HMM

• Additional algorithms for refinement of predictions (RBS finder, overlap resolution, etc.)

Example: Glimmer2.0 has a scoring schema for overlap resolution; Glimmer3. uses a dynamic programming algorithm to select the highest-scoring set of predictions consistent with the maximum allowed overlap

Advancing Science with DNA Sequence

1.1. Introduction Introduction (who said annotating (who said annotating prokaryotic genomes is easy?)prokaryotic genomes is easy?)

2.2. Tools out thereTools out there (don’t bother to write (don’t bother to write down the names and links, all presentations down the names and links, all presentations will be available on the web site)will be available on the web site)

3.3. Basic principles behind toolsBasic principles behind tools (very basic, see specific papers for details)(very basic, see specific papers for details)

4.4. Known problems of the tools: Known problems of the tools: why you may need manual why you may need manual curationcuration (more on manual curation in (more on manual curation in the next talk by Thanos Lykidis)the next talk by Thanos Lykidis)

Outline

Advancing Science with DNA Sequence

Known problems: RNAs

Genome Sequencing center

16S rRNA, nt

Synechococcus sp. CC9311

UCSD, TIGR 1477

Synechococcus sp. CC9605

JGI 1440

Synechococcus elongatus PCC 7942

JGI 1490

Synechococcus sp. JA-2-3BA(2-13)

TIGR 1323

Synechococcus sp. JA-3-3Ab

TIGR 1324

Synechococcus sp. RCC307

Genoscope 1498

Synechococcus sp. WH7803

Genoscope 1497, 1464

Large rRNAs: RNAmmer is very accurate, but it has been developed very recently

variation of rRNA sizes in closely related strains

most 16S rRNA are missing anti-Shine-Dalgarno sequence

Small structural RNAs: covariance models are generally accurate, but may miss some tRNAs in Archaea

Check for the full complement of tRNAs with all necessary anti-codons

No model for pyrrolysine tRNA Small regulatory RNAs:

search is accurate but slow (too many models)

Annotations of regulatory RNAs are missing from many genomes

Advancing Science with DNA Sequence

Known problems: CDSs

Short CDSs: many are missed, others are overpredicted short ribosomal proteins (30-40 aa long) are often missed short proteins in the promoter region are often overpredicted N-terminal sequences are often inaccurate (many features

of the sequence around start codon are not accounted for) Glimmer2.0 is calling genes longer than they should be GeneMark, Glimmer3.0 err both ways, but mostly call genes

shorter Pseudogenes all tools are looking for ORFs (needs valid start and stop codons) “unique” genes are often predicted on the opposite strand of a

pseudogene Proteins with unusual translational features (recoding,

programmed frameshifts) these genes are often mistaken for pseudogenes see pseudogenes

Advancing Science with DNA Sequence

Supplemental tools

TIS (translation initiation site) prediction/correction

TICO http://tico.gobics.de/TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA

Two tools often disagree about the best TIS, especially in high GC genomes

Operon predictionJPOP http://csbl.bmb.uga.edu/downloads/#jpophttp://www.cse.wustl.edu/~jbuhler/research/operons/http://www.sph.umich.edu/~qin/hmm/

Proteins with unusual translational features – selenocysteine-containing genesbSECISearch http://genomics.unl.edu/bSECISearch/

Advancing Science with DNA Sequence

Known problems: different gene finding tools applied to the same genome

total features

total CDSs

non-pseudo CDSs pseudo

total rRNA

total tRNA

total misc RNA

manual 7124 7042 6699 343 18 63 1

GeneMark 7059 6974 6974 0 18 62 2

ORNL 7076 6994 6994 0 18 63 1

RAST 5503 5422 5422 0 18 63 0

REGANOR 6420 6339 6339 0 18 63 0

Glimmer3 8218 8218 8218 0 0 0 0

missed by automated annotation

false positive (deleted by manual curation)

too short (extended by manual curation)

too long (truncated by manual curation)

total modifications

#% CDSs #

% CDSs # % CDSs #

% CDSs #

% CDSs

GeneMark 347 4.9 282 4.0 846 12.0 106 1.5 1581 22.4

ORNL 610 8.6 560 7.9 300 4.2 1152 16.3 2622 37.2

RAST 1783 25.3 167 2.3 83 1.1 2228 31.6 4261 60.5

REGANOR 904 12.8 203 2.8 207 2.9 1059 15.0 2373 33.6

Glimmer3 237 3.3 1408 19.9 500 7.1 542 7.6 2687 38.1

Advancing Science with DNA Sequence

Conclusions

• There are plenty of tools for automated annotation of microbial genome, including several “full-service” servers and annotation pipelines

• Even “full-service” pipelines identify a limited range of features and development of automated or semi-automated tools for identification of operons, promoters, terminators etc. is highly desirable, but not likely in the absence of experimental data

• Nearly all of the annotation tools and servers are using different strategies, algorithms, models, settings, etc., so the results may and will vary

• Different automated gene finders have different advantages and limitations; the best strategy is using any of them or a combination followed by evidence-based manual curation