finding the genes in microbial genomes natalia ivanova mgm workshop may 15, 2012

17
Advancing Science with DNA Sequence Finding the genes in Finding the genes in microbial genomes microbial genomes Natalia Ivanova Natalia Ivanova MGM Workshop MGM Workshop May 15, 2012 May 15, 2012

Upload: alicia

Post on 15-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012. Introduction 2. Tools out there 3. Basic principles behind tools and known problems 4. Metagenomes. Outline. features. Well-annotated bacterial genome in Artemis genome viewer:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Finding the genes in Finding the genes in microbial genomesmicrobial genomes

Natalia IvanovaNatalia IvanovaMGM WorkshopMGM Workshop

May 15, 2012May 15, 2012

Page 2: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

1.1. Introduction Introduction

2. Tools out there2. Tools out there

3. Basic principles behind tools 3. Basic principles behind tools and known problemsand known problems

4. Metagenomes4. Metagenomes

Outline

Page 3: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) …

Finding the genes in microbial genomesfeaturesfeatures

Well-annotated bacterial genome in Artemis genome viewer:

Page 4: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

1.1. Introduction Introduction

2. Tools out there2. Tools out there (don’t bother to write down the names and links, (don’t bother to write down the names and links,

all presentations will be available on the web all presentations will be available on the web site)site)

3. Known problems3. Known problems

4. Metagenomes4. Metagenomes

Outline

Page 5: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

IMG-ERhttp://img.jgi.doe.gov/

RASThttp://rast.nmpdr.org/

JCVI Annotation Servicehttp://www.jcvi.org/cms/research/projects/

annotation-service/

NCBI PGAAPhttp://www.ncbi.nlm.nih.gov/genomes/static/

Pipeline.html

Publicly available genome annotation services

Page 6: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

•“Non-coding” RNAs 23S, 16S and 5S rRNAstRNAs(tmRNA, RNaseP RNA component)

•Protein-coding genesTopological features (signal peptides, transmembrane regions)

• Repeats (CRISPRs, IS elements)• Regulatory RNAs (riboswitches)• Pseudogenes and frameshifts

Features predicted by most pipelines

Page 7: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

• Best way – covariance models (captures the secondary structure)

• Used for small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component)Rfam database, INFERNAL search toolhttp://www.sanger.ac.uk/Software/Rfam/http://rfam.janelia.org/http://infernal.janelia.org/tRNAScan-SEhttp://lowelab.ucsc.edu/tRNAscan-SE

• Does not work for large RNAs (23S, 16S). Alternatives: BLASTnRNAmmer http://www.cbs.dtu.dk/services/RNAmmer/

RNA prediction

Page 8: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction)Open reading frame (ORF): reading frame between a start and stop codon

CDS (not ORF!) prediction

Page 9: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Two major approaches to CDS prediction

Two major approaches to prediction of protein-coding genes:

ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs)

Advantages: finds “unique” genes; high sensitivity; very fast!Limitations: often misses “unusual” genes; high rate of false

positives

“evidence-based” (ORFs with translations homologous to the known proteins are CDSs)

Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions

Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools; slow!

Page 10: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

How ab initio tools work – very briefly

• Statistical model of coding and non-coding regions

• Statistical model(s) of other components (RBS)

• Additional algorithms for refinement of predictions (overlap resolution, etc.)

Prokaryotic gene model used by all ab initio gene findersRibosome-binding site within certain distance of the start codon;One of 3 start codons;One of 3 stop codons;No frame interruptions

Ribosome binding site

Start codon:ATG, GTG, TTG

Stop codon:TAG, TAA, TGA

open reading frame

Page 11: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Gene finders: ab initio tools; evidence-based refinement

Ab initio tools used by the pipelines:Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI,

RAST, JCVIhttp://glimmer.sourceforge.net/

GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBIhttp://exon.gatech.edu/GeneMark/

PRODIGAL -> IMG-ER, NCBIhttp://compbio.ornl.gov/prodigal/

Evidence-based refinement: mostly undocumented in-house developed tools.Types of corrections:missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites

(RAST)

Page 12: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Known problems of all annotation pipelines

• RNAs– Incomplete rRNAs– Trans-spliced tRNAin archaeal genomes– Small structural RNAs not predicted at all

• Protein-coding genes that don’t fit into prokaryotic gene model used by ab initio gene finders– no RBS (leaderless transcripts)– interrupted translation frame

sequencing errors ortranslational exceptions

– non-canonical start

Ribosome binding

site

Start codon:ATG, GTG, TTG

Stop codon:TAG, TAA, TGA

open reading frame

Genome Sequencing center

16S rRNA, nt

Synechococcus sp. CC9311 UCSD, TIGR 1477

Synechococcus sp. CC9605 JGI 1440

Synechococcus elongatus PCC 7942

JGI 1490

Synechococcus sp. JA-2-3BA(2-13)

TIGR 1323

Synechococcus sp. JA-3-3Ab TIGR 1324

Synechococcus sp. RCC307 Genoscope 1498

Synechococcus sp. WH7803 Genoscope 1497, 1464

Page 13: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Symptoms of gene finding problems

• Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing

• “Truncated” genes (shorter than homologs) => unusual translation initiation features (non-canonical start codons, leaderless transcripts)

• Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts)

• Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families

Page 14: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Supplemental tools

TIS (translation initiation site) prediction/correction

TICO http://tico.gobics.de/TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA

Two tools often disagree about the best TIS, especially in high GC genomes

Operon predictionJPOP http://csbl.bmb.uga.edu/downloads/#jpophttp://www.cse.wustl.edu/~jbuhler/research/operons/http://www.sph.umich.edu/~qin/hmm/

Proteins with unusual translational features – selenocysteine-containing genesbSECISearch http://genomics.unl.edu/bSECISearch/

Page 15: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Metagenomes sequenced with new technologies: low-coverage problems

• Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x)

• High sequence coverage cannot be achieved for metagenome data

How does this affect metagenome annotation?~70% of 454 Titanium reads have at least 1 sequencing

artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution

>100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions

metagenomegenome

sequence sequence

coverage

coverage

Page 16: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Just one example…

predicted gene

3 frameshifts

4-read contig, 1476 nt, no misassembly

Contig has 27 homopolymers (3 nt and more), 3 of them have errorsNo correlation with homopolymer type or error type Reads were quality trimmed prior to assembly

Page 17: Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Metagenome annotation tools (more details will be given)

GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs)http://exon.gatech.edu/GeneMark/

• MetaGenehttp://metagene.cb.k.u-tokyo.ac.jp/metagene/

• FragGeneScanhttp://omics.informatics.indiana.edu/FragGeneScan/

Full-service annotation pipelines• IMG/M-ER – “metagenome gene calling” + other

optionshttp://img.jgi.doe.gov/submit

• MG-RASThttp://metagenomics.nmpdr.org/

• CAMERA annotation pipelinehttp://camera.calit2.net/