prokka - rapid bacterial genome annotation - abphm 2013
TRANSCRIPT
![Page 1: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/1.jpg)
Rapid automatic microbial genome annotation
using Prokka
Dr Torsten Seemann
Applied Bioinformatics and Public Health Microbiology - Wed 15 May 2013 @ 1530h - Moller Centre, Cambridge, UK
![Page 2: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/2.jpg)
Background
![Page 3: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/3.jpg)
We come from a land down-under
![Page 4: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/4.jpg)
The team
● Simon Gladmano VelvetOptimiser author, presenting Galaxy poster
● Paul Harrisono author of Nesoni toolkit
● David Powello author of VAGUE, software wizard, theoretician
● Dieter Bulacho sequence magician, closes genomes at will
... and we are recruiting.
![Page 5: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/5.jpg)
History
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
?
?
?
?
?
?
?
?
![Page 6: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/6.jpg)
Introduction
![Page 7: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/7.jpg)
De novo assembly
Align reads to a reference
Process
![Page 8: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/8.jpg)
De novo assembly
Ideally, one sequence per replicon.
Millions of short sequences
(reads)
A few long sequences
(contigs)
Reconstruct the original genome sequence from the sequence reads only
![Page 9: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/9.jpg)
Annotation
Adding biological information to sequences.
ACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTAGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGTCCGTCCGTGGGCCACGGCCACCGCTTTTTTTTTTGCC
delta toxinPubMed: 15353161
ribosome binding site
transfer RNALeu-(UUR)
tandem repeatCCGT x 3
homopolymer10 x T
![Page 10: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/10.jpg)
What's in an annotation?
● Locationo which sequence? chromosome 2o where on the sequence? 100..659o what strand? -ve
● Feature typeo what is it? protein
coding gene
● Attributeso protein product? alcohol
dehydrogenaseo enzyme code? EC:1.1.1.1o subcellular location? cytoplasmo note?
essential for ABPHM
![Page 11: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/11.jpg)
Bacterial feature types
● protein coding geneso promoter (-10, -35)o ribosome binding site (RBS)o coding sequence (CDS)
signal peptide, protein domains, structureo terminator
● non coding geneso transfer RNA (tRNA)o ribosomal RNA (rRNA)o non-coding RNA (ncRNA)
● othero repeat patterns, operons, origin of replication, ...
![Page 12: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/12.jpg)
Automatic annotation
![Page 13: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/13.jpg)
Key bacterial features
● tRNAo easy to find and annotate: anti-codon
● rRNAo easy to find and annotate: 5s 16s 23s
● CDSo straightforward to find candidates
false positives are often small ORFs wrong start codon
o partial genes, remnantso pseudogenes o assigning function is the bulk of the workload
![Page 14: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/14.jpg)
Automatic annotation
Two strategies for identifying coding genes:
● sequence alignment o find known protein sequences in the contigs
transfer the annotation acrosso will miss proteins not in your databaseo may miss partial proteins
● ab initio gene findingo find candidate open reading frames
build model of ribosome binding sites predict coding regions
o may choose the incorrect start codono may miss atypical genes, overpredict small genes
![Page 15: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/15.jpg)
Some good existing tools
Software ab initio
align-ment Availability Speed
RAST yes yes web only 12-24 hours
xBASE yes no web only >8 hours
BG7 no yes standalone >10 hours
PGAAP(NCBI) yes yes email / we >1 month
![Page 16: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/16.jpg)
Why another tool?
● Convenienceo I have sequence, just tell me what's in it, please.
● Speedo exploit multi-core computers (aim < 15min)
● Standards complianto GFF3/GBK for viewing, TBL/FSA for Genbank sub.
● Rich consistent trustworthy outputo /product /gene /EC_number
● Provenanceo a record of where/how/why is was annotated so
![Page 17: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/17.jpg)
Why "Prokka" ?
● Unique in Google
● I like the letter "k"
● Easy to type
● It sounds Aussie
● Loosely fits "Prokaryotic Annotation"
● It rhymes with "Quokka" o Australian cat-sized nocturnal marsupial herbivoreo first Aussie mammal seen by Europeans - "giant rat"
![Page 18: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/18.jpg)
Prokka pipeline (simplified)
tRNA
rRNA
ncRNA
CDS
FASTAcontigs
Infernal
RNAmmer
Prodigal SignalP
Aragorn
sig_peptide
protein domains
HMMER3
protein annotation
BLAST+
Rfam
Swiss Pfam TIGRUser
GFF3GBKASN1
OUT
![Page 19: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/19.jpg)
What can you trust?
![Page 20: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/20.jpg)
Predicting protein function
Sequence similarity is a proxy for homology
● Sequence based (alignment)o tools: BLAST, BLAT, FASTA, Exonerateo databases: RefSeq, Uniprot, ...
● Model based ("fuzzy sequence" matching)o PSSM: position specific scoring matrix
tools: RPS-BLAST, Psi-BLAST databases: CDD, COG, Smart
o HMM: hidden Markov models tools: HMMER, HHblits databases: Pfam, TIGRfams
![Page 21: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/21.jpg)
Sequence databases
I'll just BLAST against the non-redundant database. -- Anonymous
● Which one?o nucleotide (nt) or protein (nr)
● It's actually quite redundanto only eliminates exact matching sequences
● It's not pickyo nearly anything is admitted, garbage in garbage out
● It's too bigo searching takes too long
![Page 22: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/22.jpg)
Hierarchical searching
● Factso searching against smaller databases is fastero searching against similar sequences is faster
● Ideao start with small set of close proteinso advance to larger sets of more distant proteins
● Prokkao your own custom "trusted" set (optional)o core bacterial proteome (default)o genus specific proteome (optional)o whole protein HMMs: PRK clusters, TIGRfamso protein domain HMMs: Pfam
![Page 23: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/23.jpg)
Core bacterial proteome
● Many bacterial proteins are conservedo experimentally validatedo small number of themo good annotations
● Prokka provides this databaseo derived from UniProt-Swissproto only bacterial proteinso only accept evidence level 1 (aa) or 2 (RNA) o reject "Fragment" entrieso extract /gene /EC_number /product /db_xref
● First step gets ~50% of the geneso BLAST+ blastp, multi-threading to use all CPUs
![Page 24: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/24.jpg)
The remainder
● Prokka has genus specific databaseso aim to capture "genus specific" naming conventionso derived from proteins in completed genomeso proteins are clustered and majority annotation winso some annotations are rubbish though
● Custom model databaseso I took COG/PRK MSAs and made HMMs
● Existing model databaseso Pfam, TIGRfams are well curated
● And if all else failso we always have our friend "hypothetical protein"
![Page 25: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/25.jpg)
Provenance
![Page 26: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/26.jpg)
Provenance
Recording where an annotation came from.
Prokka uses Genbank "evidence qualifier" tags:
Wet lab/experiment="EXISTENCE:Northern blot"
Dry lab/inference="similar to DNA sequence:INSD:AACN010222672.1"/inference="profile:tRNAscan:2.1"/inference="protein motif:InterPro:IPR001900"/inference="ab initio prediction:Glimmer:3.0"
![Page 27: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/27.jpg)
Example from Prokka
Feature Type:
tRNA
Location:contig000341 @ 655..730 +
Attributes:
/gene="tRNA-Leu(UUR)"
/anticodon=(pos:678..680,aa:Leu)
/product="transfer RNA-Leu(UUR)"
/inference="profile:Aragorn:1.2"
![Page 28: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/28.jpg)
Software quality
![Page 29: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/29.jpg)
Software goals
● Follow basic conventions o "prokka" should say something helpfulo "prokka -h" or "prokka --help" should show help
● All options should be optionalo "prokka contigs.fa" should do something useful
● Fail gracefullyo check your dependencies existo produce useful error messageso generate a log file (provenance!)
● Use standard input and output file formatso or at least tab-separated values if you insist...
![Page 30: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/30.jpg)
Prokka in context
● Prokka is not o particularly originalo technically or algorithmically significanto foolproof to install some dependencieso for everyone
BUT
● Prokka o is an ongoing project which will only improve :-)o checks it will run properly before wasting your timeo does what it claims, and does it quicklyo is being used widely
![Page 31: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/31.jpg)
Conclusions
![Page 32: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/32.jpg)
Prokka in the wild
● Pathogen Informatics @ Sanger UKo Andrew Pageo 50,000 draft genomes in 2 weeks (24 sec each!)
● Austin Hospital @ Melbourne AUo Ben Howden - Dept Infectious Diseaseso assembly & annotation of MiSeq clinical isolates
● VBC @ Monash AUo assemble & annotate all of SRA
● Many moreo Public Health Agency of Canadao Some of you next week hopefully!
![Page 33: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/33.jpg)
Planned features
● Modularityo alternate sub-tools eg. Aragorn vs tRNAscan-SEo every sub-system should be optionalo facilitate entry into Galaxy Toolshed
● Better support foro metagenome assemblies, viruses and archaeao broken genes, pseudogenes, assembly breakpoints
● Fastero smaller core databaseso better parallelisation and less disk i/o
● Prokka-Webo web server version currently in beta testing
![Page 34: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/34.jpg)
Acknowledgements
● Organiserso Conference committee - for inviting meo Wellcome Trust - Laura Hubbard
● Original Prokka testerso Simon Gladman & Dieter Bulach (internal)o Tim Stinear & Scott Chandry (external)
● Fundingo VLSCI / LSCC o Monash University
● Familyo Naomi, Oskar, Zoe - for tolerating my absences
![Page 35: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/35.jpg)
Contact
Email [email protected]
Twitter @torstenseemann
Blog TheGenomeFactory.blogspot.com
Web www.bioinformatics.net.au
![Page 36: Prokka - rapid bacterial genome annotation - ABPHM 2013](https://reader033.vdocuments.us/reader033/viewer/2022051016/55a827461a28abeb6f8b4929/html5/thumbnails/36.jpg)
Thank you.