preliminary results · 2018. 3. 15. · input: fasta file (nucleotide or protein) output: tsv,...
TRANSCRIPT
Preliminary Results
Team 1 Functional AnnotationWenyi Qiu, Tianze Song, Saurabh Gulati, Ryan Place, Dongjo Ban, Qinwei Zhuang, Kunal Agarwal, Frank Ambrosio
Overview
● Functional Annotation
● Goal
● Preliminary Results
● Finalized Pipeline
Functional Annotation Review
Homology Based Annotation
● Uses databases of genomic features with known function
● Accuracy is dependent on database quality○ Garbage in garbage out
● Databases for AMR genomic features are added to on a regular basis
● CARD and VFDB are examples of databases of homologous genomic features
Functional Annotation Review
Ab Initio Annotation
● Looks for intrinsic characteristics of particular gene feature types
● Signal Peptide and Transmembrane Proteins can be identified in this way○ These regions are of particular importance to this project because of their significance to AMR
Goals
● The ultimate goal of the Functional Annotation group is to functionally annotate 258 Klebsiella spp.
genomes
● We aim to provide the Comparative Genomics group with the data required to perform a Genome
Wide Association Study (GWAS) to determine which (if any) genomic features are associated with
three phenotype classes○ Susceptible
○ Resistant
○ Heteroresistant
PROKKA - Overview
● Input and Output○ Input: Fasta file (nucleotide)
○ Output: Annotation files
● Command○ prokka --force --outdir <OUTPUT DIRECTORY> --kingdom Bacteria --genus Klebsiella --gram neg --prefix <PREFIX
TO IDENTIFY SAMPLE> --rfam --rnammer <INPUT FASTA FILE>
● Scalability○ Run time on average for 1 genome is around 16 minutes (~3 days for 258)
PROKKA - Preliminary ResultsTypical result from running PROKKA on our Klebsiella reference.
PROKKA - Preliminary ResultsTypical Result from running PROKKA on a Klebsiella Skesa assembly (SRR666627.skesa.fasta)
PROKKA - Output● Gives the annotations as output in multiple formats i.e. GFF3, GenBank, TSV, SQN, FFN, TBL and TXT.
Adding Annotation Databases to PROKKA
Prokka offers 2 options for adding to the list of homology based annotations:
1. Use the --proteins flaga. Takes a FASTA file as input
b. Simple to implement
2. Create a new “genus” database in PROKKAa. Takes a list of NCBI taxids as an input
Comprehensive Antibiotic Resistance DB
● Added to PROKKA analysis using the --proteins flag
● This run took 19.9 mins
● Adding more databases will increase processing time
PilerCR
● Input and Output○ Input: Fasta file (nucleotide)
○ Output: Putative CRISPR Arrays report file
● Command○ pilercr -in <sequence_file> -out <report_file>
● Scalability○ Run time on average for 1 genome is less than 5 seconds
Number of CRISPR arrays found
Number of samples
0 207
1 17
2 29
3 2
4 1
Results from de-Novo Assemblies
SignalPCommand: signalp -t <organism_type> -f <output_format> <input_file> > <output_file>
● organism_type: euk, gram+, gram-
● output_format: short, long, summary, all
Runtime: ~4 minutes for one NCBI genome (GCF000240185.1)
SignalPCommand: signalp -t <organism_type> -f <output_format> <input_file> > <output_file>
● organism_type: euk, gram+, gram-
● output_format: short, long, summary, all
Runtime: ~3 minutes for one de novo assembled genome (SRR3467249)
PhobiusCommand: phobius.pl -<output_format> <input_file> > <output_file>
● output_format: short, long, raw
Runtime: 12~16 minutes for one NCBI genome (GCF000240185.1)
LipoPCommand: LipoP -<output_format> -<input_file> > <output_file>
● output_format: short, long
Runtime: ~2 minute for one NCBI genome (GCF000240185.1)
cytoplasmic
Signal peptide(peptidase I)
Higher, more reliable prediction
Not shown:
SpII (lipoprotein SP)CleavICeavII
LipoPCommand: LipoP -<output_format> -<input_file> > <output_file>
● output_format: short, long
Runtime: ~2 minute for one de novo assembled genome (SRR3467249)
TMHMMCommand: tmhmm -<output_format> <input_file> > <output_file>
● output_format: short, long
Runtime: ~4 minutes for one NCBI genome (GCF000240185.1)
TMHMMCommand: tmhmm -<output_format> <input_file> > <output_file>
● output_format: short, long
Runtime: ~6 minutes for one de novo assembled genome (SRR3467249)
Signal Peptide Predictions Transmembrane Predictions
DeepARG● Input and output:
○ Input: Fasta or tabular file
○ Output: Predicted ARG list (containing some statistics)
● What it does:○ Predict Antibiotic Resistance Genes (ARGs)
● Why it may be useful:○ It developed a new database (DeepARG-DB) by combining existing ARG databases (CARD and ARDB).
○ Better than traditional “best-hit” method in that it can also predict ARGs whose identity percentage are low
(30% ~50%) but with small e-values (< 1e-10)
● Scalability:○ Running time: 3min27s
DeepARG
VFDBUpdate: 47 VFs found in full set with keyword “Klebsiella”
Search through core set of virulence factor took less than 3s for each genome
makeblastdb -in VFDB_setA_nt.fas -dbtype nucl
blastn -query query.fasta -db VFDB_setA_nt.fas -xdrop_gap 150 -outfmt 6 >
outputfile
Search through the full set takes less than 7s for each genome
Interproscan● Command: /projects/data/team1_functionalAnnotation/interproscan-5.28-67.0/interproscan.sh -dp -appl
PfamA,CDD,COILS,Gene3D,HAMAP,MobiDBLite,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SF
LD,SMART,SUPERFAMILY,TIGRFAM,Phobius -goterms -iprlookup -pa -t n -i <input file> -f gff3
● Input and Output○ Input: fasta file (nucleotide or protein)
○ Output: TSV, GFF3, XML, JSON, HTML, SVG
● Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss Lab Klebsiella sample: 146 minutes
● The way Interproscan works the less fasta files in a multifasta the better:○ Gene prediction prodigal output for SRR3467249: ~5000 files when split by /^>/
■ 1 min per file or 10 min for 10 merged files
○ Skesa output of SRR3467249: 127 files when split by /^>/
■ 9-14 min for 10 merged files
Interproscan Output
Door2● Make a file containing all the Klebsiella operons and then make a blast database from that file using
makeblastdb and specifying the dbtype as protein
● blastp -db kop.fasta -query SSR3467249_output_orfs.fa -evalue 1e-10 -outfmt "6 stitle qseqid sseqid
pident qcovs evalue bitscore" > blasted_SSR3467249
Eggnog Mapper
● Input and Output○ Input: Fasta file (nucleotide or protein)
○ Output: hmm_hits file, seed_orthologs file and annotations file
● Command○ emapper.py -i [ input file ] --output [ output file ] -m
[diamond/hmmer] --usemem --cpu 04 --translate
● Scalability○ Running time for 1 NCBI with 4 cores on klebsiella reference
genome
○ Diamond:- Useful for large datasets and annotating organisms
with close relatives among the species covered by eggNOG
Eggnog Output
Annotation file
HMM method shows very less coverage with reference genome, it might be due to the reduced size of database
Sample Tool Annotations Time 4 CPU core (min) NCBI Total Annotations Coverage
ASSEMBLY GENE PREDICTION
DIAMOND 4631 60 - -
ASSEMBLY GENE PREDICTION
HMM 3171 60 - -
REFERENCE GENOME (NCBI)
DIAMOND 5472 108 5779 0.946877
REFERENCE GENOME (NCBI)
HMM 1034 72 5779 0.178924
Next Step: Innovative Ideas to Scale-Up FA
1. Custom Prokka Database
a. Combine existing databases with annotations from CARD, VFDB, etc.
i. This will remove any duplicates between the databases to save time
b. Add newly discovered gene features from the literature.
i. The CARD database is updated frequently, so any database that we make should reflect this
2. Use the Core/Pan/Accessory Genome
a. Create a Core Genome using ROARY
b. Apply a core set of annotations to all new genomes
Using the Pan/Core Genome to Innovate
1.
● Using the Core Genome we can assign the same “core annotations” to each genome
● With the Core Genome annotated we are left with the Accessory Genome
● Annotating the Accessory Genome will be far less computationally expensive
Using the Pan/Core Genome to Innovate
2.
● Using the Pan Genome we can generate a database of all the Pan Genome features
● This database will include annotations from all of the databases used to annotate the set of genomes
used to compute the Pan Genome for Klebsiella spp.
● This database can then be installed in PROKKA and used instead of the other more general databases
Using the Pan/Core Genome to Innovate
● Using ROARY we can define the Core Genome and Pan Genome for Klebsiella spp.
● Once the Core Genome and Pan Genome have been identified the Accessory Genome is designated
○ Accessory Genome = Pan Genome - Core Genome
● Using these data we have two ideas to innovate the Functional Annotation process
ROARY
● Rapid Large-Scale Prokaryote Genome Analysis
● Inputs: A set of annotated genomes (compatible with PROKKA)
● Produces the Pan Genome from a set of annotated genomes (compatible with ROARY)
● Accurate
● Scalable
ROARY - Accuracy
Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.
Roary - Scalability
Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.
SCOARY
● Comparative Genomics Tools
● Takes Pan Genome as input from ROARY
● Requires associated phenotype data
● Pan-GWAS tool
● Used to identify gene features that are associated with phenotypic traits such as AMR
● Comparative Genomics team may find this tool useful
● Provides one approach to determine which genomic features are associated with resistant,
susceptible and heteroresistant phenotypes of the isolates provided by the Weiss lab
Possible innovation by clustering nucleotide sequences
● The basic idea is to reduce the sequence redundancy and speed up the analysis - functional
annotation in our case.
● Ideal flowchart:
1. Prodigal results (predicted genes) from group 2
2. Nucleotide sequences clustering (e.g. using CD-HIT)
3. Annotated the resulting concrete representative gene collection
4. Trace back the representative genes and apply the annotation results to all the genomes.
Possible innovation by clustering nucleotide sequences
● Selections between CD-HIT4 and UCLUST5 (tools with largest numbers of being citated)○ The overallspeed of CD-HIT4 was better than UCLUST5 ( as the paper mentioned)
○ We prefer CD-HIT4 at this moment since it performs better than UCLUST5 as size of datasets increases
○ CD-HIT4 has a github page while UCLUST5 does not. ( 64-bit version of UCLUST is not free)
Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.
REFERENCES
● Huerta-Cepas, Jaime, et al. "Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper." Molecular biology and evolution 34.8 (2017): 2115-2122.
● Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.
● Jones, Philip & Binns, David & Chang, Hsin-Yu & Fraser, Matthew & Li, Weizhong & Mcanulla, Craig & Mcwilliam, Hamish & Maslen, John & Mitchell, Alex & Nuka, Gift & Pesseat, Sebastien & Quinn, Antony & Sangrador-Vegas, Amaia & Scheremetjew, Maxim & Yong, Siew-Yit & López Serrano, Rodrigo & Hunter, Sarah. (2014). InterProScan 5: Genome-scale Protein Function Classification. Bioinformatics (Oxford, England). 30. 10.1093/bioinformatics/btu031.
● Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.
● Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., & Zhang, L. (2018). DeepARG: A deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome, 6(1), 23.