preliminary results · 2018. 3. 15. · input: fasta file (nucleotide or protein) output: tsv,...

39
Preliminary Results Team 1 Functional Annotation Wenyi Qiu, Tianze Song, Saurabh Gulati, Ryan Place, Dongjo Ban, Qinwei Zhuang, Kunal Agarwal, Frank Ambrosio

Upload: others

Post on 11-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Preliminary Results

Team 1 Functional AnnotationWenyi Qiu, Tianze Song, Saurabh Gulati, Ryan Place, Dongjo Ban, Qinwei Zhuang, Kunal Agarwal, Frank Ambrosio

Page 2: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Overview

● Functional Annotation

● Goal

● Preliminary Results

● Finalized Pipeline

Page 3: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Functional Annotation Review

Homology Based Annotation

● Uses databases of genomic features with known function

● Accuracy is dependent on database quality○ Garbage in garbage out

● Databases for AMR genomic features are added to on a regular basis

● CARD and VFDB are examples of databases of homologous genomic features

Page 4: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Functional Annotation Review

Ab Initio Annotation

● Looks for intrinsic characteristics of particular gene feature types

● Signal Peptide and Transmembrane Proteins can be identified in this way○ These regions are of particular importance to this project because of their significance to AMR

Page 5: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Goals

● The ultimate goal of the Functional Annotation group is to functionally annotate 258 Klebsiella spp.

genomes

● We aim to provide the Comparative Genomics group with the data required to perform a Genome

Wide Association Study (GWAS) to determine which (if any) genomic features are associated with

three phenotype classes○ Susceptible

○ Resistant

○ Heteroresistant

Page 6: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

PROKKA - Overview

● Input and Output○ Input: Fasta file (nucleotide)

○ Output: Annotation files

● Command○ prokka --force --outdir <OUTPUT DIRECTORY> --kingdom Bacteria --genus Klebsiella --gram neg --prefix <PREFIX

TO IDENTIFY SAMPLE> --rfam --rnammer <INPUT FASTA FILE>

● Scalability○ Run time on average for 1 genome is around 16 minutes (~3 days for 258)

Page 7: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

PROKKA - Preliminary ResultsTypical result from running PROKKA on our Klebsiella reference.

Page 8: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

PROKKA - Preliminary ResultsTypical Result from running PROKKA on a Klebsiella Skesa assembly (SRR666627.skesa.fasta)

Page 9: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

PROKKA - Output● Gives the annotations as output in multiple formats i.e. GFF3, GenBank, TSV, SQN, FFN, TBL and TXT.

Page 10: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Adding Annotation Databases to PROKKA

Prokka offers 2 options for adding to the list of homology based annotations:

1. Use the --proteins flaga. Takes a FASTA file as input

b. Simple to implement

2. Create a new “genus” database in PROKKAa. Takes a list of NCBI taxids as an input

Page 11: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Comprehensive Antibiotic Resistance DB

● Added to PROKKA analysis using the --proteins flag

● This run took 19.9 mins

● Adding more databases will increase processing time

Page 12: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

PilerCR

● Input and Output○ Input: Fasta file (nucleotide)

○ Output: Putative CRISPR Arrays report file

● Command○ pilercr -in <sequence_file> -out <report_file>

● Scalability○ Run time on average for 1 genome is less than 5 seconds

Number of CRISPR arrays found

Number of samples

0 207

1 17

2 29

3 2

4 1

Results from de-Novo Assemblies

Page 13: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

SignalPCommand: signalp -t <organism_type> -f <output_format> <input_file> > <output_file>

● organism_type: euk, gram+, gram-

● output_format: short, long, summary, all

Runtime: ~4 minutes for one NCBI genome (GCF000240185.1)

Page 14: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

SignalPCommand: signalp -t <organism_type> -f <output_format> <input_file> > <output_file>

● organism_type: euk, gram+, gram-

● output_format: short, long, summary, all

Runtime: ~3 minutes for one de novo assembled genome (SRR3467249)

Page 15: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

PhobiusCommand: phobius.pl -<output_format> <input_file> > <output_file>

● output_format: short, long, raw

Runtime: 12~16 minutes for one NCBI genome (GCF000240185.1)

Page 16: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

LipoPCommand: LipoP -<output_format> -<input_file> > <output_file>

● output_format: short, long

Runtime: ~2 minute for one NCBI genome (GCF000240185.1)

cytoplasmic

Signal peptide(peptidase I)

Higher, more reliable prediction

Not shown:

SpII (lipoprotein SP)CleavICeavII

Page 17: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

LipoPCommand: LipoP -<output_format> -<input_file> > <output_file>

● output_format: short, long

Runtime: ~2 minute for one de novo assembled genome (SRR3467249)

Page 18: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

TMHMMCommand: tmhmm -<output_format> <input_file> > <output_file>

● output_format: short, long

Runtime: ~4 minutes for one NCBI genome (GCF000240185.1)

Page 19: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

TMHMMCommand: tmhmm -<output_format> <input_file> > <output_file>

● output_format: short, long

Runtime: ~6 minutes for one de novo assembled genome (SRR3467249)

Page 20: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Signal Peptide Predictions Transmembrane Predictions

Page 21: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

DeepARG● Input and output:

○ Input: Fasta or tabular file

○ Output: Predicted ARG list (containing some statistics)

● What it does:○ Predict Antibiotic Resistance Genes (ARGs)

● Why it may be useful:○ It developed a new database (DeepARG-DB) by combining existing ARG databases (CARD and ARDB).

○ Better than traditional “best-hit” method in that it can also predict ARGs whose identity percentage are low

(30% ~50%) but with small e-values (< 1e-10)

● Scalability:○ Running time: 3min27s

Page 22: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

DeepARG

Page 23: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

VFDBUpdate: 47 VFs found in full set with keyword “Klebsiella”

Search through core set of virulence factor took less than 3s for each genome

makeblastdb -in VFDB_setA_nt.fas -dbtype nucl

blastn -query query.fasta -db VFDB_setA_nt.fas -xdrop_gap 150 -outfmt 6 >

outputfile

Search through the full set takes less than 7s for each genome

Page 24: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Interproscan● Command: /projects/data/team1_functionalAnnotation/interproscan-5.28-67.0/interproscan.sh -dp -appl

PfamA,CDD,COILS,Gene3D,HAMAP,MobiDBLite,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SF

LD,SMART,SUPERFAMILY,TIGRFAM,Phobius -goterms -iprlookup -pa -t n -i <input file> -f gff3

● Input and Output○ Input: fasta file (nucleotide or protein)

○ Output: TSV, GFF3, XML, JSON, HTML, SVG

● Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss Lab Klebsiella sample: 146 minutes

● The way Interproscan works the less fasta files in a multifasta the better:○ Gene prediction prodigal output for SRR3467249: ~5000 files when split by /^>/

■ 1 min per file or 10 min for 10 merged files

○ Skesa output of SRR3467249: 127 files when split by /^>/

■ 9-14 min for 10 merged files

Page 25: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Interproscan Output

Page 26: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Door2● Make a file containing all the Klebsiella operons and then make a blast database from that file using

makeblastdb and specifying the dbtype as protein

● blastp -db kop.fasta -query SSR3467249_output_orfs.fa -evalue 1e-10 -outfmt "6 stitle qseqid sseqid

pident qcovs evalue bitscore" > blasted_SSR3467249

Page 27: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Eggnog Mapper

● Input and Output○ Input: Fasta file (nucleotide or protein)

○ Output: hmm_hits file, seed_orthologs file and annotations file

● Command○ emapper.py -i [ input file ] --output [ output file ] -m

[diamond/hmmer] --usemem --cpu 04 --translate

● Scalability○ Running time for 1 NCBI with 4 cores on klebsiella reference

genome

○ Diamond:- Useful for large datasets and annotating organisms

with close relatives among the species covered by eggNOG

Page 28: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Eggnog Output

Annotation file

HMM method shows very less coverage with reference genome, it might be due to the reduced size of database

Sample Tool Annotations Time 4 CPU core (min) NCBI Total Annotations Coverage

ASSEMBLY GENE PREDICTION

DIAMOND 4631 60 - -

ASSEMBLY GENE PREDICTION

HMM 3171 60 - -

REFERENCE GENOME (NCBI)

DIAMOND 5472 108 5779 0.946877

REFERENCE GENOME (NCBI)

HMM 1034 72 5779 0.178924

Page 29: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Next Step: Innovative Ideas to Scale-Up FA

1. Custom Prokka Database

a. Combine existing databases with annotations from CARD, VFDB, etc.

i. This will remove any duplicates between the databases to save time

b. Add newly discovered gene features from the literature.

i. The CARD database is updated frequently, so any database that we make should reflect this

2. Use the Core/Pan/Accessory Genome

a. Create a Core Genome using ROARY

b. Apply a core set of annotations to all new genomes

Page 30: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Using the Pan/Core Genome to Innovate

1.

● Using the Core Genome we can assign the same “core annotations” to each genome

● With the Core Genome annotated we are left with the Accessory Genome

● Annotating the Accessory Genome will be far less computationally expensive

Page 31: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Using the Pan/Core Genome to Innovate

2.

● Using the Pan Genome we can generate a database of all the Pan Genome features

● This database will include annotations from all of the databases used to annotate the set of genomes

used to compute the Pan Genome for Klebsiella spp.

● This database can then be installed in PROKKA and used instead of the other more general databases

Page 32: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Using the Pan/Core Genome to Innovate

● Using ROARY we can define the Core Genome and Pan Genome for Klebsiella spp.

● Once the Core Genome and Pan Genome have been identified the Accessory Genome is designated

○ Accessory Genome = Pan Genome - Core Genome

● Using these data we have two ideas to innovate the Functional Annotation process

Page 33: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

ROARY

● Rapid Large-Scale Prokaryote Genome Analysis

● Inputs: A set of annotated genomes (compatible with PROKKA)

● Produces the Pan Genome from a set of annotated genomes (compatible with ROARY)

● Accurate

● Scalable

Page 34: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

ROARY - Accuracy

Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.

Page 35: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Roary - Scalability

Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.

Page 36: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

SCOARY

● Comparative Genomics Tools

● Takes Pan Genome as input from ROARY

● Requires associated phenotype data

● Pan-GWAS tool

● Used to identify gene features that are associated with phenotypic traits such as AMR

● Comparative Genomics team may find this tool useful

● Provides one approach to determine which genomic features are associated with resistant,

susceptible and heteroresistant phenotypes of the isolates provided by the Weiss lab

Page 37: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Possible innovation by clustering nucleotide sequences

● The basic idea is to reduce the sequence redundancy and speed up the analysis - functional

annotation in our case.

● Ideal flowchart:

1. Prodigal results (predicted genes) from group 2

2. Nucleotide sequences clustering (e.g. using CD-HIT)

3. Annotated the resulting concrete representative gene collection

4. Trace back the representative genes and apply the annotation results to all the genomes.

Page 38: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

Possible innovation by clustering nucleotide sequences

● Selections between CD-HIT4 and UCLUST5 (tools with largest numbers of being citated)○ The overallspeed of CD-HIT4 was better than UCLUST5 ( as the paper mentioned)

○ We prefer CD-HIT4 at this moment since it performs better than UCLUST5 as size of datasets increases

○ CD-HIT4 has a github page while UCLUST5 does not. ( 64-bit version of UCLUST is not free)

Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.

Page 39: Preliminary Results · 2018. 3. 15. · Input: fasta file (nucleotide or protein) Output: TSV, GFF3, XML, JSON, HTML, SVG Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss

REFERENCES

● Huerta-Cepas, Jaime, et al. "Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper." Molecular biology and evolution 34.8 (2017): 2115-2122.

● Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.

● Jones, Philip & Binns, David & Chang, Hsin-Yu & Fraser, Matthew & Li, Weizhong & Mcanulla, Craig & Mcwilliam, Hamish & Maslen, John & Mitchell, Alex & Nuka, Gift & Pesseat, Sebastien & Quinn, Antony & Sangrador-Vegas, Amaia & Scheremetjew, Maxim & Yong, Siew-Yit & López Serrano, Rodrigo & Hunter, Sarah. (2014). InterProScan 5: Genome-scale Protein Function Classification. Bioinformatics (Oxford, England). 30. 10.1093/bioinformatics/btu031.

● Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.

● Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., & Zhang, L. (2018). DeepARG: A deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome, 6(1), 23.