preliminary results · 2018. 3. 15. · input: fasta file (nucleotide or protein) output: tsv,...

Preliminary Results

Team 1 Functional AnnotationWenyi Qiu, Tianze Song, Saurabh Gulati, Ryan Place, Dongjo Ban, Qinwei Zhuang, Kunal Agarwal, Frank Ambrosio

Overview

● Functional Annotation

● Goal

● Preliminary Results

● Finalized Pipeline

Functional Annotation Review

Homology Based Annotation

● Uses databases of genomic features with known function

● Accuracy is dependent on database quality○ Garbage in garbage out

● Databases for AMR genomic features are added to on a regular basis

● CARD and VFDB are examples of databases of homologous genomic features

Functional Annotation Review

Ab Initio Annotation

● Looks for intrinsic characteristics of particular gene feature types

● Signal Peptide and Transmembrane Proteins can be identified in this way○ These regions are of particular importance to this project because of their significance to AMR

Goals

● The ultimate goal of the Functional Annotation group is to functionally annotate 258 Klebsiella spp.

genomes

● We aim to provide the Comparative Genomics group with the data required to perform a Genome

Wide Association Study (GWAS) to determine which (if any) genomic features are associated with

three phenotype classes○ Susceptible

○ Resistant

○ Heteroresistant

PROKKA - Overview

● Input and Output○ Input: Fasta file (nucleotide)

○ Output: Annotation files

● Command○ prokka --force --outdir <OUTPUT DIRECTORY> --kingdom Bacteria --genus Klebsiella --gram neg --prefix <PREFIX

TO IDENTIFY SAMPLE> --rfam --rnammer <INPUT FASTA FILE>

● Scalability○ Run time on average for 1 genome is around 16 minutes (~3 days for 258)

PROKKA - Preliminary ResultsTypical result from running PROKKA on our Klebsiella reference.

PROKKA - Preliminary ResultsTypical Result from running PROKKA on a Klebsiella Skesa assembly (SRR666627.skesa.fasta)

PROKKA - Output● Gives the annotations as output in multiple formats i.e. GFF3, GenBank, TSV, SQN, FFN, TBL and TXT.

Adding Annotation Databases to PROKKA

Prokka offers 2 options for adding to the list of homology based annotations:

1. Use the --proteins flaga. Takes a FASTA file as input

b. Simple to implement

2. Create a new “genus” database in PROKKAa. Takes a list of NCBI taxids as an input

Comprehensive Antibiotic Resistance DB

● Added to PROKKA analysis using the --proteins flag

● This run took 19.9 mins

● Adding more databases will increase processing time

PilerCR

● Input and Output○ Input: Fasta file (nucleotide)

○ Output: Putative CRISPR Arrays report file

● Command○ pilercr -in <sequence_file> -out <report_file>

● Scalability○ Run time on average for 1 genome is less than 5 seconds

Number of CRISPR arrays found

Number of samples

0 207

1 17

2 29

3 2

4 1

Results from de-Novo Assemblies

SignalPCommand: signalp -t <organism_type> -f <output_format> <input_file> > <output_file>

● organism_type: euk, gram+, gram-

● output_format: short, long, summary, all

Runtime: ~4 minutes for one NCBI genome (GCF000240185.1)

SignalPCommand: signalp -t <organism_type> -f <output_format> <input_file> > <output_file>

● organism_type: euk, gram+, gram-

● output_format: short, long, summary, all

Runtime: ~3 minutes for one de novo assembled genome (SRR3467249)

PhobiusCommand: phobius.pl -<output_format> <input_file> > <output_file>

● output_format: short, long, raw

Runtime: 12~16 minutes for one NCBI genome (GCF000240185.1)

LipoPCommand: LipoP -<output_format> -<input_file> > <output_file>

● output_format: short, long

Runtime: ~2 minute for one NCBI genome (GCF000240185.1)

cytoplasmic

Signal peptide(peptidase I)

Higher, more reliable prediction

Not shown:

SpII (lipoprotein SP)CleavICeavII

LipoPCommand: LipoP -<output_format> -<input_file> > <output_file>


Runtime: ~2 minute for one de novo assembled genome (SRR3467249)

TMHMMCommand: tmhmm -<output_format> <input_file> > <output_file>


Runtime: ~4 minutes for one NCBI genome (GCF000240185.1)

TMHMMCommand: tmhmm -<output_format> <input_file> > <output_file>


Runtime: ~6 minutes for one de novo assembled genome (SRR3467249)

Signal Peptide Predictions Transmembrane Predictions

DeepARG● Input and output:

○ Input: Fasta or tabular file

○ Output: Predicted ARG list (containing some statistics)

● What it does:○ Predict Antibiotic Resistance Genes (ARGs)

● Why it may be useful:○ It developed a new database (DeepARG-DB) by combining existing ARG databases (CARD and ARDB).

○ Better than traditional “best-hit” method in that it can also predict ARGs whose identity percentage are low

(30% ~50%) but with small e-values (< 1e-10)

● Scalability:○ Running time: 3min27s

DeepARG

VFDBUpdate: 47 VFs found in full set with keyword “Klebsiella”

Search through core set of virulence factor took less than 3s for each genome

makeblastdb -in VFDB_setA_nt.fas -dbtype nucl

blastn -query query.fasta -db VFDB_setA_nt.fas -xdrop_gap 150 -outfmt 6 >

outputfile

Search through the full set takes less than 7s for each genome

Interproscan● Command: /projects/data/team1_functionalAnnotation/interproscan-5.28-67.0/interproscan.sh -dp -appl

PfamA,CDD,COILS,Gene3D,HAMAP,MobiDBLite,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SF

LD,SMART,SUPERFAMILY,TIGRFAM,Phobius -goterms -iprlookup -pa -t n -i <input file> -f gff3

● Input and Output○ Input: fasta file (nucleotide or protein)

○ Output: TSV, GFF3, XML, JSON, HTML, SVG

● Runtime NCBI complete Klebsiella genome: 9 minutes and Weiss Lab Klebsiella sample: 146 minutes

● The way Interproscan works the less fasta files in a multifasta the better:○ Gene prediction prodigal output for SRR3467249: ~5000 files when split by /^>/

■ 1 min per file or 10 min for 10 merged files

○ Skesa output of SRR3467249: 127 files when split by /^>/

■ 9-14 min for 10 merged files

Interproscan Output

Door2● Make a file containing all the Klebsiella operons and then make a blast database from that file using

makeblastdb and specifying the dbtype as protein

● blastp -db kop.fasta -query SSR3467249_output_orfs.fa -evalue 1e-10 -outfmt "6 stitle qseqid sseqid

pident qcovs evalue bitscore" > blasted_SSR3467249

Eggnog Mapper

● Input and Output○ Input: Fasta file (nucleotide or protein)

○ Output: hmm_hits file, seed_orthologs file and annotations file

● Command○ emapper.py -i [ input file ] --output [ output file ] -m

[diamond/hmmer] --usemem --cpu 04 --translate

● Scalability○ Running time for 1 NCBI with 4 cores on klebsiella reference

genome

○ Diamond:- Useful for large datasets and annotating organisms

with close relatives among the species covered by eggNOG

Eggnog Output

Annotation file

HMM method shows very less coverage with reference genome, it might be due to the reduced size of database

Sample Tool Annotations Time 4 CPU core (min) NCBI Total Annotations Coverage

ASSEMBLY GENE PREDICTION

DIAMOND 4631 60 - -

ASSEMBLY GENE PREDICTION

HMM 3171 60 - -

REFERENCE GENOME (NCBI)

DIAMOND 5472 108 5779 0.946877

REFERENCE GENOME (NCBI)

HMM 1034 72 5779 0.178924

Next Step: Innovative Ideas to Scale-Up FA

1. Custom Prokka Database

a. Combine existing databases with annotations from CARD, VFDB, etc.

i. This will remove any duplicates between the databases to save time

b. Add newly discovered gene features from the literature.

i. The CARD database is updated frequently, so any database that we make should reflect this

2. Use the Core/Pan/Accessory Genome

a. Create a Core Genome using ROARY

b. Apply a core set of annotations to all new genomes

Using the Pan/Core Genome to Innovate

1.

● Using the Core Genome we can assign the same “core annotations” to each genome

● With the Core Genome annotated we are left with the Accessory Genome

● Annotating the Accessory Genome will be far less computationally expensive


2.

● Using the Pan Genome we can generate a database of all the Pan Genome features

● This database will include annotations from all of the databases used to annotate the set of genomes

used to compute the Pan Genome for Klebsiella spp.

● This database can then be installed in PROKKA and used instead of the other more general databases


● Using ROARY we can define the Core Genome and Pan Genome for Klebsiella spp.

● Once the Core Genome and Pan Genome have been identified the Accessory Genome is designated

○ Accessory Genome = Pan Genome - Core Genome

● Using these data we have two ideas to innovate the Functional Annotation process

ROARY

● Rapid Large-Scale Prokaryote Genome Analysis

● Inputs: A set of annotated genomes (compatible with PROKKA)

● Produces the Pan Genome from a set of annotated genomes (compatible with ROARY)

● Accurate

● Scalable

ROARY - Accuracy

Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.

Roary - Scalability

Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.

SCOARY

● Comparative Genomics Tools

● Takes Pan Genome as input from ROARY

● Requires associated phenotype data

● Pan-GWAS tool

● Used to identify gene features that are associated with phenotypic traits such as AMR

● Comparative Genomics team may find this tool useful

● Provides one approach to determine which genomic features are associated with resistant,

susceptible and heteroresistant phenotypes of the isolates provided by the Weiss lab

Possible innovation by clustering nucleotide sequences

● The basic idea is to reduce the sequence redundancy and speed up the analysis - functional

annotation in our case.

● Ideal flowchart:

1. Prodigal results (predicted genes) from group 2

2. Nucleotide sequences clustering (e.g. using CD-HIT)

3. Annotated the resulting concrete representative gene collection

4. Trace back the representative genes and apply the annotation results to all the genomes.

Possible innovation by clustering nucleotide sequences

● Selections between CD-HIT4 and UCLUST5 (tools with largest numbers of being citated)○ The overallspeed of CD-HIT4 was better than UCLUST5 ( as the paper mentioned)

○ We prefer CD-HIT4 at this moment since it performs better than UCLUST5 as size of datasets increases

○ CD-HIT4 has a github page while UCLUST5 does not. ( 64-bit version of UCLUST is not free)

Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.

REFERENCES

● Huerta-Cepas, Jaime, et al. "Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper." Molecular biology and evolution 34.8 (2017): 2115-2122.

● Page AJ, Cummins CA, Hunt M, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691-3693. doi:10.1093/bioinformatics/btv421.

● Jones, Philip & Binns, David & Chang, Hsin-Yu & Fraser, Matthew & Li, Weizhong & Mcanulla, Craig & Mcwilliam, Hamish & Maslen, John & Mitchell, Alex & Nuka, Gift & Pesseat, Sebastien & Quinn, Antony & Sangrador-Vegas, Amaia & Scheremetjew, Maxim & Yong, Siew-Yit & López Serrano, Rodrigo & Hunter, Sarah. (2014). InterProScan 5: Genome-scale Protein Function Classification. Bioinformatics (Oxford, England). 30. 10.1093/bioinformatics/btu031.

● Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.

● Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., & Zhang, L. (2018). DeepARG: A deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome, 6(1), 23.

preliminary results · 2018. 3. 15. · input: fasta file (nucleotide or protein) output: tsv,...

Documents