julie shay ccbc poster may 11 2016

1
The utility of draft bacterial genomes for gene function analysis and genomic island prediction Julie A. Shay, Claire Bertelli, Bhavjinder K. Dhillon, and Fiona S.L. Brinkman Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada Canada’s Federal Genomics Research and Development Initiative Acknowledgments This work was made possible by funding from Genome Canada, Genome British Columbia, and the GRDI. Funding for project personnel was also provided by Cystic Fibrosis Canada, the Swiss National Science Foundation, the CIHR/MSFHR Bioinformatics Training Program, and the Michael Smith Foundation for Health Research. References Comparing draft vs. complete genomes: two examples The problem: growing gap between draft and complete genomes Genomic Island (GI) analysis Draft/complete genomes were run on IslandViewer 5 : web-based GI prediction tool which incorporates two methods: Contigs (GenBank format) Contig alignment to reference genome with Mauve 3 Concatenate contigs based on alignment Normal IslandViewer analysis pipeline User-selected reference genome Isolate Draft Genome Complete Genome Compare Gene function category analysis Open reading frames (ORFs) were assigned to clusters of orthologous groups (COGs) 12 using RPS-BLAST 13 COG superfamily distributions were compared between complete genomes and missing regions of drafts Genes of interest Main data set: 36 Listeria monocytogenes isolates 1 , draft Illumina genomes and the identical subsequently completed genomes Other data set: Draft genomes from the Pseudomonas aeruginosa reference panel 2 and similar completed genomes (identical reference available for 2 strains) Draft genome aligned to completed reference with Mauve Contig Mover 3 SIGI-HMM 10 IslandPath-DIMOB 11 codon usage bias HMM-based method dinucleotide bias presence of a mobility gene •“Replication, recombination, and repair” superfamily was significantly underrepresented in draft genomes of both L. monocytogenes and P. aeruginosa In particular, transposons tend to be missing from draft genomes Pipeline Many GIs are present at contig breaks, and these GIs are more likely to be missed by analysis of draft genomes 0 20 40 60 80 100 120 140 160 180 0 1 to 9 10 to 99 100 to 999 1000 to 9999 10000 to 99999 1000000 to 999999 Number of GI Predictions in Listeria Genomes Distance in Base Pairs from Contig Edge Predictions Missed in Draft Genome Analysis Predictions Correctly Identified in Draft Genome Analysis 0 50 100 150 200 250 2008 2009 2010 2011 2012 2013 2014 2015 Thousands in Database Year NCBI SRA Bacterial Genomes NCBI Complete Bacterial Genomes [A] RNA processing and modification [B] Chromatin structure and dynamics [C] Energy production and conversion [D] Cell cycle control, cell division, chromosome partitioning [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [G] Carbohydrate transport and metabolism [H] Coenzyme transport and metabolism [I] Lipid transport and metabolism [J] Translation, ribosomal structure and biogenesis [K] Transcription [L] Replication, recombination and repair [M] Cell wall / membrane / envelope biogenesis [N] Cell motility [O] Posttranslational modification, protein turnover, chaperones [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism [R] General function prediction only [S] Function unknown [T] Signal transduction mechanisms [U] Intracellular trafficking, secretion, and vesicular transport [V] Defense mechanisms [W] Extracellular structures [Z] Cytoskeleton Methods Results AMR Genes Identified using Resistance Gene Identifier 4 using the Comprehensive Antibiotic Resistance Database Not significantly underrepresented in Listeria or Pseudomonas draft genomes Virulence Factors Predicted using a conservative reciprocal- best-blast-hit approach from VFDB, PATRIC, and Victor’s virulence factors 5,6,7 . Not significantly underrepresented in Listeria or Pseudomonas draft genomes tRNA Genes Predicted using tRNAscan-SE 8 and ARAGORN 9 Significantly underrepresented in Listeria and Pseudomonas draft genomes Percent Missing from Draft Listeria genomes 0 0.1 0.2 0.3 0.4 0.5 0.6 A B C D E F G H I J K L M N O P Q R S T U V W Y Z Proportion of Total ORFs COG Superfamily Completed Genome Regions Missing from Draft Genome Conclusion All Protein- Coding Genes AMR Genes Virulence Factors tRNA Genes Note: This image only shows genomes submitted to NCBI, so it is underestimating the extent of the gap between draft and complete 1) Gimour MW, et al. 2010 BMC Genomics 11:120. 2) De Soyza A, et al. 2013 MicrobiologyOpen 2(6): 1010-23. 3) Darling AE et al. 2010 PLoS One 5(6):e11147. 4) McArthur AG, et al. 2013 Antimicrob Agents Chemo 57(7): 3348-57. 5) Dhillon BK, et al. 2015 NAR gkv401. 6) Chen L, et al. 2011 NAR gkr989. 7) Wattam AR, et al. 2014 NAR 42(D1):D581-9. 8) Lowe TM & Eddy SR 1997 NAR 25(5):955-64. 9) Laslett D & Canback B 2004 NAR 32(1):11-6. 10)Waack S, et al. 2006 BMC Bioinformatics7:142. 11)Hsiao W, et al. 2003 Bioinformatics 19(3):418-20. 12)Tatusov RL, et al. 2003 BMC Bioinformatics 4:1. 13)Altschul SF, et al. 1997 NAR 25(17):3389-402. Draft genomes have limitations: certain gene types, particularly those associated with mobile elements, are disproportionately missing Draft genome analysis is still valuable for VFs/AMR for the species examined, but more species should be studied

Upload: iridacommunity

Post on 23-Jan-2017

17 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Julie Shay CCBC poster may 11 2016

The utility of draft bacterial genomes for

gene function analysis and genomic island predictionJulie A. Shay, Claire Bertelli, Bhavjinder K. Dhillon, and Fiona S.L. Brinkman

Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada

Canada’s Federal

Genomics Research and

Development Initiative

AcknowledgmentsThis work was made possible by funding from Genome Canada, Genome British Columbia, and the GRDI. Funding for project personnel was also provided by Cystic

Fibrosis Canada, the Swiss National Science Foundation, the CIHR/MSFHR Bioinformatics Training Program, and the Michael Smith Foundation for Health Research.

References

Comparing draft vs. complete genomes:

two examples

The problem: growing gap between

draft and complete genomes

Genomic Island (GI) analysis

•Draft/complete genomes were run on

IslandViewer5: web-based GI prediction tool

which incorporates two methods:

Contigs

(GenBank format)

Contig alignment to reference

genome with Mauve3

Concatenate contigs based on

alignment

Normal IslandViewer analysis

pipeline

User-selected

reference genome

Isolate Draft Genome Complete Genome

Compare

Gene function category analysis

•Open reading frames (ORFs) were assigned to

clusters of orthologous groups (COGs)12 using

RPS-BLAST13

•COG superfamily distributions were compared

between complete genomes and missing regions

of drafts

Genes of interest

•Main data set: 36 Listeria monocytogenes

isolates1, draft Illumina genomes and the

identical subsequently completed genomes

•Other data set: Draft genomes from the

Pseudomonas aeruginosa reference panel2

and similar completed genomes (identical

reference available for 2 strains)

•Draft genome aligned to completed reference

with Mauve Contig Mover3

SIGI-HMM10 IslandPath-DIMOB11

•codon usage bias

•HMM-based method

•dinucleotide bias

•presence of a mobility gene

•“Replication, recombination, and repair”

superfamily was significantly underrepresented

in draft genomes of both L. monocytogenes

and P. aeruginosa

•In particular, transposons tend to be missing

from draft genomes

Pipeline

Many GIs are present at

contig breaks, and these

GIs are more likely to be

missed by analysis of draft

genomes

020406080

100120140160180

0 1 to 9 10 to 99 100 to 999

1000 to 9999

10000 to 99999

1000000 to

999999

Nu

mb

er

of

GI

Pre

dic

tio

ns

in

Lis

teri

a G

en

om

es

Distance in Base Pairs from Contig Edge

Predictions Missed in Draft Genome Analysis

Predictions Correctly Identified in Draft Genome Analysis

0

50

100

150

200

250

2008 2009 2010 2011 2012 2013 2014 2015

Th

ou

san

ds in

Data

base

Year

NCBI SRABacterial Genomes

NCBI CompleteBacterial Genomes

[A] RNA processing and modification

[B] Chromatin structure and dynamics

[C] Energy production and conversion

[D] Cell cycle control, cell

division, chromosome partitioning

[E] Amino acid transport and metabolism

[F] Nucleotide transport and metabolism

[G] Carbohydrate transport and

metabolism

[H] Coenzyme transport and metabolism

[I] Lipid transport and metabolism

[J] Translation, ribosomal structure and

biogenesis

[K] Transcription

[L] Replication, recombination and

repair

[M] Cell wall / membrane / envelope

biogenesis

[N] Cell motility

[O] Posttranslational modification, protein

turnover, chaperones

[P] Inorganic ion transport and metabolism

[Q] Secondary metabolites

biosynthesis, transport and catabolism

[R] General function prediction only

[S] Function unknown

[T] Signal transduction mechanisms

[U] Intracellular trafficking, secretion, and

vesicular transport

[V] Defense mechanisms

[W] Extracellular structures

[Z] Cytoskeleton

Methods Results

AM

R G

en

es Identified using

Resistance Gene

Identifier4 using the

Comprehensive

Antibiotic Resistance

Database

Not significantly

underrepresented in

Listeria or

Pseudomonas draft

genomes

Vir

ule

nce F

acto

rs Predicted using a

conservative reciprocal-

best-blast-hit approach

from VFDB, PATRIC,

and Victor’s virulence

factors5,6,7.

Not significantly

underrepresented in

Listeria or

Pseudomonas draft

genomes

tRN

AG

en

es Predicted using

tRNAscan-SE8 and

ARAGORN9

Significantly

underrepresented in

Listeria and

Pseudomonas draft

genomes

Pe

rcen

t M

issin

g f

rom

Dra

ft L

iste

ria

gen

om

es

0

0.1

0.2

0.3

0.4

0.5

0.6

A B C D E F G H I J K L M N O P Q R S T U V W Y Z

Pro

po

rtio

n o

f To

tal O

RF

s

COG Superfamily

Completed Genome

Regions Missing from Draft Genome

Conclusion

All Protein-

Coding GenesAMR

Genes

Virulence

Factors

tRNA

Genes

Note: This image only shows genomes submitted to NCBI, so it is

underestimating the extent of the gap between draft and complete

1) Gimour MW, et al. 2010 BMC Genomics 11:120.

2) De Soyza A, et al. 2013 MicrobiologyOpen 2(6):

1010-23.

3) Darling AE et al. 2010 PLoS One 5(6):e11147.

4) McArthur AG, et al. 2013 Antimicrob Agents Chemo

57(7): 3348-57.

5) Dhillon BK, et al. 2015 NAR gkv401.

6) Chen L, et al. 2011 NAR gkr989.

7) Wattam AR, et al. 2014 NAR 42(D1):D581-9.

8) Lowe TM & Eddy SR 1997 NAR 25(5):955-64.

9) Laslett D & Canback B 2004 NAR 32(1):11-6.

10)Waack S, et al. 2006 BMC Bioinformatics7:142.

11)Hsiao W, et al. 2003 Bioinformatics 19(3):418-20.

12)Tatusov RL, et al. 2003 BMC Bioinformatics 4:1.

13)Altschul SF, et al. 1997 NAR 25(17):3389-402.

•Draft genomes have limitations: certain gene

types, particularly those associated with mobile

elements, are disproportionately missing

•Draft genome analysis is still valuable for

VFs/AMR for the species examined, but more

species should be studied