robertson immemxi final march 2016

Utility of the Salmonella in Silico Typing Resource(SISTR) to outbreak investigations

James Robertson1, Catherine Yoshida1, Peter Kruczkiewicz2, Eduardo N. Taboada2 and John H. E. Nash3

1 National Microbiology Laboratory @Guelph , Public Health Agency of Canada2 National Microbiology Laboratory @Lethbridge, Public Health Agency of Canada3 National Microbiology Laboratory @Toronto, Public Health Agency of Canada

2

Salmonella is a leading public health concern Salmonella is a leading food-borne pathogen both in Canada and around the world

Globally, there are an estimated 94 million Salmonella infections every year Human costs:

• acute illness• loss of life (155,000 deaths)

Societal costs: • health care costs• lost productivity• legal costs• impact to food industry

3

Potential Sources

4

Challenges in Salmonella typing and epidemiology Small number of highly prevalent/globally distributed serovars account for most

outbreaks (e.g. Enteritidis, Typhimurium) Epidemiologicaly unrelated isolates within same serovar difficult to

investigate Additional subtyping resolution within a serovar needed (e.g. phage typing)

Increasing use of genotypic methods (i.e. molecular typing) Driven by need for methods with higher discriminatory power A number of different approaches have been applied to molecular typing of

Salmonella

5

GATCGATCGATCG

GATCAATCGATCG

MLST cgMLST wgSNP’sSerotyping

Discriminatory PowerLow Low-Mid Mid-High High

• Based on reaction of antibodies to surface antigens

• Broad usage and common nomenclature in use since the 1930’s

• Multi-Locus Sequence Typing: developed by Maiden et al. (1998)

• Indexes genetic variation in 7 core (i.e. “housekeeping”) genes

• cgMLST extends this principle to 100’s to 1000’s of loci

• Provides a portable naming scheme which correlates with historical serotypes

• Utilizes individual SNP’s and gives very high resolution

• Results are not portable to other public health professionals

7

• Initial dataset of 4330 genomes• 94.6% concordance between predicted

and reported serovar• in silico serovar predictions based on O

and H antigens• cgMLST refinement of serovar

assignment and analysis• Uses minimally processed genome

assemblies• Very fast ~30 seconds to process a

genome

What does SISTR do?In silico analysis of WGS data assembly statistics serovar prediction in silico typing (MLST,

cgMLST) AMR prediction

Comparative genomic analyses cgMLST accessory gene content core SNPs

Epidemiologic analysis geospatial distribution temporal distribution source association

https://lfz.corefacility.ca/sistr-app/

9

SISTR cgMLST

• Current cgMLST scheme in SISTR based on 330 core genes with high “assignability” (i.e. very low levels of “missing” data)

• Will include international Salmonella cgMLST scheme (i.e. once it is developed!)

• cgMLST information is used to:– Assess quality of WGS data complete, partial, missing

loci– Supplement genoserotyping predictions

10

Testing the accuracy of SISTR

• ~45,000 Salmonella genomes were downloaded from the SRA

• Raw reads were assembled using FLASH and Spades• Assemblies were loaded into SISTR and the serovar

predictions were compared between predicted and reported (where available)

• Assemblies were checked for contamination using Kraken• Quality was assessed using Quast

11

Recovery rates of 330 cgMLST genes from Assembled SRA genomes

41781

13931905

Number of Genomes with Complete 330

Number of Genomes with >300 Genes

Number of Genomes with <300 Genes

N=45,079

12

SISTR Accuracy2347

29884N=32,321

• 93.7% Overall concordance with serovar specified

Discordant

Concordant

13

• Two outbreaks of Salmonella Enteriditis were retrospectively sequenced• Examined the feasibility of WGS to outbreak investigations• Compared results of traditional molecular and microbial tests to WGS

18

SISTR (cgMLST) PARSNP (core SNP)

SNP Tree (Wuyts et al 2015)

• All three methods produce concordant trees.• cgMLST has a tendency to overgroup

Outbreak Clustering Categories

B

A

C

B

A+C

B

C

A

A

Correct Incorrectly Split

Over-grouped

A+B

A+C

Incorrectly Split and grouped

20

Concordance between cgMLST and SNP trees

Study Correct Over-grouped Split Combination Serovar(s)1 1 1 0 0 Enteriditis2 2 3 0 0 Enteriditis3 5 1 0 0 Enteriditis,Typhimurium,

Derby4 2 7 0 0 Enteriditis5 2 0 0 0 Enteriditis6 5 2 0 0 EnteriditisTotal 18 13 0 0

21

Conclusions• SISTR is a a robust and accurate platform for Salmonella in silico

typing with 93.7% concordance between specified serovar and predicted serovar

• The prototype 330 gene cgMLST scheme is readily retrievable from HTS assemblies of varying quality levels.

• The current scheme provides coarse grain separation of Salmonella genetic lineages that will be useful in outbreak analysis

22

Acknowledgements

Team: Ed Taboada, Peter Kruczkiewicz, Catherine Yoshida, John Nash

Research partners: Public Health Agency of Canada:

OIE Laboratory for Salmonellosis – National Microbiology Lab (NML) @ Guelph

Genomics Core and Bioinformatics Core – NML @ Winnipeg Public Health Genomics team – NML @ Winnipeg

IRIDA project team Animal Health Veterinary Laboratory Agency – UK Austrian Institute of Technology – Austria

Funding: Genomics Research and Development Initiative Genome Canada (IRIDA project) Public Health Agency of Canada

robertson immemxi final march 2016

Science