2015 12-09 nmdd
TRANSCRIPT
![Page 1: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/1.jpg)
WGS data for bacterial typing
Karin Lagesen
@karinlag
NMDD presentation
2015-12-09
![Page 2: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/2.jpg)
Bacterial genomes
Four letters: A, C, T, G
Two strands complementary:
A : T, C : G
Genes: DNA that encode for proteins
Often regarded as the “functional”
regions of the genome
Bacteria: genes approx 90% of the genome
ATCCGGAG GAGGACGG
Mutations: single letter
character changes
TGAGGGACCAAACCGAT
TGAGGGACGAAACCGAT
Bacterial
genomes are
most often
circular
Campylobacter
jejuni genome:
1.68 million
basepairs
![Page 3: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/3.jpg)
Bacterial typing
Typing: identifying a bacterial isolate at the strain
level
Goal: discriminate between different bacterial
isolates
● Effectively: a distance measure is often sought
Traditionally done via distinguishing based on
phenotypic characteristics
Molecular strain typing has taken over
Goal: figure out how different sequences are
![Page 4: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/4.jpg)
Advances in bacterial genomics
Phyla Number
genomes
% of total
Actinobacteria 4059 13
Bacteroidetes/
Chlorobi group
932 3
Cyanobacteria 340 1
Firmicutes 9628 31
Proteobacteria 14,268 46
Spirochaetes 525 2
Other
1500 5
Number of sequenced genomes for 6 selected phyla and the percent of all genomes found
in the phyla
Source: GenBank prokaryotes.txt file downloaded 4 February 2015
Land et. al., Functional & Integrative Genomics, 2015
![Page 5: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/5.jpg)
2002
Development of sequencing technologies
![Page 6: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/6.jpg)
Genome assembly
http://knowgenetics.org/whole-genome-sequencing/
Sequencing
machine
Reads
![Page 7: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/7.jpg)
Molecular bacterial typing
How
dif
fere
nces
are
counte
d
Amount of sequence used
Single
gene
Categorical
Ordinal
Continuous
MLST,
MLVA
MLSA
One region Some regions Many regions All
![Page 8: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/8.jpg)
MLVA – Multi-locus VNTR analysis
Find loci with known
repeats
Discover copy number
of repeat – becomes
identifier for loci
Strain identified by
copy numbers for
defined set of loci
Similarity is # of
idential loci numbers
http://www.applied-maths.com/applications/mlva
![Page 9: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/9.jpg)
Multi Locus Sequence Typing
Set of genes
Each variant is assigned a categorical number
Cluster types on # shared variants
Numbers becomes Sequence type (ST)
Similarity is # of idential loci numbers
MLST: 7 genes
rMLST: ribosomal genes
http://www.applied-maths.com/applications/mlst
![Page 10: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/10.jpg)
Clustering categorical data
Feil, Nature Rev. Microbiol. 2004
![Page 11: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/11.jpg)
Phylogeny – tracing ancestry
Many algorithms
● Distance matrix methods (sequence similarity)
● Maximum parsimony methods
● Maximum likelyhood methods
Based on similarity between sequences
Can become very computationally intensive, especially for longer sequences (e.g. WGS)
Examples:
● 16S rRNA phylogenetic trees
● Multi Locus Sequence Analyses – phylogenies of concatenated MLST genes
![Page 12: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/12.jpg)
Campylobacter 16S tree
Friis et. al. PLOS One 2013
![Page 13: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/13.jpg)
Molecular bacterial typing
How
dif
fere
nces
are
counte
d
Amount of sequence used
Single
gene
Categorical
Ordinal
Continuous
Pairwise
SNPs
Core
genome
MLST,
MLVA
MLSA
One region Some regions Many regions All
wgMLST
Core
SNPs
![Page 14: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/14.jpg)
Ideal whole genome comparisons
Bacterial species definition:
● 70% of genome should be able to anneal to each other – i.e. «match»
Converted to whole genome sequences:
● Based on % identity between conserved regions
● Average Nucleotide Identity~95 %
All-against-all sequence alignment is required
● Time complexity: O(n2)
● Not feasible in most cases
Alternatives:
● Focus on core regions of the genome (core genes)
● Find just the variations (SNPs), make trees from those
![Page 15: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/15.jpg)
Core genome – # ”shared genes”
Sequences q and s have matching region
Regarded as ”shared” iff k and n are large
enough
Similarity = # ”shared” genes
s
q length of match (n)
% of matching characters
in matching region (k)
![Page 16: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/16.jpg)
Core genome tree, Campylobacter
Friis et. al. PLOS One 2013
![Page 17: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/17.jpg)
Core SNP trees
Approach A: External core gene set
● Map each genome’s reads to genes
● Examine reads mapping to the same gene to
find sequence variations (variant calling)
● Create genome/SNP matrix
Approach B: Intrinsic core set
● Use suffix graphs to get Maximal Unique Matches
● Extend alignments from MUMs to get shared
core set
● Find variants in alignments
● Create genome/SNP matrix
Similarity: genomes that share the same SNP
Snippy
snpTree
Parsnp
![Page 18: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/18.jpg)
Campylobacter jejuni, core SNP tree
Maximum likelihood phylogeny derived from the core-genome alignment of 131 C. jejuni
isolates. Isolates with a known hyper-invasive phenotype have their taxa identifier names
highlighted in red. The three clades identified as containing hyper-invasive strains have
branches indicated in red
Baig et al. BMC Genomics 2015 16:852 doi:10.1186/s12864-015-2087-y
![Page 19: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/19.jpg)
k-mer based SNP trees
k-mer: piece of sequence, k nucleotides long
Split genomes/reads into k-mers
Find k-mers in different genomes that vary in their middle character
Create genome/SNP matrix
● Note: this is not core, but pairwise all-against-all
Create trees
Similarity is # shared SNPs
Genome A: TGAGGGACCAAACCGAT
Genome B: TGAGGGACGAAACCGAT
kSNP
![Page 20: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/20.jpg)
Acenitobacter whole genome SNP tree
Sahl et. al., PLOS One, 2013
![Page 21: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/21.jpg)
Classification of distance measures
Categorical
● Loci defined as either equal/different
● Similarity calculated as # shared loci
Ordinal
● Regions defined as “shared” based on sequence
similarity levels
● Similarity calculated as # shared sequences
Continous
● Find all sequence differences (SNPs)
● Similarity calculated as # shared SNPs
![Page 22: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/22.jpg)
(Some) sources of variation
Small changes
● Nucleotide substitution
● Insertions and deletions
Recombination
● Shuffling regions of the genome
“Jumping genes”: insertion sequences and transposons
● Small sequences that jump
● Can move other sequences with them
![Page 23: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/23.jpg)
Horizontal gene transfer.
![Page 24: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/24.jpg)
Gene tree != genome tree
Rose et. Al., Biology direct 2007
![Page 25: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/25.jpg)
So… what do we do?
No real answers (yet)
Could sequence the lot, but is expensive
However: gain so much more with sequencing
● Very high discriminatory power (resolution)
● Access to virulence genes, ++
Be aware of possible fragility in MLST data
● One mutation = changed ST
● Should probably double check STs with MLSA
Compare MLSTs with WGS data, see how stable the
MLSTs are to the whole genome
![Page 26: 2015 12-09 nmdd](https://reader031.vdocuments.us/reader031/viewer/2022030304/5879a6011a28ab082c8b6f2d/html5/thumbnails/26.jpg)
Questions? and Thankyou!