a probabilistic approach to whole genome based...
TRANSCRIPT
A Probabilistic Approach to Whole Genome based Phylogeny
Johanne AhrenfeldtPhD student - Genomic Epidemiology
DTU Bioinformatics, Technical University of Denmark
Overview• Whole Genome Based Phylogeny
• A data set with known phylogeny
• Base calling revisited
• A probabilistic approach to distance calculation
2
DTU Bioinformatics, Technical University of Denmark
Whole genome based phylogeny• WGS sequencing is increasing as the price is (used to be) falling
• A very useful tool for outbreak analysis of infectious diseases
• Used in the Haiti cholera outbreak to find the source of the infection
3
DTU Bioinformatics, Technical University of Denmark
Mapping
4
Reads
Reference genome
Consensus sequence
Reference genomeGenome 1
Genome 2
Genome 3
Genome 4
Genome 5
Genome 6
Base calling
DTU Bioinformatics, Technical University of Denmark
Base callingPosition A T G C1 4
06 4 1
02 5
010
4 6
3 5 50
0 5
4 0 0 60
0
5 5 0 2 53
6 3 5 50
2
7 0 1 0 59
8 55
0 5 05
Each position is evaluated by calculating a Z-score
X is the number of reads having the most common nucleotide at that position, and Y is the number of reads supporting other nucleotides.
To trust a base call we require Z > 1.96>90% of reads supporting the same base
DTU Bioinformatics, Technical University of Denmark
A data set with known phylogenyTo test various methods it would be useful to have a dataset with known phylogeny, then the tree structure is known, and the methods ability to infer the correct phylogeny can be evaluated.
This was made by In vitro evolution of E. coli
J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics (2017) 18:19
6
DTU Bioinformatics, Technical University of Denmark
In vitro evolution of E. coli
J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics (2017) 18:19
DTU Bioinformatics, Technical University of Denmark
Data set with known phylogeny
8
DTU Bioinformatics, Technical University of Denmark
Data set with known phylogeny
9
DTU Bioinformatics, Technical University of Denmark
Tree by NDtree - regular base calling
10
DTU Bioinformatics, Technical University of Denmark
Base calling revisited• Do we get all the information from simple base calling.
• How about vectorizing the counts and comparing these for each position to get the difference. If the scalarproduct of two vectors is 1, the angle is 0 degrees, as cos(1) = 0, telling us that the vectors are identical.
11
DTU Bioinformatics, Technical University of Denmark
Base callingPosition A T G C1 4
06 4 1
02 5
010
4 6
3 5 50
0 5
4 0 0 60
0
5 5 0 2 53
6 3 5 50
2
7 0 1 0 59
8 55
0 5 012
Position A T G C1 3
06 4 1
02 4
010
4 6
3 5 40
0 5
4 0 0 50
0
5 5 0 2 43
6 3 5 40
2
7 0 1 0 49
8 4 0 5 0
DTU Bioinformatics, Technical University of Denmark
Vectorized counts
13
DTU Bioinformatics, Technical University of Denmark
Base calling revisited• Do we get all the information from simple base calling.
• How about vectorizing the counts and comparing these for each position to get the difference. If the scalarproduct of two vectors is 1, the angle is 0 degrees as cos(1) = 0, telling us that the vectors are identical.
(Been there done that, didn’t really work)
• Bayesian probabilistic phylogeny
14
DTU Bioinformatics, Technical University of Denmark
Bayesian probabilistic phylogeny• Assume two consensus sequences x and y
• At each position i we have counts Xi(a) for base a in genome x and similarly Yi(a) forthe other genome. The total count at i is called Xi and Yi
• Distance could be the expected number of positions where the two genomes differ
• The correct base at position i is called xi / yi.
• Calculate the probability for each position
15
DTU Bioinformatics, Technical University of Denmark
Frequentist interpretation of Bayes’stheorem
16
In the frequentist interpretation, probability measures a “proportion of outcomes.” For example, suppose an experiment is performed many times. P(A) is the proportion of outcomes with property A, and P(B) that with property B. P(B | A ) is the proportion of outcomes with property B out of outcomes with property A, and P(A | B ) the proportion of those with A out of those with B.
DTU Bioinformatics, Technical University of Denmark
Bayesian probabilistic phylogeny
17
The probability of observing the count given that x and y are different
The probability of x and y being different. The prior. Tested at 0.01
The probability of observing the count, given that x and y are the same.
DTU Bioinformatics, Technical University of Denmark
Bayesian probabilistic phylogeny
18
DTU Bioinformatics, Technical University of Denmark
Single position calculationExample 1
19
Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)
Genome x
A C G T10 2 0 0 12
Genome y A 100 6,4*10^-21 6,9*10^-37 1,1*10^-42 1,1*10^-42
C 20 1,4*10^-180 1,5*10^-196 2,4*10^-202 2,4*10^-202
G 0 5,9*10^-243 6,4*10^-259 1*10^-264 1*10^-264
T 0 5,9*10^-243 6,4*10^-259 1*10^-264 1*10^-264
120
Diagonal 6,4*10^-21
Total 6,4*10^-21
P(x!=y) 1,2*10^-18 5,9*10^-12
DTU Bioinformatics, Technical University of Denmark
Single position calculationExample 2
20
Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)
Genome x
A C G T10 2 0 0 12
Genome y A 10 4,3*10^-11 4,6*10^-27 7,2*10^-33 7,2*10^-33
C 1 9,4*10^-32 1,0*10^-47 1,5*10^-53 1,5*10^-53
G 0 5,9*10^-35 6,4*10^-51 1*10^-56 1*10^-56
T 5 2,4*10^-21 2,6*10^-37 4,1*10^-43 4,1*10^-43
16
Diagonal 4,3*10^-11
Total 4,3*10^-11
P(x!=y) 5,8*10^-13 2,9*10^-06
DTU Bioinformatics, Technical University of Denmark
Single position calculationExample 2
21
Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)
Genome x
A C G T10 0 5 0 15
Genome y A 2 3,6*10^-35 1,3*10^-58 3,8*10^-45 1,3*10^-58
C 10 4,7*10^-17 1,7*10^-40 5,0*10^-27 1,7*10^-40
G 0 2,7*10^-41 1*10^-64 2,8*10^-51 1*10^-64
T 5 1,5*10^-27 5,8*10^-51 1,6*10^-37 5,8*10^-51
17
Diagonal 3,6*10^-35
Total 4,7*10^-17
P(x!=y) 1 5000000
DTU Bioinformatics, Technical University of Denmark
Tree
22
Acknowledgements
Anders Krogh
Anders Gorm Pedersen
Ole Lund