a probabilistic approach to whole genome based...

23
A Probabilistic Approach to Whole Genome based Phylogeny Johanne Ahrenfeldt PhD student - Genomic Epidemiology

Upload: others

Post on 30-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

A Probabilistic Approach to Whole Genome based Phylogeny

Johanne AhrenfeldtPhD student - Genomic Epidemiology

Page 2: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Overview• Whole Genome Based Phylogeny

• A data set with known phylogeny

• Base calling revisited

• A probabilistic approach to distance calculation

2

Page 3: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Whole genome based phylogeny• WGS sequencing is increasing as the price is (used to be) falling

• A very useful tool for outbreak analysis of infectious diseases

• Used in the Haiti cholera outbreak to find the source of the infection

3

Page 4: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Mapping

4

Reads

Reference genome

Consensus sequence

Reference genomeGenome 1

Genome 2

Genome 3

Genome 4

Genome 5

Genome 6

Base calling

Page 5: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Base callingPosition A T G C1 4

06 4 1

02 5

010

4 6

3 5 50

0 5

4 0 0 60

0

5 5 0 2 53

6 3 5 50

2

7 0 1 0 59

8 55

0 5 05

Each position is evaluated by calculating a Z-score

X is the number of reads having the most common nucleotide at that position, and Y is the number of reads supporting other nucleotides.

To trust a base call we require Z > 1.96>90% of reads supporting the same base

Page 6: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

A data set with known phylogenyTo test various methods it would be useful to have a dataset with known phylogeny, then the tree structure is known, and the methods ability to infer the correct phylogeny can be evaluated.

This was made by In vitro evolution of E. coli

J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics (2017) 18:19

6

Page 7: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

In vitro evolution of E. coli

J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics (2017) 18:19

Page 8: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Data set with known phylogeny

8

Page 9: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Data set with known phylogeny

9

Page 10: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Tree by NDtree - regular base calling

10

Page 11: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Base calling revisited• Do we get all the information from simple base calling.

• How about vectorizing the counts and comparing these for each position to get the difference. If the scalarproduct of two vectors is 1, the angle is 0 degrees, as cos(1) = 0, telling us that the vectors are identical.

11

Page 12: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Base callingPosition A T G C1 4

06 4 1

02 5

010

4 6

3 5 50

0 5

4 0 0 60

0

5 5 0 2 53

6 3 5 50

2

7 0 1 0 59

8 55

0 5 012

Position A T G C1 3

06 4 1

02 4

010

4 6

3 5 40

0 5

4 0 0 50

0

5 5 0 2 43

6 3 5 40

2

7 0 1 0 49

8 4 0 5 0

Page 13: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Vectorized counts

13

Page 14: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Base calling revisited• Do we get all the information from simple base calling.

• How about vectorizing the counts and comparing these for each position to get the difference. If the scalarproduct of two vectors is 1, the angle is 0 degrees as cos(1) = 0, telling us that the vectors are identical.

(Been there done that, didn’t really work)

• Bayesian probabilistic phylogeny

14

Page 15: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Bayesian probabilistic phylogeny• Assume two consensus sequences x and y

• At each position i we have counts Xi(a) for base a in genome x and similarly Yi(a) forthe other genome. The total count at i is called Xi and Yi

• Distance could be the expected number of positions where the two genomes differ

• The correct base at position i is called xi / yi.

• Calculate the probability for each position

15

Page 16: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Frequentist interpretation of Bayes’stheorem

16

In the frequentist interpretation, probability measures a “proportion of outcomes.” For example, suppose an experiment is performed many times. P(A) is the proportion of outcomes with property A, and P(B) that with property B. P(B | A ) is the proportion of outcomes with property B out of outcomes with property A, and P(A | B ) the proportion of those with A out of those with B.

Page 17: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Bayesian probabilistic phylogeny

17

The probability of observing the count given that x and y are different

The probability of x and y being different. The prior. Tested at 0.01

The probability of observing the count, given that x and y are the same.

Page 18: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Bayesian probabilistic phylogeny

18

Page 19: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Single position calculationExample 1

19

Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)

Genome x

A C G T10 2 0 0 12

Genome y A 100 6,4*10^-21 6,9*10^-37 1,1*10^-42 1,1*10^-42

C 20 1,4*10^-180 1,5*10^-196 2,4*10^-202 2,4*10^-202

G 0 5,9*10^-243 6,4*10^-259 1*10^-264 1*10^-264

T 0 5,9*10^-243 6,4*10^-259 1*10^-264 1*10^-264

120

Diagonal 6,4*10^-21

Total 6,4*10^-21

P(x!=y) 1,2*10^-18 5,9*10^-12

Page 20: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Single position calculationExample 2

20

Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)

Genome x

A C G T10 2 0 0 12

Genome y A 10 4,3*10^-11 4,6*10^-27 7,2*10^-33 7,2*10^-33

C 1 9,4*10^-32 1,0*10^-47 1,5*10^-53 1,5*10^-53

G 0 5,9*10^-35 6,4*10^-51 1*10^-56 1*10^-56

T 5 2,4*10^-21 2,6*10^-37 4,1*10^-43 4,1*10^-43

16

Diagonal 4,3*10^-11

Total 4,3*10^-11

P(x!=y) 5,8*10^-13 2,9*10^-06

Page 21: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Single position calculationExample 2

21

Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)

Genome x

A C G T10 0 5 0 15

Genome y A 2 3,6*10^-35 1,3*10^-58 3,8*10^-45 1,3*10^-58

C 10 4,7*10^-17 1,7*10^-40 5,0*10^-27 1,7*10^-40

G 0 2,7*10^-41 1*10^-64 2,8*10^-51 1*10^-64

T 5 1,5*10^-27 5,8*10^-51 1,6*10^-37 5,8*10^-51

17

Diagonal 3,6*10^-35

Total 4,7*10^-17

P(x!=y) 1 5000000

Page 22: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

DTU Bioinformatics, Technical University of Denmark

Tree

22

Page 23: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome

Acknowledgements

Anders Krogh

Anders Gorm Pedersen

Ole Lund