lecture 3. phylogeny methods: branch and bound,...

Post on 19-Sep-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 3. Phylogeny methods: Branch and bound,distance methods

Joe Felsenstein

Department of Genome Sciences and Department of Biology

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.1/25

Greedy search by sequential addition

A

D

B

C

A

B

C

B

C

D

A

D

A

C

B

8 7 9

BA

D

C

E11

A

D 9

E

C

B

A

D E9

BC

A

9 C

B

E

D

D 9 C

BEA

Greedy search by addition of species in a fixed order (A, B, C, D, E) in thebest place each time. Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.2/25

Goloboff’s time-saving trick

H−K

L

M−R

S−U

A

V−Z

V−Z

A−G H−R

S−U

B−G

Goloboff’s economy in computing scores of rearranged treesOnce the “views” have been computed, they can be taken to

represent subtrees, without going inside those subtrees

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.3/25

Star decomposition

A

C

D

E F

B

E

C

D

A

B

F

B C

D

A

E F

E

C

D

A

B

F

B C

D

A

F

E

C

D

A

B

F

E

“Star decomposition" search for best tree can happen in multiple ways

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.4/25

Disk-covering

A

B

C D

EF

0.1

0.05

0.1 0.04 0.1

0.030.030.02

0.05

“Disk covering" – assembly of a tree from overlapping estimated subtrees

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.5/25

Shortest Hamiltonian path problem(a) (b)

(c) (d)

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.6/25

Search tree for this problem

etc. etc.

etc.etc.

add 1 add 2 add 3

add 2 add 3 add 4 add 5

add 3 add 5

add 8 add 10add 9

add 9

add 9add 3add 10

add 10 add 8

add 8add 3add 10

add 10 add 8

add 8add 3

add 9

etc. etc.

start

(1,2,3,4,5,6,7,8,9,10) (1,2,3,4,5,6,7,9,8,10) (1,2,3,4,5,6,7,10,8,9)

(1,2,3,4,5,6,7,8,10,9) (1,2,3,4,5,6,7,9,10,8) (1,2,3,4,5,6,7,10,9,8)

add 4

etc.

add 9

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.7/25

Search tree of trees

C

A

B

D C

A B

A

C

B

D

A

B

C

D

A

ED

B

C

E

DA

B

C

D

AE

B

C

D

AC

B

E

D

AB

C

E

E

AC

B

D

E

C A B

DC

AE

B

D

C

AD

B

E

C

AB

D

E

E

AB

C

D

E

BA

C

D

B

AE

C

D

B

AD

C

E

B

A C D

ELecture 3. Phylogeny methods: Branch and bound, distance methods – p.8/25

same, with parsimony scores in place of trees

8

11

11

9

3

9

7 8

9

9

9

10

10

11

1111

11

9

11

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.9/25

Polynomial time and exponential time

1 10 10010

0

101

102

103

104

105

106

Tim

e

Problem size

6n +4n−33

e0.5n

How does the time taken by an algorithm depend on the size of theproblem? If it is a polynomial (even one with big coefficients), with a bigenough case it is faster than one that depends on the size exponentially.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.10/25

NP completeness and NP hardness

P

NP

does thispart exist?

NP Hard

is P = NP?

NP Complete

(This diagram is not quite correct – see the diagrams on the Wikipedia page for “NP-hard”).

P = problems that can be solved by a polynomial time algorithm

NP complete = problems for which a proposed solution can be checked in polynomial timebut for which it can be proven that if one of them is in P, all are.

NP hard = problems for which a solution can be checked in polynomial time, but might be notsolvable in polynomial time.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.11/25

Distance methodsThese have been attractive, particular to mathematical scientists who lovegeometry. This has its good and bad effects.

1. Take the sequences in all pairs.

2. For each pair compute a distance. (As we will see, this is bestthought of as the length of the 2-species tree for those species).

3. Try to find that tree which best fits the table of distances.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.12/25

A phylogeny with branch lengths

A B C D E

A

B

C

D

E

0

0

0

0

0

0.23 0.16 0.20 0.17

0.23 0.17 0.24

0.15 0.11

0.21

0.23

0.16

0.20

0.17

0.23

0.17

0.24

0.15

0.11 0.21

0.10

0.07

0.05

0.08

0.030.06

0.05

A B

CD

E

and the pairwise distances it predicts

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.13/25

A phylogeny with branch lengths

A B

CD

E

v1v2

v3v4

v5 v6

v7

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.14/25

Least squares trees

Least squares methods minimize

Q =n

i=1

j 6=i

wij(Dij − dij)2

over all trees, using the distances dij that they predict.Cavalli-Sforza and Edwards suggested wij = 1, Fitch andMargoliash suggested wij = 1/D2

ij.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.15/25

Statistical assumptions of least squares trees

Implicit assumption is that distances are (independently?) Normallydistributed with expectation dij and variance proportional to 1/w2

ij:

Dij ∼ N (dij, K/wij)

Thus the different weightings correspond to different assumptions aboutthe error in the distances. Also, there is assumed to be no covariance ofdistances.

In fact, the distances will covary, since a change in an interior branch ofthe tree increases (or decreases) all distances whose paths go throughthat branch.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.16/25

Matrix approach to fitting branch lengthsIf we stack the distances up into a column vector D, we can solve the least squares equation(obtained by taking derivatives of the quadratic form Q):

DT = (D12, D13, D14, D15, D23, D24, D25, D34, D35, D45)

XTD =

XTX

v.

where the “design matrix” X for the given tree topology has 1’s whenever a given branch lieson the path between those two species. Here is the design matrix for the tree we just saw.

Branches which1 2 3 4 5 6 7 D

X =

2

6

6

6

6

6

6

6

6

6

6

6

6

4

1 1 0 0 0 0 11 0 1 0 0 1 01 0 0 1 0 0 11 0 0 0 1 1 00 1 1 0 0 1 10 1 0 1 0 0 00 1 0 0 1 1 10 0 1 1 0 1 10 0 1 0 1 0 00 0 0 1 1 1 1

3

7

7

7

7

7

7

7

7

7

7

7

7

5

1, 21, 31, 41, 52, 32, 42, 53, 43, 54, 5

A B

CD

E

v1v2

v3v4

v5 v6

v7

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.17/25

The Jukes-Cantor model for DNA

A G

C T

u/3

u/3

u/3u/3 u/3

u/3

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.18/25

Derivation of the probability of change

1. Imagine events occuring at rate 43u per unit time which replace a

base by one of the 4 bases chosen at random.

2. Persuade yourself that this is no different in outcome from events u

per unit time that replace it by one of the other 3 chosen at random.

3. The probability a branch has none of these (first kind of) events if itis of length t is exp(− 4

3u t) . (Think the zero term of a Poisson

distribution).

4. If it does have one or more of these events, you end up with one ofthe 4 bases chosen at random.

5. Therefore the probability of a net change is:

3

4

(

1 − e(− 4

3u t)

)

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.19/25

The distance for the Jukes-Cantor model

0

1

0

0.75

0.49

0.7945

per

site

diffe

renc

es

branch length

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.20/25

If you don’t correct for “multiple hits”

A

B

C

0.155 0.155

0.0206

A

B

C

0.20 0.20

0.00

Left: the true tree. Right: a tree fitting the uncorrected distances

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.21/25

References, page 1Maddison, D. R. 1991. The discovery and importance of multiple islands of most-parsimonious

trees. Systematic Zoology40: 315-328. [Discusses heuristic search strategy involving ties,multiple starts]

Farris, J. S. 1970. Methods for computing Wagner trees. Systematic Zoology19: 83-92. [Earlyparsimony algorithms paper is one of first to mention sequential addition strategy]

Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructingphylogenetic trees. Molecular Biology and Evolution4: 406-425. [First mention ofstar-decomposition search for best trees, sort of]

Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihoodmethod for reconstructing tree topologies. Molecular Biology and Evolution13: 964-969.[Assembles trees out of quartets]

Huson, D., S. Nettles, L. Parida, T. Warnow, and S. Yooseph. 1998. The disk-covering method fortree reconstruction. pp. 62-75 in Proceedings of “Algorithms and Experiments” (ALEX98), Trento,

Italy, Feb. 9-11, 1998, ed. R. Battiti and A. A. Bertossi. [“Disk-covering method” for longstringy trees]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.22/25

References, page 2Foulds, L. R. and R. L. Graham. 1982. The Steiner problem in phylogeny is NP-complete.

Advances in Applied Mathematics3: 43-49. [Parsimony is NP-hard]Graham, R. L. and L. R. Foulds. 1982. Unlikelihood that minimal phylogenies for a realistic

biological study can be constructed in reasonable computat ional time. Mathematical

Biosciences60: 133-142. [ ... and more]Hendy, M. D. and D. Penny. 1982. Branch and bound algorithms to determine minimal

evolutionary trees. Mathematical Biosciences60: 133-142 [Introduced branch-and-bound forphylogenies]

Felsenstein, J. 2004. Inferring Phylogenies.Sinauer Associates, Sunderland, Massachusetts. [Forthis lecture the material is chapters 4, and 5]

Semple, C. and M. Steel. 2003. Phylogenetics.Oxford University Press, Oxford. [Also coverssearch strategies]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.23/25

References, page 3

Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification.Evolution38: 16-24. [Argument for statistical interpretation of distancemethods]

Farris, J. S. 1985. Distance data revisited. Cladistics1: 67 -85. [Reply to my1984 paper]

Felsenstein, J. 1986. Distance methods: reply to Farris. Cladistics2: 130-143.[reply to Farris 1985]

Farris, J. S. 1986. Distances and statistics. Cladistics2: 1 44-157. [debate wascut off after this]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.24/25

References, page 4

Bryant, D., and P. Waddell. 1998. Rapid evaluation of least-squares andminimum-evolution criteria on phylogenetic trees. Molecular Biology andEvolution15: 1346-1359. [quicker least squares distance trees]

Felsenstein, J. 2004. Inferring Phylogenies.Sinauer Associates, Sunderland,Massachusetts. [See chapter 11]

Semple, C. and M. Steel. 2003. Phylogenetics. Oxford University Press, Oxford.[See pp. 145-160]

Yang, Z. 2007. Computational Molecular Evolution.Oxford University Press,Oxford. [See pages 89-93]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.25/25

top related