Download - Identifiability in Phylogenetics Using Algebraic Matroids · Phylogenetics Problem Given a collection of species, nd the tree that explains their evolutionary history. Human Chimp

Identifiability in PhylogeneticsUsing

Algebraic Matroids

Ben Hollering and Seth Sullivant

North Carolina State University

April 9, 2020

Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 1 / 32

Phylogenetics

Problem

Given a collection of species, find the tree that explains theirevolutionary history.

GorillaChimpHuman HumanChimpGorilla ChimpGorillaHuman


Building Trees with DNA Sequence Data

DNA bases are A, T, G, C

DNA sequences of related species all evolved from some commonancestor

Align sequences for a gene that appears in all species

Human : GATCTCAAGGAC

Chimp : GGCCTCAAGGAT

Gorilla : GATCTCCAGGCA


Phylogenetic Models

GorillaChimpHuman

Human : GATCTCAAGGAC

Chimp : GGCCTCAAGGAT

Gorilla : GATCTCCAGGCA

We label the leaves of the treewith the base that each specieshas at a fixed site in theirDNA

Each tree gives a family ofdistributions on columns in thealignment

Maximum LikelihoodEstimation can then be usedto find the tree that maximizesthe probability of the data


Phylogenetic Models

ATC

Human : AATGGGACATGC

Chimp : AATGGCACATGT

Gorilla : AACGGGACATAA

We label the leaves of the treewith the base that each specieshas at a fixed site in theirDNA

Each tree gives a family ofdistributions on columns in thealignment

Maximum LikelihoodEstimation can then be usedto find the tree that maximizesthe probability of the data


Phylogenetic Models

Assume each site evolves independently

Phylogenetic models are hidden variable graphical models

Each leaf v is an observed random variable Xv ∈ {A, C, G, T}Each internal node v is a hidden random variable Yv

Associate a transition matrix M e to each edge e = (u, v) and adistribution π to the root

Y1

X1

Y2

X3 X2

M1

M0

M3 M2


Phylogenetic Models

Y1

X1

Y2

X3 X2

M1

M0

M3 M2

The probability of observing (x1, x2, x3) ∈ {A,C,G, T}3 is

P (x1, x2, x3) =∑y1

∑y2

πy1M0y1,y2

M1y1,x1

M2y2,x2

M3y2,x3


Types of Phylogenetic Models

First require that M e = exp(Qet) for a rate matrix Qe andparameter teFurther restrictions can be imposed on the rate matrices

[∗ αα ∗

]CFN

∗ β α γβ ∗ γ αα γ ∗ βγ α β ∗

K3P

∗ α α αα ∗ α αα α ∗ αα α α ∗

JC

∗ β α ββ ∗ β αα β ∗ ββ α β ∗

K2P


Algebraic Perspective on Phylogenetic Models

Once we fix a tree T with n leaves we get a polynomial map in theentries of π and the M e

ψT : ΘT → R4n

The phylogenetic model associated to T is MT = im(ψT ) ⊆ R4n

Θ ⊂ Rd is the space of numerical parameters (rate matrices Qe

and time parameters te)

This gives a family of parametric algebraic statistical modelsindexed by the discrete parameter T

Let VT be the Zariski closure of the model


Phylogenetic Mixture Models

Mixture models can be used to model more complicatedevolutionary events such as horizontal gene transfer orhybridization

The 2-tree mixture model for trees T1 and T2 is parameterized by

ψT1,T2 : ΘT1 ×ΘT2 × [0, 1]→ ∆4n−1

defined by

ψT1,T2(θ1, θ2, λ) = λψT1(θ1) + (1− λ)ψT2(θ2)

This gives a family of parametric algebraic statistical modelsindexed by multisets {T1, T2}The Zariski closure of the image is the join variety VT1 ∗ VT2


Identifiability

Definition

A parametric statistical model is identifiable if it gives a 1-1 map fromparameters to probability distributions.

Identifiability is needed for consistency of inference

In phylogenetics, the identifiability of the tree parameter isparticularly important

Can T or {T1, T2} be recovered from DNA sequence data?


Generic Identifiability of Discrete Parameters

Definition

Let {Ms}ks=1 be a collection of algebraic models that sit inside theprobability simplex ∆r, then the discrete parameter s is genericallyidentifiable if for each 2-subset {s1, s2} ⊂ [k]

dim(Ms1 ∩Ms2) < min(dim(Ms1),dim(Ms2))


Algebraic Tools for Testing Generic Identifiability

Let k[p] = k[p1, p2, . . . pr] denote the polynomial ring inindeterminates p1, p2, . . . pr

Definition

Let S ⊆ kr. The vanishing ideal of S, denoted I(S) is

I(S) = {f ∈ k[p] : f(a) = 0 for all a ∈ S} ⊆ k[p]

The ideal IT = I(MT ) is called the ideal of phylogenetic invariantsof T


Algebraic Tools for Testing Generic Identifiability

Proposition

Let M1 and M2 be two irreducible algebraic models which sit inside theprobability simplex ∆r. If there exists polynomials f1 and f2 such that

f1 ∈ I(M1) \ I(M2) and f2 ∈ I(M2) \ I(M1)

then dim(M1 ∩M2) < min(dim(M1),dim(M2)).

Since the models are irreducible, the ideals I(Ms) are prime

If the models are the same dimension, then it suffices to showI(M1) 6= I(M2)

Finding polynomials f1 and f2 can be quite difficult


Generic Identifiability of Tree Parameters

The tree parameter is identifiable of the JC, CFN, K2P, and K3Pmodels are generically identifiable

The tree parameters of the 2-tree JC and K2P mixture models aregenerically identifiable (Allman-Petrovic-Rhodes-Sullivant 2009)

The tree parameters of the 3-tree JC mixture model aregenerically identifiable (Long - Sullivant 2015)


Matroids

A matroid is a combinatorial object used to axiomatizeindependence

Characterized by a ground set E and independent sets I ⊆ E

Definition

A matroid is a pair (E, I), where I ⊆ 2E that satisfies

1 ∅ ∈ I2 If S ⊆ T and T ∈ I, then S ∈ I3 If S, T ∈ I and #S < #T , then there exists e ∈ T \ S such thatS ∪ {e} ∈ I


Linear Matroids

Definition

A linear matroid is one where E ⊂ kn is a finite subset, and S ∈ I ifand only if S is linearly independent over k

Example (Linear Matroid)

A =

1 1 −1 −23 1 2 40 −1 1 2

E = [4]

The independent sets are

{1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {1,2,3},{1,2,4}.


Algebraic Matroids

Since I(Ms) is a prime ideal it defines an algebraic matroid on theset of coordinates E = {pi : i ∈ [r + 1]} with independent sets

{S ⊆ E : I(Ms) ∩ C[S] = 〈0〉}

Let Ms = im(φ) with φ(θ1, . . . , θd) = (φ1(θ), . . . , φr+1(θ)) and let

J(φ) =

(∂φj∂θi

), 1 ≤ i ≤ d, 1 ≤ j ≤ r + 1

The matroid defined by the columns of J(φ) over the fraction fieldC(θ) is the same matroid defined by I(Ms)

Let M(Ms) be the independence matroid of the model defined ineither of these ways


Proving Identifiability with Algebraic Matroids

Proposition (H - Sullivant)

Let M1 and M2 be two irreducible algebraic models which sit inside theprobability simplex ∆r. Without loss of generality assumedim(M1) ≥ dim(M2). If there exists a subset S of the coordinates suchthat

S ∈M(M2) \M(M1)

then dim(M1 ∩M2) < min(dim(M1),dim(M2)).

Allows us to prove identifiability results without computing I(Ms)

Still requires symbolic computation over k(θ)


Specializing the Jacobian

Proposition

Let k be a field of characteristic zero and φ be a rational map. Thenthe matrix obtained by plugging generic parameter values into J(φ)gives a linear matroid over k which is the same as that defined by J(φ)with symbolic parameters over k(θ)

M(J(φ), k(θ)) = independence matroid over k(θ)

M(J(φ), k) = independence matroid over k obtained by pluggingin random values for θ


Certifying Identifiability with Algebraic Matroids

Algorithm 1: matroidSeparate

Input : Two maps φ1, φ2 parameterizing models M1 and M2 in kn

with dim(M1) ≥ dim(M2), a number of trials t.Output: A certificate S

1 for i = 0 to t do2 Randomly select T ⊆ [n] such that |T | ≤ dim(M2);3 if T ∈M(J(φ2), k) \M(J(φ1), k) then4 if T ∈M(J(φ2), k(θ)) \M(J(φ1), k(θ)) then5 S = T;6 Break;

7 return S or report that no certificate was found.

Still requires symbolic computation over k(θ)

Embarrassingly parallel


The Schwartz-Zippel Lemma

Lemma (Schwartz-Zippel)

Let f ∈ k[x1, . . . xn] be a non-zero polynomial of total degree α. Let Ebe a finite subset of k and r1, . . . rn be selected at random independentlyand uniformly from E. Then

P (f(r1, . . . , rn) = 0) ≤ α

|E|.

S /∈M(J(φ1), k(θ)) if the corresponding minor of J(φ1) vanishes

Main algorithm can be modified to avoid symbolic computationand produce a certificate that holds with probability 1− ε byusing this lemma


Six-to-Infinity Theorem

Theorem (Six-To-Infinity Theorem (Matsen-Mossel-Steel 2008))

Suppose that the tree parameters T1, T2 are identifiable for a 2-treemixture model for trees with six leaves. Then the tree parameters areidentifiable for trees with n leaves for all n ≥ 6.

Only finitely many cases to check since it is enough to check forevery pair of 2-multisets of 6 leaf trees


Identifiability for CFN and K3P

Theorem (H - Sullivant)

The tree parameters of the 2-tree CFN mixture model are genericallyidentifiable for trees with at least six leaves and the tree parameters ofthe 2-tree K3P mixture model are generically identifiable for trees withat least four leaves.

Proof idea:

By the Six-To-Infinity Theorem of Matsen, Mossel, and Steel(2008) its enough to prove identifiability for six leaf trees

There are 22,773 cases to check up to symmetry

Run the main algorithm for each case to find a certificate ofidentifiability

In one case it failed but we were able to compute adegree-bounded Grobner basis in this case


Why Did the Algorithm Fail?

Different prime ideals can have the same matroid

We conjecture that the ideals we get from the trees below have thesame matroid despite having different ideals

1

2 3 4

5

6

T1

1

2 3 6

4

5

S1

2

3 1 6

4

5

T2

2

3 1 4

5

6

S2


Phylogenetic Networks

Recent tool that has emerged to model evolutionary phenomenathat are non-treelike such as horizontal gene transfer

Solid edges are called tree edges

Dotted edges are reticulation edges which represent horizontalgene transfer

Networks can be thought of as cycles connected by trees

1 2

34


Phylogenetic Networks

As the number of cycles and number of allowable reticulationedges increases the model becomes increasingly complicated

A good starting point is a single cycle with a single reticulationvertex, called a cycle network

Deleting a reticulation edge ei from the network N gives a tree Ti

1 2

34

e1 e2

e3e4

e5

e6

e7

e8

(a) N

1

4 3

2e1

e4 e3

e2

e8 e7 e6

(b) T1

4

3 2

1e4

e3 e2

e1

e7 e6 e5

(c) T2


Phylogenetic Network Models

A model for trees ψT gives us a model ψN for cycle networks where

ψN = λψT1 + (1− λ)ψT2

This is not the same as mixture model since the parameters oneach tree are not independent

1 2

34

e1 e2

e3e4

e5

e6

e7

e8

(a) N

1

4 3

2e1

e4 e3

e2

e8 e7 e6

(b) T1

4

3 2

1e4

e3 e2

e1

e7 e6 e5

(c) T2


Identifiability for Phylogenetic Network Models

If T is one of the trees obtained from a network N thenim(ψT ) ⊆ im(ψN ) so in general the cycle-network parameter is notidentifiable

Gross and Long suggested limiting the question to large cyclenetworks (cycle size k ≥ 4)

They proved that the network parameter is identifiable for largecycle networks under the JC model

Similar to the tree case, they show that the question can bereduced to a finite number of cases and then computed idealsexplicitly in these cases


Identifiability for Phylogenetic Network Models

Theorem (H - Sullivant)

The semi-directed network parameter of large-cycle K2P and K3Pnetwork models is generically identifiable.

Proof idea:

Use results of Gross and Long to reduce to a finite number of cases

Use our matroid algorithm to prove identifiability in each case


Summary

Algebraic matroids can be used to show discrete parameters aregenerically identifiable

Using matroids allows us to avoid computing I(M)

Using the Schwartz-Zippel Lemma we can completely avoidcomputing over k(θ) and give a certificate of generic identifiabilitywith probability 1− εWe used it to prove that the tree parameters of 2-tree CFN andK3P mixture models are generically identifiable

We also used this method to prove that the network parameter inK2P and K3P large-cycle network models is generically identifiable


References

Elizabeth S Allman, Sonia Petrovic, John A Rhodes, and Seth Sullivant.

Identifiability of two-tree mixtures for group-based models.IEEE/ACM transactions on computational biology and bioinformatics, 8(3):710–722, 2010.

Elizabeth Gross and Colby Long.

Distinguishing phylogenetic networks.SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018.

Colby Long and Seth Sullivant.

Identifiability of 3-class Jukes-Cantor mixtures.Adv. in Appl. Math., 64:89–110, 2015.

Frederick A. Matsen, Elchanan Mossel, and Mike Steel.

Mixed-up trees: the structure of phylogenetic mixtures.Bull. Math. Biol., 70(4):1115–1139, 2008.

Zvi Rosen.

Computing algebraic matroids.arXiv preprint arXiv:1403.8148, 2014.

Seth Sullivant.

Algebraic statistics, volume 194 of Graduate Studies in Mathematics.American Mathematical Society, Providence, RI, 2018.


Download - Identifiability in Phylogenetics Using Algebraic Matroids · Phylogenetics Problem Given a collection of species, nd the tree that explains their evolutionary history. Human Chimp

Top Related