Identifiability in PhylogeneticsUsing
Algebraic Matroids
Ben Hollering and Seth Sullivant
North Carolina State University
April 9, 2020
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 1 / 32
Phylogenetics
Problem
Given a collection of species, find the tree that explains theirevolutionary history.
GorillaChimpHuman HumanChimpGorilla ChimpGorillaHuman
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 2 / 32
Building Trees with DNA Sequence Data
DNA bases are A, T, G, C
DNA sequences of related species all evolved from some commonancestor
Align sequences for a gene that appears in all species
Human : GATCTCAAGGAC
Chimp : GGCCTCAAGGAT
Gorilla : GATCTCCAGGCA
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 3 / 32
Phylogenetic Models
GorillaChimpHuman
Human : GATCTCAAGGAC
Chimp : GGCCTCAAGGAT
Gorilla : GATCTCCAGGCA
We label the leaves of the treewith the base that each specieshas at a fixed site in theirDNA
Each tree gives a family ofdistributions on columns in thealignment
Maximum LikelihoodEstimation can then be usedto find the tree that maximizesthe probability of the data
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 4 / 32
Phylogenetic Models
ATC
Human : AATGGGACATGC
Chimp : AATGGCACATGT
Gorilla : AACGGGACATAA
We label the leaves of the treewith the base that each specieshas at a fixed site in theirDNA
Each tree gives a family ofdistributions on columns in thealignment
Maximum LikelihoodEstimation can then be usedto find the tree that maximizesthe probability of the data
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 5 / 32
Phylogenetic Models
Assume each site evolves independently
Phylogenetic models are hidden variable graphical models
Each leaf v is an observed random variable Xv ∈ {A, C, G, T}Each internal node v is a hidden random variable Yv
Associate a transition matrix M e to each edge e = (u, v) and adistribution π to the root
Y1
X1
Y2
X3 X2
M1
M0
M3 M2
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 6 / 32
Phylogenetic Models
Y1
X1
Y2
X3 X2
M1
M0
M3 M2
The probability of observing (x1, x2, x3) ∈ {A,C,G, T}3 is
P (x1, x2, x3) =∑y1
∑y2
πy1M0y1,y2
M1y1,x1
M2y2,x2
M3y2,x3
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 7 / 32
Types of Phylogenetic Models
First require that M e = exp(Qet) for a rate matrix Qe andparameter teFurther restrictions can be imposed on the rate matrices
[∗ αα ∗
]CFN
∗ β α γβ ∗ γ αα γ ∗ βγ α β ∗
K3P
∗ α α αα ∗ α αα α ∗ αα α α ∗
JC
∗ β α ββ ∗ β αα β ∗ ββ α β ∗
K2P
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 8 / 32
Algebraic Perspective on Phylogenetic Models
Once we fix a tree T with n leaves we get a polynomial map in theentries of π and the M e
ψT : ΘT → R4n
The phylogenetic model associated to T is MT = im(ψT ) ⊆ R4n
Θ ⊂ Rd is the space of numerical parameters (rate matrices Qe
and time parameters te)
This gives a family of parametric algebraic statistical modelsindexed by the discrete parameter T
Let VT be the Zariski closure of the model
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 9 / 32
Phylogenetic Mixture Models
Mixture models can be used to model more complicatedevolutionary events such as horizontal gene transfer orhybridization
The 2-tree mixture model for trees T1 and T2 is parameterized by
ψT1,T2 : ΘT1 ×ΘT2 × [0, 1]→ ∆4n−1
defined by
ψT1,T2(θ1, θ2, λ) = λψT1(θ1) + (1− λ)ψT2(θ2)
This gives a family of parametric algebraic statistical modelsindexed by multisets {T1, T2}The Zariski closure of the image is the join variety VT1 ∗ VT2
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 10 / 32
Identifiability
Definition
A parametric statistical model is identifiable if it gives a 1-1 map fromparameters to probability distributions.
Identifiability is needed for consistency of inference
In phylogenetics, the identifiability of the tree parameter isparticularly important
Can T or {T1, T2} be recovered from DNA sequence data?
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 11 / 32
Generic Identifiability of Discrete Parameters
Definition
Let {Ms}ks=1 be a collection of algebraic models that sit inside theprobability simplex ∆r, then the discrete parameter s is genericallyidentifiable if for each 2-subset {s1, s2} ⊂ [k]
dim(Ms1 ∩Ms2) < min(dim(Ms1),dim(Ms2))
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 12 / 32
Algebraic Tools for Testing Generic Identifiability
Let k[p] = k[p1, p2, . . . pr] denote the polynomial ring inindeterminates p1, p2, . . . pr
Definition
Let S ⊆ kr. The vanishing ideal of S, denoted I(S) is
I(S) = {f ∈ k[p] : f(a) = 0 for all a ∈ S} ⊆ k[p]
The ideal IT = I(MT ) is called the ideal of phylogenetic invariantsof T
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 13 / 32
Algebraic Tools for Testing Generic Identifiability
Proposition
Let M1 and M2 be two irreducible algebraic models which sit inside theprobability simplex ∆r. If there exists polynomials f1 and f2 such that
f1 ∈ I(M1) \ I(M2) and f2 ∈ I(M2) \ I(M1)
then dim(M1 ∩M2) < min(dim(M1),dim(M2)).
Since the models are irreducible, the ideals I(Ms) are prime
If the models are the same dimension, then it suffices to showI(M1) 6= I(M2)
Finding polynomials f1 and f2 can be quite difficult
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 14 / 32
Generic Identifiability of Tree Parameters
The tree parameter is identifiable of the JC, CFN, K2P, and K3Pmodels are generically identifiable
The tree parameters of the 2-tree JC and K2P mixture models aregenerically identifiable (Allman-Petrovic-Rhodes-Sullivant 2009)
The tree parameters of the 3-tree JC mixture model aregenerically identifiable (Long - Sullivant 2015)
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 15 / 32
Matroids
A matroid is a combinatorial object used to axiomatizeindependence
Characterized by a ground set E and independent sets I ⊆ E
Definition
A matroid is a pair (E, I), where I ⊆ 2E that satisfies
1 ∅ ∈ I2 If S ⊆ T and T ∈ I, then S ∈ I3 If S, T ∈ I and #S < #T , then there exists e ∈ T \ S such thatS ∪ {e} ∈ I
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 16 / 32
Linear Matroids
Definition
A linear matroid is one where E ⊂ kn is a finite subset, and S ∈ I ifand only if S is linearly independent over k
Example (Linear Matroid)
A =
1 1 −1 −23 1 2 40 −1 1 2
E = [4]
The independent sets are
{1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {1,2,3},{1,2,4}.
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 17 / 32
Algebraic Matroids
Since I(Ms) is a prime ideal it defines an algebraic matroid on theset of coordinates E = {pi : i ∈ [r + 1]} with independent sets
{S ⊆ E : I(Ms) ∩ C[S] = 〈0〉}
Let Ms = im(φ) with φ(θ1, . . . , θd) = (φ1(θ), . . . , φr+1(θ)) and let
J(φ) =
(∂φj∂θi
), 1 ≤ i ≤ d, 1 ≤ j ≤ r + 1
The matroid defined by the columns of J(φ) over the fraction fieldC(θ) is the same matroid defined by I(Ms)
Let M(Ms) be the independence matroid of the model defined ineither of these ways
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 18 / 32
Proving Identifiability with Algebraic Matroids
Proposition (H - Sullivant)
Let M1 and M2 be two irreducible algebraic models which sit inside theprobability simplex ∆r. Without loss of generality assumedim(M1) ≥ dim(M2). If there exists a subset S of the coordinates suchthat
S ∈M(M2) \M(M1)
then dim(M1 ∩M2) < min(dim(M1),dim(M2)).
Allows us to prove identifiability results without computing I(Ms)
Still requires symbolic computation over k(θ)
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 19 / 32
Specializing the Jacobian
Proposition
Let k be a field of characteristic zero and φ be a rational map. Thenthe matrix obtained by plugging generic parameter values into J(φ)gives a linear matroid over k which is the same as that defined by J(φ)with symbolic parameters over k(θ)
M(J(φ), k(θ)) = independence matroid over k(θ)
M(J(φ), k) = independence matroid over k obtained by pluggingin random values for θ
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 20 / 32
Certifying Identifiability with Algebraic Matroids
Algorithm 1: matroidSeparate
Input : Two maps φ1, φ2 parameterizing models M1 and M2 in kn
with dim(M1) ≥ dim(M2), a number of trials t.Output: A certificate S
1 for i = 0 to t do2 Randomly select T ⊆ [n] such that |T | ≤ dim(M2);3 if T ∈M(J(φ2), k) \M(J(φ1), k) then4 if T ∈M(J(φ2), k(θ)) \M(J(φ1), k(θ)) then5 S = T;6 Break;
7 return S or report that no certificate was found.
Still requires symbolic computation over k(θ)
Embarrassingly parallel
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 21 / 32
The Schwartz-Zippel Lemma
Lemma (Schwartz-Zippel)
Let f ∈ k[x1, . . . xn] be a non-zero polynomial of total degree α. Let Ebe a finite subset of k and r1, . . . rn be selected at random independentlyand uniformly from E. Then
P (f(r1, . . . , rn) = 0) ≤ α
|E|.
S /∈M(J(φ1), k(θ)) if the corresponding minor of J(φ1) vanishes
Main algorithm can be modified to avoid symbolic computationand produce a certificate that holds with probability 1− ε byusing this lemma
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 22 / 32
Six-to-Infinity Theorem
Theorem (Six-To-Infinity Theorem (Matsen-Mossel-Steel 2008))
Suppose that the tree parameters T1, T2 are identifiable for a 2-treemixture model for trees with six leaves. Then the tree parameters areidentifiable for trees with n leaves for all n ≥ 6.
Only finitely many cases to check since it is enough to check forevery pair of 2-multisets of 6 leaf trees
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 23 / 32
Identifiability for CFN and K3P
Theorem (H - Sullivant)
The tree parameters of the 2-tree CFN mixture model are genericallyidentifiable for trees with at least six leaves and the tree parameters ofthe 2-tree K3P mixture model are generically identifiable for trees withat least four leaves.
Proof idea:
By the Six-To-Infinity Theorem of Matsen, Mossel, and Steel(2008) its enough to prove identifiability for six leaf trees
There are 22,773 cases to check up to symmetry
Run the main algorithm for each case to find a certificate ofidentifiability
In one case it failed but we were able to compute adegree-bounded Grobner basis in this case
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 24 / 32
Why Did the Algorithm Fail?
Different prime ideals can have the same matroid
We conjecture that the ideals we get from the trees below have thesame matroid despite having different ideals
1
2 3 4
5
6
T1
1
2 3 6
4
5
S1
2
3 1 6
4
5
T2
2
3 1 4
5
6
S2
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 25 / 32
Phylogenetic Networks
Recent tool that has emerged to model evolutionary phenomenathat are non-treelike such as horizontal gene transfer
Solid edges are called tree edges
Dotted edges are reticulation edges which represent horizontalgene transfer
Networks can be thought of as cycles connected by trees
1 2
34
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 26 / 32
Phylogenetic Networks
As the number of cycles and number of allowable reticulationedges increases the model becomes increasingly complicated
A good starting point is a single cycle with a single reticulationvertex, called a cycle network
Deleting a reticulation edge ei from the network N gives a tree Ti
1 2
34
e1 e2
e3e4
e5
e6
e7
e8
(a) N
1
4 3
2e1
e4 e3
e2
e8 e7 e6
(b) T1
4
3 2
1e4
e3 e2
e1
e7 e6 e5
(c) T2
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 27 / 32
Phylogenetic Network Models
A model for trees ψT gives us a model ψN for cycle networks where
ψN = λψT1 + (1− λ)ψT2
This is not the same as mixture model since the parameters oneach tree are not independent
1 2
34
e1 e2
e3e4
e5
e6
e7
e8
(a) N
1
4 3
2e1
e4 e3
e2
e8 e7 e6
(b) T1
4
3 2
1e4
e3 e2
e1
e7 e6 e5
(c) T2
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 28 / 32
Identifiability for Phylogenetic Network Models
If T is one of the trees obtained from a network N thenim(ψT ) ⊆ im(ψN ) so in general the cycle-network parameter is notidentifiable
Gross and Long suggested limiting the question to large cyclenetworks (cycle size k ≥ 4)
They proved that the network parameter is identifiable for largecycle networks under the JC model
Similar to the tree case, they show that the question can bereduced to a finite number of cases and then computed idealsexplicitly in these cases
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 29 / 32
Identifiability for Phylogenetic Network Models
Theorem (H - Sullivant)
The semi-directed network parameter of large-cycle K2P and K3Pnetwork models is generically identifiable.
Proof idea:
Use results of Gross and Long to reduce to a finite number of cases
Use our matroid algorithm to prove identifiability in each case
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 30 / 32
Summary
Algebraic matroids can be used to show discrete parameters aregenerically identifiable
Using matroids allows us to avoid computing I(M)
Using the Schwartz-Zippel Lemma we can completely avoidcomputing over k(θ) and give a certificate of generic identifiabilitywith probability 1− εWe used it to prove that the tree parameters of 2-tree CFN andK3P mixture models are generically identifiable
We also used this method to prove that the network parameter inK2P and K3P large-cycle network models is generically identifiable
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 31 / 32
References
Elizabeth S Allman, Sonia Petrovic, John A Rhodes, and Seth Sullivant.
Identifiability of two-tree mixtures for group-based models.IEEE/ACM transactions on computational biology and bioinformatics, 8(3):710–722, 2010.
Elizabeth Gross and Colby Long.
Distinguishing phylogenetic networks.SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018.
Colby Long and Seth Sullivant.
Identifiability of 3-class Jukes-Cantor mixtures.Adv. in Appl. Math., 64:89–110, 2015.
Frederick A. Matsen, Elchanan Mossel, and Mike Steel.
Mixed-up trees: the structure of phylogenetic mixtures.Bull. Math. Biol., 70(4):1115–1139, 2008.
Zvi Rosen.
Computing algebraic matroids.arXiv preprint arXiv:1403.8148, 2014.
Seth Sullivant.
Algebraic statistics, volume 194 of Graduate Studies in Mathematics.American Mathematical Society, Providence, RI, 2018.
Ben Hollering (NCSU) Identifiability using Matroids April 9, 2020 32 / 32