stem-hy: species tree estimation using maximum likelihood ... · what is stem-hy? using stem-hy for...
TRANSCRIPT
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
STEM-hy: Species Tree Estimation usingMaximum likelihood (with hybridization)
Laura Salter KubatkoDepartments of Statistics and
Evolution, Ecology, and Organismal BiologyThe Ohio State University
Tutorial info at: www.stat.osu.edu/∼lkubatko/uga2012.html
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
What is STEM-hy?Assumptions and Methods
Using STEM-hy for Species Tree EstimationData PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Using STEM-hy for Hybridization AnalysisBackground: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Practice Datasets
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Assumptions and Methods
What is STEM-hy?
I STEM-hy is a program to perform maximum likelihoodanalysis for estimation of the species tree from multilocusdata under the coalescent process. It includes the capability ofevaluating hybrid taxa.
I Basic functions:I Return the ML species tree.I Search the space of all species trees and return the k trees
with the highest likelihoods found.I Compute the likelihood of a user-specified tree with branch
lengths.I Find optimal branch lengths on a user-specified tree.I Carry out a bootstrap analysis to obtain bootstrap support
values for nodes in the species tree.I Evaluate hypotheses of hybridization in a model selection
framework.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Assumptions and Methods
Assumptions
I No recombination within loci
I Free recombination between loci
I No gene flow following speciation
I Only source of variability in single-gene histories is due to thecoalescence process
I There is a single θ for the entire tree, for each locus
I Evolutionary rates may vary across loci
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Assumptions and Methods
Methods: ML Estimate of the Species Tree
I Liu et al. (2009) showed that the ML estimate of the species treecan be computed by sequentially clustering minimum observeddivergence times between pairs of species across genes.
I They have shown that when gene trees are known without error, theML species tree is a consistent estimator.
I A similar result was obtained by Roch & Mossel (2010) – they calltheir estimator the GLASS tree (an acronym for Global LAteStSplit, based on the algorithm they developed to compute it).
I STEM computes the ML estimate of the species tree this way.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Assumptions and Methods
Methods: Estimation of ML Times for an Arbitrary Species Tree
I The results of Liu et al. (2009) can be extended to derive theML estimates of the speciation times for an arbitrary speciestree.
I Thus, the likelihood of any species tree can be readilycomputed by using this result to obtain ML branch lengths.
I This is important in that it allows us to compare alternativephylogenetic hypotheses.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Assumptions and Methods
Methods: Searching Species Tree Space for Trees of High Likelihood
I A simulated annealing algorithm is used to search the space ofall species trees for trees that have high likelihoods.
I The k best trees found during the search are saved andprinted to a file (k is set by the user).
I Exploration of the likelihood surface is particularly importantfor many of these problems.
I The details of the simulated annealing algorithm are similar tothose given in Salter & Pearl (2001).
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Assumptions and Methods
Features of STEM-hy
I No limits (that I know of) on the number of taxa or thenumber of loci.
I Can handle intraspecific sampling.
I Allows information concerning mutation rate for each locus tobe used in the analysis.
I Can handle different taxon samples across genes.
I Version 1.0 is written in Java (using Clojure).
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Data Preparation - Gene Trees
I STEM-hy takes as its input one gene tree for each locus.
I Thus, a first step in an analysis using STEM-hy is to estimategene trees with branch lengths for each locus.
I Any method can be used to do this, but note a couplerequirements:
I Branch lengths are assumed to be in units of expected numberof substitutions per site per unit time.
I Branch lengths must be estimated subject to a molecularclock. This is not checked by the program.
I Gene trees must be fully resolved; however, polytomies can beincluded by setting branch lengths to 0 for an arbitraryresolution of the polytomy.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Data Preparation - Population Genetics Parameters
I A value of the parameter θ = 4Nµ must be provided. Notethat this is the “per-site θ”, not a “per-locus” value as usedby other population genetics programs.
I This will be used to convert gene tree branch lengths tocoalescent units (number of 2N generations) by dividing allgene tree branch lengths by θ.
I Estimates of θ could be obtained by standard methods.Typical values of θ will be between 0.001 and 0.1.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Data Preparation - Population Genetics and Mutation Parameters
I Each locus can also be given a rate multiplier.
I These can adjust forI Variation in mutation rate across loci.I Ploidy (e.g., haploid loci – mtDNA – should be given a rate of
0.5).
I At the least, one should estimate rate variation from the databy something like the following:
I Compute average pairwise sequence divergence of eachsequence to the outgroup.
I Divide all of these values by their overall mean, and assign thatnumber as the rate multiplier for each gene.
I Adjust specific genes for ploidy, if necessary.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling
I Start with a small example where we can work things out byhand
I Four species, eight lineages, and two loci (N = 2)
I Suppose that the gene trees for the two loci are
1
2
3
4
5
6
7
8
3.46
2.46
1.23
1.20
2.40
1.00
1.20
1
2
3
4
5
6
7
8
3.50
2.86
2.54
1.23
1.20
1.00
1.10
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
STEM Example 1: Small Example with Intraspecific Sampling
I Now we can run STEM and look at output
I First, let’s compute the relevant distances by hand:
{D1ab}:
2 3 4 5 6 7 81 1.23 2.46 2.46 3.46 3.46 3.46 3.462 - 2.46 2.46 3.46 3.46 3.46 3.463 - - 1.2 3.46 3.46 3.46 3.464 - - - 3.46 3.46 3.46 3.465 - - - - 1.0 2.4 2.46 - - - - - 2.4 2.47 - - - - - - 1.2
{D2ab}:
2 3 4 5 6 7 81 1.23 2.56 2.56 2.86 2.86 3.5 3.52 - 2.56 2.56 2.86 2.86 3.5 3.53 - - 1.2 2.86 2.86 3.5 3.54 - - - 2.86 2.86 3.5 3.55 - - - - 1.0 3.5 3.56 - - - - - 3.5 3.57 - - - - - - 1.1
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
STEM Example 1: Small Example with Intraspecific Sampling
I Now we can run STEM and look at output
I First, let’s compute the relevant distances by hand:
2 3 4 5 6 7 81 1.23 2.46 2.46 3.46 3.46 3.46 3.462 - 2.46 2.46 3.46 3.46 3.46 3.463 - - 1.2 3.46 3.46 3.46 3.464 - - - 3.46 3.46 3.46 3.465 - - - - 1.0 2.4 2.46 - - - - - 2.4 2.47 - - - - - - 1.2
→S1 S2 S3 S4
S1 - 1.2 3.46 3.46S2 - 1.0 2.4S3 - 1.2S4 -
2 3 4 5 6 7 81 1.23 2.56 2.56 2.86 2.86 3.5 3.52 - 2.56 2.56 2.86 2.86 3.5 3.53 - - 1.2 2.86 2.86 3.5 3.54 - - - 2.86 2.86 3.5 3.55 - - - - 1.0 3.5 3.56 - - - - - 3.5 3.57 - - - - - - 1.1
→S1 S2 S3 S4
S1 - 1.2 2.86 3.5S2 - 1.0 3.5S3 - 1.1S4 -
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
STEM Example 1: Small Example with Intraspecific Sampling
I Now we can run STEM and look at output
I First, let’s compute the relevant distances by hand:
S1 S2 S3 S4S1 - 1.2 3.46 3.46S2 - 1.0 2.4S3 - 1.2S4 -
S1 S2 S3 S4S1 - 1.2 2.86 3.5S2 - 1.0 3.5S3 - 1.1S4 -
→S1 S2 S3 S4
S1 - 1.2 2.86 3.46S2 - 1.0 2.4S3 - 1.1S4 -
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
STEM Example 1: Small Example with Intraspecific Sampling
I First, let’s compute the relevant distances by hand:
S1 S2 S3 S4S1 - 1.2 3.46 3.46S2 - 1.0 2.4S3 - 1.2S4 -
S1 S2 S3 S4S1 - 1.2 2.86 3.5S2 - 1.0 3.5S3 - 1.1S4 -
→
S1 S2 S3 S4S1 - 1.2 2.86 3.46S2 - 1.0 2.4S3 - 1.1S4 -
1
4
2
3
1.2
1.1
1.0
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling
Step 1: Prepare the gene trees
Option 1: Place all gene trees in a single file called genetrees.tre:Newick format requiredOne gene tree per lineRate multipliers must be given in brackets in front of each gene tree
[1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010,((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106);[1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003,(Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling
Step 1: Prepare the gene trees
Option 2: Place sets of gene trees in separate filesFile names will be supplied to STEM-hy in the settings fileRate multipliers will also be supplied in the settings fileAll genes in a single file are assumed to have the same rate
genetrees1.tre:
(((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010,((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106);
genetrees2.tre:
((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003,(Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling
Step 2: Prepare the settings file - input option 1
yaml format: headings with indented parameters defined below
properties:run: 1 #0=user-tree, 1=MLE, 2=search, 3=hybridization, 4=bootstraptheta: 0.001num saved trees: 15beta: 0.0005seed: 3435893
species:Species1: Name1, Name2, Name3Species2: Name4, Name5Species3: Name6, MyName7Species4: Name8
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling
Step 2: Prepare the settings file - input option 2
yaml format: headings with indented parameters defined below
properties:run: 1 #0=user-tree, 1=MLE, 2=search, 3=hybridization, 4=bootstraptheta: 0.001num saved trees: 15beta: 0.0005seed: 3435893
species:Species1: Name1, Name2, Name3Species2: Name4, Name5Species3: Name6, MyName7Species4: Name8
files:genetrees1.tre: 1.0 # notice the space after each ’:’genetrees2.tre: 1.0
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling
Step 2: Prepare the settings file
yaml format: headings with indented parameters defined below
properties:run: 1 #0=user-tree, 1=MLE, 2=search, 3=.....theta: 0.001num saved trees: 15beta: 0.0005seed: 3435893
species:Species1: Name1, Name2, Name3Species2: Name4, Name5Species3: Name6, MyName7Species4: Name8
files:genetrees1.tre: 1.0 # notice the space after each ’:’genetrees2.tre: 1.0
Some parameters will
only be used for
certain “run” settings.
They are ignored
otherwise, and can be
omitted from the
settings file.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling - Results
I Analysis 1: Find the ML species tree (run with run: 1)
I Run at the command line with: java -jar stem-hy.jar
***************************************** Welcome to STEM 2.0 *****************************************
The settings file was successfully parsed...
Using theta = 0.0010
The settings file contained 4 species and 8 lineages.
The species-to-lineage mappings are:
Species4: Name8Species3: MyName7, Name6Species2: Name4, Name5Species1: Name1, Name2, Name3
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling - Results
I Analysis 1: Find the ML species tree (run with run: 1)
I Run at the command line with: java -jar stem.jar
I Results are written to the file mle.tre
****************Results*****************
D AB Matrix:
[ 0.00000 1.20000 2.86000 3.46000][ 0.00000 0.00000 1.00000 2.40000][ 0.00000 0.00000 0.00000 1.10000][ 0.00000 0.00000 0.00000 0.00000]
Likelihood Species Tree (Newick format):
(Species1:1.20000,(Species4:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.10000);
Log likelihood for tree: -52.43701947216076
****************** Done ****************
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling - Results
I Analysis 2: Find likelihood of all 15 trees (run with run: 2)
I Output files:
***************************************** Welcome to STEM 2.0 *****************************************
The settings file was successfully parsed......
Beginning search now (this could take a while)...
Search completed.
Here are the results (also written to file ’search.tre’):
[-52.43702] (Species1:1.20000,(Species4:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.10000);[-53.63718] (Species1:1.20000,(Species3:1.00000,(Species2:1.00000,Species4:1.00000):0.00000):0.20000);[-56.63684] ((Species4:1.10000,Species1:1.10000):0.00000,(Species2:1.00000,Species3:1.00000):0.10000);[-56.63720] (Species4:1.10000,(Species1:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.00000);[-60.23760] (Species4:1.10000,(Species2:1.00000,(Species1:1.00000,Species3:1.00000):0.00000):0.10000);[-62.63758] ((Species1:1.00000,Species3:1.00000):0.00000,(Species2:1.00000,Species4:1.00000):0.00000);[-62.63778] (Species3:1.00000,(Species1:1.00000,(Species2:1.00000,Species4:1.00000):0.00000):0.00000);[-62.63790] (Species2:1.00000,((Species1:1.00000,Species4:1.00000):0.00000,Species3:1.00000):0.00000);[-62.63806] (Species2:1.00000,((Species1:1.00000,Species3:1.00000):0.00000,Species4:1.00000):0.00000);
****************** Done ****************
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling - Results
I Analysis 3: Find the likelihood of a particular species tree
I Place the tree(s) of interest in the file user tree in the samedirectory as STEM-hy
((Species1:0.000222,Species3:0.000222):0.016444,(Species2:0.000222,Species4:0.000222):0.016444);
I Branch lengths must be included. STEM-hy gives thelikelihood of the tree with the user-specified branch lengths,as well as the ML branch lengths along the user tree.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 1: Small Example with Intraspecific Sampling - Results
***************************************** Welcome to STEM 2.0 *****************************************
The settings file was successfully parsed...
.
.
.
Read 1 species tree[s] from ’user.tre’
****************Results*****************
User tree:((Species1:0.000222,Species3:0.000222):0.016444,(Species2:0.000222,Species4:0.000222):0.016444)Log likelihood for tree: -153.62929947216077
**************Optimized Trees************
Optimized user tree:((Species1:0.99995,Species3:0.99995):0.00005,(Species2:0.99995,Species4:0.99995):0.00005);Log likelihood: -62.63865947216076
****************** Done ****************
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 2: Missing Data Example
I genetrees.tre:
[1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010,((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106);[1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.00032,(Name5:0.0010,Name6:0.0010):0.00186);[1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010,((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106);[1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003,(Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);[1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010,((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106);[1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003,(Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);[1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010,((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106);[1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003,(Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 2: Missing Data Example
I Look at gene trees:
Name1
Name2
Name3
Name4
Name5
Name6
MyName7
Name8
4 loci
Name1
Name2
Name3
Name4
Name5
Name6
1 locus
Name1
Name2
Name3
Name4
Name5
Name6
MyName7
Name8
3 loci
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 2: Missing Data Example
Note: The settings file remains unchanged.Below is the output.
****************Results*****************
D AB Matrix:
[ 0.00000 1.20000 2.86000 3.46000][ 0.00000 0.00000 1.00000 2.40000][ 0.00000 0.00000 0.00000 1.10000][ 0.00000 0.00000 0.00000 0.00000]
Maximum Likelihood Species Tree (Newick format):
(Species1:1.20000,(Species4:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.10000);
log likelihood for tree: -967.874444171144
****************** Done ****************
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example Data: Heliconius Butterflies
H. melpomene H. heurippa H. cydno
2
1
3
H. hecale
ABCD
BCD
CDBD
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 3: Bootstrap Analysis
I The current version of STEM-hy can be used to estimate bootstrapproportions on the ML tree, as well as to construct a bootstrapconsensus tree.
I Sequence data must be provided in PHYLIP format (separatefiles need to be used for each gene).
I Each gene is bootstrapped a user-specified number of times, B,to produce B bootstrap samples (alignments) for each gene.
I Gene trees are estimated for each bootstrap sample using theprogram SSA. This program uses a simulated annealingmethod to estimate gene trees under the assumption of amolecular clock.
I B species trees are reconstructed using STEM-hy and printedto both the screen and to the file bootstrap.results.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 3: Bootstrap Analysis
For this example, we’ll consider four taxa and six genes inHeliconius butterflies.
The settings file is shown below, with changes in blue
properties:run: 4 #0=user-tree, 1=MLE, 2=search, 3=hybridization, 4=bootstrapbootstrap samples: 100phylip files: co 4tax.phy,dll 4tax.phy,inv 4tax.phy,sd 4tax.phy,tpi 4tax.phy,white 4tax.phytheta: 0.01num saved trees: 15beta: 0.0005seed: 3435893
species:H. melpomene: M95H. hecale: HhH. cordula: M187H. heurippa: Strib40
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Example 3: Bootstrap Analysis
Below is the output. All bootstrap trees are written to a file calledbootstrap.results and can be read into another program andsummarized.
...The species-to-lineage mappings are:
H. heurippa: Strib40H. cordula: M187H. hecale: HhH. melpomene: M95
Bootstrapping trees (this might take a while)...
****************Results*****************
The maximum likelihood species tree estimate is:
(H. hecale:6.82133,(H. melpomene:0.74608,(H. heurippa:0.07658,H. cordula:0.07658):0.66950):6.07525);
The 100 bootstrapped species trees:
(H. heurippa:0.29907,(H. hecale:0.17664,(H. melpomene:0.12424,H. cydno:0.12424):0.05240):0.12243);(H. hecale:1.52825,(H. melpomene:0.35022,(H. heurippa:0.31089,H. cydno:0.31089):0.03933):1.17803);
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Data PreparationExample 1: Small Example with Intraspecific SamplingExample 2: Small Example with Missing DataExample 3: Bootstrap Analysis
Some Notes on Program Versions
I There are some important differences between STEMv1.1aand STEMv2.0/STEM-hyv.10
I Multifurcations are handled differently.
STEM v1.1a and lower: Zero-length branches are set to0.00001.
STEMv2.0 / STEM-hyv1.0: Zero-length branches are treatedas missing data.
I Other big differences are improvements to input format andincreased functionality in later versions.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM’s Hybrid Species Model
A B C
ττ
γγ
1−γγ
A B C
P(C(AB)) = 1−(2/3)exp(−ττ)P(A(BC))=(1/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
A B C
P(C(AB))=(1/3)exp(−ττ)P(A(BC))=1−(2/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM’s Hybrid Species Model
Species tree subject to hybridization
A B C
ττ
γγ
1−γγ
A B C
P(C(AB)) = 1−(2/3)exp(−ττ)P(A(BC))=(1/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
A B C
P(C(AB))=(1/3)exp(−ττ)P(A(BC))=1−(2/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM’s Hybrid Species Model
Hybridization parameter to model the extent of the contributionfrom each parent
A B C
ττ
γγ
1−γγ
A B C
P(C(AB)) = 1−(2/3)exp(−ττ)P(A(BC))=(1/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
A B C
P(C(AB))=(1/3)exp(−ττ)P(A(BC))=1−(2/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM’s Hybrid Species Model
Possible parental species trees
A B C
ττ
γγ
1−γγ
A B C
P(C(AB)) = 1−(2/3)exp(−ττ)P(A(BC))=(1/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
A B C
P(C(AB))=(1/3)exp(−ττ)P(A(BC))=1−(2/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM’s Hybrid Species Model
Probabilities associated with each gene tree topology for eachparental tree under the coalescent model
A B C
ττ
γγ
1−γγ
A B C
P(C(AB)) = 1−(2/3)exp(−ττ)P(A(BC))=(1/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
A B C
P(C(AB))=(1/3)exp(−ττ)P(A(BC))=1−(2/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM’s Hybrid Species Model
Sequence evolution proceeds along gene trees
A B C
ττ
γγ
1−γγ
A B C
P(C(AB)) = 1−(2/3)exp(−ττ)P(A(BC))=(1/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
A B C
P(C(AB))=(1/3)exp(−ττ)P(A(BC))=1−(2/3)exp(−ττ)P(B(AC))=(1/3)exp(−ττ)
MutationProcess
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Inference of Trees Subject to Hybridization
Assumptions:
I Hybridization results in a mosaic genome, so that a sampledgene has a probability distribution that its history originatedfrom one of several parental species trees
I Genes in the sample are independent given the species tree
I Hybridization events happen only between sister taxa
I No factors other than coalescence and hybridization lead toincongruence between gene trees and the species tree
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Likelihood Calculation for the Three-taxon Case
I Let f (gi |S) be the probability density of gene tree gi givenspecies tree S under the coalescent model (Rannala and Yang,2003)
I The likelihood function for the three-taxon case is
N∏i=1
{γf (gi |S1) + (1− γ)f (gi |S2)}
whereS1 and S2 are two possible parental species treesγ ∈ [0, 1]
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Likelihood Calculation for the Three-taxon Case
I Let f (gi |S) be the probability density of gene tree gi givenspecies tree S under the coalescent model (Rannala and Yang,2003)
I The likelihood function for the three-taxon case is
N∏i=1
{γf (gi |S1) + (1− γ)f (gi |S2)}
whereS1 and S2 are two possible parental species treesγ ∈ [0, 1]
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Likelihood Calculation for the Three-taxon Case
N∏i=1
{γf (gi |S1) + (1− γ)f (gi |S2)}
A B C
ττ
γγ
1−γγ
A B Cf(g | S1) Mutation
Process
ττ
A B Cf(g | S2) Mutation
Process
ττ
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Beyond Three Taxa . . .
I Propose a method which incorporates any number ofhybridization events, provided they occur between sister taxa
I Each putative hybridization event is assigned a parameter,γ1, γ2, . . .
I The likelihood is computed by looking at all combinations ofpossible parental species trees, weighted appropriately by theγj parameters
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
A Big(-ger) Example
Recall our motivatingexample:
A B C D E F A B C D E F
A B C D E F A B C D E F
Consider the hybrid species tree:
A C E FDB
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
A Big(-ger) Example
Recall our motivatingexample:
A B C D E F A B C D E F
A B C D E F A B C D E F
Consider the hybrid species tree:
A C E FDB
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
The Likelihood Function
A C E FDB
→A B C D E F
γγ1γγ2
S1
A B C D E F(1−γγ1)γγ2
S3
A B C D E Fγγ11−γγ2)
S2
A B C D E F(1−γγ1)(1−γγ2)
S4
∏Ni=1 {γ1γ2f (gi |S1) + γ1(1− γ2)f (gi |S2)
+(1− γ1)γ2f (gi |S3) + (1− γ1)(1− γ2)f (gi |S4)}
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Comments on Computation
I Parameters in the likelihood function: γ1, γ2, branch lengths
I For a given hybrid species tree and sample of gene trees withdivergence times, maximum likelihood branch lengths can beanalytically determined
I Fitting the likelihood model for a hypothesized hybrid speciestree only requires optimization of γ parameters
I Implemented in a modified version of the program STEM,called STEM-hy (“stemmy”)
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Selecting the Best Hybrid Species Tree
I For the example hybrid species tree, pick the best hybridmodel from among possible models using the AIC:
Model Tree γ1 γ2 Number of Parameters
1 A B C D E F 0 0 5
2 A B C D E F 0 1 5
3 A B C D E F 1 0 5
4 A B C D E F 1 1 5
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Selecting the Best Hybrid Species Tree
Model Tree γ1 γ2 Number of Parameters
5 A B C E FD 0 (0,1) 6
6 A B C E FD 1 (0,1) 6
7 A C D E FB (0,1) 0 6
8 A C D E FB (0,1) 1 6
9 A C E FDB (0,1) (0,1) 7
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM-hy: Assumptions
I In practice, the γi are not given (neither are times ofspeciation or hybridization events). The algorithm finds MLEsfor these parameters.
I STEM-hy inherits all of STEM-hy’s other assumptions (e.g.,no gene flow after speciation if no hybridization, gene treevariability is not taken into consideration, etc.).
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM-hy: Assumptions
I One important point: STEM-hy looks for evidence ofhybridization in the presence of incomplete lineage sorting.
I By using the model in STEM-hy to compute likelihoods, thecoalescent process is incorporated.
I The AIC is used to compare models:
I AIC = −2lnL(M|D) + 2k
where M is the model and D is the data. LnL(M|D) is thelikelihood from STEM-hy for the hybridization model underconsideration.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Example 4: Hybridization in Heliconius
I Input data format is the same as for previous analyses:
I Gene trees are placed in the file called genetrees.tre (option1) or the files containing the gene trees are listed in thesettings file (option 2).
I The settings file (in yaml format) is used to give usersettings (e.g., θ).
I The run option is set to 3.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Example 4: Hybridization in Heliconius
I The user must additionally provide information abouthybridization:
I The only option at present is to use a user-specified tree – thepresent version of the program assumes that the overall speciesphylogeny is known.
I The user-specified tree is one of the possible parental trees – itdoesn’t matter which one.
I The putative hybrid species are identified in thesettings.yaml file.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Example 4: Hybridization in Heliconius
H. melpomene H. heurippa H. cydno
2
1
3
H. hecale
ABCD
BCD
CDBD
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM-hy Example: Heliconius Butterflies
I Example genetrees.tre file:
[0.37137]((Hheurippa:0.005989,(Hcydno:0.001322,Hmelpomene:0.001322):0.004667):0.022778,Hhecale:0.028767);[1.17059]((Hmelpomene:0.049843,(Hcydno:0.000001,Hheurippa:0.000001):0.049843):0.001,Hhecale:0.049943);[0.11434](((Hcydno:0.021024,Hheurippa:0.021024):0.020051,Hmelpomene:0.041076):0.002610,Hhecale:0.043685);[1.35454](((Hheurippa:0.010740,Hcydno:0.010740):0.003498,Hmelpomene:0.014238):0.037654,Hhecale:0.051892);[0.39096](((Hheurippa:0.008764,Hmelpomene:0.008764):0.001686,Hcydno:0.010450):0.003969,Hhecale:0.014419);[1.22683](((Hheurippa:0.002431,Hcydno:0.002431):0.062919,Hmelpomene:0.065350):0.0000001,Hhecale:0.065351);
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM-hy Example: Heliconius Butterflies
I Example settings file:
properties:run: 3theta: 0.001beta: 0.0005burnin: 100seed: 3435893bound totali ter : 20num savedt rees : 10hybrid species: H. heurippahybrid tree: user-heliconius.tre
species:H. melpomene: M95H. hecale: HhH. cordula: M187H. heurippa: Strib40
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Example 4: Hybridization in Heliconius
I Example user-heliconius.tre:
(((H. heurippa:0.000085,H. cydno:0.000085):0.347479,H. melpomene:0.355979):3.332091,H. hecale:3.68807);
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
Example 4: Hybridization in Heliconius
****************Results*****************....
Parental trees:
gamma(H. heurippa) = 1((H. cydno:0.00009,(H. heurippa:0.00009,H. melpomene:0.00009):0.00000):3.68801,H. hecale:3.68810);Lik: -357.4325907499209AIC: 720.8651814998418k: 3
gamma(H. heurippa) = 0(((H. heurippa:0.00009,H. cydno:0.00009):0.35589,H. melpomene:0.35598):3.33212,H. hecale:3.68810);Lik: -349.9185707499209AIC: 705.8371414998418k: 3
Hybrid trees:
(((H. heurippa:0.00009,H. cydno:0.00009):0.35589,H. melpomene:0.35598):3.33212,H. hecale:3.68810);Lik: -349.5409832924012gamma(H. heurippa): 0.6600000000000004AIC: 707.0819665848024k: 4
****************** Done ****************
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
What hybrid species can be considered?
I Care must be taken in selecting hybrid species:
I Both members of a sister group cannot be selected as hybridtaxa in a single analysis. However, two analyses can be run(one with each of the sister group identified as the hybrid) andresults will be comparable across runs.
I The outgroup cannot be selected as a hybrid.
I Both of these restrictions result from the fact that for nowhybridization is only considered between sister taxa.
I More general hybridization relationships can be considered “byhand” using the user-specified tree feature of STEM-hy.
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM-hy: Strengths and Weaknesses
I STEM-hy makes some fairly strong assumptions:
I Error in estimating gene trees and branch lengths is notincorporated!!!! But the possibility of carrying out bootstrapanalysis helps.
I Information in the sequence data is not used directly; it is onlyused as summarized by estimated gene divergence times.
I There is a single value of θ for the entire tree.
I There are trade-offs involved, and STEM-hy does some things well:
I It is quick (even the tree search does not take long).I It can handle missing data easily and intuitively.I Simulations demonstrate reasonable performance (unlikely to
be misleading; may be uninformative).
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Background: STEM’s Hybrid Species ModelsExample 4: Hybridization in Heliconius
STEM-hy: Strengths and Weaknesses
I STEM-hy makes some fairly strong assumptions:
I Error in estimating gene trees and branch lengths is notincorporated!!!! But the possibility of carrying out bootstrapanalysis helps.
I Information in the sequence data is not used directly; it is onlyused as summarized by estimated gene divergence times.
I There is a single value of θ for the entire tree.
I There are trade-offs involved, and STEM-hy does some things well:
I It is quick (even the tree search does not take long).I It can handle missing data easily and intuitively.I Simulations demonstrate reasonable performance (unlikely to
be misleading; may be uninformative).
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
Challenge Datasets
I I’ve created four datasets under varying conditions:
M1 No hybridization, long intervals between speciation events.
M2 No hybridization, short intervals between speciation events.
M3 Low-levels of hybridization - B is a hybrid of A and C (speciestree as in M1 and M2).
M4 Extensive hybridization - B is a hybrid of A and C (species treeas in M1 and M2).
I All data sets have 6 species, 2 individuals/species, and 10 loci.
I GOAL: match the data set to the condition listed aboveSolutions are at www.stat.osu.edu/∼lkubatko/uga2012.html
UGA Bioinformatics Symposium 2012
What is STEM-hy?Using STEM-hy for Species Tree Estimation
Using STEM-hy for Hybridization AnalysisPractice Datasets
STEM-hy Information, References, etc.
Recommended citations - species tree estimation:
I Kubatko, L.S., B. C.Carstens, and L. L. Knowles. 2009. STEM: Species Tree Estimation using Maximumlikelihood under coalescence. Bioinformatics 25(7): 971-973.
I Liu, L., L. Yu, and D.K. Pearl. 2009. Maximum tree: a consistent estimator of the species tree. Journal ofMathematical Biology 60(1):95-106.
I Mossel, E. and S. Roch. 2010. Incomplete lineage sorting: Consistent phylogeny estimation from multipleloci. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7(1): 166-171.
Recommended citations - hybridization:
I Kubatko, LS. 2009. Identifying Hybridization Events in the Presence of Coalescence via Model Selection,Systematic Biology 58(5): 478-488.
Thank you!
STEM-hy is available at http://www.stat.osu.edu/∼lkubatko/software/STEM/
Questions concerning the programs can be sent to [email protected].
UGA Bioinformatics Symposium 2012