new methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf ·...
TRANSCRIPT
![Page 1: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/1.jpg)
Newmethodsfores-ma-ngspeciestreesfromgenome-scaledata
TandyWarnowTheUniversityofIllinois
![Page 2: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/2.jpg)
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Phylogeny(evolu9onarytree)
![Page 3: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/3.jpg)
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Samplingmul9plegenesfrommul9plespecies
![Page 4: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/4.jpg)
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome
![Page 5: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/5.jpg)
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
IncompleteLineageSor9ng(ILS)isadominantcauseofgenetreeheterogeneity
![Page 6: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/6.jpg)
Genetreesinsidethespeciestree(CoalescentProcess)
Present
Past
CourtesyJamesDegnan
GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.
![Page 7: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/7.jpg)
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Nextphasewillbemuchbiggerl WickeV,Mirarabetal.,PNAS2014
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin
Challenge:• MassivegenetreeheterogeneityconsistentwithILS
![Page 8: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/8.jpg)
AvianPhylogenomicsProjectEJarvis,HHMI
GZhang,BGI
• Approx.50species,wholegenomes,14,000loci• Jarvis,Mirarab,etal.,Science2014
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n
T.WarnowUT-Aus9n
Plusmanymanyotherpeople…
Majorchallenge:• MassivegenetreeheterogeneityconsistentwithILS.
![Page 9: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/9.jpg)
Thistalk• Genetreeheterogeneityduetoincompletelineagesor9ng,
modelledbythemul9-speciescoalescent(MSC)• Sta9s9callyconsistentes9ma9onofspeciestreesunder
theMSC,andtheimpactofgenetreees9ma9onerror• ASTRAL(Bioinforma9cs2014,2015):coalescent-based
speciestreees9ma9onmethodthathashighaccuracyonlargedatasets(1000speciesandgenes)
• “Sta9s9calbinning”(Science2014)–improvinggenetreees9ma9on,andhencespeciestreees9ma9on
• Openques9ons
![Page 10: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/10.jpg)
Thistalk• Genetreeheterogeneityduetoincompletelineagesor9ng,
modelledbythemul9-speciescoalescent(MSC)• Sta9s9callyconsistentes9ma9onofspeciestreesunder
theMSC,andtheimpactofgenetreees9ma9onerror• ASTRAL(Bioinforma9cs2014,2015):coalescent-based
speciestreees9ma9onmethodthathashighaccuracyonlargedatasets(1000speciesandgenes)
• “Sta9s9calbinning”(Science2014)–improvinggenetreees9ma9on,andhencespeciestreees9ma9on
• Openques9onsControversial!
![Page 11: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/11.jpg)
IncompleteLineageSor9ng(ILS)
• Confoundsphylogene9canalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.
• Thereissubstan9aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundsta9s9calconsistencyguarantees(theory)andperformanceondata.
![Page 12: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/12.jpg)
Sta9s9calConsistency
error
Data
![Page 13: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/13.jpg)
. . .
Analyzeseparately
Summary Method
Twocompe9ngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
![Page 14: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/14.jpg)
![Page 15: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/15.jpg)
. . .
Whataboutsummarymethods?
![Page 16: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/16.jpg)
. . .
Whataboutsummarymethods?
Techniques:Mostfrequentgenetree?Consensusofgenetrees?Other?
![Page 17: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/17.jpg)
Sta9s9callyconsistentunderILS?• Coalescent-basedsummarymethods:
– MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES
– NJst(LiuandYu,2011)-YES– Andothers,includingsomenewermethods(BUCKy-pop,ASTRAL,ASTRID,
etc.)-YES
• Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES
• Single-sitemethods(SVDquartets,METAL,SNAPP,andothers)
![Page 18: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/18.jpg)
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Nextphasewillbemuchbiggerl WickeV,Mirarabetal.,PNAS2014
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin
Challenges:• MassivegenetreeheterogeneityconsistentwithILS• CouldnotuseMP-ESTduetomissingdata(manygenetreescouldnotberooted)andlargenumberofspecies
![Page 19: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/19.jpg)
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Nextphasewillbemuchbiggerl WickeV,Mirarabetal.,PNAS2014
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin
Solu9on:• Newcoalescent-basedmethodASTRAL• ASTRALissta9s9callyconsistent,polynomial9me,andusesunrootedgenetrees.
![Page 20: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/20.jpg)
ASTRALandASTRAL-2• Es9matesthespeciestreefromgenetreesbyfindingthespecies
treethathasthemaximumquartetsupport,usingdynamicprogramming
• Theorem:ASTRALissta9s9callyconsistentundertheMSC,evenwhensolvedinconstrainedmode(drawingbipar99onsfromtheinputgenetrees)
• TheconstrainedversionofASTRALrunsinpolynomial9me• OpensourcesomwareathVps://github.com/smirarab• PublishedinECCB/Bioinforma9cs2014(Mirarabetal.)andISMB/
Bioinforma9cs2015(MirarabandWarnow)• UsedinWickeV,Mirarabetal.(PNAS2014)andPrum,Bervetal.
(Nature2015)(andinmanyotherpapers)
![Page 21: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/21.jpg)
Simulation study• Variable parameters:
• Number of species: 10 – 1000
• Number of genes: 50 – 1000
• Amount of ILS: low, medium, high
• Deep versus recent speciation
• 11 model conditions (50 replicas each) with heterogenous gene tree error
• Compare to NJst, MP-EST, concatenation (CA-ML)
• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree
14
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
ASTRAL-II
look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are
�u0
2
�quartet trees that put
that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.
Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.
Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.
Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.
3.3 Multi-furcating input gene trees
Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are
�d3
�ways to select
three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,
0
5
10
15
20
0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)
dens
ity
rate1e−06 1e−07
tree height10M 2M 500K
(a) True gene tree discordance
0
1
2
3
4
0% 25% 50% 75% 100%RF distance (true vs estimated)
dens
ity
(b) Gene tree estimation error
Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.
and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of
�d3
�
tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.
Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.
Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.
4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.
We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We
3
UsedSimPhy,MalloandPosada,2015
![Page 22: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/22.jpg)
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
![Page 23: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/23.jpg)
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
![Page 24: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/24.jpg)
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IINJstMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
![Page 25: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/25.jpg)
17
0
10
20
10 50 100 200 500 1000number of species
Run
ning
tim
e (h
ours
)
ASTRAL−IINJstMP−EST
Running time when varying the number of species
1000 genes, “medium” levels of recent ILS
![Page 26: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/26.jpg)
![Page 27: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/27.jpg)
ASTRAL-II on biological datasets (ongoing collaborations)
• 1200 plants with ~ 400 genes (1KP consortium)
• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)
• 200 avian species with whole genomes (with Genome 10K, international)
• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)
• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)
• 50 Hummingbird species with 2000 genes (with U. Copenhagen and Smithsonian)
• 40 raptor species (birds) with 10,000 genes (with U. Copenhagen and Berkeley)
• 38 mammalian species with 10,000 genes (with U. of Bristol, Cambridge, and Nat. Univ. of Ireland)
29
![Page 28: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/28.jpg)
AvianPhylogenomicsProjectEJarvis,HHMI
GZhang,BGI
• Approx.50species,wholegenomes,14,000loci• Jarvis,Mirarab,etal.,Science2014
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n
T.WarnowUT-Aus9n
Plusmanymanyotherpeople…
Majorchallenge:• Massivegenetreeheterogeneityconsistentwithincompletelineagesor9ng• Verypoorresolu9oninthe14,000genetrees(averagebootstrapsupport25%)• Standardcoalescent-basedspeciestreees9ma9onmethodscontradicted
concatena9onanalysisandpriorstudies
![Page 29: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/29.jpg)
Sta9s9calConsistencyforsummarymethods
error
Data
Dataaregenetrees,presumedtoberandomlysampledtruegenetrees.
![Page 30: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/30.jpg)
• Summarymethodscombinees9matedgenetrees,nottruegenetrees.
• Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.
• Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.
• Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.
TYPICALPHYLOGENOMICSPROBLEM: manypoorgenetrees
![Page 31: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/31.jpg)
• Summarymethodscombinees9matedgenetrees,nottruegenetrees.
• Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.
• Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.
• Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.
Genetreees9ma9onerror:keyissueinthedebate
![Page 32: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/32.jpg)
AvianPhylogenomicsProjectEJarvis,HHMI
GZhang,BGI
• Approx.50species,wholegenomes,14,000loci• PublishedScience2014
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n
T.WarnowUT-Aus9n
Plusmanymanyotherpeople…
Mostgenetreeshadverylowbootstrapsupport,sugges<veofgenetreees<ma<onerror
![Page 33: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/33.jpg)
AvianPhylogenomicsProjectEJarvis,HHMI
GZhang,BGI
• Approx.50species,wholegenomes,14,000loci
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n
T.WarnowUT-Aus9n
Plusmanymanyotherpeople…
Solu9on:Sta-s-calBinning• Improvescoalescent-basedspeciestreees9ma9onbyimprovinggenetrees(Mirarab,Bayzid,Boussau,andWarnow,Science2014)• Avianspeciestreees9matedusingSta-s-calBinningwithMP-EST(Jarvis,Mirarab,etal.,Science2014)
![Page 34: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/34.jpg)
Ideasbehindsta9s9calbinning
Numberofsitesinanalignment
• “Genetree”errortendstodecreasewiththenumberofsitesinthealignment
• Concatena9on(evenifnotsta9s9callyconsistent)tendstobereasonablyaccuratewhenthereisnottoomuchgenetreeheterogeneity
![Page 35: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/35.jpg)
12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org
INTRODUCTION: Reconstructing species
trees for rapid radiations, as in the early
diversification of birds, is complicated by
biological processes such as incomplete
lineage sorting (ILS)
that can cause differ-
ent parts of the ge-
nome to have different
evolutionary histories.
Statistical methods,
based on the multispe-
cies coalescent model and that combine
gene trees, can be highly accurate even
in the presence of massive ILS; however,
these methods can produce species trees
that are topologically far from the species
tree when estimated gene trees have error.
We have developed a statistical binning
technique to address gene tree estimation
error and have explored its use in genome-
scale species tree estimation with MP-EST,
a popular coalescent-based species tree
estimation method.
Statistical binning enables an
accurate coalescent-based estimation
of the avian tree
AVIAN GENOMICS
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*
RESEARCH ARTICLE SUMMARY
The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for
combinabilty, before estimating gene trees.
Statistical binning technique
Statistical binning pipeline
Traditional pipeline (unbinned)
Sequence data
Incompatibility graph
Gene alignments
Binned supergene alignments
Estimated gene trees
Supergene trees
Species tree
Species tree
RATIONALE: In statistical binning, phy-
logenetic trees on different genes are es-
timated and then placed into bins, so that
the differences between trees in the same
bin can be explained by estimation error
(see the figure). A new tree is then esti-
mated for each bin by applying maximum
likelihood to a concatenated alignment of
the multiple sequence alignments of its
genes, and a species tree is estimated us-
ing a coalescent-based species tree method
from these supergene trees.
RESULTS: Under realistic conditions in
our simulation study, statistical binning
reduced the topological error of species
trees estimated using MP-EST and enabled
a coalescent-based analysis that was more
accurate than concatenation even when
gene tree estimation error was relatively
high. Statistical binning also reduced the
error in gene tree topology and species
tree branch length estimation, especially
when the phylogenetic signal in gene se-
quence alignments was low. Species trees
estimated using MP-EST with statisti-
cal binning on four biological data sets
showed increased concordance with the
biological literature. When MP-EST was
used to analyze 14,446 gene trees in the
avian phylogenomics project, it produced
a species tree that was discordant with the
concatenation analysis and conflicted with
prior literature. However, the statistical
binning analysis produced a tree that was
highly congruent with the concatenation
analysis and was consistent with the prior
scientific literature.
CONCLUSIONS: Statistical binning re-
duces the error in species tree topology
and branch length estimation because
it reduces gene tree estimation error.
These improvements are greatest when
gene trees have reduced bootstrap sup-
port, which was the case for the avian
phylogenomics project. Because using
unbinned gene trees can result in over-
estimation of ILS, statistical binning may
be helpful in providing more accurate
estimations of ILS levels in biological
data sets. Thus, statistical binning enables
highly accurate species tree estimations,
even on genome-scale data sets. �
The list of author affiliations is available in the full article online.
*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463
Read the full article
at http://dx.doi
.org/10.1126/
science.1250463
ON OUR WEB SITE
Published by AAAS
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Note:Supergenetreescomputedusingfullypar99onedmaximumlikelihoodVertex-coloringgraphwithbalancedcolorclassesisNP-hard;weusedheuris9c.
![Page 36: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/36.jpg)
Sta9s9calbinningvs.unbinned
Datasets:11-taxonstrongILSdatasetswith50genesfromChungandAné,Systema9cBiology
Binningproducesbinswithapproximate5to7geneseach
0
0.05
0.1
0.15
0.2
0.25
MP−EST MDC*(75) MRP MRL GC
Aver
age
FN
rat
e
UnbinnedStatistical−75
![Page 37: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/37.jpg)
Theorem3(PLOSOne,Bayzidetal.2015):Unweightedsta9s9calbinningpipelinesarenotsta9s9cally
consistentunderGTR+MSC
Asthenumberofsitesperlocusincrease:• Alles9matedgenetreesconvergetothetruegenetreeandhavebootstrap
supportthatconvergesto1(Steel2014)• Foreachbin,withprobabilityconvergingto1,thegenesinthebinhavethe
sametreetopology(butcanhavedifferentnumericparameters),andthereisonlyonebinforanygiventreetopology
• Foreachbin,afullypar99onedmaximumlikelihood(ML)analysisofitssupergenealignmentconvergestoatreewiththecommongenetreetopology.
Asthenumberoflociincrease:• everygenetreetopologyappearswithprobabilityconvergingto1.Henceasboththenumberoflociandnumberofsitesperlocusincrease,withprobabilityconvergingto1,everygenetreetopologyappearsexactlyonceinthesetofsupergenetrees.Itisimpossibletoinferthespeciestreefromtheflatdistribu9onofgenetrees!
![Page 38: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/38.jpg)
Fig1.Pipelineforunbinnedanalyses,unweightedsta-s-calbinning,andweightedsta-s-calbinning.
BayzidMS,MirarabS,BoussauB,WarnowT(2015)WeightedSta9s9calBinning:EnablingSta9s9callyConsistentGenome-ScalePhylogene9cAnalyses.PLoSONE10(6):e0129183.doi:10.1371/journal.pone.0129183hVp://127.0.0.1:8081/plosone/ar9cle?id=info:doi/10.1371/journal.pone.0129183
![Page 39: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/39.jpg)
Theorem2(PLOSOne,Bayzidetal.2015):WSBpipelinesaresta9s9callyconsistent
underGTR+MSC
Easyproof:Asthenumberofsitesperlocusincrease• Alles9matedgenetreesconvergetothetruegenetreeandhave
bootstrapsupportthatconvergesto1(Steel2014)• Foreverybin,withprobabilityconvergingto1,thegenesinthebinhave
thesametreetopology• Fullypar99onedGTRMLanalysisofeachbinconvergestoatreewiththe
commontopologyofthegenesinthebin
Henceasthenumberofsitesperlocusandnumberoflocibothincrease,WSBfollowedbyasta9s9callyconsistentsummarymethodwillconvergeinprobabilitytothetruespeciestree.Q.E.D.
![Page 40: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/40.jpg)
Table1.ModeltreesusedintheWeightedSta9s9calBinningstudy.Weshownumberoftaxa,speciestreebranchlength(rela9vetobasemodel),andaveragetopologicaldiscordancebetweentruegenetreesandtruespeciestree.
Avian (48) 2X 35 Avian (48) 1X 47 Avian (48) 0.5X 59 Mammalian (37) 2X 18
Mammalian (37) 1X 32
Mammalian (37) 0.5X 54
10-taxon “Lower ILS" 40 10-taxon “Higher ILS" 84 15-taxon “High ILS" 82
doi:10.1371/journal.pone.0129183.t001
Dataset Species tree branch length scaling Average Discordance (%)
![Page 41: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/41.jpg)
(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length
(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length
(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes
Speciestreees9ma9onerrorforMP-ESTandASTRAL,andalsoconcatena9onusingML,onaviansimulateddatasets:48taxa,moderatelyhighILS(AD=47%),1000genes,andvaryinggenesequencelength.
Binningcanimprovespeciestreetopologyes-ma-on
Bayzidetal.,(2015).PLoSONE10(6):e0129183
![Page 42: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/42.jpg)
Cumula9vedistribu9onofthebootstrapsupportvaluesoftrueposi9ve(lem)andfalseposi9ve(right)edges.IfacurveformethodXisabovethecurveformethodY,thenXhashigherBSfortrueposi9vesandlowerBSforfalseposi9ves.Valuesintheshadedareaindicatefalseposi9vebrancheswithsupportat75%orhigher.Resultsareshownfor1000geneswith500bp,ontheaviansimulateddatasets.
Binningcanreduceincidenceofhighsupportfalseposi-veedges
Bayzidetal.,(2015).PLoSONE10(6):e0129183
![Page 43: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/43.jpg)
WeightedSta9s9calBinning:empirical
WSBgenerallybenigntohighlybeneficialformoderatetolargedatasets:
– Improvesgenetreees9ma9on
– Improvesspeciestreetopology
– Improvesspeciestreebranchlength
– Reducesincidenceofhighlysupportedfalseposi9vebranches
![Page 44: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/44.jpg)
WeightedSta9s9calBinning:empirical
However,WSBcanreduceaccuracyundersomecondi9ons.Currentsimula9onshaveonlyestablishedthisformodelcondi9onsthatsimultaneouslyhave:
• Verysmallnumbersofspecies(atmost10)• VeryhighILS(AD>80%)• Lowbootstrapsupportforgenetrees
Mostlikelythereareothercondi9onsaswell.
![Page 45: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/45.jpg)
Speciestreees-ma-onerrorforMP-ESTandASTRALon10-taxondatasets
BayzidMS,MirarabS,BoussauB,WarnowT(2015).PLoSONE10(6):e0129183
SimphyModelTree• 200geneswith100bp
(GTRGAMMA)• 10replicatesper
condi9on
Notes:• ModerateILS:binning
neutralorbeneficialusingBS=50%
• VeryhighILS:binningneutralforBS=50%,butincreasesMP-ESTerrorwithBS=75%
AD=40%
AD=84%
![Page 46: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/46.jpg)
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)• performedWSBusingunpar99onedMLinsteadoffully
par99onedML.• erroneous(extopic)datainsupergenealignments,biasing
againstWSB• Ourre-analysisoftheirdataproducedbeVerresultsthan
theyreported,butWSBdidreduceaccuracyontheirdata.
![Page 47: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/47.jpg)
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)• performedWSBusingunpar99onedMLinsteadoffully
par99onedML.• erroneous(extopic)datainsupergenealignments,biasing
againstWSB• Ourre-analysisoftheirdataproducedbeVerresultsthanthey
reported,butWSBdidreduceaccuracyontheirdata.
![Page 48: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/48.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)• Ourre-analysisoftheirdataproducedbeVerresultsfor
sta9s9calbinning(bothweightedandunweighted)thantheyreported,
• TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning)
• Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning Figureofmodeltree
fromL&E,Science9October2015:171
Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).
![Page 49: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/49.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)• Ourre-analysisoftheirdataproducedbeVerresultsfor
sta9s9calbinning(bothweightedandunweighted)thantheyreported
• TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning)
• Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning Figureofmodeltree
fromL&E,Science9October2015:171
Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).
![Page 50: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/50.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)• Ourre-analysisoftheirdataproducedbeVerresultsfor
sta9s9calbinning(bothweightedandunweighted)thantheyreported.
• TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning).
• Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning Figureofmodeltree
fromL&E,Science9October2015:171
Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).
![Page 51: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/51.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)• Ourre-analysisoftheirdataproducedbeVerresultsfor
sta9s9calbinning(bothweightedandunweighted)thantheyreported.
• TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning)
• Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning. Figureofmodeltree
fromL&E,Science9October2015:171
Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).
![Page 52: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/52.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)
FigureofmodeltreefromL&E,Science9October2015:171
![Page 53: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/53.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)
FigureofmodeltreefromL&E,Science9October2015:171
Mirarabetal.response(Science,October2015):• Ourre-analysisoftheirdata(withcorrectsupergene
alignments)showsthatWSBreducesaccuracy–butnotbyasmuchastheyreport.
• Ouranalysesofslightlylargerdatasetswiththesameproper9es(pec9nate,veryhighILS,strictclock)showedWSBneutraltobeneficial.
![Page 54: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/54.jpg)
Fig. 1 Binning simulation.
Liang Liu, and Scott V. Edwards Science 2015;350:171
Published by AAAS
LiuandEdwards,CommentinScience,October2015
AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:• Theprooffailsformul9plereasons,includingtheuseof
unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy• 5-taxon,strictmolecularclock,veryhighILS(AD=82%)
FigureofmodeltreefromL&E,Science9October2015:171
Allthecondi9onsinwhichWSBhasbeenshowntoreduceaccuracyhavethefollowingproper9es:• HighILS(AD>80%)• Smallnumbersoftaxa(atmost10)• Lowbootstrapsupportongenetreesandmostalsoobeyedthestrictmolecularclock.Bayzidetal.(PLOSOneMarch2015)advisesagainsttheuseofWSBunderthesecondi9ons.
![Page 55: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/55.jpg)
• Ques9on:Doanysummarymethodsconvergetothespeciestreeasthenumberoflociincrease,butwhereeachlocushasonlyaconstantnumberofsites?
• Answers:Roch&Warnow,SystBiol,March2015:– Strictmolecularclock:Yesforsomenewmethods,evenforasinglesiteperlocus
– Noclock:Unknownforallmethods,including MP-EST,ASTRAL,etc.
L&Easkagoodques9on:performanceonboundednumberofsites!
![Page 56: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/56.jpg)
• Ques9on:Doanysummarymethodsconvergetothespeciestreeasthenumberoflociincrease,butwhereeachlocushasonlyaconstantnumberofsites?
• Answers:Roch&Warnow,SystBiol,March2015:– Strictmolecularclock:Yesforsomenewmethods,evenforasinglesiteperlocus
– Noclock:Unknownforallmethods,including MP-EST,ASTRAL,etc.
S.RochandT.Warnow."Ontherobustnesstogenetreees9ma9onerror(orlackthereof)ofcoalescent-basedspeciestreemethods",Systema9cBiology,64(4):663-676,2015,(PDF)
L&Easkagoodques9on:performanceonboundednumberofsites!
![Page 57: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/57.jpg)
• Foranyposi9veintegerL,ifalllocihaveatmostLsites,thenWSBpipelinescannotbesta9s9callyconsistentundertheMSC.
• Comments:– Open– WillbehardtoseVleeitherway
RephrasingL&ETechnicalCommentasaconjecture
![Page 58: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/58.jpg)
10/14/15, 5:19 PMConcatenation Analyses in the Presence of Incomplete Lineage Sorting – PLOS Currents Tree of Life
Page 9 of 11http://currents.plos.org/treeoflife/article/concatenation-analyses-in-the-presence-of-incomplete-lineage-sorting/
An outline of the proof of the main theorem is as follows: We show that the expectedproportion of sites that are constant can be made arbitrary large with low rates of evolution(the lower bounds are formalized in Claim 4) and that the empirical frequencies of sitepatterns is concentrated around the expected values (Claim 2). When there are a largeenough number of invariable sites, it can be shown that likelihood scores and parsimonyscores converge to the same answer (formalized in Claim 1). Thus trees that have betterparsimony score have better likelihood under these scenarios. Therefore, it suffices to showthat parsimony is not statistically consistent under arbitrary low rates of evolution.
— Sebastien Roch and Mike Steel, “Likelihood-based tree reconstruction on a concatenation of sequencedatasets can be statistically inconsistent”, Theoretical Population Biology 100 (2015): 56-62
The authors have declared that no competing interests exist.
Statistical consistency of some standard methods
We present the current status with respect to statistical consistency (of the first or second kind) of some standard
phylogenomic estimation methods. The first column is for the first meaning of statistical consistency, which states that the
species tree estimated by the method will converge to the true species tree as the number of loci and number of sites per
locus both increase. The second column is for the second meaning, which states that the species tree estimated by the
method will converge to the true species tree as the number of loci increases, even for bounded number of sites per locus. We
also cite the paper in which the theoretical result is established.
Consistency –first kind
Consistency –second kind
MP-EST YES UNKNOWNASTRAL YES UNKNOWNUnpartitioned concatenated maximum likelihood NO ( ) NO ( )Fully partitioned maximum likelihood UNKNOWN UNKNOWNUnweighted statistical binning followed by consistent summarymethod (e.g., ASTRAL)
NO ( ) NO ( )
Weighted statistical binning followed by consistent summarymethod (e.g., ASTRAL)
YES ( ) UNKNOWN
*BEAST YES UNKNOWN
1 1
10 10
10
Appendix 1: Quote from Roch and Steel’s Paper
Competing Interests
Acknowledgements
Consistencyfirstkind:bothnumberoflociandnumberofsitesgotoinfinityConsistencysecondkind:numberoflocigoestoinfinity,numberofsitesboundedbyL(arbitraryconstant)
TablefromPLOSCurrents,Warnow2015
*
*
Notheore9caldifferencebetweenMP-EST,ASTRAL,andWSB (accordingtocurrentknowledge)
*
![Page 59: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/59.jpg)
FutureDirec9ons
• BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)
• BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees
• BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods
• BeVer“singlesite”methods
![Page 60: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/60.jpg)
FutureDirec9ons
• BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)
• BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees
• BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods
• BeVer“singlesite”methods
![Page 61: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/61.jpg)
FutureDirec9ons
• BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)
• BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees
• BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods
• BeVer“singlesite”methods
![Page 62: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/62.jpg)
FutureDirec9ons
• BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)
• BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees
• BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods
• BeVer“singlesite”methods
![Page 63: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/63.jpg)
Acknowledgments
Mirarabetal.,Science2014(Sta9s9calBinning)RochandWarnow,Systema9cBiology2014(PointsofView)Bayzidetal.,Science2015(ResponsetoLiuandEdwardsComment)MirarabandWarnow,Bioinforma9cs2015(ASTRAL-2)WarnowPLOSCurrents:TreeofLife2014(concatena9onanalysis)PapersavailableathVp://tandy.cs.illinois.edu/papers.htmlASTRALandsta9s9calbinningsomwareathVps://github.com/smirarabFunding:NSF,DavidBrutonJr.CentennialProfessorship,TACC(TexasAdvancedCompu9ngCenter),GraingerFounda9on,andHHMI(toSM).
![Page 64: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/64.jpg)
97/97
Cursores
Columbea
Otidimorphae
Australaves
80/79
73
67
92
79
94
99
68
88
87
9888
50/48 68
86
95
Binned MP-EST (unweighted/weighted) Unbinned MP-EST
Conflict with other lines of strong evidence
Podiceps cristatus9 7/94
PasseriformesPsittaciformesFalco peregrinusCariama cristataCoraciimorphaeAccipitriformesTyto alba
Cariama cristataCoraciimorphae
Pelecanus crispusEgrett agarzettaNipponia nipponPhalacrocorax carboProcellariimorphaeGavia stellataPhaethon lepturusEurypyga heliasBalearica regulorumCharadrius vociferusOpisthocomus hoazin
Calypte annaChaetura pelagicaAntrostomus carolinensis
Tauraco erythrolophusChlamydotis macqueeniiCuculus canorus
Columbal iviaPterocles gutturalisMesitornis unicolor
Phoenicopterus ruber
Meleagris gallopavoGallus gallusAnas platyrhynchos
Struthio camelusTinamus guttatus
91/87
58/56
59/57
99/99
Podiceps cristatusPhoenicopterus ruber
Cuculus canorus
PasseriformesPsittaciformes
Falco peregrinus
AccipitriformesTyto alba
Pelecanus crispusEgrett agarzettaNipponia nippon
Phalacrocorax carboProcellariimorphae
Gavia stellataPhaethon lepturus
Eurypyga heliasBalearica regulorumCharadrius vociferus
Opisthocomus hoazin
Calypte annaChaetura pelagica
Antrostomus carolinensis
Columbal iviaPterocles gutturalisMesitornis unicolor
Meleagris gallopavoGallus gallus
Anas platyrhynchos
Struthio camelusTinamus guttatus
Tauraco erythrolophusChlamydotis macqueenii
88/90100/99
100/99
100/99
ComparingBinnedandUn-binnedMP-ESTontheAvianDataset
BinnedMP-ESTislargelyconsistentwiththeMLconcatena9onanalysis.ThetreespresentedinScience2014weretheMLconcatena9onandBinnedMP-EST
Bayzidetal.,(2015).PLoSONE10(6):e0129183
![Page 65: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/65.jpg)
Speciestreees-ma-onerrorforMP-ESTandASTRALon15-taxondatasets
BayzidMS,MirarabS,BoussauB,WarnowT(2015).PLoSONE10(6):e0129183.
ModelTree:• VeryhighILS:AD=82%• Strictmolecularclock• GTR+Gammasequence
evolu9on(Indelible)• 10replicatespercondi9onNotes:• BS-75%omenimproved
accuracy(p=0.04)• BS=50%some9mesreduced
accuracy,butdifferenceswerenotsta9s9callysignificant.
• MP-ESTmoreaccuratethanASTRALonthesedata.
![Page 66: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/66.jpg)
ResearchQues9ons• Whydoesconcatena9onusingMLproducesuchgood
accuracyundermanycondi9ons?• Whydoessta9s9calbinningimproveaccuracyundermany
condi9ons?• Whatkindofmethodshouldbeusedtocomputeaspecies
tree,ordoesthisdependonthees9matedamountofILSandgenetreeaccuracyinthedataset?
• WhatisabiologicallyreasonableamountofILS?• Howcanweusemul9plelocitohelpimprovethees9ma9on
ofindividualgenetrees?(Note:co-es9ma9onundertheMSCisverypowerful,butcurrentmethodsarenotabletoanalyzeevenmoderate-sizeddatasets.)
![Page 67: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/67.jpg)
• Bayzid et al. (PLOS One 2015): “Thus, statistical binning seems to be beneficial when both ILS level and gene tree bootstrap support are not too high, will be neutral when bootstrap support values are high (so little or no binning occurs), but can be detrimental when ILS levels are extremely high but gene tree bootstrap support is low enough that binning occurs. Thus, one consequence of this study is the suggestion that when ILS levels are very high and the average gene tree bootstrap support is low, then either statistical binning should not be used, or it should be used in a very conservative fashion—with the parameter B set very low.”
• Mirarab et al. (Science 2015): Studies on 10- and 15-taxon datasets similar to the Liu and Edwards 5-taxon datasets showed binning was neutral to beneficial. Hence dataset size also seems to be relevant (i.e., binning might be potentially detrimental on very small datasets).
Whennottousesta9s9calbinning
![Page 68: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/68.jpg)
SummaryUnpar99onedconcatena9onusingmaximumlikelihoodissta9s9callyinconsistentundertheMSC(RochandSteel2014,seeWarnowPLOSCurrents2015).However,concatena9oncanbehighlyaccurate(andevenmoreaccuratethanthebestcoalescent-basedmethodscurrentlyavailable)underlowenoughILS.Concatena<oniscontroversial.
Manycoalescent-basedsummarymethods(e.g.,MP-ESTandASTRAL)convergeInprobabilitytothetruespeciestreeasthenumberofgenetreesincrease.However,allproofstodatehaveassumederror-freegenetrees,andgenetreees9ma9onerrorclearlyimpactsspeciestreees9ma9onaccuracy(andnotjustbootstrapsupport).Somenewsummarymethodscanhaveexcellentaccuracyevenonlargedatasets(e.g.,ASTRAL-2).However,summarymethodsarecontroversial.
Sta9s9calbinning(Mirarabetal.Science2014)andweightedsta9s9calbinning(Bayzidetal.,PLOSOne)omen(butnotalways)improvegenetreees9ma9on,andhencecoalescent-basedspeciestreees9ma9onfrommul9plegenes.However,sta9s9calbinningiscontroversial.
![Page 69: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/69.jpg)
SummaryUnpar99onedconcatena9onusingmaximumlikelihoodissta9s9callyinconsistentundertheMSC(RochandSteel2014,seeWarnowPLOSCurrents2015).However,concatena9oncanbehighlyaccurate(andevenmoreaccuratethanthebestcoalescent-basedmethodscurrentlyavailable)underlowenoughILS.Concatena<oniscontroversial.
Manycoalescent-basedsummarymethods(e.g.,MP-ESTandASTRAL)convergeInprobabilitytothetruespeciestreeasthenumberofgenetreesincrease.However,allproofstodatehaveassumederror-freegenetrees,andgenetreees9ma9onerrorclearlyimpactsspeciestreees9ma9onaccuracy(andnotjustbootstrapsupport).Somenewsummarymethodscanhaveexcellentaccuracyevenonlargedatasets(e.g.,ASTRAL-2).However,summarymethodsarecontroversial.
Sta9s9calbinning(Mirarabetal.Science2014)andweightedsta9s9calbinning(Bayzidetal.,PLOSOne)omen(butnotalways)improvegenetreees9ma9on,andhencecoalescent-basedspeciestreees9ma9onfrommul9plegenes.However,sta9s9calbinningiscontroversial.
![Page 70: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/70.jpg)
SummaryUnpar99onedconcatena9onusingmaximumlikelihoodissta9s9callyinconsistentundertheMSC(RochandSteel2014,seeWarnowPLOSCurrents2015).However,concatena9oncanbehighlyaccurate(andevenmoreaccuratethanthebestcoalescent-basedmethodscurrentlyavailable)underlowenoughILS.Concatena<oniscontroversial.
Manycoalescent-basedsummarymethods(e.g.,MP-ESTandASTRAL)convergeInprobabilitytothetruespeciestreeasthenumberofgenetreesincrease.However,allproofstodatehaveassumederror-freegenetrees,andgenetreees9ma9onerrorclearlyimpactsspeciestreees9ma9onaccuracy(andnotjustbootstrapsupport).Somenewsummarymethodscanhaveexcellentaccuracyevenonlargedatasets(e.g.,ASTRAL-2).However,summarymethodsarecontroversial.
Sta9s9calbinning(Mirarabetal.Science2014)andweightedsta9s9calbinning(Bayzidetal.,PLOSOne)omen(butnotalways)improvegenetreees9ma9on,andhencecoalescent-basedspeciestreees9ma9onfrommul9plegenes.However,sta9s9calbinningiscontroversial.
![Page 71: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/71.jpg)
SketchofL&Eargument• ForanysequencelengthL,thereisamodelspeciestree
suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.
• Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).
• Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.
• (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotconvergetothespeciestreeasthenumberoflociincreases.)
![Page 72: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/72.jpg)
SketchofL&Eargument• ForanysequencelengthL,thereisamodelspeciestree
suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.
• Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).
• Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.
• (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotproducethespeciestree,evenasthenumberoflociincreases.)
![Page 73: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/73.jpg)
SketchofL&Eargument• ForanysequencelengthL,thereisamodelspeciestree
suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.
• Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).
• Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.
• (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotproducethespeciestree,evenasthenumberoflociincreases.)
![Page 74: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I](https://reader033.vdocuments.us/reader033/viewer/2022052023/60391304ed564b43a0140391/html5/thumbnails/74.jpg)
SketchofL&Eargument• ForanysequencelengthL,thereisamodelspeciestree
suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.
• Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).
• Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.
• (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotconvergetothespeciestreeasthenumberoflociincreases.)