new methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf ·...

74
New methods for es-ma-ng species trees from genome-scale data Tandy Warnow The University of Illinois

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Newmethodsfores-ma-ngspeciestreesfromgenome-scaledata

TandyWarnowTheUniversityofIllinois

Page 2: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Phylogeny(evolu9onarytree)

Page 3: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Samplingmul9plegenesfrommul9plespecies

Page 4: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Page 5: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

IncompleteLineageSor9ng(ILS)isadominantcauseofgenetreeheterogeneity

Page 6: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Genetreesinsidethespeciestree(CoalescentProcess)

Present

Past

CourtesyJamesDegnan

GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.

Page 7: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Nextphasewillbemuchbiggerl  WickeV,Mirarabetal.,PNAS2014

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin

Challenge:•  MassivegenetreeheterogeneityconsistentwithILS

Page 8: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

AvianPhylogenomicsProjectEJarvis,HHMI

GZhang,BGI

• Approx.50species,wholegenomes,14,000loci• Jarvis,Mirarab,etal.,Science2014

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n

T.WarnowUT-Aus9n

Plusmanymanyotherpeople…

Majorchallenge:•  MassivegenetreeheterogeneityconsistentwithILS.

Page 9: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Thistalk•  Genetreeheterogeneityduetoincompletelineagesor9ng,

modelledbythemul9-speciescoalescent(MSC)•  Sta9s9callyconsistentes9ma9onofspeciestreesunder

theMSC,andtheimpactofgenetreees9ma9onerror•  ASTRAL(Bioinforma9cs2014,2015):coalescent-based

speciestreees9ma9onmethodthathashighaccuracyonlargedatasets(1000speciesandgenes)

•  “Sta9s9calbinning”(Science2014)–improvinggenetreees9ma9on,andhencespeciestreees9ma9on

•  Openques9ons

Page 10: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Thistalk•  Genetreeheterogeneityduetoincompletelineagesor9ng,

modelledbythemul9-speciescoalescent(MSC)•  Sta9s9callyconsistentes9ma9onofspeciestreesunder

theMSC,andtheimpactofgenetreees9ma9onerror•  ASTRAL(Bioinforma9cs2014,2015):coalescent-based

speciestreees9ma9onmethodthathashighaccuracyonlargedatasets(1000speciesandgenes)

•  “Sta9s9calbinning”(Science2014)–improvinggenetreees9ma9on,andhencespeciestreees9ma9on

•  Openques9onsControversial!

Page 11: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

IncompleteLineageSor9ng(ILS)

•  Confoundsphylogene9canalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.

•  Thereissubstan9aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundsta9s9calconsistencyguarantees(theory)andperformanceondata.

Page 12: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Sta9s9calConsistency

error

Data

Page 13: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

. . .

Analyzeseparately

Summary Method

Twocompe9ngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Page 14: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I
Page 15: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

. . .

Whataboutsummarymethods?

Page 16: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

. . .

Whataboutsummarymethods?

Techniques:Mostfrequentgenetree?Consensusofgenetrees?Other?

Page 17: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Sta9s9callyconsistentunderILS?•  Coalescent-basedsummarymethods:

–  MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES

–  NJst(LiuandYu,2011)-YES–  Andothers,includingsomenewermethods(BUCKy-pop,ASTRAL,ASTRID,

etc.)-YES

•  Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES

•  Single-sitemethods(SVDquartets,METAL,SNAPP,andothers)

Page 18: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Nextphasewillbemuchbiggerl  WickeV,Mirarabetal.,PNAS2014

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin

Challenges:•  MassivegenetreeheterogeneityconsistentwithILS•  CouldnotuseMP-ESTduetomissingdata(manygenetreescouldnotberooted)andlargenumberofspecies

Page 19: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Nextphasewillbemuchbiggerl  WickeV,Mirarabetal.,PNAS2014

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin

Solu9on:•  Newcoalescent-basedmethodASTRAL•  ASTRALissta9s9callyconsistent,polynomial9me,andusesunrootedgenetrees.

Page 20: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

ASTRALandASTRAL-2•  Es9matesthespeciestreefromgenetreesbyfindingthespecies

treethathasthemaximumquartetsupport,usingdynamicprogramming

•  Theorem:ASTRALissta9s9callyconsistentundertheMSC,evenwhensolvedinconstrainedmode(drawingbipar99onsfromtheinputgenetrees)

•  TheconstrainedversionofASTRALrunsinpolynomial9me•  OpensourcesomwareathVps://github.com/smirarab•  PublishedinECCB/Bioinforma9cs2014(Mirarabetal.)andISMB/

Bioinforma9cs2015(MirarabandWarnow)•  UsedinWickeV,Mirarabetal.(PNAS2014)andPrum,Bervetal.

(Nature2015)(andinmanyotherpapers)

Page 21: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Simulation study• Variable parameters:

• Number of species: 10 – 1000

• Number of genes: 50 – 1000

• Amount of ILS: low, medium, high

• Deep versus recent speciation

• 11 model conditions (50 replicas each) with heterogenous gene tree error

• Compare to NJst, MP-EST, concatenation (CA-ML)

• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree

14

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

ASTRAL-II

look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are

�u0

2

�quartet trees that put

that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.

Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.

Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.

Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.

3.3 Multi-furcating input gene trees

Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are

�d3

�ways to select

three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,

0

5

10

15

20

0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)

dens

ity

rate1e−06 1e−07

tree height10M 2M 500K

(a) True gene tree discordance

0

1

2

3

4

0% 25% 50% 75% 100%RF distance (true vs estimated)

dens

ity

(b) Gene tree estimation error

Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.

and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of

�d3

tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.

Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.

Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.

4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.

We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We

3

UsedSimPhy,MalloandPosada,2015

Page 22: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Page 23: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Page 24: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IINJstMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Page 25: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

17

0

10

20

10 50 100 200 500 1000number of species

Run

ning

tim

e (h

ours

)

ASTRAL−IINJstMP−EST

Running time when varying the number of species

1000 genes, “medium” levels of recent ILS

Page 26: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I
Page 27: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

ASTRAL-II on biological datasets (ongoing collaborations)

• 1200 plants with ~ 400 genes (1KP consortium)

• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)

• 200 avian species with whole genomes (with Genome 10K, international)

• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)

• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)

• 50 Hummingbird species with 2000 genes (with U. Copenhagen and Smithsonian)

• 40 raptor species (birds) with 10,000 genes (with U. Copenhagen and Berkeley)

• 38 mammalian species with 10,000 genes (with U. of Bristol, Cambridge, and Nat. Univ. of Ireland)

29

Page 28: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

AvianPhylogenomicsProjectEJarvis,HHMI

GZhang,BGI

• Approx.50species,wholegenomes,14,000loci• Jarvis,Mirarab,etal.,Science2014

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n

T.WarnowUT-Aus9n

Plusmanymanyotherpeople…

Majorchallenge:•  Massivegenetreeheterogeneityconsistentwithincompletelineagesor9ng•  Verypoorresolu9oninthe14,000genetrees(averagebootstrapsupport25%)•  Standardcoalescent-basedspeciestreees9ma9onmethodscontradicted

concatena9onanalysisandpriorstudies

Page 29: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Sta9s9calConsistencyforsummarymethods

error

Data

Dataaregenetrees,presumedtoberandomlysampledtruegenetrees.

Page 30: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

•  Summarymethodscombinees9matedgenetrees,nottruegenetrees.

•  Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.

•  Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.

•  Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.

TYPICALPHYLOGENOMICSPROBLEM: manypoorgenetrees

Page 31: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

•  Summarymethodscombinees9matedgenetrees,nottruegenetrees.

•  Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.

•  Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.

•  Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.

Genetreees9ma9onerror:keyissueinthedebate

Page 32: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

AvianPhylogenomicsProjectEJarvis,HHMI

GZhang,BGI

• Approx.50species,wholegenomes,14,000loci• PublishedScience2014

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n

T.WarnowUT-Aus9n

Plusmanymanyotherpeople…

Mostgenetreeshadverylowbootstrapsupport,sugges<veofgenetreees<ma<onerror

Page 33: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

AvianPhylogenomicsProjectEJarvis,HHMI

GZhang,BGI

• Approx.50species,wholegenomes,14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n

T.WarnowUT-Aus9n

Plusmanymanyotherpeople…

Solu9on:Sta-s-calBinning•  Improvescoalescent-basedspeciestreees9ma9onbyimprovinggenetrees(Mirarab,Bayzid,Boussau,andWarnow,Science2014)•  Avianspeciestreees9matedusingSta-s-calBinningwithMP-EST(Jarvis,Mirarab,etal.,Science2014)

Page 34: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Ideasbehindsta9s9calbinning

Numberofsitesinanalignment

•  “Genetree”errortendstodecreasewiththenumberofsitesinthealignment

•  Concatena9on(evenifnotsta9s9callyconsistent)tendstobereasonablyaccuratewhenthereisnottoomuchgenetreeheterogeneity

Page 35: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org

INTRODUCTION: Reconstructing species

trees for rapid radiations, as in the early

diversification of birds, is complicated by

biological processes such as incomplete

lineage sorting (ILS)

that can cause differ-

ent parts of the ge-

nome to have different

evolutionary histories.

Statistical methods,

based on the multispe-

cies coalescent model and that combine

gene trees, can be highly accurate even

in the presence of massive ILS; however,

these methods can produce species trees

that are topologically far from the species

tree when estimated gene trees have error.

We have developed a statistical binning

technique to address gene tree estimation

error and have explored its use in genome-

scale species tree estimation with MP-EST,

a popular coalescent-based species tree

estimation method.

Statistical binning enables an

accurate coalescent-based estimation

of the avian tree

AVIAN GENOMICS

Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*

RESEARCH ARTICLE SUMMARY

The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for

combinabilty, before estimating gene trees.

Statistical binning technique

Statistical binning pipeline

Traditional pipeline (unbinned)

Sequence data

Incompatibility graph

Gene alignments

Binned supergene alignments

Estimated gene trees

Supergene trees

Species tree

Species tree

RATIONALE: In statistical binning, phy-

logenetic trees on different genes are es-

timated and then placed into bins, so that

the differences between trees in the same

bin can be explained by estimation error

(see the figure). A new tree is then esti-

mated for each bin by applying maximum

likelihood to a concatenated alignment of

the multiple sequence alignments of its

genes, and a species tree is estimated us-

ing a coalescent-based species tree method

from these supergene trees.

RESULTS: Under realistic conditions in

our simulation study, statistical binning

reduced the topological error of species

trees estimated using MP-EST and enabled

a coalescent-based analysis that was more

accurate than concatenation even when

gene tree estimation error was relatively

high. Statistical binning also reduced the

error in gene tree topology and species

tree branch length estimation, especially

when the phylogenetic signal in gene se-

quence alignments was low. Species trees

estimated using MP-EST with statisti-

cal binning on four biological data sets

showed increased concordance with the

biological literature. When MP-EST was

used to analyze 14,446 gene trees in the

avian phylogenomics project, it produced

a species tree that was discordant with the

concatenation analysis and conflicted with

prior literature. However, the statistical

binning analysis produced a tree that was

highly congruent with the concatenation

analysis and was consistent with the prior

scientific literature.

CONCLUSIONS: Statistical binning re-

duces the error in species tree topology

and branch length estimation because

it reduces gene tree estimation error.

These improvements are greatest when

gene trees have reduced bootstrap sup-

port, which was the case for the avian

phylogenomics project. Because using

unbinned gene trees can result in over-

estimation of ILS, statistical binning may

be helpful in providing more accurate

estimations of ILS levels in biological

data sets. Thus, statistical binning enables

highly accurate species tree estimations,

even on genome-scale data sets. �

The list of author affiliations is available in the full article online.

*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463

Read the full article

at http://dx.doi

.org/10.1126/

science.1250463

ON OUR WEB SITE

Published by AAAS

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Note:Supergenetreescomputedusingfullypar99onedmaximumlikelihoodVertex-coloringgraphwithbalancedcolorclassesisNP-hard;weusedheuris9c.

Page 36: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Sta9s9calbinningvs.unbinned

Datasets:11-taxonstrongILSdatasetswith50genesfromChungandAné,Systema9cBiology

Binningproducesbinswithapproximate5to7geneseach

0

0.05

0.1

0.15

0.2

0.25

MP−EST MDC*(75) MRP MRL GC

Aver

age

FN

rat

e

UnbinnedStatistical−75

Page 37: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Theorem3(PLOSOne,Bayzidetal.2015):Unweightedsta9s9calbinningpipelinesarenotsta9s9cally

consistentunderGTR+MSC

Asthenumberofsitesperlocusincrease:•  Alles9matedgenetreesconvergetothetruegenetreeandhavebootstrap

supportthatconvergesto1(Steel2014)•  Foreachbin,withprobabilityconvergingto1,thegenesinthebinhavethe

sametreetopology(butcanhavedifferentnumericparameters),andthereisonlyonebinforanygiventreetopology

•  Foreachbin,afullypar99onedmaximumlikelihood(ML)analysisofitssupergenealignmentconvergestoatreewiththecommongenetreetopology.

Asthenumberoflociincrease:•  everygenetreetopologyappearswithprobabilityconvergingto1.Henceasboththenumberoflociandnumberofsitesperlocusincrease,withprobabilityconvergingto1,everygenetreetopologyappearsexactlyonceinthesetofsupergenetrees.Itisimpossibletoinferthespeciestreefromtheflatdistribu9onofgenetrees!

Page 38: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig1.Pipelineforunbinnedanalyses,unweightedsta-s-calbinning,andweightedsta-s-calbinning.

BayzidMS,MirarabS,BoussauB,WarnowT(2015)WeightedSta9s9calBinning:EnablingSta9s9callyConsistentGenome-ScalePhylogene9cAnalyses.PLoSONE10(6):e0129183.doi:10.1371/journal.pone.0129183hVp://127.0.0.1:8081/plosone/ar9cle?id=info:doi/10.1371/journal.pone.0129183

Page 39: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Theorem2(PLOSOne,Bayzidetal.2015):WSBpipelinesaresta9s9callyconsistent

underGTR+MSC

Easyproof:Asthenumberofsitesperlocusincrease•  Alles9matedgenetreesconvergetothetruegenetreeandhave

bootstrapsupportthatconvergesto1(Steel2014)•  Foreverybin,withprobabilityconvergingto1,thegenesinthebinhave

thesametreetopology•  Fullypar99onedGTRMLanalysisofeachbinconvergestoatreewiththe

commontopologyofthegenesinthebin

Henceasthenumberofsitesperlocusandnumberoflocibothincrease,WSBfollowedbyasta9s9callyconsistentsummarymethodwillconvergeinprobabilitytothetruespeciestree.Q.E.D.

Page 40: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Table1.ModeltreesusedintheWeightedSta9s9calBinningstudy.Weshownumberoftaxa,speciestreebranchlength(rela9vetobasemodel),andaveragetopologicaldiscordancebetweentruegenetreesandtruespeciestree.

Avian (48) 2X 35 Avian (48) 1X 47 Avian (48) 0.5X 59 Mammalian (37) 2X 18

Mammalian (37) 1X 32

Mammalian (37) 0.5X 54

10-taxon “Lower ILS" 40 10-taxon “Higher ILS" 84 15-taxon “High ILS" 82

doi:10.1371/journal.pone.0129183.t001

Dataset Species tree branch length scaling Average Discordance (%)

Page 41: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length

(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length

(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes

Speciestreees9ma9onerrorforMP-ESTandASTRAL,andalsoconcatena9onusingML,onaviansimulateddatasets:48taxa,moderatelyhighILS(AD=47%),1000genes,andvaryinggenesequencelength.

Binningcanimprovespeciestreetopologyes-ma-on

Bayzidetal.,(2015).PLoSONE10(6):e0129183

Page 42: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Cumula9vedistribu9onofthebootstrapsupportvaluesoftrueposi9ve(lem)andfalseposi9ve(right)edges.IfacurveformethodXisabovethecurveformethodY,thenXhashigherBSfortrueposi9vesandlowerBSforfalseposi9ves.Valuesintheshadedareaindicatefalseposi9vebrancheswithsupportat75%orhigher.Resultsareshownfor1000geneswith500bp,ontheaviansimulateddatasets.

Binningcanreduceincidenceofhighsupportfalseposi-veedges

Bayzidetal.,(2015).PLoSONE10(6):e0129183

Page 43: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

WeightedSta9s9calBinning:empirical

WSBgenerallybenigntohighlybeneficialformoderatetolargedatasets:

–  Improvesgenetreees9ma9on

–  Improvesspeciestreetopology

–  Improvesspeciestreebranchlength

– Reducesincidenceofhighlysupportedfalseposi9vebranches

Page 44: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

WeightedSta9s9calBinning:empirical

However,WSBcanreduceaccuracyundersomecondi9ons.Currentsimula9onshaveonlyestablishedthisformodelcondi9onsthatsimultaneouslyhave:

•  Verysmallnumbersofspecies(atmost10)•  VeryhighILS(AD>80%)•  Lowbootstrapsupportforgenetrees

Mostlikelythereareothercondi9onsaswell.

Page 45: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Speciestreees-ma-onerrorforMP-ESTandASTRALon10-taxondatasets

BayzidMS,MirarabS,BoussauB,WarnowT(2015).PLoSONE10(6):e0129183

SimphyModelTree•  200geneswith100bp

(GTRGAMMA)•  10replicatesper

condi9on

Notes:•  ModerateILS:binning

neutralorbeneficialusingBS=50%

•  VeryhighILS:binningneutralforBS=50%,butincreasesMP-ESTerrorwithBS=75%

AD=40%

AD=84%

Page 46: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)•  performedWSBusingunpar99onedMLinsteadoffully

par99onedML.•  erroneous(extopic)datainsupergenealignments,biasing

againstWSB•  Ourre-analysisoftheirdataproducedbeVerresultsthan

theyreported,butWSBdidreduceaccuracyontheirdata.

Page 47: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)•  performedWSBusingunpar99onedMLinsteadoffully

par99onedML.•  erroneous(extopic)datainsupergenealignments,biasing

againstWSB•  Ourre-analysisoftheirdataproducedbeVerresultsthanthey

reported,butWSBdidreduceaccuracyontheirdata.

Page 48: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)•  Ourre-analysisoftheirdataproducedbeVerresultsfor

sta9s9calbinning(bothweightedandunweighted)thantheyreported,

•  TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning)

•  Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning Figureofmodeltree

fromL&E,Science9October2015:171

Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).

Page 49: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)•  Ourre-analysisoftheirdataproducedbeVerresultsfor

sta9s9calbinning(bothweightedandunweighted)thantheyreported

•  TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning)

•  Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning Figureofmodeltree

fromL&E,Science9October2015:171

Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).

Page 50: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)•  Ourre-analysisoftheirdataproducedbeVerresultsfor

sta9s9calbinning(bothweightedandunweighted)thantheyreported.

•  TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning).

•  Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning Figureofmodeltree

fromL&E,Science9October2015:171

Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).

Page 51: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)•  Ourre-analysisoftheirdataproducedbeVerresultsfor

sta9s9calbinning(bothweightedandunweighted)thantheyreported.

•  TheyperformedWSBusingunpar99onedMLinsteadoffullypar99onedML(biasingagainststa9s9calbinning)

•  Theyhaderroneous(ectopic)dataintheirsupergenealignments,biasingagainststa9s9calbinning. Figureofmodeltree

fromL&E,Science9October2015:171

Thismodeltreefitsintothecategoryofcondi9onsdescribedinBayzidetal.PLOSOne2015,inwhichWSBreducedaccuracy(verysmallnumbersoftaxa,veryhighILS).

Page 52: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)

FigureofmodeltreefromL&E,Science9October2015:171

Page 53: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)

FigureofmodeltreefromL&E,Science9October2015:171

Mirarabetal.response(Science,October2015):•  Ourre-analysisoftheirdata(withcorrectsupergene

alignments)showsthatWSBreducesaccuracy–butnotbyasmuchastheyreport.

•  Ouranalysesofslightlylargerdatasetswiththesameproper9es(pec9nate,veryhighILS,strictclock)showedWSBneutraltobeneficial.

Page 54: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Fig. 1 Binning simulation.

Liang Liu, and Scott V. Edwards Science 2015;350:171

Published by AAAS

LiuandEdwards,CommentinScience,October2015

AVemptedproofthatWSBpipelinesaresta9s9callyinconsistentforboundednumberofsitesperlocus:•  Theprooffailsformul9plereasons,includingtheuseof

unpar99onedMLinsteadoffullypar99onedMLSimula9onstudy•  5-taxon,strictmolecularclock,veryhighILS(AD=82%)

FigureofmodeltreefromL&E,Science9October2015:171

Allthecondi9onsinwhichWSBhasbeenshowntoreduceaccuracyhavethefollowingproper9es:•  HighILS(AD>80%)•  Smallnumbersoftaxa(atmost10)•  Lowbootstrapsupportongenetreesandmostalsoobeyedthestrictmolecularclock.Bayzidetal.(PLOSOneMarch2015)advisesagainsttheuseofWSBunderthesecondi9ons.

Page 55: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

•  Ques9on:Doanysummarymethodsconvergetothespeciestreeasthenumberoflociincrease,butwhereeachlocushasonlyaconstantnumberofsites?

•  Answers:Roch&Warnow,SystBiol,March2015:–  Strictmolecularclock:Yesforsomenewmethods,evenforasinglesiteperlocus

– Noclock:Unknownforallmethods,including MP-EST,ASTRAL,etc.

L&Easkagoodques9on:performanceonboundednumberofsites!

Page 56: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

•  Ques9on:Doanysummarymethodsconvergetothespeciestreeasthenumberoflociincrease,butwhereeachlocushasonlyaconstantnumberofsites?

•  Answers:Roch&Warnow,SystBiol,March2015:–  Strictmolecularclock:Yesforsomenewmethods,evenforasinglesiteperlocus

– Noclock:Unknownforallmethods,including MP-EST,ASTRAL,etc.

S.RochandT.Warnow."Ontherobustnesstogenetreees9ma9onerror(orlackthereof)ofcoalescent-basedspeciestreemethods",Systema9cBiology,64(4):663-676,2015,(PDF)

L&Easkagoodques9on:performanceonboundednumberofsites!

Page 57: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

•  Foranyposi9veintegerL,ifalllocihaveatmostLsites,thenWSBpipelinescannotbesta9s9callyconsistentundertheMSC.

•  Comments:– Open– WillbehardtoseVleeitherway

RephrasingL&ETechnicalCommentasaconjecture

Page 58: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

10/14/15, 5:19 PMConcatenation Analyses in the Presence of Incomplete Lineage Sorting – PLOS Currents Tree of Life

Page 9 of 11http://currents.plos.org/treeoflife/article/concatenation-analyses-in-the-presence-of-incomplete-lineage-sorting/

An outline of the proof of the main theorem is as follows: We show that the expectedproportion of sites that are constant can be made arbitrary large with low rates of evolution(the lower bounds are formalized in Claim 4) and that the empirical frequencies of sitepatterns is concentrated around the expected values (Claim 2). When there are a largeenough number of invariable sites, it can be shown that likelihood scores and parsimonyscores converge to the same answer (formalized in Claim 1). Thus trees that have betterparsimony score have better likelihood under these scenarios. Therefore, it suffices to showthat parsimony is not statistically consistent under arbitrary low rates of evolution.

— Sebastien Roch and Mike Steel, “Likelihood-based tree reconstruction on a concatenation of sequencedatasets can be statistically inconsistent”, Theoretical Population Biology 100 (2015): 56-62

The authors have declared that no competing interests exist.

Statistical consistency of some standard methods

We present the current status with respect to statistical consistency (of the first or second kind) of some standard

phylogenomic estimation methods. The first column is for the first meaning of statistical consistency, which states that the

species tree estimated by the method will converge to the true species tree as the number of loci and number of sites per

locus both increase. The second column is for the second meaning, which states that the species tree estimated by the

method will converge to the true species tree as the number of loci increases, even for bounded number of sites per locus. We

also cite the paper in which the theoretical result is established.

Consistency –first kind

Consistency –second kind

MP-EST YES UNKNOWNASTRAL YES UNKNOWNUnpartitioned concatenated maximum likelihood NO ( ) NO ( )Fully partitioned maximum likelihood UNKNOWN UNKNOWNUnweighted statistical binning followed by consistent summarymethod (e.g., ASTRAL)

NO ( ) NO ( )

Weighted statistical binning followed by consistent summarymethod (e.g., ASTRAL)

YES ( ) UNKNOWN

*BEAST YES UNKNOWN

1 1

10 10

10

Appendix 1: Quote from Roch and Steel’s Paper

Competing Interests

Acknowledgements

Consistencyfirstkind:bothnumberoflociandnumberofsitesgotoinfinityConsistencysecondkind:numberoflocigoestoinfinity,numberofsitesboundedbyL(arbitraryconstant)

TablefromPLOSCurrents,Warnow2015

*

*

Notheore9caldifferencebetweenMP-EST,ASTRAL,andWSB (accordingtocurrentknowledge)

*

Page 59: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

FutureDirec9ons

•  BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)

•  BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees

•  BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods

•  BeVer“singlesite”methods

Page 60: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

FutureDirec9ons

•  BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)

•  BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees

•  BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods

•  BeVer“singlesite”methods

Page 61: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

FutureDirec9ons

•  BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)

•  BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees

•  BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods

•  BeVer“singlesite”methods

Page 62: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

FutureDirec9ons

•  BeVercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)

•  BeVertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees

•  BeVertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods

•  BeVer“singlesite”methods

Page 63: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Acknowledgments

Mirarabetal.,Science2014(Sta9s9calBinning)RochandWarnow,Systema9cBiology2014(PointsofView)Bayzidetal.,Science2015(ResponsetoLiuandEdwardsComment)MirarabandWarnow,Bioinforma9cs2015(ASTRAL-2)WarnowPLOSCurrents:TreeofLife2014(concatena9onanalysis)PapersavailableathVp://tandy.cs.illinois.edu/papers.htmlASTRALandsta9s9calbinningsomwareathVps://github.com/smirarabFunding:NSF,DavidBrutonJr.CentennialProfessorship,TACC(TexasAdvancedCompu9ngCenter),GraingerFounda9on,andHHMI(toSM).

Page 64: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

97/97

Cursores

Columbea

Otidimorphae

Australaves

80/79

73

67

92

79

94

99

68

88

87

9888

50/48 68

86

95

Binned MP-EST (unweighted/weighted) Unbinned MP-EST

Conflict with other lines of strong evidence

Podiceps cristatus9 7/94

PasseriformesPsittaciformesFalco peregrinusCariama cristataCoraciimorphaeAccipitriformesTyto alba

Cariama cristataCoraciimorphae

Pelecanus crispusEgrett agarzettaNipponia nipponPhalacrocorax carboProcellariimorphaeGavia stellataPhaethon lepturusEurypyga heliasBalearica regulorumCharadrius vociferusOpisthocomus hoazin

Calypte annaChaetura pelagicaAntrostomus carolinensis

Tauraco erythrolophusChlamydotis macqueeniiCuculus canorus

Columbal iviaPterocles gutturalisMesitornis unicolor

Phoenicopterus ruber

Meleagris gallopavoGallus gallusAnas platyrhynchos

Struthio camelusTinamus guttatus

91/87

58/56

59/57

99/99

Podiceps cristatusPhoenicopterus ruber

Cuculus canorus

PasseriformesPsittaciformes

Falco peregrinus

AccipitriformesTyto alba

Pelecanus crispusEgrett agarzettaNipponia nippon

Phalacrocorax carboProcellariimorphae

Gavia stellataPhaethon lepturus

Eurypyga heliasBalearica regulorumCharadrius vociferus

Opisthocomus hoazin

Calypte annaChaetura pelagica

Antrostomus carolinensis

Columbal iviaPterocles gutturalisMesitornis unicolor

Meleagris gallopavoGallus gallus

Anas platyrhynchos

Struthio camelusTinamus guttatus

Tauraco erythrolophusChlamydotis macqueenii

88/90100/99

100/99

100/99

ComparingBinnedandUn-binnedMP-ESTontheAvianDataset

BinnedMP-ESTislargelyconsistentwiththeMLconcatena9onanalysis.ThetreespresentedinScience2014weretheMLconcatena9onandBinnedMP-EST

Bayzidetal.,(2015).PLoSONE10(6):e0129183

Page 65: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

Speciestreees-ma-onerrorforMP-ESTandASTRALon15-taxondatasets

BayzidMS,MirarabS,BoussauB,WarnowT(2015).PLoSONE10(6):e0129183.

ModelTree:•  VeryhighILS:AD=82%•  Strictmolecularclock•  GTR+Gammasequence

evolu9on(Indelible)•  10replicatespercondi9onNotes:•  BS-75%omenimproved

accuracy(p=0.04)•  BS=50%some9mesreduced

accuracy,butdifferenceswerenotsta9s9callysignificant.

•  MP-ESTmoreaccuratethanASTRALonthesedata.

Page 66: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

ResearchQues9ons•  Whydoesconcatena9onusingMLproducesuchgood

accuracyundermanycondi9ons?•  Whydoessta9s9calbinningimproveaccuracyundermany

condi9ons?•  Whatkindofmethodshouldbeusedtocomputeaspecies

tree,ordoesthisdependonthees9matedamountofILSandgenetreeaccuracyinthedataset?

•  WhatisabiologicallyreasonableamountofILS?•  Howcanweusemul9plelocitohelpimprovethees9ma9on

ofindividualgenetrees?(Note:co-es9ma9onundertheMSCisverypowerful,butcurrentmethodsarenotabletoanalyzeevenmoderate-sizeddatasets.)

Page 67: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

•  Bayzid et al. (PLOS One 2015): “Thus, statistical binning seems to be beneficial when both ILS level and gene tree bootstrap support are not too high, will be neutral when bootstrap support values are high (so little or no binning occurs), but can be detrimental when ILS levels are extremely high but gene tree bootstrap support is low enough that binning occurs. Thus, one consequence of this study is the suggestion that when ILS levels are very high and the average gene tree bootstrap support is low, then either statistical binning should not be used, or it should be used in a very conservative fashion—with the parameter B set very low.”

•  Mirarab et al. (Science 2015): Studies on 10- and 15-taxon datasets similar to the Liu and Edwards 5-taxon datasets showed binning was neutral to beneficial. Hence dataset size also seems to be relevant (i.e., binning might be potentially detrimental on very small datasets).

Whennottousesta9s9calbinning

Page 68: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SummaryUnpar99onedconcatena9onusingmaximumlikelihoodissta9s9callyinconsistentundertheMSC(RochandSteel2014,seeWarnowPLOSCurrents2015).However,concatena9oncanbehighlyaccurate(andevenmoreaccuratethanthebestcoalescent-basedmethodscurrentlyavailable)underlowenoughILS.Concatena<oniscontroversial.

Manycoalescent-basedsummarymethods(e.g.,MP-ESTandASTRAL)convergeInprobabilitytothetruespeciestreeasthenumberofgenetreesincrease.However,allproofstodatehaveassumederror-freegenetrees,andgenetreees9ma9onerrorclearlyimpactsspeciestreees9ma9onaccuracy(andnotjustbootstrapsupport).Somenewsummarymethodscanhaveexcellentaccuracyevenonlargedatasets(e.g.,ASTRAL-2).However,summarymethodsarecontroversial.

Sta9s9calbinning(Mirarabetal.Science2014)andweightedsta9s9calbinning(Bayzidetal.,PLOSOne)omen(butnotalways)improvegenetreees9ma9on,andhencecoalescent-basedspeciestreees9ma9onfrommul9plegenes.However,sta9s9calbinningiscontroversial.

Page 69: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SummaryUnpar99onedconcatena9onusingmaximumlikelihoodissta9s9callyinconsistentundertheMSC(RochandSteel2014,seeWarnowPLOSCurrents2015).However,concatena9oncanbehighlyaccurate(andevenmoreaccuratethanthebestcoalescent-basedmethodscurrentlyavailable)underlowenoughILS.Concatena<oniscontroversial.

Manycoalescent-basedsummarymethods(e.g.,MP-ESTandASTRAL)convergeInprobabilitytothetruespeciestreeasthenumberofgenetreesincrease.However,allproofstodatehaveassumederror-freegenetrees,andgenetreees9ma9onerrorclearlyimpactsspeciestreees9ma9onaccuracy(andnotjustbootstrapsupport).Somenewsummarymethodscanhaveexcellentaccuracyevenonlargedatasets(e.g.,ASTRAL-2).However,summarymethodsarecontroversial.

Sta9s9calbinning(Mirarabetal.Science2014)andweightedsta9s9calbinning(Bayzidetal.,PLOSOne)omen(butnotalways)improvegenetreees9ma9on,andhencecoalescent-basedspeciestreees9ma9onfrommul9plegenes.However,sta9s9calbinningiscontroversial.

Page 70: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SummaryUnpar99onedconcatena9onusingmaximumlikelihoodissta9s9callyinconsistentundertheMSC(RochandSteel2014,seeWarnowPLOSCurrents2015).However,concatena9oncanbehighlyaccurate(andevenmoreaccuratethanthebestcoalescent-basedmethodscurrentlyavailable)underlowenoughILS.Concatena<oniscontroversial.

Manycoalescent-basedsummarymethods(e.g.,MP-ESTandASTRAL)convergeInprobabilitytothetruespeciestreeasthenumberofgenetreesincrease.However,allproofstodatehaveassumederror-freegenetrees,andgenetreees9ma9onerrorclearlyimpactsspeciestreees9ma9onaccuracy(andnotjustbootstrapsupport).Somenewsummarymethodscanhaveexcellentaccuracyevenonlargedatasets(e.g.,ASTRAL-2).However,summarymethodsarecontroversial.

Sta9s9calbinning(Mirarabetal.Science2014)andweightedsta9s9calbinning(Bayzidetal.,PLOSOne)omen(butnotalways)improvegenetreees9ma9on,andhencecoalescent-basedspeciestreees9ma9onfrommul9plegenes.However,sta9s9calbinningiscontroversial.

Page 71: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SketchofL&Eargument•  ForanysequencelengthL,thereisamodelspeciestree

suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.

•  Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).

•  Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.

•  (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotconvergetothespeciestreeasthenumberoflociincreases.)

Page 72: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SketchofL&Eargument•  ForanysequencelengthL,thereisamodelspeciestree

suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.

•  Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).

•  Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.

•  (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotproducethespeciestree,evenasthenumberoflociincreases.)

Page 73: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SketchofL&Eargument•  ForanysequencelengthL,thereisamodelspeciestree

suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.

•  Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).

•  Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.

•  (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotproducethespeciestree,evenasthenumberoflociincreases.)

Page 74: New methods for esmang species trees from genome-scale datatandy.cs.illinois.edu/mit-dec4.pdf · StatisticalConsistency: ASTRAL-Iwasstatisticallyconsistent, and changes from ASTRAL-I

SketchofL&Eargument•  ForanysequencelengthL,thereisamodelspeciestree

suchthatnearlyallsitesonnearlyallgenesevolvewithoutanychanges,andsonearlyallgenetreeshavemaximumbootstrapsupportbelowthethresholdvalue.

•  Asthenumberoflociincrease,thebinsproducedbyWSBwillhavethesamegenetreedistribu9onasforthetruespeciestree(orthedevia9onwillnotimpactanydownstreamargument).

•  Oneachbin,MLconcatena9onwillconvergetosometreethatisnotthespeciestree.

•  (Hence,applyingacoalescent-basedmethodtothesesupergenetreeswillnotconvergetothespeciestreeasthenumberoflociincreases.)