ismb tutorial: computational methods for comparative
TRANSCRIPT
Phylogenomictreeconstruction
ISMBTutorial:Computationalmethodsforcomparativeregulatorygenomics
Lecture3SiavashMirarab
�1
Topicsinthislecture
• Phylogenomics:premiseandchallenges
• Speciestreesversusgenetrees
• Causesfordiscordance
• Phylogeneticinferencedespitediscordance
• Questions
• Models
• Methodchoices
�2
Phylogeny
OrangutanGorilla ChimpanzeeHuman
Phylogeny
�3
Treeoflife
source: http://www.evogeneao.com/
“Nothing in biology makes sense except in the light of evolution.” Dobzhansky, 1973
Treeoflife
�4
Treeoflife
source: http://www.evogeneao.com/
“Nothing in biology makes sense except in the light of evolution.” Dobzhansky, 1973
“Nothing in evolution makes sense except in the light of phylogeny.” multiple coinage
Treeoflife
�4
5
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGGOrangutanGorilla ChimpanzeeHuman
PhylogeneticreconstructionfromDNA
5
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGGOrangutanGorilla ChimpanzeeHuman
CTGCACACCGCTGCACACCG
CTGCACACGG
PhylogeneticreconstructionfromDNA
5
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGGOrangutanGorilla ChimpanzeeHuman
CTGCACACCGCTGCACACCG
CTGCACACGG
PhylogeneticreconstructionfromDNA
5
Orangutan
Gorilla
ChimpanzeeHuman
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
D
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGGOrangutanGorilla ChimpanzeeHuman
CTGCACACCGCTGCACACCG
CTGCACACGG
PhylogeneticreconstructionfromDNA
5
Orangutan
Gorilla
ChimpanzeeHuman
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
D TP (D|T )
Orangutan
Gorilla
Chimpanzee
Human
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGGOrangutanGorilla ChimpanzeeHuman
CTGCACACCGCTGCACACCG
CTGCACACGG
PhylogeneticreconstructionfromDNA
5
Orangutan
Gorilla
ChimpanzeeHuman
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
D TP (D|T )
Orangutan
Gorilla
Chimpanzee
Human
ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGGOrangutanGorilla ChimpanzeeHuman
CTGCACACCGCTGCACACCG
CTGCACACGG
statistical support
81%
PhylogeneticreconstructionfromDNA
Phylogenomics:promise
6
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Data (#genes)
Species tree error
Orangutan
Gorilla
Chimpanzee
Human
81%
“gene” here simply means a (recombination-free) parts of the genome
more data ⟶ better inference
Phylogenomics:promise
6
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Data (#genes)
Species tree error
Orangutan
Gorilla
Chimpanzee
Human
81%
“gene” here simply means a (recombination-free) parts of the genome
more data ⟶ better inference
7
Phylogenomics: onlyafewyearslater
Genetreediscordance
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
�8
Genetreediscordance
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
�8
Genetreediscordance
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
Causes of gene tree discordance include:– Duplication and loss – Horizontal Gene Transfer (HGT) and Hybridization – Incomplete Lineage Sorting (ILS)
gene1000gene 1
�8
Geneduplicationandloss
• Insomespecies(e.g.plants)geneduplicationandlossisrampant.
picture from evolution-textbook.org (Fig 5-20), redrawn from Eisen, Genome Research, 1998�9
Geneduplicationandloss
• Insomespecies(e.g.plants)geneduplicationandlossisrampant.
• RecallfromColin’stalk:Paralogs:genesthatdivergedinaduplicationevent;e.g:2Aand3BOrthologs:genesthatdivergedinaspeciationevent;e.g:1Band3B
picture from evolution-textbook.org (Fig 5-20), redrawn from Eisen, Genome Research, 1998�9
Geneduplicationandloss
• Insomespecies(e.g.plants)geneduplicationandlossisrampant.
• RecallfromColin’stalk:Paralogs:genesthatdivergedinaduplicationevent;e.g:2Aand3BOrthologs:genesthatdivergedinaspeciationevent;e.g:1Band3B
• Agenetreethatincludesparalogousgenesmaydifferfromthespeciestree
– Westriveforfindingorthologousgenes,butwemayfail
picture from evolution-textbook.org (Fig 5-20), redrawn from Eisen, Genome Research, 1998�9
HorizontalGeneTransfer(HGT)
• Horizontalgenetransfer:anorganismpicksupDNAfromanotherorganismortheenvironmentratherfromitsancestors
• Itmayreplaceasimilargenepresentinthetargetormaycreateanewcopy
• Rampantinprokaryotes;observedinplantsandothereukaryotes
�10[Degnan & Rosenberg, 2009]
HorizontalGeneTransfer(HGT)
• Horizontalgenetransfer:anorganismpicksupDNAfromanotherorganismortheenvironmentratherfromitsancestors
• Itmayreplaceasimilargenepresentinthetargetormaycreateanewcopy
• Rampantinprokaryotes;observedinplantsandothereukaryotes
• Atreemaybeinsufficient.Aphylogeneticnetworkmaybeneeded.
�10[Degnan & Rosenberg, 2009]
[Degnan & Rosenberg, 2009]
Hybridization
• Aviablespeciesiscreatedasaresultsofhybridizationbetweentwodifferentspecies
– Thecontributionoftheparentspeciestothenewspeciesneednotbeequal.
• Atreeisnotsufficient;networksneeded
�11
[Degnan & Rosenberg, 2009]
Hybridization
• Aviablespeciesiscreatedasaresultsofhybridizationbetweentwodifferentspecies
– Thecontributionoftheparentspeciestothenewspeciesneednotbeequal.
• Atreeisnotsufficient;networksneeded
• Whatdoesaspeciesevenmean?
– 17differentdefinitions…amatterofgreatdebate
– Speciesdelineationisanactiveareaofresearch
�11
• A random process related to the coalescence of alleles across various populations
Tracing alleles through generations
IncompleteLineageSorting(ILS)
• A random process related to the coalescence of alleles across various populations
Tracing alleles through generations
IncompleteLineageSorting(ILS)
• A random process related to the coalescence of alleles across various populations
• Omnipresent:
• Possible for every gene tree
• Likely for short branches or large population sizes
Tracing alleles through generations
IncompleteLineageSorting(ILS)
Phylogeneticinferencedespitediscordance
Genetreediscordancehastobeaccountedfor.
Howso?
�13
Sofar…
welearnedvariousreasonsfordiscordance.
Fourmainquestionsinphylogenomics
• Reconciliation:
• Mapagiven(i.e.,known)genetreeontoaspeciestree
• Explainshowagenetreeevolvedinsidethespeciestree
�14
Fourmainquestionsinphylogenomics
• Reconciliation:
• Mapagiven(i.e.,known)genetreeontoaspeciestree
• Explainshowagenetreeevolvedinsidethespeciestree
• Inferthespeciestreegivenacollectionofknown(orinferred)genetrees
�14
Fourmainquestionsinphylogenomics
• Reconciliation:
• Mapagiven(i.e.,known)genetreeontoaspeciestree
• Explainshowagenetreeevolvedinsidethespeciestree
• Inferthespeciestreegivenacollectionofknown(orinferred)genetrees
• Inferagenetreegivenaknownspeciestreeandsequencedataforthatgene(a.k.atreefixing)
�14
Fourmainquestionsinphylogenomics
• Reconciliation:
• Mapagiven(i.e.,known)genetreeontoaspeciestree
• Explainshowagenetreeevolvedinsidethespeciestree
• Inferthespeciestreegivenacollectionofknown(orinferred)genetrees
• Inferagenetreegivenaknownspeciestreeandsequencedataforthatgene(a.k.atreefixing)
• Co-estimategenetreesandthespeciestreegiventhesequencedataforacollectionofgenes
�14
Fourmainquestionsinphylogenomics
• Reconciliation:
• Mapagiven(i.e.,known)genetreeontoaspeciestree
• Explainshowagenetreeevolvedinsidethespeciestree
• Inferthespeciestreegivenacollectionofknown(orinferred)genetrees
• Inferagenetreegivenaknownspeciestreeandsequencedataforthatgene(a.k.atreefixing)
• Co-estimategenetreesandthespeciestreegiventhesequencedataforacollectionofgenes
• …andothers…
�14
Approachestophylogenomicsinference
A. Parsimony-basedB. Model-basedC. Summary-based
�15
Parsimony-based
• Describediscordancesbetweengenetreesandthespeciestreeusingtheminimumnumberof“events”
• Events:Duplications,losses,transfers,deepcoalescences
• Reliesheavilyonfast(lineartime)parsimoniousreconciliationbetweengenetreesandthespeciestree
FigurefromDoyonetal.,BriefingsinBioinformatics,2011doi:10.1093/bib/bbr045�16
Parsimony-based
• Describediscordancesbetweengenetreesandthespeciestreeusingtheminimumnumberof“events”
• Events:Duplications,losses,transfers,deepcoalescences
• Reliesheavilyonfast(lineartime)parsimoniousreconciliationbetweengenetreesandthespeciestree
FigurefromDoyonetal.,BriefingsinBioinformatics,2011doi:10.1093/bib/bbr045�16
Orang.GorillaChimp
Human Orang.Gorilla ChimpHuman
Orang.Gorilla
ChimpHuman
Orang.Chimp Human
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
Model-based A. Designagenerativemodelofgenetreeevolution
�17
Gene tree
Sequence data(Alignments)
Gene tree Gene tree Gene tree
Sequence data(Alignments)
Gene evolution model
Sequence evolution model P(D |G)
P(G |S)
OrangutanGorilla ChimpHuman
Orang.GorillaChimp
Human Orang.Gorilla ChimpHuman
Orang.Gorilla
ChimpHuman
Orang.Chimp Human
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT 18
Gene tree
Sequence data(Alignments)
Gene tree Gene tree Gene tree
Sequence data(Alignments)
Gene evolution model
Sequence evolution model P(D |G)
P(G |S)
OrangutanGorilla ChimpHuman
B. MLorBayesianinferenceunderthemodel
Model-based
Modelsofgenetreeevolution
• ILS:modeledbytheMulti-SpeciesCoalescentmodel(MSC):anextensionoftheKingman’scoalescenttomultiplespecies
�19
Modelsofgenetreeevolution
• ILS:modeledbytheMulti-SpeciesCoalescentmodel(MSC):anextensionoftheKingman’scoalescenttomultiplespecies
• Duploss:typicallymodeledusingbirthdeathprocesses,requiringarateofbirthanddeath(oftenfixed)
�19
Modelsofgenetreeevolution
• ILS:modeledbytheMulti-SpeciesCoalescentmodel(MSC):anextensionoftheKingman’scoalescenttomultiplespecies
• Duploss:typicallymodeledusingbirthdeathprocesses,requiringarateofbirthanddeath(oftenfixed)
• HGT,geneflow/hybridization:thespeciesphylogenyismodeledasanetwork(DAG)andgenetreesarestochasticallyembeddedinthenetwork.
�19
Modelsofgenetreeevolution
• ILS:modeledbytheMulti-SpeciesCoalescentmodel(MSC):anextensionoftheKingman’scoalescenttomultiplespecies
• Duploss:typicallymodeledusingbirthdeathprocesses,requiringarateofbirthanddeath(oftenfixed)
• HGT,geneflow/hybridization:thespeciesphylogenyismodeledasanetwork(DAG)andgenetreesarestochasticallyembeddedinthenetwork.
• Modelsofcombinedeffectsalsoexist
• DTLSR,DLCoal,ODT,Hybridization+ILS
�19
Modelsofgenetreeevolution
• ILS:modeledbytheMulti-SpeciesCoalescentmodel(MSC):anextensionoftheKingman’scoalescenttomultiplespecies
• Duploss:typicallymodeledusingbirthdeathprocesses,requiringarateofbirthanddeath(oftenfixed)
• HGT,geneflow/hybridization:thespeciesphylogenyismodeledasanetwork(DAG)andgenetreesarestochasticallyembeddedinthenetwork.
• Modelsofcombinedeffectsalsoexist
• DTLSR,DLCoal,ODT,Hybridization+ILS
• Caution:inferenceundermanyofthesemodelsisdifficult
�19
Summary-basedmethods
• Useexpectationsunderastatisticalmodelbutavoidcomputingthelikelihood
• Oftenarestatisticallyconsistentundersomemodelofgenetreeevolution
• Oftenbasedonsummarystatisticsordistancemeasures
�20
Summary-basedmethods
• Useexpectationsunderastatisticalmodelbutavoidcomputingthelikelihood
• Oftenarestatisticallyconsistentundersomemodelofgenetreeevolution
• Oftenbasedonsummarystatisticsordistancemeasures
• Usuallyworkintwosteps:
• Genetreesareindependentlyinferredfromsequencedata
• Genetreesarecombinedtobuildthespeciestree
• Let’sseeanexample…
�20
UnderMSCmodelofILS
For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability of appearing in gene trees (Allman, et al. 2010)
21
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
The most frequent gene tree = The most likely species tree
22
Orang.Gorilla ChimpHuman Rhesus
Morethan4species
For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
22
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa
3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
Morethan4species
For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
22
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa
3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
Morethan4species
For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
Alternative:
Weight all 3(n 4 ) quartet topologies by
their frequency and find the optimal tree
MaximumQuartetSupportSpeciesTree
• Optimization problem:
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
23
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
MaximumQuartetSupportSpeciesTree
• Optimization problem:
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
23
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
the set of quartet trees induced by T
a gene tree
NP-Hard [Lafond & Scornavaccaori, 2016]
Score(T ) =mX
1
|Q(T ) \Q(ti)|
[Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]
• Solve the Maximum Quartet Support problem exactly using dynamic programming
24
ASTRAL
[Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]
• Solve the Maximum Quartet Support problem exactly using dynamic programming
• Constrains the search space to make large datasets feasible
• The constrained version remains statistically consistent
• Running time of constrained version increases polynomially with the input size
24
ASTRAL
[Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]
• Solve the Maximum Quartet Support problem exactly using dynamic programming
• Constrains the search space to make large datasets feasible
• The constrained version remains statistically consistent
• Running time of constrained version increases polynomially with the input size
• Is adopted widely for incomplete lineage sorting 24
ASTRAL
Sofar…
Threeapproaches:A. Parsimony-basedB. Model-basedC. Summary-based
FourQuestions:A. ReconciliationB. SpeciestreeinferenceC. GenetreeinferenceD. Co-estimation
Threetypesofdiscordance:A. DuplicationandlossesB. HGT&HybridizationC. ILS
*Eachcapturedbymanystatisticalmodels
�25
Manymanytools
• Speciestreeinference
• ILS:ASTRAL,STAR,MP-EST,GLASS,NJst/ASTRID,DISTIQUE,STELLS,…
• Duploss:DupTree,iGTP,DynaDup,MulRF,…
• Hybridization:Phylonet,PhyloNetworks,SNaQ,…
• Agnostic:Bucky,MRP,MRL,guenomu,…
• Genetreecorrection
• TreeFix,Giga,RefineTree,SPIDIR,SPIMAP,PrIME-GSR,SYNERGY,ALE,ODT(seealsoEnsemblCompara)
• Reconciliation
• Notung,DLCoalRecon,PrIME,Phylonet,TreeMap,Korak,CoRe-Pa,Jane,Mowgli,AnGST,…
• Co-estimation
• PHYLDOG,*BEAST,BEST,BBCA,PhyloNet,…
�26
Manymanytools
• Speciestreeinference
• ILS:ASTRAL,STAR,MP-EST,GLASS,NJst/ASTRID,DISTIQUE,STELLS,…
• Duploss:DupTree,iGTP,DynaDup,MulRF,…
• Hybridization:Phylonet,PhyloNetworks,SNaQ,…
• Agnostic:Bucky,MRP,MRL,guenomu,…
• Genetreecorrection
• TreeFix,Giga,RefineTree,SPIDIR,SPIMAP,PrIME-GSR,SYNERGY,ALE,ODT(seealsoEnsemblCompara)
• Reconciliation
• Notung,DLCoalRecon,PrIME,Phylonet,TreeMap,Korak,CoRe-Pa,Jane,Mowgli,AnGST,…
• Co-estimation
• PHYLDOG,*BEAST,BEST,BBCA,PhyloNet,…
Noscalablespecies/genetreeinferencemethodcancurrentlyaddressallcausesofdiscordance
�26
Whatdoyouchoose?
• Whatcause(s)ofdiscordancedoyoubelievetobeprevalent?
• Doyouneedtoworryaboutduplicationandloss?
• Orperhapsyouhavearelativelyreliablewayoffindingorthology?Maybeusingwholegenomealignments.
• Doyouexpecthybridization,geneflow,orHGTforyourtaxa?
• IsILSlikelytobepresent?
• shortinternalbranches
• veryhighpopulationsizes
�27
Whatdoyouchoose?
• Statisticalnoisecannotbeignored
�28[Jarvis, et al., Science, 2014]
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Whatdoyouchoose?
• Statisticalnoisecannotbeignored
• Inferringgenetreesfromshortsequencesisoftenerror-prone
• Tree-fixingmethodsmayhelp
• Co-estimationislesspronetogenetreeestimationerror
�28[Jarvis, et al., Science, 2014]
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Whatdoyouchoose?
• Whatisthedatasetsize?
• Methodsdiffervastlyintermsofscalability
• Fast/scalable:parsimony-basedandsummarymethods(+MLgenetrees)
• Slower:statisticalmodels,especiallywhenconsideringmultiplecausesofdiscordance
• Slowest:co-estimation
• Notallapproachesareimplementedinanoptimizedmanner
• Somemethodsareeasiertoparallelizethanothers
�29
Takeawaymessages
• Causesofgenetreediscordancearevariedandcanco-exist
• Inferenceunderdiscordanceisanongoingareaofresearch
• Datasetsizematters!
• Thinkingaboutuncertaintyanderrorisimportant
�30
Reference
• Maddison, Wayne P. “Gene Trees in Species Trees.” Systematic Biology 46, no. 3 (September 1, 1997): 523–36. https://doi.org/10.2307/2413694.
• Degnan, James H., and Noah A. Rosenberg. “Gene Tree Discordance, Phylogenetic Inference and the Multispecies Coalescent.” Trends in Ecology and Evolution 24, no. 6 (June 1, 2009): 332–40. https://doi.org/10.1016/j.tree.2009.01.009.
• Doyon, J.-P. JP, Vincent Ranwez, Vincent Daubin, and V. Berry. “Models, Algorithms and Programs for Phylogeny Reconciliation.” Briefings in Bioinformatics 12, no. 5 (September 22, 2011): 392–400. https://doi.org/10.1093/bib/bbr045.
• Szöllõsi, G J, E Tannier, Vincent Daubin, and Bastien Boussau. “The Inference of Gene Trees with Species Trees.” Systematic Biology 64, no. 1 (July 28, 2014): e42–62. https://doi.org/10.1093/sysbio/syu048.
• Warnow, Tandy. Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press, 2017.
31