finding errors in phylogenomic data using treeshrinktandy.cs.illinois.edu/treeshrink-uyen.pdf ·...
TRANSCRIPT
Finding Errors in Phylogenomic Data Using TreeShrink
Uyen MaiUniversity of California San Diego
�1
TreeShrink Softwarehttps://github.com/uym2/treeshrink
Observations• Sequence data often include various sources of error
• contamination
• mistaken orthology
• misalignment
• Erroneous sequences can appear as unproportionally long branches in the gene trees
!2
From Gatesy et. al. (2014)
• Deep coalescent nodes also appear on long branches
• Detecting long branches can be helpful in screening for errors in gene trees
!3
A Gene tree from Mammalian datasetSong et al, PNAS, 2012
0.2
Elephant
Rabbit
Guinea_Pig
Lesser_Hedgehog_Tenrec
Human
Dog
Megabat
Alpaca
Pig
Galagos
Opossum
Pika
Microbat
Shrew
Hedgehog
Kangaroo_Rat
Marmoset
Rat
Mouse_Lemur
Gorilla
Orangutan
Armadillos
Tarsier
Dolphin
Mouse
Chimpanzee
Hyrax
Sloth
Tree_Shrew
Macaque
Platypus
Squirrel
Chicken
Cat
Horse
Wallaby
Cow
Platypus
MouseRat
Kangaroo Rat
Guinea Pig
Tree ShrewShrew
ChickenOpossum
Macaque
!4
A Gene tree from 1kp Plants datasetWicket et al, PNAS, 2014
0.7
Sorghum_bicolor
Cosmarium_ochthodes
Boehmeria_nivea
Huperzia_squarrosa
Penium_margaritaceum
Hibiscus_cannabinus
Thuidium_delicatulum
Entransia_fimbriata
Cunninghamia_lanceolata
Rosmarinus_officinalis
Sphaerocarpos_texanus
Netrium_digitus
Prumnopitys_andina
Aquilegia_formosa
Larrea_tridentata
Ginkgo_biloba
Mesotaenium_endlicherianum
Chlorokybus_atmophyticus
Colchicum_autumnale
Sphagnum_lescurii
Ipomoea_purpurea
Sciadopitys_verticillata
Nuphar_advena
Alsophila_spinulosa
Vitis_vinifera
Ricciocarpos_natansMetzgeria_crassipilis
Carica_papaya
Acorus_americanus
Cylindrocystis_brebissonii
Catharanthus_roseus
Smilax bona-nox
Hedwigia_ciliata
Coleochaete_irregularis
Rosulabryum_cf_capillare
Gnetum_montanum
Brachypodium_distachyon
Houttuynia_cordata
Allamanda_cathartica
Tanacetum_parthenium
Spirotaenia_minuta
Ceratodon_purpureus
Coleochaete_scutata
Juniperus_scopulorum
Arabidopsis_thaliana
Bazzania_trilobata
Polytrichum_commune
Nothoceros_vincentianus
Eschscholzia_californica
Cylindrocystis_cushleckae
Nothoceros_aenigmaticus
Rhynchostegium_serrulatum
Pteridium_aquilinum
Pyramimonas_parkeae
Klebsormidium_subtile
Marchantia_polymorpha
Taxus_baccata
Diospyros_malabarica
Sabal_bermudana
Uronema_sp
Roya_obtusa
Angiopteris_evecta
Amborella_trichopoda
Inula_helenium
Sarcandra_glabra
Nephroselmis_pyriformis
Anomodon_attenuatus
Marchantia_emarginata
Ephedra_sinica
Selaginella_moellendorffii_1kp
Yucca_filamentosa
Bryum_argenteum
Saruma_henryi
Leucodon_brachypus
Mougeotia_sp
Kadsura_heteroclita
Dendrolycopodium_obscurum
Dioscorea_villosaPersea_americana
Kochia_scoparia
Podophyllum_peltatum
Liriodendron_tulipifera
Uronema sp Nephroselmiss pyriformis
Pyramimonas parkeae
!5
Q: How to detect long branches?
A: Remove leaves to maximally reduce the diameter
!6
Diameter: The longest path between any two leaves
0.2
For unrooted trees?
0.2
Diameter: the longest path between any two species
A gene tree from the1KP plant dataset (Wicket et al, PNAS, 2014) 30.2
For unrooted trees?
0.2
Diameter: the longest path between any two species
A gene tree from the1KP plant dataset (Wicket et al, PNAS, 2014) 3
!7
More than three times reduction in diameter!
0.6
Entransia_fimbriata
Marchantia_polymorpha
Smilax bona-nox
Metzgeria_crassipilis
Sarca
ndra
_glab
ra
Yucca_filamentosa
Saru
ma_
henr
yi
Noth
ocer
os_a
enig
mat
icus
Zea_mays
Cycas_micholitzii
Chlorokybus_atmophyticus
Pteridium_aquilinum
Podo
phyl
lum
_pel
tatu
m
Cunninghamia_lanceolata
Dendr
olyco
podiu
m_obs
curu
m
Ceratodon_purpureus
Ephedra_sinicaGnetum_montanum
Leuc
odon
_bra
chyp
us
Ricciocarpos_natans
Klebsormidium
_subtile
Chara_vulgaris
Vitis_vinifera
Equisetum_diffusum
Persea
_ameri
cana
Sabal_bermudana
Rosmarinus_officinalis
Aqui
legi
a_fo
rmos
a
Selaginella_moellendorffii_1kp
Ipomoea_purpurea
Esch
scho
lzia
_cal
iforn
ica
Amborella_trichopoda
Catharanthus_roseus
Juniperus_scopulorum
Pyramimonas_parkeae
Mesotaenium
_endlicherianum
Alsophila_spinulosa
Cylindrocystis_cushleckae
Acorus_
americanus
Nephroselmis_pyriformis
Boeh
mer
ia_n
ivea
Hibiscus_cannabinus
Oryza_sativa
Pseudolyc
opodiella_ca
roliniana
Nuphar_advena
Sphagnum_lescurii
Medicago_truncatula
Allamanda_cathartica
Coleochaete_irregularis
Phys
com
itrel
la_p
aten
s
Cylindrocystis_brebissonii
Coleochaete_scutata
Ophioglossum_petiolatum
Taxus_baccata
Sciadopitys_verticillata
Thui
dium
_del
icat
ulum
Pinus_taeda
Hed
wig
ia_c
iliat
a
Rhy
ncho
steg
ium
_ser
rula
tum
Inul
a_he
leni
um
Ginkgo_biloba
Roya_obtusa
Cosmarium_ochthodes
Notho
cero
s_vin
cent
ianu
s
Monomastix_opisthostigma
Cycas_rumphii
Selaginella_moellendorffii_genom
e
Tana
cetu
m_p
arth
eniu
m
Marchantia_emarginata
Prumnopitys_andina
Penium_margaritaceumSphaerocarpos_texanus
Bryu
m_a
rgen
teum
Hout
tuyn
ia_c
orda
ta
Spirogyra_sp
Psilotum_nudum
Mougeotia_sp
Chaetosphaeridium_globosum
Carica_papaya
Lirio
dend
ron_
tulip
ifera
Netrium_digitus
Polytrichum_com
mune
Arabidopsis_thalianaKochia_scoparia
Kadsu
ra_h
etero
clita
Dioscorea_villosa
Anom
odon
_atte
nuat
us
Brachypodium_distachyon
Bazzania_tril
obata
Huper
zia_s
quar
rosa
Diospyros_m
alabarica
Zamia_vazquezii
Sphagnum lescurii
!8
Diameter tracking
If we are to remove 1 leaf “shrinkable”: d0/d1 ≈ 3.5
0.6
Entransia_fimbriata
Marchantia_polymorpha
Smilax bona-nox
Metzgeria_crassipilis
Sarca
ndra
_glab
ra
Yucca_filamentosa
Saru
ma_
henr
yi
Noth
ocer
os_a
enig
mat
icus
Zea_mays
Cycas_micholitzii
Chlorokybus_atmophyticus
Pteridium_aquilinum
Podo
phyl
lum
_pel
tatu
m
Cunninghamia_lanceolata
Dendr
olyco
podiu
m_obs
curu
m
Ceratodon_purpureus
Ephedra_sinicaGnetum_montanum
Leuc
odon
_bra
chyp
us
Ricciocarpos_natans
Klebsormidium
_subtile
Chara_vulgaris
Vitis_vinifera
Equisetum_diffusum
Persea
_ameri
cana
Sabal_bermudana
Rosmarinus_officinalis
Aqui
legi
a_fo
rmos
a
Selaginella_moellendorffii_1kp
Ipomoea_purpurea
Esch
scho
lzia
_cal
iforn
ica
Amborella_trichopoda
Catharanthus_roseus
Juniperus_scopulorum
Pyramimonas_parkeae
Mesotaenium
_endlicherianum
Alsophila_spinulosa
Cylindrocystis_cushleckae
Acorus_
americanus
Nephroselmis_pyriformis
Boeh
mer
ia_n
ivea
Hibiscus_cannabinus
Oryza_sativa
Pseudolyc
opodiella_ca
roliniana
Nuphar_advena
Sphagnum_lescurii
Medicago_truncatula
Allamanda_cathartica
Coleochaete_irregularis
Phys
com
itrel
la_p
aten
s
Cylindrocystis_brebissonii
Coleochaete_scutata
Ophioglossum_petiolatum
Taxus_baccata
Sciadopitys_verticillata
Thui
dium
_del
icat
ulum
Pinus_taeda
Hed
wig
ia_c
iliat
a
Rhy
ncho
steg
ium
_ser
rula
tum
Inul
a_he
leni
um
Ginkgo_biloba
Roya_obtusa
Cosmarium_ochthodes
Notho
cero
s_vin
cent
ianu
s
Monomastix_opisthostigma
Cycas_rumphii
Selaginella_moellendorffii_genom
e
Tana
cetu
m_p
arth
eniu
m
Marchantia_emarginata
Prumnopitys_andina
Penium_margaritaceumSphaerocarpos_texanus
Bryu
m_a
rgen
teum
Hout
tuyn
ia_c
orda
ta
Spirogyra_sp
Psilotum_nudum
Mougeotia_sp
Chaetosphaeridium_globosum
Carica_papaya
Lirio
dend
ron_
tulip
ifera
Netrium_digitus
Polytrichum_com
mune
Arabidopsis_thalianaKochia_scoparia
Kadsu
ra_h
etero
clita
Dioscorea_villosa
Anom
odon
_atte
nuat
us
Brachypodium_distachyon
Bazzania_tril
obata
Huper
zia_s
quar
rosa
Diospyros_m
alabarica
Zamia_vazquezii
Sphagnum lescurii
!9
Diameter tracking
If we are to remove 1 leaf “shrinkable”: d0/d1 ≈ 3.5
0.6
Entransia_fimbriata
Marchantia_polymorpha
Smilax bona-nox
Metzgeria_crassipilis
Sarca
ndra
_glab
ra
Yucca_filamentosa
Saru
ma_
henr
yi
Noth
ocer
os_a
enig
mat
icus
Zea_mays
Cycas_micholitzii
Chlorokybus_atmophyticus
Pteridium_aquilinum
Podo
phyl
lum
_pel
tatu
m
Cunninghamia_lanceolata
Dendr
olyco
podiu
m_obs
curu
m
Ceratodon_purpureus
Ephedra_sinicaGnetum_montanum
Leuc
odon
_bra
chyp
us
Ricciocarpos_natans
Klebsormidium
_subtile
Chara_vulgaris
Vitis_vinifera
Equisetum_diffusum
Persea
_ameri
cana
Sabal_bermudana
Rosmarinus_officinalis
Aqui
legi
a_fo
rmos
a
Selaginella_moellendorffii_1kp
Ipomoea_purpurea
Esch
scho
lzia
_cal
iforn
ica
Amborella_trichopoda
Catharanthus_roseus
Juniperus_scopulorum
Pyramimonas_parkeae
Mesotaenium
_endlicherianum
Alsophila_spinulosa
Cylindrocystis_cushleckae
Acorus_
americanus
Nephroselmis_pyriformis
Boeh
mer
ia_n
ivea
Hibiscus_cannabinus
Oryza_sativa
Pseudolyc
opodiella_ca
roliniana
Nuphar_advena
Sphagnum_lescurii
Medicago_truncatula
Allamanda_cathartica
Coleochaete_irregularis
Phys
com
itrel
la_p
aten
s
Cylindrocystis_brebissonii
Coleochaete_scutata
Ophioglossum_petiolatum
Taxus_baccata
Sciadopitys_verticillata
Thui
dium
_del
icat
ulum
Pinus_taeda
Hed
wig
ia_c
iliat
a
Rhy
ncho
steg
ium
_ser
rula
tum
Inul
a_he
leni
um
Ginkgo_biloba
Roya_obtusa
Cosmarium_ochthodes
Notho
cero
s_vin
cent
ianu
s
Monomastix_opisthostigma
Cycas_rumphii
Selaginella_moellendorffii_genom
e
Tana
cetu
m_p
arth
eniu
m
Marchantia_emarginata
Prumnopitys_andina
Penium_margaritaceumSphaerocarpos_texanus
Bryu
m_a
rgen
teum
Hout
tuyn
ia_c
orda
ta
Spirogyra_sp
Psilotum_nudum
Mougeotia_sp
Chaetosphaeridium_globosum
Carica_papaya
Lirio
dend
ron_
tulip
ifera
Netrium_digitus
Polytrichum_com
mune
Arabidopsis_thalianaKochia_scoparia
Kadsu
ra_h
etero
clita
Dioscorea_villosa
Anom
odon
_atte
nuat
us
Brachypodium_distachyon
Bazzania_tril
obata
Huper
zia_s
quar
rosa
Diospyros_m
alabarica
Zamia_vazquezii
Sphagnum lescurii
If we are to remove 2 leaves “shrinkable”: d1/d2 ≈ 1.1
…
!10
Diameter tracking
0.6
Entransia_fimbriata
Marchantia_polymorpha
Smilax bona-nox
Metzgeria_crassipilis
Sarca
ndra
_glab
ra
Yucca_filamentosa
Saru
ma_
henr
yi
Noth
ocer
os_a
enig
mat
icus
Zea_mays
Cycas_micholitzii
Chlorokybus_atmophyticus
Pteridium_aquilinum
Podo
phyl
lum
_pel
tatu
m
Cunninghamia_lanceolata
Dendr
olyco
podiu
m_obs
curu
m
Ceratodon_purpureus
Ephedra_sinicaGnetum_montanum
Leuc
odon
_bra
chyp
us
Ricciocarpos_natans
Klebsormidium
_subtile
Chara_vulgaris
Vitis_vinifera
Equisetum_diffusum
Persea
_ameri
cana
Sabal_bermudana
Rosmarinus_officinalis
Aqui
legi
a_fo
rmos
a
Selaginella_moellendorffii_1kp
Ipomoea_purpurea
Esch
scho
lzia
_cal
iforn
ica
Amborella_trichopoda
Catharanthus_roseus
Juniperus_scopulorum
Pyramimonas_parkeae
Mesotaenium
_endlicherianum
Alsophila_spinulosa
Cylindrocystis_cushleckae
Acorus_
americanus
Nephroselmis_pyriformis
Boeh
mer
ia_n
ivea
Hibiscus_cannabinus
Oryza_sativa
Pseudolyc
opodiella_ca
roliniana
Nuphar_advena
Sphagnum_lescurii
Medicago_truncatula
Allamanda_cathartica
Coleochaete_irregularis
Phys
com
itrel
la_p
aten
s
Cylindrocystis_brebissonii
Coleochaete_scutata
Ophioglossum_petiolatum
Taxus_baccata
Sciadopitys_verticillata
Thui
dium
_del
icat
ulum
Pinus_taeda
Hed
wig
ia_c
iliat
a
Rhy
ncho
steg
ium
_ser
rula
tum
Inul
a_he
leni
um
Ginkgo_biloba
Roya_obtusa
Cosmarium_ochthodes
Notho
cero
s_vin
cent
ianu
s
Monomastix_opisthostigma
Cycas_rumphii
Selaginella_moellendorffii_genom
e
Tana
cetu
m_p
arth
eniu
m
Marchantia_emarginata
Prumnopitys_andina
Penium_margaritaceumSphaerocarpos_texanus
Bryu
m_a
rgen
teum
Hout
tuyn
ia_c
orda
ta
Spirogyra_sp
Psilotum_nudum
Mougeotia_sp
Chaetosphaeridium_globosum
Carica_papaya
Lirio
dend
ron_
tulip
ifera
Netrium_digitus
Polytrichum_com
mune
Arabidopsis_thalianaKochia_scoparia
Kadsu
ra_h
etero
clita
Dioscorea_villosa
Anom
odon
_atte
nuat
us
Brachypodium_distachyon
Bazzania_tril
obata
Huper
zia_s
quar
rosa
Diospyros_m
alabarica
Zamia_vazquezii
Sphagnum lescurii
If we are to remove k leaves “shrinkable”: dk-1/dk
!11
Diameter tracking
Diameter-shrinking plotWhat to remove?
ν i
●●
●
●
● ● ●
●
●● ●
● ● ● ● ● ● ● ● ●1
2
3
4
5
5 10 15 20removal
ratio
8
ν i
What to remove?
ν i
●●
●
●
● ● ●
●
●● ●
● ● ● ● ● ● ● ● ●1
2
3
4
5
5 10 15 20removal
ratio
8
ν i
i
removal shrinkable
… …
ν1 =d0
d1i = 1
i = 2
i = 3
i = 4
ν2 =d1
d2
ν3 =d2
d3
ν4 =d3
d4
83
!12
What to remove? the diameter after i-1 removals Let νi = —————————————— the diameter after i removals
0.2
●● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●1
2
3
4
5
5 10 15removal
ratioν i
7
“flat” plot —> no outliers!
What to remove? the diameter after i-1 removals Let νi = —————————————— the diameter after i removals
0.2
7
Diameter-shrinking plotremoval shrinkable
… …
ν1 =d0
d1i = 1
i = 2
i = 3
i = 4
ν2 =d1
d2
ν3 =d2
d3
ν4 =d3
d4
!13
Diameter-shrinking plot
●●
●
●
● ● ●
●
●● ●
● ● ● ● ● ● ● ● ●1
2
3
4
5
5 10 15 20
2.0
0.2
●● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●1
2
3
4
5
5 10 15
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●1
2
3
4
5
5 10 15
0.3
●
●
● ●● ● ● ● ● ● ● ●
● ● ● ● ●1
2
3
4
5
5 10 15
0.4
removal removal removal removal
ratio
ratio
ratio
ratio
ii ii
νi νi νi νi
!14
Q: How to automate the process?
A: Use TreeShrink!
!15
Mai and Mirarab. BMC Genomics 2018
https://github.com/uym2/TreeShrink
Step 1: compute the sets of 1, 2, …, k leaves that could be removed to reduce the diameter maximally
Step 2: computer diameter-shrinking plots for 1…k
Step 3: Use a statistical test to detect outliers and suggest them for removal
!16
TreeShrink: Algorithm
conda install -c smirarab treeshrink
TreeShrink: Installation
!17
TreeShrink: Installation
git clone https://github.com/uym2/TreeShrink.git
python setup.py install [--user]
!18
TreeShrink: Usage
run_treeshrink.py [-h] [-i INDIR] [-t TREE]
[-a ALIGNMENT][-o OUTDIR][-q QUANTILES]
[-m MODE][-c][-k K]
!19
TreeShrink: Inputs
-i INDIR The parent input directory where the trees (and alignments) can be found.
-t TREE The name of the input tree/trees. If the input directory is specified (see -i option), each subdirectory under it must contain a tree with this name. Otherwise, all the trees can be included in this one file. Default: input.tre
-a ALIGNMENT The name of the input alignment; can only be used when the input directory is specified (see -i option). Each subdirectory under it must contain an alignment with this name. Default: input.fasta
!20
TreeShrink: Outputs
-o OUTDIR Output directory. Default: the same as input directory (if it is specified) or in the same directory with the input trees.
‣ The output directory will include‣ The removing list: the species removed from each input tree‣ The shrunk trees: the trees with the suggested species removed.‣ The filtered alignments: the alignments (if provided) with
suggested species removed.
!21
• Inside the generated folder test_data/mm10_treeshrink/
• the shrunk trees mm10_shrunk_0.05.trees
• the removing set mm10_shrunk_RS_0.05.txt
TreeShrink: Examplerun_treeshrink.py -t test_data/mm10.trees -o test_data/mm10_treeshrink
!22
TreeShrink: include alignment
• Alignments can optionally be included to be filtered with the trees
• The alignments do not have any impact on outlier detection. They will be filtered based on the results of the filter applied to the trees.
• To include alignments, use -a together with -i
!23
TreeShrink: Example> ls allgenes/*
allgenes/4048: tree.nwk alignment.fasta
allgenes/4103: tree.nwk alignment.fasta
allgenes/4218: tree.nwk alignment.fasta
allgenes/4234: tree.nwk alignment.fasta
!24
TreeShrink: Examplerun_treeshrink.py -i allgenes -t tree.nwk -a alignment.fasta
allgenes/4048: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta
allgenes/4103: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta
allgenes/4218: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta
allgenes/4234: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta
!25
TreeShrink: -m option
• TreeShrink includes three modes• per-gene• all-genes• per-species
• By default, TreeShrink automatically selects an appropriate mode (usually per-species is chosen)
• Use -m to manually change the mode
!26
TreeShrink: -q and -b
• To control the sensitivity of TreeShrink, use -q and -b
-q QUANTILES The false-tolerance threshold. Multiple thresholds can be specified. Default: 0.05
-b MINIMPACT To be used with per-species mode. The minimum impact (percent) on the diameter on which the species could be removed. As such, TreeShrink never removes the species if their impact on diameter is less than MINIPACT%. Default: 5
!27
TreeShrink: -k and -s
• Use -k and -s to set the size of the diameter-shrinking plot (x-axis)
• In the per-gene mode, this number is the maximum number of species that could be removed per tree
-k K The size of the diameter-shrinking plot; i.e. maximum number of leaves that can be removed. Default: auto-select based on the data
-s KSCALING If -k is not given, we use k=min(n/a,b*sqrt(n)) by default; using this option, you can set the a,b constants; Default: '5,2'
• Generate folder test_data/mm50_treeshrink_multi/ which contains two sets of outputs
• at α = 0.05
• mm50_shrunk_0.05.trees and mm50_shrunk_RS_0.05.txt
• at α = 0.10
• mm50_shrunk_0.1.trees and mm50_shrunk_RS_0.1.txt
TreeShrink: Examplerun_treeshrink.py -t test_data/mm50.trees -q "0.05 0.10” -b 5 -k 5 -m per-species —o test_data/mm50_treeshrink_multi
!29
TreeShrink: LoggingLaunching TREESHRINK version 1.3.3
TREESHRINK was called as follow
run_treeshrink.py -t test_data/mm50.trees -m per-species —q 0.05 0.10 -b 5 -k 5 -o test_data/mm50_treeshrink_multi
Solving k-shrink with k = 5 Solving k-shrink with k = 5 Solving k-shrink with k = 5 Solving k-shrink with k = 5 …
TreeShrink will run in 'Per-species' mode ... CAV: will be cut in 2 trees where its impact is above 1.212920 for quantile 0.05 CAV: will be cut in 5 trees where its impact is above 1.119273 for quantile 0.10 DAS: will be cut in 1 trees where its impact is above 1.050000 for quantile 0.05 DAS: will be cut in 1 trees where its impact is above 1.050000 for quantile 0.10 …
ReferencesPublicationMai, Uyen, and Siavash Mirarab. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19, no. S5 (2018): 272. doi:10.1186/s12864-018-4620-2.
TreeShrink Softwarehttps://github.com/uym2/treeshrink
ContactUyen Mai [email protected]
!31