finding errors in phylogenomic data using treeshrinktandy.cs.illinois.edu/treeshrink-uyen.pdf ·...

31
Finding Errors in Phylogenomic Data Using TreeShrink Uyen Mai University of California San Diego [email protected] 1 TreeShrink Software https://github.com/uym2/treeshrink

Upload: others

Post on 01-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Finding Errors in Phylogenomic Data Using TreeShrink

Uyen MaiUniversity of California San Diego

[email protected]

�1

TreeShrink Softwarehttps://github.com/uym2/treeshrink

Page 2: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Observations• Sequence data often include various sources of error

• contamination

• mistaken orthology

• misalignment

• Erroneous sequences can appear as unproportionally long branches in the gene trees

!2

Page 3: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

From Gatesy et. al. (2014)

• Deep coalescent nodes also appear on long branches

• Detecting long branches can be helpful in screening for errors in gene trees

!3

Page 4: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

A Gene tree from Mammalian datasetSong et al, PNAS, 2012

0.2

Elephant

Rabbit

Guinea_Pig

Lesser_Hedgehog_Tenrec

Human

Dog

Megabat

Alpaca

Pig

Galagos

Opossum

Pika

Microbat

Shrew

Hedgehog

Kangaroo_Rat

Marmoset

Rat

Mouse_Lemur

Gorilla

Orangutan

Armadillos

Tarsier

Dolphin

Mouse

Chimpanzee

Hyrax

Sloth

Tree_Shrew

Macaque

Platypus

Squirrel

Chicken

Cat

Horse

Wallaby

Cow

Platypus

MouseRat

Kangaroo Rat

Guinea Pig

Tree ShrewShrew

ChickenOpossum

Macaque

!4

Page 5: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

A Gene tree from 1kp Plants datasetWicket et al, PNAS, 2014

0.7

Sorghum_bicolor

Cosmarium_ochthodes

Boehmeria_nivea

Huperzia_squarrosa

Penium_margaritaceum

Hibiscus_cannabinus

Thuidium_delicatulum

Entransia_fimbriata

Cunninghamia_lanceolata

Rosmarinus_officinalis

Sphaerocarpos_texanus

Netrium_digitus

Prumnopitys_andina

Aquilegia_formosa

Larrea_tridentata

Ginkgo_biloba

Mesotaenium_endlicherianum

Chlorokybus_atmophyticus

Colchicum_autumnale

Sphagnum_lescurii

Ipomoea_purpurea

Sciadopitys_verticillata

Nuphar_advena

Alsophila_spinulosa

Vitis_vinifera

Ricciocarpos_natansMetzgeria_crassipilis

Carica_papaya

Acorus_americanus

Cylindrocystis_brebissonii

Catharanthus_roseus

Smilax bona-nox

Hedwigia_ciliata

Coleochaete_irregularis

Rosulabryum_cf_capillare

Gnetum_montanum

Brachypodium_distachyon

Houttuynia_cordata

Allamanda_cathartica

Tanacetum_parthenium

Spirotaenia_minuta

Ceratodon_purpureus

Coleochaete_scutata

Juniperus_scopulorum

Arabidopsis_thaliana

Bazzania_trilobata

Polytrichum_commune

Nothoceros_vincentianus

Eschscholzia_californica

Cylindrocystis_cushleckae

Nothoceros_aenigmaticus

Rhynchostegium_serrulatum

Pteridium_aquilinum

Pyramimonas_parkeae

Klebsormidium_subtile

Marchantia_polymorpha

Taxus_baccata

Diospyros_malabarica

Sabal_bermudana

Uronema_sp

Roya_obtusa

Angiopteris_evecta

Amborella_trichopoda

Inula_helenium

Sarcandra_glabra

Nephroselmis_pyriformis

Anomodon_attenuatus

Marchantia_emarginata

Ephedra_sinica

Selaginella_moellendorffii_1kp

Yucca_filamentosa

Bryum_argenteum

Saruma_henryi

Leucodon_brachypus

Mougeotia_sp

Kadsura_heteroclita

Dendrolycopodium_obscurum

Dioscorea_villosaPersea_americana

Kochia_scoparia

Podophyllum_peltatum

Liriodendron_tulipifera

Uronema sp Nephroselmiss pyriformis

Pyramimonas parkeae

!5

Page 6: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Q: How to detect long branches?

A: Remove leaves to maximally reduce the diameter

!6

Page 7: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Diameter: The longest path between any two leaves

0.2

For unrooted trees?

0.2

Diameter: the longest path between any two species

A gene tree from the1KP plant dataset (Wicket et al, PNAS, 2014) 30.2

For unrooted trees?

0.2

Diameter: the longest path between any two species

A gene tree from the1KP plant dataset (Wicket et al, PNAS, 2014) 3

!7

Page 8: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

More than three times reduction in diameter!

0.6

Entransia_fimbriata

Marchantia_polymorpha

Smilax bona-nox

Metzgeria_crassipilis

Sarca

ndra

_glab

ra

Yucca_filamentosa

Saru

ma_

henr

yi

Noth

ocer

os_a

enig

mat

icus

Zea_mays

Cycas_micholitzii

Chlorokybus_atmophyticus

Pteridium_aquilinum

Podo

phyl

lum

_pel

tatu

m

Cunninghamia_lanceolata

Dendr

olyco

podiu

m_obs

curu

m

Ceratodon_purpureus

Ephedra_sinicaGnetum_montanum

Leuc

odon

_bra

chyp

us

Ricciocarpos_natans

Klebsormidium

_subtile

Chara_vulgaris

Vitis_vinifera

Equisetum_diffusum

Persea

_ameri

cana

Sabal_bermudana

Rosmarinus_officinalis

Aqui

legi

a_fo

rmos

a

Selaginella_moellendorffii_1kp

Ipomoea_purpurea

Esch

scho

lzia

_cal

iforn

ica

Amborella_trichopoda

Catharanthus_roseus

Juniperus_scopulorum

Pyramimonas_parkeae

Mesotaenium

_endlicherianum

Alsophila_spinulosa

Cylindrocystis_cushleckae

Acorus_

americanus

Nephroselmis_pyriformis

Boeh

mer

ia_n

ivea

Hibiscus_cannabinus

Oryza_sativa

Pseudolyc

opodiella_ca

roliniana

Nuphar_advena

Sphagnum_lescurii

Medicago_truncatula

Allamanda_cathartica

Coleochaete_irregularis

Phys

com

itrel

la_p

aten

s

Cylindrocystis_brebissonii

Coleochaete_scutata

Ophioglossum_petiolatum

Taxus_baccata

Sciadopitys_verticillata

Thui

dium

_del

icat

ulum

Pinus_taeda

Hed

wig

ia_c

iliat

a

Rhy

ncho

steg

ium

_ser

rula

tum

Inul

a_he

leni

um

Ginkgo_biloba

Roya_obtusa

Cosmarium_ochthodes

Notho

cero

s_vin

cent

ianu

s

Monomastix_opisthostigma

Cycas_rumphii

Selaginella_moellendorffii_genom

e

Tana

cetu

m_p

arth

eniu

m

Marchantia_emarginata

Prumnopitys_andina

Penium_margaritaceumSphaerocarpos_texanus

Bryu

m_a

rgen

teum

Hout

tuyn

ia_c

orda

ta

Spirogyra_sp

Psilotum_nudum

Mougeotia_sp

Chaetosphaeridium_globosum

Carica_papaya

Lirio

dend

ron_

tulip

ifera

Netrium_digitus

Polytrichum_com

mune

Arabidopsis_thalianaKochia_scoparia

Kadsu

ra_h

etero

clita

Dioscorea_villosa

Anom

odon

_atte

nuat

us

Brachypodium_distachyon

Bazzania_tril

obata

Huper

zia_s

quar

rosa

Diospyros_m

alabarica

Zamia_vazquezii

Sphagnum lescurii

!8

Diameter tracking

Page 9: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

If we are to remove 1 leaf “shrinkable”: d0/d1 ≈ 3.5

0.6

Entransia_fimbriata

Marchantia_polymorpha

Smilax bona-nox

Metzgeria_crassipilis

Sarca

ndra

_glab

ra

Yucca_filamentosa

Saru

ma_

henr

yi

Noth

ocer

os_a

enig

mat

icus

Zea_mays

Cycas_micholitzii

Chlorokybus_atmophyticus

Pteridium_aquilinum

Podo

phyl

lum

_pel

tatu

m

Cunninghamia_lanceolata

Dendr

olyco

podiu

m_obs

curu

m

Ceratodon_purpureus

Ephedra_sinicaGnetum_montanum

Leuc

odon

_bra

chyp

us

Ricciocarpos_natans

Klebsormidium

_subtile

Chara_vulgaris

Vitis_vinifera

Equisetum_diffusum

Persea

_ameri

cana

Sabal_bermudana

Rosmarinus_officinalis

Aqui

legi

a_fo

rmos

a

Selaginella_moellendorffii_1kp

Ipomoea_purpurea

Esch

scho

lzia

_cal

iforn

ica

Amborella_trichopoda

Catharanthus_roseus

Juniperus_scopulorum

Pyramimonas_parkeae

Mesotaenium

_endlicherianum

Alsophila_spinulosa

Cylindrocystis_cushleckae

Acorus_

americanus

Nephroselmis_pyriformis

Boeh

mer

ia_n

ivea

Hibiscus_cannabinus

Oryza_sativa

Pseudolyc

opodiella_ca

roliniana

Nuphar_advena

Sphagnum_lescurii

Medicago_truncatula

Allamanda_cathartica

Coleochaete_irregularis

Phys

com

itrel

la_p

aten

s

Cylindrocystis_brebissonii

Coleochaete_scutata

Ophioglossum_petiolatum

Taxus_baccata

Sciadopitys_verticillata

Thui

dium

_del

icat

ulum

Pinus_taeda

Hed

wig

ia_c

iliat

a

Rhy

ncho

steg

ium

_ser

rula

tum

Inul

a_he

leni

um

Ginkgo_biloba

Roya_obtusa

Cosmarium_ochthodes

Notho

cero

s_vin

cent

ianu

s

Monomastix_opisthostigma

Cycas_rumphii

Selaginella_moellendorffii_genom

e

Tana

cetu

m_p

arth

eniu

m

Marchantia_emarginata

Prumnopitys_andina

Penium_margaritaceumSphaerocarpos_texanus

Bryu

m_a

rgen

teum

Hout

tuyn

ia_c

orda

ta

Spirogyra_sp

Psilotum_nudum

Mougeotia_sp

Chaetosphaeridium_globosum

Carica_papaya

Lirio

dend

ron_

tulip

ifera

Netrium_digitus

Polytrichum_com

mune

Arabidopsis_thalianaKochia_scoparia

Kadsu

ra_h

etero

clita

Dioscorea_villosa

Anom

odon

_atte

nuat

us

Brachypodium_distachyon

Bazzania_tril

obata

Huper

zia_s

quar

rosa

Diospyros_m

alabarica

Zamia_vazquezii

Sphagnum lescurii

!9

Diameter tracking

Page 10: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

If we are to remove 1 leaf “shrinkable”: d0/d1 ≈ 3.5

0.6

Entransia_fimbriata

Marchantia_polymorpha

Smilax bona-nox

Metzgeria_crassipilis

Sarca

ndra

_glab

ra

Yucca_filamentosa

Saru

ma_

henr

yi

Noth

ocer

os_a

enig

mat

icus

Zea_mays

Cycas_micholitzii

Chlorokybus_atmophyticus

Pteridium_aquilinum

Podo

phyl

lum

_pel

tatu

m

Cunninghamia_lanceolata

Dendr

olyco

podiu

m_obs

curu

m

Ceratodon_purpureus

Ephedra_sinicaGnetum_montanum

Leuc

odon

_bra

chyp

us

Ricciocarpos_natans

Klebsormidium

_subtile

Chara_vulgaris

Vitis_vinifera

Equisetum_diffusum

Persea

_ameri

cana

Sabal_bermudana

Rosmarinus_officinalis

Aqui

legi

a_fo

rmos

a

Selaginella_moellendorffii_1kp

Ipomoea_purpurea

Esch

scho

lzia

_cal

iforn

ica

Amborella_trichopoda

Catharanthus_roseus

Juniperus_scopulorum

Pyramimonas_parkeae

Mesotaenium

_endlicherianum

Alsophila_spinulosa

Cylindrocystis_cushleckae

Acorus_

americanus

Nephroselmis_pyriformis

Boeh

mer

ia_n

ivea

Hibiscus_cannabinus

Oryza_sativa

Pseudolyc

opodiella_ca

roliniana

Nuphar_advena

Sphagnum_lescurii

Medicago_truncatula

Allamanda_cathartica

Coleochaete_irregularis

Phys

com

itrel

la_p

aten

s

Cylindrocystis_brebissonii

Coleochaete_scutata

Ophioglossum_petiolatum

Taxus_baccata

Sciadopitys_verticillata

Thui

dium

_del

icat

ulum

Pinus_taeda

Hed

wig

ia_c

iliat

a

Rhy

ncho

steg

ium

_ser

rula

tum

Inul

a_he

leni

um

Ginkgo_biloba

Roya_obtusa

Cosmarium_ochthodes

Notho

cero

s_vin

cent

ianu

s

Monomastix_opisthostigma

Cycas_rumphii

Selaginella_moellendorffii_genom

e

Tana

cetu

m_p

arth

eniu

m

Marchantia_emarginata

Prumnopitys_andina

Penium_margaritaceumSphaerocarpos_texanus

Bryu

m_a

rgen

teum

Hout

tuyn

ia_c

orda

ta

Spirogyra_sp

Psilotum_nudum

Mougeotia_sp

Chaetosphaeridium_globosum

Carica_papaya

Lirio

dend

ron_

tulip

ifera

Netrium_digitus

Polytrichum_com

mune

Arabidopsis_thalianaKochia_scoparia

Kadsu

ra_h

etero

clita

Dioscorea_villosa

Anom

odon

_atte

nuat

us

Brachypodium_distachyon

Bazzania_tril

obata

Huper

zia_s

quar

rosa

Diospyros_m

alabarica

Zamia_vazquezii

Sphagnum lescurii

If we are to remove 2 leaves “shrinkable”: d1/d2 ≈ 1.1

!10

Diameter tracking

Page 11: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

0.6

Entransia_fimbriata

Marchantia_polymorpha

Smilax bona-nox

Metzgeria_crassipilis

Sarca

ndra

_glab

ra

Yucca_filamentosa

Saru

ma_

henr

yi

Noth

ocer

os_a

enig

mat

icus

Zea_mays

Cycas_micholitzii

Chlorokybus_atmophyticus

Pteridium_aquilinum

Podo

phyl

lum

_pel

tatu

m

Cunninghamia_lanceolata

Dendr

olyco

podiu

m_obs

curu

m

Ceratodon_purpureus

Ephedra_sinicaGnetum_montanum

Leuc

odon

_bra

chyp

us

Ricciocarpos_natans

Klebsormidium

_subtile

Chara_vulgaris

Vitis_vinifera

Equisetum_diffusum

Persea

_ameri

cana

Sabal_bermudana

Rosmarinus_officinalis

Aqui

legi

a_fo

rmos

a

Selaginella_moellendorffii_1kp

Ipomoea_purpurea

Esch

scho

lzia

_cal

iforn

ica

Amborella_trichopoda

Catharanthus_roseus

Juniperus_scopulorum

Pyramimonas_parkeae

Mesotaenium

_endlicherianum

Alsophila_spinulosa

Cylindrocystis_cushleckae

Acorus_

americanus

Nephroselmis_pyriformis

Boeh

mer

ia_n

ivea

Hibiscus_cannabinus

Oryza_sativa

Pseudolyc

opodiella_ca

roliniana

Nuphar_advena

Sphagnum_lescurii

Medicago_truncatula

Allamanda_cathartica

Coleochaete_irregularis

Phys

com

itrel

la_p

aten

s

Cylindrocystis_brebissonii

Coleochaete_scutata

Ophioglossum_petiolatum

Taxus_baccata

Sciadopitys_verticillata

Thui

dium

_del

icat

ulum

Pinus_taeda

Hed

wig

ia_c

iliat

a

Rhy

ncho

steg

ium

_ser

rula

tum

Inul

a_he

leni

um

Ginkgo_biloba

Roya_obtusa

Cosmarium_ochthodes

Notho

cero

s_vin

cent

ianu

s

Monomastix_opisthostigma

Cycas_rumphii

Selaginella_moellendorffii_genom

e

Tana

cetu

m_p

arth

eniu

m

Marchantia_emarginata

Prumnopitys_andina

Penium_margaritaceumSphaerocarpos_texanus

Bryu

m_a

rgen

teum

Hout

tuyn

ia_c

orda

ta

Spirogyra_sp

Psilotum_nudum

Mougeotia_sp

Chaetosphaeridium_globosum

Carica_papaya

Lirio

dend

ron_

tulip

ifera

Netrium_digitus

Polytrichum_com

mune

Arabidopsis_thalianaKochia_scoparia

Kadsu

ra_h

etero

clita

Dioscorea_villosa

Anom

odon

_atte

nuat

us

Brachypodium_distachyon

Bazzania_tril

obata

Huper

zia_s

quar

rosa

Diospyros_m

alabarica

Zamia_vazquezii

Sphagnum lescurii

If we are to remove k leaves “shrinkable”: dk-1/dk

!11

Diameter tracking

Page 12: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Diameter-shrinking plotWhat to remove?

ν i

●●

● ● ●

●● ●

● ● ● ● ● ● ● ● ●1

2

3

4

5

5 10 15 20removal

ratio

8

ν i

What to remove?

ν i

●●

● ● ●

●● ●

● ● ● ● ● ● ● ● ●1

2

3

4

5

5 10 15 20removal

ratio

8

ν i

i

removal shrinkable

… …

ν1 =d0

d1i = 1

i = 2

i = 3

i = 4

ν2 =d1

d2

ν3 =d2

d3

ν4 =d3

d4

83

!12

Page 13: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

What to remove? the diameter after i-1 removals Let νi = —————————————— the diameter after i removals

0.2

●● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●1

2

3

4

5

5 10 15removal

ratioν i

7

“flat” plot —> no outliers!

What to remove? the diameter after i-1 removals Let νi = —————————————— the diameter after i removals

0.2

7

Diameter-shrinking plotremoval shrinkable

… …

ν1 =d0

d1i = 1

i = 2

i = 3

i = 4

ν2 =d1

d2

ν3 =d2

d3

ν4 =d3

d4

!13

Page 14: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Diameter-shrinking plot

●●

● ● ●

●● ●

● ● ● ● ● ● ● ● ●1

2

3

4

5

5 10 15 20

2.0

0.2

●● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●1

2

3

4

5

5 10 15

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●1

2

3

4

5

5 10 15

0.3

● ●● ● ● ● ● ● ● ●

● ● ● ● ●1

2

3

4

5

5 10 15

0.4

removal removal removal removal

ratio

ratio

ratio

ratio

ii ii

νi νi νi νi

!14

Page 15: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Q: How to automate the process?

A: Use TreeShrink!

!15

Mai and Mirarab. BMC Genomics 2018

https://github.com/uym2/TreeShrink

Page 16: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

Step 1: compute the sets of 1, 2, …, k leaves that could be removed to reduce the diameter maximally

Step 2: computer diameter-shrinking plots for 1…k

Step 3: Use a statistical test to detect outliers and suggest them for removal

!16

TreeShrink: Algorithm

Page 17: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

conda install -c smirarab treeshrink

TreeShrink: Installation

!17

Page 18: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: Installation

git clone https://github.com/uym2/TreeShrink.git

python setup.py install [--user]

!18

Page 19: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: Usage

run_treeshrink.py [-h] [-i INDIR] [-t TREE]

[-a ALIGNMENT][-o OUTDIR][-q QUANTILES]

[-m MODE][-c][-k K]

!19

Page 20: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: Inputs

-i INDIR The parent input directory where the trees (and alignments) can be found.

-t TREE The name of the input tree/trees. If the input directory is specified (see -i option), each subdirectory under it must contain a tree with this name. Otherwise, all the trees can be included in this one file. Default: input.tre

-a ALIGNMENT The name of the input alignment; can only be used when the input directory is specified (see -i option). Each subdirectory under it must contain an alignment with this name. Default: input.fasta

!20

Page 21: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: Outputs

-o OUTDIR Output directory. Default: the same as input directory (if it is specified) or in the same directory with the input trees.

‣ The output directory will include‣ The removing list: the species removed from each input tree‣ The shrunk trees: the trees with the suggested species removed.‣ The filtered alignments: the alignments (if provided) with

suggested species removed.

!21

Page 22: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

• Inside the generated folder test_data/mm10_treeshrink/

• the shrunk trees mm10_shrunk_0.05.trees

• the removing set mm10_shrunk_RS_0.05.txt

TreeShrink: Examplerun_treeshrink.py -t test_data/mm10.trees -o test_data/mm10_treeshrink

!22

Page 23: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: include alignment

• Alignments can optionally be included to be filtered with the trees

• The alignments do not have any impact on outlier detection. They will be filtered based on the results of the filter applied to the trees.

• To include alignments, use -a together with -i

!23

Page 24: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: Example> ls allgenes/*

allgenes/4048: tree.nwk alignment.fasta

allgenes/4103: tree.nwk alignment.fasta

allgenes/4218: tree.nwk alignment.fasta

allgenes/4234: tree.nwk alignment.fasta

!24

Page 25: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: Examplerun_treeshrink.py -i allgenes -t tree.nwk -a alignment.fasta

allgenes/4048: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta

allgenes/4103: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta

allgenes/4218: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta

allgenes/4234: tree.nwk tree_shrunk_0.05.nwk tree_shrunk_RS_0.05.txt alignment.fasta alignment_shrunk0.05.fasta

!25

Page 26: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: -m option

• TreeShrink includes three modes• per-gene• all-genes• per-species

• By default, TreeShrink automatically selects an appropriate mode (usually per-species is chosen)

• Use -m to manually change the mode

!26

Page 27: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: -q and -b

• To control the sensitivity of TreeShrink, use -q and -b

-q QUANTILES The false-tolerance threshold. Multiple thresholds can be specified. Default: 0.05

-b MINIMPACT To be used with per-species mode. The minimum impact (percent) on the diameter on which the species could be removed. As such, TreeShrink never removes the species if their impact on diameter is less than MINIPACT%. Default: 5

!27

Page 28: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: -k and -s

• Use -k and -s to set the size of the diameter-shrinking plot (x-axis)

• In the per-gene mode, this number is the maximum number of species that could be removed per tree

-k K The size of the diameter-shrinking plot; i.e. maximum number of leaves that can be removed. Default: auto-select based on the data

-s KSCALING If -k is not given, we use k=min(n/a,b*sqrt(n)) by default; using this option, you can set the a,b constants; Default: '5,2'

Page 29: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

• Generate folder test_data/mm50_treeshrink_multi/ which contains two sets of outputs

• at α = 0.05

• mm50_shrunk_0.05.trees and mm50_shrunk_RS_0.05.txt

• at α = 0.10

• mm50_shrunk_0.1.trees and mm50_shrunk_RS_0.1.txt

TreeShrink: Examplerun_treeshrink.py -t test_data/mm50.trees -q "0.05 0.10” -b 5 -k 5 -m per-species —o test_data/mm50_treeshrink_multi

!29

Page 30: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

TreeShrink: LoggingLaunching TREESHRINK version 1.3.3

TREESHRINK was called as follow

run_treeshrink.py -t test_data/mm50.trees -m per-species —q 0.05 0.10 -b 5 -k 5 -o test_data/mm50_treeshrink_multi

Solving k-shrink with k = 5 Solving k-shrink with k = 5 Solving k-shrink with k = 5 Solving k-shrink with k = 5 …

TreeShrink will run in 'Per-species' mode ... CAV: will be cut in 2 trees where its impact is above 1.212920 for quantile 0.05 CAV: will be cut in 5 trees where its impact is above 1.119273 for quantile 0.10 DAS: will be cut in 1 trees where its impact is above 1.050000 for quantile 0.05 DAS: will be cut in 1 trees where its impact is above 1.050000 for quantile 0.10 …

Page 31: Finding Errors in Phylogenomic Data Using TreeShrinktandy.cs.illinois.edu/TreeShrink-Uyen.pdf · 2020-01-01 · Taxus_baccata Diospyros_malabarica Sabal_bermudana Uronema_sp Roya_obtusa

ReferencesPublicationMai, Uyen, and Siavash Mirarab. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19, no. S5 (2018): 272. doi:10.1186/s12864-018-4620-2.

TreeShrink Softwarehttps://github.com/uym2/treeshrink

ContactUyen Mai [email protected]

!31