talevich bosc2010 bio-phylo

17

Click here to load reader

Upload: bosc-2010

Post on 11-May-2015

574 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Talevich bosc2010 bio-phylo

Bio.PhyloA unified phylogenetics toolkit for Biopython

Eric Talevich

Institute of BioinformaticsUniversity of Georgia

June 29, 2010

Page 2: Talevich bosc2010 bio-phylo

Abstract

Bio.Phylo is a new phylogenetics library for:

• Exploring, modifying and annotating trees

• Reading & writing standard file formats

• Quick visualization

• Gluing together computational pipelines

Availability: Biopython 1.54

Page 3: Talevich bosc2010 bio-phylo

A quick survey of file formats

Newick (a.k.a. New Hampshire) is a simple nested-parens

format: (A, (B, C), (D, E))• Extended & tweaked, led to NHX (and parsing

problems)

Nexus is a collection of formats, including Newick trees• More than just tree data. . . still tough to parse

PhyloXML is an XML-based replacement for NHX• Annotations formalized as XML elements;

extensible with user-defined element types

NeXML is an XML-based successor to Nexus• Ontology-based — key-value assignments have

semantic meaning

Page 4: Talevich bosc2010 bio-phylo

Demo: What’s in a tree?

1. Read a simple Newick file

2. Inspect through IPython

3. Draw withPyLab/matplotlib

4. Promote to a PhyloXML tree

5. Set branch colors

6. Write a PhyloXML file

Page 5: Talevich bosc2010 bio-phylo

# In a terminal, make a simple Newick file

# Then launch the IPython interpreter and read the file

% cat > simple.dnd <<EOF

> (((A,B),(C,D)),(E,F,G))

> EOF

% ipython -pylab

>>> from Bio import Phylo

>>> tree = Phylo.read(’simple.dnd’, ’newick’)

Page 6: Talevich bosc2010 bio-phylo

# String representation shows the object structure

>>> print tree

Tree(weight=1.0, rooted=False, name=’’)

Clade(branch_length=1.0)

Clade(branch_length=1.0)

Clade(branch_length=1.0)

Clade(branch_length=1.0, name=’A’)

Clade(branch_length=1.0, name=’B’)

Clade(branch_length=1.0)

Clade(branch_length=1.0, name=’C’)

Clade(branch_length=1.0, name=’D’)

Clade(branch_length=1.0)

Clade(branch_length=1.0, name=’E’)

Clade(branch_length=1.0, name=’F’)

Clade(branch_length=1.0, name=’G’)

Page 7: Talevich bosc2010 bio-phylo

# Draw an ASCII-art dendrogram

>>> Phylo.draw_ascii(tree, column_width=52)

______________ A

______________|

| |______________ B

______________|

| | ______________ C

| |______________|

_| |______________ D

|

| ______________ E

| |

|______________|______________ F

|

|______________ G

Page 8: Talevich bosc2010 bio-phylo

>>> tree.rooted = True

>>> Phylo.draw graphiz(tree)

E

C

B

G

A

D

F

Page 9: Talevich bosc2010 bio-phylo

# Promote a basic tree to PhyloXML

>>> from Bio.Phylo.PhyloXML import Phylogeny

>>> phy = Phylogeny.from_tree(tree)

>>> print phy

Phylogeny(rooted=True, name=’’)

Clade(branch_length=1.0)

Clade(branch_length=1.0)

Clade(branch_length=1.0)

Clade(branch_length=1.0, name=’A’)

Clade(branch_length=1.0, name=’B’)

Clade(branch_length=1.0)

Clade(branch_length=1.0, name=’C’)

Clade(branch_length=1.0, name=’D’)

Clade(branch_length=1.0)

Clade(branch_length=1.0, name=’E’)

Clade(branch_length=1.0, name=’F’)

Clade(branch_length=1.0, name=’G’)

Page 10: Talevich bosc2010 bio-phylo

Branch color

>>> phy.root.color = (128, 128, 128)

Or:>>> phy.root.color = ’#808080’

Or:>>> phy.root.color = ’gray’

Find clades by attribute values:>>> mrca = phy.common ancestor({’name’:’E’},

{’name’:’F’})>>> mrca.color = ’salmon’

Directly index a clade:>>> phy.clade[0,1].color = ’blue’

>>> Phylo.draw graphviz(phy, prog=’neato’)

Page 11: Talevich bosc2010 bio-phylo

E

C

F

A

D

G

B

Page 12: Talevich bosc2010 bio-phylo

# Save the color annotations in phyloXML

>>> Phylo.write(phy, ’simple-color.xml’, ’phyloxml’)

<phy:phyloxml xmlns:phy="http://www.phyloxml.org">

<phylogeny rooted="true">

<clade>

<branch_length>1.0</branch_length>

<color>

<red>128</red>

<green>128</green>

<blue>128</blue>

</color>

<clade>

<branch_length>1.0</branch_length>

<clade>

<branch_length>1.0</branch_length>

<clade>

<name>A</name>

...

Page 13: Talevich bosc2010 bio-phylo

Thanks

Holla:

• Brad Chapman and Christian Zmasek, GSoC 2009 mentors

• The Biopython developers, feat. Peter J. A. Cock,Frank Kauff & Cymon J. Cox

• Hilmar Lapp & the NESCent Phyloinformatics program

• Google’s Open Source Programs Office

• My professor, Dr. Natarajan Kannan

• Developers like you

Page 14: Talevich bosc2010 bio-phylo

Q&A

• Which 3rd-party applications should we wrap inBio.Phylo.Applications? (e.g. RAxML, MrBayes)

• Which other libraries should we support interoperability with?(PyCogent, ape)

• What other algorithms are simple, stable and relevant?(Consensus, rooting)

• Features for systematics? (Geography, PopGen integration?)

Page 15: Talevich bosc2010 bio-phylo

Extra: Tree methods>>> dir(tree)

collapse

collapse all

common ancestor

count terminals

depths

distance

find any

find clades

find elements

get nonterminals

get path

get terminals

is bifurcating

is monophyletic

is parent of

is preterminal

ladderize

prune

split

total branch length

trace

See: http://biopython.org/DIST/docs/api/Bio.Phylo.

BaseTree.TreeMixin-class.html

Page 16: Talevich bosc2010 bio-phylo

Extra: The Bio.Phylo class hierarchy

Figure: Inheritance relationship among the core classes

Page 17: Talevich bosc2010 bio-phylo

Extra: PhyloXML classes

$ pydoc Bio.Phylo.PhyloXML

AccessionAlphabetAnnotationBaseTreeBinaryCharactersBranchColorCladeCladeRelationConfidence

DateDistributionDomainArchitectureEventsIdMolSeqOtherPhylogenyPhyloxml

PointPolygonPropertyProteinDomainReferenceSequenceSequenceRelationTaxonomyUri

See: http://biopython.org/wiki/PhyloXML