talevich bosc2010 bio-phylo
TRANSCRIPT
Bio.PhyloA unified phylogenetics toolkit for Biopython
Eric Talevich
Institute of BioinformaticsUniversity of Georgia
June 29, 2010
Abstract
Bio.Phylo is a new phylogenetics library for:
• Exploring, modifying and annotating trees
• Reading & writing standard file formats
• Quick visualization
• Gluing together computational pipelines
Availability: Biopython 1.54
A quick survey of file formats
Newick (a.k.a. New Hampshire) is a simple nested-parens
format: (A, (B, C), (D, E))• Extended & tweaked, led to NHX (and parsing
problems)
Nexus is a collection of formats, including Newick trees• More than just tree data. . . still tough to parse
PhyloXML is an XML-based replacement for NHX• Annotations formalized as XML elements;
extensible with user-defined element types
NeXML is an XML-based successor to Nexus• Ontology-based — key-value assignments have
semantic meaning
Demo: What’s in a tree?
1. Read a simple Newick file
2. Inspect through IPython
3. Draw withPyLab/matplotlib
4. Promote to a PhyloXML tree
5. Set branch colors
6. Write a PhyloXML file
# In a terminal, make a simple Newick file
# Then launch the IPython interpreter and read the file
% cat > simple.dnd <<EOF
> (((A,B),(C,D)),(E,F,G))
> EOF
% ipython -pylab
>>> from Bio import Phylo
>>> tree = Phylo.read(’simple.dnd’, ’newick’)
# String representation shows the object structure
>>> print tree
Tree(weight=1.0, rooted=False, name=’’)
Clade(branch_length=1.0)
Clade(branch_length=1.0)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name=’A’)
Clade(branch_length=1.0, name=’B’)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name=’C’)
Clade(branch_length=1.0, name=’D’)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name=’E’)
Clade(branch_length=1.0, name=’F’)
Clade(branch_length=1.0, name=’G’)
# Draw an ASCII-art dendrogram
>>> Phylo.draw_ascii(tree, column_width=52)
______________ A
______________|
| |______________ B
______________|
| | ______________ C
| |______________|
_| |______________ D
|
| ______________ E
| |
|______________|______________ F
|
|______________ G
>>> tree.rooted = True
>>> Phylo.draw graphiz(tree)
E
C
B
G
A
D
F
# Promote a basic tree to PhyloXML
>>> from Bio.Phylo.PhyloXML import Phylogeny
>>> phy = Phylogeny.from_tree(tree)
>>> print phy
Phylogeny(rooted=True, name=’’)
Clade(branch_length=1.0)
Clade(branch_length=1.0)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name=’A’)
Clade(branch_length=1.0, name=’B’)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name=’C’)
Clade(branch_length=1.0, name=’D’)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name=’E’)
Clade(branch_length=1.0, name=’F’)
Clade(branch_length=1.0, name=’G’)
Branch color
>>> phy.root.color = (128, 128, 128)
Or:>>> phy.root.color = ’#808080’
Or:>>> phy.root.color = ’gray’
Find clades by attribute values:>>> mrca = phy.common ancestor({’name’:’E’},
{’name’:’F’})>>> mrca.color = ’salmon’
Directly index a clade:>>> phy.clade[0,1].color = ’blue’
>>> Phylo.draw graphviz(phy, prog=’neato’)
E
C
F
A
D
G
B
# Save the color annotations in phyloXML
>>> Phylo.write(phy, ’simple-color.xml’, ’phyloxml’)
<phy:phyloxml xmlns:phy="http://www.phyloxml.org">
<phylogeny rooted="true">
<clade>
<branch_length>1.0</branch_length>
<color>
<red>128</red>
<green>128</green>
<blue>128</blue>
</color>
<clade>
<branch_length>1.0</branch_length>
<clade>
<branch_length>1.0</branch_length>
<clade>
<name>A</name>
...
Thanks
Holla:
• Brad Chapman and Christian Zmasek, GSoC 2009 mentors
• The Biopython developers, feat. Peter J. A. Cock,Frank Kauff & Cymon J. Cox
• Hilmar Lapp & the NESCent Phyloinformatics program
• Google’s Open Source Programs Office
• My professor, Dr. Natarajan Kannan
• Developers like you
Q&A
• Which 3rd-party applications should we wrap inBio.Phylo.Applications? (e.g. RAxML, MrBayes)
• Which other libraries should we support interoperability with?(PyCogent, ape)
• What other algorithms are simple, stable and relevant?(Consensus, rooting)
• Features for systematics? (Geography, PopGen integration?)
Extra: Tree methods>>> dir(tree)
collapse
collapse all
common ancestor
count terminals
depths
distance
find any
find clades
find elements
get nonterminals
get path
get terminals
is bifurcating
is monophyletic
is parent of
is preterminal
ladderize
prune
split
total branch length
trace
See: http://biopython.org/DIST/docs/api/Bio.Phylo.
BaseTree.TreeMixin-class.html
Extra: The Bio.Phylo class hierarchy
Figure: Inheritance relationship among the core classes
Extra: PhyloXML classes
$ pydoc Bio.Phylo.PhyloXML
AccessionAlphabetAnnotationBaseTreeBinaryCharactersBranchColorCladeCladeRelationConfidence
DateDistributionDomainArchitectureEventsIdMolSeqOtherPhylogenyPhyloxml
PointPolygonPropertyProteinDomainReferenceSequenceSequenceRelationTaxonomyUri
See: http://biopython.org/wiki/PhyloXML