nexml - phylogenetic data as xml
DESCRIPTION
NeXML is an exchange standard for representing phyloinformatic data — inspired by the commonly used NEXUS format, but more robust and easier to process.TRANSCRIPT
NexmlA future data exchange standard for
phylogenetics
Rutger Vos
Increased automation in evolutionary informatics is hampered by poorly defined
“standards”
Introduction (1/7)The problem
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Addressing interoperability problems by coding our way out of it
Syntax:Nexml
Semantics:CDAO
Transport:PhyloWS
Introduction (2/7)EvoInfo.nescent.org interests
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (3/7)This subproject’s mission
• To create a file format like nexus*, but:o Fix (some) problems with nexuso Give access to data at higher levelo Be extensibleo Expose data to xml goodies
*Maddison, Swofford and Maddison, 1997. NEXUS: An Extensible File Format for Systematic Information. Syst. Biol. 46(4):590-621
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (4/7)Nexus issues
• Hard/impossible to validate• No explicit versions
o Nothing ever deprecated• No public extensions
o Leads to hacks such as ‘mixed’ data, ‘hot comments’
o Phylogenetics post-’80s in private blocks
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResourceshttps://www.nescent.org/wg_evoinfo/NEXUS_Problems
Introduction (5/7)Parsing plain text versus parsing XML
• Processing nexus data involves lexing + parsing + processing
• XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (6/7)Extensibility
• ‘Extensible’ file format should provide the ability to: o Define new data types that
implement described ‘interfaces’o Attach typed data structures to
core types o Attach custom XML
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (7/7)XML goodies
• Large stack of off-the-shelf tools:o XML parser librarieso Web service toolkitso Native XML databaseso Editors / IDEso Serialization / data binding tools
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (1/5)Design principles
• Re-use of prior art• Follow design patterns• Referencing• Verbose and compact
representations
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (2/5)Re-use of prior art
• Generic key/value attachments using RDFa
• Trees and networks following graphml
• General file structure following nexus concepts, i.e. blocks that reference each other
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (3/5)XML design patterns
• http://www.xmlpatterns.com • “Declare before use”• “Metadata first”• “Venetian blinds”• Abstract inheritance through
extension, concrete inheritance through restriction
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (4/5)Inheritance
IDTagged (required id attribute)
Labelled (optional label attribute)
Annotated (optional dict elements)
Base (optional base/lang/href attributes)
AbstractElement (in root schema)
ConcreteElement (in instance document)
extends
extends
extends
extends
restricts
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (5/5)Referencing
• Elements sometimes refer to other elements, much like in nexus
• In nexml, elements refer to the id of other elements by the name of the referenced element:
<otu id="t1"/> <!-- referenced later: --> <node id="n1" otu="t1"/>
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (1/6)Approach
• Schema design• Community feedback through wiki,
email, telecon, projects (evoinfo, ppod, MIAPA) etc.
• Processors (perl, java, python, c++, javascript, VB) development in parallel
• Experiments with xml tools (ws, db, data binding tools)
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (2/6) Entity relationships
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (3/6)inheritance tree for elements
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (4/6) anatomy of a “block”
<characters id="c1" xsi:type="nex:DnaSeqs" otus="t1">
</characters>
<meta id="m1" datatype="xsd:string” xsi:type="nex:LiteralMeta” property="dwc:catalogNumber" content="12345"/> Contents…
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (5/6)Character Classes
RestrictionCellsRestrictionSeqsRestriction
ContinuousCellsContinuousSeqsContinuous
StandardCellsStandardSeqsStandard
ProteinCellsProteinSeqsProtein
RnaCellsRnaSeqsRNA
DnaCellsDnaSeqsDNA
CellsSequence
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (6/6)Tree Classes
IntTreeFloatTreeTree
IntNetworkFloatNetworkNetwork
IntFloat
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Current status (1/4)Schema blocks
• Done:o OTUso characters: dna, rna,
nucleotide, protein, categorical, continuous, restriction (compact and verbose)
o trees: graphml trees and networks, various edge formats and rootings
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Nexml parsers and writers: o Phenexo TreeBASEo Mesquiteo Bio::Phyloo DendroPyo DAMBEo Etc.
Current status (2/4)Parsers and writers
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Included schema in soap wsdl• Indexed files in dbxml• Created large files from tolweb,
rbcl• XInclude with tinyseq xml• REST service described using
nexml
Current status (3/4)Experiments
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Cross-reference with glossary, ontology
• Substitution model descriptions• Publish standard• Compact trees• Distances• Splits
Current status (4/4)To do
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Resources
NeXML Base URL: http://nexml.org• Wiki: /wiki• Mailing list: /mail• Issue tracker: /tracker • SVN repository: /code
EvoInfo: http://evoinfo.nescent.org CDAO: http://www.evolutionaryontology.org
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Acknowledgements
• Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran, Xuhua Xia, Chase Miller, Anurag Priyam, Jaime Huerta-Cepas, Matt Yoder, Andrew Hill, Sam Smits, Mike Keesey, Apurv Verma, Mark Jensen
• Feedback: wg-evoinfo, pPOD, Wayne Maddison, David Maddison
• Additional funding, support: NESCent, GSoC