nexml - phylogenetic data as xml

25
Nexml A future data exchange standard for phylogenetics Rutger Vos

Upload: rutger-vos

Post on 24-Jun-2015

1.261 views

Category:

Technology


3 download

DESCRIPTION

NeXML is an exchange standard for representing phyloinformatic data — inspired by the commonly used NEXUS format, but more robust and easier to process.

TRANSCRIPT

Page 1: NeXML - phylogenetic data as XML

NexmlA future data exchange standard for

phylogenetics

Rutger Vos

Page 2: NeXML - phylogenetic data as XML

Increased automation in evolutionary informatics is hampered by poorly defined

“standards”

Introduction (1/7)The problem

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 3: NeXML - phylogenetic data as XML

Addressing interoperability problems by coding our way out of it

Syntax:Nexml

Semantics:CDAO

Transport:PhyloWS

Introduction (2/7)EvoInfo.nescent.org interests

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 4: NeXML - phylogenetic data as XML

Introduction (3/7)This subproject’s mission

• To create a file format like nexus*, but:o Fix (some) problems with nexuso Give access to data at higher levelo Be extensibleo Expose data to xml goodies

*Maddison, Swofford and Maddison, 1997. NEXUS: An Extensible File Format for Systematic Information. Syst. Biol. 46(4):590-621

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 5: NeXML - phylogenetic data as XML

Introduction (4/7)Nexus issues

• Hard/impossible to validate• No explicit versions

o Nothing ever deprecated• No public extensions

o Leads to hacks such as ‘mixed’ data, ‘hot comments’

o Phylogenetics post-’80s in private blocks 

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResourceshttps://www.nescent.org/wg_evoinfo/NEXUS_Problems

Page 6: NeXML - phylogenetic data as XML

Introduction (5/7)Parsing plain text versus parsing XML

• Processing nexus data involves lexing + parsing + processing

• XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 7: NeXML - phylogenetic data as XML

Introduction (6/7)Extensibility

• ‘Extensible’ file format should provide the ability to: o Define new data types that

implement described ‘interfaces’o Attach typed data structures to

core types o Attach custom XML

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 8: NeXML - phylogenetic data as XML

Introduction (7/7)XML goodies

• Large stack of off-the-shelf tools:o XML parser librarieso Web service toolkitso Native XML databaseso Editors / IDEso Serialization / data binding tools

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 9: NeXML - phylogenetic data as XML

Design (1/5)Design principles

• Re-use of prior art• Follow design patterns• Referencing• Verbose and compact

representations

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 10: NeXML - phylogenetic data as XML

Design (2/5)Re-use of prior art

• Generic key/value attachments using RDFa

• Trees and networks following graphml

• General file structure following nexus concepts, i.e. blocks that reference each other

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 11: NeXML - phylogenetic data as XML

Design (3/5)XML design patterns

• http://www.xmlpatterns.com • “Declare before use”• “Metadata first”• “Venetian blinds”• Abstract inheritance through

extension, concrete inheritance through restriction

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 12: NeXML - phylogenetic data as XML

Design (4/5)Inheritance

IDTagged (required id attribute)

Labelled (optional label attribute)

Annotated (optional dict elements)

Base (optional base/lang/href attributes)

AbstractElement (in root schema)

ConcreteElement (in instance document)

extends

extends

extends

extends

restricts

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 13: NeXML - phylogenetic data as XML

Design (5/5)Referencing

• Elements sometimes refer to other elements, much like in nexus

• In nexml, elements refer to the id of other elements by the name of the referenced element:

  <otu id="t1"/>   <!-- referenced later: -->  <node id="n1" otu="t1"/>

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 14: NeXML - phylogenetic data as XML

Implementation (1/6)Approach

• Schema design• Community feedback through wiki,

email, telecon, projects (evoinfo, ppod, MIAPA) etc.

• Processors (perl, java, python, c++, javascript, VB) development in parallel

• Experiments with xml tools (ws, db, data binding tools)

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 15: NeXML - phylogenetic data as XML

Implementation (2/6) Entity relationships

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 16: NeXML - phylogenetic data as XML

Implementation (3/6)inheritance tree for elements

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 17: NeXML - phylogenetic data as XML

Implementation (4/6) anatomy of a “block”

<characters     id="c1"     xsi:type="nex:DnaSeqs"     otus="t1">

</characters>

<meta id="m1" datatype="xsd:string” xsi:type="nex:LiteralMeta” property="dwc:catalogNumber" content="12345"/> Contents…

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 18: NeXML - phylogenetic data as XML

Implementation (5/6)Character Classes

RestrictionCellsRestrictionSeqsRestriction

ContinuousCellsContinuousSeqsContinuous

StandardCellsStandardSeqsStandard

ProteinCellsProteinSeqsProtein

RnaCellsRnaSeqsRNA

DnaCellsDnaSeqsDNA

CellsSequence

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 19: NeXML - phylogenetic data as XML

Implementation (6/6)Tree Classes

IntTreeFloatTreeTree

IntNetworkFloatNetworkNetwork

IntFloat

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 20: NeXML - phylogenetic data as XML

Current status (1/4)Schema blocks

• Done:o OTUso characters: dna, rna,

nucleotide, protein, categorical, continuous, restriction (compact and verbose)

o trees: graphml trees and networks, various edge formats and rootings

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 21: NeXML - phylogenetic data as XML

• Nexml parsers and writers: o Phenexo TreeBASEo Mesquiteo Bio::Phyloo DendroPyo DAMBEo Etc.

Current status (2/4)Parsers and writers

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 22: NeXML - phylogenetic data as XML

• Included schema in soap wsdl• Indexed files in dbxml• Created large files from tolweb,

rbcl• XInclude with tinyseq xml• REST service described using

nexml

Current status (3/4)Experiments

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 23: NeXML - phylogenetic data as XML

• Cross-reference with glossary, ontology

• Substitution model descriptions• Publish standard• Compact trees• Distances• Splits

Current status (4/4)To do

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 24: NeXML - phylogenetic data as XML

Resources

NeXML Base URL: http://nexml.org• Wiki: /wiki• Mailing list: /mail• Issue tracker: /tracker • SVN repository: /code

EvoInfo: http://evoinfo.nescent.org  CDAO: http://www.evolutionaryontology.org

Introduction    The problem    EvoInfo interests    This subproject    Nexus issues    Parsing    Extensibility    XML goodiesDesign    Principles    Re-use    Patterns    Inheritance    ReferencesImplementation    Approach    ERD    Inheritance    Anatomy    Characters    TreesCurrent status    Schema blocks    Parsers & writers    Experiments    To doResources

Page 25: NeXML - phylogenetic data as XML

Acknowledgements

• Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran, Xuhua Xia, Chase Miller, Anurag Priyam, Jaime Huerta-Cepas, Matt Yoder, Andrew Hill, Sam Smits, Mike Keesey, Apurv Verma, Mark Jensen

• Feedback: wg-evoinfo, pPOD, Wayne Maddison, David Maddison

• Additional funding, support: NESCent, GSoC