Download - Chado-XML
![Page 1: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/1.jpg)
Chado and interoperability
Chris Mungall, BDGPPinglei Zhou, FlyBase-Harvard
![Page 2: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/2.jpg)
Databases and applications
SQL
?
How do we get databases and applications speakingto one another?
Application
Application
Java
Perl
Chado
seq
cv
rad
genetic
phylopub
![Page 3: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/3.jpg)
posgresqldriver
posgresqldriver
Databases and applications
Application
Application
SQL Java
Perl
Generic database interfaces only solve part of the problem
JDBC
DBI
They let us embed SQL inside application code
Chado
seq
cv
rad
genetic
phylopub
method calls
row objects
method calls
perl arrays
![Page 4: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/4.jpg)
Why this alone isn’t enough
• Interfacing applications to databases is a tricky business…
• Issue: Language mismatch• Issue: Data structure mismatch• Issue: Repetitive code• Issue: No centralized domain logic
![Page 5: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/5.jpg)
Language mismatch
String sql = "SELECT srcfeature_id, fmax, fmin “+ “FROM featureloc "+ "WHERE feature_id =" + featId; try { Statement s = conn.createStatement(); ResultSet rs = s.executeQuery(sql); rs.next(); sourceFeatureId = rs.getInt("srcfeature_id"); fmin = rs.getInt("fmin"); fmax = rs.getInt("fmax"); } catch (SQLException sqle) { System.err.println(this.getClass() + ": SQLException retrieving feature loc" + " for feature_id = " + featId); sqle.printStackTrace(System.err); }
![Page 6: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/6.jpg)
Data structure mismatch
• Different formalisms
Xrelations - set theoretic - relational algebra
classes and structs - pointers - programs
![Page 7: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/7.jpg)
Repetitive code
• Database fetch pattern– construct, ask, transform, repeat, stitch
• Example: fetching gene models– fetch genes– fetch transcripts– fetch exons, polypeptides– fetch ancillary data (props, cvs, pubs, syns,
etc)
• Optimisation is difficult
![Page 8: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/8.jpg)
No centralized domain logic
• Examples of domain logic:– project a feature onto a virtual contig– revcomp or translate a sequence– search by ontology term– delete a gene model
• Domain logic should be reusable by different applications
![Page 9: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/9.jpg)
biosqladapter
chadoadapter
A solution:Object Oriented APIs
Application
Perl
Different perl applicationsshare the same API
API
Different schemas can beadded by writing adapters
chado method calls
domain objects
Perl
driver DBI
biosql driver DBIApplication
Perl
domain objects
![Page 10: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/10.jpg)
How do OO APIs help?
• Issue: Language mismatch– Seperation of interface from implementation
• Issue: Data structure mismatch– API talks objects– adapters hide and deal with conversion
• Issue: Repetitive code– code centralized in both API and adapter
• Issue: No centralized domain logic– object model encapsulates domain logic– object model can be used independently of database
![Page 11: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/11.jpg)
How do OO APIs hinder?
• Writing or generating adapters– brittle, difficult to maintain
• Restrictive– canned parameterized queries vs query language
• Application language bound– very difficult to use a perl API from java
• Application bound– sometimes generic, but often limited to one
application
• Opaque domain logic
![Page 12: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/12.jpg)
XML can help
• XML is application-language neutral • XML can be used to specify:
• data• transactions• queries and query constraints
• XML can be used within both application languages and specialized XML languages
• XPath• XSLT• XQuery
![Page 13: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/13.jpg)
impl
XML middleware
Application
Perl
Database XML aslingua franca
interface
chado query params
chado xml
Perl
driver DBI
Applicationquery params
chado xml
Java
any
Generic SQL toXML mapper
OO APIs can be implementedon top of XML layer
![Page 14: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/14.jpg)
impl
XML middleware
Application
Perl
interface
Implementation canbe any language
chado query params
chado xml
Java
driver JDBC
Applicationquery params
chado xml
Java
any
![Page 15: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/15.jpg)
Mapping with XORT
• XORT is a specification of how to map XML to the relational model– generic: independent of chado and biology
• XML::XORT is a perl implementation of the XORT specification– Other implementations possible
• DBIx::DBStag implements XORT xml->db
• Application language agnostic– Easily wrapped for other languages
![Page 16: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/16.jpg)
Highlights
• Proposal: XML mapping specification for Chado
• Tools
• Real Case
![Page 17: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/17.jpg)
XORT Mapping
• Elements– Table– Column (except DB-specific value, e.g primary key in Chado schema -- not visible in
XML)
• Attributes– few and generic: transaction and reference control
• Element nesting– column within table– joined table within table -- joining column is implicit– foreign key table within foreign key column
• Modules– No module distinctions in chadoXML
• Limitations of DTD– Cardinality, NULLness, data type
![Page 18: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/18.jpg)
Transactions and Operations
• Lookup - Select only
• Insert - Insert explicitly
• Delete - Unique identifier with unique key(s) - One record per operation
• Update - Two elements - Unique identifier with unique key(s) - One record per operation
• Force Combination of lookup, insert and update (if not lookup, then insert, else
update)
![Page 19: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/19.jpg)
Referencing Objects• By global accession
- Format: dbname:accession[.version] - Only for dbxref, feature ?, cvterm ?
• By a pre-defined local id - Allows reference to objects in same file - Need not be in DB - Can be any symbol
• By lookup using unique key value(s) - Object can be in file or DB
• Implicitly, using foreign key to identify information in the related link table
![Page 20: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/20.jpg)
Object Reference By Global Accession
<feature> <uniquename>CG3123</uniquename> <type_id>gene</type_id> <feature_relationship> <subject_id>Gadfly:CG3123-RA:1</subject_id> <type_id>producedby</type_id> <feature_relationship> ……
</feature></feature>
![Page 21: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/21.jpg)
Object Reference By Local ID
<cv id=“SO”> <name>Sequence Ontology</name></cv>
<cvterm id=“exon”> <cv_id>SO</cv_id> <name>exon</name></cvterm>
<feature><type_id>exon</type_id>
![Page 22: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/22.jpg)
Object Reference By key Value (s)
<feature> <type_id> <cvterm> <cv_id> <cv> <name>Sequence Ontology</name> </cv> </cv_id> <name>exon</name> </cvterm> </type_id> ….
![Page 23: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/23.jpg)
ChadoXML Example<cv id=“SO”> <name>Sequence Ontology</name></cv><feature op=“lookup” id=“CG3312”> <uniquename>CG3312</uniquename> <type_id> <cvterm> <name>gene</name> <cv_id>SO</cv_id> </cvterm> <type_id> <organism_id> <feature_relationship> <subject_id>Gadfly:CG3312-RA</subject_id> <type_id>producedby</type_id> </feature_relationship> </feature>
![Page 24: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/24.jpg)
Schema-Driven Tools
• DTD Generator: DDL-DTD• Validator
DB Not connected Syntax verification: legal XML, correct element nesting Some Semantic verfication: NULLness, cardinality, local ID
reference DB Connected: reference validation
• Loader: XML->DB• Dumper:DB->XML
Driven by XML “dumpspec”• XORTDiff: diff two XORT XML files
![Page 25: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/25.jpg)
DumpSpec Driven Dumper
• Default behavior: given an object class and ID, dump all direct values and link tables, with refs to foreign keys.
• Non-default behavior specified by XML dumpspecs using same DTD with a few additions:– attribute dump= all | cols | select | none– attribute test = yes | no– element _sql– element _appdata
• Workaround with views, _sql• Current use cases:
– Dump a gene for a gene detail page– Dump a scaffold for Apollo
![Page 26: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/26.jpg)
<feature dump="all"> <uniquename test="yes">CG3312</uniquename> <!-- get all mRNAs of this gene --> <feature_relationship dump="all"> <subject_id test="yes"> <feature> <type_id><cvterm> <name>mRNA</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump="all"> <!-- get all exons of those mRNAs --> <feature_relationship dump="all"> <subject_id test="yes"> <feature> <type_id> ><cvterm><name>exon</name> </cvterm> </type_id> </feature> </subject_id> <subject_id> <feature dump="all“/> </subject_id> </feature_relationship> </feature>
DumpSpec Sample
![Page 27: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/27.jpg)
Use Case 1Chado <-> Apollo Interaction
XM
L D
um
per
XM
L L
oader
+ valid
ator
GameBridge
Chado
GAMEXML
ChadoXML
ApolloGAMEXML
ChadoXML
GameBridge
DumpSpec1DDL
XMLSchema
By_product
![Page 28: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/28.jpg)
Use Case 2Gene Page Dataflow
Chado
XM
L D
um
per
ChadoXML
DumpSpec2
acode
![Page 29: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/29.jpg)
To Do Lists
• External Object reference
• Dump with auto-generated XML Schema
• Output human-friendly
![Page 30: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/30.jpg)
Resources
• Today’s slides
• XORT package http://www.gmod.org
• Protocol draft submit to Current Protocol In Bioinformatics
• Using chado to Store Genome Annotation Data
![Page 31: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/31.jpg)
XORT Key points
• Application language-neutral– reusable from within multiple
languages and applications
• Where does the domain logic live?• Unlike objects, XML does not have
‘behaviour’
– One solution: ChadoXML Services– Another solution: Inside the DBMS
![Page 32: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/32.jpg)
XML::XORT
ChadoXML Services
ApplicationXORT
interfacechado
query params
chado xmldriver DBI
ChadoXML services interface
Formatconverters,dumpers
chado x
ml
query
para
ms
chado x
ml
oth
er x
ml
XSLT
sequencedomain logic CPerl
Ontologyservices
Java
XQuery
Lisp
Prolog
companalysislogic
JavaPerl
![Page 33: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/33.jpg)
ChadoXML Services
Application
ChadoXML services interface
Formatconverters,dumpers
chado x
ml
query
para
ms
chado x
ml
oth
er x
ml
XSLT
sequencedomain logic CPerl
Ontologyservices
Java
XQuery
Lisp
Prolog
Can be independent of DB
companalysislogic
JavaPerl
![Page 34: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/34.jpg)
PL/PgSQLFunction
Impl
DB Functions API
Application
Implementation inside database
Chado db function calls
sql result objects
DB FunctionsAPI
posgresqldriver
JDBC/DBI
sequencelibrary C
![Page 35: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/35.jpg)
PL/PgSQLFunction
Impl
DB Functions API
Existing functions
Chado
DB FunctionsAPI
• cv module– get_all_subject_ids(cvterm_id int);– get_all_object_ids(cvterm_id int);– fill_cvtermpath(cv_id int);
![Page 36: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/36.jpg)
PL/PgSQLFunction
Impl
DB Functions API
Existing functions
Chado
DB FunctionsAPI
• sequence module– get_sub_feature_ids(feature_id int)– featuresplice(fmin int, fmax int)– get_subsequence(srcfeature_id int,
fmin int, fmax int, strand int)– next_uniquename()
![Page 37: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/37.jpg)
Putting it together: storing ontologies in chado
ChadoXML
XML::XORT orDBIx::DBStag
oboxml_to_chadoxml.xslobo-edit Obo
XML
cv
fill_cvtermpath()
owl_to_oboxml.xslprotege OWL
![Page 38: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/38.jpg)
Benefits
• Issue: Language mismatch– XORT dumpspecs and sql functions a more natural fit
for application languages
• Issue: Data structure mismatch– XML maps naturally to objects and structs
• Issue: Repetitive code– XORT dumpspecs centralize db-fetch code– XORT loader centralizes db-store code
• Issue: No centralized domain logic– domain logic can be encoded in:
• PostgreSQL functions and triggers• ChadoXML services
![Page 39: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/39.jpg)
Other issues
• Speed?– chained transformations may be slower (-)– generic code is often slower (-)– single point for optimization(+)
• Verbosity– inevitable with a normalized database – reduced with XORT macros
• Portability– XORT highly portable (+)– PostgreSQL functions must be manually ported
to different DBMSs (-)
![Page 40: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/40.jpg)
Current plans
• XORT wrappers• Improving efficiency• Documentation• Extend PostgreSQL function repertoire• More ChadoXML XSLTs• ChadoXML adapters
– CGL– Apollo– BioPerl - Bio::
{Seq,Search,Tree,..}IO::chadoxml
![Page 41: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/41.jpg)
Conclusions
• ChadoXML– a common GMOD format– converted to other formats with XSLTs
• XORT– centralises database interoperation logic
• PostgreSQL functions– useful for certain kinds of domain logic
• Object APIs– still required by many applications– can be layered on top of XORT if so desired
![Page 42: Chado-XML](https://reader035.vdocuments.us/reader035/viewer/2022062703/55501238b4c90535638b4ac4/html5/thumbnails/42.jpg)
Thanks to…
• Richard Bruskiewich• Scott Cain• Allen Day• Karen Eilbeck• Dave Emmert• William Gelbart• Mark Gibson• Don Gilbert• Aubrey de Grey• Nomi Harris• Stan Letovsky• Suzanna Lewis• Aaron Mackey
• Sima Misra• Emmannel Mongin• Simon Prochnik• Gerald Rubin• Susan Russo• ShengQiang Shu• Chris Smith• Frank Smutniak• Lincoln Stein• Colin Wiel• Mark Yandell• Peili Zhang• Mark Zythovicz