intermine infrastructure lf meeting 20150428
TRANSCRIPT
InterMine in a nutshell
• Open-source data warehouse software• Integration of complex biological data• Parsers for common biological data formats• Extensible framework for custom data• Cookie-cutter interface, highly customizable• Interact using sophisticated web query tools• Programmatic access using web-service API
Open-source Project
• Source code available online• Distributed with the GNU
LGPL license• GitHub Repo:
https://github.com/intermine/intermine
• GitHub Organization: https://github.com/intermine
intermine / intermine> bio> biotestmine> config> flymine> humanmine> imbuild> intermine> testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES
InterMine system architecture
Web Application• Java Server Pages (JSP), HTML, JS, CSS• Interfaces with Java Servlets and IM web-services
Web Server• Tomcat 7.0.x, serves Web application ARchive file• ant based build system using Java SDK
Database Server• PostgreSQL 9.2 or above• range query, btree, gist enabled (refer docs here)
http://intermine.readthedocs.org/en/latest/system-requirements/
Data Model Overview
• Object-oriented data model• Divided into classes, their attributes and their
relationships; defined in XML• Represented as Java classes (pure Java
beans); auto-generated from XML, automatically map to tables in schema
• Core data model; based on Sequence Ontology (SO); refer: bio/core/core.xml and bio/core/genomic_additions.xml
http://intermine.readthedocs.org/en/latest/data-model/overview/
Data Model Overview
<?xml version="1.0"?><model name="example" package="org.intermine.model.bio">
<class name="Protein" is-interface="true" extends="SequenceFeature"> <attribute name="name" type="java.lang.String"/> <attribute name="accession" type="java.lang.String"/> <collection name="features" referenced-type="NewFeature" reverse-reference="protein"/> </class>
<class name="NewFeature" is-interface="true"> <attribute name="identifier" type="java.lang.String"/> <attribute name="confidence" type="java.lang.Double"/> <reference name="protein" referenced-type="Protein" reverse-reference="features"/> </class></model>
Model expects standard Java names for classes and attributes• classes: start with an upper case letter and be CamelCase, no underscores or spaces.• fields (attributes, references, collections): should start with a lower case letter and be
lowerCamelCase, no underscores or spaces.http://intermine.readthedocs.org/en/latest/data-model/model/
Creating & configuring a mine
• Build out scaffold for mine$ cd git/intermine$ bio/scripts/make_mine legumine
• Configure data to load and post-processing steps to run by customizing project.xml
• Data <source /> elements correspond to directory under bio/sources/*; defines parsers to retrieve data and encodes rules for integration
intermine / intermine> bio> biotestmine> config> flymine> legumine
> dbmodel> integrate> postprocess> webapp> default.intermine.integrate.properties> default.intermine.webapp.properties> project.xml
> humanmine> imbuild> intermine> testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine
Creating & configuring a mine
<project type="bio"> <property name="target.model" value="genomic"/> <property name="source.location" location="../bio/sources/"/> <property name="common.os.prefix" value="common"/> <property name="intermine.properties.file" value="legumine.properties"/> <property name="default.intermine.properties.file" location="../default.intermine.integrate.properties"/> <sources> <source name=”legumine-gff" type="legumine-gff"> <property name="gff3.taxonId" value="3880"/> <property name="gff3.seqDataSourceName" value="LF"/> <property name="gff3.dataSourceName" value="LF"/> <property name="gff3.seqClsName" value="Chromosome"/> <property name="gff3.dataSetTitle" value="Genome Annotation"/> <property name="src.data.dir" location="/path/to/legumine/genome/gff/" /> </source> : : </sources> <post-processing> <post-process name="create-references" /> <post-process name="create-chromosome-locations-and-lengths"/> <post-process name="create-gene-flanking-features" /> : : </post-processing></project>
project.xml
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#project-xml
Data Sources and Sets
• InterMine provides a vast library of data source parsers and loaders, covering data types not restricted to:genome sequence (fasta)annotation (gff)ontology (go, so)proteins (uniprot)interactions (psi-mi)pathway (kegg, reactome)homologs (panther, compara, homologene)publications (pubmed)chado (sequence, stock)
• Custom sources can be written by following the tutorial: http://intermine.readthedocs.org/en/latest/database/data-sources/custom/ or by referring to code from other mineshttp://intermine.readthedocs.org/en/latest/database/data-sources/library/
Building a mine
• Each InterMine instance requires 3 PostgreSQL databases:
legumine: core db mapping to data model items-legumine: db for storing intermediate Items during load userprofile-legumine: db for storing user specific data
• Running build requires special config file in the users’ home area, containing db connection params and other mine specific configs to override${HOME}/.intermine/legumine.properties
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file
Model Merging & Data Integration
Model Merging• Each source contributes
towards the data model• bio/core/core.xml is always
used as the base for model merging
• The ant build-db command consumes the SOURCE_additions.xml
• Model is used to generate tables, Java classes and the webapp
http://intermine.readthedocs.org/en/latest/database/database-building/model-merging/
Data Integration• Key(s) for class of object
defines equivalence for objects of that class
• Primary key defines field(s) used to search for equivalence
• For objects which share same primary key, fields are merged and stored as single object
http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/
Post processing
• Operations are performed on integrated data
• Calculate/set fields difficult to work with while data loading, because they require 2 or more sources to be loaded already
• Order of steps is somewhat important
<post-processing> <post-process name="create-references" /> <post-process name="create-chromosome-locations-and-lengths"/> <post-process name="create-gene-flanking-features" /> <post-process name="do-sources" /> <post-process name="create-intron-features"> <property name="organisms" value="3880"/> </post-process> <post-process name="transfer-sequences"/> <post-process name="populate-child-features"/> <post-process name="create-location-range-index" /> <post-process name="create-overlap-view" /> <post-process name="create-attribute-indexes"/> <post-process name="summarise-objectstore"/> <post-process name="create-search-index"/></post-processing>
http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/
Building & deploying a mine
Two types of build mechanisms:• Manual:$ cd dbmodel && ant clean build-db ## initialize db$ ant -Dsource=legumine-gff ## load data sources $ ant -Dsource=legumine-chr-fasta ## load more sources$ cd ../postprocess && ant ## run post-process
steps$ cd ../webapp ## build mine webapp$ ant clean remove-webapp default release-webapp
• Automated:$ ../bio/scripts/project_build -b -v localhost ~/legumine-
dumphttp://intermine.readthedocs.org/en/latest/database/database-building/build-script/
Lucene based search index
• Post-process "create-search-index" runs the database indexing, zips and stores in db
• On webapp (first) load, index is unpacked• By default, all id and text fields are ignored by the
indexer• Uses the Apache Lucene whitespace analyzer to
identify word boundaries• Control temp directory and classes/fields to be
ignored by altering MINE_NAME/dbmodel/resources/keyword_search.properties file
http://intermine.readthedocs.org/en/latest/webapp/keyword-search/
Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472
InterMine web services
http://iodocs.labs.intermine.org
Federated Authentication
• Apart from the standard login scheme (username/password), InterMine supports industry standard OAuth2 based login flows, implemented by Google, GitHub, Agave, etc.
• ThaleMine relies on this infrastructure to authenticate users against the araport.org tenant registered within the Agave infrastructure
• Documentation available here: http://intermine.readthedocs.org/en/latest/webapp/properties/web-properties/#openauth2-settings-aka-openid-connect
Friendly reference mines
• FlyMine: https://github.com/intermine/intermine/• ThaleMine:
https://github.com/Arabidopsis-Information-Portal/intermine/
• MedicMine: https://github.com/jcvi-plant-genomics/intermine/
• PhytoMine: https://github.com/JoeCarlson/intermine/
Summary
• Advantages InterMine is a powerful biological data warehouse Performs complex data integration Allows fast and flexible querying Well documented programmatic interface Cookie-cutter, user-friendly web interface Facilitates cross-talk between “mines”
• Caveats Adding more data requires a full database rebuild (incremental loading is not
possible) because of the integration step
• About InterMine: Developed by the Micklem Lab at the University of Cambridge, UK Written in Java, backed by PostgreSQLdb, deployed under Tomcat.
Documentation and downloads available at http://www.intermine.org
Acknowledgments
• InterMine Team Gos Micklem Julie Sullivan Alex Kalderimis Richard Smith Sergio Contrino Josh Heimbach et al.
• Araport Team Chris Town Jason Miller Matt Vaughn Maria Kim Svetlana
Karamycheva Erik Ferlanti Chia-Yi Cheng Benjamin Rosen Irina Belyaeva