intermine infrastructure lf meeting 20150428

21
Introduction to InterMine Infrastructure Vivek Krishnakumar LF Meeting 04/28/2015

Upload: vivek-krishnakumar

Post on 07-Aug-2015

53 views

Category:

Science


0 download

TRANSCRIPT

Introduction to InterMine Infrastructure

Vivek KrishnakumarLF Meeting 04/28/2015

InterMine in a nutshell

• Open-source data warehouse software• Integration of complex biological data• Parsers for common biological data formats• Extensible framework for custom data• Cookie-cutter interface, highly customizable• Interact using sophisticated web query tools• Programmatic access using web-service API

Open-source Project

• Source code available online• Distributed with the GNU

LGPL license• GitHub Repo:

https://github.com/intermine/intermine

• GitHub Organization: https://github.com/intermine

intermine / intermine> bio> biotestmine> config> flymine> humanmine> imbuild> intermine> testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES

Richard N. Smith et al. Bioinformatics 2012;28:3163-3165

InterMine system architecture

InterMine system architecture

Web Application• Java Server Pages (JSP), HTML, JS, CSS• Interfaces with Java Servlets and IM web-services

Web Server• Tomcat 7.0.x, serves Web application ARchive file• ant based build system using Java SDK

Database Server• PostgreSQL 9.2 or above• range query, btree, gist enabled (refer docs here)

http://intermine.readthedocs.org/en/latest/system-requirements/

Data Model Overview

• Object-oriented data model• Divided into classes, their attributes and their

relationships; defined in XML• Represented as Java classes (pure Java

beans); auto-generated from XML, automatically map to tables in schema

• Core data model; based on Sequence Ontology (SO); refer: bio/core/core.xml and bio/core/genomic_additions.xml

http://intermine.readthedocs.org/en/latest/data-model/overview/

Data Model Overview

<?xml version="1.0"?><model name="example" package="org.intermine.model.bio">

<class name="Protein" is-interface="true" extends="SequenceFeature"> <attribute name="name" type="java.lang.String"/> <attribute name="accession" type="java.lang.String"/> <collection name="features" referenced-type="NewFeature" reverse-reference="protein"/> </class>

<class name="NewFeature" is-interface="true"> <attribute name="identifier" type="java.lang.String"/> <attribute name="confidence" type="java.lang.Double"/> <reference name="protein" referenced-type="Protein" reverse-reference="features"/> </class></model>

Model expects standard Java names for classes and attributes• classes: start with an upper case letter and be CamelCase, no underscores or spaces.• fields (attributes, references, collections): should start with a lower case letter and be

lowerCamelCase, no underscores or spaces.http://intermine.readthedocs.org/en/latest/data-model/model/

Creating & configuring a mine

• Build out scaffold for mine$ cd git/intermine$ bio/scripts/make_mine legumine

• Configure data to load and post-processing steps to run by customizing project.xml

• Data <source /> elements correspond to directory under bio/sources/*; defines parsers to retrieve data and encodes rules for integration

intermine / intermine> bio> biotestmine> config> flymine> legumine

> dbmodel> integrate> postprocess> webapp> default.intermine.integrate.properties> default.intermine.webapp.properties> project.xml

> humanmine> imbuild> intermine> testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES

http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine

Creating & configuring a mine

<project type="bio"> <property name="target.model" value="genomic"/> <property name="source.location" location="../bio/sources/"/> <property name="common.os.prefix" value="common"/> <property name="intermine.properties.file" value="legumine.properties"/> <property name="default.intermine.properties.file" location="../default.intermine.integrate.properties"/> <sources> <source name=”legumine-gff" type="legumine-gff"> <property name="gff3.taxonId" value="3880"/> <property name="gff3.seqDataSourceName" value="LF"/> <property name="gff3.dataSourceName" value="LF"/> <property name="gff3.seqClsName" value="Chromosome"/> <property name="gff3.dataSetTitle" value="Genome Annotation"/> <property name="src.data.dir" location="/path/to/legumine/genome/gff/" /> </source> : : </sources> <post-processing> <post-process name="create-references" /> <post-process name="create-chromosome-locations-and-lengths"/> <post-process name="create-gene-flanking-features" /> : : </post-processing></project>

project.xml

http://intermine.readthedocs.org/en/latest/get-started/tutorial/#project-xml

Data Sources and Sets

• InterMine provides a vast library of data source parsers and loaders, covering data types not restricted to:genome sequence (fasta)annotation (gff)ontology (go, so)proteins (uniprot)interactions (psi-mi)pathway (kegg, reactome)homologs (panther, compara, homologene)publications (pubmed)chado (sequence, stock)

• Custom sources can be written by following the tutorial: http://intermine.readthedocs.org/en/latest/database/data-sources/custom/ or by referring to code from other mineshttp://intermine.readthedocs.org/en/latest/database/data-sources/library/

Building a mine

• Each InterMine instance requires 3 PostgreSQL databases:

legumine: core db mapping to data model items-legumine: db for storing intermediate Items during load userprofile-legumine: db for storing user specific data

• Running build requires special config file in the users’ home area, containing db connection params and other mine specific configs to override${HOME}/.intermine/legumine.properties

http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file

Model Merging & Data Integration

Model Merging• Each source contributes

towards the data model• bio/core/core.xml is always

used as the base for model merging

• The ant build-db command consumes the SOURCE_additions.xml

• Model is used to generate tables, Java classes and the webapp

http://intermine.readthedocs.org/en/latest/database/database-building/model-merging/

Data Integration• Key(s) for class of object

defines equivalence for objects of that class

• Primary key defines field(s) used to search for equivalence

• For objects which share same primary key, fields are merged and stored as single object

http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/

Post processing

• Operations are performed on integrated data

• Calculate/set fields difficult to work with while data loading, because they require 2 or more sources to be loaded already

• Order of steps is somewhat important

<post-processing> <post-process name="create-references" /> <post-process name="create-chromosome-locations-and-lengths"/> <post-process name="create-gene-flanking-features" /> <post-process name="do-sources" /> <post-process name="create-intron-features"> <property name="organisms" value="3880"/> </post-process> <post-process name="transfer-sequences"/> <post-process name="populate-child-features"/> <post-process name="create-location-range-index" /> <post-process name="create-overlap-view" /> <post-process name="create-attribute-indexes"/> <post-process name="summarise-objectstore"/> <post-process name="create-search-index"/></post-processing>

http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/

Building & deploying a mine

Two types of build mechanisms:• Manual:$ cd dbmodel && ant clean build-db ## initialize db$ ant -Dsource=legumine-gff ## load data sources $ ant -Dsource=legumine-chr-fasta ## load more sources$ cd ../postprocess && ant ## run post-process

steps$ cd ../webapp ## build mine webapp$ ant clean remove-webapp default release-webapp

• Automated:$ ../bio/scripts/project_build -b -v localhost ~/legumine-

dumphttp://intermine.readthedocs.org/en/latest/database/database-building/build-script/

Lucene based search index

• Post-process "create-search-index" runs the database indexing, zips and stores in db

• On webapp (first) load, index is unpacked• By default, all id and text fields are ignored by the

indexer• Uses the Apache Lucene whitespace analyzer to

identify word boundaries• Control temp directory and classes/fields to be

ignored by altering MINE_NAME/dbmodel/resources/keyword_search.properties file

http://intermine.readthedocs.org/en/latest/webapp/keyword-search/

Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472

InterMine web services

http://iodocs.labs.intermine.org

Federated Authentication

• Apart from the standard login scheme (username/password), InterMine supports industry standard OAuth2 based login flows, implemented by Google, GitHub, Agave, etc.

• ThaleMine relies on this infrastructure to authenticate users against the araport.org tenant registered within the Agave infrastructure

• Documentation available here: http://intermine.readthedocs.org/en/latest/webapp/properties/web-properties/#openauth2-settings-aka-openid-connect

Friendly reference mines

• FlyMine: https://github.com/intermine/intermine/• ThaleMine:

https://github.com/Arabidopsis-Information-Portal/intermine/

• MedicMine: https://github.com/jcvi-plant-genomics/intermine/

• PhytoMine: https://github.com/JoeCarlson/intermine/

Summary

• Advantages InterMine is a powerful biological data warehouse Performs complex data integration Allows fast and flexible querying Well documented programmatic interface Cookie-cutter, user-friendly web interface Facilitates cross-talk between “mines”

• Caveats Adding more data requires a full database rebuild (incremental loading is not

possible) because of the integration step

• About InterMine: Developed by the Micklem Lab at the University of Cambridge, UK Written in Java, backed by PostgreSQLdb, deployed under Tomcat.

Documentation and downloads available at http://www.intermine.org

Acknowledgments

• InterMine Team Gos Micklem Julie Sullivan Alex Kalderimis Richard Smith Sergio Contrino Josh Heimbach et al.

• Araport Team Chris Town Jason Miller Matt Vaughn Maria Kim Svetlana

Karamycheva Erik Ferlanti Chia-Yi Cheng Benjamin Rosen Irina Belyaeva

THANK YOU