bio4j: a pioneer graph based database for the integration of biological big data

38
Bio4j: A pioneer graph based database for the integration of biological Big Data www.ohnosequences.com www.bio4j.com

Upload: graphdevroom

Post on 23-Jun-2015

697 views

Category:

Technology


2 download

DESCRIPTION

canceled talk about bio4j, at lest you have their slides here, xD!

TRANSCRIPT

Page 1: Bio4j: A pioneer graph based database for the integration of biological Big Data

Bio4j: A pioneer graph based

database for the integration of

biological Big Data

www.ohnosequences.com www.bio4j.com

Page 2: Bio4j: A pioneer graph based database for the integration of biological Big Data

What’s Bio4j?

Bio4j is a bioinformatics graph based DB including most data available in :

Uniprot (SwissProt + Trembl)

Gene Ontology (GO)

UniRef (50,90,100)

NCBI Taxonomy

RefSeq

Enzyme DB

www.ohnosequences.com www.bio4j.com

Page 3: Bio4j: A pioneer graph based database for the integration of biological Big Data

It provides a completely new and powerful framework

for protein related information querying and

management.

Since it relies on a high-performance graph engine, data

is stored in a way that semantically represents its own

structure

www.ohnosequences.com www.bio4j.com

What’s Bio4j?

Page 4: Bio4j: A pioneer graph based database for the integration of biological Big Data

Bio4j uses Neo4j technology, a "high-performance graph

engine with all the features of a mature and robust

database".

Thanks to both being based on Neo4j DB and the API

provided, Bio4j is also very scalable, allowing anyone

to easily incorporate his own data making the best

out of it.

www.ohnosequences.com www.bio4j.com

What’s Bio4j?

Page 5: Bio4j: A pioneer graph based database for the integration of biological Big Data

Everything in Bio4j is open source !

released under AGPLv3

www.ohnosequences.com www.bio4j.com

What’s Bio4j?

Page 6: Bio4j: A pioneer graph based database for the integration of biological Big Data

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Highly interconnected overlapping knowledge spread throughout different DBs

www.ohnosequences.com www.bio4j.com

Page 7: Bio4j: A pioneer graph based database for the integration of biological Big Data

However all this data is in most cases modeled in relational databases.Sometimes even just as plain CSV files

As the amount and diversity of data grows, domain models become crazily complicated!

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 8: Bio4j: A pioneer graph based database for the integration of biological Big Data

With a relational paradigm, the double implication

Entity Table

does not go both ways.

You get „auxiliary‟ tables that have no relationship with the smallpiece of reality you are modeling.

You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality)

Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL.

Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain model

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 9: Bio4j: A pioneer graph based database for the integration of biological Big Data

Life in general and biology in particular are probably not 100% like a graph…

but one thing’s sure, they are not a set of tables!

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 10: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

NoSQL (not only SQL)

“NoSQL is a broad class of database management systems

that differ from the classic model of the relational database

management system (RDBMS) in some significant ways.

These data stores may not require fixed table schemas,

usually avoid join operations and typically scale

horizontally.”

NoSQ… what !??

Let’s see what Wikipedia says…

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 11: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

NoSQL data modelsBioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 12: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable.

MongoDB (from "humongous") is an open source document-oriented NoSQL database system written in the C++ programming language.

Page 13: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database.

The programmer works with an object-oriented, flexible network structure rather than with strict and static tables

All the benefits of a fully transactional, enterprise-strengthdatabase.

For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs.

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 14: Bio4j: A pioneer graph based database for the integration of biological Big Data

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

www.ohnosequences.com www.bio4j.com

Ok, but why starting all this? Were you so bored…?!

It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations.

At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning:

I got crazy ‘deciphering’ how to extract Uniprot protein annotations from GO official tables schema

Uniprot and GO official protein annotations were not always consistent

Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations.

Soon enough we also had the need of having massive access to basic protein information.

Page 15: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

These processes had to be automated for our (specifically

designed for NGS data) bacterial genome annotation system

BG7

Uniprot web services available were too limited:

- Slow

- Number of queries limitation

- Too little information available

So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl)

and started to have some fun with it !

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 16: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

We got used to having massive direct access to all this proteinrelated information…

So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ?

Then came:

- Isoform sequences

- Protein interactions and features

- Uniref 50, 90, and 100

- RefSeq

- NCBI Taxonomy

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features- Enzyme Expasy DB

Page 17: Bio4j: A pioneer graph based database for the integration of biological Big Data

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

www.ohnosequences.com www.bio4j.com

Let’s dig a bit about Bio4j structure:

Data sources and their relationships:

Page 18: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

The Graph DB model: representation

Core abstractions:

Properties on both

Relationships between nodes

Nodes

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 19: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Let’s dig a bit about Bio4j structure:

How are things modeled?

Couldn’t be simpler!

Entities

Nodes

Associations / Relationships

Edges

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 20: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Some examples of nodes would be:

Protein

GO term

Genome Element

and relationships:

Protein

GO term

PROTEIN_GO_ANNOTATION

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 21: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

• Navigate through all nodes and relationships

• Access the javadocs of any node or relationship

• Graphically explore the neighborhood of a node/relationship

• Look up for the indexes that may serve as an entry point for a node

• Check incoming/outgoing relationships of a specific node

• Check start/end nodes of a specific relationship

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 22: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Entry points and indexing

There are two kinds of entry points for the graph:

Auxiliary relationships going from the reference node, e.g.

Node indexing

- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl

There are two types of node indexes:

- CELLULAR_COMPONENT: leads to the root of GO cellular component sub-ontology

- Exact: Only exact values are considered hits

- Fulltext: Regular expressions can be used

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 23: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Retrieving protein info (Bio4jModel Java API)

//--creating manager and node retriever----

Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);

NodeRetriever nR= new NodeRetriever(manager);

ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);

Getting more related info...

List<InterproNode> interpros = protein.getInterpro();

OrganismNode organism = protein.getOrganism();

List<GoTermNode> goAnnotations = protein.getGOAnnotations();

List<ArticleNode> articles = protein.getArticleCitations();

for (ArticleNode article : articles) {

System.out.println(article.getPubmedId());

}

//And don’t forget to close the Bio4jManager

manager.shutDown();

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 24: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Proteins with Interpro motif ‘IPR000847’ (Bio4jModel Java API)

//--creating manager and node retriever----

Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);

NodeRetriever nR= new NodeRetriever(manager);

InterproNode interpro = nR.getInterproById(“IPR000847”);

ProteinInterproRel rel = ProteinInterproRel(null);

Iterator<Relationship> iterator =

interpro.getNode().getRelationships(rel, Direction.INCOMING);

while(relIterator.hasNext()){

ProteinNode p = new ProteinNode(iterator.next().getStartNode());

System.out.println(p.getAccession());

}

//And don’t forget to close the Bio4jManager

manager.shutDown();

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Page 25: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Querying Bio4j with Cypher

START k=node:keyword_id_index(keyword_id_index = "KW-0181")

return k.name, k.id

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Getting a keyword by its ID

START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")

MATCH d <-[r:PROTEIN_DATASET]- p,

circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -

[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -

[:PROTEIN_PROTEIN_INTERACTION]-> (p)

return p.accession, p2.accession, p3.accession

Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

Check this blog post for more info and our Bio4j Cypher cheetsheet

Page 26: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name ==> Aspartate aminotransferase, mitochondrial

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Get protein by its accession number and return its full name

Get proteins (accessions) associated to an interpro motif (limited to 4 results)

gremlin>

g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV

.accession[0..3]

==> E2GK26

==> G3PMS4

==> G3Q865==> G3PIL8

Check our Bio4j Gremlin cheetsheet

Page 27: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

You can also query/navigate through Bio4j with the REST API !

The default representation is json, both for responses and or data sent with

POST/PUT requests

http://server_url:7474/db/data/index/node/protein_acc

ession_index/protein_accession_index/Q9UR66

REST Server

Get protein by its accession number: (Q9UR66)

http://server_url:7474/db/data/node/Q9UR66_node_id/re

lationships/out

Get outgoing relationships for protein Q9UR66

Page 28: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Visualizations (1) REST Server Data Browser

Navigate through Bio4j data in real time !

Page 29: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Visualizations (2) Bio4j + Gephi

Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platform

Page 30: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Visualizations (3) Bio4j GO Tools

Page 31: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Why would I use Bio4j ?

Massive access to protein/genome/taxonomy… related information

Networks analysis

Integration of your own DBs/resources around common information

Development of services tailored to your needs built around Bio4j

Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;)

Visualizations

Page 32: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Bio4j + Cloud (1)

Interoperability and data distribution

We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits:

Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds.

CloudFormation templates:

- Basic Bio4j DB Instance

- Bio4j REST Server Instance

Page 33: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Bio4j + Cloud (2)

Backup and Storage using S3 (Simple Storage Service)

We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files)

What kind of benefits do we get from this?

• Easy to use

• Flexible

• Cost-Effective

• Reliable

• Scalable and high-performance

• Secure

Page 34: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Bio4j + Cloud (3)

Web servers and service providers in the cloud

Deploying your own web server in AWS using Bio4j as back-end is really simple.

A good example of this would be Bio4jTestServer, a continuously developed server showcasing Web Services based on Bio4j.

Page 35: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Upcoming features

- Relationship indexing for relationships going and coming from supernodes

No one’s perfect, and Bio4j is not the exception. Relationship fetching can become a bottleneck whenever you have to deal with supernodes (unless you index these relationships). Fortunately this is something that Neo4j is going to fix in the next version(s).

- More resources available (Reactome…)

- Improvements in the importing process

- A more complete version of Bio4jModel

Allowing users to perform almost all sorts of queries without having to worry about Neo4j core API.

- New tools, services and visualizations built around Bio4j

Page 36: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

Community

Bio4j has a fast growing internet presence:

- Twitter: check @bio4j for updates

- Blog: go to http://blog.bio4j.com

- Mail-list: ask any question you may have in our list.

- LinkedIn: check the Bio4j group

- Github issues: don’t be shy! open a new issue if you thinksomething’s going wrong.

Page 37: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

and... Who’s behind all this?

- Pablo Pareja Tobes: Main developer (that’s me!)

- Eduardo Pareja Tobes: Technology and architecture main advisor

- Raquel Tobes: Bioinformatics main advisor

- Marina Manrique: Bioinformatics support

- Eduardo Pareja: Scientific advisor

Bio4j is being developed by Oh no sequences! Team and Era7 Bioinformatics members:

Page 38: Bio4j: A pioneer graph based database for the integration of biological Big Data

www.ohnosequences.com www.bio4j.com

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Upcoming features

That’s it !

Thanks for your time ;)