neo4j and bioinformatics

45
Neo4j and Bioinformatics www.ohnosequences.com www.bio4j.com

Upload: pablo-pareja-tobes

Post on 04-Jul-2015

4.061 views

Category:

Technology


1 download

DESCRIPTION

Slide deck from the webinar "Neo4j and bioinformatics"

TRANSCRIPT

Page 1: Neo4j and bioinformatics

Neo4j and Bioinformatics

www.ohnosequences.com www.bio4j.com

Page 2: Neo4j and bioinformatics

But who’s this guy talking here?

I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences!

www.ohnosequences.com www.bio4j.com

Oh no what !?

We are the R&D group at Era7 Bioinformatics. we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics… well, lots of things.

What about Era7 Bioinformatics?

Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation. Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and analysis.

Page 3: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

In Bioinformatics we have highly interconnected overlapping knowledge spread throughout different DBs

Page 4: Neo4j and bioinformatics

However all this data is in most cases modeled in relational databases. Sometimes even just as plain CSV files

As the amount and diversity of data grows, domain models become crazily complicated!

www.ohnosequences.com www.bio4j.com

Page 5: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

With a relational paradigm, the double implication Entity Table

does not go both ways.

You get ‘auxiliary’ tables that have no relationship with the small piece of reality you are modeling.

You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality)

Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL.

Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain model

Page 7: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

NoSQL data models

Page 8: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables All the benefits of a fully transactional, enterprise-strength database. For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs.

Page 9: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

What’s Bio4j?

Bio4j is a bioinformatics graph based DB including most data available in :

Uniprot KB (SwissProt + Trembl)

Gene Ontology (GO)

UniRef (50,90,100)

NCBI Taxonomy

RefSeq

Enzyme DB

Page 10: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

It provides a completely new and powerful framework

for protein related information querying and

management.

Since it relies on a high-performance graph engine, data

is stored in a way that semantically represents its own

structure

What’s Bio4j?

Page 11: Neo4j and bioinformatics

Bio4j uses Neo4j technology, a "high-performance graph

engine with all the features of a mature and robust

database".

Thanks to both being based on Neo4j DB and the API

provided, Bio4j is also very scalable, allowing anyone

to easily incorporate his own data making the best

out of it.

www.ohnosequences.com www.bio4j.com

What’s Bio4j?

Page 12: Neo4j and bioinformatics

Everything in Bio4j is open source !

released under AGPLv3

www.ohnosequences.com www.bio4j.com

What’s Bio4j?

Page 13: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Relationships: 530.642.683 Nodes: 76.071.411 Relationship types: 139 Node types: 38

Bio4j in numbers

The current version (0.7) includes:

Page 14: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Let’s dig a bit about Bio4j structure…

Data sources and their relationships:

Page 16: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

The Graph DB model: representation

Core abstractions:

Properties on both

Relationships between nodes

Nodes

Page 17: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

How are things modeled?

Couldn’t be simpler!

Entities

Nodes

Associations / Relationships

Edges

Page 18: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Some examples of nodes would be:

Protein

GO term

Genome Element

and relationships:

Protein

GO term

PROTEIN_GO_ANNOTATION

Page 19: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

• Navigate through all nodes and relationships • Access the javadocs of any node or relationship • Graphically explore the neighborhood of a node/relationship • Look up for the indexes that may serve as an entry point for a node • Check incoming/outgoing relationships of a specific node • Check start/end nodes of a specific relationship

Page 20: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Entry points and indexing

There are two kinds of entry points for the graph:

Auxiliary relationships going from the reference node, e.g.

Node indexing

- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl

There are two types of node indexes:

- CELLULAR_COMPONENT: leads to the root of GO cellular component sub-ontology

- Exact: Only exact values are considered hits

- Fulltext: Regular expressions can be used

Page 21: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

//--creating manager and node retriever----

Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);

NodeRetriever nR= new NodeRetriever(manager);

ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);

Getting more related info...

List<InterproNode> interpros = protein.getInterpro();

OrganismNode organism = protein.getOrganism();

List<GoTermNode> goAnnotations = protein.getGOAnnotations();

List<ArticleNode> articles = protein.getArticleCitations();

for (ArticleNode article : articles) {

System.out.println(article.getPubmedId());

}

//Don’t forget to close the manager

manager.shutDown();

Retrieving protein info (Bio4jModel Java API)

Page 22: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Querying Bio4j with Cypher

START k=node:keyword_id_index(keyword_id_index = "KW-0181")

return k.name, k.id

Getting a keyword by its ID

START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")

MATCH d <-[r:PROTEIN_DATASET]- p,

circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -

[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]->

(p)

return p.accession, p2.accession, p3.accession

Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

Check this blog post for more info and our Bio4j Cypher cheetsheet

Page 23: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name ==> Aspartate aminotransferase, mitochondrial

Get protein by its accession number and return its full name

Get proteins (accessions) associated to an interpro motif (limited to 4 results)

gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accession[0..3] ==> E2GK26 ==> G3PMS4 ==> G3Q865 ==> G3PIL8

Check our Bio4j Gremlin cheetsheet

A graph traversal language

Page 24: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

You can also query/navigate through Bio4j with the Neo4j REST API ! The default representation is json, both for responses and or data sent with POST/PUT requests

http://server_url:7474/db/data/index/node/protein_accession_index/

protein_accession_index/Q9UR66

REST Server

Get protein by its accession number: (Q9UR66)

http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o

ut

Get outgoing relationships for protein Q9UR66

Page 25: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Visualizations (1) REST Server Data Browser

Navigate through Bio4j data in real time !

Page 26: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Visualizations (2) Bio4j GO Tools

Page 27: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Visualizations (3) Bio4j + Gephi

Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platform

Page 28: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Bio4j + Cloud

Interoperability and data distribution

We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits:

Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds.

CloudFormation templates: - Basic Bio4j DB Instance - Bio4j REST Server Instance

Backup and Storage using S3 (Simple Storage Service)

We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files)

Page 29: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Why would I use Bio4j ?

Massive access to protein/genome/taxonomy… related information

Networks analysis

Integration of your own DBs/resources around common information

Development of services tailored to your needs built around Bio4j

Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;)

Visualizations

Page 30: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Community

Bio4j has a fast growing internet presence:

- Twitter: check @bio4j for updates

- Blog: go to http://blog.bio4j.com

- Mail-list: ask any question you may have in our list.

- LinkedIn: check the Bio4j group

- Github issues: don’t be shy! open a new issue if you think something’s going wrong.

Page 31: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

OK, but why starting all this? Were you so bored…?!

It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations.

At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning:

I got crazy ‘deciphering’ how to extract Uniprot protein annotations from GO official tables schema

Uniprot and GO official protein annotations were not always consistent

Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations.

Soon enough we also had the need of having massive access to basic protein information.

Page 32: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

These processes had to be automated for our (specifically designed for NGS data)

bacterial genome annotation system BG7

Uniprot web services available were too limited: - Slow - Number of queries limitation - Too little information available

So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl) and started to have some fun with it !

Page 33: Neo4j and bioinformatics

1 • Selection of the specific reference protein set

2 • Prediction of possible genes by BLAST similarity

3 • Gene definition: merging compatible similarity regions, detecting start and stop

4 • Solving overlapped predicted genes

5 • RNA prediction by BLAST similarity

6 • Final annotation and complete deliverables. Quality control.

www.era7bioinformatics.com

BG7 algorithm

Page 34: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

We got used to having massive direct access to all this protein related information…

So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ?

Then we incorporated:

- Isoform sequences

- Protein interactions and features

- Uniref 50, 90, and 100

- RefSeq

- NCBI Taxonomy

- Enzyme Expasy DB

Page 35: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Some numbers:

• 157 639 502 nodes • 742 615 705 relationships • 632 832 045 properties • 148 relationship types • 44 node types

Bio4j + MG7 + 48 Blast XML files (~1GB each)

And it works just fine!

Page 37: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

What’s MG7?

MG7 provides the possibility of choosing different parameters to fix the thresholds for filtering the BLAST hits: i. E-value ii. Identity and query coverage

It allows exporting the results of the analysis to different data formats like: • XML • CSV • Gexf (Graph exchange XML format) As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree --> MG7Viewer

Page 38: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Heat-map Viz

Page 39: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Graph Viz

Page 40: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

MG7 Viewer

Page 41: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Mining Bio4j data

Finding topological patterns in Protein-Protein Interaction networks

Page 42: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Page 43: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Future directions (1)

Gene flux tool New tool for bacterial comparative genomics: massive tracing of vertical and horizontal gene flux between genome elements based on the analysis of the similarity between their proteins. It would analyze similarity relationships that could be fixed to a 90% or 100% similarity threshold.

Pathways tool Data from Metacyc is going to be included in Bio4j. This data would allow to dissect the metabolic pathways in which a genome element, organism or community (metagenomic samples) is involved. Gephi could be used for the representation of metabolic pathways for each of them. .

Page 44: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

Future directions (2)

Detector of common annotations in gene clusters Many biological problems are related to the search of common annotations in a set of genes. Some examples: - a set of overexpressed genes - a set of proteins with local structural similarities (WIP) - a set of genes bearing SNPs in cancer samples - a set of exclusive genes in a pathogenic bacterial strain The detection of common annotations can help in the inference of important functional connections.

Page 45: Neo4j and bioinformatics

www.ohnosequences.com www.bio4j.com

That’s it !

Thanks for your time ;)