graphconnect europe 2016 - building a repository of biomedical ontologies with neo4j - simon jupp

Post on 08-Jan-2017

374 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Building a repository of biomedical ontologies with Neo4j

Simon Jupp jupp@ebi.ac.uk, @simonjuppSamples, Phenotypes and Ontologies TeamEuropean Bioinformatics InstituteCambridge, UK.

Biological data heavily interlinked

Proteome

Metabolome

Genome

tissue

CE-MS

antibody array LC-MS/MSm/z

600 800 1000 1200 1400 1600

10

20

30

40

50

60

70

80

90

100

Inte

nsity

609.256b6

755.422y8

882.357b9

852.476y9

995.435b10

1092.506b11

1181.252y12

1318.578b13

1587.759b16

1715.817b18

858.408b18 ++

794.380b16 ++

0

miRNAarray

mRNA array

PathwaysProtein Interaction

Drug targets

We need terminology standards

Dyschromatopsia

Search PubMed for “color blindness”

Search PubMed for “Dyschromatopsia”

Search PubMed for "abnormality of the eye"

The ontology of color blindness

HP:0011518 (Dichromacy )HP:0011518 (Eye)

HP:0000551 (Abnormality of color vision )

HP:0007641 (Dyschromatopsia)

Is-a

Is-aDisease-location

The ontology of color blindness

HP:0011518 (Dichromacy )HP:0011518 (Eye)

HP:0000551 (Abnormality of color vision )

HP:0007641 (Dyschromatopsia)

Is-a

Is-aDisease-location

“Colorblindness”

“A form of colorblindness in which only two of the three fundamental colors can be distinguished due to a lack of one of the retinal cone pigments.”

synonym

definition

9

Genotype Phenotype

Sequence

Proteins

Gene products Transcript

Pathways

Cell type

BRENDA tissue / enzyme source

Development

Anatomy

Phenotype

Plasmodium life cycle

- Sequence types and features

- Genetic Context

- Molecule role - Molecular Function- Biological process - Cellular component

- Protein covalent bond - Protein domain - UniProt taxonomy

-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction

-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version

-Mosquito gross anatomy-Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy-Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development

-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype - Human phenotype-Habronattus courtship -Loggerhead nesting -Animal natural history and life history

eVOC (Expressed Sequence Annotation for Humans)

Ontologies for life sciences

Ontology Lookup Service

• Ontology search engine (Solr)• Graph database of terms (Neo4j)• Powerful RESTful API (Built with Spring data neo4j / rest)• Open source project

• Generic infrastructure (can load any ontology represented in OWL)https://github.com/EBISPOT/OLS

Repository of over 140 biomedical ontologies (4.5 million terms, 11 million relations)

http://www.ebi.ac.uk/ols/beta

Web Ontology Language – (OWL)

• W3C standard vocabulary for describing ontologies• Powerful knowledge representation

However• OWL ontologies aren’t graphs, but…

… can be represented as an RDF graph… people want to use them as graphs

• Plenty of RDF databases around • But incomplete w.r.t. OWL semantics• SPARQL is an acquired taste

OWL to Neo4j schema• Each node label one of {Class, Property, Individuals} AND {Ontology name}• All OWL annotations become properties (labels, id, descriptions etc)• Superclass of (named and simple existentials) become edges in Neo4j

• E.g. In OWL “heart” subclassOf (part-of some “cardiovascular system”) In Neo4j “heart” part-of “cardiovascular system”

What are the sub types of “colorblindess”?MATCH (n:Class {obo_id: 'HP:0007641'})<-[r*]-(types:Class) RETURN n, r, types

What parts of the eye are related to diseases?MATCH

(eye:Class {obo_id: 'UBERON:0000970'})<-[r:Related {label : "part_of"}]-(eye_part:Class)<-

[r1:Related {label : "has_disease_location"}]-(disease:Class) RETURN eye, r,r1, eye_part, disease

Finding common ancestors via shortest pathMatch p=shortestPath( (a:Class)-[r:SUBCLASSOF*]-(b:Class) )Return nodes(p)

What is the common taxonomic superfamily of Gibbons and Chimpanzees?(or Hylobatidae and Pan troglodytes!)

https://commons.wikimedia.org/wiki/File:Hylobates_lar_pair_of_white_and_black_01.jpg

OLS visualisations• Partonomy for heart from the UBERON anatomy

ontology MATCH path = (n:Class)-[r:SUBCLASSOF|PartOf*]->(ancestor)

REST API (Spring Data REST + Neo4j)

• Crawlable API - Hypermedia drivel (HAL)

• Get ontology and term meta data • /ontologies• /ontologies/{name}• /ontologies/{name}/terms• /ontologies/{name}/terms/{termid}

• Get related terms and navigate ontology structure• /ontologies/{name}/terms/{termid}/parent• /ontologies/{name}/terms/{termid}/children• /ontologies/{name}/terms/{termid}/descendants• /ontologies/{name}/terms/{termid}/ancestors• /ontologies/{name}/terms/{termid}/{relation} e.g. part_of

http://www.ebi.ac.uk/ols/beta/api

Building the index• We check all 140 external ontology files nightly for

changes• We have a master build index

• When ontology updates we remove the old version and reload using the Neo4j BatchInserter (Potentially fragile)

• We push master index to various production data centers• Provides load balancing

Nightly crawl of all >140 registered ontologies

Conclusion• We’ve built a scalable repository of biomedical

ontologies with Neo4j• Generic OWL indexer (simplified OWL)• Powerful REST API built with Spring

• Acts as standalone OWL ontology server• Now being deployed externally

• Beta ~2000 users / 10 Million requests per month• Would like to discuss

• Batch Inserter• Migrating to Spring Data Neo4j 4

Acknowledgements• Sample Phenotypes and Ontologies Team - Tony

Burdett, James Malone, Dani Welter, Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen Parkinson

• Matt Pearce – Flax (BioSOLR project)• Michal Bachman and GraphAware team (Neo4j

training)

• Funding • European Molecular Biology Laboratory (EMBL)• European Union projects: DIACHRON, BioMedBridges

and CORBEL

top related