data integration and ontologies for biomedical...

Data Integration and Ontologies for Biomedical Applications

Amarnath GuptaVadim Astakhov Christopher Condit

A Word about Data in ScienceExcerpts from a Report by NSF’s Office of the Cyberinfrastructure

Data. … data are any and all complex data entities from observations, experiments, simulations, models, and higher orderassemblies, along with the associated documentation needed to describe and interpret the data.

Metadata. Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, inter-relationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.

Ontology. An ontology is the systematic description of a given phenomenon, often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse.

What is data integration?For applications where there are a number of data sources (recall previous slide)

Geographically distributedautonomousHaving data on different platforms (may be) on systems with different query capabilities (e.g., different DBMSs, files, spreadsheets)

Perhaps even having different data modelsHaving different schemaBUT about one common, general theme

One may want to constructA general-purpose information system such that

All these data sources can be co-accessed as if they belong to a single data sourceIt can produce “combined information objects” on-demand for ad hoc queries to facilitate problem-specific analyses performed through other software products (workflows, atlases, statistical packages …)

Data integration refers to a body of techniques to produce such an information system

Source: Mark Ellisman

Data Integration vis-à-vis Data GridA different aspect of data management

Storage Resource Transparency

Storage Location Transparency

E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...Data Identifier Transparency

image_0.jpg…image_100.jpgData Replica Transparency

image.sqlimage.cgi image.wsdlVirtual Data Transparency

Semantic data Organization (with behavior)patientRecordsCollectionmyActiveNeuroCollection

Inter-organizational Information

Storage Management

Courtesy: Reagan Moore and Arun Jagatheesan

Need for Integration: The BIRN Case

BIRN involves a consortium of more than 30 universities and 40 research groups

Three test bed projects and associated collaborative projects centered around brain imaging of human, neurological diseases and associated animal models.

Research question requires integration over multiple heterogeneous sources

The Need for Integration (Function BIRN)

fMRI

Are chronic, but not firstAre chronic, but not first--onset patients, associated with onset patients, associated with superior temporal gyrus dysfunction?superior temporal gyrus dysfunction?

IntegratedIntegratedViewView

Receptor Density ERP

WebWeb

PubMed,Expasy

Wrapper

WrapperWrapper

Wrapper

Structure

Wrapper

Clinical

Wrapper

MediatorMediator

0.150.18

0.140.11

-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30

ARIP - 20MGARIP - 30MGRISP - 06MG PLACEBOTreatment Group

WebQTL

Information Integration Problem (MouseBIRN)What is the cerebellar distribution of rat proteins with more than 70%

homology with human NCS-1? Any structure specificity?How about other rodents?

protein localization(NCMIR)

neurotransmission(SENSELAB)

sequence info(CaPROT)

morphometry(SYNAPSE)

Data Federation

Wrapper 1 Wrapper 2 Wrapper 4

Information Bus

Information Bus

Wrapper 3

The Need for Integration (FBIRN)

fMRI

Are chronic, but not firstAre chronic, but not first--onset patients, associated with onset patients, associated with superior temporal gyrus dysfunction?superior temporal gyrus dysfunction?

IntegratedIntegratedViewView

Receptor Density ERP

WebWeb

PubMed,Expasy

Wrapper

WrapperWrapper

Wrapper

Structure

Wrapper

Clinical

Wrapper

MediatorMediator

0.150.18

0.140.11

-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30

ARIP - 20MGARIP - 30MGRISP - 06MG PLACEBOTreatment Group

WebQTL

Information Integration Problem (MouseBIRN)What is the cerebellar distribution of rat proteins with more than 70%

homology with human NCS-1? Any structure specificity?How about other rodents?

protein localization(NCMIR)

neurotransmission(SENSELAB)

sequence info(CaPROT)

morphometry(SYNAPSE)

Data Federation

Wrapper 1 Wrapper 2 Wrapper 4

Information Bus

Information Bus

Wrapper 3

Science questions require integration of data across:

• multiple scales: from 3D volumetric data of nerve components to image feature data of protein distribution in the brain

• different categories of data: genomics data with neuro-imaging and molecular distribution data

• multiple universities/labs: to obtain larger population sample, so that the findings are statistically more robust

• number of integrated schemas: over different combinations of these sources designed for different study

(Some) Dimensions of Information Integration in Cyberinfrastructure Projects

Source Information Model

Integration Engine’s Information ModelThe integration model (or the registration model)

Specification of semantic correspondences across sources

The 3-party power play among “global schema”, “local schema”, “ontology”

Query paradigms over integrated data

The mechanics of query planning

query execution

About Semantic CorrespondencesThe general problem

For any data integration across multiple sources there needs to be a way to

Specify how two objects from different data sources may correspondSpecify how the “joining” of these two objects would create a composite data object

What’s the big deal?Identical object versus equivalent objectsComplete objects versus partial objectsMulti-scale representations of the same objectHandling definitional differencesTaking into account natural variabilityContextual correspondence

Are these always specifiable through ontological standards like OWL?Do we need to have “correspondence checking” services?

About the 3-party Power PlayWhile we want to create a single (cyber-) infrastructure with a data integration component, different applications have different integration scenarios

Is there a single global schema?Do new applications (and hence global schema) get added all the time over existing sources and ontologies?Are the sources fixed? Do new sources get added all the time? Dosources come and go?

Are sources added dynamically as “data sets” that users want to integrate “on the fly”?

Do local schemata come with their own ontologies? Is there a global ontology that all local ontologies must map to?How does the global schema (if one exists) relate to the global and local ontologies?Do new (or modified) ontologies get added all the time?Do the local schemata evolve all the time?

Is there a general way to manage this?Do we need to architect any cyberinfrastructure components differently?

Source Information Models

BIRNData Sources

Relational DBMSStandard data typesSemantic data types (attribute-domain references to ontologies)

Some data and computation sources expose a set of functionsKey constraints

Ontology SourcesSimplifying assumptions

Ontologies can be approximated by edge-labeled directed graphs stored in relational systemsGraph traversal functions can be mimicked as database functions

BONFIREGlue ontology for simple inter-ontology mappings and extensions

Image and Spatial Data SourcesDiscussed later

Source Information Models

PAKT (marine biogeography)Data Sources

RelationalSpatial (vectors) supported by GIS and Spatial DBMSSpatial (raster – continuously partitionable arrays)

ArcGIS (map algebra), Nested, non-aligned, multiple resolution

Spatially-indexed time seriesFunction-exposing sources (WSDL)

Parameter and result data types are interpretable or BLOBS

Ontology SourcesAny ontology specified in a subset of OWLAny DAG-structured data source

Integration Engine’s Information ModelBIRN

Sources from the mediator’s viewBase relations may have binding patternsDistinction between data and metadata is not strictly observed

SRB metadata catalog is treated as a relational source with somespecial functions

Files are accessed by reference to data-grid URIs (SRB ids)Integration Model

Essentially Global-as-view (GAV) mediation“semantic” aspect of the mediation executed through opaque functions over ontology sourcesKey constraints not used during standard query processing but are used for keyword queries

Integration Engine’s Information ModelBIRN (contd.)

The 3-party power-playMany integrated views used by several global schemata on a relatively fixed set of sourcesOntologies are used in two ways

A global view may be defined using ontology functionsKeyword queries use simple ontological relationships

Some terms in the global schema mapped to ontologies through semantic typing

Otherwise the global schema and integrated views are independentfrom the ontology

Some data are warped to a common atlas coordinate systems to enable atlas queries

Atlas mapping ≡ spatial annotation

Ontological Query Processor

Integration Engine’s Information ModelBIRN Integration architecture

OTISSpatialRegistry Mediator

Atlas Query Processor

Data Grid Access Wrapper Access

Atlas Client Onto ClientQuery Client

Gatewayhas XML API for source registration, source schema updateHas XML API for queriesCan be accessed as web service

RegistryAPI-based access to schema elements and view definitionsImplemented over MySQL for portabilitySpatial registry for image data

Planner and ExecutorDescribed later

WrappersLocal and remote

OTISInverted index for ontological terms

Information Engine’s Information Model (just for comparison)

GEONSources from the Integration Engine’s Viewpoint

Metadata (Item-level information) maintained in a GEON standard called ADN (Alexandria-Delese-NASA)Item-detail level information is either any relationalizabledata or shapefilesAny WMS, WFS service is a valid source for map information managementDoes not permit an external ontology source, all ontologies have to be defined in the GEON framework

Integration ModelEvery source schema is registered to an ontology

Integration Engine’s Information ModelPAKT (briefly)

Type extensibility of the mediatorNested relational query language extended by tree and a restricted set of graph pattern operationsConstruction operations importantPassive extensibility

Source more powerful than the mediatorSource exports a set of type-based optimization rules to the mediator

Active extensibilityMediator extends its set of interpreted types

Ontology managementOntological queries processed by a separate co-processor that interoperates with mediatorQuery planner partitions the query into ontological and mediatedquery processors

Query ParadigmsWhat are the different kinds of queries scientists and applications pose to an integrated system?

Metadata-based file access21,038 raw image files per subject2.4 GB of raw image data per subject25 GB to 40 GB of processed image data per subject 10 million slices of functional imaging data in Phase II7 Terabytes of image data for all of the Phase II analyses

(conservative estimate of 25 GB/subject)

Ontologically supported mediated queries“Find most recent FMRI data of all patients with low scores in working memory tasks having volumetric changes of hippocampus over 10% in 2 years”

Keyword queriesFMRI “working memory task” hippocampus

Ontologically supported keyword queriesAssociative searches

Oct

-02

Feb-0

3

Jun-0

3

Oct

-03

Feb-0

4

Jun-0

4

Oct

-04

Feb-0

5

Jun-0

5

Oct

-05

Feb-0

6

Jun-0

6

Total Number of Files(in thousands)

02000400060008000

1000012000140001600018000

BIRN Data Grid Usage

Total Number of Files (in thousands) Total Size of Storage (in Gigabytes)

16+ Terabytes

16 million files

View Definition and Query Language

Views defined as Union of conjunctive queriesMay contain function termExpressed in XML Datalog with aggregated functions

Queryq(X,F(Y)):-r1(X,Z),r2(Z,Y),

where F(Y) – aggregate function operated on set of Y and X group-by variables.

Planner and Executor translate this to:q’(X,Y):-r1(X,Z),r2(Z,Y)q(X,W):-F(gb(q’(X,Y)) Where group-by “gb” function with aggregate function F pushed to

data source whenever possible or evaluate at Mediator.Query Language allows for nested query – inner queries are assigned to intermediate variables that are used by main query

Mediator – data integration platform

Ontological Sources - term-graph (UMLS / BONFIRE)

• Node → term• Edge → relation

general : is-a, part-ofdomain specific:

volumetric-subpart: brain-region->brain-regionmeasured-by: psych- parameter->cognitive test.

The Data Integration Process

Brain region

MRI images

ADT’s ADT’s ADT’s

Tools

derived data

Metadata Schemas

Neuron

microscope images

Other ADT’se.g., system of pipes

Tools

Other derived data

Metadata Schemas

UMLSSpatial Atlas:

• Hook to spatial region

• Hook to UMLS concept

Support for ADT’s

Oracle at Sources

Representation of ontologies

• UMLS and Neuronames implemented in Oracle

Extending Ontologies

Brain region

MRI data

data data data

Tools

derived data

Metadata Schemas

Neuron

images

datae.g., system of pipes

Tools

derived data

Metadata Schemas

+ +

Mapping Relations

Ontology Mappingmaps data values from a source to an ontology term of a known ontology (UMLS)

Purkinje corpuscle → ‘Purkinje cell’

Joinable relationpairs attributes from different relations

Joins-with( src1,rel1,attrib1, src2,rel2,attrib2)

Value-Mapmaps mediator-supported data value to source supported

Gender0/1 at some source‘M’/’F’ at another sourcemale/female for mediator

Architecture extension to support semantic integration

Gateway Post-Processor

Registry

Planner ExecutorPre-

Processor

OTIS

WrapperServer

Query in XML format

Combine output tuple set

Source1 Source2 Source3

Batch subqueries

by sourceExecute Query

Manage Executor-Wrapper Interaction

Return data pointers associated with terms

Request associated sources and generate views

A Functional View of the Mediation Process

Query Expression(UCQ+ + Nesting + Grouping & Aggregate)

View Unfolding

Flattening of Nested Queries

Normalization to DNF

Predicate Reordering(binding patterns + maximal chunk)

Maximal Feasible Plan

Algebraic Plan

Cost/Selectivity-based Optimization

Pre-Executable Plan

Pre-Executable Plan

Executable Plan

Execution Control

Result Building

Post-processing+ aggregate

Execution EnginePlanner

Result Reporting

Processing Ontological Queries

PAKT: Spatial and Taxonomic Queries

Processing Ontological Queries with Inferencing

Geo-SpatialBiological

Q1: where is species X found?OBIS(scientific_name,lat,long)

OBIS

Geo-Spatial

HabitatBenth_Hab

Q5: where is habitat X found?

Q2: for a given polygon, what species are found?OBIS(scientific_name,m_lat,m_long,m_lat,m_long)

Geo-SpatialBiological Physiochemical

Q3: where is species X found given certain physical parameter?OBIS(scientific_name,lat,long) WOA(physio,lat,long)

Q4: what are the aggregated physical properties of species X?OBIS(scientific_name,lat,long) WOA(physio,lat,long)

OBIS WOA

Q6: for a given polygon A, what habitats are found?

Geo-SpatialBiological Physiochemical

Habitat

OBIS WOA

Benth_Hab

Q7: where is habitat X found given certain physical parameter?CMECS(habitat,physio)

Q8: what are the aggregated physical properties of habitat X?

CMECS(habitat,physio)

BH(habitat_grp,shape) WOA(physio,lat,long)BH(habitat_grp,shape)

CMECS(habitat,physio)

CMECS(habitat,physio) BH(habitat_grp,shape) PolygonA

BH(habitat_grp,shape) WOA(physio,lat,long)CMECS(habitat,physio)

BH(habitat_grp,shape)Q9: what species can be found at habitat X?

Q10: what habitats is a species X found at ?

OBIS(scientific_name,lat,long)

CMECS(habitat,physio) BH(habitat_grp,shape) OBIS(scientific_name,lat,long)

Italics: input

Underline: output

extended

Frequent Query Patterns

Example queries are joins ofLeft query patterns: habitat-spatial, andRight query patterns: spatial-environmental/species distribution

CMECS(habitat,physio) BH(habitat_grp,shape)

CMECS(habitat,physio) BH(habitat_grp,shape)

CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) WOA(physio,lat,long)

BH(..,shape) WOA(physio,lat,long)

BH(..,shape) OBIS(scientific_name,lat,long)

BH(..,shape) OBIS(scientific_name,lat,long)

Onto-module’s queries Mediator’s queries

PolygonA( )

API

Mediator Demonstration

Ontology for NeuroscienceWhy we need ontology for neuroscience?

Critical for linking together neuroscience data and making it understandable to both human and machineOntologies provide a set of concepts and the relationships between them, e.g., “is_a” and “has_a”, and may be expressed in a formal language that supports reasoningLarge scale ontology projects like the Gene Ontology and UnifiedMedical Language System have limited coverage of neuroscience-specific conceptsSeveral existing projects are developing terminology resources for neuroanatomy down to the cellular scale

Our goalDeveloping an ontology that describes the subcellular anatomy ofthe nervous system, including cell types and their subcellular properties and multicellular domainscreate a knowledge-base of the subcellular anatomy of the nervous system to aid in database interoperability and construction of neuronal models of cells and mulitcellular domains

Method

The conceptual framework for describing the subcellular anatomy of the nervous system was based on Peters, Palay & Webster, The Fine Structure of the Nervous System, Ed. 2 The knowledge base was constructed as a directed graph using the open source tool Protégé (http://protege.stanford.edu), a freely available knowledge management tool written in Java. A large community supports ongoing development, including multiple Plug-Ins for visualization, analysis and integration with other toolsThe ontology is expressed in OWL-DL. OWL (Web Ontology Language) is a markup language for publishing and sharing data using ontologies on the Internet. OWL is a vocabulary extension of RDF (Resource Description Framework) and is derived from the DAML+OIL Web Ontology Language. OWL-DL supports description logic, which has desirable computational properties for reasoning systems, e.g., Kv3.2 is located in the plasma membrane; if an axon terminal expresses Kv3.2, then it has a plasma membrane.

Subcellular OntologyIntercellular

JunctionMulti-cellular

Domain

Pinceau Node of Ranvier

Extracellular Space

Glomerulus NeuropilSynaptic Cleft

Subcellular Space

Nerve Cell

Neuron

Glia

Microglia Macroglia

Compartment

Dendrite Axon Cell body Spine

Dendritic Spine

Component

Post synaptic

Component

PSD

SER

Actin Filament

RibosomeOrientation

Distribution

Property

Morphometrics

Shape

Compartment

Compartment

Shaft

Cytoplasm

Organelle

Cytoskeleton

Cilium

Specialization

Inclusion

Plasma Membrane

Component

Orientation

Distribution

Property

Morphometrics

Shape

Moleculesubclass

has-a

LEGEND

System Development Facts

Our system is developed on top of an IBM integrated ontology toolkit, which implements a high performance ontology repository built on relational database

Completely follows W3C’s OWL and SPARQL query languageUses description logic reasoner for class-level inference and a set of logic rules translated from DLP for instance-level inferenceHence, inference completeness and soundness on DLP can be guaranteedBack-end database schema design supports efficient querying and inference, performance superior compared to Jena, Seseam etc.

IBM ToolKit

SKIL APIs

Biologist-Friendly GUI (i.e.,OntoQuest)

Query Mediator

SQL

Cache

Updater Reasoner

. .

SQL

System Architecture

We designed a domain user friendly GUI and a library of customized APIs

Updater: enable inserting classes and instances incrementally into the ontology repositoryQuery Mediator: form user’s request as a graph query against the global view; decompose it into sub-queries in forms of SQL and SPARQL and send to CCDB and CKB; reassemble the results and render an appropriate view (e.g. graphic) for the userReasoner: infer useful information from incomplete inputs to help with data curation or to guide users through explorationCache: further enhance the system efficiency by caching or prefetching frequent query results

The system is still under development – some of the functionalities are not completed or need to be improved

Step 0: startup screen

Step 1: click to show subclass hierarchy by default

Step 1a: other clicking options for expanding different types of hierarchies

Step 1b: show allowed compartment types for Neuroepithelial_Cell and those for Neuron

Step 2: get the detailed info (instances and properties) of the subclass Dendrite of Neuron_Compartment

Step 2a: click on ccdb_ID column to get values for all instances; right click a cell to view the CCDB image page

Step 2b: the CCDB image page corresponding to the selected instance Dendritic_Tree_1 is shown here

Step 2’: some concept (like Cellular_Dependent_Continuant here) has properties but no instances in CKB

Step 3: right click on a concept in the hierarchy pops up a list of view functions to choose from

Step 3a: the OWLdoc page is shown for the chosen concept Neuron

Step 4: aggregate the has_Component values of all Dendrite instances; the last row shows statistics summary

You may also have noticed that instances of Dendrite include those of its subclasses (such as Dendrite_Tree)

Step 5: drill down to view instances of Dendrite_Tree, aggregate on several numeric type of property values

•What are the cellular components of a dendrite?29 instances of dendrite

1. Microtubules2. Mitochondria3. Hypolemmal cisternae4. Plasma membrane5. Smooth endoplasmic reticulum6. Rough endoplasmic reticulum7. Polyribosomes8. Neurofilaments

Average diameter = 3.2 umAverage length = 150 um

•How many dendrites does a Purkinje cell have?3 instances of Purkinje cell dendritic tree

1. Avg branch order = 222. Number of primary dendrites = 1.33. Avg number of branches = 760

**Computes aggregate properties from instances

“Rules” for cellular assembly

Thank you!

Questions? Comments? Integrated Queries?

data integration and ontologies for biomedical...

Documents