data integration and ontologies for biomedical...
TRANSCRIPT
Data Integration and Ontologies for Biomedical Applications
Amarnath GuptaVadim Astakhov Christopher Condit
A Word about Data in ScienceExcerpts from a Report by NSF’s Office of the Cyberinfrastructure
Data. … data are any and all complex data entities from observations, experiments, simulations, models, and higher orderassemblies, along with the associated documentation needed to describe and interpret the data.
Metadata. Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, inter-relationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.
Ontology. An ontology is the systematic description of a given phenomenon, often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse.
What is data integration?For applications where there are a number of data sources (recall previous slide)
Geographically distributedautonomousHaving data on different platforms (may be) on systems with different query capabilities (e.g., different DBMSs, files, spreadsheets)
Perhaps even having different data modelsHaving different schemaBUT about one common, general theme
One may want to constructA general-purpose information system such that
All these data sources can be co-accessed as if they belong to a single data sourceIt can produce “combined information objects” on-demand for ad hoc queries to facilitate problem-specific analyses performed through other software products (workflows, atlases, statistical packages …)
Data integration refers to a body of techniques to produce such an information system
Source: Mark Ellisman
Data Integration vis-à-vis Data GridA different aspect of data management
Storage Resource Transparency
Storage Location Transparency
E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...Data Identifier Transparency
image_0.jpg…image_100.jpgData Replica Transparency
image.sqlimage.cgi image.wsdlVirtual Data Transparency
Semantic data Organization (with behavior)patientRecordsCollectionmyActiveNeuroCollection
Inter-organizational Information
Storage Management
Courtesy: Reagan Moore and Arun Jagatheesan
Need for Integration: The BIRN Case
BIRN involves a consortium of more than 30 universities and 40 research groups
Three test bed projects and associated collaborative projects centered around brain imaging of human, neurological diseases and associated animal models.
Research question requires integration over multiple heterogeneous sources
The Need for Integration (Function BIRN)
fMRI
Are chronic, but not firstAre chronic, but not first--onset patients, associated with onset patients, associated with superior temporal gyrus dysfunction?superior temporal gyrus dysfunction?
IntegratedIntegratedViewView
Receptor Density ERP
WebWeb
PubMed,Expasy
Wrapper
WrapperWrapper
Wrapper
Structure
Wrapper
Clinical
Wrapper
MediatorMediator
0.150.18
0.140.11
-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30
ARIP - 20MGARIP - 30MGRISP - 06MG PLACEBOTreatment Group
WebQTL
Information Integration Problem (MouseBIRN)What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?How about other rodents?
protein localization(NCMIR)
neurotransmission(SENSELAB)
sequence info(CaPROT)
morphometry(SYNAPSE)
Data Federation
Wrapper 1 Wrapper 2 Wrapper 4
Information Bus
Information Bus
Wrapper 3
The Need for Integration (FBIRN)
fMRI
Are chronic, but not firstAre chronic, but not first--onset patients, associated with onset patients, associated with superior temporal gyrus dysfunction?superior temporal gyrus dysfunction?
IntegratedIntegratedViewView
Receptor Density ERP
WebWeb
PubMed,Expasy
Wrapper
WrapperWrapper
Wrapper
Structure
Wrapper
Clinical
Wrapper
MediatorMediator
0.150.18
0.140.11
-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30
ARIP - 20MGARIP - 30MGRISP - 06MG PLACEBOTreatment Group
WebQTL
Information Integration Problem (MouseBIRN)What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?How about other rodents?
protein localization(NCMIR)
neurotransmission(SENSELAB)
sequence info(CaPROT)
morphometry(SYNAPSE)
Data Federation
Wrapper 1 Wrapper 2 Wrapper 4
Information Bus
Information Bus
Wrapper 3
Science questions require integration of data across:
• multiple scales: from 3D volumetric data of nerve components to image feature data of protein distribution in the brain
• different categories of data: genomics data with neuro-imaging and molecular distribution data
• multiple universities/labs: to obtain larger population sample, so that the findings are statistically more robust
• number of integrated schemas: over different combinations of these sources designed for different study
(Some) Dimensions of Information Integration in Cyberinfrastructure Projects
Source Information Model
Integration Engine’s Information ModelThe integration model (or the registration model)
Specification of semantic correspondences across sources
The 3-party power play among “global schema”, “local schema”, “ontology”
Query paradigms over integrated data
The mechanics of query planning
query execution
About Semantic CorrespondencesThe general problem
For any data integration across multiple sources there needs to be a way to
Specify how two objects from different data sources may correspondSpecify how the “joining” of these two objects would create a composite data object
What’s the big deal?Identical object versus equivalent objectsComplete objects versus partial objectsMulti-scale representations of the same objectHandling definitional differencesTaking into account natural variabilityContextual correspondence
Are these always specifiable through ontological standards like OWL?Do we need to have “correspondence checking” services?
About the 3-party Power PlayWhile we want to create a single (cyber-) infrastructure with a data integration component, different applications have different integration scenarios
Is there a single global schema?Do new applications (and hence global schema) get added all the time over existing sources and ontologies?Are the sources fixed? Do new sources get added all the time? Dosources come and go?
Are sources added dynamically as “data sets” that users want to integrate “on the fly”?
Do local schemata come with their own ontologies? Is there a global ontology that all local ontologies must map to?How does the global schema (if one exists) relate to the global and local ontologies?Do new (or modified) ontologies get added all the time?Do the local schemata evolve all the time?
Is there a general way to manage this?Do we need to architect any cyberinfrastructure components differently?
Source Information Models
BIRNData Sources
Relational DBMSStandard data typesSemantic data types (attribute-domain references to ontologies)
Some data and computation sources expose a set of functionsKey constraints
Ontology SourcesSimplifying assumptions
Ontologies can be approximated by edge-labeled directed graphs stored in relational systemsGraph traversal functions can be mimicked as database functions
BONFIREGlue ontology for simple inter-ontology mappings and extensions
Image and Spatial Data SourcesDiscussed later
Source Information Models
PAKT (marine biogeography)Data Sources
RelationalSpatial (vectors) supported by GIS and Spatial DBMSSpatial (raster – continuously partitionable arrays)
ArcGIS (map algebra), Nested, non-aligned, multiple resolution
Spatially-indexed time seriesFunction-exposing sources (WSDL)
Parameter and result data types are interpretable or BLOBS
Ontology SourcesAny ontology specified in a subset of OWLAny DAG-structured data source
Integration Engine’s Information ModelBIRN
Sources from the mediator’s viewBase relations may have binding patternsDistinction between data and metadata is not strictly observed
SRB metadata catalog is treated as a relational source with somespecial functions
Files are accessed by reference to data-grid URIs (SRB ids)Integration Model
Essentially Global-as-view (GAV) mediation“semantic” aspect of the mediation executed through opaque functions over ontology sourcesKey constraints not used during standard query processing but are used for keyword queries
Integration Engine’s Information ModelBIRN (contd.)
The 3-party power-playMany integrated views used by several global schemata on a relatively fixed set of sourcesOntologies are used in two ways
A global view may be defined using ontology functionsKeyword queries use simple ontological relationships
Some terms in the global schema mapped to ontologies through semantic typing
Otherwise the global schema and integrated views are independentfrom the ontology
Some data are warped to a common atlas coordinate systems to enable atlas queries
Atlas mapping ≡ spatial annotation
Ontological Query Processor
Integration Engine’s Information ModelBIRN Integration architecture
OTISSpatialRegistry Mediator
Atlas Query Processor
Data Grid Access Wrapper Access
Atlas Client Onto ClientQuery Client
Gatewayhas XML API for source registration, source schema updateHas XML API for queriesCan be accessed as web service
RegistryAPI-based access to schema elements and view definitionsImplemented over MySQL for portabilitySpatial registry for image data
Planner and ExecutorDescribed later
WrappersLocal and remote
OTISInverted index for ontological terms
Information Engine’s Information Model (just for comparison)
GEONSources from the Integration Engine’s Viewpoint
Metadata (Item-level information) maintained in a GEON standard called ADN (Alexandria-Delese-NASA)Item-detail level information is either any relationalizabledata or shapefilesAny WMS, WFS service is a valid source for map information managementDoes not permit an external ontology source, all ontologies have to be defined in the GEON framework
Integration ModelEvery source schema is registered to an ontology
Integration Engine’s Information ModelPAKT (briefly)
Type extensibility of the mediatorNested relational query language extended by tree and a restricted set of graph pattern operationsConstruction operations importantPassive extensibility
Source more powerful than the mediatorSource exports a set of type-based optimization rules to the mediator
Active extensibilityMediator extends its set of interpreted types
Ontology managementOntological queries processed by a separate co-processor that interoperates with mediatorQuery planner partitions the query into ontological and mediatedquery processors
Query ParadigmsWhat are the different kinds of queries scientists and applications pose to an integrated system?
Metadata-based file access21,038 raw image files per subject2.4 GB of raw image data per subject25 GB to 40 GB of processed image data per subject 10 million slices of functional imaging data in Phase II7 Terabytes of image data for all of the Phase II analyses
(conservative estimate of 25 GB/subject)
Ontologically supported mediated queries“Find most recent FMRI data of all patients with low scores in working memory tasks having volumetric changes of hippocampus over 10% in 2 years”
Keyword queriesFMRI “working memory task” hippocampus
Ontologically supported keyword queriesAssociative searches
Oct
-02
Feb-0
3
Jun-0
3
Oct
-03
Feb-0
4
Jun-0
4
Oct
-04
Feb-0
5
Jun-0
5
Oct
-05
Feb-0
6
Jun-0
6
Total Number of Files(in thousands)
02000400060008000
1000012000140001600018000
BIRN Data Grid Usage
Total Number of Files (in thousands) Total Size of Storage (in Gigabytes)
16+ Terabytes
16 million files
View Definition and Query Language
Views defined as Union of conjunctive queriesMay contain function termExpressed in XML Datalog with aggregated functions
Queryq(X,F(Y)):-r1(X,Z),r2(Z,Y),
where F(Y) – aggregate function operated on set of Y and X group-by variables.
Planner and Executor translate this to:q’(X,Y):-r1(X,Z),r2(Z,Y)q(X,W):-F(gb(q’(X,Y)) Where group-by “gb” function with aggregate function F pushed to
data source whenever possible or evaluate at Mediator.Query Language allows for nested query – inner queries are assigned to intermediate variables that are used by main query
Mediator – data integration platform
Ontological Sources - term-graph (UMLS / BONFIRE)
• Node → term• Edge → relation
general : is-a, part-ofdomain specific:
volumetric-subpart: brain-region->brain-regionmeasured-by: psych- parameter->cognitive test.
The Data Integration Process
Brain region
MRI images
ADT’s ADT’s ADT’s
Tools
derived data
Metadata Schemas
Neuron
microscope images
Other ADT’se.g., system of pipes
Tools
Other derived data
Metadata Schemas
UMLSSpatial Atlas:
• Hook to spatial region
• Hook to UMLS concept
Support for ADT’s
Oracle at Sources
Representation of ontologies
• UMLS and Neuronames implemented in Oracle
Extending Ontologies
Brain region
MRI data
data data data
Tools
derived data
Metadata Schemas
Neuron
images
datae.g., system of pipes
Tools
derived data
Metadata Schemas
+ +
Mapping Relations
Ontology Mappingmaps data values from a source to an ontology term of a known ontology (UMLS)
Purkinje corpuscle → ‘Purkinje cell’
Joinable relationpairs attributes from different relations
Joins-with( src1,rel1,attrib1, src2,rel2,attrib2)
Value-Mapmaps mediator-supported data value to source supported
Gender0/1 at some source‘M’/’F’ at another sourcemale/female for mediator
Architecture extension to support semantic integration
Gateway Post-Processor
Registry
Planner ExecutorPre-
Processor
OTIS
WrapperServer
Query in XML format
Combine output tuple set
Source1 Source2 Source3
Batch subqueries
by sourceExecute Query
Manage Executor-Wrapper Interaction
Return data pointers associated with terms
Request associated sources and generate views
A Functional View of the Mediation Process
Query Expression(UCQ+ + Nesting + Grouping & Aggregate)
View Unfolding
Flattening of Nested Queries
Normalization to DNF
Predicate Reordering(binding patterns + maximal chunk)
Maximal Feasible Plan
Algebraic Plan
Cost/Selectivity-based Optimization
Pre-Executable Plan
Pre-Executable Plan
Executable Plan
Execution Control
Result Building
Post-processing+ aggregate
Execution EnginePlanner
Result Reporting
Processing Ontological Queries
PAKT: Spatial and Taxonomic Queries
Processing Ontological Queries with Inferencing
Geo-SpatialBiological
Q1: where is species X found?OBIS(scientific_name,lat,long)
OBIS
Geo-Spatial
HabitatBenth_Hab
Q5: where is habitat X found?
Q2: for a given polygon, what species are found?OBIS(scientific_name,m_lat,m_long,m_lat,m_long)
Geo-SpatialBiological Physiochemical
Q3: where is species X found given certain physical parameter?OBIS(scientific_name,lat,long) WOA(physio,lat,long)
Q4: what are the aggregated physical properties of species X?OBIS(scientific_name,lat,long) WOA(physio,lat,long)
OBIS WOA
Q6: for a given polygon A, what habitats are found?
Geo-SpatialBiological Physiochemical
Habitat
OBIS WOA
Benth_Hab
Q7: where is habitat X found given certain physical parameter?CMECS(habitat,physio)
Q8: what are the aggregated physical properties of habitat X?
CMECS(habitat,physio)
BH(habitat_grp,shape) WOA(physio,lat,long)BH(habitat_grp,shape)
CMECS(habitat,physio)
CMECS(habitat,physio) BH(habitat_grp,shape) PolygonA
BH(habitat_grp,shape) WOA(physio,lat,long)CMECS(habitat,physio)
BH(habitat_grp,shape)Q9: what species can be found at habitat X?
Q10: what habitats is a species X found at ?
OBIS(scientific_name,lat,long)
CMECS(habitat,physio) BH(habitat_grp,shape) OBIS(scientific_name,lat,long)
Italics: input
Underline: output
extended
Frequent Query Patterns
Example queries are joins ofLeft query patterns: habitat-spatial, andRight query patterns: spatial-environmental/species distribution
CMECS(habitat,physio) BH(habitat_grp,shape)
CMECS(habitat,physio) BH(habitat_grp,shape)
CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) WOA(physio,lat,long)
BH(..,shape) WOA(physio,lat,long)
BH(..,shape) OBIS(scientific_name,lat,long)
BH(..,shape) OBIS(scientific_name,lat,long)
Onto-module’s queries Mediator’s queries
PolygonA( )
API
Mediator Demonstration
Ontology for NeuroscienceWhy we need ontology for neuroscience?
Critical for linking together neuroscience data and making it understandable to both human and machineOntologies provide a set of concepts and the relationships between them, e.g., “is_a” and “has_a”, and may be expressed in a formal language that supports reasoningLarge scale ontology projects like the Gene Ontology and UnifiedMedical Language System have limited coverage of neuroscience-specific conceptsSeveral existing projects are developing terminology resources for neuroanatomy down to the cellular scale
Our goalDeveloping an ontology that describes the subcellular anatomy ofthe nervous system, including cell types and their subcellular properties and multicellular domainscreate a knowledge-base of the subcellular anatomy of the nervous system to aid in database interoperability and construction of neuronal models of cells and mulitcellular domains
Method
The conceptual framework for describing the subcellular anatomy of the nervous system was based on Peters, Palay & Webster, The Fine Structure of the Nervous System, Ed. 2 The knowledge base was constructed as a directed graph using the open source tool Protégé (http://protege.stanford.edu), a freely available knowledge management tool written in Java. A large community supports ongoing development, including multiple Plug-Ins for visualization, analysis and integration with other toolsThe ontology is expressed in OWL-DL. OWL (Web Ontology Language) is a markup language for publishing and sharing data using ontologies on the Internet. OWL is a vocabulary extension of RDF (Resource Description Framework) and is derived from the DAML+OIL Web Ontology Language. OWL-DL supports description logic, which has desirable computational properties for reasoning systems, e.g., Kv3.2 is located in the plasma membrane; if an axon terminal expresses Kv3.2, then it has a plasma membrane.
Subcellular OntologyIntercellular
JunctionMulti-cellular
Domain
Pinceau Node of Ranvier
Extracellular Space
Glomerulus NeuropilSynaptic Cleft
Subcellular Space
Nerve Cell
Neuron
Glia
Microglia Macroglia
Compartment
Dendrite Axon Cell body Spine
Dendritic Spine
Component
Post synaptic
Component
PSD
SER
Actin Filament
RibosomeOrientation
Distribution
Property
Morphometrics
Shape
Compartment
Compartment
Shaft
Cytoplasm
Organelle
Cytoskeleton
Cilium
Specialization
Inclusion
Plasma Membrane
Component
Orientation
Distribution
Property
Morphometrics
Shape
Moleculesubclass
has-a
LEGEND
System Development Facts
Our system is developed on top of an IBM integrated ontology toolkit, which implements a high performance ontology repository built on relational database
Completely follows W3C’s OWL and SPARQL query languageUses description logic reasoner for class-level inference and a set of logic rules translated from DLP for instance-level inferenceHence, inference completeness and soundness on DLP can be guaranteedBack-end database schema design supports efficient querying and inference, performance superior compared to Jena, Seseam etc.
IBM ToolKit
SKIL APIs
Biologist-Friendly GUI (i.e.,OntoQuest)
Query Mediator
SQL
Cache
Updater Reasoner
. .
SQL
System Architecture
We designed a domain user friendly GUI and a library of customized APIs
Updater: enable inserting classes and instances incrementally into the ontology repositoryQuery Mediator: form user’s request as a graph query against the global view; decompose it into sub-queries in forms of SQL and SPARQL and send to CCDB and CKB; reassemble the results and render an appropriate view (e.g. graphic) for the userReasoner: infer useful information from incomplete inputs to help with data curation or to guide users through explorationCache: further enhance the system efficiency by caching or prefetching frequent query results
The system is still under development – some of the functionalities are not completed or need to be improved
Step 0: startup screen
Step 1: click to show subclass hierarchy by default
Step 1a: other clicking options for expanding different types of hierarchies
Step 1b: show allowed compartment types for Neuroepithelial_Cell and those for Neuron
Step 2: get the detailed info (instances and properties) of the subclass Dendrite of Neuron_Compartment
Step 2a: click on ccdb_ID column to get values for all instances; right click a cell to view the CCDB image page
Step 2b: the CCDB image page corresponding to the selected instance Dendritic_Tree_1 is shown here
Step 2’: some concept (like Cellular_Dependent_Continuant here) has properties but no instances in CKB
Step 3: right click on a concept in the hierarchy pops up a list of view functions to choose from
Step 3a: the OWLdoc page is shown for the chosen concept Neuron
Step 4: aggregate the has_Component values of all Dendrite instances; the last row shows statistics summary
You may also have noticed that instances of Dendrite include those of its subclasses (such as Dendrite_Tree)
Step 5: drill down to view instances of Dendrite_Tree, aggregate on several numeric type of property values
•What are the cellular components of a dendrite?29 instances of dendrite
1. Microtubules2. Mitochondria3. Hypolemmal cisternae4. Plasma membrane5. Smooth endoplasmic reticulum6. Rough endoplasmic reticulum7. Polyribosomes8. Neurofilaments
Average diameter = 3.2 umAverage length = 150 um
•How many dendrites does a Purkinje cell have?3 instances of Purkinje cell dendritic tree
1. Avg branch order = 222. Number of primary dendrites = 1.33. Avg number of branches = 760
**Computes aggregate properties from instances
“Rules” for cellular assembly
Thank you!
Questions? Comments? Integrated Queries?