bertram lud ä scher data and knowledge system san diego supercomputer center u.c. san diego

21
San Diego Supercomputer Center San Diego Supercomputer Center EDBT'02, Prague EDBT'02, Prague 1 Scientific Data Scientific Data Integration Integration for for Complex Multiple-Worlds Complex Multiple-Worlds Scenarios: Scenarios: Databases Meets Knowledge Databases Meets Knowledge Representation Representation Bertram Lud Bertram Lud ä ä scher scher Data and Knowledge System Data and Knowledge System San Diego Supercomputer San Diego Supercomputer Center Center U.C. San Diego U.C. San Diego

Upload: hamal

Post on 25-Jan-2016

22 views

Category:

Documents


4 download

DESCRIPTION

EDBT Panel, March 2002, Prague: Scientific Data Integration for Complex Multiple-Worlds Scenarios: Databases Meets Knowledge Representation. Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center U.C. San Diego. ? Information Integration. Crime Stats. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 11

EDBT Panel, March 2002, Prague:EDBT Panel, March 2002, Prague: Scientific Data Integration Scientific Data Integration

for for Complex Multiple-WorldsComplex Multiple-Worlds Scenarios: Scenarios: Databases Meets Knowledge RepresentationDatabases Meets Knowledge Representation

EDBT Panel, March 2002, Prague:EDBT Panel, March 2002, Prague: Scientific Data Integration Scientific Data Integration

for for Complex Multiple-WorldsComplex Multiple-Worlds Scenarios: Scenarios: Databases Meets Knowledge RepresentationDatabases Meets Knowledge Representation

Bertram LudBertram Ludääscherscher

Data and Knowledge SystemData and Knowledge System

San Diego Supercomputer Center San Diego Supercomputer Center

U.C. San DiegoU.C. San Diego

Bertram LudBertram Ludääscherscher

Data and Knowledge SystemData and Knowledge System

San Diego Supercomputer Center San Diego Supercomputer Center

U.C. San DiegoU.C. San Diego

Page 2: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

A Home Buyer’s Information Integration ProblemA Home Buyer’s Information Integration Problem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population?

?Information Integration

?Information Integration

RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats

“Simple Multiple-Worlds”Mediation Problem

=> XML-Based Mediator

“Simple Multiple-Worlds”Mediation Problem

=> XML-Based Mediator

Page 3: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

A Neuroscientist’s Information Integration ProblemA Neuroscientist’s Information Integration Problem

What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?

How about other rodents?

?Information Integration

?Information Integration

protein localization(NCMIR)

protein localization(NCMIR)

neurotransmission(SENSELAB)

neurotransmission(SENSELAB)

sequence info(CaPROT)

sequence info(CaPROT) morphometry

(SYNAPSE)

morphometry(SYNAPSE)

“Complex Multiple-Worlds”Mediation Problem

=> Model-Based Mediator

“Complex Multiple-Worlds”Mediation Problem

=> Model-Based Mediator

Page 4: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

A Geoscientist’s Information Integration ProblemA Geoscientist’s Information Integration Problem

What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?

How does it relate to host rock structures?

?Information Integration

?Information Integration

Geologic Map(Virginia)

Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical

(gravity contours)

GeoPhysical(gravity contours)

GeoChronologic(Concordia)

GeoChronologic(Concordia)

Foliation Map(structure DB)

Foliation Map(structure DB)

“Complex Multiple-Worlds”

Mediation

“Complex Multiple-Worlds”

Mediation

Page 5: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 55

Scientific Data Integration Challenges: Scientific Data Integration Challenges: Heterogeneities in the 4S’s ...Heterogeneities in the 4S’s ...

• System AspectsSystem Aspects– platforms, devices, phys. distribution, transport protocols,

access APIs, impedance mismatch, user interfaces, application integration ...

• SyntaxesSyntaxes– heterogeneous data formats (one for each tool ...)

• StructuresStructures– heterogeneous schemas (one for each DB ...)– heterogeneous data models (RDBs, ORDBs, OODBs,

XMLDBs)

• SemanticsSemantics– unclear semantics: e.g., incoherent terminology, multiple

taxonomies, ...

Page 6: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 66

Data Integration: Approaches / SolutionsData Integration: Approaches / Solutions

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

• (Data-)Grid / Middleware(Data-)Grid / Middleware– system: distributed data & computing (SDSC

SRB, Globus, web services, WSDL)– source = file or DB

• XML-Based MediatorsXML-Based Mediators– structure: XML queries and views– source = XML-DB

• Model-Based/Semantic MediatorsModel-Based/Semantic Mediators– semantics: conceptual models and declarative

views – source = Knowledge Base (DB+CMs+ICs)

• Semantic Web FormalismsSemantic Web Formalisms– semantics: ontologies, description logics

(RDF(S), DAML+OIL,...)

• Knowledge/Semantic-GridKnowledge/Semantic-Grid– combination

Page 7: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 77

What’s in a Link? What’s in a Link? • Syntactic Joins Syntactic Joins

(X,Y) := X.SSN = Y.SSN equality (X,Y) := X.UMLS-ID = Y.UID

• ““Speciality” JoinsSpeciality” Joins (X,Y,Score) := BLAST(X,Y,Score) similarity

• Semantic/Rule-Based JoinsSemantic/Rule-Based Joins (X,Y,C) :=

X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub (X,Y,[produces,B,increased_in]) :=

X produces B, B increased_in Y. rule-based

e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease

• Challenge: Challenge: – compile semantic joins into efficient syntactic ones

XY

Page 8: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

XML-Based vs. Model-Based MediationXML-Based vs. Model-Based Mediation

Raw DataRaw DataRaw Data

IF THEN IF THEN IF THEN

LogicalDomainConstraints

Integrated-CM :=

CM-QL(Src1-CM,...)

Integrated-CM :=

CM-QL(Src1-CM,...)

. . ....

....

........ (XML)Objects

Conceptual Models

XMLElements

XML Models

C2 C3

C1

R

Classes,Relations,Ontologiesis-a, has-a, ...

“Glue” Maps Domain Maps Process Maps

“Glue” Maps Domain Maps Process Maps

Integrated-DTD :=

XQuery(Src1-DTD,...)

Integrated-DTD :=

XQuery(Src1-DTD,...)

No DomainConstraints

A = (B*|C),DB = ...

Structural Constraints (DTDs),Parent, Child, Sibling, ...

CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}

Page 9: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

NCMIR ANATOM NCMIR ANATOM Domain Map:Domain Map:• conceptsconcepts• relationsrelations• logic ruleslogic rules

Page 10: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1010

Semantics-Aware Semantics-Aware BrowsingBrowsing and and QueryingQuerying

Cerebellum

Source 1 Source 2

Source 3

Cerebellar Cortex

Granule Cell Layer

Purkinje Cell layer

Molecular Layer

has a

Purkinje Cell Dendrite

Dendritic spines

Dendritic shaft

Endoplasmic reticulum

Purkinje Neuron

has a

Page 11: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1111

Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)

Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)

Domain Map (DM)

Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).

Domain Expert Knowledge

DM in Description Logic

Formalizing Glue Knowledge:Formalizing Glue Knowledge:Domain Map for Domain Map for SYNAPSESYNAPSE and and NCMIRNCMIR

Page 12: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1212

Source Registration/Data ContextualizationSource Registration/Data Contextualization

Source registers data with an existing ontology, using description logics it may also refine the mediator’s

domain map... [ICDE01]

sources can register new concepts at the mediator ...

Page 13: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1313

Source Registration: Semantic Annotations Source Registration: Semantic Annotations

Page 14: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1414

Multiple Ways of Querying DataMultiple Ways of Querying Data

Brain

Cerebellum

Purkinje Cell Layer

Purkinje cell

neuron

has a

has a

has a

is aSpatial Representation (Atlases)

Ontologies

Transformations

Page 15: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1515

S1 S2

S3

(XML-Wrapper) (XML-Wrapper) (XML-Wrapper)

CM-Wrapper CM-Wrapper CM-Wrapper

USER/ClientUSER/Client

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM Queries & Results (exchanged in XML)

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Process MapsPMs

“Glue” MapsGMs

semanticcontextCON(S)

Integrated View Definition IVD

Model-Based Mediator Architecture

First Results & Demos:[SSDBM’00] [VLDB’00]

[ICDE’01] [HBP’01] [EDBT’02][BNCOD’02]

Conceptual Model =• Object Model • Knowledge Base• Contextualization

Conceptual Model =• Object Model • Knowledge Base• Contextualization

Page 16: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1616

Model-Based Mediation Methodology ...Model-Based Mediation Methodology ...

• Lift Sources to export CMs: Lift Sources to export CMs:

CM(S) = OM(S) + KB(S) + CON(S)

• Object Model OM(Object Model OM(SS):):– complex objects (frames), class hierarchy, OO constraints

• Knowledge Base KB(Knowledge Base KB(SS):):– explicit representation of (“hidden”) source semantics

– logic rules over OM(S)

• Contextualization CON(Contextualization CON(SS):):– situate OM(S) data using “glue maps” (GMs): domain maps DMs (ontology)

= terminological knowledge: concepts + roles process maps PMs

= “procedural knowledge”: states + transitions

Page 17: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1717

... Model-Based Mediation Methodology... Model-Based Mediation Methodology

• Integrated View Definition (IVD)Integrated View Definition (IVD)– declarative (logic) rules with object-oriented features

– defined over CM(S), domain maps, process maps

– needs “mediation engineers” = domain + KRDB experts

• Knowledge-Based Querying and Browsing (runtime):Knowledge-Based Querying and Browsing (runtime):– mediator composes the user query Q with the IVD

... rewrites (Q o IVD), sends subqueries to sources

... post-processes returned results (e.g., situate in context)

Page 18: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1818

Mediation Scenarios & TechniquesMediation Scenarios & TechniquesFederated Databases XML-Based Mediation Model-Based Mediation

One-World One-/Multiple-Worlds Complex Multiple-Worlds

Common Schema Mediated Schema Common Glue Maps

SQL, rules XML query languages DOOD query languages

Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings

Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps

DB expert DB expert KRDB + domain experts

Page 19: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 1919

Some ObservationsSome Observations• Scientific Data Integration is different Scientific Data Integration is different

– e.g., complex and hidden semantics,...

• Co-Education (CS=>DS, DS=>CS) takes time Co-Education (CS=>DS, DS=>CS) takes time – NIH BioInformatics Research Network (BIRN) – Neuroscientists– DOE Scientific Data Management Center (SDM)– Starting with Ecologists, Geoscientists, ...

• A good thing about standards: A good thing about standards: • There are so many to choose from:There are so many to choose from:

– SQL, http, HTML, XML, XQuery, XSLT, XML Schema, RDF(S), DAML+OIL, DAML-S, UMLS, GO, XMI, SOAP, WSDL, ...

• Syntax is overrated (and its impact underestimated?)Syntax is overrated (and its impact underestimated?)– nobody likes LISP any more, but everybody likes XML ...

• 22ndnd Marriage of Knowledge Representation & Databases: Marriage of Knowledge Representation & Databases:– Semantic Web– (child from 1st marriage: Deductive Databases; aren’t they cute siblings? ;)=> model-based/semantic mediators

Page 20: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 2020

Internet2

SOAP

SOA

P

OILOIL

The Road Ahead: Scientific Data Integration with The Road Ahead: Scientific Data Integration with the Semantic Web !?the Semantic Web !?

Data-Grid

Scientific DataScientific Data RDF DOOD rules

WSDL XQuery

DAML-S

RDF DOOD rules

WSDL XQuery

DAML-S

XMLXML RDF RDF

XMLDB

sub

sum

ptio

n

DAML

Logic

descrip

tion

log

ics

RDB

infe

ren

ce

ORDBontologies

Integrated Data ViewsIntegrated Data Views

Ivory

Tower

Page 21: Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center  U.C. San Diego

San Diego Supercomputer CenterSan Diego Supercomputer CenterEDBT'02, PragueEDBT'02, Prague 2121

Some Related References: Some Related References: Mediation of Neuroscience DataMediation of Neuroscience Data

• Model-Based Mediation with Domain MapsModel-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. , B. Ludäscher, A. Gupta, M. E. Martone, Martone, 17th Intl. Conference on Data Engineering17th Intl. Conference on Data Engineering ( (ICDEICDE), Heidelberg, Germany, ), Heidelberg, Germany, IEEE Computer Society, April 2001. IEEE Computer Society, April 2001.

• Navigating Virtual Information Sources with Know-MENavigating Virtual Information Sources with Know-ME, X. Qian, B. Ludäscher, , X. Qian, B. Ludäscher, M. E. Martone, A. Gupta, M. E. Martone, A. Gupta, demonstration track, Intl. Conference on Extending demonstration track, Intl. Conference on Extending Database TechnologyDatabase Technology ( (EDBTEDBT), Prague, Czech Republic, March 2002. ), Prague, Czech Republic, March 2002.

• Model-Based Information Integration in a Neuroscience Mediator SystemModel-Based Information Integration in a Neuroscience Mediator System , B. , B. Ludäscher, A. Gupta, M. E. Martone, Ludäscher, A. Gupta, M. E. Martone, demonstration track, 26th Intl. Conference on demonstration track, 26th Intl. Conference on Very Large DatabasesVery Large Databases ( (VLDBVLDB), Cairo, Egypt, September 2000. ), Cairo, Egypt, September 2000.

• Knowledge-Based Integration of Neuroscience Data SourcesKnowledge-Based Integration of Neuroscience Data Sources, A. Gupta, B. , A. Gupta, B. Ludäscher, M. E. Martone, Ludäscher, M. E. Martone, 12th Intl. Conference on Scientific and Statistical Database 12th Intl. Conference on Scientific and Statistical Database ManagementManagement ( (SSDBMSSDBM), Berlin, Germany, IEEE Computer Society, July 2000. ), Berlin, Germany, IEEE Computer Society, July 2000.

• A Cell-Centered Database for Electron Tomographic DataA Cell-Centered Database for Electron Tomographic Data, M. E. Martone, A. , M. E. Martone, A. Gupta, M. Wong, X. Qian, G. Sosinsky, S. Lamont, B. Ludäscher , and M. H. Gupta, M. Wong, X. Qian, G. Sosinsky, S. Lamont, B. Ludäscher , and M. H. Ellisman. Ellisman. Journal of Structural BiologyJournal of Structural Biology, 2002. to appear , 2002. to appear