towards semantic typing support for scientific workflows
DESCRIPTION
Towards Semantic Typing Support for Scientific Workflows. Bertram Ludäscher Knowledge-Based Information Systems Lab San Diego Supercomputer Center University of California San Diego. http://seek.ecoinformatics.org. http://www.geongrid.org. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Towards Semantic Typing Support for Towards Semantic Typing Support for Scientific WorkflowsScientific Workflows
Bertram Ludäscher
Knowledge-Based Information Systems LabSan Diego Supercomputer CenterUniversity of California San Diego
http://seek.ecoinformatics.org http://www.geongrid.org
B. Ludäscher – Scientific Data Management 2
Outline
1. Motivation: Traditional vs Scientific Data Integration
2. Semantic (a.k.a. Model-Based) Mediation
3. Scientific Workflows (a.k.a. Analysis Pipelines)
4. DB Theory Appetizer: Web Service Composition Through Declarative Queries
B. Ludäscher – Scientific Data Management 3
Information Integration Challenges
• System aspects: “Grid” Middleware• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets …
• Syntax & Structure: (XML-Based) Data Mediators
• wrapping, restructuring • (XML) queries and views• sources = (XML) databases
• Semantics: Model-Based/Semantic Mediators
• conceptual models and declarative views • Knowledge Representation: ontologies,
description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)
SyntaxSyntax
StructureStructure
SemanticsSemantics
System aspectsSystem aspects
reconciling reconciling SS44 heterogeneitiesheterogeneities
““gluing” together gluing” together resources resources
bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally
B. Ludäscher – Scientific Data Management 4
Information Integration from a DB Perspective
• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...)
and user questions Q1,..., Qn that can be answered using the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated/global
view G over S1 ,..., Sk using database query languages (SQL,
XQuery,...) questions become queries Qi against G(S1,..., Sk)
B. Ludäscher – Scientific Data Management 5
Standard (XML-Based) Mediator Architecture
MEDIATORMEDIATOR
Integrated Global(XML) View G
Integrated ViewDefinition
G(..) S1(..)…Sk(..)
USER/ClientUSER/Client
1. Query Q ( G (S1. Query Q ( G (S11,..., S,..., Skk) )) )
S1
Wrapper
(XML) View
S2
Wrapper
(XML) View
Sk
Wrapper
(XML) Viewweb services as wrapper APIs
3. Q1 Q2 Q33. Q1 Q2 Q3
4. {answers(Q1)} {answers(Q2)} {answers(Q3)}4. {answers(Q1)} {answers(Q2)} {answers(Q3)}
6. {answers(Q)}6. {answers(Q)}
B. Ludäscher – Scientific Data Management 6
Query Planning for Mediators
• Given: – User query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { false … S … G… } integrity constraints (ICs)
• Find: – equivalent (or min. containing, max.contained) query
plan Q’: answer(…) … S … • Results:
– A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP,…, undecidable
– many variants still open
B. Ludäscher – Scientific Data Management 7
From Scientific Data Integration to Process & Application Integration (and back…)• Data Integration
– Database mediation + Knowledge-based extension Query rewriting w/ GAV, LAV, ICs, access patterns
• “Process/Application”Integration– Scientific models (ocean, atmosphere, ecology, …),
assimilation models (e.g., real-time data feeds), …– Data sets– Legacy tools Components = web services Applications = composite components
(“workflows”) Need for semantic type extensions
B. Ludäscher – Scientific Data Management 8
Geologic Map Integration
• Given: – Geologic maps from different state geological surveys
(shapefiles w/ different data schemas)– Different ontologies:
• Geologic age ontology• Rock type ontologies:
– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)
– Single hierarchy from British Geological Survey (BGS)
• Problem– Support uniform queries against the multiple geologic
maps using different ontologies– Support registration w/ ontology A, querying w/ ontology
B
B. Ludäscher – Scientific Data Management 9
Ontology Mappings: Motivation
• Establish correspondences between ontologies Integrate data sets which are registered to different
ontologies Query data sets through different ontologies
Data set 1
Data set 2
Ontology A
Ontology B
register
register
Ontology mappings queries
B. Ludäscher – Scientific Data Management 10
A Multi-Hierarchical Rock Classification Ontology (GSC)
Composition
Genesis
Fabric
Texture
B. Ludäscher – Scientific Data Management 11
Some enabling operations on “ontology data”
Composition
Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’
B. Ludäscher – Scientific Data Management 12
Some enabling operations on “ontology data”
Composition
Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y
B. Ludäscher – Scientific Data Management 13
Implementation in OWL: Not only “for the machine” …
Geologic Map Integration
domainknowledge
domainknowledge
Knowledge r
epresentatio
n
Ontologies!?
NevadaNevada
Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy
GEON Metamorphism Equation:
+/- a few hundred million years
B. Ludäscher – Scientific Data Management 16
Geology Workbench: Registering Data to an OntologyStep 1: Choose Classes
Click on Submission Data set name
Select a shapefile
Choose an ontology class
B. Ludäscher – Scientific Data Management 17
Geology Workbench: Data RegistrationStep 2: Choose Columns for Selected Classes
AREA
PERIMETER
AZ_1000
AZ_1000_ID
GEO
PERIOD
ABBREV
DESCR
D_SYMBOL
P_SYMBOL
It contains information about geologic age
B. Ludäscher – Scientific Data Management 18
Geology Workbench: Data RegistrationStep 3: Resolve Mismatches
Two terms arenot matched anyontology terms
Manually mappingalgonkian intothe ontology
B. Ludäscher – Scientific Data Management 19
Geology Workbench: Ontology-enabled Map Integrator
Click on the nameChoose interesting
Classes
All areas with the age Paleozoic
B. Ludäscher – Scientific Data Management 20
Geology Workbench: Change Ontology
Submit a mapping
Ontology mappingbetween British Rock
Classification and CanadianRock Classification
Switch from Canadian Rock Classification to
British Rock Classification
Run it New query interface
B. Ludäscher – Scientific Data Management 22
Ontologies and Data Management
Schema Schema Schema Schema
ConceptualModel
ConceptualModel
Ontology
Data
Metadata
DesignArtifact
use concepts from(explicitly or implicitly)
• How to define and refine an ontology?• How to register a dataset to an ontology?
B. Ludäscher – Scientific Data Management 23
Biomedical InformaticsResearch Networkhttp://nbirn.net
Biomedical InformaticsResearch Networkhttp://nbirn.net
Refining an Ontology – the logic way, enables “Source Contextualization”
B. Ludäscher – Scientific Data Management 24
Connecting Datasets to Ontologies:“Semantic Registration”
Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57
Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57
DataCollectionEventMeasurement
MeasurementContextMeasurableItem
SpeciesCountSpeciesAbundance
AbundanceCollectionEventLocation
LTERSiteSBLTERSite
{naples,…}
⊑ contains.Measurement⊑ measureOf.MeasurableItem ⊓ hasContext.MeasurementContext
⊑ hasTime.DateTime ⊓ hasLocation.Location ⊑ hasUnit.Unit ⊓ hasValue.UnitValue ⊑ MeasurableItem ⊓ hasSpecies.Species ⊓ hasUnit.RatioUnit
… ⊑ Measurement ⊓ measureOf.SpeciesCount ⊑ DataCollectionEvent ⊓ contains.SpeciesAbundance ⊑ position.Coordinate ⊑ Location ⊑ LTERSite ⊓ position.SBLTERCoordinate ⊑ SBLTERSite
How can we “register”the dataset to concepts in the Ontology?
Ontology (snippet)
Dataset
B. Ludäscher – Scientific Data Management 25
Purpose of Semantic Registration
Expose “hidden” information:– What do attributes represent? – What do specific values represent? – What conceptual “objects” are in the dataset?
Capture connections between the dataset and ontology to:– Find existing datasets (or parts of datasets) via
ontological concepts (discovery)– Enable integration of datasets (mediation)– Generate metadata for new data products (in a
pipeline)
B. Ludäscher – Scientific Data Management 26
Semantic Registration Framework
Step 1: Data provider selects relevant ontological concepts (for the dataset)
Step 2: The semantic registration system creates a structural representation based on chosen concepts (data provide refines if needed)
Step 3: The data provider maps the dataset information to the generated structural representation
B. Ludäscher – Scientific Data Management 27
Step1: Selecting Relevant Concepts
Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57
Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57
Concepts from an Ontology
Dataset
• DataCollectionEvent• AbundanceCollectionEvent
• Measurement• Abundance
• SpeciesAbundance
• MeasurableItem• SpeciesCount
• Location• LTERSite
• SBLTERSite• naples
• Species• …
• MeasurementContext• …
B. Ludäscher – Scientific Data Management 28
Step1: Selecting Relevant Concepts
Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57
Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57
Concepts from an Ontology
Dataset
• DataCollectionEvent• AbundanceCollectionEvent
• Measurement• Abundance
• SpeciesAbundance
• MeasurableItem• SpeciesCount
• Location• LTERSite
• SBLTERSite• naples
• Species• …
• MeasurementContext• …
B. Ludäscher – Scientific Data Management 29
Step2: Generate Object ModelConcepts from an Ontology
AbundanceCollection Event
SpeciesAbundance
containsSpeciesCount
measureOf
Species
hasSpecies
RatioUnit
hasUnit
RatioValue
hasValue
DateTime SBLTERSite
hasTime hasLoc
• DataCollectionEvent• AbundanceCollectionEvent
• Measurement• Abundance
• SpeciesAbundance
• MeasurableItem• SpeciesCount
• Location• LTERSite
• SBLTERSite• naples
• Species• …
• MeasurementContext• …
B. Ludäscher – Scientific Data Management 30
B. Ludäscher – Scientific Data Management 31
B. Ludäscher – Scientific Data Management 32
Scientific Workflows
B. Ludäscher – Scientific Data Management 34
Promoter Identification Workflow (PIW)
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
B. Ludäscher – Scientific Data Management 35
Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)
B. Ludäscher – Scientific Data Management 36
Ecology: GARP Analysis Pipeline for Invasive Species Prediction
Training sample
(d)
GARPrule set
(e)
Test sample (d)
Integrated layers
(native range) (c)
Speciespresence &
absence points(native range)
(a)EcoGridQuery
EcoGridQuery
LayerIntegration
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Validation
MapGeneration
Integrated layers (invasion area) (c)
Species presence &absence points
(invasion area) (a)
Native range
predictionmap (f)
Model qualityparameter (g)
Environmental layers (native
range) (b)
GenerateMetadata
ArchiveTo Ecogrid
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
Environmental layers (invasion
area) (b)
Invasionarea prediction
map (f)
Model qualityparameter (g)
Selectedpredictionmaps (h)
Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)
B. Ludäscher – Scientific Data Management 37
Scientific Workflows: Some Findings
• More dataflow than (business) workflow• Need for “programming extension”
– Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)
• Need for abstraction and nested workflows• Need for data transformations • Need for rich user interaction & workflow steering:
– pause / revise / resume– select & branch; e.g., web browser capability at specific steps
as part of a coordinated SWF• Need for high-throughput transfers (“grid-enabling”,
“streaming”)• Need for persistence of intermediate products
data provenance (“virtual data” concept)
Our Starting Point: Dataflow Process Networks and Ptolemy II
see!see!see!see!
try!try!try!try!
read!read!read!read!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
B. Ludäscher – Scientific Data Management 39
Kepler Team, Projects, Sponsors
• Ilkay Altintas SDM • Chad Berkley SEEK • Shawn Bowers SEEK• Jeffrey Grethe BIRN• Christopher H. Brooks Ptolemy II • Zhengang Cheng SDM • Efrat Jaeger GEON • Matt Jones SEEK • Edward A. Lee Ptolemy II • Kai Lin GEON• Ashraf Memon GEON• Bertram Ludaescher BIRN, GEON, SDM, SEEK• Steve Mock NMI• Steve Neuendorffer Ptolemy II • Mladen Vouk SDM • Yang Zhao Ptolemy II • …
Ptolemy IIPtolemy II
B. Ludäscher – Scientific Data Management 40
Commercial Workflow/Dataflow Systems
B. Ludäscher – Scientific Data Management 41
SCIRun: Problem Solving Environments for Large-Scale Scientific Computing
• SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations
• Component model, based on generalized dataflow programming
Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)
B. Ludäscher – Scientific Data Management 42
E-Science and Link-Up Buddies
• … <UPDATE ME> …– Taverna, Scufl, Freefluo, ..– DiscoveryNet– Triana– ICENI– …
B. Ludäscher – Scientific Data Management 43
Dataflow Process Networks:Putting Computation Models first!
• Synchronous Dataflow Network (SDF)– Statically schedulable single-threaded dataflow
• Can execute multi-threaded, but the firing-sequence is known in advance– Maximally well-behaved, but also limited expressiveness
• Process Network (PN)– Multi-threaded dynamically scheduled dataflow– More expressive than SDF (dynamic token rate prevents static
scheduling)– Natural streaming model
• Other Execution Models (“Domains”)– Implemented through different “Directors”
actor actor
typed i/o ports
FIFO
advanced push/pull
B. Ludäscher – Scientific Data Management 44
Promoter Identification Workflow (PIW)
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
B. Ludäscher – Scientific Data Management 45
Promoter Identification
Workflowin Ptolemy-II[SSDBM’03]
ExecutionSemantics
B. Ludäscher – Scientific Data Management 46
hand-crafted control solution; also: forces sequential execution!
designed to fit
designed to fit
hand-craftedWeb-service
actor
Complex backward control-flow
No data transformations
available
B. Ludäscher – Scientific Data Management 47
Simplified Process Network PIW
• Back to purely functional dataflow process network(= a data streaming
model!)• Re-introducing map(f) to
Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go
from piw(GeneId) to PIW :=map(piw) over [GeneId]
map(f)-style
iterators Powerful type
checking Generic,
declarative “programming”
constructs
Generic data transformation
actors
Forward-only, abstractable sub-workflow piw(GeneId)
B. Ludäscher – Scientific Data Management 48
Optimization by Declarative Rewriting
• PIW as a declarative, referentially transparent functional process optimization via functional
rewriting possiblee.g. map(f o g) = map(f) o
map(g)
• Details: – Technical report &PIW
specification in Haskell
map(f o g) instead of map(f) o
map(g)
Combination of map and zip
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
B. Ludäscher – Scientific Data Management 49
Web Services & Scientific Workflows in Kepler
• Web services = individual components (“actors”)• “Minute-Made” Application Integration:
– Plugging-in and harvesting web service components is easy and fast
• Rich SWF modeling semantics (“directors” and more):– Different and precise dataflow models of computation– Clear and composable component interaction semantics Web service composition and application integration tool
• Coming soon:– Shrinked wrapped, pre-packaged “Kepler-to-Go” (v0.8)– SWFs with structural and semantic data types (better design
support)– Grid-enabled web services (for big data, big computations,…) – Different deployment models (SWF WS, web site, applet, …)
B. Ludäscher – Scientific Data Management 50
KEPLER Core Capabilities (1/2)
• Designing scientific workflows– Composition of actors (tasks) to perform a scientific WF
• Actor prototyping• Accessing heterogeneous data
– Data access wizard to search and retrieve Grid-based resources– Relational DB access and query– Ability to link to EML data sources
B. Ludäscher – Scientific Data Management 51
KEPLER Core Capabilities (2/2)
• Data transformation actors to link heterogeneous data
• Executing scientific workflows– Distributed and/or local computation– Various models for computational semantics and
scheduling– SDFSDF and PNPN: Most common for scientific workflows
• External computing environments:– C++, Python, C (… Perl--planned ...)
• Deploying scientific tasks and workflows as web services themselves (… planned …)
B. Ludäscher – Scientific Data Management 52
The KEPLER GUI (Vergil)
Drag and drop utilities, director and actor libraries.
B. Ludäscher – Scientific Data Management 54
Distributed SWFs in KEPLER
• Web and Grid Service plug-ins– WSDL, and whatever comes after GWSDL– ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard
• WS Harvester– Imports all the operations of a specific WS (or of all the WSs in a UDDI repository) as Kepler actors
• WS-deployment interface (…ongoing work…)• XSLT and XQuery transformers to link non-fitting
services together
B. Ludäscher – Scientific Data Management 55
A Generic Web Service Actor
Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.
Configure - select service operation
B. Ludäscher – Scientific Data Management 56
Set Parameters and Commit
Set parameters and commit
B. Ludäscher – Scientific Data Management 58
Web Service Harvester
• Imports the web services in a repository into the actor library.• Has the capability to search for web services based on a keyword.
B. Ludäscher – Scientific Data Management 59
Composing 3rd-Party WSs
Output of previousweb service
User interaction &Transformations
Input of next web service
B. Ludäscher – Scientific Data Management 62
B. Ludäscher – Scientific Data Management 64
Result launched via the BrowserUI actor
Querying Example
B. Ludäscher – Scientific Data Management 66
KEPLER and YOU
• Kepler …– is a community-based, cross-
project, open source collaboration
– uses web services as basic building blocks
– has a joint CVS repository, mailing lists, web site, …
– is gaining momentum thanks to contributors and contributions
• BSD-style license allows commercial spin-offs
– a pre-packaged, shrink-wrapped version (“Kepler-to-GO”) coming soon to a place near you…
Now back to the “Semantics Stuff”
B. Ludäscher – Scientific Data Management 68
Semantic Types for Scientific Workflows
B. Ludäscher – Scientific Data Management 69
From Semantic to Structural Mappings
B. Ludäscher – Scientific Data Management 71
• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and
utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”
Summary I: Putting it all together for the Summary I: Putting it all together for the Science Environment for Ecological Science Environment for Ecological KnowledgeKnowledge
ASx ASy ASzTS1TS2
Semantic MediationEngine
Data Binding
Query Processing
ECO2
Logic Rules ECO2-CL
Analytical Pipeline (AP)
SemanticMediation System (SMS)
EcoGrid
provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment
Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration
SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities
AM: Analysis & Modeling System (KEPLER)
ASr
Parameters w/ Semantics
CC
C
CC
CParameterOntologies
WSDL/UDDI WSDL/UDDI
SRB KNB
MC
Species
WrpDar
...
Raw data setswrappedfor integrationw/ EML, etc.
ECO2 TaxOn
EML
etc.
Execution Environment
SAS, MATLAB,FORTRAN, etc
Library of Analysis Steps, Pipelines& Results
Invasive speciesover time
ASr
WSDL/UDDI
Example of “AP0”
AP0
B. Ludäscher – Scientific Data Management 72
Outline
1. Motivation: Traditional vs Scientific Data Integration
2. Semantic (a.k.a. Model-Based) Mediation
3. Scientific Workflows (a.k.a. Analysis Pipelines)
4. DB Theory Appetizer: Web Service Composition Through Declarative Queries
B. Ludäscher – Scientific Data Management 73
Planning with Limited Access Patterns(back to GAV mediation …) • User query Q: answer(ISBN, Author, Title)
book(ISBN, Author, Title),catalog(ISBN, Author),not library(ISBN).
• Limited (web service) APIs (access patterns):– Src1.books: in: ISBN out: Author, Title– Src1.books: in: Author out: ISBN, Title– Src2.catalog: in: {} out: ISBN, Author– Src3.library: in: {} out: ISBN
• Note: Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library)
B. Ludäscher – Scientific Data Management 74
Query Feasibility is as hard as Containment
• Theorem [EDBT’04]: For UCQneg queries Q:Q is feasible iff ans(Q) Q
• The answerable part ans(Q) can be computed in quadratic time. Idea: scan Q for answerable literals, rescan, repeat until ans(Q) is reached
• Checking query containment Q1 Q2 is hard:– Already NP-complete for CQ (conjunctive queries)– Undecidable for FO (first-order logic queries)
B. Ludäscher – Scientific Data Management 75
Conjunctive Query Containment
• Given: conjunctive queries Q1, Q2 (aka Select-Project-Join queries)• Problem: Is answers(D, Q1) answers(D, Q2) for all databases D?• If yes, we say that “Q1 is contained in Q2”; short: Q1 Q2• Examples:
Q1: answer(X) student(X, cs)Q2: answer(X) student(X,Dept), advisor(X,Y), dept(Y,cs)Q3: answer(X) student(X,Dept)
• Quiz: – Q1 Q2 ?– No: not every student X necessarily has an adviser Y who is in the
cs department!– Q1 Q3 ?– Yes: every cs student is student in some department (crux of the “proof”: Dept = cs)Homework: What about Q1 Q2 if we know that every student must
have an advisor from the same department?
B. Ludäscher – Scientific Data Management 76
The World’s Shortest Conjunctive Query Containment Checker (an NP-complete problem): 7 lines in Prolog …
Quiz: 1. find the bug in the 7 lines of code2. Fix the bug (hint: add one more line of code)
Moral: Short programs can be buggy too
B. Ludäscher – Scientific Data Management 77
Summary II: Got milk/eggs/meat/wool?Or: “Die eierlegende Wollmilchsau …”
• Data Integration– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing
• Semantic Mediation– semantic integrity constraints, reasoning w/ plans,
automated deduction– deductive database/logic programming technology, AI
“stuff”...– Semantic Web technology
• Scientific Workflow Management– more procedural than database mediation (the scientist is
the “query planner”)– deployment using web services
B. Ludäscher – Scientific Data Management 78
• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and
utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”
Science Environment for Science Environment for Ecological KnowledgeEcological Knowledge
ASx ASy ASzTS1TS2
Semantic MediationEngine
Data Binding
Query Processing
ECO2
Logic Rules ECO2-CL
Analytical Pipeline (AP)
SemanticMediation System (SMS)
EcoGrid
provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment
Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration
SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities
AM: Analysis & Modeling System (KEPLER)
ASr
Parameters w/ Semantics
CC
C
CC
CParameterOntologies
WSDL/UDDI WSDL/UDDI
SRB KNB
MC
Species
WrpDar
...
Raw data setswrappedfor integrationw/ EML, etc.
ECO2 TaxOn
EML
etc.
Execution Environment
SAS, MATLAB,FORTRAN, etc
Library of Analysis Steps, Pipelines& Results
Invasive speciesover time
ASr
WSDL/UDDI
Example of “AP0”
AP0
B. Ludäscher – Scientific Data Management 79
Building the EcoGrid
AND
LUQ
HBR
NTL
Metacat node
Legacy system
LTER Network (24) Natural History Collections (>> 100)Organization of Biological Field Stations (180)UC Natural Reserve System (36)Partnership for Interdisciplinary Studies of Coastal Oceans (4)Multi-agency Rocky Intertidal Network (60)
SRB node
DiGIR node
VCR
VegBank node
Xanthoria node
Source: Matthew Jones (UCSB)Source: Matthew Jones (UCSB)
B. Ludäscher – Scientific Data Management 80
Heterogeneous Data integration
• Requires advanced metadata and processing
– Attributes must be semantically typed– Collection protocols must be known– Units and measurement scale must be known– Measurement relationships must be known
• e.g., that ArealDensity=Count/Area
B. Ludäscher – Scientific Data Management 81
Ecological ontologies
• What was measured (e.g., biomass)• Type of measurement (e.g., Energy)• Context of measurement (e.g., Psychotria limonensis)• How it was measured (e.g., dry weight)
• SEEK intends to enable community-created ecological ontologies using OWL– Represents a controlled vocabulary for ecological metadata
• More about this in Bertram’s talk
B. Ludäscher – Scientific Data Management 82
• Label data with semantic types (e.g. concept expressions in OWL)
• Label inputs and outputs of analytical components with semantic types
• Use reasoning engines to generate transformation steps– Observe analytical constraints
• Use reasoning engine to discover relevant components
Semantic Mediation
Data Ontology Workflow Components