towards semantic typing support for scientific workflows

Towards Semantic Typing Support for Towards Semantic Typing Support for Scientific WorkflowsScientific Workflows

Bertram Ludäscher

Knowledge-Based Information Systems LabSan Diego Supercomputer CenterUniversity of California San Diego

http://seek.ecoinformatics.org http://www.geongrid.org

http://www.sdsc.edu/

B. Ludäscher – Scientific Data Management 2

Outline

1. Motivation: Traditional vs Scientific Data Integration

2. Semantic (a.k.a. Model-Based) Mediation

3. Scientific Workflows (a.k.a. Analysis Pipelines)

4. DB Theory Appetizer: Web Service Composition Through Declarative Queries



Information Integration Challenges

• System aspects: “Grid” Middleware• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together gluing” together resources resources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally



Information Integration from a DB Perspective

• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...)

and user questions Q1,..., Qn that can be answered using the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated/global

view G over S1 ,..., Sk using database query languages (SQL,

XQuery,...) questions become queries Qi against G(S1,..., Sk)



Standard (XML-Based) Mediator Architecture

MEDIATORMEDIATOR

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

1. Query Q ( G (S1. Query Q ( G (S11,..., S,..., Skk) )) )

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) Viewweb services as wrapper APIs

3. Q1 Q2 Q33. Q1 Q2 Q3

4. {answers(Q1)} {answers(Q2)} {answers(Q3)}4. {answers(Q1)} {answers(Q2)} {answers(Q3)}

6. {answers(Q)}6. {answers(Q)}



Query Planning for Mediators

• Given: – User query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { false … S … G… } integrity constraints (ICs)

• Find: – equivalent (or min. containing, max.contained) query

plan Q’: answer(…) … S … • Results:

– A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP,…, undecidable

– many variants still open



From Scientific Data Integration to Process & Application Integration (and back…)• Data Integration

– Database mediation + Knowledge-based extension Query rewriting w/ GAV, LAV, ICs, access patterns

• “Process/Application”Integration– Scientific models (ocean, atmosphere, ecology, …),

assimilation models (e.g., real-time data feeds), …– Data sets– Legacy tools Components = web services Applications = composite components

(“workflows”) Need for semantic type extensions



Geologic Map Integration

• Given: – Geologic maps from different state geological surveys

(shapefiles w/ different data schemas)– Different ontologies:

• Geologic age ontology• Rock type ontologies:

– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)

– Single hierarchy from British Geological Survey (BGS)

• Problem– Support uniform queries against the multiple geologic

maps using different ontologies– Support registration w/ ontology A, querying w/ ontology

B



Ontology Mappings: Motivation

• Establish correspondences between ontologies Integrate data sets which are registered to different

ontologies Query data sets through different ontologies

Data set 1

Data set 2

Ontology A

Ontology B

register

register

Ontology mappings queries



A Multi-Hierarchical Rock Classification Ontology (GSC)

Composition

Genesis

Fabric

Texture



Some enabling operations on “ontology data”

Composition

Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’



Some enabling operations on “ontology data”

Composition

Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y



Implementation in OWL: Not only “for the machine” …


Geologic Map Integration

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

Ontologies!?

NevadaNevada

Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy

GEON Metamorphism Equation:

+/- a few hundred million years


Geology Workbench: Registering Data to an OntologyStep 1: Choose Classes

Click on Submission Data set name

Select a shapefile

Choose an ontology class



Geology Workbench: Data RegistrationStep 2: Choose Columns for Selected Classes

AREA

PERIMETER

AZ_1000

AZ_1000_ID

GEO

PERIOD

ABBREV

DESCR

D_SYMBOL

P_SYMBOL

It contains information about geologic age



Geology Workbench: Data RegistrationStep 3: Resolve Mismatches

Two terms arenot matched anyontology terms

Manually mappingalgonkian intothe ontology



Geology Workbench: Ontology-enabled Map Integrator

Click on the nameChoose interesting

Classes

All areas with the age Paleozoic



Geology Workbench: Change Ontology

Submit a mapping

Ontology mappingbetween British Rock

Classification and CanadianRock Classification

Switch from Canadian Rock Classification to

British Rock Classification

Run it New query interface



Ontologies and Data Management

Schema Schema Schema Schema

ConceptualModel

ConceptualModel

Ontology

Data

Metadata

DesignArtifact

use concepts from(explicitly or implicitly)

• How to define and refine an ontology?• How to register a dataset to an ontology?



Biomedical InformaticsResearch Networkhttp://nbirn.net

Biomedical InformaticsResearch Networkhttp://nbirn.net

Refining an Ontology – the logic way, enables “Source Contextualization”



Connecting Datasets to Ontologies:“Semantic Registration”

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57


DataCollectionEventMeasurement

MeasurementContextMeasurableItem

SpeciesCountSpeciesAbundance

AbundanceCollectionEventLocation

LTERSiteSBLTERSite

{naples,…}

⊑ contains.Measurement⊑ measureOf.MeasurableItem ⊓ hasContext.MeasurementContext

⊑ hasTime.DateTime ⊓ hasLocation.Location ⊑ hasUnit.Unit ⊓ hasValue.UnitValue ⊑ MeasurableItem ⊓ hasSpecies.Species ⊓ hasUnit.RatioUnit

… ⊑ Measurement ⊓ measureOf.SpeciesCount ⊑ DataCollectionEvent ⊓ contains.SpeciesAbundance ⊑ position.Coordinate ⊑ Location ⊑ LTERSite ⊓ position.SBLTERCoordinate ⊑ SBLTERSite

How can we “register”the dataset to concepts in the Ontology?

Ontology (snippet)

Dataset



Purpose of Semantic Registration

Expose “hidden” information:– What do attributes represent? – What do specific values represent? – What conceptual “objects” are in the dataset?

Capture connections between the dataset and ontology to:– Find existing datasets (or parts of datasets) via

ontological concepts (discovery)– Enable integration of datasets (mediation)– Generate metadata for new data products (in a

pipeline)



Semantic Registration Framework

Step 1: Data provider selects relevant ontological concepts (for the dataset)

Step 2: The semantic registration system creates a structural representation based on chosen concepts (data provide refines if needed)

Step 3: The data provider maps the dataset information to the generated structural representation



Step1: Selecting Relevant Concepts



Concepts from an Ontology

Dataset

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …



Step1: Selecting Relevant Concepts



Concepts from an Ontology

Dataset







• Species• …




Step2: Generate Object ModelConcepts from an Ontology

AbundanceCollection Event

SpeciesAbundance

containsSpeciesCount

measureOf

Species

hasSpecies

RatioUnit

hasUnit

RatioValue

hasValue

DateTime SBLTERSite

hasTime hasLoc







• Species• …



Scientific Workflows


Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)



Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)



Ecology: GARP Analysis Pipeline for Invasive Species Prediction

Training sample

(d)

GARPrule set

(e)

Test sample (d)

Integrated layers

(native range) (c)

Speciespresence &

absence points(native range)

(a)EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

SampleData

+A3+A2

+A1

DataCalculation

MapGeneration

Validation

User

Validation

MapGeneration

Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)

GenerateMetadata

ArchiveTo Ecogrid

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)



Scientific Workflows: Some Findings

• More dataflow than (business) workflow• Need for “programming extension”

– Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)

• Need for abstraction and nested workflows• Need for data transformations • Need for rich user interaction & workflow steering:

– pause / revise / resume– select & branch; e.g., web browser capability at specific steps

as part of a coordinated SWF• Need for high-throughput transfers (“grid-enabling”,

“streaming”)• Need for persistence of intermediate products

data provenance (“virtual data” concept)


Our Starting Point: Dataflow Process Networks and Ptolemy II

see!see!see!see!

try!try!try!try!

read!read!read!read!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


Kepler Team, Projects, Sponsors

• Ilkay Altintas SDM • Chad Berkley SEEK • Shawn Bowers SEEK• Jeffrey Grethe BIRN• Christopher H. Brooks Ptolemy II • Zhengang Cheng SDM • Efrat Jaeger GEON • Matt Jones SEEK • Edward A. Lee Ptolemy II • Kai Lin GEON• Ashraf Memon GEON• Bertram Ludaescher BIRN, GEON, SDM, SEEK• Steve Mock NMI• Steve Neuendorffer Ptolemy II • Mladen Vouk SDM • Yang Zhao Ptolemy II • …

Ptolemy IIPtolemy II



Commercial Workflow/Dataflow Systems



SCIRun: Problem Solving Environments for Large-Scale Scientific Computing

• SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations

• Component model, based on generalized dataflow programming

Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)



E-Science and Link-Up Buddies

• … <UPDATE ME> …– Taverna, Scufl, Freefluo, ..– DiscoveryNet– Triana– ICENI– …



Dataflow Process Networks:Putting Computation Models first!

• Synchronous Dataflow Network (SDF)– Statically schedulable single-threaded dataflow

• Can execute multi-threaded, but the firing-sequence is known in advance– Maximally well-behaved, but also limited expressiveness

• Process Network (PN)– Multi-threaded dynamically scheduled dataflow– More expressive than SDF (dynamic token rate prevents static

scheduling)– Natural streaming model

• Other Execution Models (“Domains”)– Implemented through different “Directors”

actor actor

typed i/o ports

FIFO

advanced push/pull



Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)



Promoter Identification

Workflowin Ptolemy-II[SSDBM’03]

ExecutionSemantics



hand-crafted control solution; also: forces sequential execution!

designed to fit

designed to fit

hand-craftedWeb-service

actor

Complex backward control-flow

No data transformations

available



Simplified Process Network PIW

• Back to purely functional dataflow process network(= a data streaming

model!)• Re-introducing map(f) to

Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go

from piw(GeneId) to PIW :=map(piw) over [GeneId]

map(f)-style

iterators Powerful type

checking Generic,

declarative “programming”

constructs

Generic data transformation

actors

Forward-only, abstractable sub-workflow piw(GeneId)



Optimization by Declarative Rewriting

• PIW as a declarative, referentially transparent functional process optimization via functional

rewriting possiblee.g. map(f o g) = map(f) o

map(g)

• Details: – Technical report &PIW

specification in Haskell

map(f o g) instead of map(f) o

map(g)

Combination of map and zip

http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf



Web Services & Scientific Workflows in Kepler

• Web services = individual components (“actors”)• “Minute-Made” Application Integration:

– Plugging-in and harvesting web service components is easy and fast

• Rich SWF modeling semantics (“directors” and more):– Different and precise dataflow models of computation– Clear and composable component interaction semantics Web service composition and application integration tool

• Coming soon:– Shrinked wrapped, pre-packaged “Kepler-to-Go” (v0.8)– SWFs with structural and semantic data types (better design

support)– Grid-enabled web services (for big data, big computations,…) – Different deployment models (SWF WS, web site, applet, …)



KEPLER Core Capabilities (1/2)

• Designing scientific workflows– Composition of actors (tasks) to perform a scientific WF

• Actor prototyping• Accessing heterogeneous data

– Data access wizard to search and retrieve Grid-based resources– Relational DB access and query– Ability to link to EML data sources



KEPLER Core Capabilities (2/2)

• Data transformation actors to link heterogeneous data

• Executing scientific workflows– Distributed and/or local computation– Various models for computational semantics and

scheduling– SDFSDF and PNPN: Most common for scientific workflows

• External computing environments:– C++, Python, C (… Perl--planned ...)

• Deploying scientific tasks and workflows as web services themselves (… planned …)



The KEPLER GUI (Vergil)

Drag and drop utilities, director and actor libraries.



Running the workflow



Distributed SWFs in KEPLER

• Web and Grid Service plug-ins– WSDL, and whatever comes after GWSDL– ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard

• WS Harvester– Imports all the operations of a specific WS (or of all the WSs in a UDDI repository) as Kepler actors

• WS-deployment interface (…ongoing work…)• XSLT and XQuery transformers to link non-fitting

services together



A Generic Web Service Actor

Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.

Configure - select service operation



Set Parameters and Commit

Set parameters and commit



WS Actor after Instantiation



Web Service Harvester

• Imports the web services in a repository into the actor library.• Has the capability to search for web services based on a keyword.



Composing 3rd-Party WSs

Output of previousweb service

User interaction &Transformations

Input of next web service



Classifying with Kepler



SWF Designed in Kepler



Result launched via the BrowserUI actor


Querying Example


KEPLER and YOU

• Kepler …– is a community-based, cross-

project, open source collaboration

– uses web services as basic building blocks

– has a joint CVS repository, mailing lists, web site, …

– is gaining momentum thanks to contributors and contributions

• BSD-style license allows commercial spin-offs

– a pre-packaged, shrink-wrapped version (“Kepler-to-GO”) coming soon to a place near you…


Now back to the “Semantics Stuff”


Semantic Types for Scientific Workflows



From Semantic to Structural Mappings



Structural and Semantic Mappings



• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and

utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”

Summary I: Putting it all together for the Summary I: Putting it all together for the Science Environment for Ecological Science Environment for Ecological KnowledgeKnowledge

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SemanticMediation System (SMS)

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis & Modeling System (KEPLER)

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0



Outline

1. Motivation: Traditional vs Scientific Data Integration

2. Semantic (a.k.a. Model-Based) Mediation

3. Scientific Workflows (a.k.a. Analysis Pipelines)

4. DB Theory Appetizer: Web Service Composition Through Declarative Queries



Planning with Limited Access Patterns(back to GAV mediation …) • User query Q: answer(ISBN, Author, Title)

book(ISBN, Author, Title),catalog(ISBN, Author),not library(ISBN).

• Limited (web service) APIs (access patterns):– Src1.books: in: ISBN out: Author, Title– Src1.books: in: Author out: ISBN, Title– Src2.catalog: in: {} out: ISBN, Author– Src3.library: in: {} out: ISBN

• Note: Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library)



Query Feasibility is as hard as Containment

• Theorem [EDBT’04]: For UCQneg queries Q:Q is feasible iff ans(Q) Q

• The answerable part ans(Q) can be computed in quadratic time. Idea: scan Q for answerable literals, rescan, repeat until ans(Q) is reached

• Checking query containment Q1 Q2 is hard:– Already NP-complete for CQ (conjunctive queries)– Undecidable for FO (first-order logic queries)



Conjunctive Query Containment

• Given: conjunctive queries Q1, Q2 (aka Select-Project-Join queries)• Problem: Is answers(D, Q1) answers(D, Q2) for all databases D?• If yes, we say that “Q1 is contained in Q2”; short: Q1 Q2• Examples:

Q1: answer(X) student(X, cs)Q2: answer(X) student(X,Dept), advisor(X,Y), dept(Y,cs)Q3: answer(X) student(X,Dept)

• Quiz: – Q1 Q2 ?– No: not every student X necessarily has an adviser Y who is in the

cs department!– Q1 Q3 ?– Yes: every cs student is student in some department (crux of the “proof”: Dept = cs)Homework: What about Q1 Q2 if we know that every student must

have an advisor from the same department?



The World’s Shortest Conjunctive Query Containment Checker (an NP-complete problem): 7 lines in Prolog …

Quiz: 1. find the bug in the 7 lines of code2. Fix the bug (hint: add one more line of code)

Moral: Short programs can be buggy too



Summary II: Got milk/eggs/meat/wool?Or: “Die eierlegende Wollmilchsau …”

• Data Integration– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing

• Semantic Mediation– semantic integrity constraints, reasoning w/ plans,

automated deduction– deductive database/logic programming technology, AI

“stuff”...– Semantic Web technology

• Scientific Workflow Management– more procedural than database mediation (the scientist is

the “query planner”)– deployment using web services



• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and

utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”

Science Environment for Science Environment for Ecological KnowledgeEcological Knowledge

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SemanticMediation System (SMS)

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis & Modeling System (KEPLER)

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0



Building the EcoGrid

AND

LUQ

HBR

NTL

Metacat node

Legacy system

LTER Network (24) Natural History Collections (>> 100)Organization of Biological Field Stations (180)UC Natural Reserve System (36)Partnership for Interdisciplinary Studies of Coastal Oceans (4)Multi-agency Rocky Intertidal Network (60)

SRB node

DiGIR node

VCR

VegBank node

Xanthoria node

Source: Matthew Jones (UCSB)Source: Matthew Jones (UCSB)



Heterogeneous Data integration

• Requires advanced metadata and processing

– Attributes must be semantically typed– Collection protocols must be known– Units and measurement scale must be known– Measurement relationships must be known

• e.g., that ArealDensity=Count/Area



Ecological ontologies

• What was measured (e.g., biomass)• Type of measurement (e.g., Energy)• Context of measurement (e.g., Psychotria limonensis)• How it was measured (e.g., dry weight)

• SEEK intends to enable community-created ecological ontologies using OWL– Represents a controlled vocabulary for ecological metadata

• More about this in Bertram’s talk



• Label data with semantic types (e.g. concept expressions in OWL)

• Label inputs and outputs of analytical components with semantic types

• Use reasoning engines to generate transformation steps– Observe analytical constraints

• Use reasoning engine to discover relevant components

Semantic Mediation

Data Ontology Workflow Components


towards semantic typing support for scientific workflows

Documents

data sources s1

realtime data feeds

data sets syntax structure

answer g

restructuring xml queries

false s g

xml databasessemantics

database si