ASU, 7/25/2003
Scientific Data Management: From Scientific Data Management: From Data Integration to Analytical Pipelines Data Integration to Analytical Pipelines
Data & Knowledge SystemsData & Knowledge Systems
San Diego Supercomputer CenterSan Diego Supercomputer Center
University of California, San DiegoUniversity of California, San Diego
Bertram LudäscherBertram Ludä[email protected]@sdsc.edu
ASU, 7/25/2003
OutlineOutline
• Motivation: Scientific Data Integration ProblemsMotivation: Scientific Data Integration Problems
• ““Semantic” (Model-based) MediationSemantic” (Model-based) Mediation
• Scientific Workflows and Analytical PipelinesScientific Workflows and Analytical Pipelines
ASU, 7/25/2003
AcknowledgementsAcknowledgements• National Science Foundation (NSF)National Science Foundation (NSF)
– www.nsf.gov
• GEOsciences Network (NSF) GEOsciences Network (NSF) – www.geongrid.org
• Biomedical Informatics Research Network (NIH)Biomedical Informatics Research Network (NIH)– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/
An Online Shopper’s Information Integration ProblemAn Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
??Information Information IntegrationIntegration
addall.comaddall.com
““One-World” Scenario:One-World” Scenario:XML-based mediatorXML-based mediator
amazon.comamazon.comamazon.comamazon.com A1books.comA1books.comA1books.comA1books.comhalf.comhalf.comhalf.comhalf.combarnes&noble.combarnes&noble.combarnes&noble.combarnes&noble.com
Mediator (virtual DB)Mediator (virtual DB)(vs. Datawarehouse)(vs. Datawarehouse)
A Home Buyer’s Information Integration ProblemA Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population? with below-average crime rate and diverse population?
??Information Information IntegrationIntegration
RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats
““Multiple-Worlds”Multiple-Worlds”MediationMediation
ASU, 7/25/2003
• Data Integration Approaches:Data Integration Approaches:– Let’s just share data, e.g., link everything from a web page!– ... or better put everything into an relational or XML database– ... and do remote access using the Grid– ... or just use Web services!
• Nice try. But: Nice try. But: – “Find the files where the amygdala was segmented.”– “Which other structures were segmented in the same files?”– “Did the volume of any of those structures differ much from
normal?”– “What is the cerebellar distribution of rat proteins with more
than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”
Some BIRNing Data Some BIRNing Data Integration QuestionsIntegration Questions
Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net
A Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?
How about other rodents?
??Information Information IntegrationIntegration
protein localization(NCMIR)
neurotransmission(SENSELAB)
sequence info(CaPROT)
morphometry(SYNAPSE)
““Complex Complex Multiple-Worlds”Multiple-Worlds”
MediationMediation
Biomedical InformaticsResearch Networkhttp://nbirn.net
ASU, 7/25/2003
Information Integration Challenges: Information Integration Challenges: Heterogeneities = SHeterogeneities = S44......
• SSystem Aspectsystem Aspects– platforms, devices, distribution, APIs, protocols, …
• SSyntaxesyntaxes– heterogeneous data formats (one for each tool ...)
• SStructurestructures– heterogeneous schemas (one for each DB ...)
– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …)
• SSemanticsemantics– unclear & “hidden” semantics : e.g., incoherent terminology,
multiple taxonomies, implicit assumptions, ...
ASU, 7/25/2003
Information Integration ChallengesInformation Integration Challenges
• System aspects: “Grid” Middleware• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets, …
• Syntax & Structure: (XML-Based) Data Mediators
• wrapping, restructuring • (XML) queries and views• sources = (XML) databases
• Semantics: Model-Based/Semantic Mediators
• conceptual models and declarative views • Knowledge Representation: ontologies,
description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)
SyntaxSyntax
StructureStructure
SemanticsSemantics
System aspectsSystem aspects
reconciling reconciling SS44 heterogeneitiesheterogeneities
““gluing” together multiple gluing” together multiple data sources data sources
bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally
ASU, 7/25/2003
Information Integration from a DB Information Integration from a DB Perspective Perspective
• Information Integration ProblemInformation Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user
questions Q1,..., Qn that can be answered using the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database” The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...)
Si can be queried
define virtual (or materialized) integrated views V over S1 ,..., Sk using database query languages (SQL, XQuery,...)
questions become queries Qi against V(S1,..., Sk)
ASU, 7/25/2003
Standard (XML-Based) Mediator ArchitectureStandard (XML-Based) Mediator Architecture
MEDIATORMEDIATOR
(XML) Queries & Results
S1
Wrapper
(XML) View
S2
Wrapper
(XML) View
Sk
Wrapper
(XML) View
Integrated Global(XML) View G
Integrated ViewDefinition
G(..) S1(..)…Sk(..)
USER/ClientUSER/Client
Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )
wrappers implementedas web services
Scientific Data IntegrationScientific Data Integration ... Questions to Queries ...... Questions to Queries ...
What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?
How does it relate to host rock structures?
?Information Integration
Geologic Map(Virginia)
GeoChemicalGeoPhysical
(gravity contours)GeoChronologic
(Concordia)Foliation Map(structure DB)
“Complex Multiple-Worlds”
Mediation
domain knowledge
Database mediationData modeling
Knowledge Representation:ontologies, concept spaces
raw data
GeoSciences Network
ASU, 7/25/2003
Towards Shared Conceptualizations: Towards Shared Conceptualizations: Data Contextualization via Concept Spaces Data Contextualization via Concept Spaces
ASU, 7/25/2003
Rock Classification OntologyRock Classification Ontology
Composition
Genesis
Fabric
Texture
ASU, 7/25/2003
Some enabling operations on “ontology data”Some enabling operations on “ontology data”
Composition
Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’
ASU, 7/25/2003
Some enabling operations on “ontology data”Some enabling operations on “ontology data”
Composition
Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y
ASU, 7/25/2003
Show formations where AGE = ‘Paleozic’
(without age ontology)
Show formations where AGE = ‘Paleozic’
(without age ontology)
Show formations where AGE = ‘Paleozic’
(with age ontology)
Show formations where AGE = ‘Paleozic’
(with age ontology)
domainknowledge
domainknowledge
Knowledge r
epresentatio
n
AGE ONTOLOGY
NevadaNevada
ASU, 7/25/2003
Example: Geologic Map Integration Example: Geologic Map Integration
domainknowledge
domainknowledge
Knowledge r
epresentatio
n
AGE ONTOLOGY
NevadaNevada
Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy
GEON Metamorphism Equation:
+/- a few hundred million years
ASU, 7/25/2003
GEON and “Semantic” Data IntegrationGEON and “Semantic” Data Integration
Rocky Mountains
Midatlantic Region
ASU, 7/25/2003
Mediator DemoMediator Demo
ASU, 7/25/2003
Biomedical InformaticsResearch Networkhttp://nbirn.net
Biomedical InformaticsResearch Networkhttp://nbirn.net
Getting Formal: Source ContextualizationGetting Formal: Source Contextualization & Ontology Refinement in Logic & Ontology Refinement in Logic
ASU, 7/25/2003
Distributed Querying Processing Distributed Querying Processing ChallengesChallenges: Part I, The Basics: Part I, The Basics
GeoSciences Network
• ““Scientific data” (BIRN, GEON, ...) variant of Scientific data” (BIRN, GEON, ...) variant of data data integration problemintegration problem studied by database CS community studied by database CS community
• Given Given – user query against integrated view
– view to source mappings (GAV/LAV)
– sources with limited access patterns
• Compute a Compute a distributed query plan distributed query plan PP s.t.s.t.– P has a feasible execution order
– P optimized wrt. time/space/networking complexity
CS &CS &
theorytheory
ASU, 7/25/2003
RReal-time eal-time OObservatories, bservatories, AApplications, and pplications, and DData management ata management NetNetworkwork
• Autonomous field sensorsAutonomous field sensors– Seismic, oceanic, climate, ecological, …, video, audio,
…
• RT Data Acquisition: RT Data Acquisition: – ANZA Seismic Network (1981-present):13 Broadband
Stations, 3 Borehole Strong Motion Arrays, 5 Infrasound Stations, 1 Bridge Monitoring System; Kyrgyz Seismic Network (1991-present): 10 Broadband Stations; IRIS PASSCAL Transportable Array (1997-Present):15 - 60 Broadband and Short Period Stations; IDA Global Seismic Network (~1990 -Present): 38 Broadband Stations
• High Performance Wireless Research Network High Performance Wireless Research Network (HPWREN)(HPWREN)
– High performance backbone network: 45Mbps duplex point-to-point links, backbone nodes at quality locations, network performance monitors at backbone sites; High speed access links: hard to reach areas, typically 45Mbps or 802.11radios, point-to-point or point-to-multipoint
• Data Grid Technology (SRB) Data Grid Technology (SRB) – collaborative access to distributed heterogeneous data,
single sign-on authentication and seamless authorization,data scaling to Petabytes and 100s of millions of files, data replication, etc.
ASU, 7/25/2003
A P2P Problem from ROADNetA P2P Problem from ROADNet
• Networks of ORBs send each other various data streamsNetworks of ORBs send each other various data streams
• Avoid Avoid actualactual loops in the presence of loops in the presence of virtual virtual loops:loops:– A B C
– A: c1B
– B: c2 C
– C: c3 A
– ...
• Idea: L(c1) Idea: L(c1) L(c2) L(c2) L(c3) … = {} L(c3) … = {}• In the real system: unix regexpsIn the real system: unix regexps
ASU, 7/25/2003
Scientific Workflows and Scientific Workflows and Analytical PipelinesAnalytical Pipelines
ASU, 7/25/2003
Scientific Workflows/AnalyticalPipelines over Brain Data
Biomedical InformaticsResearch Networkhttp://nbirn.net
Representation of the workflow for cortical reconstruction using FreeSurfer. Raw anatomical MR images are first pre-processed and then must be manually edited to correct defects in the pre-processing. Once verified for correctness the pre-processed images can then be analyzed. During the processing various “snapshots” of the data are returned to the BIRN Virtual Data Grid.
ASU, 7/25/2003
Example: Promoter Identification Workflow (PIW) (simplified)
• scientific data sets flow between the steps• abstraction of tasks into higher conceptual levels• branching/merging of tasks and looping
ASU, 7/25/2003
• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..
• Fundamental improvements for researchers:Fundamental improvements for researchers: Global access to ecologically relevant Global access to ecologically relevant data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend analysis processanalysis process
SEEK: Vision & Overview
ASx ASy ASzTS1TS2
Semantic MediationEngine
Data Binding
Query Processing
ECO2
Logic Rules ECO2-CL
Analytical Pipeline (AP)
SMS: SemanticMediation System
EcoGrid
provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment
Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration
SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities
AM: Analysis and Modeling System
ASr
Parameters w/ Semantics
CC
C
CC
CParameterOntologies
WSDL/UDDI WSDL/UDDI
SRB KNB
MC
Species
WrpDar
...
Raw data setswrappedfor integrationw/ EML, etc.
ECO2 TaxOn
EML
etc.
Execution Environment
SAS, MATLAB,FORTRAN, etc
Library of Analysis Steps, Pipelines& Results
Invasive speciesover time
ASr
WSDL/UDDI
Example of “AP0”
AP0
ASU, 7/25/2003
SEEK ComponentsSEEK Components• EcoGridEcoGrid
• Seamless access to distributed, heterogeneous data: ecological, biodiversity, environmental data
• “Semantically” mediated and metadata driven• Centralized search & management portal(s)
• Analysis and Modeling SystemAnalysis and Modeling System – Capture, reproduce, and extend analysis process
• Declarative means for documenting analysis• “Pipeline” system for linking generic analysis steps• Strong version control for analysis steps
– Easy-to-use interface between data and analysis
• Semantic Mediation System:Semantic Mediation System:– “smart” data discovery, “type-correct” pipeline construction & data binding:– determine whether/how to link analytic steps – determine how data sets can be combined – determine whether/how data sets are appropriate inputs for analysis steps
ASU, 7/25/2003
AMS OverviewAMS Overview
• ObjectiveObjective– Create a semi-automated system for analyzing data
and executing models that provides documentation, archiving, and versioning of the analyses, models, and their outputs (visual programming language?)
• ScopeScope– Any type of analysis or model in ecology and
biodiversity science– Massively streamline the analysis and modeling
process– Archiving, rerunning analyses in SAS, Matlab, R,
SysStat, C(++),…– …
ASU, 7/25/2003
SMS Requirements from AMSSMS Requirements from AMS
• ...assist users in determining the appropriateness of ...assist users in determining the appropriateness of combining various analytical steps and data sources combining various analytical steps and data sources based on semantic mediationbased on semantic mediation......
• Semantic mediation should occur in three areas:Semantic mediation should occur in three areas:1. determine whether it is appropriate to link together particular
analytic steps. 2. mediate between multiple data sets to determine in what
ways they can be combined. 3. determine whether the selected data sources are appropriate
inputs for the selected analysis.
ASU, 7/25/2003
Some functional requirements Some functional requirements
• SMS should have the ability to ... SMS should have the ability to ... FR1: recognize data types (XML Schema types!? EML types?) of
registered EcoGrid data sets
FR2: recognize semantic types (OWL and/or RDF(S) !?) of registered EcoGrid data sets
FR3: recognize registered EcoGrid ontologies Note: semantic types reference those ontologies
FR4: recognize data type signature (XML Schema? WSDL?) of analytical steps (ASs)
FR5: recognize semantic type signature of analytical steps
FR6: recognize semantic constraints (OWL? First-order? What syntax? KIF? Prolog?)
Note: data schemas and signatures of analytical steps have those
ASU, 7/25/2003
... some functional requirements... some functional requirements
• Ability to ... Ability to ... FR8: check well-typedness (data and semantics) of a data set wrt. an
analytical step
FR9: check compatibility of two data sets wrt. "generalized operations" between those data sets (e.g., "semantic" join and union)
FR10: check well-typedness (data and semantics) of chained analytical steps
FR11: introduce data type conversions (e.g., int float)
FR12: perform and "explain" semantic type substitutions (e.g. if some AS works for Cs and D-isa-C, it also works for Ds)
FR13: [optional] generate type correct APs from a given schema of desired output and (optionally) input parameters
ASU, 7/25/2003
Use CasesUse Cases
• Clients of the SMS include the AMS, the EcoGrid, and "scientific Clients of the SMS include the AMS, the EcoGrid, and "scientific workflow engineers".workflow engineers".– UC1: Client requests type signature (data and semantic types) of a
registered EcoGrid data set (DS)– UC2: Client requests "other semantic constraints" of a DS.– UC3: Client requests type signature (data and semantic types) of an
analytical step (AS) – UC4: Client requests "other semantic constraints" of an AS.– UC5: Client requests type signature of an AP.– UC6: Client requests type checking of AP.– UC7: Client requests registered data sets compatible with the inputs of an AS
(e.g., if AS is scale sensitive, then all data sets must have the same scale; a flag is raised if data needs scaling).
– UC8: Client requests all registered ASs which can produce a given parameter (the latter is part of a registered ontology)
– UC9: Client requests candidate predecessor and successor steps for a given AS.
ASU, 7/25/2003
Planned ComponentsPlanned Components
SW1: Formal language(s) for representing/instantiating data types, semantic types, ontologies, and "other semantic constraints".
SW2: System for data type checking and inference (includes introduction of data type conversion steps)
SW3: System for semantic type checking and inference
SW4: [optional] System for "planning" APs given some of: output parameters, data sets, and input parameteres
ASU, 7/25/2003
THE PROBLEMTHE PROBLEM – Reconcile this: – Reconcile this:
• Simple, intuitiveSimple, intuitive graph/pipeline language, graph/pipeline language,
• … … which is which is expressive enoughexpressive enough to handle real-world to handle real-world flows (SciDAC: PIW),flows (SciDAC: PIW),
• … … and allows some and allows some static analysisstatic analysis
• while trying to leverage existing work:while trying to leverage existing work:– e.g., Ptolemy-II directors: Process Networks (PN),
Synchronous Dataflow (SDF), ..., – or workflow standards and systems
ASU, 7/25/2003
(Analytical) Pipelines …. (Scientific) Workflows(Analytical) Pipelines …. (Scientific) Workflows
• Spectrum of languages & formalisms:Spectrum of languages & formalisms:– Pipelines (a la Unix)
– Dataflow languages:• Kahn’s process networks (PN)
• Synchronous dataflow networks (SDF)
– “Web page-flow”: • Active XML, WebML, …
• Hesitating-weak-alternating-tree-automata-ML
• …
– (Business) Workflows:• WfMC’s XPDL, WSFL, BPELWS, …
ASU, 7/25/2003
Kahn Process Networks (PN)Kahn Process Networks (PN)• Concurrent processes communication through Concurrent processes communication through one-wayone-way FIFO channels with FIFO channels with unbounded unbounded
capacitycapacity• A A functional processfunctional process F F maps a set of input sequences into a set of output sequences maps a set of input sequences into a set of output sequences
(sounds like XSM!)(sounds like XSM!)• increasing chain of sets of sequences increasing chain of sets of sequences outputs may outputs may notnot increase! increase! • Consider increasing chains (wrt. prefix ordering “<“) of streamsConsider increasing chains (wrt. prefix ordering “<“) of streams• PN is PN is continuouscontinuous if lub(Xs) exists for all increasing chains Xs and if lub(Xs) exists for all increasing chains Xs and
– F(lub(Xs)) < lub(F(Xs))F(lub(Xs)) < lub(F(Xs))• Continuous implies montonicContinuous implies montonic::
– if Xs < Ys then F(Xs)<F(Ys)if Xs < Ys then F(Xs)<F(Ys)
ASU, 7/25/2003
Process Networks (cont’d)Process Networks (cont’d)
• PN in essence: PN in essence: simultaneous relations between sequencessimultaneous relations between sequences• Network of functional processes can be described by a Network of functional processes can be described by a
mapping mapping
X X = F(= F(XX,,II) ) – X denotes all the sequences in the network (inputs I+outputs)
• X X that forms a solution is a that forms a solution is a fixed pointfixed point• Continuity implies exactly one “minimal” fixed pointContinuity implies exactly one “minimal” fixed point
– minimal in the sense of pre-fix ordering for any inputs I
– execution of the network: given I = and find the minimal fixed point (works because of the monotonic property)
ASU, 7/25/2003
Synchronous Synchronous Data Flow Data Flow Networks Networks
(SDF)(SDF)
• Special case of PNSpecial case of PN• Ptolemy-II SDF overview Ptolemy-II SDF overview
– SDF supports efficient execution of Dataflow graphs that lack control structures– with control structures Process Networks(PN) – requires that the rates on the ports of all actors be known before hand– do not change during execution– in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted
SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.
ASU, 7/25/2003
Extended Kahn-MacQueen Process NetworksExtended Kahn-MacQueen Process Networks
• A process is considered A process is considered activeactive from its creation until its termination from its creation until its termination
• An active process can block when trying to read from a channel An active process can block when trying to read from a channel ((read-blockedread-blocked), when trying to write to a channel (), when trying to write to a channel (write-blockedwrite-blocked) or ) or when waiting for a queued topology change request to be processed when waiting for a queued topology change request to be processed ((mutation-blockedmutation-blocked))
• A A deadlockdeadlock is when all the active processes are blocked is when all the active processes are blocked– real deadlock: all the processes are blocked on a read
– artificial deadlock: all processes are blocked, at least one process is blocked on a write increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock.
– If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.
ASU, 7/25/2003
Analytical Pipelines: An Open Source ToolAnalytical Pipelines: An Open Source Tool
ASU, 7/25/2003
A commercial tool for Analytical PipelinesA commercial tool for Analytical Pipelines
ASU, 7/25/2003
ASU, 7/25/2003
MAP: Data Massaging a la Blue-Titan/Perl MAP: Data Massaging a la Blue-Titan/Perl
ASU, 7/25/2003
Compiling Abstract Scientific Compiling Abstract Scientific Workflows into Workflows into
Web Service WorkflowsWeb Service Workflows
SSDBM’03SSDBM’03
ASU, 7/25/2003
The ProblemThe Problem
• Scientist would like to ...Scientist would like to ...– create a high-level “abstract” WF and
– not bother about web service urls, parameter passing, low-level data transformations,...
• How to go from ...How to go from ...– a high-level Abstract Workflow (AWF) to
– an Executable (web service) Workflow (EWF) ??
• Idea:Idea:– Using nested definitions, express AWF in terms of other AWFs
and EWFs; unfold definitions at compile-time
Abstract-as-View approach
ASU, 7/25/2003
WF Language Constructs (AWF+EWF)WF Language Constructs (AWF+EWF)
Edge TypeEdge Type ExplanationExplanation
Data-In: Data-In: specifies input data of a task.specifies input data of a task.
Conditional Data-In: Conditional Data-In: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied
Data-Out: Data-Out: specifies output of a task specifies output of a task
Conditional Data-Out: Conditional Data-Out: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied
Data-Connect: Data-Connect: connects output data (of previous steps) to input data connects output data (of previous steps) to input data (of subsequent steps)(of subsequent steps)
Conditional Data-Connect: Conditional Data-Connect: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied
Parameter: Parameter: specifies a control parameter of a taskspecifies a control parameter of a task
cond
cond
cond
ASU, 7/25/2003
Compute clusters(min. distance)
Select gene-set(cluster-level)
For each geneRetrieve
Transcription factors
ArrangeTranscription factors
For each promoter
ComputeSubsequence labels
With all Promoter Models
Compute JointPromoter Model
Retrieve matching cDNA
Retrieve genomicSequence
Extract promoterRegion(begin, end)
Create consensussequence
Align promoters
Conceptual WorkflowConceptual Workflow
ASU, 7/25/2003
Abstract Workflow (AWF)Abstract Workflow (AWF)(= chain program over relations with i/o (= chain program over relations with i/o
patterns)patterns)% AWF
piw(DB,Gene,TFBSModel) :- cDNASequence(Gene, GeneSeq),localAlignment(DB, CDNASeq,RankedPromoterList),firstRest(Promoter,RankedPromoters,RankedPromoters1),promoter_detail(Promoter, PromoterId, Start, End, Orientation),
cDNASequence(PromoterId,GenomicSeq),trim_sequence(GenomicSeq, Start, End, Orientation, ShortSeq),convertSeq(Orientation,ShortSeq,PosSeq),transfac(PosSeq, TFBSModel).
ASU, 7/25/2003
promoters tfbs_models
piwAWF
promoters AAV
gene_seq localAlignment
DB
Promoters TFBSModelsPromotersGene
Gene PromotersCDNASeq CDNASeq
AAV
genbank_service
embl_service
ddbj_service
DDBJId
EMBLId
GenbankId cDNASeq
cDNASeq
cDNASeq
convertToAcc#
Gene GeneId CDNASeq
gene_seq AAVEWF
…
AWF to EWF in AWF to EWF in graph formgraph form
ASU, 7/25/2003
AWF AWF EWF Translation EWF Translation
1.1. Check whether AWF is Check whether AWF is well-formedwell-formed and and well-typedwell-typed; if not, corresponding ; if not, corresponding warnings are issued (a semantic type mismatch may not only be a workflow warnings are issued (a semantic type mismatch may not only be a workflow design error, but often indicates the incompleteness of the underlying ontology).design error, but often indicates the incompleteness of the underlying ontology).
2.2. Next the AWF is successively Next the AWF is successively unfoldedunfolded, using the AAV view definitions. , using the AAV view definitions. • (Compiling AWF into EWF using AAV is similar to rewriting a query against a global schema
into queries against the sources.)
3.3. The unfolded The unfolded logic query planlogic query plan then undergoes several rewriting steps until a then undergoes several rewriting steps until a certain normal (DNF/UCQcertain normal (DNF/UCQ) is reached. If the join variables (= the connection ) is reached. If the join variables (= the connection edges) are not of the same edges) are not of the same data typedata type (but at least of compatible semantic types) (but at least of compatible semantic types) then the insertion of conversion rules is attempted; if this fails, an error is then the insertion of conversion rules is attempted; if this fails, an error is reported.reported.
4.4. For each list of conjunctive goals, the system tries to find an For each list of conjunctive goals, the system tries to find an executable goal executable goal orderorder, i.e., one which satisfies all i/o restrictions imposed by the web service , i.e., one which satisfies all i/o restrictions imposed by the web service descriptions of executable tasks.descriptions of executable tasks.
• ImplementationImplementation: a set of Java and Prolog programs, rules, ontologies and repositories: a set of Java and Prolog programs, rules, ontologies and repositories
ASU, 7/25/2003
geneList
managegeneLoop/[while geneList
not EMPTY]
updatedGeneList
[geneList EMPTY]
expressionArray selectGeneSet
updatedGeneList
LOOP1: [for each gene]
Loop1Final
gene
AWF for Matt’s Promoter Identification Workflow
ASU, 7/25/2003
geneId
inspectedTFBSs
shortSeq
sequence
geneNo
seqName coreValue sort threshold matrix indiv
TRANSFACMatInspector
complementSequence
plusSeq[orient < 0]
partialSeq
minusSeq
manageClustalWLoop
[orient > 0]
ClustalWSequence
[geneListNOTEmpty]
[geneListEmpty]
loop back:
geneListupdatedGeneList
EWF for Matt’s Extended Promoter Identification Workflow (w/ loops & conditions)
Figure1
prepareClustalWInput
ClustalWSequence
ListmultipleSeqAlignment
typepwalignment
noMoreGenes
geneListEmpty
CWSequence
[orient > 0]
ASU, 7/25/2003
geneId
inspectedTFBSs
shortSeq
sequence
geneNo
seqNamecoreValue sort thresholdmatrix indiv
TRANSFACMatInspector
complementSequence
plusSeq
[orient < 0]
partialSeq
minusSeq
manageClustalWLoop
[orient > 0]
ClustalWSequence
[geneListNOTEmpty]
[geneListEmpty]
loop back:
geneListupdatedGeneList prepareClustalWInput
ClustalWSequence
ListmultipleSeqAlignment
typepwalignment
noMoreGenes
geneListEmpty
format
RequestIdcDNASeq seq1 BlastRIDGenbank1
programdb1
BlastPromoter
fullGenomicSequence
RIdlist_udis Genbank2
doptcmd2db2cmd1
promoters
seq2
orientation
end
start
hitId
to
from
orient
trimSequence
promoterList
outputNextPromoter
updatedPromoter
List
Unfolded EWF
[orient > 0]
ASU, 7/25/2003
Generated EWF Plan (using BIRN Mediation Tool)
ASU, 7/25/2003
Abstract-As-View (AAV) Definitions:Abstract-As-View (AAV) Definitions:Control-Flow IssuesControl-Flow Issues
% AAVcDNASequence(GeneId, CDNASeq) :-
genbank(GeneId, CDNASeq) ; fail(genbank), embl(GeneId, CDNASeq) ; fail(genbank),fail(embl),ddbj(GeneId, CDNASeq).
localAlignment(DB, CDNASeq,RankedPromoterList) :- blast(CDNASeq,DB,xml,RankedPromoterList) ; fail(blast), fasta(CDNASeq,DB, RankedPromoterList) ; fail(blast),fail(fasta),blat(CDNASeq,querytype,
sortcriteria,outputtype,RankedPromoterList).
convertSeq(Orientation,ShortSeq,PosSeq) :-negative(Orientation),
complement(ShortSeq,PosSeq);equals(ShortSeq,PosSeq)
ASU, 7/25/2003
Further ProblemsFurther Problems
• Reconcile: Reconcile:
– Simple, intuitive graph/pipeline language,
– … which is expressive enough to handle real-world flows (PIW),
– … and allows some static analysis
– while trying to leverage existing work:• e.g., Ptolemy-II directors: Process Networks (PN), Synchronous Dataflow
(SDF), ...,
• or workflow standards and systems
• Semi-automatic web service composition:Semi-automatic web service composition:– use of semantic and data types to define data transformations:
• map: prev_step.out next_step.in
ASU, 7/25/2003
Design(Ptolemy-II) Execution monitoring(Ptolemy-II)
WF-Pilot
User
ET ET
Genbank BLAST
web serviceinvocation
web serviceinvocation
(Ptolemy II-Based Architecture)
Execution(Ptolemy-II)Directors:
PN, SDF, . . , XPDL/OFBiz Style Ptolemy-II Director
AWF Valid-AWF
Abstract Task(AT) Repository
AAVrules C
C
C
Data & Parameter Ontologies
ETschemas
query rewriting
Executable Task(ET) Repository
semantic typechecking
conversionrules
data typeconversion
Datatype & Conversion Repository
web servicematching
WF-ValidatorAWF & EWF Validation
ValidationErrors
ET -- Web serviceAT -- (“Mini workflow” of ETs; Composition of ETs and ATs) may become a web service if deployed
SciDAC Extensions to Ptolemy-II Web Service plug-in
ASU, 7/25/2003
Gene Sequence Processing
Designing PIW in Scientific Workflow Management System User specified parameters:• The accession numbers, separated by commas,• The number of promoters to investigate,• The name of the file to hold the fasta format promoter regions.
ASU, 7/25/2003
Looking Inside an Abstract Task: Gene Sequence Processing
Look Inside
Look Inside
Look Inside
ASU, 7/25/2003
Run or Resume the Model
Running the PIW Model
ASU, 7/25/2003
Summary (Scientific Workflows)Summary (Scientific Workflows)
• Spectrum of dataflow/control-flow/workflow approaches:Spectrum of dataflow/control-flow/workflow approaches:– SDF, PN, …, AXML, WebML, … XPDL
• Scientist user needs to “visually program” themScientist user needs to “visually program” them• System support needed:System support needed:
– Translation from simple, conceptual (“declarative”) WFs to executable Web/Grid service plans
– Static analysis to check • dynamic properties (deadlocks, starvation,…),
• feasibility wrt. given sources
• type compatibilities
• Macro/micro-level planning (overall control flow, local schema mappings)
ASU, 7/25/2003
Summary: Mediation Scenarios & Summary: Mediation Scenarios & TechniquesTechniques
Federated Databases XML-Based Mediation Model-Based Mediation
One-World One-/Multiple-Worlds Complex Multiple-Worlds
Common Schema Mediated Schema Common Glue Maps
SQL, rules XML query languages DOOD query languages
Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings
Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps
DB expert DB expert KRDB + domain experts
Glue?Glue?
ASU, 7/25/2003
Combine Everything:Combine Everything:Die Die eierlegende Wollmilchsaueierlegende Wollmilchsau::
• Database Federation/MediationDatabase Federation/Mediation– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing
• Semantic MediationSemantic Mediation– semantic integrity constraints, reasoning w/ plans, automated
deduction– deductive database/logic programming technology, AI “stuff”...– Semantic Web technology
• Scientific Workflow ManagementScientific Workflow Management– more procedural than database mediation (often the scientist is
the query planner)– deployment using web services