kepler: towards a grid-enabled system for scientific workflows ilkay altintas, chad berkley, efrat...
Post on 24-Dec-2015
232 Views
Preview:
TRANSCRIPT
Kepler: Towards a Grid-Enabled Kepler: Towards a Grid-Enabled System for Scientific System for Scientific
WorkflowsWorkflowsIlkay Altintas, Chad Berkley, Efrat Jaeger,
Matthew Jones, Bertram Ludäscher* , Steve Mock
*ludaesch@SDSC.EDUSan Diego Supercomputer Center (SDSC)University of California, San Diego (UCSD)
B. Ludäscher et al. – Grid-Enabling Kepler 2
Outline
• Motivation: Scientific Workflows (SEEK, SDM, GEON, ..)
• Current Features of the Kepler Scientific Workflows System
• Extending Kepler:– Grid-Enabling Kepler:
• 3rd party transfer
– WF planning & optimization• Shipping and Handling Algebra (SHA)• Web Service Composition as Declarative Query Plans
– Semantic Types for Scientific Workflows
• Conclusions
B. Ludäscher et al. – Grid-Enabling Kepler 3
Kepler Team, Projects, Sponsors
• Ilkay Altintas SDM • Chad Berkley SEEK • Shawn Bowers SEEK• Jeffrey Grethe BIRN• Christopher H. Brooks Ptolemy II • Zhengang Cheng SDM • Efrat Jaeger GEON • Matt Jones SEEK • Edward A. Lee Ptolemy II • Kai Lin GEON• Bertram Ludäscher BIRN, GEON, SDM, SEEK• Steve Mock NMI• Steve Neuendorffer Ptolemy II • Jing Tao SEEK• Mladen Vouk SDM • Yang Zhao Ptolemy II • …
Ptolemy IIPtolemy II
B. Ludäscher et al. – Grid-Enabling Kepler 4
Example: SEEK – Science Environment for Ecological Knowledge (large NSF ITR)
• Analysis & Modeling System– Design and execution of
ecological models and analysis
– End user focus– application-/upperware
• Semantic Mediation System– Data Integration of hard-
to-relate sources and processes
– Semantic Types and Ontologies
– upper middleware• EcoGrid
– Access to ecology data and tools
– middle-/underware
Architecture Overview(cf. Cyberinfrastructure)
B. Ludäscher et al. – Grid-Enabling Kepler 5
Ecology: GARP Analysis Pipeline for Invasive Species Prediction
Training sample
(d)
GARPrule set
(e)
Test sample (d)
Integrated layers
(native range) (c)
Speciespresence &
absence points(native range)
(a)EcoGridQuery
EcoGridQuery
LayerIntegration
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Validation
MapGeneration
Integrated layers (invasion area) (c)
Species presence &absence points
(invasion area) (a)
Native range
predictionmap (f)
Model qualityparameter (g)
Environmental layers (native
range) (b)
GenerateMetadata
ArchiveTo Ecogrid
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
Environmental layers (invasion
area) (b)
Invasionarea prediction
map (f)
Model qualityparameter (g)
Selectedpredictionmaps (h)
Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)
B. Ludäscher et al. – Grid-Enabling Kepler 6
Genomics Example: Promoter Identification
Workflow (PIW)
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
B. Ludäscher et al. – Grid-Enabling Kepler 7
Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)
B. Ludäscher et al. – Grid-Enabling Kepler 8
Scientific “Workflows”: Some Findings
• More dataflow than (business control-/) workflow– DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna, Triana,, …,
• Need for “programming extension” – Iterations over lists (foreach); filtering; functional composition;
generic & higher-order operations (zip, map(f), …)• Need for abstraction and nested workflows• Need for data transformations (WS1DTWS2)• Need for rich user interaction & workflow steering:
– pause / revise / resume– select & branch; e.g., web browser capability at specific steps
as part of a coordinated SWF• Need for high-throughput transfers (“grid-enabling”,
“streaming”)• Need for persistence of intermediate products and
provenance
B. Ludäscher et al. – Grid-Enabling Kepler 10
In a Flux: Workflow “Standards”
Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.htmlSource: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html
B. Ludäscher et al. – Grid-Enabling Kepler 11
Commercial & Open Source Scientific “Workflow” (well Dataflow)
Systems
Kensington Discovery Edition from InforSense
Taverna
Triana
B. Ludäscher et al. – Grid-Enabling Kepler 12
SCIRun: Problem Solving Environments for Large-Scale Scientific Computing
• SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations
• New collaboration under Kepler/SDM • Component model, based on generalized dataflow programming
Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)
Our Starting Point: Ptolemy II & Dataflow Process Networks
see!see!see!see!
try!try!try!try!
read!read!read!read!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
B. Ludäscher et al. – Grid-Enabling Kepler 14
Why Ptolemy II?
• Ptolemy II Objective:– “The focus is on assembly of concurrent components. The key
underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”
• Data & Process oriented: Dataflow process networks • Natural Data Streaming Support• User-Orientation
– “application-ware”, not middle-/under-ware)– Workflow design & exec console (Vergil GUI)
• PRAGMATICS– mature, actively maintained, well-documented (500+pp)– open source system– developed across multiple projects (NSF/ITRs SEEK and GEON, DOE
SciDAC SDM, …)– hoping to leverage e-sister projects (e.g. Taverna, …)
B. Ludäscher et al. – Grid-Enabling Kepler 15
Dataflow Process Networks: Putting Computation Models (“Orchestration”) first!
• Synchronous Dataflow Network (SDF)– Statically schedulable single-threaded dataflow
• Can execute multi-threaded, but the firing-sequence is known in advance– Maximally well-behaved, but also limited expressiveness
• Process Network (PN)– Multi-threaded dynamically scheduled dataflow– More expressive than SDF (dynamic token rate prevents static
scheduling)– Natural streaming model
• Other Execution Models (“Domains”)– Implemented through different “Directors”
actor actor
typed i/o ports
FIFO
advanced push/pull
B. Ludäscher et al. – Grid-Enabling Kepler 16Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Actor-/Dataflow Orientation
vsObject-/
Control flow Orientation
B. Ludäscher et al. – Grid-Enabling Kepler 17
Marrying or Divorcing Control- & Dataflow
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
B. Ludäscher et al. – Grid-Enabling Kepler 18
Overview: Scientific Workflows in Kepler
• Modeling and Workflow Design
• Web services = individual components (“actors”)
• “Minute-Made” Application Integration: – Plugging-in and harvesting web service components is easy, fast
• Rich SWF modeling semantics (“directors”):– Different and precise dataflow models of computation– Clear and composable component interaction semantics Web service composition and application integration tool
• Coming soon:– Shrinked wrapped, pre-packaged “Kepler-to-Go” – Structural and semantic typing (better design support)– Grid-enabled web services (for big data, big computations,…) – Different deployment models (web service, web site, applet, …)
B. Ludäscher et al. – Grid-Enabling Kepler 19
The KEPLER GUI: Vergil(Steve Neuendorffer, Ptolemy II)
Drag and drop utilities, director and actor libraries.
B. Ludäscher et al. – Grid-Enabling Kepler 20
Running a Genomics WF (Ilkay Altintas, SDM)
B. Ludäscher et al. – Grid-Enabling Kepler 21
Support for Multiple Workflow Granularities
Boulders
Abstraction:Sand to Rocks
Sand
Powder
Plumbing
B. Ludäscher et al. – Grid-Enabling Kepler 22
Directors and Combining Different Component Interaction Semantics
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
B. Ludäscher et al. – Grid-Enabling Kepler 23
Application Examples: Mineral Classification with Kepler … (Efrat Jaeger, GEON)
B. Ludäscher et al. – Grid-Enabling Kepler 25
Standard BrowserUI: Client-Side SVG
B. Ludäscher et al. – Grid-Enabling Kepler 26
SWF Reengineering (Ashraf, Efrat, Kai, GEON)
B. Ludäscher et al. – Grid-Enabling Kepler 28
Result launched via BrowserUI actor(coupling with ESRI’s ArcIMS)
B. Ludäscher et al. – Grid-Enabling Kepler 29
Distributed Workflows in KEPLER
• Web and Grid Service plug-ins– WSDL (now) and Grid services (stay tuned …)– ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard– SSH, SCP, SDSC SRB, OGS?-???… coming
• WS Harvester– Import query-defined WS operations as Kepler actors
• XSLT and XQuery Data Transformers– to link not “designed-to-fit” web services
• WS-deployment interface (planned)
B. Ludäscher et al. – Grid-Enabling Kepler 30
Generic Web Service Actor (Ilkay Altintas)
Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.
Configure - select service operation
B. Ludäscher et al. – Grid-Enabling Kepler 31
Set Parameters and Commit
Set parameters and commit
B. Ludäscher et al. – Grid-Enabling Kepler 32
Specialized WS Actor (after instantiation)
B. Ludäscher et al. – Grid-Enabling Kepler 33
Web Service Harvester (Ilkay Altintas, SDM)
• Imports the web services in a repository into the actor library.• Has the capability to search for web services based on a keyword.
B. Ludäscher et al. – Grid-Enabling Kepler 34
Composing 3rd-Party WSs (NMI, Steve Mock)
Output of previousweb service
User interaction &Transformations
Input of next web service
B. Ludäscher et al. – Grid-Enabling Kepler 35
A Special Generic Ingestion Actor for EML Data (SEEK, Chad Berkley)
Ingests any data format described by EML metadata
Converts raw data to Ptolemy format
Data can then be operated on with other actors
B. Ludäscher et al. – Grid-Enabling Kepler 37
Promoter Identification Workflow (PIW)
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
B. Ludäscher et al. – Grid-Enabling Kepler 38
Promoter Identification
Workflowin Ptolemy-II[SSDBM’03]
ExecutionSemantics
B. Ludäscher et al. – Grid-Enabling Kepler 39
hand-crafted control solution; also: forces sequential execution!
designed to fit
designed to fit
hand-craftedWeb-service
actor
Complex backward control-flow
No data transformations
available
B. Ludäscher et al. – Grid-Enabling Kepler 40
Promoter Identification Workflow in FP
genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> String
d0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file
B. Ludäscher et al. – Grid-Enabling Kepler 41
Cleaned up Process Network PIW
• Back to purely functional dataflow process network(= also a data streaming
model!)
• Re-introducing map(f) to Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from
piw(GeneId) to PIW :=map(piw) over [GeneId]
map(f)-style
iterators Powerful type
checking Generic,
declarative “programming”
constructs
Generic data transformation
actors
Forward-only, abstractable sub-workflow piw(GeneId)
B. Ludäscher et al. – Grid-Enabling Kepler 42
Optimization by Declarative Rewriting I
• PIW as a declarative, referentially transparent functional process optimization via functional
rewriting possiblee.g. map(f o g) = map(f) o map(g)
• Technical report &PIW specification in Haskell
map(f o g) instead of map(f) o
map(g)
Combination of map and zip
http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdfhttp://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
B. Ludäscher et al. – Grid-Enabling Kepler 43
Optimizing II: Streams & Pipelines
• Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f • mapS g mapS (f • g)
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki
John Reekie, University of Technology, Sydney
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki
John Reekie, University of Technology, Sydney
Middle/Underware Access: Querying Databases
• Database connection actor: – Opening a database connection and passing it to all actors
accessing this database.
• Database query actor:– A generic actor that queries a database and provides its
result.
• DBConnection type and DBConnectionToken:– A new IOPort type and a token to distinguish a database
connection from any general type.
Database Connection Actor
• OpenDBConnection actor: – Input: database connection information– Output: DBConnectionToken (reference to a DB
connection instance, via a DBConnection output port)
Database Query Actor
• Database Query actor: – Input: SQL query string and a DB connection token– Parameters:
• output type: XML, Record, or String• tuple-at-a-time vs set-at-a-time
– Process: • execute query• produce results according to parameters
Querying Example
B. Ludäscher et al. – Grid-Enabling Kepler 48
An (oversimplified) Model of the Grid
• Hosts: {h1, h2, h3, …}
• Data@Hosts: d1@{hi}, d2@{hj}, …
• Functions@Hosts: f1@{hi}, f2@{hj}, …
• Given: data/workflow:• … as a functional plan: […; Y := f(X); Z := g(Y); …] • … as a logic plan: […; f(X,Y)g(Y,Z); …]
• Find Host Assignment: di hi , fj hj for all di , fj
… s.t. […; d3@h3 := f@h2(d1@h1), …] is a valid plan
f gX Y Z
B. Ludäscher et al. – Grid-Enabling Kepler 49
Shipping and Handling Algebra (SHA)
f@A
x@b y@c
f@A
x@b y@c
f@A
x@b y@c
f@A
x@b y@c
plan Y@C = F@A of X@B =
1. [ X@B to A, Y@A := F@A(X@A), Y@A to C ]
2. [ F@A => B, Y@B := F@B(X@B), Y@B to C ]
3. [ X@B to C, F@A => C, Y@C := F@C(X@C) ]
Logical view
Physical view: SHA Plans
(1)
(3)
(2)
B. Ludäscher et al. – Grid-Enabling Kepler 50
Grid-Enabling PTII: Handles
A B
GA GB
1. AGA: get_handle2. GAA: return &X3. AB: send &X4. BGB: request &X5. GBGA: request &X6. GA GB: send *X7. GBB: send done(&X)
Example: &X = “GA.17”
*X =<some_huge_file>
Candidate Formalisms:• GridFTP• SSH, SCP• SDSC SRB• OGS?-??? … WSRF?
1 2
3
4
5
6
7
Kepler space
Grid space
Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion.
B. Ludäscher et al. – Grid-Enabling Kepler 51
Extensions: Semantic Type
• Take concepts and relationships from an ontology to “semantically type” the data-in/out ports
• Application: e.g., design support: – smart/semi-automatic wiring, generation of “massaging
actors”
m1
(normalize)p3 p4
Takes Abundance Count
Measurements for Life StagesReturns Mortality Rate Derived
Measurements for Life Stages
B. Ludäscher et al. – Grid-Enabling Kepler 54
Semantic Types
• The semantic type signature– Type expressions over the (OWL) ontology
m1
(normalize)p3 p4
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
hasContext.appliesTo.LifeStageProperty
->
DerivedObservation & itemMeasured.MortalityRate &
hasContext.appliesTo.LifeStageProperty
B. Ludäscher et al. – Grid-Enabling Kepler 55
Extended Type System (here: OWL Semantic Types)
SemType m1 :: Observation & itemMeasured.AbundanceCount & hasContext.appliesTo.LifeStageProperty DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStagePropertySubstructure association:
XML raw-data =(X)Query=> object model =link => OWL ontology
B. Ludäscher et al. – Grid-Enabling Kepler 56
Semantic Types for Scientific Workflows
B. Ludäscher et al. – Grid-Enabling Kepler 57
Deriving Data Transformations from Semantic Service Registration
[Bowers-Ludaescher,DILS’04]
B. Ludäscher et al. – Grid-Enabling Kepler 58
Structural and Semantic Mappings
[Bowers-Ludaescher,DILS’04]
B. Ludäscher et al. – Grid-Enabling Kepler 59
Workflow Planning as Planning Queries with Limited Access Patterns• User query Q: answer(ISBN, Author, Title)
book(ISBN, Author, Title),catalog(ISBN, Author),not library(ISBN).
• Limited (web service) Access Patterns (API)– Src1.books: in: ISBN out: Author, Title– Src1.books: in: Author out: ISBN, Title– Src2.catalog: in: {} out: ISBN, Author– Src3.library: in: {} out: ISBN
• Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library)
ICDE (poster), EDBT, PODS (papers), [Nash-Ludaescher,2004]
B. Ludäscher et al. – Grid-Enabling Kepler 60
Conclusions
• Summary– Kepler Scientific Workflow System– Open source, cross-project collaboration
(SEEK, GEON, SDM,…)– Actor & Dataflow-oriented Modeling, Design,
Execution (Ptolemy II heritage)– Prototyping, static analysis, web services,
data transformations• Next Steps
– First official release (“Kepler-to-Go”) April/May ’04
• e-Science meeting NeSC, Edinburgh– Grid-enabling
• 3rd party transfer, planning, optimization, …– Semantic Typing [DILS’04]– Provenance, Fault tolerance, … – Link-Up w/ e.g. Taverna, Pegasus, …– Become a member or co-developer (You!)
top related