applying the virtual data provenance model ipaw 2006 yong zhao, ian foster, michael wilde 4 may 2006
TRANSCRIPT
Applying the Virtual Data Provenance Model
IPAW 2006Yong Zhao, Ian Foster, Michael Wilde
4 May 2006
www.griphyn.org/vds
Virtual Data Origins:The Grid Physics Network
Enhance scientific productivity through… Discovery, application and management of
data and processes at all scales Using a worldwide data grid as a scientific
workstation
The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.
www.griphyn.org/vds
The purpose of Virtual Data
Better understanding of data (and tools) Assess what happened at run-time Easier to express and execute work Discover useful workflow patterns Adapt workflow patterns to new needs
www.griphyn.org/vds
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,
University of Chicago
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Galaxy clustersize distribution
DAG
Virtual Data Example:Galaxy Cluster Search
Sloan Data
www.griphyn.org/vds
How is Workflow and Provenance connected?
Workflow – specifies what to do Provenance – tracks what was done Virtual Data integrates these capabilities
ExecutedExecutingExecutableWaiting
Query
Edit
ScheduleExecution environment
What I Did
What I Want to
Do
What I Am Doing
…
www.griphyn.org/vds
Temporal aspects of provenance Prospective provenance
The workflow recipes for how to produce data
Metadata annotating code and data Retrospective provenance
Run-time records of data production environment: where, how long, how much
www.griphyn.org/vds
Expressing Workflow in VDL
TR grep (in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
TR sort (in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
DV grep (a1=@{in:file1}, a2=@{out:file2});
DV sort (a1=@{in:file2}, a2=@{out:file3});
file1
file2
file3
grep
sort
Define a “function” wrapper for an
application
Provide “actual” argument values for the invocation
Define “formal arguments” for the application
Define a “call” to invoke application
Connect applications via output-to-input dependencies
www.griphyn.org/vds
Terminology virtual data
defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation
VDL – Virtual Data Language A language (text and XML) that defines the functions
and function calls of a workflow VDC – Virtual Data Catalog
The database and schema that store VDL definitions VDS – Virtual Data System
The tools to define, store, manipulate and execute virtual data workflows and query data provenance
www.griphyn.org/vds
Executing VDL Workflows
Abstractworkflow
DAGmanDAG
PegasusPlanner
DAGman &Condor-G
VDLProgram
Virtual Datacatalog
Virtual DataWorkflowGenerator
JobPlanner
JobCleanup
Workflow spec Create Execution Plan Grid Workflow Execution
Show world and results in large DAG on right, as animated overlay
www.griphyn.org/vds
Dimensions of Provenance Data
Virtual DataCatalog
Virtual datarelationships
Metadataannotations
Derivationlineage
www.griphyn.org/vds
Virtual Data Catalog Schema
dvIDhoststart
durationexitcode
stats
Invocation
nmspacename
version
Call
passes passes
executescalls
binds references
describesuses
includes
nmspacename
version
Procedure
argnametype
direction
FormalArg
argnamevalue
ActualArg
wfidfromDV
toDV
Workflow
nmspacename
Dataset
objectpred
type/valuserdate
Annotation
1
1
1
1
1
1
*
*
*
*
*
1
11
1
1
1
1 describes
www.griphyn.org/vds
Run Time Environmentand Provenance Collection
VDL
DAGmanscript
PegasusPlanner
DAGman &Condor-G
Abstractworkflow
Virtual Datacatalog
Virtual DataWorkflowGenerator
Specify Workflow Create and run DAG Grid Workflow Execution(on worker nodes)
launcher
launcher
file1
file2
file3
grep
sort
Provenancedata
Provenancedata
Provenancecollector
www.griphyn.org/vds
Provenance Query Types
Virtual data relationships Annotations Lineage Multi-dimension Compositional
www.griphyn.org/vds
Context for Query Examples:Functional MRI Analysis
3a.h
align_warp/1
3a.i
3a.s.h
softmean/9
3a.s.i
3a.w
reslice/2
4a.h
align_warp/3
4a.i
4a.s.h 4a.s.i
4a.w
reslice/4
5a.h
align_warp/5
5a.i
5a.s.h 5a.s.i
5a.w
reslice/6
6a.h
align_warp/7
6a.i
6a.s.h 6a.s.i
6a.w
reslice/8
ref.h ref.i
atlas.h atlas.i
slicer/10 slicer/12 slicer/14
atlas_x.jpg
atlas_x.ppm
convert/11
atlas_y.jpg
atlas_y.ppm
convert/13
atlas_z.jpg
atlas_z.ppm
convert/15
Workflow courtesy James Dobson, Dartmouth Brain Imaging Center
www.griphyn.org/vds
Virtual Data Relationships
Simple query by signature: Show procedures in namespace /pub/bin/std that
have inputs of type SubjectImage and outputs of type ThumbNailImage.
Actual arguments and runtime provenance: Show alignlinear calls (including all arguments), in
XML format, with argument model=rigid, and which generated more than 10,000 page faults, on ia64 processors.
Show calls to procedure alignlinear, and their runtimes, with argument model=rigid that ran in less than 30 minutes on non-ia64 processors.
Aggregate query: Show me the average runtime of all alignlinear calls
with argument model=rigid that ran in less than 30 minutes.
www.griphyn.org/vds
Provenance forLarge-scale ATLAS Simulation
How much compute time was delivered?| years| mon | year |+------+------+------+| .45 | 6 | 2004 || 20 | 7 | 2004 || 34 | 8 | 2004 || 40 | 9 | 2004 || 15 | 10 | 2004 || 15 | 11 | 2004 || 8.9 | 12 | 2004 |+------+------+------+
Selected statistics for one of these jobs:start: 2004-09-30 18:33:56duration: 76103.33 pid: 6123exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556
... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386
Which Linux kernel releases were used ?How many jobs were run on a Linux 2.4.28 Kernel?
(Data from work of Robert Gardner and Marco Mambelli, University of Chicago)
www.griphyn.org/vds
Annotation queries
Return annotations for any object type procedures, calls, arguments, and files
select based on subject, predicate, (object, type) select a set of virtual data objects:
find all objects (of any type) annotated with predicate p of type t and value v
find objects of a specific type annotated with predicate p of type t and value v;
find objects (one type or any type) annotated by same set of attribute predicates.
www.griphyn.org/vds
fMRI Virtual Data QueriesWhich transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType
subject_image input A: fMRIDC.AIR::align_warp
List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img
Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img
3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img
www.griphyn.org/vds
Annotation Queries
Show all the annotations of [datasets that have metadata annotation studyModality with values speech, visual or audio].
Show all the developerName annotations of [procedures that accept or produce an argument of type Study with annotation studyModality=audio]
www.griphyn.org/vds
Lineage Queries
Basic lineage graph queries refer to information that has been propagated along derivation relationships: find datasets derived from dataset d find ancestor datasets to dataset d that have type t find datasets that were derived within 2 levels of
procedure p find datasets that are the result of workpattern wp; find the procedure calls in workflow w whose inputs
have been processed by any subgraph matching workpattern wp.
www.griphyn.org/vds
Planned approach: Workpatterns
Match graph patterns of transformations, calls, and invocations; fixed or varying numbers of nodes
Match on expressions with argument name, argument values, argument types, and/or annotations
Workpattern query yields set of workflows with subgraphs that match the workpattern
The target search space of a workpattern query can be entire database, or a specific set of workflows selected by a prior search.
www.griphyn.org/vds
Workflow Patterns
Given the workpattern: align (model=affine)
reslice (axis=x, intensify=3) softmean show me all output datasets of softmean calls that
were linear-aligned with model=affine. I.e., “where softmean was preceded in the workflow,
directly or indirectly, by an alignlinear call with argument model=affine”
Show me all output datasets of softmean that were resliced with intensify=3. (Looking for a softmean that is directly preceded by the requested pattern)
www.griphyn.org/vds
Workflow Pattern Searches3a.h
align_warp/1
3a.i
3a.s.h
softmean/9
3a.s.i
3a.w
reslice/2
4a.h
align_warp/3
4a.i
4a.s.h 4a.s.i
4a.w
reslice/4
5a.h
align_warp/5
5a.i
5a.s.h 5a.s.i
5a.w
reslice/6
6a.h
align_warp/7
6a.i
6a.s.h 6a.s.i
6a.w
reslice/8
ref.h ref.i
atlas.h atlas.i
slicer/10 slicer/12 slicer/14
atlas_x.jpg
atlas_x.ppm
convert/11
atlas_y.jpg
atlas_y.ppm
convert/13
atlas_z.jpg
atlas_z.ppm
convert/15
align_warp/*/softmean
softmean/slicer*
www.griphyn.org/vds
Multidimension Provenance Queries
“find transformations with signature S that have been called with arguments V and which match: an annotation query the metadata values for a specified set of
predicates from a transformation list returned by another query
the minimum, maximum, and average run times of a set of procedure calls matching workpattern wp and annotation query q.”
www.griphyn.org/vds
Multi-dimension queries Find procedures that take in ImageAtlas and Date,
have been called with atlas.std.2005.img,and have annotation QALevel > 5.6
Display metadata tags study-type on result datasets that were linearly aligned with parameter model=affine and with an input dataset annotated with center set to UofChicago
Show me the output dataset names (and all their metadata tags) that were linearly aligned with model=affine and with input LFN metadata center=UChicago
Show annotations school of output datasets of softmean with values in set {UIUC, UChicago, UIC}.
Show annotations school with values in set {UIUC, UChicago, UIC} of outputs of softmean that were aligned with model=affine (graph relationship)
www.griphyn.org/vds
Modification and composition queries Change arguments in a set of calls Change procedures in a set of workflows Edit subgraphs of a workflow, creating new
workflows Edit metadata throughout a workflow
www.griphyn.org/vds
Workflow Transformation3a.h
align_warp/1
3a.i
3a.s.h
softmean/9
3a.s.i
3a.w
reslice/2
4a.h
align_warp/3
4a.i
4a.s.h 4a.s.i
4a.w
reslice/4
5a.h
align_warp/5
5a.i
5a.s.h 5a.s.i
5a.w
reslice/6
6a.h
align_warp/7
6a.i
6a.s.h 6a.s.i
6a.w
reslice/8
ref.h ref.i
atlas.h atlas.i
slicer/10 slicer/12 slicer/14
atlas_x.jpg
atlas_x.ppm
convert/11
atlas_y.jpg
atlas_y.ppm
convert/13
atlas_z.jpg
atlas_z.ppm
convert/15
More accurate
alignment tool?
Better renderingalgorithm?
www.griphyn.org/vds
Workflow transformation3a.h
align_warp/1
3a.i
3a.s.h
softmean/9
3a.s.i
3a.w
reslice/2
4a.h
align_warp/3
4a.i
4a.s.h 4a.s.i
4a.w
reslice/4
5a.h
align_warp/5
5a.i
5a.s.h 5a.s.i
5a.w
reslice/6
6a.h
align_warp/7
6a.i
6a.s.h 6a.s.i
6a.w
reslice/8
ref.h ref.i
atlas.h atlas.i
slicer/10 slicer/12 slicer/14
atlas_x.jpg
atlas_x.ppm
convert/11
atlas_y.jpg
atlas_y.ppm
convert/13
atlas_z.jpg
atlas_z.ppm
convert/15
warpnslice warpnslicewarpnslicewarpnslice
render
www.griphyn.org/vds
Conclusion
Separation of dimensions facilitates schema design and comprehension
Workflow and provenance are inextricably linked Integration of dimensions in query is powerful Graph matching and editing paradigms are
essential tools in this unified treatment Efficient implementation of these searches will take
time and research Elegance and usability are key factors in the query
language
www.griphyn.org/vds
Acknowledgements…many thanks to the entire OSG Collaboration and our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and fMRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division.
The Virtual Data System group is: ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet
Singh, Mei-Hui Su, Karan Vahi U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ,
Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao www.griphyn.org/vds
GriPhyN and iVDGL are supported by the National Science Foundation
Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.