applying the virtual data provenance model ipaw 2006 yong zhao, ian foster, michael wilde 4 may 2006

30
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

Upload: basil-quinn

Post on 29-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

Applying the Virtual Data Provenance Model

IPAW 2006Yong Zhao, Ian Foster, Michael Wilde

4 May 2006

Page 2: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Virtual Data Origins:The Grid Physics Network

Enhance scientific productivity through… Discovery, application and management of

data and processes at all scales Using a worldwide data grid as a scientific

workstation

The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording.

Page 3: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

The purpose of Virtual Data

Better understanding of data (and tools) Assess what happened at run-time Easier to express and execute work Discover useful workflow patterns Adapt workflow patterns to new needs

Page 4: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,

University of Chicago

1

10

100

1000

10000

100000

1 10 100

Num

ber

of C

lust

ers

Number of Galaxies

Galaxy clustersize distribution

DAG

Virtual Data Example:Galaxy Cluster Search

Sloan Data

Page 5: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

How is Workflow and Provenance connected?

Workflow – specifies what to do Provenance – tracks what was done Virtual Data integrates these capabilities

ExecutedExecutingExecutableWaiting

Query

Edit

ScheduleExecution environment

What I Did

What I Want to

Do

What I Am Doing

Page 6: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Temporal aspects of provenance Prospective provenance

The workflow recipes for how to produce data

Metadata annotating code and data Retrospective provenance

Run-time records of data production environment: where, how long, how much

Page 7: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Expressing Workflow in VDL

TR grep (in a1, out a2) {

argument stdin = ${a1}; 

argument stdout = ${a2}; }

TR sort (in a1, out a2) {

argument stdin = ${a1};

argument stdout = ${a2}; }

DV grep (a1=@{in:file1}, a2=@{out:file2});

DV sort (a1=@{in:file2}, a2=@{out:file3});

file1

file2

file3

grep

sort

Define a “function” wrapper for an

application

Provide “actual” argument values for the invocation

Define “formal arguments” for the application

Define a “call” to invoke application

Connect applications via output-to-input dependencies

Page 8: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Terminology virtual data

defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation

VDL – Virtual Data Language A language (text and XML) that defines the functions

and function calls of a workflow VDC – Virtual Data Catalog

The database and schema that store VDL definitions VDS – Virtual Data System

The tools to define, store, manipulate and execute virtual data workflows and query data provenance

Page 9: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Executing VDL Workflows

Abstractworkflow

DAGmanDAG

PegasusPlanner

DAGman &Condor-G

VDLProgram

Virtual Datacatalog

Virtual DataWorkflowGenerator

JobPlanner

JobCleanup

Workflow spec Create Execution Plan Grid Workflow Execution

Show world and results in large DAG on right, as animated overlay

Mike Wilde
Intergate this slide and the next into one animation - or just use the next slide.
Page 10: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Dimensions of Provenance Data

Virtual DataCatalog

Virtual datarelationships

Metadataannotations

Derivationlineage

Page 11: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Virtual Data Catalog Schema

dvIDhoststart

durationexitcode

stats

Invocation

nmspacename

version

Call

passes passes

executescalls

binds references

describesuses

includes

nmspacename

version

Procedure

argnametype

direction

FormalArg

argnamevalue

ActualArg

wfidfromDV

toDV

Workflow

nmspacename

Dataset

objectpred

type/valuserdate

Annotation

1

1

1

1

1

1

*

*

*

*

*

1

11

1

1

1

1 describes

Page 12: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Run Time Environmentand Provenance Collection

VDL

DAGmanscript

PegasusPlanner

DAGman &Condor-G

Abstractworkflow

Virtual Datacatalog

Virtual DataWorkflowGenerator

Specify Workflow Create and run DAG Grid Workflow Execution(on worker nodes)

launcher

launcher

file1

file2

file3

grep

sort

Provenancedata

Provenancedata

Provenancecollector

Page 13: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Provenance Query Types

Virtual data relationships Annotations Lineage Multi-dimension Compositional

Page 14: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Context for Query Examples:Functional MRI Analysis

3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

Page 15: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Virtual Data Relationships

Simple query by signature: Show procedures in namespace /pub/bin/std that

have inputs of type SubjectImage and outputs of type ThumbNailImage.

Actual arguments and runtime provenance: Show alignlinear calls (including all arguments), in

XML format, with argument model=rigid, and which generated more than 10,000 page faults, on ia64 processors.

Show calls to procedure alignlinear, and their runtimes, with argument model=rigid that ran in less than 30 minutes on non-ia64 processors.

Aggregate query: Show me the average runtime of all alignlinear calls

with argument model=rigid that ran in less than 30 minutes.

Page 16: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Provenance forLarge-scale ATLAS Simulation

How much compute time was delivered?| years| mon | year |+------+------+------+| .45 | 6 | 2004 || 20 | 7 | 2004 || 34 | 8 | 2004 || 40 | 9 | 2004 || 15 | 10 | 2004 || 15 | 11 | 2004 || 8.9 | 12 | 2004 |+------+------+------+

Selected statistics for one of these jobs:start: 2004-09-30 18:33:56duration: 76103.33 pid: 6123exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556

... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386

Which Linux kernel releases were used ?How many jobs were run on a Linux 2.4.28 Kernel?

(Data from work of Robert Gardner and Marco Mambelli, University of Chicago)

Page 17: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Annotation queries

Return annotations for any object type procedures, calls, arguments, and files

select based on subject, predicate, (object, type) select a set of virtual data objects:

find all objects (of any type) annotated with predicate p of type t and value v

find objects of a specific type annotated with predicate p of type t and value v;

find objects (one type or any type) annotated by same set of attribute predicates.

Page 18: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

fMRI Virtual Data QueriesWhich transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType

subject_image input A: fMRIDC.AIR::align_warp

List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img

Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img

3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img

Page 19: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Annotation Queries

Show all the annotations of [datasets that have metadata annotation studyModality with values speech, visual or audio].

Show all the developerName annotations of [procedures that accept or produce an argument of type Study with annotation studyModality=audio]

Page 20: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Lineage Queries

Basic lineage graph queries refer to information that has been propagated along derivation relationships: find datasets derived from dataset d find ancestor datasets to dataset d that have type t find datasets that were derived within 2 levels of

procedure p find datasets that are the result of workpattern wp; find the procedure calls in workflow w whose inputs

have been processed by any subgraph matching workpattern wp.

Page 21: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Planned approach: Workpatterns

Match graph patterns of transformations, calls, and invocations; fixed or varying numbers of nodes

Match on expressions with argument name, argument values, argument types, and/or annotations

Workpattern query yields set of workflows with subgraphs that match the workpattern

The target search space of a workpattern query can be entire database, or a specific set of workflows selected by a prior search.

Page 22: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Workflow Patterns

Given the workpattern: align (model=affine)

reslice (axis=x, intensify=3) softmean show me all output datasets of softmean calls that

were linear-aligned with model=affine. I.e., “where softmean was preceded in the workflow,

directly or indirectly, by an alignlinear call with argument model=affine”

Show me all output datasets of softmean that were resliced with intensify=3. (Looking for a softmean that is directly preceded by the requested pattern)

Page 23: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Workflow Pattern Searches3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

align_warp/*/softmean

softmean/slicer*

Page 24: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Multidimension Provenance Queries

“find transformations with signature S that have been called with arguments V and which match: an annotation query the metadata values for a specified set of

predicates from a transformation list returned by another query

the minimum, maximum, and average run times of a set of procedure calls matching workpattern wp and annotation query q.”

Page 25: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Multi-dimension queries Find procedures that take in ImageAtlas and Date,

have been called with atlas.std.2005.img,and have annotation QALevel > 5.6

Display metadata tags study-type on result datasets that were linearly aligned with parameter model=affine and with an input dataset annotated with center set to UofChicago

Show me the output dataset names (and all their metadata tags) that were linearly aligned with model=affine and with input LFN metadata center=UChicago

Show annotations school of output datasets of softmean with values in set {UIUC, UChicago, UIC}.

Show annotations school with values in set {UIUC, UChicago, UIC} of outputs of softmean that were aligned with model=affine (graph relationship)

Page 26: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Modification and composition queries Change arguments in a set of calls Change procedures in a set of workflows Edit subgraphs of a workflow, creating new

workflows Edit metadata throughout a workflow

Page 27: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Workflow Transformation3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

More accurate

alignment tool?

Better renderingalgorithm?

Page 28: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Workflow transformation3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

warpnslice warpnslicewarpnslicewarpnslice

render

Page 29: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Conclusion

Separation of dimensions facilitates schema design and comprehension

Workflow and provenance are inextricably linked Integration of dimensions in query is powerful Graph matching and editing paradigms are

essential tools in this unified treatment Efficient implementation of these searches will take

time and research Elegance and usability are key factors in the query

language

Page 30: Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006

www.griphyn.org/vds

Acknowledgements…many thanks to the entire OSG Collaboration and our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and fMRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division.

The Virtual Data System group is: ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet

Singh, Mei-Hui Su, Karan Vahi U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ,

Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao www.griphyn.org/vds

GriPhyN and iVDGL are supported by the National Science Foundation

Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science.