up & down: improving provenance precision by … & down: improving provenance precision by...

UP & DOWN: Improving Provenance Precision by Combining Workflow-and Trace-Level Information

Saumen Dey∗ Khalid Belhajjame† David Koop‡ Tianhong Song∗

Paolo Missier§ Bertram Ludascher∗

AbstractWorkflow-level provenance declarations can improve

the precision of coarse provenance traces by reducing thenumber of “false” dependencies (not every output of astep depends on every input). Conversely, fine-grainedexecution provenance can be used to improve the pre-cision of input-output dependencies of workflow actors.We present a new logic-based approach for improvingprovenance precision by combining downward and up-ward inference, i.e., from workflows to traces and viceversa.

1 Introduction

Many scientific workflow systems have been instru-mented to capture workflow execution events as prove-nance. Typically, such events record the computationalsteps that were invoked as part of a workflow execu-tion as well as the data that were input to and output byeach step. These recorded events can be used to tracedata provenance by identifying the input values that con-tributed to an output data value generated as a result ofthe workflow execution. In practice, however, this appli-cation is limited by the black box nature of the actors—the computational modules that implement the steps ofthe workflow.

To illustrate this, consider the actor A shown in Fig. 1.Assume that an execution of A uses the input values vI1and vI2 for the input ports I1 and I2, respectively, andoutput ports O1 and O2 produce values vO1 and vO2 , re-spectively. Because the actor is a black box we cannotassert whether the value vO1 was derived from (i) vI1 , (ii)vI2 , (iii) both vI1 and vI2 , or (iv) none of the inputs. Sim-ilarly, we can not assert how the value vO2 was derivedfrom vI1 and vI2 . By derivation, we refer to the trans-formation of a data value into another, an update of adata value resulting in a new one, or the construction ofa new data value based on a pre-existing one [13]. Allwe can safely conclude is that the invocation of the actorused the input values vI1 and vI2 and generated the valuesvO1 and vO2 . While useful, this information is not suf-ficient to trace fine grained dependencies of data valuesproduced by workflow executions. For the actor A shown∗University of California, {scdey,thsong,ludaesch}@ucdavis.edu†Universite Paris-Dauphine, [email protected]‡New York University, [email protected]§Newcastle University, [email protected]

A I1

I2

O1

O2

Figure 1: An actor A with two input ports I1, and I2 and twooutput ports O1 and O2. Do the outputs depend on none, one,or both of the inputs?

I1

I2

O1

O2

(a)

I1

I2

O1

O2

(b)

I1

I2

O1

O2

(c)

Figure 2: Three examples of internal port dependenciesof the actor A shown in Fig. 1. There could be 13 moresuch dependencies.

in Fig. 1, there could be 16 possible ways output ports aredependent on the input ports, three of which are shownin Fig. 2. If we determine that one of the input valueswas incorrect, e.g., due to a malfunctioning sensor, wewould potentially need to invalidate many useful resultsthan if we knew that the input was only used for an in-significant output. This is one of many reasons showingthe importance of finding fine-grained dependencies.

In general, if there are n input ports and m output portsfor an actor, then there are 2mn possible internal portdependencies. With k such actors in a workflow, thereare 2kmn possible internal port dependency models of theworkflow, i.e., possible models of the workflow.

In this paper, we propose a framework that takes aworkflow specification and a set of provenance informa-tion to generate all possible data dependency models. Bydependency models, we mean a graph in which the nodesrepresent the input and output ports of the actors, andthe edges specify derivation dependencies between theports. Notice that the framework generates multiple de-pendency models, all of which are possible. Fig. 2 showsthree possible dependency models of the workflow witha single actor A shown in Fig. 1. However, given the ex-ecution trace of a workflow execution, only one of thepossible data dependency models reflects the true depen-dencies between the ports, whereas the remaining datadependency models are false positives. Figure 3 depictsthe overall framework of our proposal towards reducingthe number of false positive data dependency models.

Infer dependencies from the workflow specification

Generate possible dependency models

1

4

Validate provenance data

5

Infer dependencies from the provenance graph structure

Infer dependencies from the input-output values

2 3

Figure 3: Components of our solution for inferring de-pendencies.

During the design of a workflow, the designer knowssome internal details of the actors, e.g., (i) data on anoutput port can only be produced only after using datafrom a specific subset of input ports, (ii) two outputs aregenerated using exact set of inputs, etc. In step 1©, we as-sume that our workflow model captures these additionaldesign level information and analyze them to generateadditional dependencies. In 2©, we analyze the prove-nance graph structure and infer further dependencies bycomputing the transitive closure of data dependencies.Given a set of provenances either by (i) collecting ac-tual workflow execution traces, or (ii) probing the in-put values, executing the workflow, and collecting prove-nances from the execution traces, in 3© we analyze inputand output data values towards understanding the input-output dependencies. Steps 1©, 2©, and 3© generate ad-ditional dependency information, which this frameworkencodes as constraints and uses them while generatingall the possible models of the workflow in step 4©. Eachof these models is an “improved” version of the givenworkflow, i.e., they have more dependency information,and these models are in turn used in step 5© to validate aprovenance graph of an unknown origin (i.e., workflow)and improve it by injecting more dependencies. This im-proved provenance then can be used to have better resultanalysis, debugging, etc.

Running Example. We use the Climate Data Collec-tor workflow to showcase the features of the proposedframework. We assume a set of sensors that collect tem-perature, pressure, and other climate observations includ-ing various flags. Given a sensor id, the ReadSensor ac-tor reads from the sensors and returns temperature, pres-sure, and three flags, flagA, flagB, and flagC. The sec-ond actor, SensorLogic, takes these three flags and re-turns two flags weatherCode and temperatureCode. Theflag weatherCode specifies if the weather was normalor not and temperatureCode indicates if the tempera-ture was read in Celsius or in Fahrenheit. Another ac-tor, ConvertToKelvin, uses the temperatureCode flagand the temperature reading to convert the temperatureto Kelvin. A final actor RangeCalculation accepts the

Figure 4: Climate Data Collector workflow, an exampleworkflow that processes climate data read from a sensor.

pressure and temperature and computes a range. Thisworkflow, defined in VisTrails, is shown in Figure 4.

If we consider all these actors as black boxes,ReadSensor, SensorLogic, ConvertToKelvin, andRangeCalculation actors individually have 32, 64, 4,and 4 input-output dependency models and thus, theworkflow as a whole has 32768 possible dependencymodels. These many models makes it impossible tomake any specific claims about dependencies, but if weuse provenance information, we can reduce the numberof possibilities.

For example, all the outputs of ReadSensor dependon a single input and there is no output if the sensor is notavailable. These information reduces 32 possible depen-dency models to exactly one which makes result analysiseasier. Unfortunately, neither the workflow specificationnor the provenance clearly capture these dependencies.We use this example workflow and show how we cancapture additional dependency information, how we canuse this information, and how we can improve the possi-ble dependency models in Section 4.

The paper is organized as follows. We start by statingthe assumptions needed for workflow model and prove-nance model used in our approach in Section 2. We thenexplain our proposed framework and the logical archi-tecture in Section 3. In Section 4, we discuss our imple-mentation using the example workflow and then describerelated work in Section 5. We conclude the paper anddiscuss ongoing work in Section 6.

2 Models

Workflow Model. We base our workflow model on [3,4]. A workflow W = (Vw,Ew) is a directed graph whosenodes Vw = A∪P∪C are actors A, ports P, and channels

2

Provenance Workflow

Constraint Generator

Model Generator

Model

Constraint

Model Reducer

Workflow

Provenance Workflow

Provenance Validator

Provenance

Figure 5: Logical architecture of how components inter-act to improve a workflow specification.

C. Actors (a.k.a. processes) are computational entities.An invocation of an actor reads data from channels andwrites data into channels. The edges Ew = in∪out∪pdepare either input edges in ⊆ C×P, output edges out ⊆P×C, or port dependency edges pdep⊆ P×P. We alsomaintain an additional relation port(P,A,Pt) where P is aport in actor A and Pt specifies the type of port (input oroutput).

Provenance Model. The starting point for our prove-nance model is [12]. A provenance graph is a directedacyclic graph G = (Vg,Eg) where the nodes Vg = D∪ Irepresent either data tokens D or invocations I. Theedges Eg = used∪ genBy∪ derBy are either used edgesused ⊆ I ×D, generated-by edges genBy ⊆ D× I, orderived-by edges derBy ⊆ D×D. Here, a used edge(i,d) ∈ Eg means that invocation i consumes d as input;a generated-by edge (d,i) ∈ Eg means that d is an out-put token, generated by invocation i; and a derived-byedge (di,dj) ∈ Eg means that data di was derived usingdata dj. We use two more relations data and invoc. Therelation data(D,P,A,R,V ) specifies that data D appearedat port P of actor A during the run R with a value V . Therelation invoc(I,A) maps an invocation I to its actor A.In addition, we use the auxiliary relation ddep(X ,Y ) tospecify that data Y depends on data X and ddep∗(X ,Y ) isthe transitive closure of ddep(X ,Y ).(1) ddep(X,Y) :- used(I,X),genBy(Y,I).(2) ddep∗(X,Y) :- ddep(X,Y).(3) ddep∗(X,Y) :- ddep(X,Z),ddep∗(Z,Y).

3 Proposed Framework and Architecture

This framework takes a workflow specification W and aset of provenance graphs Gs and infers the internal (i.e.,

internal to an actor) port dependencies by analyzing Wand Gs. It then encodes all these inferred dependenciesas constraints, which are used to reduce the number ofpossible workflow models. It uses the Generate-and-Testpattern from Answer Set Programming (ASP) [11], i.e.,we use sets of stable models [9] to represent all possiblemodels that are consistent with the generated constraints.

Let us consider the snippet of ASP program we haveused in this framework.pd(X,Y) v pnd(X,Y):- in(X,A), out(A,Y).

Here, pd(X ,Y ) relation means that Y depends on Xand npd(X ,Y ) relation means that Y does not depend onX . If there are n in(X ,A) relations and m out(A,Y ) re-lations in W , then there would be 2mn dependency mod-els for the actor A. In one extreme possible model, wewould have all the pd(X ,Y ) as “true” and in another ex-treme possible model, we would have all the npd(X ,Y )as “true”. In all possible models, we would have some ofpd(X ,Y ) as “true” and some npd(X ,Y ) as “true”. Now,this framework applies the constraints in the followingway: if a possible model has one npd(X ,Y ) as “true”and for the same (X ,Y ) pair a constraint has been ob-served to be “true” as discussed in Section 1 then its acontradiction as the constraint specifies that node Y de-pends on node X . The framework applies the negation asfailure principle and excludes this model.

The proposed framework, as shown in Fig. 5, hasseveral components including the Constraint Generator,Model Generator, Model Reducer, and Trace Validator.The Constraint Generator takes a workflow W and aset of provenance traces Gs, and performs three analy-sis tasks to generate constraints using the techniques wepresent next.

We analyze provenance graphs and capture such in-ferred data dependencies and use them to reduce thenumber of models. More specifically, in our prove-nance model, we capture used(I,D), genBy(D, I), andderBy(D1,D2) edges. The used(I2,D), and genBy(D, I1)edges along with the mappings invoc(I1,A1), andinvoc(I2,A2) unambiguously states that the actor A2 de-pends on actor A1. However, edges used(I,D1), andgenBy(D2, I) are unable to unambiguously state thatdata artifact D2 depends on data artifact D1. ThederBy(D1,D2) edges describe dependencies among dataartifacts and thus these edges are able to infer internalport dependencies of an actor. This framework analyzesthese derBy/2 (i.e., derBy is a two arity relation) rela-tions and infers further data dependencies.

We use the provenance graph shown in Fig. 6 to de-scribe how this framework analyzes these derBy/2) re-lations. This provenance graph has seven data artifactsT though Z and two derBy/2 edges, derBy(Z,Y ) andderBy(Z,T ) . Considering the derBy(Z,T ) edge we in-fer that Z depends on W . Thus, given the derBy(Z,Y )

3

T U A2 A1

A4 V

W A3

X

Z

Y

Figure 6: A provenance graph with two derived-byedges derBy(Z,Y) and derBy(Z,T). Circles and rectanglesrepresent data artifacts and processes, respectively.

and derBy(Z,T ) edges, we conclude that output port as-sociated with Z depends on input ports associated withW and Y , which reduces the number of possible depen-dency models.

More generally, given provenance traces Gs, thederBy/2 relations in all the Gs are analyzed. A derByedge could specify the dependency of (i) an output and aninput of some actor or (ii) an output of one actor and aninput of another actor. The framework infers these con-straints using an algorithm, which is represented by thedatalog rules 4 and 5 as shown below. Based on these twoinferences, we get that port P2 depends on P1, which isa workflow specification level information by analyzingthe derBy/2 relation from the provenance graph. Notethat (i) provide exact dependency, but (ii) provides anover estimate.(4) pdep(P1,P2) :-

derBy(Y1,X1),data(X1,P1,A,R,_),data(Y1,P2,A,R,_).

(5) pdep(P1,P2) :-derBy(Y1,X1),data(X1,P1,A,R,_),data(D,P2,A,R,_),ddep∗(D,Y1),ddep∗(X1,D).

Input port dependencies can also be inferred by ana-lyzing the input and output data values∗. The frameworkcomputes the constraints using an algorithm, which isrepresented using rule 6 shown below.(6) pdep(IP,OP) :-

data(D1,IP,A,R1,IV1),data(D1,IP,A,R2,IV2),data(D2,OP,A,R1,OV1),data(D2,OP,A,R2,OV2),¬ R1=R2, ¬ IV1=IV2, ¬ OV1=OV2.

The Model Generator computes all possible depen-dency models in which output ports may depend on inputports for an actor and removes the models that contradictthe constraints generated by the Constraint Generator asdiscussed in the beginning of this section. This compo-nent returns only those models that have no contradic-tions. The Model Reducer takes all the possible work-flow models generated by Model Generator and com-bines them into one workflow model. This workflowmodel is an improved specification of the input workflow∗In this work, we assumed a simple model for input and output data

value analysis. In this model, an output port of an actor depends onan input port if the values on the output port changes while changingvalues of an input and keeping the values for all other inputs same. Inour future work, we plan to investigate further complexities of an actor.

ReadSensor

p01

p02 p07

p03 p08

p04 p09

p05

p12

p06

p15

SensorLogic

p10

p13

p11

p18

ConvertToKelvin

p14 p16

RangeCalculation

p17StandardOutput

Figure 7: Climate Data Collector workflow representedby this framework. There are 32768 possible dependencymodels for this workflow.

model W , in which some of overestimates of the internalport dependencies have been removed. In ProvenanceValidator, the framework uses this improved workflowspecification (i) to validate the provenance graph of un-known origin, (ii) to improve provenance graph by re-moving the inconsistencies.

4 Prototypical Implementation

In this section, we describe a prototypical implementa-tion of the proposed framework.

Fig. 7 shows the Climate Data Collector workflowrepresented using this framework; actors are representedas rectangles and have ports “p01” through “p18”. Theedges represent channels which connect output ports ofone actor to input ports of another actor.

Fig. 8 shows that there are 32 possible dependencymodels for the SensorLogic actor. Without additional in-formation, the framework will generate all 32 models.Similarly, other actors have many possible dependencymodels (as discussed in Section 1), and are not displayedbecause they mirror the SensorLogic actor.

The framework analyzes the workflow specification,provenance traces, and user-specified constraints as dis-cussed in Section 1, and generates the constraints asshown in Fig. 9 which the framework applies while gen-erating the possible dependency models. Fig. 9(a) showsthe constraints generated based on the additional speci-fications† provided by the workflow designer. Fig. 9(b)shows all the constraints generated by the framework byanalyzing the input-output value dependencies. Fig. 9(c)and Fig. 9(d) show all the constraints generated by theframework by analyzing the derBy/2 edges from theprovenance graphs.

Fig. 10 shows the improved workflow specifi-cation based on the given workflow, provenance

†The definition of this specification is available in [5].

4

Figure 8: Possible dependency models for the Sensor-Logic actor. There are 32 possible dependency modelsfor this actor alone.

graphs, and constraints. Based on the constraintsgenerated by the framework, there is only onemodel for actors ReadSensor, ConvertToKelvin, andRangeCalculation. There are still 6 models for theSensorLogic actor. Thus, there are total 6 possiblemodels for the Climate Data Collector workflow, and theframework generates all of them.

Using this example, we have demonstrated that theproposed framework can improve the workflow specifi-cation, and this information can be used to later validateprovenance graphs with unknown origin.

pdep(p02,p01). pdep(p03,p01). pdep(p04,p01). pdep(p05,p01). pdep(p06,p01).

pdep(p10,p07). pdep(p10,p08). pdep(p11,p09).

pdep(p14,p12). pdep(p14,p13).

pdep(p17,p15). pdep(p17,p16).

(a) (b)

(c) (d)

Figure 9: All of the constraints that are generated by theframework.

5 Related Work

The problem of defining or inferring data dependenciesin scientific workflows has been investigated by a hand-ful of researchers. For example, Bowers et al. proposeda declarative language for specifying fine-grained depen-dencies at the level of the workflow definition, whichare propagated and applied to workflow trace events pro-duced as a result of the workflow execution [2]. Theirapproach is complementary to ours, and we adopt a sim-ilar language to encode the dependencies that are derviedfrom workflow specifications and their corresponding

Figure 10: Improved workflow specification of the Cli-mate Data Collector workflow after including one pos-sible dependency model. There are be 6 such possiblemodels, and this framework generates all of them.

trace events. However, that approach did not tackle theproblem of inferring data dependencies.

Ghoshal et al. investigated the use of static anal-ysis techniques to derive dependencies (or what theyterm mappings) between the inputs and outputs of an ac-tor [10]. However, while they assume that the sourcecode of the program implementing an actor is available,we tackle the problem of deriving data dependencieswhen the source code for the actors is not avaiable.

Garijo et al. proposed a framework where theyidentified data transformation and manipulation patterns(which they term motifs) that are commonly found inscientific workflows [8]. They distinguish two typesof motifs: data intensive activities that are observed inworkflows (data-oriented motifs), and different mannersin which activities are implemented within workflows(workflow-oriented motifs). The first type is relevant toour work. However, in our framework, we are partic-ularly interested in identifying if there is a dependencybetween an input and an output of an actor rather thanthe kind of data manipulation performed by the actor.

The above work and others, e.g., [1], [14], are relatedbecause they aim to understanding the kind of manip-ulation carried out by workflows. However, none tacklethe problem of identifying fine-grained dependencies be-tween input and output port for black box actors. In ourprior work [6], we used the same Generate-and-test ASPbased approach towards finding the possible orders ofevents. However, in this work we focus on finding thefine-grained dependencies using the same Generate-and-test ASP based approach.

6 Discussion

In this paper, we have presented a framework for infer-ring fine grained dependencies between the inputs andoutput ports of actors. We have described a probingbased methods for inferring such dependencies. In doing

5

so, we made the assumption that dependencies are static,in the sense that they hold across all invocations of theactors. In practice, however, dependencies may be dy-namic, in the sense they may change across invocationsof the actor. For example, the VisTrails system [7] pro-vides a conditional actor (If) that uses modified dataflowlogic to execute only one of two upstream workflows togenerate an output based on a boolean input. There arealso instances where two ports are provided to make thespecification of an input possible in different formats,meaning only one port’s value will eventually be usedbut is dependent on which are provided.

The framework that we presented in this paper needsto be refined in order to cater for the identification ofdynamic dependencies. In particular, the user should beable to understand the cases (conditions under which) agiven output is likely to depend or not on a given in-put. Note also, that so far, we treated input-output portequally. A classification of dependencies is needed toprovide the user (workflow designer or provenance user)with better understading on the kind of relationship be-tween the input and output ports. For example, distin-guishing control-flow dependencies from data-flow de-pendencies. The later can be further classified to specifythe kind of contribution a given input value had in theconstruction of the output value.

Acknowledgments. Supported in part by NSF ACI-0830944 and IIS-1118088.

References

[1] P. Alper, K. Belhajjame, C. A. Goble, andP. Karagoz. Enhancing and abstracting scien-tific workflow provenance for data publishing. InEDBT/ICDT Workshops, pages 313–318. ACM,2013.

[2] S. Bowers, T. M. McPhillips, and B. Ludascher.Declarative rules for inferring fine-grained dataprovenance from scientific workflow executiontraces. In Proc. IPAW ’12, pages 82–96, 2012.

[3] V. Cuevas-Vicenttın, S. Dey, B. Ludascher,M. Li Yuan Wang, and T. Song. Modeling andquerying scientific workflow provenance in the d-opm. In 7th Workshop on Workflows in Support ofLarge-Scale Science (WORKS), 2012.

[4] V. Cuevas-Vicenttın, P. Kianmajd, B. Ludascher,P. Missier, F. Chirigati, Y. Wei, D. Koop, andS. Dey. The PBase scientific workflow provenancerepository. IDCC, 2014.

[5] S. Dey, S. Kohler, S. Bowers, and B. Ludascher.Datalog as a lingua franca for provenance queryingand reasoning. In TaPP ’12, 2012.

[6] S. Dey, S. Riddle, and B. Ludascher. Prove-nance analyzer: exploring provenance semanticswith logic rules. In TaPP ’13. USENIX, 2013.

[7] J. Freire, D. Koop, E. Santos, C. Scheidegger,C. Silva, and H. T. Vo. The Architecture of OpenSource Applications, chapter VisTrails. Lulu.com,2011.

[8] D. Garijo, P. Alper, K. Belhajjame, O. Corcho,Y. Gil, and C. A. Goble. Common motifs in scien-tific workflows: An empirical analysis. In eScience,pages 1–8. IEEE Computer Society, 2012.

[9] M. Gelfond and V. Lifschitz. The stable model se-mantics for logic programming. In Proc. 5th Int’lConf. Logic Prog., volume 161, 1988.

[10] D. Ghoshal, A. Chauhan, and B. Plale. Static com-piler analysis for workflow provenance. In Proc.8th Work. on Workflows in Support of Large-ScaleScience, WORKS ’13, pages 17–27. ACM, 2013.

[11] V. Lifschitz. What is answer set programming. InProceedings of the AAAI Conference on ArtificialIntelligence, pages 1594–1597, 2008.

[12] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil,P. Groth, N. Kwasnikowska, S. Miles, P. Missier,J. Myers, B. Plale, Y. Simmhan, E. Stephan, andJ. V. den Bussche. The open provenance model corespecification (v1.1). FCGS, 27(6):743–756, 2011.

[13] L. Moreau and P. Missier. PROV-DM: The PROVdata model, 2013.

[14] R. J. Sethi, H. Jo, and Y. Gil. Re-using workflowfragments across multiple data domains. In Proc.SCC ’12, pages 90–99, Washington, DC, USA,2012. IEEE Computer Society.

6

up & down: improving provenance precision by … & down: improving provenance precision by...

Documents