workflow classification and open-sourcing methods: towards a new publication model

Post on 24-Jan-2015

1.445 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at the Open Knowledge Conference 2011 in Berlin. This work is being done under the heading of DataONE. More information can be found at http://notebooks.dataone.org/workflows

TRANSCRIPT

Workflow Classification and Open-Sourcing Methods: Towards a New Publication ModelRichard Littauer, Karthik Ram, Bertram Ludäscher, William Michener, Rebecca Koskela

Dat

aON

E

1

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work

Dat

aON

E

2

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work• Provide reproducibility to their

experiments

Dat

aON

E

3

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work• Provide reproducibility to their

experiments• Track provenance

Dat

aON

E

4

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work• Provide reproducibility to their

experiments• Track provenance• Share their data with other

scientists

Dat

aON

E

5

Workflow Workbenches

Dat

aON

E

6

Workflow Workbenches

Dat

aON

E

7

Workflow Workbenches

Dat

aON

E

8

Workflow Workbenches• These facilitate:

Dat

aON

E

9

Creation

http://www.flickr.com/photos/ideacreamanuelapps/3542203718/

Workflow Workbenches• These facilitate:

Dat

aON

E

10

Mapping

http://www.flickr.com/photos/fatguyinalittlecoat/5716492273

Workflow Workbenches• These facilitate:

Dat

aON

E

11http://www.flickr.com/photos/silent-penguin/232394/

Scheduling

Workflow Workbenches• These facilitate:

Dat

aON

E

12

Execution

http://www.flickr.com/photos/pagedooley/4039784738/

Workflow Workbenches• These facilitate:

Dat

aON

E

13

http://www.flickr.com/photos/cnon/5698746966/

Visualisation

Workflow Workbenches• These facilitate:

Dat

aON

E

14

Re-use

http://www.flickr.com/photos/nihonbunka/32774212/

Workflow Workbenches

• Not all scientists are coders.

Dat

aON

E

15

Workflow Workbenches

• Not all scientists are coders.

• By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)…

Dat

aON

E

16

Workflow Workbenches

• Not all scientists are coders.

• By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)…

• …it is easier for scientists to do and share their work.

Dat

aON

E

17

http://www.flickr.com/photos/wouterverhelst/362538835/

Workflow Workbenches

• This is a common way how workflows are ‘sold’.

Dat

aON

E

18

http://www.flickr.com/photos/amagill/3366720659/

Workflow Workbenches

• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.

Dat

aON

E

19

http://www.flickr.com/photos/amagill/3366720659/

Workflow Workbenches

• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.• Often it is just replacing one style of coding (conventional)

with another (workflows).

Dat

aON

E

20

http://www.flickr.com/photos/amagill/3366720659/

Workflow Workbenches

• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.• Often it is just replacing one style of coding (conventional)

with another (workflows).• We’re trying to see if we can get to the bottom of how the

promises cash out.

Dat

aON

E

21

http://www.flickr.com/photos/amagill/3366720659/

Our Study

• However, there have been few studies done looking at how these workflows work.

Dat

aON

E

22

http://www.flickr.com/photos/eleaf/2536358399

Our Study

• How do we classify workflows?

Dat

aON

E

23

http://www.flickr.com/photos/eleaf/2536358399

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short?

Dat

aON

E

24

http://www.flickr.com/photos/eleaf/2536358399

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short? • How can the process of creating

workflows be improved?

Dat

aON

E

25

http://www.flickr.com/photos/eleaf/2536358399

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short? • How can the process of creating

workflows be improved?• How about executing them?

Dat

aON

E

26

http://www.flickr.com/photos/eleaf/2536358399

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short? • How can the process of creating

workflows be improved?• How about executing them?• And sharing them?

Dat

aON

E

27

http://www.flickr.com/photos/eleaf/2536358399

Our Study• Some studies have been done.

Dat

aON

E

28

Our Study• Some studies have been done.

• For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4].

Dat

aON

E

29

Our Study• Some studies have been done.

• For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4].

• This large percentage and the difficulty of developing custom shims suggest that workflow design technology can still be improved.

Dat

aON

E

30

Our Study• But most importantly, these studies have not significantly

changed the way we use workflows.

Dat

aON

E

31

Our Study• But most importantly, these studies have not significantly

changed the way we use workflows.

• In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5]. D

ataO

NE

32

Our Study• But most importantly, these studies have not significantly

changed the way we use workflows.

• In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5].

• Therefore, a greater understanding of workflows and how we can most adequately implement them into open science is called for.

Dat

aON

E

33

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

Dat

aON

E

34

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org Dat

aON

E

35

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007

Dat

aON

E

36

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users

Dat

aON

E

37

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users• 1850+ workflows (mostly Taverna 1, 2, and RapidMiner)

Dat

aON

E

38

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users• 1850+ workflows (mostly Taverna 1, 2, and RapidMiner)• Minable by SPARQL

Dat

aON

E

39

Our Study• Methods: • For each workflow, we’re gathering three tiers of information.

Dat

aON

E

40

http://www.flickr.com/photos/jpvargas/83258973/

Our Study• Methods: • For each workflow, we’re gathering three tiers of information.

Dat

aON

E

41

http://www.flickr.com/photos/jpvargas/83258973/

Meta-Data

Description

`Worth’

Tier 1

Metadata:• Workflow source• Workflow system• Works on run• Area of research• Type• Description• User• User total uploads• Published citations• Downloads• Date uploaded

Dat

aON

E

42

Tier 2Description:• Foreign components• QA/QC steps• Visual Output• Number of inputs• Intermediate input• Linear• Embedded• Embedded details• Number of databases• Type conversion• Tag conversion• Multiple outputs

• Processing• Stats• Scalable• Smart reruns• provenance retained• Multipurpose• research mining• Query• Loop• Grid• Accounts necessary• External results

Dat

aON

E

43

Tier 3

`Worth’:• Sufficiency of metadata• Sufficiency of Natural

Language Description• Reuse in published articles• Relevant issues based on the

system it was created in.

Dat

aON

E

44

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

Dat

aON

E

45

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

2. Workflows are becoming more complex over time.

Dat

aON

E

46

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

2. Workflows are becoming more complex over time.3. Workflows become more powerful over time.

Dat

aON

E

47

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

2. Workflows are becoming more complex over time.3. Workflows become more powerful over time. 4. Workflows become more complex as one gains more

experience. Dat

aON

E

48

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

Dat

aON

E

49

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

6. Workflow re-use is proportional to the sufficiency of the documentation.

Dat

aON

E

50

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

6. Workflow re-use is proportional to the sufficiency of the documentation.

7. Reuse is proportional to the age of the workflow.

Dat

aON

E

51

http://www.flickr.com/photos/nauright/5391995939/

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

6. Workflow re-use is proportional to the sufficiency of the documentation.

7. Reuse is proportional to the age of the workflow. 8. Workflow reuse is proportional to the proficiency of the

creator.

Dat

aON

E

52

http://www.flickr.com/photos/nauright/5391995939/

Data• Still being gathered and analysed.

Dat

aON

E

53

Data• Still being gathered and analysed.

• We’re using myExperiment download rate as a proxy for workflow reuse.

Dat

aON

E

54

Data• Still being gathered and analysed.

• We’re using myExperiment download rate as a proxy for workflow reuse.

Dat

aON

E

55

Data• Still being gathered and analysed.

• We’re using myExperiment download rate as a proxy for workflow reuse.

Dat

aON

E

56

Data• One of the issues with this is the amount of workflows being

created by each user.

• However, this still should allow for a diachronic analysis.

Dat

aON

E

57

Conclusion

Old publishing model:

Write paper. Submit paper. Drink wine.

Dat

aON

E

58

http://www.flickr.com/photos/joelmontes/4762384399/

Conclusion

Old publishing model:

Write paper. Submit paper. Drink wine.

New publishing model:

Write paper. Submit paper. Get feedback.Submit data. Replication (?)

Dat

aON

E

59

http://www.flickr.com/photos/joelmontes/4762384399/

Conclusion

Better publishing model:

Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication

Dat

aON

E

60

http://www.flickr.com/photos/mactitioner/5595830505

Conclusion

Better publishing model:

Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication

Submit workflows. That works.

Dat

aON

E

61

http://www.flickr.com/photos/mactitioner/5595830505

Conclusion

Better publishing model:

Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication

Submit workflows. That works.

As this is done, questions of how effective workflows are, and how they can be utilized in the new research and publishing paradigm, might be answered.

Dat

aON

E

62

http://www.flickr.com/photos/mactitioner/5595830505

References• [1] Kepler Project. http://www.kepler-project.org• [2] Taverna. http://www.taverna.org.uk/• [3] Vistrails http://www.vistrails.org/• [4] Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai, and Jing Hua. 2009. A

Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In Proceedings of the 2009 IEEE International Conference on Services Computing (SCC '09). IEEE Computer Society, Washington, DC, USA, http://dx.doi.org/10.1109/SCC.2009.77

• [5]Coombes, K. R., Wang, J. & Baggerly, K. A. Microarrays: retracing steps.Nature Med. 13, 1276–1277 (2007).

DataONE Workflows Project: http://notebooks.dataone.org/workflows Mendeley Research Group: http://www.mendeley.com/groups/1189721/scientific-workflows-and-workflow-systems/

Dat

aON

E

63

http://www.flickr.com/photos/wwworks/4759535950/

top related