workflow classification and open-sourcing methods: towards a new publication model

63
Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model Richard Littauer, Karthik Ram, Bertram Ludäscher, William Michener, Rebecca Koskela DataONE 1

Upload: richard-littauer

Post on 24-Jan-2015

1.445 views

Category:

Technology


3 download

DESCRIPTION

Presented at the Open Knowledge Conference 2011 in Berlin. This work is being done under the heading of DataONE. More information can be found at http://notebooks.dataone.org/workflows

TRANSCRIPT

Page 1: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Classification and Open-Sourcing Methods: Towards a New Publication ModelRichard Littauer, Karthik Ram, Bertram Ludäscher, William Michener, Rebecca Koskela

Dat

aON

E

1

Page 2: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work

Dat

aON

E

2

Page 3: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work• Provide reproducibility to their

experiments

Dat

aON

E

3

Page 4: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work• Provide reproducibility to their

experiments• Track provenance

Dat

aON

E

4

Page 5: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult

work• Provide reproducibility to their

experiments• Track provenance• Share their data with other

scientists

Dat

aON

E

5

Page 6: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

Dat

aON

E

6

Page 7: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

Dat

aON

E

7

Page 8: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

Dat

aON

E

8

Page 9: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches• These facilitate:

Dat

aON

E

9

Creation

http://www.flickr.com/photos/ideacreamanuelapps/3542203718/

Page 10: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches• These facilitate:

Dat

aON

E

10

Mapping

http://www.flickr.com/photos/fatguyinalittlecoat/5716492273

Page 11: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches• These facilitate:

Dat

aON

E

11http://www.flickr.com/photos/silent-penguin/232394/

Scheduling

Page 12: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches• These facilitate:

Dat

aON

E

12

Execution

http://www.flickr.com/photos/pagedooley/4039784738/

Page 13: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches• These facilitate:

Dat

aON

E

13

http://www.flickr.com/photos/cnon/5698746966/

Visualisation

Page 14: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches• These facilitate:

Dat

aON

E

14

Re-use

http://www.flickr.com/photos/nihonbunka/32774212/

Page 15: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• Not all scientists are coders.

Dat

aON

E

15

Page 16: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• Not all scientists are coders.

• By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)…

Dat

aON

E

16

Page 17: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• Not all scientists are coders.

• By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)…

• …it is easier for scientists to do and share their work.

Dat

aON

E

17

http://www.flickr.com/photos/wouterverhelst/362538835/

Page 18: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• This is a common way how workflows are ‘sold’.

Dat

aON

E

18

http://www.flickr.com/photos/amagill/3366720659/

Page 19: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.

Dat

aON

E

19

http://www.flickr.com/photos/amagill/3366720659/

Page 20: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.• Often it is just replacing one style of coding (conventional)

with another (workflows).

Dat

aON

E

20

http://www.flickr.com/photos/amagill/3366720659/

Page 21: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Workflow Workbenches

• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.• Often it is just replacing one style of coding (conventional)

with another (workflows).• We’re trying to see if we can get to the bottom of how the

promises cash out.

Dat

aON

E

21

http://www.flickr.com/photos/amagill/3366720659/

Page 22: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study

• However, there have been few studies done looking at how these workflows work.

Dat

aON

E

22

http://www.flickr.com/photos/eleaf/2536358399

Page 23: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study

• How do we classify workflows?

Dat

aON

E

23

http://www.flickr.com/photos/eleaf/2536358399

Page 24: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short?

Dat

aON

E

24

http://www.flickr.com/photos/eleaf/2536358399

Page 25: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short? • How can the process of creating

workflows be improved?

Dat

aON

E

25

http://www.flickr.com/photos/eleaf/2536358399

Page 26: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short? • How can the process of creating

workflows be improved?• How about executing them?

Dat

aON

E

26

http://www.flickr.com/photos/eleaf/2536358399

Page 27: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study

• How do we classify workflows?• Where do existing workflow

systems fall short? • How can the process of creating

workflows be improved?• How about executing them?• And sharing them?

Dat

aON

E

27

http://www.flickr.com/photos/eleaf/2536358399

Page 28: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• Some studies have been done.

Dat

aON

E

28

Page 29: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• Some studies have been done.

• For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4].

Dat

aON

E

29

Page 30: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• Some studies have been done.

• For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4].

• This large percentage and the difficulty of developing custom shims suggest that workflow design technology can still be improved.

Dat

aON

E

30

Page 31: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• But most importantly, these studies have not significantly

changed the way we use workflows.

Dat

aON

E

31

Page 32: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• But most importantly, these studies have not significantly

changed the way we use workflows.

• In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5]. D

ataO

NE

32

Page 33: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• But most importantly, these studies have not significantly

changed the way we use workflows.

• In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5].

• Therefore, a greater understanding of workflows and how we can most adequately implement them into open science is called for.

Dat

aON

E

33

Page 34: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

Dat

aON

E

34

Page 35: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org Dat

aON

E

35

Page 36: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007

Dat

aON

E

36

Page 37: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users

Dat

aON

E

37

Page 38: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users• 1850+ workflows (mostly Taverna 1, 2, and RapidMiner)

Dat

aON

E

38

Page 39: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• We are analyzing a wide variety of workflow systems and

publicly available workflows.

• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users• 1850+ workflows (mostly Taverna 1, 2, and RapidMiner)• Minable by SPARQL

Dat

aON

E

39

Page 40: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• Methods: • For each workflow, we’re gathering three tiers of information.

Dat

aON

E

40

http://www.flickr.com/photos/jpvargas/83258973/

Page 41: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Our Study• Methods: • For each workflow, we’re gathering three tiers of information.

Dat

aON

E

41

http://www.flickr.com/photos/jpvargas/83258973/

Meta-Data

Description

`Worth’

Page 42: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Tier 1

Metadata:• Workflow source• Workflow system• Works on run• Area of research• Type• Description• User• User total uploads• Published citations• Downloads• Date uploaded

Dat

aON

E

42

Page 43: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Tier 2Description:• Foreign components• QA/QC steps• Visual Output• Number of inputs• Intermediate input• Linear• Embedded• Embedded details• Number of databases• Type conversion• Tag conversion• Multiple outputs

• Processing• Stats• Scalable• Smart reruns• provenance retained• Multipurpose• research mining• Query• Loop• Grid• Accounts necessary• External results

Dat

aON

E

43

Page 44: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Tier 3

`Worth’:• Sufficiency of metadata• Sufficiency of Natural

Language Description• Reuse in published articles• Relevant issues based on the

system it was created in.

Dat

aON

E

44

Page 45: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

Dat

aON

E

45

http://www.flickr.com/photos/nauright/5391995939/

Page 46: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

2. Workflows are becoming more complex over time.

Dat

aON

E

46

http://www.flickr.com/photos/nauright/5391995939/

Page 47: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

2. Workflows are becoming more complex over time.3. Workflows become more powerful over time.

Dat

aON

E

47

http://www.flickr.com/photos/nauright/5391995939/

Page 48: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.

2. Workflows are becoming more complex over time.3. Workflows become more powerful over time. 4. Workflows become more complex as one gains more

experience. Dat

aON

E

48

http://www.flickr.com/photos/nauright/5391995939/

Page 49: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

Dat

aON

E

49

http://www.flickr.com/photos/nauright/5391995939/

Page 50: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

6. Workflow re-use is proportional to the sufficiency of the documentation.

Dat

aON

E

50

http://www.flickr.com/photos/nauright/5391995939/

Page 51: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

6. Workflow re-use is proportional to the sufficiency of the documentation.

7. Reuse is proportional to the age of the workflow.

Dat

aON

E

51

http://www.flickr.com/photos/nauright/5391995939/

Page 52: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Research Hypotheses

5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.

6. Workflow re-use is proportional to the sufficiency of the documentation.

7. Reuse is proportional to the age of the workflow. 8. Workflow reuse is proportional to the proficiency of the

creator.

Dat

aON

E

52

http://www.flickr.com/photos/nauright/5391995939/

Page 53: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Data• Still being gathered and analysed.

Dat

aON

E

53

Page 54: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Data• Still being gathered and analysed.

• We’re using myExperiment download rate as a proxy for workflow reuse.

Dat

aON

E

54

Page 55: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Data• Still being gathered and analysed.

• We’re using myExperiment download rate as a proxy for workflow reuse.

Dat

aON

E

55

Page 56: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Data• Still being gathered and analysed.

• We’re using myExperiment download rate as a proxy for workflow reuse.

Dat

aON

E

56

Page 57: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Data• One of the issues with this is the amount of workflows being

created by each user.

• However, this still should allow for a diachronic analysis.

Dat

aON

E

57

Page 58: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Conclusion

Old publishing model:

Write paper. Submit paper. Drink wine.

Dat

aON

E

58

http://www.flickr.com/photos/joelmontes/4762384399/

Page 59: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Conclusion

Old publishing model:

Write paper. Submit paper. Drink wine.

New publishing model:

Write paper. Submit paper. Get feedback.Submit data. Replication (?)

Dat

aON

E

59

http://www.flickr.com/photos/joelmontes/4762384399/

Page 60: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Conclusion

Better publishing model:

Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication

Dat

aON

E

60

http://www.flickr.com/photos/mactitioner/5595830505

Page 61: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Conclusion

Better publishing model:

Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication

Submit workflows. That works.

Dat

aON

E

61

http://www.flickr.com/photos/mactitioner/5595830505

Page 62: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

Conclusion

Better publishing model:

Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication

Submit workflows. That works.

As this is done, questions of how effective workflows are, and how they can be utilized in the new research and publishing paradigm, might be answered.

Dat

aON

E

62

http://www.flickr.com/photos/mactitioner/5595830505

Page 63: Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model

References• [1] Kepler Project. http://www.kepler-project.org• [2] Taverna. http://www.taverna.org.uk/• [3] Vistrails http://www.vistrails.org/• [4] Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai, and Jing Hua. 2009. A

Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In Proceedings of the 2009 IEEE International Conference on Services Computing (SCC '09). IEEE Computer Society, Washington, DC, USA, http://dx.doi.org/10.1109/SCC.2009.77

• [5]Coombes, K. R., Wang, J. & Baggerly, K. A. Microarrays: retracing steps.Nature Med. 13, 1276–1277 (2007).

DataONE Workflows Project: http://notebooks.dataone.org/workflows Mendeley Research Group: http://www.mendeley.com/groups/1189721/scientific-workflows-and-workflow-systems/

Dat

aON

E

63

http://www.flickr.com/photos/wwworks/4759535950/