new information sharing in science 2.0: challenges and opportunitiesemanuele/info_sharing.pdf ·...

4
Information Sharing in Science 2.0: Challenges and Opportunities Emanuele Santos University of Utah Salt Lake City,UT [email protected] Juliana Freire University of Utah Salt Lake City, UT [email protected] Cl´ audio Silva University of Utah Salt Lake City, UT [email protected] ABSTRACT Scientists are beginning to utilize wikis, mashups, blogs and other Web 2.0 technologies as a means to improve collabora- tion and information sharing. The term Science 2.0 has been used to refer to emerging scientific applications that make use of these Web 2.0 technologies. In this paper we discuss the benefits and opportunities that come from Science 2.0 as well as the challenges involved in building Science 2.0 applications. Author Keywords information sharing, collaboration, provenance, scientific work- flows ACM Classification Keywords H.5.3 Information Interfaces and Presentation: Group and Organization Interfaces—Collaborative computing INTRODUCTION The success of Web 2.0 technologies, such as wikis, blogs, and social-networking sites, together with the proliferation of Internet-enabled wireless portable devices has opened up new opportunities to improve collaboration and information sharing among scientists. Recently, the idea of Science 2.0, has started to gain attention. Science 2.0 entails the use of the open-access Web 2.0 technologies for carrying out sci- entific activities [15]. It can also be viewed as a new kind of science, which introduces new methods for carrying out scientific research [11]. By democratizing science, Science 2.0 has the potential to benefit both the scientific community and the general public. The Web is open and for this reason anyone can participate, publish and consume information, regardless of whether one is a member of a large established research group with plen- tiful resources, an independent researcher, or a high-school student. Because the Web provides virtually unlimited space, scien- tific results can be described at much greater detail. Whereas a scientific peer-reviewed publication represents a snapshot of a given problem and solution, on the Web, scientists can essentially publish their notebooks and the many different alternatives they tried, as well as the data sets and anal- ysis scripts they used. With all this information, readers can better understand not only the results but also the ex- ploratory process that led to those results. In some cases, they may also be able to reproduce and validate the results. Besides better information dissemination, Science 2.0 also leads to improved collaboration. As a scientist posts her results (and questions) online, she can receive immediate feedback as well as discuss (or blog) with a number of re- searchers throughout the world [8, 13]. SCIENCE 2.0 TODAY: OPPORTUNITIES Collaborative Content Creation and Curation. Wikis have become a popular means to share data and scientific find- ings [14]. Scientific wikis have been used in different ways. OpenWetWare (http://openwetware.org) is a wiki for sharing electronic lab notebooks in biological sciences and engineer- ing. Started in 2005, today it serves more than 5,500 reg- istered users. WikiPathways (http://wikipathways.org) has been used by the biology community for sharing and main- taining a pathway database [9]. It provides the infrastruc- ture for a mass collaboration approach to curate the path- ways. To facilitate the participation in pathway curation, WikiPathways extended the popular MediaWiki software to include a custom graphical pathway editing tool and inte- grated databases comprising major gene, protein, and small- molecule systems. Wikigenes (http://wikigenes.org) is a portal that provides the access to gene, protein and chemical compound databases [6]. Like any wiki, WikiGenes consists of thousands of articles collaboratively edited by users. Most wiki engines keep track of every modification made to a page, but these modifica- tions are rarely exposed to users because after a large num- ber of changes the information can be very hard to under- stand. A key difference between WikiGenes and other wikis is that it links every piece of text and every word directly to its author. It also allows registered authors to rate other authors’ contributions, helping to build reputation within the community. myExperiment (http://myexperiment.org) is also used for shar- ing content. But instead of data sets and documents, scien- 1

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Information Sharing in Science 2.0: Challenges and Opportunitiesemanuele/info_sharing.pdf · 2010. 1. 6. · information sharing, collaboration, provenance, scientific work-flows

Information Sharing in Science 2.0: Challenges andOpportunities

Emanuele SantosUniversity of UtahSalt Lake City,UT

[email protected]

Juliana FreireUniversity of UtahSalt Lake City, UT

[email protected]

Claudio SilvaUniversity of UtahSalt Lake City, [email protected]

ABSTRACTScientists are beginning to utilize wikis, mashups, blogs andother Web 2.0 technologies as a means to improve collabora-tion and information sharing. The term Science 2.0 has beenused to refer to emerging scientific applications that makeuse of these Web 2.0 technologies. In this paper we discussthe benefits and opportunities that come from Science 2.0as well as the challenges involved in building Science 2.0applications.

Author Keywordsinformation sharing, collaboration, provenance, scientific work-flows

ACM Classification KeywordsH.5.3 Information Interfaces and Presentation: Group andOrganization Interfaces—Collaborative computing

INTRODUCTIONThe success of Web 2.0 technologies, such as wikis, blogs,and social-networking sites, together with the proliferationof Internet-enabled wireless portable devices has opened upnew opportunities to improve collaboration and informationsharing among scientists. Recently, the idea of Science 2.0,has started to gain attention. Science 2.0 entails the use ofthe open-access Web 2.0 technologies for carrying out sci-entific activities [15]. It can also be viewed as a new kindof science, which introduces new methods for carrying outscientific research [11].

By democratizing science, Science 2.0 has the potential tobenefit both the scientific community and the general public.The Web is open and for this reason anyone can participate,publish and consume information, regardless of whether oneis a member of a large established research group with plen-tiful resources, an independent researcher, or a high-schoolstudent.

Because the Web provides virtually unlimited space, scien-

tific results can be described at much greater detail. Whereasa scientific peer-reviewed publication represents a snapshotof a given problem and solution, on the Web, scientists canessentially publish their notebooks and the many differentalternatives they tried, as well as the data sets and anal-ysis scripts they used. With all this information, readerscan better understand not only the results but also the ex-ploratory process that led to those results. In some cases,they may also be able to reproduce and validate the results.Besides better information dissemination, Science 2.0 alsoleads to improved collaboration. As a scientist posts herresults (and questions) online, she can receive immediatefeedback as well as discuss (or blog) with a number of re-searchers throughout the world [8, 13].

SCIENCE 2.0 TODAY: OPPORTUNITIES

Collaborative Content Creation and Curation. Wikis havebecome a popular means to share data and scientific find-ings [14]. Scientific wikis have been used in different ways.OpenWetWare (http://openwetware.org) is a wiki for sharingelectronic lab notebooks in biological sciences and engineer-ing. Started in 2005, today it serves more than 5,500 reg-istered users. WikiPathways (http://wikipathways.org) hasbeen used by the biology community for sharing and main-taining a pathway database [9]. It provides the infrastruc-ture for a mass collaboration approach to curate the path-ways. To facilitate the participation in pathway curation,WikiPathways extended the popular MediaWiki software toinclude a custom graphical pathway editing tool and inte-grated databases comprising major gene, protein, and small-molecule systems.

Wikigenes (http://wikigenes.org) is a portal that provides theaccess to gene, protein and chemical compound databases [6].Like any wiki, WikiGenes consists of thousands of articlescollaboratively edited by users. Most wiki engines keep trackof every modification made to a page, but these modifica-tions are rarely exposed to users because after a large num-ber of changes the information can be very hard to under-stand. A key difference between WikiGenes and other wikisis that it links every piece of text and every word directlyto its author. It also allows registered authors to rate otherauthors’ contributions, helping to build reputation within thecommunity.

myExperiment (http://myexperiment.org) is also used for shar-ing content. But instead of data sets and documents, scien-

1

Page 2: New Information Sharing in Science 2.0: Challenges and Opportunitiesemanuele/info_sharing.pdf · 2010. 1. 6. · information sharing, collaboration, provenance, scientific work-flows

Figure 1. Accessing the workflow that generated a 3D visualization and a histogram on a wiki page. By clicking on the image, VisTrails will load theworkflow so users can interact directly with the visualization and explore the workflow evolution history.

tists use myExperiment to share workflows. These work-flows represent computational tasks (e.g., to integrate infor-mation from different sources, to perform analyses or buildvisualizations) that combine different tools and services. Draw-ing upon social networking Web sites, users can rate, tag andrecommend workflows.

Social Data Analysis. Another class of Web 2.0 site that hasrecently emerged allows users to upload and collaborativelyanalyze different types of data [13, 12]. Also based on theidea of social networking, Many Eyes [13] supports collab-oration around a collection of data visualizations at a largescale. Registered users can upload data, create interactivevisualizations, view, discuss and rate data sets and visualiza-tions.

Reproducible Publications. With the goal of simplifyingthe process of publishing reproducible results, the VisTrailssystem (http://www.vistrails.org) allows users to create doc-uments whose digital artifacts (e.g., figures) include detailedprovenance information, i.e., the specification of the com-putational process (or workflow) and associated parametersused to produce the artifact. Using the provenance, read-ers can reproduce the result by re-executing the workflowas well as experiment with other parameters. Because Vis-Trails maintains information about how workflows evolveover time [7], besides the individual workflow used to derivea given artifact, readers can also access the different (ver-sions of) workflows that led to that workflow, and obtain abetter understanding of the trial-and-error process followedto derive the the artifact (Figure 1).

CHALLENGESAlthough the openness of Science 2.0 has clear benefits andhas the potential to make science more efficient, it also presentsimportant challenges. For example: If everyone can publishand edit information on a wiki, can we trust that the infor-mation is accurate? If unpublished results are posted on awiki, can we prevent others from stealing that work? Will

scientists spend all the extra time required to publish theirnotebooks and associated data online? And if they do, willwe be flooded with information and unable to find what isreally important? If a document links to information in awiki, will that information be there one year from now? Wediscuss these issues below.

The Importance of Provenance. Provenance, from the Latinword provenire—”to come from”, means the origin, or thesource, of something, or the history of the ownership or loca-tion of an object. Because Science 2.0 applications facilitatecollaboration and allow potentially large groups of people tocreate and modify artifacts (e.g., documents, databases, andworkflows), maintaining detailed provenance of these arti-facts is crucial. The provenance information can be usedto determine authorship, enforce intellectual property rights,validate the integrity of artifacts and assess their quality, andin some cases to reproduce the artifact.

Existing Science 2.0 applications, however, provide little orno support for provenance. Wikis, for example, do trackchanges to documents, but they fail to capture modificationsto artifacts included in the documents. In myExperiment,it is not possible to track sequences of modifications to aworkflow by multiple users, unless explicitly stated by theusers.

Integrating Science 2.0 Applications. If we examine exist-ing Science 2.0 applications, we can see that each applica-tion addresses specific aspects of the scientific process, forexample, data curation, result reproducibility, informationsharing. To support the scientific process more fully, it isnecessary to integrate these tools and the information theyhold. One alternative is to use mashups that combine com-ponents that come from multiple Web sites [3]. For example,this would be useful for a scientist adding an entry on his labnotebook on OpenWetWare about an experiment he did onfinding biological pathways. He could add the new pathwayhe found to WikiPathways and link to the workflow he used

2

Page 3: New Information Sharing in Science 2.0: Challenges and Opportunitiesemanuele/info_sharing.pdf · 2010. 1. 6. · information sharing, collaboration, provenance, scientific work-flows

to do that on myExperiment. However, achieving this in-tegration can be challenging given the large number of toolsused, data formats, and the lack of standard and reliable link-ing mechanisms [5].

Usable Tools. Part of the success of Web 2.0 technologiescan be attributed to the fact that they provide usable toolsand interfaces for users to share information and collabo-rate. But to share (reproducible) scientific results, substantialwork may be required, e.g., to organize and upload sourcecode and data sets. Scientists will be less inclined to gen-erate reproducible publications or to publish lab notebooksonline if great effort and time are required. Thus, there is aneed for tools that facilitate this process and that make theprocess of information sharing more transparent and betterintegrated with the tools scientists already use for carryingout their experiments. For example, a scientist that uses Vis-Trails to analyze and visualize data can easily publish herresults in a single step, using a single copy-and-paste oper-ation (see [10] for examples of how scientists can publishtheir experimental results using VisTrails).

Finding and Making Sense of Information. Although Sci-ence 2.0 sites can help make sense of the scientific datatsunami, they also add to the information overflow problem.An important challenge is thus how to quickly find the in-formation one is looking for. Generic search engines such asGoogle or Yahoo index attempt to maximize the coverage oftheir index and they are very good for broad searches. Forspecific queries, however, they are often ineffective. For ex-ample, if a biologist needs to locate online databases relatedto molecular biology and searches on Google for the key-words “molecular biology database” over 19 million docu-ments are returned. Among these, she will find pages thatcontain databases, but the results also include a very largenumber of pages from journals, scientific articles, personalWeb pages, etc. Domain-specific (vertical) search engineswill be fundamental to help users more effectively searchfor information in Science 2.0 sites [1, 2]. More specific andstructured may be required for some of the information avail-able in these sites. Handling such queries, and attempting tointegrate the information on-the-fly is an open research prob-lem [5].

Data Longevity. An important problem that needs to beaddressed in Science 2.0 is data longevity. Since data andapplications live in distributed and autonomous sites, thereis no guarantee that they will exist forever. Even when anartifact is published together with its provenance, the qualityof the artifact is diminished if part of the provenance comesfrom a site that disappeared, or was produced by workflowthat no longer runs (e.g., uses old libraries or defunct Webservices). This problem of data preservation has attractedsubstantial attention recently and it is the main topic of theNational Foundation DataNet program [4].

Authorship Attribution. Many scientists are concerned withthe openness provided by Science 2.0. Tenure cases and

patents are dependent on being the first to publish a newdiscovery, and by publishing results on the Web one riskshaving ideas stolen [15]. Keeping detailed provenance, suchas what is done in WikiGenes [6], is essential for keepingtrack of authorship. myExperiment also allow users to givecredit to other users. However, this non-traditional form ofcredit assignment is far from being accepted by the scientificcommunity.

CONCLUSIONSIn this position paper, we discussed some of the benefits andnew opportunities brought about by Science 2.0. We alsodiscussed important challenges involved in building Science2.0 applications and argued that the ability to track prove-nance is essential for these applications. The adoption ofScience 2.0 has the potential to transform the way science isdone, both by improving the process of information sharingand by enabling new forms of collaboration. However, itssuccess depends heavily on our ability to solve challengingcomputer science problems.

ACKNOWLEDGMENTSThis work is partially supported by the NSF (under grantsIIS-0844572, CNS-0751152, IIS-0746500, IIS-0513692, CCF-0401498, EIA-0323604, CNS-0514485, IIS-0534628, CNS-0528201, OISE-0405402), the DOE, and an IBM FacultyAward. E. Santos is partially supported by a CAPES/Fulbrightfellowship.

REFERENCES1. L. Barbosa and J. Freire. An adaptive crawler for

locating hidden-web entry points. In Proceedings ofWWW, pages 441–450, 2007.

2. S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: A new approach to topic-specific webresource discovery. Computer Networks,31(11-16):1623–1640, 1999.

3. B. Chan, L. Wu, J. Talbot, M. Cammarano, andP. Hanrahan. Vispedia: Interactive visual exploration ofwikipedia data via search-based integration.Visualization and Computer Graphics, IEEETransactions on, 14(6):1213–1220, Nov.-Dec. 2008.

4. Sustainable digital data preservation and accessnetwork partners (DataNet).http://www.nsf.gov/funding/pgm summ.jsp?pims id=503141.Accessed on 19 February 2009.

5. M. J. Franklin, A. Y. Halevy, and D. Maier. Fromdatabases to dataspaces: a new abstraction forinformation management. SIGMOD Record,34(4):27–33, 2005.

6. R. Hoffmann. A wiki for the life sciences whereauthorship matters. Nature Genetics, 40(9):1047–1051,2008.

7. J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E.Scheidegger, and H. T. Vo. Managing rapidly-evolvingscientific workflows. In International Provenance and

3

Page 4: New Information Sharing in Science 2.0: Challenges and Opportunitiesemanuele/info_sharing.pdf · 2010. 1. 6. · information sharing, collaboration, provenance, scientific work-flows

Annotation Workshop (IPAW), LNCS 4145, pages10–18, 2006.

8. An experiment in massively collaborative Mathematics.http://en.wordpress.com/tag/polymath1/. Accessed on19 February 2009.

9. A. R. Pico, T. Kelder, M. P. van Iersel, K. Hanspers,B. R. Conklin, and C. Evelo. WikiPathways: Pathwayediting for the people. PLoS Biology, 6(7), 2008.

10. E. Santos and H. Vo. Reproducible and interactivedocuments. Online video clip, September 2008.Accessed on 31 October 2008.”http://www.sci.utah.edu/ emanuele/movies/vistrails-publishing-new.mov”.

11. B. Shneiderman. Science 2.0. Science,319(5868):1349–1350, Jan 2008.

12. Swivel. http://www.swivel.com.

13. F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, andM. McKeon. Many Eyes: A site for visualization atinternet scale. IEEE Transactions on Visualization andComputer Graphics, 13(6):1121–1128, 2007.

14. M. Waldrop. Wikiomics. Nature, 455:22–25,September 2008.

15. M. M. Waldrop. Science 2.0. Scientific American,298(5):p68 – 73, May 2008.

4