acs denver dirks potenzone 30 aug2011

47
Enriched research documents at the cutting edge: When research papers no longer make sense on paper Rudy Potenzone SciencePoint Solutions Lee Dirks Education & Scholarly Communication Microsoft Research | Connections Presented at the American Chemical Society National Meeting Denver CO, August 30, 2011 at the Skolnick Award Symposium in Honor of Sandy Lawson

Upload: rudy-potenzone

Post on 07-May-2015

361 views

Category:

Technology


1 download

DESCRIPTION

Presentation at the American Chemical Society Meeting in Denver CO, August 30, 2011 at the Skolnick Award Symposium honoring Sandy Lawson.

TRANSCRIPT

Page 1: Acs denver dirks potenzone 30 aug2011

Enriched research documents at the cutting edge:When research papers no longer make sense on paper

Rudy PotenzoneSciencePoint Solutions

Lee DirksEducation & Scholarly Communication

Microsoft Research | Connections

Presented at the American Chemical Society National MeetingDenver CO, August 30, 2011

at the Skolnick Award Symposium in Honor of Sandy Lawson

Page 2: Acs denver dirks potenzone 30 aug2011

Agenda

• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future

Page 3: Acs denver dirks potenzone 30 aug2011

Agenda

• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Addins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future

3

Page 4: Acs denver dirks potenzone 30 aug2011

A Brief History of Enriched Scientific Papers

• Research papers have long enjoyed the ability to exist on paper with enriched content

• Embed figures and associated electronic items– chemical structures that included full bonding and

structural information– Crystallographic databases– Spectral databases– Biological sequence and Pathway databases– Supplemental material repositories

Page 5: Acs denver dirks potenzone 30 aug2011

Issues with External Repositories

• Often not complete• Poorly audited with some notable

exceptions• References between the paper and the

files are often lost or incorrect• There is a real loss of context due to the

separation of all the information• Reproducibility is not certain!

Page 6: Acs denver dirks potenzone 30 aug2011

Agenda

• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future

Page 7: Acs denver dirks potenzone 30 aug2011

My Bio – a Content Perspective

• The NIH/EPA Chemical Information System– SANSS, MSSS, FRSS, etc.

• Chemical Abstracts Service– CA, Registry, CASREACT, CHEMCATS, SciFinder

• MDL Information Systems/Elsevier– ACD, various synthesis, Beilstein

• LION bioscience, Ingenuity Systems– SRS and Ingenuity Pathway Analysis (IPA)

• CambridgeSoft– ACX, etc.

Page 8: Acs denver dirks potenzone 30 aug2011

Why Are WeNOT

Focusing On Authoring Tools?

Page 9: Acs denver dirks potenzone 30 aug2011

On the Verge of a Major Revolution

• Technology that enables authors to create elaborate versions of results of research

• Capturing the full context of research in progress:– The formal scientific report– The very METHODS used– Full data repository– Complete workflows

• With the resulting documentation offering information for completely reproducible results

Page 10: Acs denver dirks potenzone 30 aug2011

DynamicDocuments

Reputation& Influence

Reproducible Research

Interactive Data

Collaboration

Envisioning a New Era of Research Reporting

Page 11: Acs denver dirks potenzone 30 aug2011

Benefits of a Scientific ePaper

• Helping to improve the quality of science• Facilitating the intellectual transfer of the core

discoveries• Fully documenting the provenance of the research• Preserving the knowledge with complete context• Services easily accessible on top of the data

– a new value-added layer– visualization and analysis– discovery through simulation and modeling– etc.

• Accessible Reproducible Research!!

Page 12: Acs denver dirks potenzone 30 aug2011

Jill P. Mesirov. Accessible Reproducible Research. Science Vol. 327 (22) Jan 2010 (from http://www.sciencemag.org/cgi/content/full/327/5964/415/DC1)

Reproducible Research

Scientific publications have at least two goals:1. to announce a result and2. to convince readers that the

result is correct.3. Preservation of knowledge

Page 13: Acs denver dirks potenzone 30 aug2011

Fully Reproducible

Content Driving Better

Science

Rich Original Content

Content Sharing Services

Full Data Content

Embedded

Workflow Process

Embedded

Fully Reproducible

Content Driving Better

Science

Page 14: Acs denver dirks potenzone 30 aug2011

Agenda

• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future

Page 15: Acs denver dirks potenzone 30 aug2011

Redefining the Document

• Microsoft introduced their open document format – OpenXML – in Office 2007

Page 16: Acs denver dirks potenzone 30 aug2011

Project "Chem4Word"– Chemical Drawing in Microsoft WordSemantic chemistry for students and publishers

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml> V1.0 now available (binary and open source)

http://research.microsoft.com/chem4word/

Data: Semantics stored in Chemistry Markup Language (CML)

Intent: Recognizes chemical dictionary and ontology terms

Author/edit 1D and 2D chemistry. Change chemical layout styles.

Intelligence: Verifies validity of authored chemistry

http://www.nytimes.com/2010/04/08/technology/personaltech/08askk.html?_

r=1

Relationships: Navigate and link referenced chemistry

Page 17: Acs denver dirks potenzone 30 aug2011

GenePattern Reproducible Research Add-in

Source code and binary:http://GenepatternWordAddin.codeplex.com

Services: Connects to GenePattern database

Data: Resulting data (and provenance) stored within Word document

Data: Control and execute query pipelines into GenePattern

Relationships: Inline graphics are synchronized to dataset

Page 18: Acs denver dirks potenzone 30 aug2011

18

Research Information Centre (RIC) ProjectVirtual Research Environment (VRE) Toolkit for SharePoint

Version 1.1 (Open Source under Ms-PL):http://ric.codeplex.com/

Collaborative environment for research groups

Personal site for each researcher and project site for each project

Document management, federated search, social networking, real-time communication, blogs, wikis

Project Overview:http://research.microsoft.com/ric/http://research.microsoft.com/vre/

Page 19: Acs denver dirks potenzone 30 aug2011

moleculestext

experiments

measurementsdocuments

datamolecules

data

scientists

oreChem – The Chemical Semantic Web

• Peter Murray-Rust• Jim Downing• Nico Adams

• Carl Lagoze• Geoffrey Fox • Jeremy Frey• Simon Coles

• Lee Giles• Karl Mueller• Prasenjit Mitra

Mash-up (re-use) of data

Semantic storage

Compound document authoring

Demonstrating:• Large collaboration project

focusing on interoperability• At-source capture of

chemistry data• Chemical structure search• Compound object authoring• Retrospective harvesting of

chemistry data• Reuse through common ORE

data model• Semantic authoring• Virtualized triple storage

Page 20: Acs denver dirks potenzone 30 aug2011

“RSC Publishing and Southampton University drive the chemical semantic web…”

Enabling the Chemical Semantic Web

Page 21: Acs denver dirks potenzone 30 aug2011

Elsevier's Article of the Future CompetitionGrand Challenge & Article of the Future contest -- ongoing collaboration between Elsevier and the scientific community to redefine how a scientific article is presented online.

PLoS Currents: Influenza In conjunction with NIH & Google Knol – a rapid research note service, enable this exchange by providing an open-access online resource for immediate, open communication and discussion of new scientific data, analyses, and ideas in the field of influenza. All content is moderated by an expert group of influenza researchers, but in the interest of timeliness, does not undergo in-depth peer review.

Nature Preceedings Connects thousands of researchers and provides a platform for sharing new and preliminary findings with colleagues on a global scale – via pre-print manuscripts, posters and presentations. Claim priority and receive feedback on your findings prior to formal publication.

Mendeley (and Papers)Called “iTunes” for academic papers; 400,000+ users have signed up and a staggering 30+ million scientific papers have been uploaded.

Recent developments of interest

Page 22: Acs denver dirks potenzone 30 aug2011

• Swivel• IBM’s “Many Eyes”• Gapminder &

Google’s Trendalyzer• Metaweb’s “Freebase”• CSA’s “Illustrata”

Several CommercialData Sharing + Analysis Services

Page 23: Acs denver dirks potenzone 30 aug2011

http://thedata.org

Via web application software, data citation standards, and statistical methods, the Dataverse Network project increases scholarly recognition and distributed control for authors, journals, archives, teachers, and others who produce or organize data; facilitates data access and analysis for researchers and students; and ensures long-term preservation whether or not the data are in the public domain. [From the Institute of Quantitative Social Science (IQSS) at Harvard University]

Harvard’s “Dataverse” Project

Page 24: Acs denver dirks potenzone 30 aug2011
Page 25: Acs denver dirks potenzone 30 aug2011

Taverna

• Taverna is an open source and domain-independent Workflow Management System– A suite of tools used to design and execute scientific

workflows and aid in silico experimentation.• Taverna has been created by the myGrid team and

funded through OMII-UK. The project has guaranteed funding until 2014.

• The Taverna Suite is written in Java and includes the Taverna Engine (used for enacting workflows) that powers both the Taverna Workbench (desktop client) and the Taverna Server.

Page 26: Acs denver dirks potenzone 30 aug2011

More on Taverna

• Integrated with other myGrid tools– social networking and workflow sharing

environment for scientists– curated catalogue of Web services for Life

Sciences

26

Page 27: Acs denver dirks potenzone 30 aug2011

Log what, where, when who

For data and for publications

27

1 1 2 2 1 3 1 4

Sample of 4-flourinatedbiphenyl

Add CoolReflux

Butanone Sample ofK2CO3Powder

Weigh

grammes0.9031

Measure

40 ml

Add

Weigh

2.0719 g

text

3 5

Add

g

Sample ofBr11OCB

2 6

Reflux

2 7

Cool

Water

Measure

30 ml

9

Liquid-liquid

extraction

DCM

Measure

3 of 40 ml

10

Dry

MgSO4

11

Filter(Buchner)

12

RemoveSolvent

by RotaryEvaporation

13

Fuse

Silica

14

ColumnChromatography

Ether/PetrolRatio

Butanone dried via silica column andmeasured into 100ml RB flask.

Used 1ml extra solvent to wash outcontainer.

Started reflux at 13.30. (Had tochange heater stirrer) Only reflux

for 45min, next step 14:15.

Inorganics dissolve 2layers. Added brine

~20ml.

Organics are yellowsolution

Washed MgSO4 withDCM ~ 50ml

Measure

excess

Observation Types

weight - grammes

measure - ml, drops

annotate - text

temperature - K, °C

Key

Process

Input

Literal

Observation

Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove

Solventby Rotary

Evaporation

Fuse ColumnChromatography

Dissolve 4-flourinatedbiphenyl inbutanone

Add K2CO3powder

Heat at refluxfor 1.5 hours

Cool and addBr11OCB

Heat atreflux untilcompletion

Cool and addwater (30ml)

Combine organics,dry over MgSO4 &filter

Removesolvent invacuo

Liquid-liquid

extraction

Extract withDCM(3x40ml)

Fuse compound to silica &column in ether/petrol

4 8

Add

Add

text

Annotate

Annotate

text

Weigh

Annotate

g

Annotate Annotate

text text

Future Questions

Whether to have many subclasses of processes or fewer with annotations

How to depict destructive processes

How to depict taking lots of samples

What is the observation/process boundary? e.g. MRI scan

1.5918

Combechem

30 January 2004gvh, hrm, gms

Ingredient List

Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml

image

To

Do

Lis

tP

lan

Pro

ce

ss

Re

co

rd

Provenance

Page 28: Acs denver dirks potenzone 30 aug2011

myGrid Open Suite of Tools

Client User InterfacesWorkflow GUI Workbench

and 3rd party plug-ins

Workflow Repository

Service Catalogue

Programming and APIs

Web Portals

Activity and Service Plug-in Manager

Provenance Store

Workflow Server

Open Provenance

Model

Secure Service Access, and Programming APIs

Page 29: Acs denver dirks potenzone 30 aug2011

Recycling, Reuse, Repurposing

http://www.myexperiment.org/

• Share

• Search

• Re-use

• Re-purpose

• Execute

• Communicate

• Record

Page 30: Acs denver dirks potenzone 30 aug2011

Project Trident – Scientific Workflow WorkbenchBuilt on Windows Workflow Foundation

Author, Execute and Monitor Workflows

Version 1.2 (Open Source under Apache 2.0 License):http://tridentworkflow.codeplex.com/

Compose and modify workflows via drag & drop canvas

View data products, performance metrics, and provenance data

Page 31: Acs denver dirks potenzone 30 aug2011

KNIME

• KNIME (Konstanz Information Miner)• A user-friendly and comprehensive Open-

Source platform for:– Data integration– Processing– Analysis– Exploration

• Growing vendor adoption– PerkinElmer, Shrodinger, Tripos, CCG,

ChemAxon, etc.

Page 32: Acs denver dirks potenzone 30 aug2011

Accelrys Pipeline PilotChemistry

Page 33: Acs denver dirks potenzone 30 aug2011

Accelrys Pipeline PilotADME

Page 34: Acs denver dirks potenzone 30 aug2011

Accelrys Pipeline PilotBiology

Page 35: Acs denver dirks potenzone 30 aug2011

Accelrys Pipeline PilotGenomics

Page 36: Acs denver dirks potenzone 30 aug2011

DynamicDocuments

Reputation& Influence

Reproducible Research

Interactive Data

Collaboration

Envisioning a New Era of Research Reporting

Imagine…• Live research reports

– multiple end-user ‘views’– dynamically tailor presentations

• An authoring environment that absorbs and encapsulates– research workflows– outputs from the lab experiments

• A report that can be dropped into an electronic lab workbench and reconstitute an entire experiment

• Dynamic mash up data and workflows across experiments

• Apply new analyses and visualizations and perform new in silico experiments

Page 37: Acs denver dirks potenzone 30 aug2011

Agenda

• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future

Page 38: Acs denver dirks potenzone 30 aug2011

Impact of These Innovations

• On Science• On the Business of Science• On the Scientific Community

• And Other Emotional Factors . . .

38

Page 39: Acs denver dirks potenzone 30 aug2011

Overall Impacts

Authors will be somewhat inconvenienced to learn new things . . . But as readers and consumers it will clearly be beneficial!

Across Industry and Academia it will be positive advance

The vendors will be skeptical and reluctant to change – but will move with the spending community!

Page 40: Acs denver dirks potenzone 30 aug2011

On the Scientific Community

• This will provide a significantly more capable platform for science– Extending collaboration– Easing validation of research– Offering transfer of knowledge and ease of

extension of research projects• But is DOES further erode the status quo

system of rewards and tenure!

Page 41: Acs denver dirks potenzone 30 aug2011

And Other Emotional FactorsIs There An Elephant In This Room??

• The Publishers??

• CAS?? Other A&I companies??

• Well what about Electronic Lab Notebooks??

Page 42: Acs denver dirks potenzone 30 aug2011

On the Business of Science

• Publishers will need to continue to evolve to find a role as “cool provider” of these tools and become a “hot” distribution center

• A&I companies will need to redefine their role

• Software vendors have a real opportunity, if they can adapt . . .

Page 43: Acs denver dirks potenzone 30 aug2011

The Value of the A & I LayersAbstracting and Indexing in the Future

The Old Days• Abstracting was

Key• True Assessment

of Content

Today

• Indexing is Key• Precision and

Recall• “Beats” Google

every time

Going Forward

• Indexing with Context “Built-In”

• Will Abstracting or more correctly ‘Content Monitoring’ become the value add?

• Or be an reliable data aggregator?

Page 44: Acs denver dirks potenzone 30 aug2011

Agenda

• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future

Page 45: Acs denver dirks potenzone 30 aug2011

Rich Content Sources Direct Search Tools

Reproducible Science Complete Provenance

ChallengeOr

Opportunity

Page 46: Acs denver dirks potenzone 30 aug2011

The Opportunity Before Us• Faster Development in an Increasingly

Complex World– Improving reproducibility of scientific results– Data Sharing and collaboration services– Reliable maintenance of provenance– Faster availability and efficient query tools– Secure and/or controlled access to data– Finding related data and research partners– Assurance that data will be preserved

• A Brave New World for Scientific Discovery and Research– Cross-domain partnerships– Enhanced broad availability of data and prior

research

• Improved Knowledge Transfer– Both upstream and downstream– Realizing the promise of translational medicine

Page 47: Acs denver dirks potenzone 30 aug2011

Thank You!

Rudy PotenzoneSciencePoint Solutions

[email protected]

Lee DirksEducation & Scholarly Communication

Microsoft Research | Connections

[email protected] or [email protected] – http://www.microsoft.com/scholarlycomm/Facebook: Scholarly Communication at Microsoft