Transcript
Page 1: The culture of researchData

The Culture of Research Data

Peter Murray-Rust, ContentMine.org and UniversityOfCambridge

LEARN, London, UK 2016-01-29

The technology for Managing Research Data is already here……but we need a change of culture

Open Notebook SciencePublishers must be forced to serve us, not tyrranize us

Page 2: The culture of researchData

Just read the big letters

He’s got zillions of slides…

Page 3: The culture of researchData

My European Heroes

Young People(ContentMine)

NEELIE KROES

Page 4: The culture of researchData

The Right to Read is the Right to Mine

http://contentmine.org

Page 5: The culture of researchData

Themes

• Highly domain-dependent (chem, cryst, phylo)• Requires community and centrality• University repositories are NOT the solution• Openness makes it dramatically easier/better• The publisher-academic complex is a major

problem.• Infrastructure must be open and under our

control

Page 6: The culture of researchData

WE pay for scholarly publications that WE

can’t read

[1] The Military-Industrial-Academic complex (1961)(Dwight D Eisenhower, US President)

Publishers AcademiaGlory+?

$$, MSreview

Taxpayer

Student

Researcher

$$ $$

in-kind

The Publisher-Academic complex[1]

Page 7: The culture of researchData

Elsevier wants to control Open Data

[asked by Michelle Brook]

Page 8: The culture of researchData

Some topics

• Github / software mgt informs data mgt• Open notebook science• Open source malaria + LabTrove• Open phylogenetics• Computational chemistry• Crystallography• Early career researchers can change the world, if we

let them.• Are “publishers” tyrants or servants?

Page 9: The culture of researchData

Every Research Data Manager

should be using me

Page 10: The culture of researchData

Why I reposit software in GitHubI WANT TO!!!

BETTER QUICKERSECUREAUDIT, BACKTRACKABLEEASY

get collaborators

Most early career software creators have repos

How many people have USED Git?

Page 11: The culture of researchData

Free/Open Software Development CODE REPOSITORY

Worldcommunity

CODErewrite

validate

CODEfork

CODE

Re-use

CODERe-use

Github, BitBucketStackOverflow,Apache

inspires

OSI

Example: ContentMine athttp://github.com/ContentMine/quickscrape

BORN-OPEN-SOURCE

NO WALLS

Page 12: The culture of researchData

GIT housekeeps AUTOMATICALLY, eternally

Daily record of commits andMerges. Can backtrack to ANY Previous version

Page 13: The culture of researchData

Community involvement

https://github.com/ContentMine/quickscrape/pulls

Contributions fromPeople “outside project”

Page 14: The culture of researchData

Compile Fail

Inactive

Fail Tests

Pass Tests

Continuous Integration (Jenkins)Every time I commit a change 50 projects are recompiled and tested.

Impossible to do this manually!

Page 15: The culture of researchData

Software managementIs a success!

Research DATA managementIs a mess.

Page 16: The culture of researchData

Traditional Research and Publication

“Lab” work paper/thesis

Write

rewrite

Re-experiment

publish

???

Validation??

DATA

output “belongs” to publisher

Every process is LOSSY

Page 17: The culture of researchData

How NOT to publish data

HT Henry Rzepa

From Henry Rzepa:

this article http://doi.org/10.1126/science.aad6252 which provides a 22 Mbyte PDF of data (mostly bitmaps of NMR spectra) and comes in at 404 pages long. [1]

But this one http://doi.org/10.1021/jacs.5b05902 [comp chem] is 505 pages long (the current record holder?)

[1] DATA Behind paywall

Page 18: The culture of researchData

505 pages PDF, was a machine-readable log file that could and should have been in a repo

ComputationalChemistry

Page 19: The culture of researchData

MORE of the PDF DATA Destruction

Blind humans and Machines cannot read this

Page 20: The culture of researchData

ALWAYS put your (computational,

instrumental, observational)

data directly into a repository

Page 21: The culture of researchData

Let’s see some visionaries

Page 22: The culture of researchData

JD Bernal’s 1965 vision

However large an array of facts, however rapidly theyaccumulate, it is possible to keep them in order and toextract from time to time digests containing the most

generally significant information, while indicating how tofind those items of specialized interest. To do so, however,

requires the will and the means. (Bernal, 1965)

Quoted by PMR in http://journals.iucr.org/d/issues/1998/06/01/ba0011/ba0011.pdf

Page 23: The culture of researchData

PMR’s Tribute

Planned Memorial Meeting July 14th 2014 Cambridge

OPEN NOTEBOOK SCIENCE

Page 24: The culture of researchData

https://en.wikipedia.org/wiki/Bermuda_Principles

• Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours).

• Immediate publication of finished annotated sequences.

• Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society.

HUMAN GENOME project used Open Notebooks

Page 25: The culture of researchData

Open is FASTER, BETTER, MORE EFFICIENT

Page 26: The culture of researchData

Open is FASTER, BETTER, MORE, MORE EFFICIENT

Page 27: The culture of researchData

Open Notebook Science, ONS

Jean-Claude Bradley 2006

All data immediately available to all. NO

INSIDER INFORMATION.

Page 28: The culture of researchData

TOOLS

Open Notebook ScienceOpen engineeredrepository

Worldcommunity

INSTRUMENT

validate

merge

MODELCODE

DATA

DATAknowledge

calibrate

Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous ; data are SEMANTIC

Machines and humansWorking together

Page 29: The culture of researchData

Here are three examples

Page 30: The culture of researchData

Mat Todd (Sydney) and MANY collaborators

http://opensourcemalaria.org/ (Chrome for interactivity)

Mat Todd, Univ Sydney, runs an Open Notebook community to create new antimalarials.

Page 31: The culture of researchData

Notebook managed on Git.

Page 32: The culture of researchData

Interactive OPEN chemical search tool from cheminfo.org

Page 33: The culture of researchData

Interactive OPEN molecular display Jmol (Bob Hanson et al)

Page 34: The culture of researchData
Page 35: The culture of researchData

Interactive OPEN chemical search tool from cheminfo.org

Page 36: The culture of researchData

data is associated with the proposed scientific endeavour prior to or at the

point of creation rather than by annotating the data with commentary after the experiment has taken place

University of Southampton

Page 37: The culture of researchData

Data thrives on Community

Page 38: The culture of researchData

Henry Rzepa does Open Notebook Computational Chemistry…

http://www.rzepa.net/blog/?p=14272

This is a current open notebook discussion, http://www.ch.imperial.ac.uk/rzepa/blog/?p=15552 (see comments, currently 67).

… on his blog

Page 39: The culture of researchData
Page 40: The culture of researchData
Page 41: The culture of researchData

COMMUNITYINVOLVEMENT

Page 42: The culture of researchData

Crystallography – a model for Data Management

• Pro-active, friendly international community• Committed active International Union(IUCr)• Data publication valued (1960-present)• Community develops semantics/dictionaries• Committed volunteer software innovators• Heavily Open approach• Massive and valuable re-use of data• Culture of validation/reproducibility• Respect and credit for tool development

Page 43: The culture of researchData

IUCr DICTIONARIES

Page 44: The culture of researchData

IUCr VALIDATIONCRITERIA/TOOLS

Page 45: The culture of researchData

DATA

Page 46: The culture of researchData

PUBLICLYVALIDATEDTRUSTABLESCIENCE

Page 47: The culture of researchData

Where to reposit published crystallography?

Proteins -> PDB, OpenBUTInorganics -> ICSD ClosedOrganics -> Cambridge (CCDC) ClosedSOThe community has built a Crystallography Open Database

Page 48: The culture of researchData

Restrictions on Re-use of Crystallographic data

NOTE: The CCDC is based on data contributed by scientists as part of publication and validation

Crystallographic data from publications now belongs to CCDC

Page 49: The culture of researchData

Open Source and Open Data

www.crystallography.net

Page 50: The culture of researchData

Interactive OPEN crystal search tool

Page 51: The culture of researchData
Page 52: The culture of researchData

Panton Fellows (Early Career Researchers)

Panton Principles of Open Scientific Data 2010

Publish data openly (CC0) and record your wishes

Page 53: The culture of researchData

Sophie Kershaw, Panton Fellow : Doctoral Training in Oxford

Page 54: The culture of researchData

Sophie Kershaw, Panton Fellow

Page 55: The culture of researchData

Rotation-Based Learning (RBL)

Phase 1: Initiator• No communication

permitted between groups• Attempt to reproduce

existing literature• Deliver a coherent research

story by the end of Phase 1

Phase 2: Successor• Communication between

groups still prohibited• Validate and develop the

inherited research story• Critique your predecessors

• Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues?

Throughout Phases 1 & 2:• Daily lectures on open

science culture & techniques• First-hand application to own

research work• Version control using GitHub• Daily group supervision

Page 56: The culture of researchData

… third-year graduate students

So first-year grad students should be

trained by…

Page 57: The culture of researchData

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

Page 58: The culture of researchData

Authors don’t deposit data (Ross Mounce)

Page 59: The culture of researchData

http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014

Page 60: The culture of researchData

And we did it as Open Notebook Science

all data and code on Github

Discussion on public Discourse Tool

Page 61: The culture of researchData

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

Page 62: The culture of researchData

4300 images in Github

Page 63: The culture of researchData

“Root”

We analysed every pixel

Page 64: The culture of researchData

Many diagrams had author errors

Page 65: The culture of researchData

Supertree created from 4300 papers

Page 66: The culture of researchData

Aves

Apterygidae

Marsupialia

Monotremata

Mammalia

Reptilia

Amphibia

Arthropoda

Myriapodia

Okapia johnstoni

Pyrus

Stuffed Tree of Life

Page 67: The culture of researchData

Supertree for 924 species

Tree

Page 68: The culture of researchData

Can we mine for animals?

YesWith the PhylogenyCmunity [*]

[*] overlaps with “Tree of Life”, “Evolutionary Biology” , “Taxonomy”, “Species”

Page 69: The culture of researchData

So now we can legally mine the

whole literature in the UK

Yes! And we are starting to do it…

NORMA

Page 70: The culture of researchData

So why not Git for Data?

Page 71: The culture of researchData

DAT is Git for Data!!

Page 72: The culture of researchData

DAT! Queen Mary UL reposits DNA

Page 73: The culture of researchData

The John S. and James L. Knight Foundation is an American private, non-profit foundation dedicated to supporting "transformational ideas that promote quality journalism, advance media innovation, engage communities and foster the arts."[2]

DAT supports public data

Page 74: The culture of researchData

@Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism:

"Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ …

#opencon #TDM

Elsevier stopped me doing my researchChris Hartgerink

Page 75: The culture of researchData

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

Chris Hartgerink’s blog post

Page 76: The culture of researchData

Some Children of the Digital Enlightenment

• David Carroll & Joe McArthur: OAButton• Rayna Stamboliyska & Pierre-Carl Langlais• Jon Tennant• Ross Mounce • Jenny Molloy• Erin McKiernan• Jack Andraka• Michelle Brook• Heather Piwowar• TheContentMine Team• Rufus Pollock• Jonathan Gray• Sophie Kay

Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP)

J-C promoted these ideas with UNDERGRADUATE scientists.

[1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge

Sophie Kay

Page 77: The culture of researchData
Page 78: The culture of researchData
Page 79: The culture of researchData

OPEN CLOSED

Zenodo Figshare

Git

Dat

OpenOffice Word, PPT

LabTrove, cheminfo.org Chemdraw

CrystallographyOpenDB Cambridge Cryst data Centre

WriteLatex / Overleaf

ReadCube, Symplectic,

Page 80: The culture of researchData

>>> Henry.> Where and what is your latest repository and can I demonstrate it? This will be better than pointing to some dead Quixote site. And any blog posts would be useful. Happy to talk today if you are free. Chempond is still running eg http://chempound.ch.ic.ac.uk:8090/content/f0705698-39fa-4279-b736-f2fdca571e7b/

Timed out... Is it running?

There have been firewall problems in the past. I thought they were fixed. Will check.

Do you have blog posts which show either (a) how the repository is set up and (b) an Open Notebook approach to a project - where you discuss a problem before it is formally published? Both of these would be very useful.

This is a current open notebook discussion, http://www.ch.imperial.ac.uk/rzepa/blog/?p=15552 (see comments, currently 67). This is an earlier one, http://www.rzepa.net/blog/?p=14272 (with 86 comments) and also incorporates Jsmol to visualise all the dataThis one starts discussion as an open notebook http://www.ch.imperial.ac.uk/rzepa/blog/?p=12115 with the resulting formal publication at 10.1002/jcc.23985This was the original open notebook post http://www.ch.imperial.ac.uk/rzepa/blog/?p=984 with the resulting formal publication at 10.1038/NCHEM.596This one incorporates open data into its citation list http://www.ch.imperial.ac.uk/rzepa/blog/?p=15505 and is also an open notebook follow up to my PhD thesis work, formally published in 1975 or so, thus operating in reverse to the above.This shows some end outcomes: http://www.ch.imperial.ac.uk/rzepa/blog/?p=15313This shows the principles: http://www.ch.imperial.ac.uk/rzepa/blog/?p=10972This is an introductory tutorial http://www.ch.imperial.ac.uk/rzepa/blog/?p=14454This is a critique http://www.ch.imperial.ac.uk/rzepa/blog/?p=13826This is “convincing case” http://www.ch.imperial.ac.uk/rzepa/blog/?p=13248This is about metadata http://www.ch.imperial.ac.uk/rzepa/blog/?p=12932And its use http://www.ch.imperial.ac.uk/rzepa/blog/?p=12526You have seen this data nightmare before: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12728This is about ORCID http://www.ch.imperial.ac.uk/rzepa/blog/?p=12513

Page 81: The culture of researchData
Page 82: The culture of researchData

Open Source software inspires Open Science

Jean-Claude Bradley 2006

Page 83: The culture of researchData
Page 84: The culture of researchData
Page 85: The culture of researchData
Page 86: The culture of researchData
Page 87: The culture of researchData
Page 88: The culture of researchData

Ross Mounce (Bath), Panton Fellow

• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:

Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)

Page 89: The culture of researchData

Top Related