vo sandpit, november 2009 tracking the impact of data – how? sarah callaghan...

27
VO Sandpit, November 2009 Tracking the impact of data – how? Sarah Callaghan [email protected] @sorcha_ni 1 st Altmetrics conference, London, 25-26 September 2014

Upload: shonda-reed

Post on 28-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

VO Sandpit, November 2009

Tracking the impact of data – how?

Sarah [email protected]

@sorcha_ni

1st Altmetrics conference, London, 25-26 September 2014

VO Sandpit, November 2009

The UK’s Natural Environment Research Council (NERC) funds six data centres which between them have responsibility for the long-term management of NERC's environmental data holdings.

We deal with a variety of environmental measurements, along with the results of model simulations in:

•Atmospheric science

•Earth sciences

•Earth observation

•Marine Science

•Polar Science

•Terrestrial & freshwater science, Hydrology and Bioinformatics

•Space Weather

Who are we and why do we care about data?

VO Sandpit, November 2009

OpenAIRE Portal

3

www.openaire.euwww.openaire.eu

Develop an Open Access, participatory infrastructure for scientific information that includes:

Publications Datasets ProjectsInterlinking

VO Sandpit, November 2009

Data, Reproducibility and Science

Science should be reproducible – other people doing the same experiments in the same way should get the same results.

Observational data is not reproducible (unless you have a time machine!)

Therefore we need to have access to the data to confirm the science is valid! http://www.flickr.com/photos/31333486@N00/1893012324/

sizes/o/in/photostream/

VO Sandpit, November 2009

It used to be “easy”…

Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665

The Scientific Papers of William Parsons, Third Earl of Rosse 1800-1867

…but datasets have gotten so big, it’s not useful to publish them in hard copy anymore

VO Sandpit, November 2009

Hard copy of the Human Genome at the Wellcome Collection

VO Sandpit, November 2009

Creating a dataset is hard work!

"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.com

Managing and archiving data so that it’s understandable by other researchers is difficult and time consuming too.

We want to reward researchers for putting that effort in!

VO Sandpit, November 2009

VO Sandpit, November 2009

Most people have an idea of what a publication is

VO Sandpit, November 2009

Some examples of data (just from the Earth Sciences)

1. Time series, some still being updated e.g. meteorological measurements

2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer

3. 2D scans e.g. satellite data, weather radar data

4. 2D snapshots, e.g. cloud camera

5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature

6. Datasets consisting of data from multiple instruments as part of the same measurement campaign

7. Physical samples, e.g. fossils

VO Sandpit, November 2009

What is a Dataset?

DataCite’s definition (http://www.datacite.org/sites/default/files/Business_Models_Principles_v1.0.pdf):

Dataset: "Recorded information, regardless of the form or medium on which it may be recorded including writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow, charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data."

(from the U.S. National Institutes of Health (NIH) Grants Policy Statement via DataCite's Best Practice Guide for Data Citation).

In my opinion a dataset is something that is:•The result of a defined process•Scientifically meaningful•Well-defined (i.e. clear definition of what is in the dataset and what isn’t)

VO Sandpit, November 2009

What metrics do we use for our data?

VO Sandpit, November 2009

Metric BreakdownCEDA

numbersNotes

Number of discovery dataset records in the DCS

Quarterly NEODC 26 BADC 242 UKSSDC 11

Compliance with NERC data management policy. Reflects how many data sets NERC has. The number of dataset discovery records visible from the NERC data discovery service.

Web site visits Quarterly 

BADC:                   61,600NEODC:                10,200 

Active use and visibility of the data centre. Site visits from standard web log analysis systems, such as webaliser. Sensible web crawler filters should have been applied.

Web site page views

Quarterly BADC:                   219,900NEODC:                25,800 

See web visits notes.

Queries closed this period

Quarterly 362 helpdesk queries838 dataset applications

Active use and visibility of the data centre. Queries marked as resolved within the quarter. A query is a request for information, a problem or ad hoc data request.

Queries received in period

Quarterly 388 helpdesk queries860 dataset applications

Active use and visibility of the data centre. See closed query notes.

Data centre metrics – produced 15th July 2014

VO Sandpit, November 2009

Metric Breakdown CEDA numbers NotesPercent queries dealt with in 3 working days

Quarterly 84.06 (11.57% resolved after 3 days)87.67 (10.23% resolved after 3 days)Queries receiving initial response within 1 working day Helpdesk - 93.57 %Dataset applications - 97.91%

Responsiveness. See closed query notes

Identifiable users actively downloading

None Over year to date: BADC: 4065NEODC: 362 

Use and visibility of the data centre. An estimate of the number of users using data access services over the year.

Number of metadata records in data centre web site

None BADC: 240NEODC:33

INSPIRE compliance. Reflects how many data sets NERC has.

Number of datasets available to view via the data centre web site

None (Metric in development) INSPIRE compliance. Usable services.

Number of datasets available to download via the data centre web site

None (Metric in development) INSPIRE compliance. Usable services.

Data centre metrics – produced 15th July 2014

VO Sandpit, November 2009

Metric Breakdown CEDA numbers NotesNERC funded Data centre staff (FTE)

None 14 (estimate for FY 14/15)

Data management costs. Efficiency. Number of full time equivalent posts employed to perform data centre functions.

Direct costs of Data Stewardship in data centre

None (reportable at end of financial year)

Data management costs. Efficiency. Cost to NERC

Capital Expenditure directly related to Data Stewardship at data centre

None (reportable at end financial year)

Data management costs. Efficiency.

Direct Receipts from Data Licenses and Sales

None £0 (CEDA does not charge for data)

Commercial value of data products and services

Number of projects with Outline Data Management Plans

None (Metric in development)

Means of tracking projects’ adoption of good DM practice. Outline DMP is at proposal stage

Number of projects with Full Data Management Plans

None (Metric in development)

Means of tracking projects’ adoption of good DM practice. Full DMP is at funded stage

Users by area UK 2534 61% Active use. Visibility of the data centre internationally. Percentage of user base in terms of geographical spread.

Europe 494 12%Rest of the world

1024 25%

Unknown 79 2%Users by institute type University 2934 71% Active use. Visibility of the data centre

sectorially. Percentage of users base in terms of the users host institute type.

Government 694 17%NERC 160 4%Other 277 7%Commercial 42 1%School 35 1%

VO Sandpit, November 2009

Short answer:

We don’t know!!

Unless the data user comes back to us to tell us.

Or we stumble across a paper which

•Cites us

•Or mentions us in a way that we can find

• And tells us what the dataset the authors used was.

This is why we’re working with other groups (like CODATA, Force11, RDA, DataCite, Thompson Reuters,…) to promote data citation.

After the data is downloaded, what happens then?

VO Sandpit, November 2009

The Noble Eight-Fold Path to Citing Data

1. Importance2. Credit and attribution 3. Evidence4. Unique Identification 5. Access6. Persistence 7. Specificity and verifiability 8. Interoperability and

flexibility

Principles are supplemented with a glossary, references and exampleshttp://force11.org/datacitation

Principles are supplemented with a glossary, references and exampleshttp://force11.org/datacitation

VO Sandpit, November 2009

How we (NERC) cite data

We using digital object identifiers (DOIs) as part of our dataset citation because:

• They are actionable, interoperable, persistent links for (digital) objects

• Scientists are already used to citing papers using DOIs (and they trust them)

• Academic journal publishers are starting to require datasets be cited in a stable way, i.e. using DOIs.

• We have a good working relationship with the British Library and DataCite

NERC’s guidance on citing data and assigning DOIs can be found at: http://www.nerc.ac.uk/research/sites/data/doi.asp

VO Sandpit, November 2009

Dataset catalogue page (and DOI landing page)

Dataset citation

Clickable link to Dataset in the archive

VO Sandpit, November 2009

Another example of a cited dataset

VO Sandpit, November 2009

http://www.charme.org.uk/

VO Sandpit, November 2009

Data metrics – the state of the art!

Data citation isn’t common practice (unfortunately)

Data citation counts don’t exist yet

To count how often BADC data is used we have to:

1. Search Google Scholar for “BADC”, “British Atmospheric Data Centre”

2. Scan the results and weed out false positives

3. Read the papers to figure out what datasets the authors are talking about (if we can)

4. Count the mentions and citations (if any)

http://www.lol-cat.org/little-lovely-lolcat-and-big-work/

We’re working with DataCite and Thompson Reuters to get data

citation counts.

VO Sandpit, November 2009

Altmetrics and social media for data?

Mainly focussing on citation as a first step, as it’s most commonly accepted by researchers.

We have a social media presence @CEDAnews

- Mainly used for announcements about service availability

We definitely want ways of showing our funders that we provide a good service to our users and the research community.

And we want to be able to tell our depositors what

impact their data has had!

VO Sandpit, November 2009

RDA Bibliometrics for Data WG – preliminary survey results• Launched 3rd September

• As of 17th September – 63 responses• 100% completion• Survey link still live https://www.surveymonkey.com/s/RDA_bibliometrics_data

Science 3Earth sciences 16Physics 4Scientometrics and bibliometrics 4 Engineering 2Chemistry 1Biology (inc. zoology) 2STEM 1Medicine & biomedical research 8Energy 1Admin for research 2Computer science 4Social science, policy and economics 4Librarian and digital curation 11

VO Sandpit, November 2009

Current use

VO Sandpit, November 2009

In the future, what would you like to use to evaluate the impact of data?

Most popular suggestions:•Data citations•Actual use in professional practice•Download statistics•Mentions in social media•DOIs/PIDs•Altmetrics•Well regarded indicators

Also pleas for:•Easy to use and set up•Radically different tools•Whatever tool can provide reliable information•Best estimate of societal benefit in $$ terms

What is currently missing and/or needs to be created for bibliometrics for data to become widely used?

Most popular suggestions:•Culture change!•Principles and standards for consistent practice (and enforcement of these)•Use of PIDs•Mature tools for data citation, publishing, discovery and impact analysis•Openness in papers and patents

Also:•Research on what current metrics actually measure•Infrastructure•Free apps

Future and missing

VO Sandpit, November 2009

Please help!Survey link still live!https://www.surveymonkey.com/s/

RDA_bibliometrics_data

Please pass on the link to anyone who might be interested and encourage others to fill

in the survey!

Share your experience with altmetrics – join the RDA WG on Publishing Data

Bibliometrics https://rd-alliance.org/group/rdawds-publishing-data-bibliometrics-wg.html

Thank you!Sarah Callaghan

[email protected]@sorcha_ni

http://weknowmemes.com/generator/meme/379914/

Work funded by the European Commission as part of the project

OpenAIREplus (FP7-INFRA-2011-2, Grant Agreement no.

283595)