vo sandpit, november 2009 tracking the impact of data – how? sarah callaghan...
TRANSCRIPT
VO Sandpit, November 2009
Tracking the impact of data – how?
Sarah [email protected]
@sorcha_ni
1st Altmetrics conference, London, 25-26 September 2014
VO Sandpit, November 2009
The UK’s Natural Environment Research Council (NERC) funds six data centres which between them have responsibility for the long-term management of NERC's environmental data holdings.
We deal with a variety of environmental measurements, along with the results of model simulations in:
•Atmospheric science
•Earth sciences
•Earth observation
•Marine Science
•Polar Science
•Terrestrial & freshwater science, Hydrology and Bioinformatics
•Space Weather
Who are we and why do we care about data?
VO Sandpit, November 2009
OpenAIRE Portal
3
www.openaire.euwww.openaire.eu
Develop an Open Access, participatory infrastructure for scientific information that includes:
Publications Datasets ProjectsInterlinking
VO Sandpit, November 2009
Data, Reproducibility and Science
Science should be reproducible – other people doing the same experiments in the same way should get the same results.
Observational data is not reproducible (unless you have a time machine!)
Therefore we need to have access to the data to confirm the science is valid! http://www.flickr.com/photos/31333486@N00/1893012324/
sizes/o/in/photostream/
VO Sandpit, November 2009
It used to be “easy”…
Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665
The Scientific Papers of William Parsons, Third Earl of Rosse 1800-1867
…but datasets have gotten so big, it’s not useful to publish them in hard copy anymore
VO Sandpit, November 2009
Creating a dataset is hard work!
"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.com
Managing and archiving data so that it’s understandable by other researchers is difficult and time consuming too.
We want to reward researchers for putting that effort in!
VO Sandpit, November 2009
Some examples of data (just from the Earth Sciences)
1. Time series, some still being updated e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer
3. 2D scans e.g. satellite data, weather radar data
4. 2D snapshots, e.g. cloud camera
5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature
6. Datasets consisting of data from multiple instruments as part of the same measurement campaign
7. Physical samples, e.g. fossils
VO Sandpit, November 2009
What is a Dataset?
DataCite’s definition (http://www.datacite.org/sites/default/files/Business_Models_Principles_v1.0.pdf):
Dataset: "Recorded information, regardless of the form or medium on which it may be recorded including writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow, charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data."
(from the U.S. National Institutes of Health (NIH) Grants Policy Statement via DataCite's Best Practice Guide for Data Citation).
In my opinion a dataset is something that is:•The result of a defined process•Scientifically meaningful•Well-defined (i.e. clear definition of what is in the dataset and what isn’t)
VO Sandpit, November 2009
Metric BreakdownCEDA
numbersNotes
Number of discovery dataset records in the DCS
Quarterly NEODC 26 BADC 242 UKSSDC 11
Compliance with NERC data management policy. Reflects how many data sets NERC has. The number of dataset discovery records visible from the NERC data discovery service.
Web site visits Quarterly
BADC: 61,600NEODC: 10,200
Active use and visibility of the data centre. Site visits from standard web log analysis systems, such as webaliser. Sensible web crawler filters should have been applied.
Web site page views
Quarterly BADC: 219,900NEODC: 25,800
See web visits notes.
Queries closed this period
Quarterly 362 helpdesk queries838 dataset applications
Active use and visibility of the data centre. Queries marked as resolved within the quarter. A query is a request for information, a problem or ad hoc data request.
Queries received in period
Quarterly 388 helpdesk queries860 dataset applications
Active use and visibility of the data centre. See closed query notes.
Data centre metrics – produced 15th July 2014
VO Sandpit, November 2009
Metric Breakdown CEDA numbers NotesPercent queries dealt with in 3 working days
Quarterly 84.06 (11.57% resolved after 3 days)87.67 (10.23% resolved after 3 days)Queries receiving initial response within 1 working day Helpdesk - 93.57 %Dataset applications - 97.91%
Responsiveness. See closed query notes
Identifiable users actively downloading
None Over year to date: BADC: 4065NEODC: 362
Use and visibility of the data centre. An estimate of the number of users using data access services over the year.
Number of metadata records in data centre web site
None BADC: 240NEODC:33
INSPIRE compliance. Reflects how many data sets NERC has.
Number of datasets available to view via the data centre web site
None (Metric in development) INSPIRE compliance. Usable services.
Number of datasets available to download via the data centre web site
None (Metric in development) INSPIRE compliance. Usable services.
Data centre metrics – produced 15th July 2014
VO Sandpit, November 2009
Metric Breakdown CEDA numbers NotesNERC funded Data centre staff (FTE)
None 14 (estimate for FY 14/15)
Data management costs. Efficiency. Number of full time equivalent posts employed to perform data centre functions.
Direct costs of Data Stewardship in data centre
None (reportable at end of financial year)
Data management costs. Efficiency. Cost to NERC
Capital Expenditure directly related to Data Stewardship at data centre
None (reportable at end financial year)
Data management costs. Efficiency.
Direct Receipts from Data Licenses and Sales
None £0 (CEDA does not charge for data)
Commercial value of data products and services
Number of projects with Outline Data Management Plans
None (Metric in development)
Means of tracking projects’ adoption of good DM practice. Outline DMP is at proposal stage
Number of projects with Full Data Management Plans
None (Metric in development)
Means of tracking projects’ adoption of good DM practice. Full DMP is at funded stage
Users by area UK 2534 61% Active use. Visibility of the data centre internationally. Percentage of user base in terms of geographical spread.
Europe 494 12%Rest of the world
1024 25%
Unknown 79 2%Users by institute type University 2934 71% Active use. Visibility of the data centre
sectorially. Percentage of users base in terms of the users host institute type.
Government 694 17%NERC 160 4%Other 277 7%Commercial 42 1%School 35 1%
VO Sandpit, November 2009
Short answer:
We don’t know!!
Unless the data user comes back to us to tell us.
Or we stumble across a paper which
•Cites us
•Or mentions us in a way that we can find
• And tells us what the dataset the authors used was.
This is why we’re working with other groups (like CODATA, Force11, RDA, DataCite, Thompson Reuters,…) to promote data citation.
After the data is downloaded, what happens then?
VO Sandpit, November 2009
The Noble Eight-Fold Path to Citing Data
1. Importance2. Credit and attribution 3. Evidence4. Unique Identification 5. Access6. Persistence 7. Specificity and verifiability 8. Interoperability and
flexibility
Principles are supplemented with a glossary, references and exampleshttp://force11.org/datacitation
Principles are supplemented with a glossary, references and exampleshttp://force11.org/datacitation
VO Sandpit, November 2009
How we (NERC) cite data
We using digital object identifiers (DOIs) as part of our dataset citation because:
• They are actionable, interoperable, persistent links for (digital) objects
• Scientists are already used to citing papers using DOIs (and they trust them)
• Academic journal publishers are starting to require datasets be cited in a stable way, i.e. using DOIs.
• We have a good working relationship with the British Library and DataCite
NERC’s guidance on citing data and assigning DOIs can be found at: http://www.nerc.ac.uk/research/sites/data/doi.asp
VO Sandpit, November 2009
Dataset catalogue page (and DOI landing page)
Dataset citation
Clickable link to Dataset in the archive
VO Sandpit, November 2009
Data metrics – the state of the art!
Data citation isn’t common practice (unfortunately)
Data citation counts don’t exist yet
To count how often BADC data is used we have to:
1. Search Google Scholar for “BADC”, “British Atmospheric Data Centre”
2. Scan the results and weed out false positives
3. Read the papers to figure out what datasets the authors are talking about (if we can)
4. Count the mentions and citations (if any)
http://www.lol-cat.org/little-lovely-lolcat-and-big-work/
We’re working with DataCite and Thompson Reuters to get data
citation counts.
VO Sandpit, November 2009
Altmetrics and social media for data?
Mainly focussing on citation as a first step, as it’s most commonly accepted by researchers.
We have a social media presence @CEDAnews
- Mainly used for announcements about service availability
We definitely want ways of showing our funders that we provide a good service to our users and the research community.
And we want to be able to tell our depositors what
impact their data has had!
VO Sandpit, November 2009
RDA Bibliometrics for Data WG – preliminary survey results• Launched 3rd September
• As of 17th September – 63 responses• 100% completion• Survey link still live https://www.surveymonkey.com/s/RDA_bibliometrics_data
Science 3Earth sciences 16Physics 4Scientometrics and bibliometrics 4 Engineering 2Chemistry 1Biology (inc. zoology) 2STEM 1Medicine & biomedical research 8Energy 1Admin for research 2Computer science 4Social science, policy and economics 4Librarian and digital curation 11
VO Sandpit, November 2009
In the future, what would you like to use to evaluate the impact of data?
Most popular suggestions:•Data citations•Actual use in professional practice•Download statistics•Mentions in social media•DOIs/PIDs•Altmetrics•Well regarded indicators
Also pleas for:•Easy to use and set up•Radically different tools•Whatever tool can provide reliable information•Best estimate of societal benefit in $$ terms
What is currently missing and/or needs to be created for bibliometrics for data to become widely used?
Most popular suggestions:•Culture change!•Principles and standards for consistent practice (and enforcement of these)•Use of PIDs•Mature tools for data citation, publishing, discovery and impact analysis•Openness in papers and patents
Also:•Research on what current metrics actually measure•Infrastructure•Free apps
Future and missing
VO Sandpit, November 2009
Please help!Survey link still live!https://www.surveymonkey.com/s/
RDA_bibliometrics_data
Please pass on the link to anyone who might be interested and encourage others to fill
in the survey!
Share your experience with altmetrics – join the RDA WG on Publishing Data
Bibliometrics https://rd-alliance.org/group/rdawds-publishing-data-bibliometrics-wg.html
Thank you!Sarah Callaghan
[email protected]@sorcha_ni
http://weknowmemes.com/generator/meme/379914/
Work funded by the European Commission as part of the project
OpenAIREplus (FP7-INFRA-2011-2, Grant Agreement no.
283595)