metrics & citation for software (and data)
TRANSCRIPT
Metrics & Citation
for Software (and Data)
Daniel S. [email protected] & [email protected]
@danielskatz
Program Director, Division of
Advanced Cyberinfrastructure(http://www.slideshare.net/danielskatz/metrics-citation-for-
software-and-data)
Workshop on Supporting Scientific Discovery through
Norms and Practices for Software and Data Citation and
Attribution, Washington DC, 29 Jan 2015
National Science Foundation
• Federal agency created in 1950 "to promote the
progress of science; to advance the national
health, prosperity, and welfare; to secure the
national defense…”
• Annual budget of $7.3 billion (FY 2015)
• Funds 24 percent of all federally supported
basic research at US colleges and universities
• In many fields such as mathematics, computer
science and the social sciences, NSF is the
major source of federal funds
NSF
NATIONAL SCIENCE FOUNDATION
DIRECTORATE FOR
BIOLOGICAL
SCIENCES
(BIO)
James L. Olds,
Assistant Director
Jane Silverthorne,
Deputy AD
703.292.8400
DIRECTORATE FOR
EDUCATION & HUMAN
RESOURCES
(EHR)
Joan Ferrini-Mundy,
Assistant Director
James W. Lewis,
Deputy AD
703.292.8600
DIVISION OF BIOLOGICAL
INFRASTRUCTURE (DBI)
Scott Edwards,
Division Director
703.292.8470
DIVISION OF ENVIRONMENTAL
BIOLOGY (DEB)
Alan Tessler,
Acting Division Director
703.292.8480
DIVISION OF INTEGRATIVE
ORGANISMAL SYSTEMS (IOS)
William Zamer,
Acting Division Director
703.292.8420
DIVISION OF MOLECULAR &
CELLULAR BIOSCIENCES (MCB)
Gregory Warr,
Acting Division Director
703.292.8440
OFFICE OF EMERGING
FRONTIERS (EF)
Charles Liarakos,
Acting Division Director
703.292.8508
DIRECTORATE FOR
COMPUTER &
INFORMATION SCIENCE &
ENGINEERING (CISE)
James F. Kurose,
Assistant Director
Suzanne Iacono,
Deputy AD
703.292.8900
DIVISION OF CHEMICAL,
BIOENGINEERING, ENVIRONMENTAL &
TRANSPORT SYSTEMS (CBET)
JoAnn Lighty ,
Division Director
703.292.8320
DIVISION OF CIVIL,
MECHANICAL & MANUFACTURING
INNOVATION (CMMI)
Deborah Goodings ,
Acting Division Director
703.292.8360
DIVISION OF ELECTRICAL,
COMMUNICATIONS & CYBER
SYSTEMS (ECCS)
Samir El-Ghazaly,
Division Director
703.292.8339
DIVISION OF ENGINEERING
EDUCATION & CENTERS (EEC)
Don L. Millard,
Acting Division Director
703.292.8380
DIVISION OF INDUSTRIAL
INNOVATION & PARTNERSHIPS (IIP)
Joseph Hennessey,
Acting Division Director
703.292.8050
OFFICE OF EMERGING
FRONTIERS IN RESEARCH &
INNOVATION (EFRI)
Sohi Rastegar,
Senior Advisor
703.292.8301
DIRECTORATE FOR
GEOSCIENCES
(GEO)
Roger Wakimoto,
Assistant Director
Margaret Cavanaugh,
Deputy AD
703.292.8500
DIRECTORATE FOR
MATHEMATICAL &
PHYSICAL SCIENCES
(MPS)
Fleming Crim,
Assistant Director
Celeste M. Rohlfin
g
,
Deputy AD
703.292.8800
DIVISION OF ASTRONOMICAL
SCIENCES (AST)
James Ulvestad,
Division Director
703.292.8820
DIVISION OF CHEMISTRY (CHE)
Steven Bernasek,
Division Director
703.292.8840
DIVISION OF MATERIALS
RESEARCH (DMR)
Mary Galvin-Donoghue ,
Division Director
703.292.8810
DIVISION OF MATHEMATICAL
SCIENCES (DMS)
Michael Vogelius,
Division Director
703.292.8870
DIVISION OF PHYSICS (PHY)
Denise Caldwell,
Division Director
703.292.8890
OFFICE OF MULTIDISCIPLINARY
ACTIVITIES (OMA)
Clark Cooper,
Offic
e
He ad
703.292.8800
DIRECTORATE FOR
SOCIAL, BEHAVIORAL, &
ECONOMIC SCIENCES
(SBE)
Fay L. Cook,
Assistant Director
Clifford Gabriel,
Deputy AD (Acting)
703.292.8700
DIVISION OF BEHAVIORAL &
COGNITIVE SCIENCES (BCS)
Mark Weiss,
Division Director
703.292.8740
DIVISION OF SOCIAL &
ECONOMIC SCIENCES (SES)
Jeryl Mumpower,
Division Director
703.292.8760
NATIONAL CENTER FOR
SCIENCE AND ENGINEERING
STATISTICS (NCSES)
John Gawalt,
Division Director
703.292.8780
National Science Foundation
4201 Wilson Boulevard
Arlington, Virginia 22230
TEL: 703.292.5111 | FIRS: 800.877.8339 | TDD: 800.281.8749 January 2015
DIRECTORATE FOR
ENGINEERING
(ENG)
Pramod P.
Khargonekar,
Assistant Director
Grace Wang,
Deputy AD
703.292.8300
DIVISION OF GRADUATE
EDUCATION (DGE)
Valerie Wilson,
Acting Division Director
703.292.8630
DIVISION OF HUMAN RESOURCE
DEVELOPMENT (HRD)
Sylvia James,
Division Director
703.292.8640
DIVISION OF RESEARCH ON
LEARNING IN FORMAL &
INFORMAL SETTINGS (DRL)
Sarah McDonald,
Acting Division Director
703.292.8620
DIVISION OF UNDERGRADUATE
EDUCATION (DUE)
Susan Singer,
Division Director
703.292.8670
DIVISION OF ATMOSPHERIC &
GEOSPACE SCIENCES (AGS)
Paul Shepson
Division Director
703.292.8520
DIVISION OF EARTH
SCIENCES (EAR)
Carol Frost,
Division Director
703.292.8550
DIVISION OF OCEAN
SCIENCES (OCE)
Deborah Bronk ,
Division Director
703.292.8580
DIVISION OF
POLAR PROGRAMS (PLR)
Kelly Falkner,
Division Director
703.292.8030
DIVISION OF COMPUTER &
NETWORK SYSTEMS (CNS)
Keith Marzullo,
Division Director
703.292.8950
OFFICE OF INFORMATION
& RESOURCE
MANAGEMENT
(OIRM)
Joanne S. Tornow,
Head / Chief Human
Capital Offic
e
r
Amy Northcutt,
Chief Information Offic
e
r
703.292.8100
OFFICE OF BUDGET,
FINANCE, & AWARD
MANAGEMENT
(BFA)
Martha A. Rubenstein,
Head / Chief Financial
Offic
e
r
Joanna E. Rom ,
Deputy Head
703.292.8200
BUDGET DIVISION (BUD)
Michael Sieverts,
Division Director
703.292.8260
DIVISION OF ACQUISITION AND
COOPERATIVE SUPPORT (DACS)
Jeffery Lupis,
Division Director
703.292.8240
DIVISION OF FINANCIAL
MANAGEMENT (DFM)
Shirl Ruffin
,
Division Director / Deputy CFO
703.292.8280
DIVISION OF ADMINISTRATIVE
SERVICES (DAS)
Mercedes Eugenia,
Division Director
703.292.8190
DIVISION OF INFORMATION
SYSTEMS (DIS)
Dorothy Aronson,
Division Director
703.292.8150
DIVISION OF HUMAN RESOURCE
MANAGEMENT (HRM)
Judy Sunley,
Division Director
703.292.8180
DIVISION OF GRANTS &
AGREEMENTS (DGA)
Karen Tiplady,
Division Director
703.292.8210
DIVISION OF INSTITUTION &
AWARD SUPPORT (DIAS)
Mary Santonastasso,
Division Director
703.292.8230
LARGE FACILITIES OFFICE
Matthew Hawkins,
Acting Deputy Director
703.292.4416
DIVISION OF COMPUTING &
COMMUNICATION
FOUNDATIONS (CCF)
Rao Kosaraju,
Division Director
703.292.8910
DIVISION OF ADVANCED
CYBERINFRASTRUCTURE (ACI)
Irene Qualters,
Division Director
703.292.8970
DIVISION OF INFORMATION &
INTELLIGENT SYSTEMS (IIS)
Lynne E. Parker ,
Division Director
703.292.8930
Richard Buckius
Chief Operating
Offic
e
r
OFFICE OF THE GENERAL COUNSEL (OGC)
Lawrence Rudolph, General Counsel
Peggy Hoyle, Deputy GC703.292.8060
OFFICE OF DIVERSITY &
INCLUSION (ODI)
Vacant, Head
703.292.8020
OFFICE OF LEGISLATIVE &
PUBLIC AFFAIRS (OLPA)
Dana Toupousis, Acting Head
703.292.8070
OFFICE OF INTERNATIONAL &
INTEGRATIVE ACTIVITIES (OIIA)
Wanda Ward, Head
703.292.8040
OFFICE OF INSPECTOR GENERAL (OIG)
Allison C. Lerner, Inspector General
703.292.7100
NATIONAL SCIENCE BOARDOFFICE
Michael Van WoertExecutive Offic
e
r
703.292.7000
NATIONAL SCIENCE BOARD (NSB)
Dan E. ArvizuChair
Kelvin K. DroegemeierVice Chair
703.292.7000
OFFICE OF THE DIRECTOR703.292.8000
VacantDeputy Director
France A. CórdovaDirector
Advanced Cyberinfrastructure
(ACI) Division• Supports and coordinates the development,
acquisition, and provision of state-of-the-art
cyberinfrastructure resources, tools, and
services
• Supports forward-looking research and
education to expand the future capabilities of
cyberinfrastructure
• Serves the growing community of scientists and
engineers, across all disciplines, whose work
relies on the power of advanced computation,
data-handling, and networking
Cyberinfrastructure
“Cyberinfrastructure consists of
computing systems,
data storage systems,
advanced instruments and
data repositories,
visualization environments, and
people,
all linked together by
software and
high performance networks,
to improve research productivity and
enable breakthroughs not otherwise possible.”
-- Craig Stewart
Software as InfrastructureScience
Software
Computing Infrastructure
• Software (including services) essential for
the bulk of science
- About half the papers in recent issues of
Science were software-intensive projects
- Research becoming dependent upon
advances in software
- Significant software development being
conducted across NSF: NEON, OOI,
NEES, NCN, iPlant, etc
• Wide range of software types: system, applications, modeling,
gateways, analysis, algorithms, middleware, libraries
• Software is not a one-time effort, it must be sustained
• Development, production, and maintenance are people intensive
• Software life-times are long vs hardware
• Software has under-appreciated value
For software to be sustainable,
it must become infrastructure
See http://bit.ly/sw-ci for current projects
5 rounds of funding,
65 SSEs
4 rounds of funding,
35 SSIs
2 rounds of funding,
14 S2I2
conceptualizations
NSF Software Infrastructure Projects
SSE & SSI – NSF 14-520: Cross-NSF, all Directorates participating
Next SSEs due Feb 2015; Next SSIs due June 2015
SI2 Solicitation and Decision Process
• Proposal reviews well -> my role becomes
matchmaking– I want to find program officers with funds, and convince them
that they should spend their funds on the proposal
• Unidisciplinary project (e.g. bioinformatics app)– Work with single program officer, either likes the proposal or
not
• Multidisciplinary project (e.g., molecular
dynamics)– Work with multiple program officers, ...
• Omnidisciplinary project (e.g. http, math library)– Try to work with all program officers, often am told “it’s your
responsibility”
To judge software, need to
understand/forecast impact
Measuring Impact – Scenarios
1. Developer of open source physics simulation
– Possible metrics
• How many downloads? (easiest to measure, least value)
• How many contributors?
• How many uses?
• How many papers cite it?
• How many papers that cite it are cited? (hardest to measure,
most value)
2. Developer of open source math library
– Possible metrics are similar, but citations are less
likely
– What if users don’t download it?
• It’s part of a distro
• It’s pre-installed (and optimized) on an HPC system
• It’s part of a cloud image
• It’s a service
• Future impacts – let proposers suggest
ACI Software Cluster Programs
• In these programs, ACI works with other NSF
units to support projects that lead to software
as an element of infrastructure
• Issue: amount of software that is
infrastructure grows over time, and grows
faster than NSF funding
Q: How can NSF ensure that software as
infrastructure continues to appear, without
funding all of it?
A: Incentives
• The devil is in the details
Other Software Discussions
• Working Towards Sustainable Software for
Science: Practice and Experience (WSSSPE)
– http://wssspe.researchcomputing.org.uk
– 3 workshops held
• Lessons:
Many of the issues in developing
sustainable software are social, not
technicalSoftware work is inadequately visible in
ways that “count” within the reputation
system underlying science
Where We Are
• To judge software, need to understand/forecast impact
• Q: How can NSF ensure that software as infrastructure
continues to appear, without funding all of it?
• A: Incentives
• Many of the issues in developing sustainable software are
social, not technical
• Software work is inadequately visible in ways that “count”
within the reputation system underlying science
Hypothesis: better measurement of
contributions can lead to rewards
(incentives), leading to career paths,
willingness to join communities, leading to
more sustainable software
A Problem
Credit for finding: Amy Brand, Digital Science
Another Problem
Credit for finding: Amy Brand, Digital Science
Last Problem
Credit for finding: Amy Brand, Digital Science
Moving Forward - NSF
• Recent CISE/ACI & SBE/SES Dear Colleague
Letter: Supporting Scientific Discovery through
Norms and Practices for Software and Data
Citation and Attribution (NSF 14-059,
http://www.nsf.gov/pubs/2014/nsf14059/nsf14059
.jsp)
– Need well-developed metrics to assess the
impact and quality of scientific software and
data
– Explore new norms and practices for software
and data citation and attribution, so that data
producers, software and tool developers, and
data curators are credited
• 6 projects and 3 collaborative workshops funded
Moving Forward - Dan
• Products (software, paper, data set) are
registered
– Credit map (weighted list of contributors—
people, products, etc.) is an input
– DOI is an output
Paper
Author
B... Paper
M... Software
X...
0.20.05 0.2
Author
A
0.2
Data
K...
0.1
Moving Forward - Dan– Enables transitive credit1
• E.g., paper 1 provides 25% credit to software A, and
software A provides 10% credit to library X -> library X gets
2.5% credit for paper 1
• Helps developer show: “my tools are important”
– Issues:
• Social: Trust in person who registers a product
• Technological: How2, Registration system
1D. S. Katz, "Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of
Digital Products," Journal of Open Research Software, v.2(1): e20, 2014. DOI: 10.5334/jors.be2D. S. Katz, A. M. Smith, "Implementing Transitive Credit with JSON-LD," 2nd Workshop on Sustainable Software for Science:
Practice and Experiences (WSSSPE2), 2014. URL: http://arxiv.org/abs/1407.5117
Author
1... Paper
4... Software
12...
0.1
0.10.3
Paper
Author
B... Paper
M... Software
X...
0.20.05 0.2
Author
A
0.2
Data
K...
0.1
Moving Forward – Project CRediT• Goal: develop a contributor role taxonomy to enable greater granularity &
transparency around contributions to scholarly published output in science
• http://projectcredit.net
• Rationale:
• Comments to [email protected] & [email protected]
Publishers
Increase transparency
Reduce author disputes
Simplify process of chasing authors
Identifying peer reviewers
Research funders
Supporting grant applications
Understanding impact
Awarding credit
Identifying peer reviewers
Identifying new funding opportunities
Researchers
Gaining credit for true contribution
Credit for ‘new’/specific roles
Identify collaborators
Benefit junior reviewers
Reduce authorship politics?
Research institutions
Support tenure & appointment
New esteem & credit metrics for
staff
Understanding impact
Moving Forward – Project CRediTRole Description
Study conception Idea; formulation of research question; statement of hypothesis
Methodology Development or design of methodology; creation of models
ComputationProgramming, software development; designing computer programs;
implementation of computer code and supporting algorithms
Formal analysisApplication of statistical, mathematical, or or formal techniques to analyze study
data
Investigation; performed the
experiments
Conducting the research and investigation process, specifically performing the
experiments
Investigation; data/evidence collectionConducting the research and investigation process, specifically data/evidence
collection
ResourcesProvision of study materials, reagents, patients, laboratory samples, animals,
instrumentation, or other analysis tools
Data curationManagement activities to annotate (produce metadata) and maintain research
data for initial use and later re-use
Writing/manuscript preparation: writing
the initial draft
Preparation, creation, and/or presentation of published work, specifically writing
the initial draft
Writing/manuscript preparation: critical
review, commentary, or revision
Preparation, creation, and/or presentation of published work, specifically critical
review, commentary, or revision
Writing/manuscript preparation:
visualization/data presentation
Preparation, creation, and/or presentation of published work, specifically
visualization/data presentation
SupervisionResponsibility for supervising research; project orchestration; principal investigator
or other lead stakeholder
Project administration Coordination or management of research activities leading to this publication
Funding acquisition Acquisition of the financial support for the project leading to this publication
Moving Forward – Software Discovery Index
• NIH workshop, May 2014, within Big Data to
Knowledge (BD2K) initiative
– http://softwarediscoveryindex.org/
• Explored challenges facing the biomedical
research community in locating, citing, and
reusing biomedical software
• Identified fundamental prerequisite for success:
an automated, broadly accessible system
enabling comprehensive identification of
biomedical software.
• SDI Objectives:– to assign standard and unambiguous identifiers to reference
all software
– to track specific metadata features that describe that
software
– to enable robust querying of all relevant information for users
Moving Forward – Software Discovery Index
• Complementary with BD2K Data Discovery Index (DDI)
• Data vs. Software Characteristics
• Research Resource Identifiers (RRIDs) as prototype?
– http://scicrunch.com/resources
• Note strong biomedical focus of SDI and DDI
– initial case or limiting?
Issue Data Software
Storage-limited
Number of {datasets | software}
Complex metadata
Cited consistently and effectively
Consistently accessible long-term
Dependent on other data and software
(Credit; Chris Wellington & Vivien Bonazzi, NIH)
Moving Forward - Scholarly Contributions
Workshop & FORCE11
• FORCE11 – Open community aiming to improve future
research communication and e-Scholarship
– http://force11.org
• Scholarly Communications Workshop @ FORCE2015,
Oxford, UK, Jan 11 2015
• Goals:
1. Develop collaborative, interdisciplinary group to technically
implement a scholarly contribution roles ontology in
context of VIVO-ISF
2. Skeleton of scholarly products and the contribution roles
that people have towards each
3. Plan for technical next steps and development of proposal
to get funding to support this work
• Interest led to Force11 Attribution working group
– Webpage: http://www.force11.org/group/attributionwg
– Mailing List: [email protected]
Moving Forward - Community• Lots of challenges remain – within and across projects
• Career paths – Is there a role for non-tenure-track researchers
who produce software, data, etc. in universities?
– Assuming yes, do universities recognize and support this? If not,
how to get them to?
• What is needed to support reproducibility of science, in terms of
data and software?
• Versioning & provenance
• Lots of entities with similar interests in both software and data,
e.g. JISC, RCUK, NIH, DOE, Sloan & Moore, Mozilla, Apache
– Identifier work from Zenodo/GitHub, DataCite, CrossRef, VIVO, ...
• Need institutional buy-in, incorporation in researcher profiles
• Publisher involvement is essential– Software papers vs software?
• Future of Google Scholar?
• Continued participation in WSSSPE invited, leading to actions
• Other ideas and questions are welcome, now or later
Resources• NSF Software as Infrastructure Vision:
http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf12113
• Implementation of NSF Software Vision:
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504817
• Software Infrastructure for Sustained Innovation (SI2) Program
– Scientific Software Elements (SSE) & Scientific Software Integration (SSI) solicitation:
http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf14520
– 2013 PI meeting: https://sites.google.com/site/si2pimeeting/
– 2014 PI meeting: https://sites.google.com/site/si2pimeeting2014/
– Awards: http://bit.ly/sw-ci
• Working towards Sustainable Software for Science: Practice and Experiences (WSSSPE)
– Home: http://wssspe.researchcomputing.org.uk (includes links to all slides & papers)
– 1st workshop paper: http://arxiv.org/abs/1404.7414
– 2nd workshop site: http://wssspe.researchcomputing.org.uk/wssspe2/
• NSF 14-059: “Dear Colleague Letter - Supporting Scientific Discovery through Norms and
Practices for Software and Data Citation and Attribution”
– http://www.nsf.gov/pubs/2014/nsf14059/nsf14059.jsp
• Transitive Credit Papers
– http://dx.doi.org/10.5334/jors.be
– http://arxiv.org/abs/1407.5117
• Project CRediT: http://projectcredit.net
• NIH Software Discovery Index: http://softwarediscoveryindex.org/
• FORCE11: http://foce11.org/
– Attribution Working Group: http://www.force11.org/group/attributionwg
Credits:• SI2 Program:
– Current program officers: Daniel S. Katz, Rudolf Eigenmann, William Y. B. Chang,
John C. Cherniavsky, Almadena Y. Chtchelkanova, Cheryl L. Eavey, Evelyn
Goldfield, Sol Greenspan, Daryl W. Hess, Peter H. McCartney, Bogdan Mihaila,
Dimitrios V. Papavassiliou, Andrew D. Pollington, Barbara Ransom, Thomas
Russell, Massimo Ruzzene, Nigel A. Sharp, Paul Werbos, Eva Zanzerkia
– Formerly-involved program officers: Manish Parashar, Gabrielle Allen, Sumanta
Acharya, Eduardo Misawa, Jean Cottam-Allen, Thomas Siegmund
• WSSSPE:
– Organizers: Daniel S. Katz, Gabrielle Allen, Neil Chue Hong, Karen Cranston,
Manish Parashar, David Proctor, Matthew Turk, Colin C. Venters, Nancy Wilkins-
Diehr
– WSSSPE1 summary paper authors: Daniel S. Katz, Sou-Cheng T. Choi, Hilmar
Lapp, Ketan Maheshwari, Frank Löffler, Matthew Turk, Marcus D. Hanwell, Nancy
Wilkins-Diehr, James Hetherington, James Howison, Shel Swenson, Gabrielle D.
Allen, Anne C. Elster, Bruce Berriman, Colin Venters
– Keynote speakers: Phil Bourne, Arfon Smith, Kaitlin Thaney, Neil Chue Hong
• Project CRediT– Leads: Liz Allen, Amy Brandt, full group at http://projectcredit.net/
• NIH Software Discovery Index– http://softwarediscoveryindex.org/
• Force11 community– http://force11.org/