managing data responsibly to enable research interity
TRANSCRIPT
Managing data responsibly to enable research integrity
Heather Coates | Digital Scholarship & Data Management Librarianhttp://ulib.iupui.edu/digitalscholarship/datasupport/
Introduction to Research EthicsQuaid G504 (Fall 2016)
What is research integrity?
security
privacy
trust
honesty
accuracy
efficiency
objectivity
personal responsibility
ownership
stewardship
governance
Why does RDM matter?
RDM as a component of RCR
Roles & Responsibilities
Practical RDM
WHY DOES RDM MATTER?
The value of data increases with their use.
-Paul Uhlir
Source: John Gantz, IDC Corporation: The Expanding Digital Universe
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
2005 2006 2007 2008 2009 2010
The World of Data Around Us
Transient information
or unfilled demand for
storage
Information
Available Storage
Pe
tab
yte
s W
orld
wid
e
• Natural disaster • Facilities infrastructure failure • Storage failure • Server hardware/software failure• Application software failure• External dependencies (e.g. PKI failure)• Format obsolescence• Legal encumbrance • Human error• Malicious attack by human or automated agents• Loss of staffing competencies• Loss of institutional commitment • Loss of financial stability • Changes in user expectations and requirements
The World of Data Around Us: Data Loss
CC
im
age b
y S
hary
n M
orr
ow
on F
lickr
CC
im
age b
y m
om
bole
um
on F
lickr
Poor Data Management Affects Everyone
“MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004
Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY2003 (9.3% error rate) . The error rate measured claims that were paid despite being medically unnecessary, inadequately documented or improperly coded. In some instances, Medicare asked health care providers for medical records to back up their claims and got no response. The survey did not document instances of alleged fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007
The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General said the data “appear to be the result of decentralized and haphazard methods of collections … and do not appear to be intentional.”
“OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007
A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course was taken off the receipts to Alaska residents.)
"Good data management practice allows reliable verification of results and permits new and innovative research built on existing information. This is important if the full value of public investment in research is to be realized."
Managing and Sharing Data: Best Practices for ResearchersUK Data Archive
Benefits: Good Data Practices & Open Data
• Open data addresses social justice issues
• Open data enhances social welfare
• Open data benefits for effective governance and policy making
• Open data grows the economy
• Open data improves the integrity of the scholarly record
• Open data facilitates the education and training of new generations
• Open data enables validation or replication to support published results
• Open data accelerates the pace of discovery
• GDP increases the impact of your work by sharing your data, code & other products
• GDP improves the quality and consistency of research data you produce (save $$$)
• GDP improves the efficiency of your research (save time)
Personal Experience
“Please forgive my paranoia about protocols, standards, and data review. I'm in the latter stages of a long career with USGS (30 years, and counting), and have experienced much. Experience is the knowledge you get just after you needed it.
Several times, I've seen colleagues called to court in order to testify about conditions they have observed. Without a strong tradition of constant review and approval of basic data, they would've been in deep trouble under cross-examination. Instead, they were able to produce field notes, data approval records, and the like, to back up their testimony.
It's one thing to be questioned by a college student who is working on a project for school. It's another entirely to be grilled by an attorney under oath with the media present.”
- Nelson Williams, Scientist US Geological Survey
RDM AS A COMPONENT OF RCR
Concepts of Data Management
• Data ownership
• Data collection
• Data storage
• Data protection
• Data retention
• Data analysis
• Data sharing
• Data reporting
Steneck, 2004
DataONEData Life Cycle
The purpose of data management planning is to ensure that research data produced by a project are high quality, well organized, thoroughly documented, preserved, and accessible so that the validity of the data can be determined at any time.
ORI Guidelines for Responsible Data Management
The goal of data management is to produce self-describing data sets.
DataONE Primer on Data Management
Why is good data management so challenging?
ambiguity effect
availability heuristic
confirmation bias
experimenter’s or expectation bias
framing effect
hindsight bias
neglect of probability
optimism bias
planning fallacywell traveled road effect
ROLES & RESPONSIBILITIES
Funder progress towards openness
1985: National
Research Council
1999: Office of Mgmt & Budget, Circular A-110
revisions
2003: NIH Data
Sharing Policy
2008: NIH Public Access Policy
2011: NSF DMP Require-
ment
2012: NEH,
Office of Digital
Humanities DMP
Require-ment
2013: NSF Bio sketch change
2013: OSTP
Memo on Public
Access to Results of Federally-Funded
Research
2014: OSTP Memo on
Improving the Mgmt of & Access to Scientific
Collections
2014: OMB
Circular A-81
(Uniform Guidance)
takes effect
2015: Federal Funding agencies release plans
responding to 2013 OSTP memo
2016: Federal Funding Agency
Plans take effect
(DMP req)
Funder Policies: DMP & data sharing
• Association for Healthcare Research & Quality
• Centers for Disease Control & Prevention
• National Institutes of Health
• National Science Foundation
More agency policies at datasharing.sparcopen.org
Publisher Policies: Data availability
• DataDryad Publishers
• PLoS Journals
• Nature Publishing Group
• American Economic Review
• BioMedCentral
• JORD - Social Science Journals with a research data policy
• Data policies of Economic Journals
https://ulib.iupui.edu/digitalscholarship/datasupport/publisher_policies
Institutional Policies
• Vary greatly, with lots of gaps
• Distributed – address specific local or state requirements for specific types of data
• Often focus on institutional data rather than research data
• Do not provide practical guidance
• Do not distinguish between institutional and personal responsibilities
Roles/Responsibilities of project personnel
• On your own
– Fill in the team members responsible for key data activities
• In small groups (2-3 people), share and discuss
– What kind of training is provided for team members to complete these tasks accurately?
• Whole group discussion
– What barriers do you face in tracking roles and responsibilities?
– What barriers do you face in providing training?
Team Member Name Project Role Activity Description
Project design [+ documentation]
Determining the aims of the project, the methods used to achieve those aims, and identifying the products resulting from the project.
Translating the aims of the project into measurable research questions or hypotheses.
Instrument/measure/data collection tool design [+ documentation]
Creating tools that adapt the research questions or hypotheses into questions that can be addressed by discrete data points.
Validating tools through external review or pilot testing.Data collection [+ documentation]
Conducting surveys, interviews, experiments, and other project procedures according to the protocol in order to generate data.
Data processing [+ documentation]: entry, proofing/cleaning, preparation for analysis
Entering analog data into spreadsheet or database. Documenting procedures, date, and person responsible.
Checking data entry for accuracy and completeness. Documenting procedures, date, and person responsible.
Checking data for missing data, errors, and outliers. Documenting procedures, date, and person responsible.
Deciding what data to include/exclude. Documenting decision-making process and criteria used.Data analysis [+ documentation]:
Selecting analytical tools to be used. Documenting decision-making process and criteria used.
Conducting data characterization and screening tests, running analyses, generating results. Documenting process and files generated.
Deciding what data are relevant to the project aims and objectives. Documenting decision-making process and criteria used.
Data reporting: Creating summary tables, graphs, and other visuals to represent the data.
Writing up the project details and relevant results in the packages/format requested by the client, as specified by the deliverables agreed upon in the contract.
PRACTICAL RDM
Basic Principles: Good Research Data Practices
1. Have a plan & use it2. Follow the 3-2-1 rule of data storage3. Document4. Be consistent; when you aren’t, document the deviation5. Use common, standardized terminology6. Monitor the quality of the data as it is being created7. Report enough detail about your research so that others in your
field can reproduce it and others outside your field can evaluate it8. Be as open as possible9. Think about how your research might be useful to others
1: Functional data management plans support teams
• A tool for planning all the key activities related to data before you have a messy pile of bits on your hands
• A working document that reflects how a study is conducted
• Communication device for the team
• Documents the team members and their roles
• Customized to address the issues most relevant to your research
1: Planning…learning from Good Clinical Data Management Practices
• Begin with the end in mind OR Produce report-ready outputs
• Plan, test, revise, plan, test, revise…implement
• Include all stakeholders in the design of the protocol, data collection tools, data management plan, etc.
• Document, document, document– Specify documents required for reproducible research
– Facilitates clear communication and shared understanding throughout the project
– Specify roles and responsibilities from the beginning
2: Follow the 3-2-1 Rule
The accepted rule for backup best practices is the three-two-one rule. It can be summarized as: if you’re backing something up, you should have:
• At least three copies (in different places),
• in two different formats,
• with one of those copies off-site.
3: Document: How much?
More than you think you will need BUT less than everything
Information EntropyD
ATA
DE
TA
ILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change makes access to
“mental storage” difficult or unlikely
Loss of data developer leads
to loss of remaining
information
TIME(From Michener et al 1997)
3: Document, document, document
Documentation should capture crucial details needed for post publication peer review or validation of results
• Study: research questions/aims, IRB protocol, informed consents/authorizations, etc.
• Data collection instruments or tools OR data sources
• Data collection process or workflow
• Can take many forms, but should be consistent with standards or norms of practice for your field (e.g., data dictionary, data model, codebook, readme.txt, lab notebook)
3A: Know what you have - Data Inventory
• On your own
– Fill in as much of the data inventory as you can
• In small groups (2-3 people), share and discuss
– Benefits of knowing exactly what data you have?
– How hard would it be to complete this fully and accurately?
• Whole group discussion
– How might this be helpful throughout various phases of the project?
– How might it be helpful to have an inventory for complete and active projects?
3A: Know what you have - Data Inventory Example
• Funding source
• Program or initiative
• Project title
• PI First Name
• PI Surname
• Other Researchers/Data Contacts
• Project Start Date dd-mm-yyyy
• Project End Date dd-mm-yyyy
• New datasets created?
• How many datasets created
• Data location(s)
• Dataset Type (qualitative, quantitative, mixed methods, model)
• Sharing data?– Deposit location?
– Licensing?
– Embargo?
http://www.data-archive.ac.uk/create-manage/strategies-for-centres/data-inventory
3B: Documentation Strategies
• Lab notebooks (print or electronic)
• Codebooks
• Data Dictionaries
• Procedures Manuals
• Protocols
• Readme.txt
3C: Structured documentation [metadata] is crucial for discovery, reuse, and interoperability
• Metadata describes the who, what, when, where, how, why of the data
• Metadata = documentation for machines (standardized, structured)
• Purpose is to enable evaluation, discovery, organization, management, re-use, authority/identification, and preservation
• Standards are commonly agreed upon terms and definitions in a structured format
• Good documentation builds trust in your data – provenance, data integrity, transparency, audit trail, etc.
4: Be consistent
• We’re human – recognize the challenge
• Prevention – design research instruments & processes to prevent mistakes
• Pilot everything to identify potential problem areas
• When you aren’t, document the deviation
• Train your project personnel to be consistent & monitor performance
• Do internal audits, quality checks, data screening periodically to detect inconsistencies
5: Use common, standardized terminology
• For things/concepts
– Diagnoses
– Species/cell lines
– Locations
– Variable names
– Samples & materials
• For formats, too
– Dates
– Codes
– Identifiers
6: Monitor the quality of the data
• Don’t wait until data collection is over
• Quality Assurance
• Quality Control
• Build it into the project timeline
• Make it someone’s job
• Document what you find and how it was corrected
7: Better reporting
• Report enough detail about your research so that others in your field can reproduce it and others outside your field can evaluate it
• This includes ALL aspects of the study: study design, data collection methods, sampling, population, data screening & processing, QA/QC procedures, analytical procedures, visualization procedures, etc.
I know you can’t fit this into a journal article but you can write up pieces of this as the study is conducted to support publications and reporting to the funder. Plus it makes writing those products much easier.
8: Be as open as possible (Open Science)
Open isn’t an all or nothing choice• Study registration• Open notebook science• Data sharing (raw data, processed data, data supporting published
results)• Open Data• Open Access publishing (deposit pre/post print in a repository,
choose OA journal, choose Gold OA option)
Want to learn more? Center for Open Science Why Open Research?
9: Think ahead
How might your research might be useful to others? To yourself in 5/15/50 years? Your students or trainees? Historians?
• Consider what you will forget in that time and document it
• Consider whether your data will be useful beyond the life of the project. If so, put it somewhere safe like an institutional or subject repository to share it and ensure long-term access.
Stakeholders in (Academic) RDM
• Research Administration
• Research Compliance
• University IT
• University Libraries
• University Archives
• Consortia (e.g., CIC)
• NIH CTSA Hubs/NCATS
• Research & Technology Corporation: http://iurtc.iu.edu/
Case studies: Discussion
1. http://retractionwatch.com/2015/11/05/got-the-blues-you-can-still-see-blue-after-all-paper-on-sadness-and-color-perception-retracted/
2. https://ori.hhs.gov/content/case-summary-anderson-david
Resources1. Uhlir, P. F. (2010). Information Gulags, Intellectual Straightjackets, and Memory Holes. Data Science Journal, 9, ES1-ES5.
2. DataONE Education Module: Data Management. DataONE. Retrieved December 2013. From
http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx
3. Scientists are hoarding data and it’s ruining medical research: http://www.buzzfeed.com/bengoldacre/deworming-trials
4. Losing data from the National Centre for E-Social Science (NCESS) Portal:
http://datastories.jiscinvolve.org/wp/2015/08/10/losing-data-from-the-national-centre-for-e-social-science-ncess-
portal/
5. Biomedical data sharing and reuse: Attitudes and practices of clinical and scientific research staff:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129506
6. Over half of psychology studies fail reproducibility test: http://www.nature.com/news/over-half-of-psychology-
studies-fail-reproducibility-test-1.18248
7. Value of Open Data Sharing: https://www.fosteropenscience.eu/sites/default/files/pdf/2536.pdf
8. Michener, W. K., Brunt, J. W., Helly, J. J., Kirchner, T. B., & Stafford, S. G. (1997). Nongeospatial metadata for the
ecological sciences. Ecological Applications, 7(1), 330-342.
9. Society for Clinical Data Management. (2013). Good Clinical Data Management Practices. Washington, D.C.
10. UK Data Archive. (2015). Prepare and manage data. From http://ukdataservice.ac.uk/manage-data.