ada slide presentation rsc day_feb2017_v2

51
Introduction to ADA: The Australian Data Archive as a Trusted Repository for Research Data Dr. Steve McEachern Director, ADA 2017 Research Support Community Day Colombo Theatre, UNSW 13 February, 2017

Upload: susanmrob

Post on 03-Mar-2017

22 views

Category:

Presentations & Public Speaking


1 download

TRANSCRIPT

Page 1: Ada slide presentation rsc day_feb2017_v2

Introduction to ADA: The Australian Data Archive as a Trusted Repository for Research Data

Dr. Steve McEachernDirector, ADA

2017 Research Support Community DayColombo Theatre, UNSW13 February, 2017

Page 2: Ada slide presentation rsc day_feb2017_v2

ADA in Brief

• The Social Science Data Archive (now ADA) was set up in 1981, housed in the Research School of Social Sciences, with a mission to collect and preserve Australian social science data on behalf of the social science research community

• The Archive holds over 5000 datasets from around 1500 studies, including national election studies; public opinion polls; social attitudes surveys, censuses, aggregate statistics, administrative data and many other sources.

• Data holdings are sourced from academic, government and private sectors.

Page 3: Ada slide presentation rsc day_feb2017_v2

So what is a data archive?

• ‘A “trusted system” that provides... an accessible and comprehensive service empowering researchers to locate, request, retrieve and use data resources in a simple, seamless and cost effective way, while at the same time protecting the privacy, confidentiality and intellectual property rights of those involved.’

Social Sciences and Humanities Research Council of Canada. “National Data Archive Consultation Final Report: Building Infrastructure for Access to and Preservation of Research Data in Canada” URL: http://www.sshrc.ca/web/whatsnew/initiatives/da_finalreport_e.pdf [20 November 2003].

Page 4: Ada slide presentation rsc day_feb2017_v2

ADA Subarchives

• Social Science – predominantly survey or polling based quantitative social science data

• Historical – an archive of Australian census data tables from 1834 to the present day

• Indigenous – A thematic archive bringing together research data about Aboriginal and Torres Strait Islanders

• Longitudinal –major longitudinal cohort and panel surveys of the Australian population

• Qualitative – a new collection which provides specialist data archiving and access services to qualitative researchers

• Crime & Justice – major collections of data in crime, law and justice, including criminal justice administrative data

• International – a central point of access for links to international data sources around the world

Page 5: Ada slide presentation rsc day_feb2017_v2

ADA Data Holdings

Ageing Business and management Census data Culture Demography Drugs, alcohol and tobacco Economics Education, employment and work Environment, Conservation, Land

use Family studies Foreign affairs Gambling Health Housing

Law, Crime, Courts Mass media, communication and

language Migration, immigration and

multiculturalism Politics and elections Public opinion and social

attitudes Psychology Quality of life Science, Technology Social welfare Sociology Tourism, recreation and leisure Travel and transport

ADA data holdings cover a wide variety of subject areas:

Page 6: Ada slide presentation rsc day_feb2017_v2

Example studies

• Australian Survey of Social Attitudes (ANU, UWA, UQ, …)• Longitudinal Surveys of Australian Youth (NCVER)• Australian Election Studies (ANU, QUT)• ANUPolls, Morgan Gallup Polls, Age Polls, Lowy polls

(1947 – Present)• Colonial census tables and images, 1838-1901 (ABS)• Census tabulations, 1966 – Present (ABS)• National Drug Strategy Household Survey, 1994 – Present

(AIHW)• Australian Workplace Relations Survey, 1990, 1995, 2014

(forthcoming) - Dept of Employment• Negotiating the Life Course (ANU, AIFS, UQ)

Page 7: Ada slide presentation rsc day_feb2017_v2

Forthcoming

• Longitudinal studies– Department of Social Services

• HILDA, LSAC, LSIC, BNLA– National Centre for Vocational Education Research

• LSAY new wave– Department of Health

• Australian Longitudinal Studies on Womens Health (ALSWH) and Mens Health (Ten to Men)

– Bruce: Child Support study• Exercise, Recreation and Sport Survey 2001-2010

(Australian Sports Commission)• Giving Australia survey (DSS)

Page 8: Ada slide presentation rsc day_feb2017_v2

The ADA website

Page 9: Ada slide presentation rsc day_feb2017_v2

The ADA Study Page

Page 10: Ada slide presentation rsc day_feb2017_v2

Dataset study pages

Study information is based on the DDI-C (Data Documentation Initiative) standard, and includes:

• Study: information including the investigators, abstract, sample, data collection methods, and access requirements.

• Variables: a list of variables available in a quantitative dataset

• Related Materials: additional documentation (reports, questionnaires, technical information), links and other related studies (eg. others in the series) that may interest you

Page 11: Ada slide presentation rsc day_feb2017_v2

Who uses ADA?

• 2016– 12000 online analyses (usually crosstabulations)– 1100 data file downloads

• Registrations:– Approx. 1000 new users each year

• User types:– Undergraduates: 41% of analysis, 16% of downloads– Postgraduates: 33% / 40%– Researchers:11% / 40%– Others (media, government, NGO, etc.): 15% / 4%

• Institution types: (approx.)– Australian universities: 70%– International universities: 15%– Government departments and agencies: 10%– Other: 5%

Page 12: Ada slide presentation rsc day_feb2017_v2

Data dissemination options

Page 13: Ada slide presentation rsc day_feb2017_v2

The ADA study page

Study information is available through the tabs at the top of the study:

• Study: information including the investigators, abstract, sample, data collection methods, and access requirements.

• Variables: a list of variables available in a quantitative dataset• Related Materials: additional documentation, links and other

related studies (eg. others in the series) that may interest youThe study page is also the access point for the ADA Nesstar system,

for:• Analysis of quantitative data online, • Download of data to your own computer. Note: you will need to log in to your ADA user account in order to

access the Nesstar system.

Page 14: Ada slide presentation rsc day_feb2017_v2

Types of access

• Browse (viewing metadata):– Open access

• Analyse (Online analysis): free user registration– General access studies: Free access for registered users– Restricted studies: User still requires approval to access

• Data download:– For unrestricted data: submit a user request, and sign ADA

general user undertaking (reviewed by ADA staff)– For restricted data: restricted access request form and specific

user undertaking (reviewed by ADA and depositor of data)– Special access: depends on the particular access requirements

Page 15: Ada slide presentation rsc day_feb2017_v2

Browsing: The ADA Study Page

Page 16: Ada slide presentation rsc day_feb2017_v2

Exploring data in Nesstar

• The information about the study (from the ADA study page) is also available in Nesstar. Click on the Dataset icon to explore the study.

• For quantitative analysis, you can also view basic statistics and charts for individual variables in this section, by exploring the Variables tab

Page 17: Ada slide presentation rsc day_feb2017_v2

Exploring variables in Nesstar

Page 18: Ada slide presentation rsc day_feb2017_v2

Creating a cross-tabulation

Page 19: Ada slide presentation rsc day_feb2017_v2

Downloading data

• Nesstar is also used as the ADA data download system, to export the data files for the study to your own computer.

• To download data, you need to have been approved for download access for the study you are interested in.

• This can be done by submitting a Request for Data Access:– a) from the “Request Analysis and Download access” link from a study

page, OR– b) from your personal User page (http://users.ada.edu.au)

• This request then goes to the ADA User Services team for approval.

• Once your download access has been approved, you will receive an email notification from ADA, and a link to the study will be added to your User Page.

Page 20: Ada slide presentation rsc day_feb2017_v2

Managing and Depositing Data: ADA and DDI

Page 21: Ada slide presentation rsc day_feb2017_v2

Data deposit: ADAPT

Page 22: Ada slide presentation rsc day_feb2017_v2

Archival processing

Manual system with some automation tools1. Deposit:

– Review of ADAPT submission– Storage via ADAPT to file store

2. Data processing:– File format conversion (usually to SPSS for processing)– Privacy/confidentiality review– Data cleaning (in consultation with depositor)

3. Metadata processing:– DDI-C metadata creation in Nesstar Publisher

4. Publishing:– Archival storage and access format creation– Data publication to Nesstar server– Metadata publication to Nesstar and ADA CMS

Page 23: Ada slide presentation rsc day_feb2017_v2

Future directions

Page 24: Ada slide presentation rsc day_feb2017_v2

Future trends

• Mandated rather than recommended data archiving– How do we scale?– Looking at self-deposit systems

• Open access to data as the default – Government: PM&C Open Data Policy, data.gov(.au/.uk)– Research: Horizon2020, ESRC, NSF, ARC/NHMRC??

• Broader range of data types available– Qualitative data: YES– Social media data:

• Raw feed (firehose): NO • Processed data: ??? (how to support access)

– Administrative data: ???• Broader range of users of that data

– Different disciplines: health, environment, comp. sci.– Different users: public/media/government– Different geographies: internationally

Page 25: Ada slide presentation rsc day_feb2017_v2

Core needs for social science data

• Collection• Preservation• Integration• Analysis• Dissemination

Page 26: Ada slide presentation rsc day_feb2017_v2

ADA trusted digital repository project

• Funded by ANDS 2016-17• Aims:

– Completion of the Data Seal of Approval self-assessment and certification process

• http://www.datasealofapproval.org/en/ • 16 requirements: • Assessment on 0-4 scale:• All requirements must be at least a 1

– Implemention of improvements to ADA systems and procedures to improve certification assessment

– Review of the DSA certification process and criteria to assess suitability for the Australian research data environment

Page 27: Ada slide presentation rsc day_feb2017_v2

DSA requirements

• “Fundamental to the following guidelines are five criteria, that together determine whether or not the digital research data may be qualified as sustainably archived:– The research data can be found on the Internet.– The research data are accessible, while taking into account

relevant legislation with regard to personal information and intellectual property of the data.

– The research data are available in a usable format.– The research data are reliable.– The research data can be referred to.”

• http://www.datasealofapproval.org/media/filer_public/2013/09/27/dsa-booklet_1_june2010.pdf

Page 28: Ada slide presentation rsc day_feb2017_v2

The guidelines

• “The associated guidelines relate to the implementation of these criteria and focus on three stakeholders: the data producer, the data repository and the data consumer.1. The data producer is responsible for the quality of the digital research

data.2. The data repository is responsible for the quality of storage and

availability of the data and data management.3. The data consumer is responsible for the quality of use of the digital

research data.”– http://www.datasealofapproval.org/media/filer_public/2013/09/27/dsa-b

ooklet_1_june2010.pdf

• Guidelines: https://drive.google.com/file/d/0B4qnUFYMgSc-eDRSTE53bDUwd28/view

Page 29: Ada slide presentation rsc day_feb2017_v2

Repositories and archives project

• With UNSW Library (Maude Frances)• Exploring mechanisms for deposit and preservation

of data through repository to the data archive• Questions we are exploring:

– Where should we deposit the data?– Who should store the data?– What metadata should we collect?– Who should manage the metadata?– How to transfer content (data and metadata) between

repository and archive?– How to determine the “source of truth”? (e.g. who should

mint the DOI?)

Page 30: Ada slide presentation rsc day_feb2017_v2

ADA Dataverse

• Redevelopment of our database and website infrastructure– New website– New data catalogue

• New functionality:– Self-deposit of data– Open data access– API access (both for deposit and access, e.g. through R)– Shibboleth authentication

• Currently in early testing– For completion in 2017 (probably Q3)

• Functionality intended to support additional DSA requirements

Page 31: Ada slide presentation rsc day_feb2017_v2

ADA Dataverse

Page 33: Ada slide presentation rsc day_feb2017_v2
Page 34: Ada slide presentation rsc day_feb2017_v2
Page 35: Ada slide presentation rsc day_feb2017_v2

Data documentation standards

Page 36: Ada slide presentation rsc day_feb2017_v2

DDI-Codebook

• Two flavours of DDI – Codebook and Lifecycle• Focus on DDI-C, four sections:

1. Document description: characteristics of the DDI XML document itself

2. Study description: characteristics of the Study (project) that the DDI is describing (including Related Materials: documents associated with the project, such as questionnaires, codebooks, etc.)

3. File description: characteristics of the physical data files4. Variable description: characteristics of the variables in the

data file

Page 37: Ada slide presentation rsc day_feb2017_v2

Dublin Core

• Type• Format• Identifier• Source• Language• Relation• Coverage• Rights

• Title• Creator• Subject• Description• Publisher• Contributor• Date

Page 38: Ada slide presentation rsc day_feb2017_v2

DCAT (W3C)

DCAT standard is relatively simple, and includes four basic objects:• Dataset: “a collection of data, published or curated by a

single agent, and available for access or download in one or more formats”

• Data catalog(ue): “ a curated collection of metadata about datasets”

• Catalog(ue) record: “a record in a data catalog, describing a single dataset”

• Distribution: “represents a specific available form of a dataset”

• Key object for SRC is the Dataset– others are distribution-related

Page 39: Ada slide presentation rsc day_feb2017_v2

ADA systems architecture

Page 40: Ada slide presentation rsc day_feb2017_v2

Approach

• Core archive website: – http://www.ada.edu.au

• Sub-archives focussed on specialised thematic or methodological areas- eg. http://www.ada.edu.au/indigenous/home

• “Add-on” systems for complex analysis or visualisation tasks:– Nesstar– GIS: http://gis-test.ada.edu.au– Longitudinal visualisation: Panemalia– Historical census data: http://hccda.ada.edu.au

Page 41: Ada slide presentation rsc day_feb2017_v2

OAIS architecture

Page 42: Ada slide presentation rsc day_feb2017_v2

Data sharing policies in Australia

Page 43: Ada slide presentation rsc day_feb2017_v2

Policy trends in data access

• Mandated rather than recommended data archiving• Open access to data as the default (NSF, Office of

the President, data.gov(.au,.uk))• Broader range of data types available• Broader range of users of that data

Page 44: Ada slide presentation rsc day_feb2017_v2

Policy drivers

• Funders: Return on investment:– Government data: Treasury, PM&C– Research data: ARC/NHMRC, Horizon 2020

• Journal publishers: Reputation:– Open access journals (e.g. PLOS One) and – For-profit publishers (e.g. Nature, Science, Elsevier)

concerned about loss of credibility from fraudulent research• Learned societies and disciplines: Good science

AND reputation: – American Political Science Association: DART initiative– American Economic Association:

Page 45: Ada slide presentation rsc day_feb2017_v2

Government data

• Australia: Australian Government Public Data Policy Statement– The Australian Government commits to optimise the use and

reuse of public data; to release non-sensitive data as open by default; and to collaborate with the private and research sectors to extend the value of public data for the benefit of the Australian public.

– Public data includes all data collected by government entities for any purposes including; government administration, research or service delivery.

– Non-sensitive data is anonymised data that does not identify an individual or breach privacy or security requirements.

– https://www.dpmc.gov.au/sites/default/files/publications/aust_govt_public_data_policy_statement_1.pdf

Page 46: Ada slide presentation rsc day_feb2017_v2

Research data

• Australian Code for the Responsible Conduct of Research

• https://www.nhmrc.gov.au/guidelines-publications/r39 (Joint ARC/NHMRC publication)

• Section 2: Management of research data and primary materials

• Then provides related links to ethics statements and similar

Page 47: Ada slide presentation rsc day_feb2017_v2

ACRCR Section 2: Responsibilities of Institutions

Section 2.1.1: In general, the minimum recommended period for retention of research data is 5 years from the date of publication. However, in any particular case, the period for which data should be retained should be determined by the specific type of research. For example:• for short-term research projects that are for assessment purposes

only, such as research projects completed by students, retaining research data for 12 months after the completion of the project may be sufficient

• for most clinical trials, retaining research data for 15 years or more may be necessary

• for areas such as gene therapy, research data must be retained permanently (eg patient records)

• if the work has community or heritage value, research data should be kept permanently at this stage, preferably within a national collection.

Page 48: Ada slide presentation rsc day_feb2017_v2

ARC statement

"Researchers and institutions have an obligation to care for and maintain research data in accordance with the Australian Code for the Responsible Conduct of Research (2007). The ARC considers data management planning an important part of the responsible conduct of research and strongly encourages the depositing of data arising from a Project in an appropriate publicly accessible subject and/or institutional repository"

Page 49: Ada slide presentation rsc day_feb2017_v2

ANDS suggest three questions

1. Where will your research data be stored at completion of the project?

2. What access will you provide to the data set on completion of the project?

3. How will you enable others to reuse your research data?

Page 50: Ada slide presentation rsc day_feb2017_v2

Horizon 2020

• http://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-data-management/open-access_en.htm

• (All grants): Develop a data management plan (DMP) within 6 months of commencement of project

• Pilot program (2014-17):– Deposit research data described in DMP, preferably in a

research data repository– As far as possible, projects must then take measures to

enable third parties to access, mine, exploit, reproduce and disseminate (free of charge for any user) this research data.

– Guidelines recommend FAIR principles

Page 51: Ada slide presentation rsc day_feb2017_v2

FAIR principles

• Findable• Accessible• Interoperable• Reusable

• Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).