open data for open science: implications for european universities geoffrey boulton eua, brussels...
DESCRIPTION
Why is “open data” a big current issue? The data deluge from powerful acquisition tools coupled with powerful tools for storing, manipulating, analysing, displaying and transmitting data and citizens interest in scrutinising scientific claims have created new challenges & new opportunities that require new forms of openness and novel social dynamics in science In the last century universities have played a key role in the progress of science. Will they address the challenges and exploit the opportunities as science changes around them?TRANSCRIPT
Open Data for Open Science: implications for European universities
Geoffrey Boulton
EUA, Brussels 2012
Some emerging conclusions from a Royal Society Policy Report: “Science as an Open Enterprise”
Open data as the engine of the “scientific revolution”
Publish scientific theories – and the experimental and observational data on which they are based – to permit others to scrutinise them, to identify errors, to support, reject or refine theories and to reuse data for further understanding and knowledge.
Henry Oldenburg - Father of “open science”
Why is “open data” a big current issue? The data deluge from powerful acquisition toolscoupled withpowerful tools for storing, manipulating, analysing, displaying and transmitting data
andcitizens interest in scrutinising scientific claimshave creatednew challenges & new opportunities that require new forms of openness and novel social dynamics in science
In the last century universities have played a key role in the progress of science. Will they address the challenges and exploit the opportunities as science changes around them?
Challenges• Maintaining scientific self-correction (closing the concept-data gap)• Responding to citizens’ demands for evidence in “public interest science”
Opportunities • Exploiting data-intensive science – a 4th paradigm?• The potential of linked data• “Data is the new raw material for business”• Exposing malpractice and fraud• Stimulating citizen science• A second “open science” revolution?Aspiration: all scientific literature online, all data online, and for them to interoperate
Openness of data per se has no value. Open science is more than disclosureFor effective communication, we need intelligent openness. Data must be:
• Accessible• Intelligible• Assessable• Re-usable
Only when these four criteria are fulfilled are data properly openMetadata must be audience-sensitive
METADATA (data about data)
Scientific data rarely fits neatly into an EXCEL spreadsheet!
Mathematics related discussions
Tim Gowers- crowd-sourced mathematics
An unsolved problem posed on his blog.
32 days – 27 people – 800 substantive contributions
Emerging contributions rapidly developed or discarded
Problem solved!
“Its like driving a car whilst normal research is like pushing it”
What inhibits such processes?- The criteria for credit and promotion.
Precursor of a second open science revolution?
Boundaries of openness?
• Legitimate commercial interests
• Privacy (note that anonymisation is impossible)
• Safety & Security
But the boundaries are fuzzy & complex
Benefits/costs of open data to the science processPathfinder disciplines where benefit is recognised and habits are changing
• Bioinformatics (-omics disciplines)• Biological science• Particle physics• Nanotechnology• Environmental science• Longitudinal societal data• Astronomy & space science
CostsTier 1 – International databases – e.g. Worldwide Protein Databank: >65 staff; $6.5M pa; 1% of cost of collecting data
Tier 3 – Institutional data management - UK 2011, average UK university repository - 1.36 FTE (managerial, administrative, technical)
e.g. Gene Omnibus – 2700 GEO uploads by non-contributors in 2000 led to 1150 papers (>1000 additional papers over the 16 that would be expected from investment of $400,000)
Levels of data curation
Tier 1 – International databases
Tier 2 – National (e.g. Research Councils
Tier 3 – Institutions (Universities & Institutes)
Tier 4 – “Small science” researchers & research groups
Financial sustainability?up
ward
data
mig
ratio
n
Dataloss
Priorities for action- 1
1) Change the mindset: publicly funded data is a public resource
2) Credit for useful data and productive, novel collaboration (the Tim Gowers phenomenon)
3) Mandatory access to data underlying publications
4) Common standards for communicating data
5) Sustainability (the power needs of current modes of data storage will outstrip the global electricity supply within the decade)
Priorities for action - 2• R & D on software tools (Enabling dynamic data; managing the data lifecycle; tracking provenance, citation, indexing and searching, standards & inter-operability, sustainability - note that the ICT industry is often way ahead - & the US prioritises investment here)
• Institutional responsibility for the knowledge they create (cumulative small science data > cumulative big science data)
• Data scientists (they are being trained, and the commercial demand is large)
“Big Iron” is a national infrastructure priority“Big data” is a science priority – the big costs are people and software, not computers
Actors in stimulating change
• Employers & custodians of research (universities/institutes)
• Funders of research (the cost of curation is a cost of research)
• Publishers of research
• Business – exploiting the opportunity
• Government – is it important? If so, do we need to act?
• Scientists: persuading them to act
Challenges for universities • Will they rise to the scientific challenge, or leave things to the information business?
• Will they be responsible for the knowledge they create?
• The university library; doing the wrong things with the wrong people?
• Supporting the data manipulation needs of their researchers?
• Supporting intelligent openness
• Open data and commercial imperatives
The stance of the European Commission & European Academies and the international dimension