data intensive science: shades of grey · data intensive science • research datasets / databases...

23
Data Intensive Science: Shades of Grey Keith G Jeffery a *, Anne Asserson b a Keith G Jeffery Consultants, Shrivenham, SN6 8AH, UK b University of Bergen, Bergen, 5009, Norway ©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 1

Upload: others

Post on 08-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Data Intensive Science:

Shades of Grey

Keith G Jefferya *, Anne Asserson b

a Keith G Jeffery Consultants, Shrivenham, SN6 8AH, UK

b University of Bergen, Bergen, 5009, Norway

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 1

Page 2: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Structure

• Introduction

• Reliable Information

• Rich Metadata

• Conclusion

• Data Intensive Science

• Grey

• Research Information

• Open Government Data

• Quality

• Context

• Availability

• CERIF

• 3-layer model

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 2

Page 3: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Data Intensive Science

• Research datasets / databases– High volume

– High velocity (of change)

– Complex structures

– Streamed

• Data mining– Patterns

– Induction

• Not all ‘patterns’ or ‘rules’ are valid hypotheses

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 3

Page 4: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Data Intensive Science

• Related Concepts

– Open data

• Available

• Toll free

– Big Data

• Volume

• Complexity

– CLOUD Computing

• Virtualisation

• Elasticity

• Pay-as-you-go

Page 5: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Grey• That which is not white

– NOT Peer reviewed

• Typically– PhD / MS theses

– Technical reports

– Lab notebooks

– Manuals

• But also– Newsletters

– Advertising

• And, importantly– Datasets

– Software

– Licences

• Patents are peer-reviewed– Special process

• PhD theses are peer reviewed– Twice if composed of published

papers

• Technical Reports undergo internal peer review– May be basis of commercial

success

• Increasingly research datasets are peer reviewed– Especially biomedical

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 5

Page 6: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Research Information

• White Assured

– by peer review

– Publishers

• Impact factor

– Gold OA Beall’s list

http://scholarlyoa.com/20

14/01/02/list-of-predatory-

publishers-2014/

– San Francisco declaration

http://am.ascb.org/dora/

• Grey : how to assure

– Quality

– Relevance

– Access

• So it can be reviewed

• Review Methods:

– Usage

– Citation

– Annotation

– Impact (commercial/social

take-up)

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 6

Page 7: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Open Government Data

• Motivation

– Transparency

– Commercialisation

• Derivation

– Commonly summarised from publicly-funded research

• Vast majority .pdf; then .csv, then .xls

• Metadata DC or CKAN

• ENGAGE Project

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 7

Page 8: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Structure

• Introduction

• Reliable Information

• Rich Metadata

• Conclusion

• Data Intensive Science

• Grey

• Research Information

• Open Government Data

• Quality

• Context

• Availability

• CERIF

• 3-layer model

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 8

Page 9: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Reliable Information

• Quality

– Represents accurately world of interest

• Context

– Environment within which collected – related entities

• Persons, organisations, projects, funding, equipment,

publications…..

• Availability

– Persistence (preservation / curation)

– Conditions of use (open access)

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 9

Page 10: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Reliable Information: Quality

• Data integrity

– Schema

– Constraints

• Accuracy, precision

• Incomplete and inconsistent information

• Temporal validity

• Independent validation

– Quality rating

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 10

Page 11: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Reliable Information: Context

• Related entities • give confidence that the

dataset is understood in context• Purpose, subject area,

research method, associated information

• Used to evaluate dataset for relevance and quality• Relevance: Subject area,

geospatial / temporal coordinates

• Quality: organisation, person, publications, facility, equipment, citations

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 11

Page 12: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Reliable Information: Availability• Persistence

– Media migration• Who can read a 7 inch floppy

disk? Or a 3420 IBM tape?

– Declared syntax and semantics• Machine readable AND machine

understandable

– Preservation of related software• Changing languages, compilers /

interpreters

• Changing operating environment (sequential,parallel, distributed, data dependencies)

• Specifications

• Access– Open

– Toll-free (conditions, licences)

12©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 12

Page 13: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Structure

• Introduction

• Reliable Information

• Rich Metadata

• Conclusion

• Data Intensive Science

• Grey

• Research Information

• Open Government Data

• Quality

• Context

• Availability

• CERIF

• 3-layer model

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 13

Page 14: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

CERIF

• (Common European Research Information

Format)

EU Recommendation to member states

• Used in 42 countries

• National standard in 10

• Maintained, developed, promoted by

euroCRIS (not for profit) www.eurocris.org

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 14

Page 15: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

CERIF

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 15

Page 16: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Dataset Relationships (1)• Project

• Organisation

– Collector/creator

– Owner

– Funder

– User

• Person

– Collector/creator

– Owner

– Funder

– User

• Name, Description,

Keywords

• Classification scheme(s)

• GeoBBox

– Measurement for precision

• Funding

• Facility

• Equipment

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 16

Page 17: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Dataset Relationships (2)

• Publication

– Scholarly

– Licence

– Data Management policy

(including preservation)

• Product

– Dataset

– Software

– Dataset schema

• Citation

• Measurement

– Volume

– Velocity (of change)

– Accuracy

– Precision

• Medium

– classification

Temporal coordinates managed by

linking relations timestamps (e.g.

Project-Product) or if content refers to

an era by classification©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 17

this provides

provenance through

time-stamped role-

based relationships

Page 18: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Datasets in CERIF: The Debate

• Keep as Product

– Seems to work

– Mainly ‘attributes’ are in

linked relations and

linking relations

– If make dataset special

what about software,

dataset schema,

• Create new entity

– Gives higher ‘status’

– Additional attributes

required for datasets

over product

– Dataset is important and

increasingly so; software

not yet

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 18

Need:

1. Use cases

2. Mapping to existing CERIF

3. Analyse problems

Page 19: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

The Vision: Metadata Stack

DISCOVERY

(DC, eGMS…)

CONTEXT

(CERIF)

DETAIL

(SUBJECT OR TOPIC SPECIFIC)

Generate

Point to

Linked

open data

Formal

Information

Systems

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 19

Page 20: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Open Data and Information Processing

LOD Semantic Web RDF

Browsing, ease of use

Relational (Links)

Integrity, performance

generate

provide

access to

Example: summary data in semantic

web/LOD environment (RDF) with

associated processing

Example: research datasets in Relational

DB environment with associated analysis,

visualisation, data mining ….

Manual download

Manual connection to software

Manual integration

Automated download

Automatic connection to software

Automated integration©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 20

Page 21: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Complete ICT environment for research

The Vision: The Models

Complete cohort of researchers, research managers,

innovators, media

Processing Model

User Model

Data Model

Resource Model

interaction with data, processing, persons

providing what the user

requires

representing research

representing ICT

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 21

Page 22: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

Structure

• Introduction

• Reliable Information

• Rich Metadata

• Conclusion

• Data Intensive Science

• Grey

• Research Information

• Open Government Data

• Quality

• Context

• Availability

• CERIF

• 3-layer model

©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 22

Page 23: Data Intensive Science: Shades of Grey · Data Intensive Science • Research datasets / databases –High volume –High velocity (of change) –Complex structures –Streamed •

CONCLUSION

• We assert three points:

– (1) in the context of data-intensive science the

importance of grey;

– (2) the need for reliability mechanisms to ensure

the quality and relevance of grey and

– (3) the need for rich metadata to support the

usage of grey.

• Grey includes research datasets and open

government data

USE CERIF FOR DATA INTENSIVE SCIENCE©Keith G Jeffery, Anne Asserson CRIS14 Rome May 2014 23