2016 12-14 gbif and reuse of research data. gbif seminar in bergen

Post on 13-Apr-2017

84 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CC-BYDagEndresen

•  BIG DATA – a new research paradigm •  Data curation plan (data-life-cycle) •  Publish and archive your research data •  Use shared universal data standards •  Write metadata, good data documentation •  "Data paper" and data citation •  Academic credits for data publishing •  Use digital, stable and universal identity-

numbers (DOI)

DATA EXPLOTION

•  More and more and more data is produced.

•  The challenge ahead is not to produce more data, but knowledge, understanding and capacity to navigate and use very large volumes of data.

•  90% of the data that currently exists was created in just the last two years.

•  Data curation is critical to ensure that data is appropriately structured, available and reusable.

EXPONENTIAL GROWTH FOR DIGITAL DATA

Thedigitaluniversewilldoubleeverytwoyearsbetweennowand2020.Thegrowthismostlyunstructureddata(includingsensordatafromcameratrapsandweatherstaAons,images,video,soundclips).Amajorfactorbehindtheexpansionisthegrowthofmachinegenerateddata(from11%in2005toover40%in2020).

Imagesource:EMC/IDCDigitalUniverseStudy,2012

UNSTRUCTURED DATA

"Data! Data! Data! he cried impatiently. I can’t make bricks without clay". (Quote from Sherlock Holmes by Sir Arthur Conan Doyle in “The Adventure of the Copper Beeches”).

UnstructureddataaccountsforanesAmated80%ofalldatainorganizaAonsandawhopping95%ofallnewdatagenerateddaily(Grimes2008).

Why create a data management plan?

GraphicsbyJørgenStampCC-BY

DATA LOSS Digital data are fragile and susceptible to loss for a wide variety of reasons:

•  Natural disaster •  Facilities infrastructure failure •  Storage failure •  Server hardware/software failure •  Application software failure •  Format obsolescence •  Legal encumbrance •  Human error •  Malicious attack •  Loss of staffing competencies •  Loss of institutional commitment •  Loss of financial stability •  Changes in user expectations

Source: OpenAIRE & EUDAT, CC-BY-4.0, 2013 Image CC BY-NC-SA 2.0 by Dave Hill https://www.flickr.com/photos/dmh650/4031607067

DATA MANAGEMENT PLAN

•  Making your data available to others ensures that your research is truly reproducible.

•  Managing your research data saves time because it ensures that you and others in your collaboration will be able to find, understand, and use the data.

•  Sharing your research data enables wider dissemination of your work.

•  Enabling others to use your data reinforces open scientific inquiry and can lead to new and unanticipated discoveries.

GraphicsbyJørgenStampCC-BY

"FAIR" DATA Findable

–  assign persistent IDs, provide rich metadata, register in a searchable resource... (such as GBIF)

Accessible –  Retrievable by their ID using a standard protocol,

metadata remain accessible even if data aren’t...

Interoperable –  Use formal, broadly applicable languages, use

standard vocabularies, qualified references... (e.g. Darwin Core, …)

Reusable –  Rich, accurate metadata, clear licences, provenance,

use of community standards... (e.g. Dublin Core, EML, …)

www.force11.org/group/fairgroup/fairprinciples

Slide source: OpenAIRE & EUDAT, CC-BY-4.0, 2013

DATA CITATION PRINCIPLES

1.  Data to be legitimate citable products of research. 2.  Data citations giving scholarly credit and attribution. 3.  In scholarly literature, whenever claims are based on data, data should

always be cited. 4.  Persistent method for identification of data, that is machine actionable,

globally unique, universal. 5.  Data citation facilitate access to data or at least to metadata. 6.  Unique identifiers that persist even beyond the lifespan of the data. 7.  Data citation identify and access the specific data that support verification

of the claim (provenance, time-slice, version). 8.  Flexible, but attention to interoperability of practices across communities.

Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014

Long-term archiving for your research data

GraphicsbyJørgenStampCC-BY

BACKUP AND ARCHIVING – NOT THE SAME THING!

Backup –  Periodic snapshots of data in case the current

version is destroyed or lost. –  Backups are copies of files stored for short-term or

near-long-term. –  Often performed on a somewhat frequent schedule.

Archiving –  Preserve data for historical reference. –  Usually the final version, stored for long-term, and

generally not copied over. –  Often performed at the end of a project or during

major milestones.

Source: OpenAIRE & EUDAT, CC-BY-4.0, 2013

ONLINE DATA ARCHIVING CENTER

Rather than leaving your research data on a local server or in cloud storage, archive your data with a trusted digital repository. Many repositories create metadata and documentation to ensure that the data will be discoverable in the future.

DATA ONE

Source: GBIF News story, September 2014, DataONE: http://www.gbif.org/page/8199

NATIONAL DATA CENTER

Sigma2AS

Foto: CC-BY Intel Free Press (WikiMedia Commons) APeekInsideFacebook'sOregonDataCenter

UNINETTSigma2ASandtheNorwegianCenterforResearchData(NSD)providenaAonalanaAonalinfrastructureserviceforarchivingNorwegianresearchdata.AninfrastructuredatarepositoryprovidemanybenefitscomparedtolocalinsAtuAonaldataarchiving.•  Standardizedprotocols.•  Improvedaccessforusersof

datafromoutsideowninsAtuAon.

Metadata

WHAT IS METADATA?

Photo: CC-BY ‘Metadata is a love note to the future’ by Cea+ www.flickr.com/photos/ centralasian/8071729256

Commonly defined as ‘data about data’, metadata helps to make data findable and understandable. Metadata can be:

Descriptive: information about the content and context of the data.

Structural: information about the structure of the data.

Administrative: information about the file type, rights management and preservation processes.

WHAT IS METADATA?

Source:CC-BYEUDAT,2015

METADATA CATALOG Image CC-BY ‘University of Michigan Library Card Catalog’ by David Fulmer www.flickr.com/photos/annarbor/4350629792

Comprehensive metadata will:

•  Facilitate data discovery

•  Help users determine the applicability of the data

•  Enable interpretation and reuse

•  Allow any limitations to be understood

•  Clarify ownership and restrictions on reuse

•  Offer permanence as it transcends people and time

•  Provide interoperability

WHY USE METADATA?

Source:CC-BYEUDAT,2015

INFORMATION ENTROPY

TheLossofInformaAonaboutData(Metadata)OverTime,Micheneretal,1997

Create metadata at the time of data creation.

Information will be forgotten and there won’t be time or effort left to capture it later.

Metadata benefits from quality control at an early stage too.

TIME MATTERS!

Photo CC-BY-SA ‘egg timer – hour glass running out’ by Open Democracy www.flickr.com/photos/opendemocracy/523438942

Source:CC-BYEUDAT,2015

DATASET TITLE

Titles are critical in helping readers find your data. –  While individuals are searching for the most appropriate

data sets, they are most likely going to use the title as the first criteria to determine if a dataset meets their needs.

–  Treat the title as the opportunity to sell your dataset.

A complete title includes: What, Where, When, Who, and Scale.

An informative title includes: topic, timeliness of the data, specific information about place and geography.

Source:CC-BYEUDAT,2015

WHAT IS THE BETTER DATASET TITLE?

Rivers or

Rivers in Rondane national park from 1:126,700 Forest Service visitor maps (1961-1983) Rivers (what) in Rondane national park (where) from 1:126,700 (scale) Forest Service (who) visitor maps (1961-1983) (when)

Source:CC-BYEUDAT,2015

WRITEFORMACHINES,NOTJUSTHUMANS

Remember: a computer will read your metadata.

Do not use symbols that could be misinterpreted: Examples: ! @ # % { } | / \ < > ~

Don’t use tabs, indents, or line feeds/carriage returns.

When copying and pasting from other sources, use a text editor (e.g., Notepad) to eliminate hidden characters.

Source:CC-BYEUDAT,2015

Peer review before data-publishing

"Data paper"

AuthorsgetscienAficcreditfordatapublicaAon.MeeAngconcernsoverdataquality.MeeAngconcernsoverdatacitaFonmechanism.

hap://www.gbif.org/publishingdata/datapapers

PEERREVIEWOPTIONFORBIODIVERSITYDATASETS

METADATA TOPICS / HEADLINES

Dataset description Project description People and Organizations (including roles) Coverage

•  Taxonomic coverage •  Geographic coverage •  Temporal coverage

Methods Intellectual property rights, licensing Keywords

RATIONALE FOR DATA PAPER

•  A scholarly publication of searchable metadata document describing a dataset, or a group of datasets.

•  Promote and publicize the existence of the data.

•  Provide scholarly credit to data publishers through citable journal publications.

•  Describe the data in a structured human- and machine-readable form.

Persistent and universal identity-number

ThepurposeofidenAfiers…istonamethings,

makingitispossibletorefertothem.“EachidenAfierreferstooneandonlyonething”(Coyle2006).“Anassocia-onbetweenastringandathing”(Kunze2003).“Astatedassocia-onbetweenasymbolandathing;thatthesymbolmaybeusedtounambiguouslyrefertothethingwithinagivencontext”(Campbell2007).

Manythings(inGBIF)arenamed123

Catalognumber:123GBIFID:543392241urn:catalog:CAS:BOT:123Bigelowiajuncea

Catalognumber:123GBIFID:1030591721UAMb:Herb:123Sphagnumgirgensohnii

Catalognumber:123GBIFID:893477175Parideserithalion

Catalognumber:123GBIFID:1050327334Cinchonaledgeriana Catalognumber:123

GBIFID:931031820Bromuskalmii

Catalognumber:123GBIFID:283363urn:occurrence:Arctos:MVZ:Egg:123:164Mercurialisovata

Catalognumber:123GBIFID:231564351Umbrinacanariensis

Catalognumber:123GBIFID:896547722urn:occurrence:Arctos:MVZ:Egg:123:164Contopussordidulusveliei

NAME AMBIGUITY:

HTTP – PURL – UUID http://purl.org/gbifnorway/id/41d9cbb4-4590-4265-8079-ca44d46d27c3

Includingmachine-readableformats

urn:uuid:41d9cbb4-4590-4265-8079-ca44d46d27c3

dc:idenAfier"urn:uuid:41d9cbb4-4590-4265-8079-ca44d46d27c3"

Data License (machine-readable license)

LICENSING FOR DATA PUBLISHED THROUGH GBIF

http://www.gbif.org/terms/licences

GBIFGoverningBoardestablishedin2014supportinGBIFforthreelicenses

GBIFPortal(statusDecember2016)CC0 57%CC-BY4.0 31%CC-BY-NC4.0 13%

DATA LICENSE REGULATES THE POSSIBILITY FOR REUSE OF DATA

•  CC0 data are made available for any use without restriction or particular requirements on the part of users.

•  CC BY data are made available for any use provided that attribution is appropriately given for the sources of data used.

•  CC NC data are made available for no-commercial use – however, how to limit what is considered to be "commercial use"?

•  CC SA data are made available provided conditional that derived products also are shared alike as CC SA – notice that this could block desired commercial products?

•  CC ND data are made available for verification read-only, however no modifications or derived products are allowed (blocking reuse)!

NORWEGIAN LICENSE FOR PUBLIC DATA (NLOD)

•  NLOD Norwegian license for public data is compatible with CC BY 4.0.

•  http://data.norge.no/nlod/no/1.0

•  Recommended to use CC BY 4.0 for broader compatibility and understanding also outside of Norway (alternatively declare both).

H2020 – OPEN DATA BY DEFAULT FROM 2017

Kilde:OpenAIRE&EUDAT,CC-BY-4.0,2013

Conclusion

WHY MANAGE AND PUBLISH YOUR OWN RESEARCH DATA?

•  Make your own research easier! •  Stop yourself drowning in irrelevant data.

•  Save your own data for later use.

•  Avoid accusations of fraud or bad science (e.g. p-hacking).

•  Share your research data for re-use. •  Get credit for your data. •  Meet funder/institution requirements.

Because well-managed data opens up opportunities for re-use, sharing and makes for better science!

Source:OpenAIRE&EUDAT,CC-BY-4.0,2013

NodeteamatNHM,UniversityofOsloDagEndresen,NodemanagerChrisAanSvindseth,Databasemanager

FridtjofMehlum,ResearchdirectorEinarTimdal,AssociateprofessorGeirSøli,AssociateprofessorVidarBakken,Consultant

Artsdatabanken,Trondheim

WouterKochNilsValland

NTNUUniversityMuseumAndersFinstad,GBIFSciencecommiOee

ResearchCouncilofNorway

PerBacke-Hansen,HeadofdelegaQon

Contactusat:gbif-driW@nhm.uio.no

top related