rda work groups outputs and adoption - early wg report back session

48
RDA and Adoption Early WG Report back session September 23, 2014

Upload: research-data-alliance

Post on 11-Jun-2015

350 views

Category:

Science


5 download

DESCRIPTION

RDA Fourth Plenary - RDA Work Groups Outputs and Adoption - Early WG Report back session, Tuesday 23rd Sept 2014, Amsterdam, the Netherlands

TRANSCRIPT

Page 1: RDA Work Groups Outputs and Adoption - Early WG Report back session

RDA and Adoption

Early WG Report back session September 23, 2014

Page 2: RDA Work Groups Outputs and Adoption - Early WG Report back session

2 Happy Birthday!

http://cdn.cakecentral.com/d/d3/900x900px-LL-d3548099_gallery6680631282672149.jpeg

Page 3: RDA Work Groups Outputs and Adoption - Early WG Report back session

3

§  Motivated groups of people can do a lot §  But we are relying too much on volunteer labour

contributed on top of over-full lives §  Looks like the RDA-challenge goal of 12-18 months is

achievable §  But IGs also provide valuable space for longer-term

interaction §  We need to reduce friction in our processes §  But the organisation is maturing rapidly

What did we learn?

Page 4: RDA Work Groups Outputs and Adoption - Early WG Report back session

4

§  RDA will only deliver on its promise if it produces deliverables, and those deliverables become adopted outside the groups that created them

§  Consequential TAB foci: §  proposals for new groups – adoption plans? §  tracking groups underway – fit for purpose? §  monitoring of adoption once groups conclude – actually

adopted?

§  So, how can we most usefully think through the process of adoption?

RDA and Outputs

Page 5: RDA Work Groups Outputs and Adoption - Early WG Report back session

5

§  Adoption can be seen as the end result of a diffusion process. This diffusion process involves §  awareness §  interest §  evaluation §  trial §  adoption

§  RDA has a role to play in §  supporting each stage §  making the transitions from one stage to the next more likely

Diffusion and RDA

Page 6: RDA Work Groups Outputs and Adoption - Early WG Report back session

6

1.  How do we talk about data? 2.  How can we describe the data? 3.  Can we optimize addressing the data? 4.  How can we get trust in our infrastructure?

Important questions

Page 7: RDA Work Groups Outputs and Adoption - Early WG Report back session

7

§  Base infrastructure §  (Coincidence, also social groups!) §  Lets agree on Terms. (DFT) §  Descriptions for Interoperability. (DTR) §  Scaling across PID systems. (PIT) §  Building policies into the infrastructure. (PP)

What

Page 8: RDA Work Groups Outputs and Adoption - Early WG Report back session

8

§  Amplify each other §  Use each others outputs §  Have to interlock properly

§  Will continue the effort after they finish.

These groups

Page 9: RDA Work Groups Outputs and Adoption - Early WG Report back session

Data Foundation and Terminology

Chairs: Gary Berg-Cross, Raphael Ritz, Peter Wittenburg

Page 10: RDA Work Groups Outputs and Adoption - Early WG Report back session

10 Task

Bob Kahn:

You need to know where you are talking about.

DFT mission: understand what the core of the data domain is, develop definitions of core terms based on data models. DFT is part of coming to an agreed culture in RDA.

Scope:

AND only speak about domain of registered data. §  knowing that there is a lot of non-registered data §  knowing that some disciplines are further away from

what we are discussing as necessity

Page 11: RDA Work Groups Outputs and Adoption - Early WG Report back session

11 DFT WG Activities & Accomplishments

§  Drafted 4 related Model Documents on core work: 1.  Data Models 1: Overview – 22+ models 2.  Data Models 2: Analysis & Synthesis 3.  Data Models 3: Term Snapshot 4.  Data Models 4: Use Cases (Work with other RDA WGs on use cases to illustrate data concepts) §  Developed Semantic Media Wiki Term Tool to

capture initial list of terms and definitions for discussions, demo held at P3

(open for others and “persistent”)

Candidate List Evolved to Consolidated List

Page 12: RDA Work Groups Outputs and Adoption - Early WG Report back session

12

§  digital object (DO) §  persistent identifier §  PID resolution system §  metadata §  aggregation §  digital collection §  (digital) repository §  bitstream §  state information

Need to put relation between terms into the documents On purpose no formal ontology (yet) and no terminologist’ exactness since we made definitions for data practitioners first.

Our Core Terms in simple Words J

Page 13: RDA Work Groups Outputs and Adoption - Early WG Report back session

13

§  A digital collection is an aggregation of DOs that is identified by a PID and described by metadata.

§  Note: A digital collection is a (complex) DO. §  Variants §  A collection is a form of aggregation of elements that has an identity of its own separate from the

identity of the elements. §  Collection is defined as a “group of objects gathered together for some intellectual, artistic curatorial

purpose. §  A digital collection is a type of aggregation formed by a collection process on existing data and data

sets where the collected data is in digital form. §  Collection is a type of aggregation obeying part-role relations and is a digital object since it has a

PID to be referable and metadata describing its properties. §  A Digital Collection is an organized aggregation or other grouping of distinct DOs that are related by

some criteria and where the collection is described by metadata. A Digital Collection may also be identified by a unique persistent identifier, in which case the collection may be construed as a DO. (Kahn et.al)

§  Conclusion points §  purpose and process of aggregation/collection building and part relations not

relevant for definition §  remember: only speak about domain of registered DOs.

Definitions & Process

Page 14: RDA Work Groups Outputs and Adoption - Early WG Report back session

14 Interactions with others

•  Interacted with RDA WGs and IGs. •  Participated in Munich meeting and Chairs telcos. •  Part of WG forum discussions •  also “active” interactions with about 120 groups

RDA/EU & EUDAT Interviews Interactions Total Humanities &Soc Sci 8 13 21 Environmental 7 2 9 Life Sciences 10 7 17 Natural Sciences 11 13 24 Engineering & CS - 14 14 Various disciplines - 24 24

others 4 3 7

40 74 114

Page 15: RDA Work Groups Outputs and Adoption - Early WG Report back session

15 Adoption

•  What does adoption mean in case of a set of terms? •  it’s about the interaction process itself within and

outside of RDA •  it’s about influencing conceptualization and thus

harmonizing “language” •  it’s about changing cultures •  we have done a lot – many departments & communities

•  why so relevant: •  report from 120 interactions tells us that data practices

are a nightmare (report is available) •  data organizations are so different that data federation

including “logical information” is too expensive •  current data science is not reproducible

Page 16: RDA Work Groups Outputs and Adoption - Early WG Report back session

16 Objectives until/for P4

1.  Go out and intensify interaction based on Snapshot §  create condensed statements for different groups (2-page flyer) §  interact with other groups in RDA and early adopters §  interact with the many communities (outside RDA) we already contacted (in Europe ESFRI RI projects: 17th October, Brussels) §  encourage people using the term wiki

2.  Come to new consolidated agreements §  consolidated definitions until P5 §  present the consolidated definitions and tend core term set §  identify some people from communities that have adoption talks (no PR!)

3.  Finish some unsolved issues §  synthesis: generic flexible enough model to capture terms and their

relationships §  add more use cases §  see how to continue maintenance

Page 17: RDA Work Groups Outputs and Adoption - Early WG Report back session

Thanks for your attention.

Page 18: RDA Work Groups Outputs and Adoption - Early WG Report back session

Data Type Registries WG Outcomes

Page 19: RDA Work Groups Outputs and Adoption - Early WG Report back session

19

§  Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data

§  How do we do this now? §  For documents – formats are enough, e.g., PDF, and then the

document explains itself to humans §  This doesn’t work well with data – numbers are not self-

explanatory §  What does the number 7 mean in cell B27?

§  Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc.

§  Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation

Problem: Implicit Assumptions in Data

Page 20: RDA Work Groups Outputs and Adoption - Early WG Report back session

20

§  Evaluate and identify a few assumptions in data that can be codified and shared in order to…

§  Produce a functioning Registry system that can easily be evaluated by organizations before adoption §  Highly configurable for changing scope of captured and shared

assumptions depending on the domain or organization §  Supports several Type record dissemination variations

§  Design for allowing federation between multiple Registry instances

§  The group’s emphasis is not on §  Identifying every possible assumption and data characteristic

applicable for all domains §  Technology

Goals: Explicate and Share Assumptions using Types and Type Registries

Page 21: RDA Work Groups Outputs and Adoption - Early WG Report back session

21

§  Produced a community consensus system – in this case the consensus was between the group members §  Input from folks from different backgrounds including

technologists, scientists, policy analysts, etc., is considered §  Released a functioning prototype that can be adapted (with no s/w

changes) for domain-specific use §  Not a turnkey solution §  Adapt - Evaluate – Adopt cycle is expected at each organization

or community §  Federation between different instances is technically possible

§  Organizational policies were not discussed due to the lack of time

§  CNRI, a member of the group, has designed and implemented a prototype, the latest of which is at: http://typeregistry.org

§  With the help of RDA provided scholar, we seeded the Registry with Types that pertain to geosciences community

Results

Page 22: RDA Work Groups Outputs and Adoption - Early WG Report back session

22

§  Data Type Registry is neither a turnkey system nor an immediate ROI application

§  Every organization should nominate a domain expert for defining the scope of Type records and for seeding their Registry instance

§  Cross-domain interpretation beyond some basic computability needs social processes in place

§  Data systems such as Type Registries are low-level infrastructure systems with wide applicability §  Network effect plays a significant role in the success of any

infrastructure

Points to Keep in Mind

Page 23: RDA Work Groups Outputs and Adoption - Early WG Report back session

23

§  We expect multiple groups to put significant efforts into exercising the prototype: §  the EUDAT project in Europe, §  National Institute of Standards and Technology

(NIST) in the US, §  the International DOI Foundation

§  (Wo Chang, Digital Data Manager at NIST, shares his evaluation plans)

Adoption and Impact

Page 24: RDA Work Groups Outputs and Adoption - Early WG Report back session

24

§  Adoption plans will continue §  The group, or some part of it, will continue to

work, we hope with RDA’s blessing and maybe support. We will have more to say at P5

§  Future-proofing data is hard work, but is essential for long-term data-driven science

Conclusion – For Now

Page 25: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG PID Information Types Outcomes

Page 26: RDA Work Groups Outputs and Adoption - Early WG Report back session

26

§  PIDs are associated with additional information and this information needs to be typed

§  Harmonization across disciplines and PID providers

§  What are PID Information Types? §  Specify a framework for defining types §  Agree on some essential types §  Provide technical solutions for interaction with PID types

§  Provide the tools first, then create types individually

Problem & Goal

Page 27: RDA Work Groups Outputs and Adoption - Early WG Report back session

27

Insights gained: §  Types depend on use cases and semantics differ between

disciplines §  There is no single set of types fitting all cases §  Community processes must define types from practical adoption

Final deliverables avaliable: §  Type examples and illustrating use cases

§  Types registered in the Type Registry prototype §  API description and prototypic implementation §  Client demonstrator GUI

Results

Page 28: RDA Work Groups Outputs and Adoption - Early WG Report back session

28 Registered types enable cross-services

Format: Checksum: Size:

Size: Format: Checksum:

Verification service

Page 29: RDA Work Groups Outputs and Adoption - Early WG Report back session

29

§  Register your types so they can be adopted and reused, making it easier for others to use your data §  Information on how to register new types available in the report

§  Adopt types already being used in your domain to increase interoperability

§  Decouple object management from contents §  Simplify client access to data across domains, implementations

and changes in information models §  More lightweight access to information on less accessible

objects

Adoption & Impact

Page 30: RDA Work Groups Outputs and Adoption - Early WG Report back session

30

§  Adoption of these capabilities by PID infrastructure providers

§  Discipline-specific types, preferably from practical adoption

§  Establish a type ecosystem §  Refine data model §  Enhance REST API

Possible follow-ups

Page 31: RDA Work Groups Outputs and Adoption - Early WG Report back session

31

§  Draft final report available via the website §  Demonstrator web GUI:

Conclusions

http://smw-rda.esc.rzg.mpg.de/PitApiGui/

Page 32: RDA Work Groups Outputs and Adoption - Early WG Report back session

Practical Policies Outcomes

Page 33: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 33

Page 34: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 34

§  Create research data repository §  Data: 2 TB, 500,000 files + growing

+ integrity + access (IG FIM) + publish (publication+PID) + …

§  Some assertions: policies & rules attached to the data

Scenario

Policy:    Asser%on  or  assurance  that  is  enforced  about  a  collec%on  or  a  dataset  

Page 35: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 35

Computer actionable policies §  Enforce management, §  Automate administrative tasks, §  Validate assessment criteria, §  Automate scientific analyses §  etc.

A generic set of policies that can be revised and adapted by user communities and site managers does not exist. §  Domain scientists who want to build-up a collection or

a repository §  Data centers for automating policies

Problem

Page 36: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 36

§  To bring together practitioners in policy making and policy implementation (nearly all RDA WG/IGs)

§  To identify typical application scenarios for policies such as replication, preservation etc.

§  To collect and to register practical policies §  To enable sharing, revising, adapting, and re-using of

computer actionable policies

Goals

Page 37: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 37

Policy   Importance  Integrity   217  Preserva%on   150  Access  control   126  Provenance   108  Data  Management  plans   99  Publica%on   75  Replica%on   66  Data  staging   52  Federa%on   37  Metadata  sharing   23  Regulatory   16  Collec%on  proper%es   7  Iden%fiers   7  Data  sharing   7  Versioning   7  Licensing   6  Format   6  Data  Life  Cycle   6  Arrangement   5  Processing   5  

Survey of 30 Institutions for Highest Priority Policies

In c

lose

coo

pera

tion

with

the

Eng

agem

ent G

roup

Page 38: RDA Work Groups Outputs and Adoption - Early WG Report back session

Contextual  Metadata  Extrac%on  

Disposi%on  

Data  Reten%on  

Integrity  

Storage  Cost  Reports  

Restricted  Searching  

No%fica%on  

Data  Access  Control   Use  

Agreements  

Data  backup  

Data  Format  Control  

Collec%on-­‐based  Policies  

Identification of 11 important policy areas:

Page 39: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 39

Identification of 11 important policy areas:

§  Contextual metadata extraction §  Data access control §  Data backup §  Data format control §  Data retention §  Disposition §  Integrity (including replication) §  Notification §  Restricted searching §  Storage cost reports §  Use agreements

Results

Page 40: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 40

https://www.rd-alliance.org/filedepot?cid=104&fid=556

Templates §  Interactions of policies and DO attributes §  Policy descriptions §  Technology independent §  Reviews of the provided policy areas in progress

Results

Page 41: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 41

https://www.rd-alliance.org/filedepot?cid=104&fid=553 §  Examples for implementations:

§  English language descriptions §  iRODS §  GPFS

§  ~50 pages

Results

Page 42: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 42

Result: List of of policy categories and policies

§  Improved data center administration §  By sharing policies, communities can interoperate and

share data more effectively §  Transparency: basis of establishing trust

§  Implemented policies: can be used as examples and be adapted to specific requirements and other data management systems

Impact

Page 43: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 43

Target Communities: §  Groups managing data collections §  Data centers First adopters are the institutions/organizations who contributed to the results, e.g. RENCI, KIT, OSC, DARIAH, RZG, etc.: §  EUDAT §  CESNET §  (DataNet Federation Consortium, WDS ? )

Adoption

Page 44: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 44

§  “Outcomes Policy Templates: Practical Policy Working Group, September 2014” https://www.rd-alliance.org/filedepot?cid=104&fid=556

§  “Implementations: Practical Policy Working Group, September 2014” https://www.rd-alliance.org/filedepot?cid=104&fid=553

§  Work in Progress: Reviews

Conclusions

Page 45: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 45

§  More interaction with other technical groups à Data Fabric à Publication policies

§  More interaction with domain specific groups

à Adopters For information please contact §  Reagan Moore [email protected] and §  Rainer Stotzka [email protected]

Conclusions: Next Steps

Page 46: RDA Work Groups Outputs and Adoption - Early WG Report back session

WG Practical Policies 46

Outbreak Session: Tuesday September 23, 14:00 – 15:30

Agenda: 1.  Introduction 2.  Presentation of deliverables 3.  David Antos & Petr Benedikt: "Policy implementations

on GPFS” 4.  Discussions:

§  Policy reviews §  Adding new policies §  Interoperability with other WG/IGs §  Adoption

WG Practical Policies

Page 47: RDA Work Groups Outputs and Adoption - Early WG Report back session

47

§  More groups will be presenting at P5 §  Starting to see how different WG outputs can fit together

§  Ex: Data Fabric

§  Planning to have a major focus at P5 on adoption of WG outputs

§  Also thinking through how best to accelerate adoption and support groups that want to integrate RDA outputs

P5 and Adoption Day

Page 48: RDA Work Groups Outputs and Adoption - Early WG Report back session

48

§  Get involved in WGs, IGs to ensure outputs meet your needs and the needs of your organisation

§  Encourage your organisation to become aware of RDA outputs and evaluate or trial them

§  Look for places where RDA can make a difference

How you can help!