rda work groups outputs and adoption - early wg report back session
DESCRIPTION
RDA Fourth Plenary - RDA Work Groups Outputs and Adoption - Early WG Report back session, Tuesday 23rd Sept 2014, Amsterdam, the NetherlandsTRANSCRIPT
RDA and Adoption
Early WG Report back session September 23, 2014
2 Happy Birthday!
http://cdn.cakecentral.com/d/d3/900x900px-LL-d3548099_gallery6680631282672149.jpeg
3
§ Motivated groups of people can do a lot § But we are relying too much on volunteer labour
contributed on top of over-full lives § Looks like the RDA-challenge goal of 12-18 months is
achievable § But IGs also provide valuable space for longer-term
interaction § We need to reduce friction in our processes § But the organisation is maturing rapidly
What did we learn?
4
§ RDA will only deliver on its promise if it produces deliverables, and those deliverables become adopted outside the groups that created them
§ Consequential TAB foci: § proposals for new groups – adoption plans? § tracking groups underway – fit for purpose? § monitoring of adoption once groups conclude – actually
adopted?
§ So, how can we most usefully think through the process of adoption?
RDA and Outputs
5
§ Adoption can be seen as the end result of a diffusion process. This diffusion process involves § awareness § interest § evaluation § trial § adoption
§ RDA has a role to play in § supporting each stage § making the transitions from one stage to the next more likely
Diffusion and RDA
6
1. How do we talk about data? 2. How can we describe the data? 3. Can we optimize addressing the data? 4. How can we get trust in our infrastructure?
Important questions
7
§ Base infrastructure § (Coincidence, also social groups!) § Lets agree on Terms. (DFT) § Descriptions for Interoperability. (DTR) § Scaling across PID systems. (PIT) § Building policies into the infrastructure. (PP)
What
8
§ Amplify each other § Use each others outputs § Have to interlock properly
§ Will continue the effort after they finish.
These groups
Data Foundation and Terminology
Chairs: Gary Berg-Cross, Raphael Ritz, Peter Wittenburg
10 Task
Bob Kahn:
You need to know where you are talking about.
DFT mission: understand what the core of the data domain is, develop definitions of core terms based on data models. DFT is part of coming to an agreed culture in RDA.
Scope:
AND only speak about domain of registered data. § knowing that there is a lot of non-registered data § knowing that some disciplines are further away from
what we are discussing as necessity
11 DFT WG Activities & Accomplishments
§ Drafted 4 related Model Documents on core work: 1. Data Models 1: Overview – 22+ models 2. Data Models 2: Analysis & Synthesis 3. Data Models 3: Term Snapshot 4. Data Models 4: Use Cases (Work with other RDA WGs on use cases to illustrate data concepts) § Developed Semantic Media Wiki Term Tool to
capture initial list of terms and definitions for discussions, demo held at P3
(open for others and “persistent”)
Candidate List Evolved to Consolidated List
12
§ digital object (DO) § persistent identifier § PID resolution system § metadata § aggregation § digital collection § (digital) repository § bitstream § state information
Need to put relation between terms into the documents On purpose no formal ontology (yet) and no terminologist’ exactness since we made definitions for data practitioners first.
Our Core Terms in simple Words J
13
§ A digital collection is an aggregation of DOs that is identified by a PID and described by metadata.
§ Note: A digital collection is a (complex) DO. § Variants § A collection is a form of aggregation of elements that has an identity of its own separate from the
identity of the elements. § Collection is defined as a “group of objects gathered together for some intellectual, artistic curatorial
purpose. § A digital collection is a type of aggregation formed by a collection process on existing data and data
sets where the collected data is in digital form. § Collection is a type of aggregation obeying part-role relations and is a digital object since it has a
PID to be referable and metadata describing its properties. § A Digital Collection is an organized aggregation or other grouping of distinct DOs that are related by
some criteria and where the collection is described by metadata. A Digital Collection may also be identified by a unique persistent identifier, in which case the collection may be construed as a DO. (Kahn et.al)
§ Conclusion points § purpose and process of aggregation/collection building and part relations not
relevant for definition § remember: only speak about domain of registered DOs.
Definitions & Process
14 Interactions with others
• Interacted with RDA WGs and IGs. • Participated in Munich meeting and Chairs telcos. • Part of WG forum discussions • also “active” interactions with about 120 groups
RDA/EU & EUDAT Interviews Interactions Total Humanities &Soc Sci 8 13 21 Environmental 7 2 9 Life Sciences 10 7 17 Natural Sciences 11 13 24 Engineering & CS - 14 14 Various disciplines - 24 24
others 4 3 7
40 74 114
15 Adoption
• What does adoption mean in case of a set of terms? • it’s about the interaction process itself within and
outside of RDA • it’s about influencing conceptualization and thus
harmonizing “language” • it’s about changing cultures • we have done a lot – many departments & communities
• why so relevant: • report from 120 interactions tells us that data practices
are a nightmare (report is available) • data organizations are so different that data federation
including “logical information” is too expensive • current data science is not reproducible
16 Objectives until/for P4
1. Go out and intensify interaction based on Snapshot § create condensed statements for different groups (2-page flyer) § interact with other groups in RDA and early adopters § interact with the many communities (outside RDA) we already contacted (in Europe ESFRI RI projects: 17th October, Brussels) § encourage people using the term wiki
2. Come to new consolidated agreements § consolidated definitions until P5 § present the consolidated definitions and tend core term set § identify some people from communities that have adoption talks (no PR!)
3. Finish some unsolved issues § synthesis: generic flexible enough model to capture terms and their
relationships § add more use cases § see how to continue maintenance
Thanks for your attention.
Data Type Registries WG Outcomes
19
§ Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data
§ How do we do this now? § For documents – formats are enough, e.g., PDF, and then the
document explains itself to humans § This doesn’t work well with data – numbers are not self-
explanatory § What does the number 7 mean in cell B27?
§ Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc.
§ Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation
Problem: Implicit Assumptions in Data
20
§ Evaluate and identify a few assumptions in data that can be codified and shared in order to…
§ Produce a functioning Registry system that can easily be evaluated by organizations before adoption § Highly configurable for changing scope of captured and shared
assumptions depending on the domain or organization § Supports several Type record dissemination variations
§ Design for allowing federation between multiple Registry instances
§ The group’s emphasis is not on § Identifying every possible assumption and data characteristic
applicable for all domains § Technology
Goals: Explicate and Share Assumptions using Types and Type Registries
21
§ Produced a community consensus system – in this case the consensus was between the group members § Input from folks from different backgrounds including
technologists, scientists, policy analysts, etc., is considered § Released a functioning prototype that can be adapted (with no s/w
changes) for domain-specific use § Not a turnkey solution § Adapt - Evaluate – Adopt cycle is expected at each organization
or community § Federation between different instances is technically possible
§ Organizational policies were not discussed due to the lack of time
§ CNRI, a member of the group, has designed and implemented a prototype, the latest of which is at: http://typeregistry.org
§ With the help of RDA provided scholar, we seeded the Registry with Types that pertain to geosciences community
Results
22
§ Data Type Registry is neither a turnkey system nor an immediate ROI application
§ Every organization should nominate a domain expert for defining the scope of Type records and for seeding their Registry instance
§ Cross-domain interpretation beyond some basic computability needs social processes in place
§ Data systems such as Type Registries are low-level infrastructure systems with wide applicability § Network effect plays a significant role in the success of any
infrastructure
Points to Keep in Mind
23
§ We expect multiple groups to put significant efforts into exercising the prototype: § the EUDAT project in Europe, § National Institute of Standards and Technology
(NIST) in the US, § the International DOI Foundation
§ (Wo Chang, Digital Data Manager at NIST, shares his evaluation plans)
Adoption and Impact
24
§ Adoption plans will continue § The group, or some part of it, will continue to
work, we hope with RDA’s blessing and maybe support. We will have more to say at P5
§ Future-proofing data is hard work, but is essential for long-term data-driven science
Conclusion – For Now
WG PID Information Types Outcomes
26
§ PIDs are associated with additional information and this information needs to be typed
§ Harmonization across disciplines and PID providers
§ What are PID Information Types? § Specify a framework for defining types § Agree on some essential types § Provide technical solutions for interaction with PID types
§ Provide the tools first, then create types individually
Problem & Goal
27
Insights gained: § Types depend on use cases and semantics differ between
disciplines § There is no single set of types fitting all cases § Community processes must define types from practical adoption
Final deliverables avaliable: § Type examples and illustrating use cases
§ Types registered in the Type Registry prototype § API description and prototypic implementation § Client demonstrator GUI
Results
28 Registered types enable cross-services
Format: Checksum: Size:
Size: Format: Checksum:
Verification service
29
§ Register your types so they can be adopted and reused, making it easier for others to use your data § Information on how to register new types available in the report
§ Adopt types already being used in your domain to increase interoperability
§ Decouple object management from contents § Simplify client access to data across domains, implementations
and changes in information models § More lightweight access to information on less accessible
objects
Adoption & Impact
30
§ Adoption of these capabilities by PID infrastructure providers
§ Discipline-specific types, preferably from practical adoption
§ Establish a type ecosystem § Refine data model § Enhance REST API
Possible follow-ups
31
§ Draft final report available via the website § Demonstrator web GUI:
Conclusions
http://smw-rda.esc.rzg.mpg.de/PitApiGui/
Practical Policies Outcomes
WG Practical Policies 33
WG Practical Policies 34
§ Create research data repository § Data: 2 TB, 500,000 files + growing
+ integrity + access (IG FIM) + publish (publication+PID) + …
§ Some assertions: policies & rules attached to the data
Scenario
Policy: Asser%on or assurance that is enforced about a collec%on or a dataset
WG Practical Policies 35
Computer actionable policies § Enforce management, § Automate administrative tasks, § Validate assessment criteria, § Automate scientific analyses § etc.
A generic set of policies that can be revised and adapted by user communities and site managers does not exist. § Domain scientists who want to build-up a collection or
a repository § Data centers for automating policies
Problem
WG Practical Policies 36
§ To bring together practitioners in policy making and policy implementation (nearly all RDA WG/IGs)
§ To identify typical application scenarios for policies such as replication, preservation etc.
§ To collect and to register practical policies § To enable sharing, revising, adapting, and re-using of
computer actionable policies
Goals
WG Practical Policies 37
Policy Importance Integrity 217 Preserva%on 150 Access control 126 Provenance 108 Data Management plans 99 Publica%on 75 Replica%on 66 Data staging 52 Federa%on 37 Metadata sharing 23 Regulatory 16 Collec%on proper%es 7 Iden%fiers 7 Data sharing 7 Versioning 7 Licensing 6 Format 6 Data Life Cycle 6 Arrangement 5 Processing 5
Survey of 30 Institutions for Highest Priority Policies
In c
lose
coo
pera
tion
with
the
Eng
agem
ent G
roup
Contextual Metadata Extrac%on
Disposi%on
Data Reten%on
Integrity
Storage Cost Reports
Restricted Searching
No%fica%on
Data Access Control Use
Agreements
Data backup
Data Format Control
Collec%on-‐based Policies
Identification of 11 important policy areas:
WG Practical Policies 39
Identification of 11 important policy areas:
§ Contextual metadata extraction § Data access control § Data backup § Data format control § Data retention § Disposition § Integrity (including replication) § Notification § Restricted searching § Storage cost reports § Use agreements
Results
WG Practical Policies 40
https://www.rd-alliance.org/filedepot?cid=104&fid=556
Templates § Interactions of policies and DO attributes § Policy descriptions § Technology independent § Reviews of the provided policy areas in progress
Results
WG Practical Policies 41
https://www.rd-alliance.org/filedepot?cid=104&fid=553 § Examples for implementations:
§ English language descriptions § iRODS § GPFS
§ ~50 pages
Results
WG Practical Policies 42
Result: List of of policy categories and policies
§ Improved data center administration § By sharing policies, communities can interoperate and
share data more effectively § Transparency: basis of establishing trust
§ Implemented policies: can be used as examples and be adapted to specific requirements and other data management systems
Impact
WG Practical Policies 43
Target Communities: § Groups managing data collections § Data centers First adopters are the institutions/organizations who contributed to the results, e.g. RENCI, KIT, OSC, DARIAH, RZG, etc.: § EUDAT § CESNET § (DataNet Federation Consortium, WDS ? )
Adoption
WG Practical Policies 44
§ “Outcomes Policy Templates: Practical Policy Working Group, September 2014” https://www.rd-alliance.org/filedepot?cid=104&fid=556
§ “Implementations: Practical Policy Working Group, September 2014” https://www.rd-alliance.org/filedepot?cid=104&fid=553
§ Work in Progress: Reviews
Conclusions
WG Practical Policies 45
§ More interaction with other technical groups à Data Fabric à Publication policies
§ More interaction with domain specific groups
à Adopters For information please contact § Reagan Moore [email protected] and § Rainer Stotzka [email protected]
Conclusions: Next Steps
WG Practical Policies 46
Outbreak Session: Tuesday September 23, 14:00 – 15:30
Agenda: 1. Introduction 2. Presentation of deliverables 3. David Antos & Petr Benedikt: "Policy implementations
on GPFS” 4. Discussions:
§ Policy reviews § Adding new policies § Interoperability with other WG/IGs § Adoption
WG Practical Policies
47
§ More groups will be presenting at P5 § Starting to see how different WG outputs can fit together
§ Ex: Data Fabric
§ Planning to have a major focus at P5 on adoption of WG outputs
§ Also thinking through how best to accelerate adoption and support groups that want to integrate RDA outputs
P5 and Adoption Day
48
§ Get involved in WGs, IGs to ensure outputs meet your needs and the needs of your organisation
§ Encourage your organisation to become aware of RDA outputs and evaluate or trial them
§ Look for places where RDA can make a difference
How you can help!