decentralized collaborative knowledge management using gitcreation, curation, linking). our aim is...

24
Decentralized Collaborative Knowledge Management using Git Natanael Arndt a,* , Patrick Naumann b , Norman Radtke a , Michael Martin a , Edgard Marx a,b a Agile Knowledge Engineering and Semantic Web (AKSW), Institute of Computer Science, Leipzig University, Augustusplatz 10, 04109 Leipzig, Germany b Hochschule für Technik, Wirtschaft und Kultur Leipzig (HTWK), Gustav-Freytag-Str. 42A, 04277 Leipzig, Germany Abstract The World Wide Web and the Semantic Web are designed as a network of distributed services and datasets. The distributed character of the Web brings manifold collaborative possibilities for interchanging data. While the commonly adopted collaborative solutions for RDF data are centralized, such as SPARQL endpoints and wiki systems. To actually support distributed collaboration a system is needed, which supports divergence of datasets, brings the possibility to conflate diverged states and allows to synchronize different distributed datasets. We are presenting the Quit Stack inspired by and built on the successful Git system. The approach is based on a formal expression of evolution and consolidation of distributed datasets. During the collaborative curation process the system is automatically versioning the RDF dataset, and tracking provenance information. It is providing support for branching, merging and synchronizing distributed RDF datasets. The merging process is guarded by specific merge strategies for RDF data. Finally with our reference implementation we prove a reasonable performance and demonstrate the practical usability of the system. Keywords: RDF, Semantic Web, Git, Distributed Collaboration, Distributed Version Control System, Knowledge Engineering 2010 MSC: 68P10, 68P20 1. Introduction Apart from documents, datasets are gaining more at- tention on the World Wide Web. An increasing number of the datasets on the Web are available as Linked Data, also called the Linked Open Data Cloud 1 or Giant Global Graph 2 . Collaboration of people and machines is a ma- jor aspect of the World Wide Web and as well of the Se- mantic Web. Currently, the access to RDF data on the Semantic Web is possible by applying the Linked Data principles [9], and the SPARQL specification [41], which enables clients to access and retrieve data stored and pub- lished via SPARQL endpoints. RDF resources in the Se- mantic Web are interconnected and often correspond to previously created vocabularies and patterns. This way of reusing existing knowledge facilitates the modelling and representation of information and may optimally reduce the development costs of a knowledge base. However, * Corresponding author Email addresses: [email protected] (Natanael Arndt), [email protected] (Patrick Naumann), [email protected] (Norman Radtke), [email protected] (Michael Martin), [email protected] (Edgard Marx) URL: http://aksw.org/NatanaelArndt (Natanael Arndt), http://aksw.org/NormanRadtke (Norman Radtke), http://aksw.org/MichaelMartin (Michael Martin), http://aksw.org/EdgardMarx (Edgard Marx) 1 http://lod-cloud.net/ 2 http://dig.csail.mit.edu/breadcrumbs/node/215 reusing existing RDF resources (terminological as well as instance resources) causes problems in locating, applying, and managing them. The administrative burden of these resources then increases immensely, insofar as the original sources change (partially) and the reuse of these RDF re- sources takes place collaboratively and in a decentralized manner. For example, this type of reuse occurs during the creation of a domain-specific vocabulary or a specific set of instances developed by organizationally independent collaborators. Presumably, collaboration in such a setup is either primarily agile or it can be organized top-down, in which case it has to be completely supervised. How- ever, this complete supervision requires a high amount of effort. As a result, structural and content interferences as well as varying models and contradictory statements are inevitable. Projects from a number of domains are striving for distributed models to collaborate on common knowledge bases. In the domain of e-humanities, the projects Pfar- rerbuch 3 , Catalogus Professorum 4 [40], Héloïse – Euro- pean Network on Digital Academic History 5 [39], and Pro- fessorial Career Patterns of the Early Modern History 6 are good examples of the need to explore and track provenance and the evolution of the domain data. In the context of 3 http://aksw.org/Projects/Pfarrerbuch 4 http://aksw.org/Projects/CatalogusProfessorum 5 http://heloisenetwork.eu/ 6 http://catalogus-professorum.org/projects/pcp-on-web/ Preprint submitted to Journal of Web Semantics May 8, 2018

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

Decentralized Collaborative Knowledge Management using Git

Natanael Arndta,∗, Patrick Naumannb, Norman Radtkea, Michael Martina, Edgard Marxa,b

aAgile Knowledge Engineering and Semantic Web (AKSW), Institute of Computer Science, Leipzig University, Augustusplatz 10, 04109Leipzig, Germany

bHochschule für Technik, Wirtschaft und Kultur Leipzig (HTWK), Gustav-Freytag-Str. 42A, 04277 Leipzig, Germany

Abstract

The World Wide Web and the Semantic Web are designed as a network of distributed services and datasets. Thedistributed character of the Web brings manifold collaborative possibilities for interchanging data. While the commonlyadopted collaborative solutions for RDF data are centralized, such as SPARQL endpoints and wiki systems. To actuallysupport distributed collaboration a system is needed, which supports divergence of datasets, brings the possibility toconflate diverged states and allows to synchronize different distributed datasets. We are presenting the Quit Stackinspired by and built on the successful Git system. The approach is based on a formal expression of evolution andconsolidation of distributed datasets. During the collaborative curation process the system is automatically versioningthe RDF dataset, and tracking provenance information. It is providing support for branching, merging and synchronizingdistributed RDF datasets. The merging process is guarded by specific merge strategies for RDF data. Finally with ourreference implementation we prove a reasonable performance and demonstrate the practical usability of the system.

Keywords: RDF, Semantic Web, Git, Distributed Collaboration, Distributed Version Control System, KnowledgeEngineering2010 MSC: 68P10, 68P20

1. Introduction

Apart from documents, datasets are gaining more at-tention on the World Wide Web. An increasing numberof the datasets on the Web are available as Linked Data,also called the Linked Open Data Cloud1 or Giant GlobalGraph2. Collaboration of people and machines is a ma-jor aspect of the World Wide Web and as well of the Se-mantic Web. Currently, the access to RDF data on theSemantic Web is possible by applying the Linked Dataprinciples [9], and the SPARQL specification [41], whichenables clients to access and retrieve data stored and pub-lished via SPARQL endpoints. RDF resources in the Se-mantic Web are interconnected and often correspond topreviously created vocabularies and patterns. This wayof reusing existing knowledge facilitates the modelling andrepresentation of information and may optimally reducethe development costs of a knowledge base. However,

∗Corresponding authorEmail addresses: [email protected]

(Natanael Arndt), [email protected](Patrick Naumann), [email protected] (NormanRadtke), [email protected] (Michael Martin),[email protected] (Edgard Marx)

URL: http://aksw.org/NatanaelArndt (Natanael Arndt),http://aksw.org/NormanRadtke (Norman Radtke),http://aksw.org/MichaelMartin (Michael Martin),http://aksw.org/EdgardMarx (Edgard Marx)

1http://lod-cloud.net/2http://dig.csail.mit.edu/breadcrumbs/node/215

reusing existing RDF resources (terminological as well asinstance resources) causes problems in locating, applying,and managing them. The administrative burden of theseresources then increases immensely, insofar as the originalsources change (partially) and the reuse of these RDF re-sources takes place collaboratively and in a decentralizedmanner. For example, this type of reuse occurs duringthe creation of a domain-specific vocabulary or a specificset of instances developed by organizationally independentcollaborators. Presumably, collaboration in such a setupis either primarily agile or it can be organized top-down,in which case it has to be completely supervised. How-ever, this complete supervision requires a high amount ofeffort. As a result, structural and content interferences aswell as varying models and contradictory statements areinevitable.

Projects from a number of domains are striving fordistributed models to collaborate on common knowledgebases. In the domain of e-humanities, the projects Pfar-rerbuch3, Catalogus Professorum4 [40], Héloïse – Euro-pean Network on Digital Academic History5 [39], and Pro-fessorial Career Patterns of the Early Modern History6 aregood examples of the need to explore and track provenanceand the evolution of the domain data. In the context of

3http://aksw.org/Projects/Pfarrerbuch4http://aksw.org/Projects/CatalogusProfessorum5http://heloisenetwork.eu/6http://catalogus-professorum.org/projects/pcp-on-web/

Preprint submitted to Journal of Web Semantics May 8, 2018

Page 2: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

managing historical prosopographical data, the source ofthe statements is relevant to evaluate their credibility andto consider the influence of their environment. In libraries,metadata of electronic library resources are gathered andshared among stakeholders. The AMSL7 project aims tocollaboratively curate and manage electronic library re-sources as Linked Data [4, 38]. In a collaborative datacuration setup we need to identify the origin of any state-ment introduced into a dataset. This is essential in orderto be able to track back the conclusion of license contractsand identify sources of defective metadata. But even enter-prises have a need for managing data in distributed setups.The LUCID – Linked Value Chain Data8 project [20] re-searches the communication of data along supply chainsThe LEDS – Linked Enterprise Data Services9 project fo-cuses on how to organize and support distributed collab-oration on datasets for the management of backgroundknowledge and business procedures.

The collaboration on Linked Data Sets, currently ismainly done by keeping a central version of a dataset,and collaborators are editing on the same instance simul-taneously. Available systems to enable collaboration onLinked Data are central SPARQL endpoints and wiki sys-tems [19, 18, 33]. In both cases a common version of adataset is kept in a central infrastructure and thus collab-oration happens on a single shared instance. This centralapproach for a synchronized state has drawbacks in scenar-ios in which the existence of multiple different versions ofthe dataset is preferable. Furthermore, the evolution of adataset in a distributed setup is not necessarily happeningin a linear manner. The existence of multiple different ver-sions of a respective dataset occurs if simultaneous accessto the central dataset is not possible for all participants(for instance, if they are working from mobile devices withlimited network connection). Also if a consensus on thestatements in a dataset is not yet reached, multiple view-points need to be expressed as different versions of thedataset. Hence, a system that fosters the evolution of adataset in a distributed collaborative setup needs to

• support divergence of datasets,

• conflate diverged states of datasets, and

• synchronize different distributed derivatives of therespective dataset.

As a consequence of conflating diverged datasets, the uti-lized system also needs to

• identify possible occurring conflicts and contradic-tions, as well as

• to offer workflows for resolving identified conflictsand contradictions.

7http://amsl.technology/8http://www.lucid-project.org/9http://www.leds-projekt.de/

In the early days of computers the term software crisiswas coined to describe the immaturity of the software engi-neering process and software engineering domain. Dijkstradescribed the situation as follows:

[…] as long as there were no machines, pro-gramming was no problem at all; when we hada few weak computers, programming became amild problem, and now we have gigantic com-puters, programming had become an equallygigantic problem.10

The process of creating software could be made more re-liable and controllable by introducing software engineer-ing methods. In the 1970s, software configuration man-agement enabled structured collaborative processes, whereversion control is an important aspect to organize the evo-lution of software. Early version control systems (VCS),such as CVS and Subversion, allowed to create centralrepositories. The latest version on the repository rep-resents the current state of development and the linearversioning history draws the evolution process of the soft-ware. Distributed VCS (DVCS), such as Darcs, Mercurial,and Git, were developed to allow every member of a dis-tributed team to fork the current state of the programssource code and individually contribute new features orbug-fixes as pull-requests which then can be merged intoa master branch, representing the current stable version.

Learning from software engineering history where DVCShelped to overcome the software crisis we claim that adapt-ing DVCS to Linked Data is a means to support decentral-ized and distributed collaboration processes in knowledgemanagement. The subject of collaboration in the contextof Linked Data are datasets instead of source code files,central VCS systems correspond to central SPARQL end-points and wiki systems. Similar to source code develop-ment with DVCS, individual local versions of a dataset arecurated by data scientists and domain experts. Trackingprovenance during the process of changing data is a basicrequirement for any version control system. Therefore, itis important to record the provenance of data at any stepof a process involving possible changes of a dataset (e.g.creation, curation, linking).

Our aim is to provide a system that enables distributedcollaboration of data scientists and domain experts onRDF datasets. Yet, we focus on a generic solution for theproblem of collaboration on RDF datasets. By genericwe mean that the solution should not make any assump-tions on the application domain. This includes, that itshould rely on the pure RDF data model and not rely onor add support for additional semantics such as OWL orSKOS. On the informal Semantic Web Layer Cake model11

we are thus focusing on the syntactic data interchange

10https://www.cs.utexas.edu/users/EWD/transcriptions/EWD03xx/EWD340.html

11https://www.w3.org/2007/03/layerCake.svg

2

Page 3: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

layer in combination with the generic SPARQL query lan-guage. To support distributed collaboration on the datainterchange layer we propose a methodology of using aGit repository for storing the data in combination witha SPARQL 1.1 interface to access it. We introduce theQuit Stack (“Quads in Git”) as integration layer to makethe collaboration features of Git repositories accessible toapplications operating on RDF datasets.

In this paper we have combined multiple aspects forsupporting the collaboration on RDF datasets. Addition-ally to the individual improvements to those aspects, whichwere discussed independently, we can now present a com-bined and comprehensive system. For describing and trans-forming the operations implemented in Git to operationson RDF datasets, we introduce a formal model. The for-mal model was published in [2]. In this paper we canpresent a more elaborated formal model, it was especiallyextended and improved regarding the atomic partitioning.The formal model is used for expressing changes and theevolution of datasets with support for tracking, reverting,branching, and merging. This model provides support fornamed graphs and for blank nodes. To actually pursuemerging operations, identifying and resolving conflicts wepropose various merge strategies which can be utilized indifferent scenarios. To present and further examine thepossibilities to support distributed collaborative datasetcuration processes, we have created a reference implemen-tation. Initial functionality for tracking and analyzingdata-provenance using the Quit Store was examined in [3].Since then we have improved and clarified the handlingof the provenance data and can present an extended datamodel. We enable access to provenance-related metadataretrieved from RDF data that is managed in a Git repos-itory. It further supports collaborative processes by pro-viding provenance mechanisms to track down and debugthe sources of errors. We have published the descriptionof the initial prototype of the Quit Store in [5]. In thispaper we can present an improved implementation, whichprovides support for provenance tracking and exploitation.Further we were able to increase the overall performanceof the system by implementing an overworked architec-ture which is mainly covered in section 9. Also the newsystem has a practically usable interface to create, query,and merge different branches, where the user can selectbetween multiple merge strategies.

The paper is structured as follows. We are present-ing the description of an application domain with relevantuse cases we are targeting in section 2. Requirements fora decentralized collaboration setup are formulated in sec-tion 3. The state of the art is presented and discussed insection 4 followed by relevant preliminaries, such as Git,a discussion about RDF serialization, comparison of RDFgraphs as well as blank nodes, in section 5. An introduc-tion to the preliminaries of the formal model with basicdefinitions are given in section 6. Based on the definitionsin section 6, the basic operations on a versioning graph ofdistributed evolving datasets are defined in section 7. As

extension of the basic operations different merge strategiesare presented in section 8. The approach and methodol-ogy of the system as well as the prototypical referenceimplementation is specified in detail in section 9. The pre-sented concepts are evaluated regarding correctness andperformance, using our prototypical implementation, insection 10. Finally, we are discussing the results of thepaper in section 11 and a conclusion is given together witha prospect to future work in section 12.

Throughout the paper we are using the following RDF-prefix mappings:@prefix quit: <http://quit.aksw.org/vocab/> .@prefix prov: <http://www.w3.org/ns/prov#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix local: <http://quit.local/> .@prefix ex: <http://example.org/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

In most cases we could reuse existing terms from theprov: and foaf: vocabularies to express the provenanceinformation produced by the quit store. The quit: vo-cabulary was created for defining additional terms and in-stances which where not defined in existing vocabularies,furthermore it also provides terms to express the softwareconfiguration of the Quit Stack. We are using the names-pace local: to denote terms and instances, which are onlyvalid in a local context of a system. The namespace ex:is used for exemplary instances which are placeholders forinstances whose IRIs depend on the exact application do-main.

2. Domain and Derived Use Cases

The research group Agile Knowledge Engineering andSemantic Web (AKSW) together with historians from theHerzog August Library in Wolfenbüttel (HAB) run a re-search project in cooperation with partners from the Work-ing Group of the German Professor’s catalogs and theEuropean research network for academic history–Héloïse.The project’s aim is to develop a new research methodacross humanities and the computer science field employ-ing Semantic Web and distributed online databases forstudy and evaluation of collected information on groupsof historical persons focusing on German professors’ ca-reer patterns during the 18th and 19th centuries. Theproject deals with many challenges as individual datasetshave been developed by those different communities, someof them are more than 600 years old as the Catalogus Pro-fessorum Lipsiensium. These communities have differentsingularities that make the development and managementof a common vocabulary very challenging. For instance,the German professor’s dataset of the state of Saxony con-tains a list of as much as six working places and respectivepositions of each professor across time as well as a detaileddescription of the archive where the information were ex-tracted from. These information were not available in pre-vious published datasets and therefore they would cause

3

Page 4: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

a vocabulary change. However, in order to perform thechanges, there is a need of (a) creating an independent vo-cabulary or (b) to implement the changes over a mutualagreement among the different research groups. Creat-ing independent vocabularies (a) in the individual workinggroups would not help to integrate the datasets. Due tothe organizational distribution of the individual researchgroups, a common vocabulary expressing a mutual agree-ment among the groups (b) can only be the result of acollaborative process. Based on the origin of the individ-ual datasets, their diverse granularity and structure theindividual discussions are having varying starting points.

This description of an application domain is exemplaryfor other interdisciplinary projects in fields as the digitalhumanities, but can also stand for many other collabora-tive scenarios. Our aim is to provide a system to supportthe distributed collaboration of data scientists and domainexperts on RDF datasets. To break this domain descrip-tion down we have identified several use cases which cancontribute to a comprehensive system to support such col-laborative evolutionary processes. In the following we aredescribing the use cases:

• UC 1: Collaborative and Decentralized Design

• UC 2: Crowdsourcing Information with Citizen Sci-entists

• UC 3: Integrating With Existing Systems

• UC 4: Backup

• UC 5: Provenance Recording and Exploration

UC1 Collaborative and Decentralized Design. Vocabular-ies and datasets are used to encode a common understand-ing of describing information which represent and expressa given domain from the point of view of the creators ata certain time. Until a formulation of the common under-standing or definition of a term or dataset item is reacheddifferent proposals are discussed in parallel. The processof creating a vocabulary or a dataset is long and involvesmany aspects such as the targeted users, the domain, usedtechnology, and the maintenance of the vocabulary overtime. It can be collaborative, agile and usually implicatesmany iterations until it reaches a mature state. For in-stance the creation and evolution of the foaf: vocabu-lary went through 10 public iterations of the specifica-tion from 2005 until 201412. In a more generalized waythe distributed collaboration allows users to work togetherthrough individual contributions. It played an importantroll on the evolution of the World Wide Web. Good ex-amples of collaborative systems are Wikipedia13 respec-tive Wikidata14. The Wikipedia project provides a plat-form for collaboration and exchange which allows volun-teers around the world to collectively curate its content in

12http://xmlns.com/foaf/spec/13https://www.wikipedia.org/14https://www.wikidata.org/

AAlice Citizen Scientists

(1)(2)

(3)

(4)

(5)

(6)

(6)

Data Curators

Figure 1: Integrating a crowdsourcing process into a collaborativecuration process.

a crowd-sourcing manner. However, users might have dif-ferent needs and requirements. Heterogeneous distributedcollaboration systems can support a multitude of data per-spectives and a decentralized fashion allows users to evolvetheir dataset copy distinctly while still sharing a commonportions of it.

UC2 Crowdsourcing Information with Citizen Scientists.Collaborative processes can also be used to incorporate ex-ternal non-professional collaborators into a research work-flow. This should be demonstrated with the following pro-totypical user story. Alice is collaborating with a teamon some scientific set of open data (fig. 1, 1). Alice haspublished the data together with an exploration interfaceon the Web (fig. 1, 2). While Alice’s team of curatorsis working hard on the data there are some citizen scien-tists who are sending e-mails to Alice reporting errors inthe data (fig. 1, 3). Since Alice and her team are work-ing hard there is no time to read through all the prose ofthe citizen scientists and then manually seek the propertyin the dataset which has to be corrected. To incorporatethe contributions of the citizen scientists into the existingcollaboration workflow, Alice adds a data editor to the ex-ploration interface on the Web (fig. 1, 4). The explorationinterface can then be used by the citizen scientists to fixerrors, they have found. All changes made by the citizenscientists are then transfered to the collaboration reposi-tory (fig. 1, 5) which is synchronized with the repositoriesof the team of curators (fig. 1, 6). Now each member ofthe team of curators can review the proposed changes andincorporate them into the dataset.

UC3 Integrating With Existing Systems. Tools for cre-ating and editing RDF knowledge bases exist, but mightlack support for collaborative scenarios. It would involvea high complexity to attach existing single place editingand exploration systems with collaboration functionality.A possibility to attach single place systems to a collabora-tion infrastructure by using well defined interfaces wouldallow the creation of collaborative setups without the need

4

Page 5: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

of new tools. An architecture which is employing well de-fined interfaces for existing tools would also follow andsupport the single responsibility principle as formulated inthe context of the “UNIX philosophy” as Make each pro-gram do one thing well [22, 23]. Users can thus continueusing their existing tools, daily workflows, and familiarinterfaces.

UC4 Backup. Data storage systems can be attacked,hacked, and fail, data can be corrupted. Some users mightdesire to prevent data loss by periodically performing datasetbackups for instance by synchronizing them with a remotesystem. By doing so, it is easier to restore it whenevernecessary. RDF is not different than any other data, it isalways important to create backups. A version controlledbackup system also allows to restore data even after faultychanges. Further a backup of the work of each collabora-tor allows other parties to continue the overall work, evenif collaborating parties quit. Providing means to backupdata, helps to avoid content loss as well as time on restor-ing it. By integrating a tool for tracking the changes of thedata and submitting the data to a safe location into thedaily work of the data creator, avoids gaps in the backupand distraction from the main tasks.

UC5 Provenance Recording and Exploration. When inte-grating datasets from different sources or performing up-date operations, recording of provenance information canbe desired. According to the Oxford online dictionary,provenance has to be with the “origin or earliest knownhistory of something”15. The storage of data’s provenanceall the way down to the atomic level (insertions and dele-tions) can be very useful to backtrack the data transfor-mation process and spotlight possible errors on differentlevels during the data life-cycle [8] as well as in the data it-self. Therefore, the provenance can be explored to improvedata management and engineering processes. To supportdevelopers of provenance systems, Groth et al. [25] providegeneral aspects that should be considered by any system,which deals with provenance information. The aspects arecategorized as follows [25]:

• Content describes what should be contained in prove-nance data, whereas entities, contributing sources,processes generating artifacts, versioning, justifica-tion, and entailment are relevant dimensions.

• Management refers to concerns about how prove-nance should be captured and maintained, includ-ing publication and access, dissemination, and howa system scales.

• Use is about how user specific problems can be solvedusing recorded provenance. Mentioned are under-standing, interoperability, comparison, accountabil-ity, trust, imperfections, and debugging.

15https://en.oxforddictionaries.com/definition/provenance

3. Requirements

Based on our use cases we formulate the following re-quirements for a distributed collaboration system on RDFdatasets. The requirements point out aspects of support-ing a distributed setup of data curators in collaborating onRDF datasets as well as aspects of the system, which areneeded to record and explore the evolution of the data. Re-quirements for collaborative vocabulary development arealready formulated in [27]. We are adopting some of theserequirements which are overlapping with our use cases. Incontrast to [27] we focus on the technical collaborationon a distributed network, rather than the specific processof creating vocabularies. First we present the three ma-jor requirements which are directly implied by our aim tosupport distributed collaboration (REQ 1 to 3). The ma-jor requirements are followed by five requirements whichare needed to support the distributed collaboration pro-cess (REQ 4 to 6) respective are necessary for employingthe system in a semantic web context (REQ 7 and 8).

REQ1 Support Divergence. In collaborative scenarioscontributors may differ in their motivation to contribute,for example because of their organizational role, may differin their opinion to the subject of collaboration, or may pro-vide contradicting contributions in any other way. Thusthe existence of multiple different versions of the dataset ispreferable, for instance to express dissensus, disagreement,or situations in which a consensus was not yet reached. Es-pecially in distributed collaborative setups (cf. UC 1 and 2)it can happen, that collaborating parties are not alwayssharing a common understanding of a certain topic. Butalso because of organizational structures the evolution ofa dataset is not necessarily happening in a linear manner,for instance if partial subjects are discussed in sub-workinggroups. Thus the system needs to be able to handle di-verging states of a dataset.

REQ2 Conflate Diverged States. The aim of collabora-tion is to eventually contribute to a common subject. Tocombine the contributions in diverged states of a datasetto a common dataset a possibility to conflate the divergedstates is needed. Because diverged states can encode dis-sensus it is necessary to identify possible conflicts beforeactually merging the datasets. The definition of conflict isdepending on the semantics of the data and the applicationdomain, thus different strategies are needed to identifyconflicts. When possibly identifying conflicts the systemneeds to offer workflows for resolving identified conflictsand contradictions.

REQ3 Synchronize. Collaborating on a single centralizedcopy of a dataset causes much coordination difficulties. Forinstance if a simultaneous access to the central dataset isnot possible for all participants (e.g. from mobile devices).Thus we are focusing on a distributed setup of collabora-tion systems. To support the collaboration of multiple dis-tributed parties across system boundaries (cf. UC 1 and 2)

5

Page 6: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

it is necessary to exchange data between systems. The sys-tem should support the synchronization of their availablestates of the dataset with the available states of remotesetups. This is also important to keep up a collabora-tive process if participating systems fail (cf. UC 4). Thissynchronization process can happen in real-time or asyn-chronously.

REQ4 Provenance of Contributions. Provenance informa-tion have to be attached to a contribution to the commondataset, at least a change reason, author information, anddate of commit. For the automated interaction with thesystem, executed operations have to be documented, e. g.the query or data source. To utilize the provenance in-formation, which are recorded during the evolution of adataset, they need to be represented in a queriable graph.Access to this graph has to be provided through a queryinterface. The interface has to return a structured repre-sentation of metadata recorded for the selected versions ofa dataset. This is necessary as a prerequisite of the as-pects publication and access, resp. dissemination of theprovenance information (cf. UC 5: Management; Commu-nication support (R1) and Provenance of information (R2)in [27])

REQ5 Random Access to any Version. For a collab-oration system it should be possible to randomly accessany version of the dataset without the need for doing arollback of the latest versions resp. resetting the storagesystem. For instance, when collaborators are currentlyworking on different versions of the data (cf. UC 1). Thisallows queries across versions and it can support the de-bugging process (Use, cf. UC 5). To restore any time sliceof a backup it is also necessary to access and restore anarbitrary version (cf. UC 4).

REQ6 Deltas Among Versions. It is required to calculatethe difference between versions generated by contributionof collaborators. The difference should express the actualchanges on the dataset rather than changes to the serializa-tion of the data in the repository. This is necessary whenreviewing external contributions to a dataset (cf. UC 2).It is also a prerequisite for exploring the provenance infor-mation, for the debug operation (Use, cf. UC 5), and toanalyze backups of datasets (cf. UC 4). The calculated dif-ference should be expressed in a machine readable format.(cf. Deltas among versions (R7) in [27])

REQ7 Support of RDF Data Sets and Modularization ofGraphs. The system should be able to handle multipleRDF graphs (i.e. RDF datasets), in a repository. This al-lows users resp. collaborators to organize the stored knowl-edge in individual organizational units, as it is required bytheir application. This requirement also provides the func-tionality to implement the requirement Modularity (R9) asformulated in [27]. The method works with different gran-ularities of modularization of RDF datasets. This is of

interest when the system should be integrated with exist-ing systems (cf. UC 3).

REQ8 Standard Data Access Interface. Different imple-mentations of collaboration interfaces can access and col-laborate on a common repository (cf. UC 3). Collaboratorscan use different RDF editors to contribute to the repos-itory (cf. UC 1). To some extent the methodology shouldeven be robust to manual editing of RDF files containedin the repository. In contrast to the requirement Editoragnostic (R8) as formulated in [27], we do not require thesyntax independence on the repository and understand theeditor agnosticism as transparency of the provided appli-cation interface. Besides the adherence to the RDF dataformat the system also has to provide a data access andupdate interface following the SPARQL 1.1 [41] standard.

4. State of the Art

In the following we are looking into existing approachesfor partially solving the targeted problem. First we con-sider abstract models for expressing changes and evolution,like methodologies and vocabularies for managing the de-centralized evolution of RDF datasets in section 4.1. Sec-ond, we examine implementations dealing with versioningof RDF data in section 4.2. Followed by applications builton top of RDF versioning systems in section 4.3.

4.1. Theoretical Foundations and VocabulariesCurrently, various vocabularies exist that allow the de-

scription of provenance. As a World Wide Web Consor-tium (W3C) Recommendation, the PROV ontology (PROV-O) [34] is the de-facto standard for the representationand exchange of domain-independent provenance. TheOpen Provenance Model [37] predates the PROV-O, butboth use very similar approaches as their core components.Both vocabularies enable the description of provenancedata as relations between agents, entities, and activitiesor their respective equivalent.

Another popular standard for general-purpose meta-data is Dublin Core respective the Dublin Core MetadataTerms [14]. The main difference to the prior ontologies isin their perspective on expressing provenance. Both vo-cabularies provide means for expressing provenance meta-data. While the PROV-O is more focused on activitiesthat lead to a specific entity, Dublin Core focuses on theresulting entities.

One advantage of using domain-independent vocabu-laries as a core is their applicability to systems and toolsthat operate without any domain-specific knowledge. PROV-O-Viz16 is an example of a visualization tool only workingwith the data expressed according to the PROV ontology.

Berners-Lee and Connolly [9] give a general overviewon the problem of synchronization and how to calculate

16http://provoviz.org/

6

Page 7: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

delta on RDF graphs. This work considers the transfer ofchanges to datasets by applying patches. They introducea conceptual ontology that describes patches in “a wayto uniquely identify what is changing” and “to distinguishbetween the pieces added and those subtracted”.

Haase and Stojanovic [26] introduce their concept ofontology evolution as follows: “Ontology evolution can bedefined as the timely adaptation of an ontology to thearisen changes and the consistent management of thesechanges. […] An important aspect in the evolution pro-cess is to guarantee the consistency of the ontology whenchanges occur, considering the semantics of the ontologychange.”. This paper has its focus on the linear evolutionprocess of an individual dataset, rather than the decen-tralized evolution process. For dealing with inconsistencyresp. consistency they define three levels: structural, logi-cal, and user-defined consistency. In the remainder of thepaper they mainly focus on the implications of the evo-lution with respect to OWL (DL) rather than a genericapproach.

Auer and Herre [7] propose a generic framework to sup-port the versioning and evolution of RDF graphs. Themain concept introduced in this work is the concept ofatomic graphs, which provides a practical approach fordealing with blank nodes in change sets. Additionally theyintroduce a formal hierarchical system for structuring a setof changes and evolution patterns leading to the changesof a knowledge base.

4.2. Practical Knowledge Base Versioning SystemsTable 1 provides an overview of the related work and

compares the approaches presented with regard to differ-ent aspects. One aspect for categorizing versioning sys-tems is its archiving policy. Fernández et al. [17] definethree archiving policies. IC - Independent Copies, whereeach version is stored and managed as a different, isolateddataset. CB - Change-based (delta), where differences be-tween versions are computed and stored based on a ba-sic language of changes describing the change operations.TB - Timestamp-based where each statement is annotatedwith is temporal validity. These three archiving policiesdo not cover the full range of possible archiving systemsand thus additionally we need to define the archiving pol-icy FB - Fragment-based. This policy stores snapshots ofeach changed fragment of an archive. Depending on therequirements fragments can be defined at any level of gran-ularity (e.g. resources, subgraphs, or individual graphs ina dataset). An index is maintained which references thefragments belonging to a version of the dataset. Thisapproach addressed the issue of IC of fully repeating alltriples across versions. In contrast to CB it is not nec-essary to reconstruct individual versions by applying thechange operations.

Besides the used archiving policy we compare the sys-tems if they allow collaboration on datasets with multiplegraphs (quad support, REQ 7) and if it is possible to ac-cess individual versions on the repository (random access,

REQ 5). Finally, we want to find out how the existing sys-tems support the solution of our three basic requirementsSupport Divergence (REQ 1), Conflate Diverged States(REQ 2), and Synchronize (REQ 3) by allowing to createmultiple branches, merge branches, and create distributedsetups for collaboration using push and pull mechanisms.

TailR as presented by Meinhardt et al. [35] is a systemfor preserving the history of arbitrary RDF datasets onthe web. It follows a combined delta and snapshot stor-age approach. The system is comparable to the approachpresented by Frommhold et al. [21], as both systems arelinear change tracking systems. None of the systems pro-vides support for branches to allow independent evolutionof RDF graphs.

Another approach is implemented by stardog17, a triplestore with integrated version control capabilities. Theversioning module provides functionality for tagging andcalculating the difference between revisions18. Snapshotscontain all named graphs from the time the snapshot wastaken. RDF data and snapshots are stored in a rela-tional database. The current state of the database canbe queried via a SPARQL interface. While older states ofthe database can be restored, to our knowledge, they can-not be queried directly. An additional graph containingthe version history is provided.

Cassidy and Ballantine [12] present a version controlsystem for RDF graphs based on the model of Darcs. Theirapproach covers the versioning operations commute, re-vert, and merge. Even though Darcs is considered a DVCSin contrast to other DVCS the presented approach onlysupports linear version tracking. Further the merge op-eration is implemented using patch commutation, whichrequires history rewriting and thus loses the context ofthe original changes.

Graube et al. [24] propose the R43ples approach, whichuses named graphs for storing revisions as deltas. For ex-pressing the provenance information it uses the RMO vo-cabulary19, which is an extended and more domain-specificversion of the PROV ontology. For querying and updatingthe triple store an extended SPARQL protocol language isintroduced.

R&Wbase by Vander Sande et al. [45] is a tool forversioning an RDF graph. It is tracking changes whichare stored in individual named graphs which are combinedon query time, this situation makes it impossible to usethe system to manage RDF datasets with multiple namedgraphs. The system also implements an understanding ofcoexisting branches within a versioning graph, which isvery close to the concept of Git.

In the dat20 project a tool for distributing and synchro-

17http://stardog.com/18http://www.stardog.com/docs/#_versioning, https

://github.com/stardog-union/stardog-examples/blob/d7ac8b5/examples/cli/versioning/README.md

19https://github.com/plt-tud/r43ples/blob/master/doc/ontology/RMO.ttl

20http://dat-data.com/

7

Page 8: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

Approach ArchivingPolicy

Quad Support RandomAccess

Branches/Merge Synchronize(Push/Pull)

Frommhold et al. [21] CB yes no nod noMeinhardt et al. [35] hybrid

(IC and CB)noa yes nod (yes)h

stardog IC yes no no noCassidy and Ballantine [12] CB no no no/(yes)d,e (yes)i

Vander Sande et al. [45] TB nob yes yes/(yes)f noGraube et al. [24] CB yesb,c yes yes/(yes)g nodat FB (chunks) n/a yes no yes

a The granularity of versioning are repositories; b The context is used to encode revisions; c Graphs are separately putunder version control; d Only linear change tracking is supported; e If a workspace if duplicated subsequent patches canbe appliead to the original copy, which is called merge in this system; f Naive merge implementation; g The publicationmentions a merging interface; h No pull mechanism but history replication via memento API; i Synchronizations happensby exchanging patches

Table 1: Comparison of the different (D)VCS systems for RDF data. Custom implementations exist for all of these systems and they are notre-using existing VCSs. At the level of abstraction all of these systems can be located on the data interchange layer.

nizing data is developed, the aim being to synchronize anyfile type peer to peer. It has no support for managingbranches and merging diverged versions and is not focus-ing on RDF data.

Looking at the above knowledge base versioning sys-tems as listed in table 1, it is clear that only two of themcan fulfill the requirement Support Divergence by provid-ing a branching model, namely R&Wbase [45] and R43ples[24]. Even so they currently only have very limited supportfor merge operations to fulfill the requirement Conflate Di-vergence. Moving on to the support for the requirementSynchronize we see TailR [35], the approach by Cassidyand Ballantine [12] and dat. Given the limited support forRDF, we can ignore dat while TailR and the approach byCassidy and Ballantine do not bring support for a branch-ing system and thus cannot fulfill the first two require-ments. One can argue that it is more likely to extenda system with proper support for branching and mergingwith an appropriate synchronization system than the otherway around. Because once all conflicts are resolved andthe conflation is performed locally only the storage struc-ture needs to be transferred. Thus the remaining relevantrelated work is R&Wbase [45] and R43ples [24].

4.3. Applications for Exploiting Knowledge VersioningThe Git4Voc, as proposed by Halilaj et al. [27] is a

methodology and collection of best practices for collabo-ratively creating RDF vocabularies using Git repositories.To support vocabulary authors in the process of creatingRDF and OWL vocabularies Git4Voc is implemented usingpre- and post-commit hooks for validating the vocabularyand generating documentation. For validating the vocab-ulary specification, a combination of local and online toolsis used. In preparation of the presented Git4Voc system,Halilaj et al. have formulated important requirements forcollaboration on RDF data. We have partially incorpo-rated these requirements in section 3. Based on Git4Voc

Halilaj et al. have created the VoCol [27] system as an in-tegrated development environment for vocabularies. ForVoCol, the three core activities modeling, population, andtesting are formulated. VoCol, as well as Git4Voc are notfocused on providing a versioning system for RDF data ingeneral, but rather tools built on top of a versioning systemto specifically support the development of vocabularies.

The Git2PROV21 tool [15] allows to generate a prove-nance document using the PROV ontology for any publicGit repository. It can be used as a web service or canbe executed locally. Since our aim is to provide prove-nance for RDF data on graph- and triple-level Git2PROVis not suited as a component since it is only able to handleprovenance on a per-file-level.

5. Preliminaries

In the following we give a brief overview and intro-duction to technologies and design considerations that arerelevant for our methodology. At first we introduce theDVCS Git by describing its general architecture and sometechnological details relevant for the realization of our ap-proach in section 5.1. Then we are discussing design con-siderations regarding the storage and serialization of RDFdata in section 5.2 and an approach to support blank nodesin section 5.3.

5.1. GitGit22 is a DVCS designed to be used in software devel-

opment. It is used for managing over 64 million projects ongithub23 and is also used on other platforms, such as bit-bucket or gitlab, is being hosted on self controlled servers,

21http://git2prov.org/22https://git-scm.com/23https://github.com/about, 2017-08-15

8

Page 9: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

and is used in a peer to peer manner. Git offers variousbranching and merging strategies, and synchronizing withmultiple remote repositories. Due to this flexibility, bestpractices and workflows have been developed to supportsoftware engineering teams with organizing different ver-sions of a programs source code, such as e.g. gitflow24 andthe Forking Workflow25.

In contrast to other VCS such as Subversion or CVS26,Git is a DVCS. As such, in Git, users work on a localversion of a remote Git repository, which is a completeclone of the remote repository. Git operations, such ascommit, merge, revert, as well as (interactive) rebase areexecuted on the local system. Out of the box, Git alreadyprovides the capability to store provenance informationalongside a commits. The repository contains commits,which represent a certain version of the working directory.Each version of the working directory contains the currentstate of its files at the given version. Even so Git is mainlyintended to work with text files, its storage system does notdistinguish between text and arbitrary binary files. Filesare stored as binary large objects (blob) and are referencedby commits, while equal files are stored only once.

Thus, for extending our provenance information wehave to dig a little deeper into Git’s internal structure.The internal storage structure of Git is a file-system basedkey-value store. Git uses different types of objects to storeboth structural information and data in files. The typesused by Git to organize its data are blobs, trees, and com-mits, the types are linked as shown in fig. 2. The individualdata items are addressed with their sha1-hash27 which isalso used as commit ID resp. blob ID.

Commit

Tree

Blob

Tree Parent Autor Committer Message

Blob Tree

Data

Figure 2: Internal structure used by Git.

The content of any file that is put under version con-trol is stored as blob while a folder are stored as tree. Atree is a list of references ot other trees and blobs. Eachrevision is represented by a commit-object consisting ofmetadata, references to parent commits and a reference toa tree-object, which is considered as root. References suchas branches and tags are simply files within Git, pointingto a commit as their entry point to the revision history.

24http://nvie.com/posts/a-successful-git-branching-model/

25https://www.atlassian.com/git/tutorials/comparing-workflows/forking-workflow

26Concurrent Versions System, http://savannah.nongnu.org/projects/cvs

27Secure Hash Algorithm

With the Git transfer protocol it is possible to synchro-nize distributed repositories without the need for a centralinstance.

5.2. Serialization of RDF DataRDF 1.1 specifies multiple different formats which can

be used for serializing RDF graphs (RDF/XML28, Tur-tle29, RDFa30, N-Triples31) and RDF datasets (TriG32,JSON-LD33, N-Quads34). RDF graphs and RDF datasetscan be serialized in different formats and thus the sameRDF statements can result in completely different textualrepresentations and the resulting file size can vary. Eventhe same graph or dataset serialized twice in the sameserialization format can be textually different. To allowa better readability and processability of the differencesbetween two versions in the VCS (cf. section 3 “DeltasAmong Versions”), we have to find an easy to comparedefault serialization format. For our approach we havedecided to use the N-Quads serialization [11] in Git repos-itories. N-Quads is a line-based, plain text format, whichrepresents one statement per line. Since Git is also treat-ing lines as atoms it will automatically treat statementsin N-Quads as atomic units In contrast to N-Triples, N-Quads supports the encoding of complete RDF datasets(multiple graphs). N-Triples is a subset of N-Quads, byonly using the default graph. Another candidate would beTriG (Turtle extended by support for RDF datasets), incontrast to N-Quads one line does not necessarily repre-sent one statement. Also due to the abreviation features(using ; or , as delimiter) as well as multi line literals,automatic line merges can destroy the syntax. Similarproblems would occur with the other serialization formatslisted above. To further ensure stability and comparabilityof the files we are maintaining a sorted list of statementsduring the serialization.

Halilaj et al. [27] propose the usage of Turtle in Gitrepositories, to address the requirement to be editor ag-nostic. Since a transformation to any other serializationformat is possible, e.g. using rapper35 or Jena RIOT36,our approach does not put additional constraints on theusage of the serialization format in an editor application.Further as stated above, we find N-Quads to be of betterfit for Git versioning than Turtle.

5.3. Blank Nodes in VersioningUsing RDF as an exchange format, blank nodes are

still a problem we have to deal with. Blank nodes are

28https://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/

29https://www.w3.org/TR/2014/REC-turtle-20140225/30https://www.w3.org/TR/2015/NOTE-rdfa-primer-20150317/31https://www.w3.org/TR/2014/REC-n-triples-20140225/32https://www.w3.org/TR/2014/REC-trig-20140225/33https://www.w3.org/TR/2014/REC-json-ld-20140116/34https://www.w3.org/TR/2014/REC-n-quads-20140225/35http://librdf.org/raptor/rapper.html36https://jena.apache.org/documentation/io/

9

Page 10: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

identifiers with a local scope and so might be different foreach participating platform.

The recommendation of RDF 1.1, suggests replacingblank nodes with IRIs [13], which is called skolemization.Replacing all inserted blank nodes with skolem-IRIs wouldalter the stored dataset. Our preferred solution is thus tobreak down all operations on the dataset to atomic graphs,as proposed by Auer and Herre [7].

6. Definitions

In this section we are introducing a formalization forexpressing changes to RDF graphs based on additions anddeletions of atomic subgraphs. This foundational formalmodel is used to describe the more complex operationsin sections 7 and 8. As it is commonly used in RDFand as it is also defined in [13] we define an RDF graphresp. just graph as a set of RDF triples. With the excep-tion, that we consider isomorphic sub-graphs as identicaland de-duplicate these sub-graphs during our operations.37

An RDF dataset resp. just dataset is a collection of RDFgraphs as defined in [13].

According to Auer and Herre [7] an Atomic Graph isdefined as follows:

Definition 1 (Atomic Graph). A graph is called atomicif it can not be split into two nonempty graphs whose re-spective sets of blank nodes are disjoint.

This implies that all graphs containing exactly onestatement are atomic. Furthermore a graph is atomic ifit contains a statement with at least on blank node andeach pair of occurring blank nodes is connected by a se-quence of statements where subject and object are blanknodes. If one of these statements additionally contains asecond blank node, the same takes effect for this blanknode recursively. A recursive definition of Atomic Graphsis given under the term Minimum Self-Contained Graph(MSG) by Tummarello et al. [44].

Let A be the set of all Atomic Graphs and let ≈ bethe equivalence relation such that G ≈ H holds for anyG,H ∈ A iff G and H are isomorphic as defined for RDFgraphs in [13]. Essentially two graphs are isomorphic inthis sense if a bijection between these graphs exists, whichis the identity mapping for non-blank nodes and predicatesand a bijection between blank-nodes. By P := A/ ≈ wedenote the quotient set of A by ≈. We assume a canonicallabeling for blank nodes for any graph. The existence ofsuch a labeling has been shown by Hogan [30]. Thus asystem of representatives of P is given by the set P ⊂ Aof all canonically labeled atomic graphs.

Based on this we now define the Canonical Atomic Par-tition of a graph as follows:

37This is not the same definition as for lean-graphs [29], but ourgraphs are similar to lean-graphs regarding the aspect that we elim-inate internal redundancy.

Definition 2 (Canonical Atomic Partition). Given anRDF graph G, let PG ⊂ A denote the partition of G intoatomic graphs. We define a mapping r : PG → P, suchthat r(a) = p, where p is the canonically labeled represen-tative of a.

The Canonical Atomic Partition of the graph G is de-fined as

P(G) := {r(x)|x ∈ PG}

P(G) ⊂ P and especially P(G) ⊂ A.

Each of the contained sets consists of exactly one state-ment for all statements without blank nodes. For state-ments with blank nodes, it consists of the whole subgraphconnected to a blank node and all neighboring blank nodes.This especially means that all sets in the Atomic Par-tition are disjoint regarding the contained blank nodes(cf. [7, 44]). Further they are disjoint regarding the con-tained triples (because it is a partition).

Since P(G) is a set of atomic graphs, the union of itselements is a graph again which is isomorphic to G,∪

P(G) ≈ G.

Because we are building a system for distributed col-laboration on datasets, we need to find a way to expressthe changes which lead from one dataset to another. Toexpress these changes we start by comparing two graphsby calculating the difference.

Definition 3 (Difference). Let G and G′ be two graphs,and P(G) resp. P(G′) the Canonical Atomic Partitions.

C+ :=∪

(P(G′) \ P(G))

C− :=∪

(P(G) \ P(G′))

∆(G,G′) := (C+, C−)

Looking at the resulting tuple (C+, C−) we can alsosay that the inverse of ∆(G,G′) is ∆−1(G,G′) = ∆(G′, G)by swapping the positive and negative sets.

We now have the tuple of additions and deletions, whichdescribes the difference between two graphs. Thus we cansay that applying the changes in this tuple to the initialgraph G, leads to G′. Furthermore we can define a Change-set which can be applied on a graph G as follows:

Definition 4 (Changeset). Given an RDF graph G, achangeset is a tuple of two graphs (C+

G , C−G ) in relation to

G, with

P(C+G) ∩ P(G) = ∅

P(C−G ) ⊂ P(G)

P(C+G) ∩ P(C−

G ) = ∅P(C+

G) ∪ P(C−G ) ̸= ∅

10

Page 11: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

Since blank nodes cannot be identified across graphs,there cannot be any additions of properties to a blanknode, nor can properties be removed from a blank node.If a change to a statement involving a blank node takesplace, this operation is transformed into the removal of oneatomic graph and the addition of another atomic graph.Thus P(C+

G) and P(G) have to be disjoint. This meansan addition cannot introduce just new statements to anexisting blank node. Parallel to the addition a blank nodecan only be removed if it is completely removed with all itsstatements. This is ensured by P(C−

G ) being a subset ofP(G). Simple statements without blank nodes can be sim-ply added and removed. Further since P(C+

G) and P(C−G )

are disjoint we avoid the removal of atomic graphs, whichare added in the same changeset and vice versa. Since atleast one of P(C+

G) or P(C−G ) cannot be empty we avoid

changes with no effect.

Definition 5 (Application of a Change). Given an RDFgraph G, let CG = (C+

G , C−G ) be a changeset on G. The

function Apl is defined for the arguments G,CG respectiveG, (C+

G , C−G ) and is determined by

Apl(G, (C+

G , C−G )):=∪(

(P(G) \ P(C−G )) ∪ P(C+

G))

We say that CG is applied to G with the result G′.

7. Operations

Based on the formal model for expressing changes onRDF graphs, we can now introduce our formal syntax forrepresenting versioning operations. Because we are aimingfor a version tracking in a decentralized evolution environ-ment linear versioning operation are not enough, moreoverour operations have to bring in support for non-linear ver-sioning. The presented operations are commit to recordchanges in section 7.1, branch to support divergence in sec-tion 7.2, merge in section 7.3, and revert to undo changes insection 7.4. Each version is related to the respective RDFdataset and thus the versioning operations are related tothe respective transformations of the RDF dataset.

7.1. CommitFigure 3 depicts an initial commit A without any an-

cestor resp. parent commit and a commit B referring to itsparent A.

A({G0}

)B{A} ({G})

(C+G0 , C

−G0)

Figure 3: Two commits with an ancestor reference.

Let G0 be a graph under version control. A({G0}) is acommit containing the graph G0. G will be the new ver-sion of G0 after a change (C+

G0 , C−G0) was applied on G0;

Apl(G0, (C+G0 , C

−G0)) = G. Now we create a new commit

containing G which refers to its ancestor commit, fromwhich it was derived: B{A}({G}). Another change ap-plied on G would result in G′ and thus a new commitC{B{A}}({G′}) is created. In the further writing, the in-dices and arguments of commits are sometimes omitted forbetter readability, while clarity should still be maintainedby using distinct letters. Further also the changeset on topof the arrow is omitted if it is obvious.

The evolution of a graph is the process of subsequentlyapplying changes to the graph using the Apl function asdefined in definition 5. Each commit is expressing thecomplete evolution process of a set of graphs, since it refersto its ancestor, which in turn refers to its ancestor as well.Initial commits holding the initial version of the graph arenot referring to any ancestor.

7.2. BranchSince a commit is referring to its ancestor and not vice

versa, nothing hinders us from creating another commitD{B{A}}({G′′}). Taking the commits A, B{A}, C{B}, andD{B} results in a directed rooted in-tree, as depicted infig. 4. The commit D is now a new branch or fork basedon B, which is diverged from C. We know that G ̸≈ G′ andG ̸≈ G′′, while we do not know about the relation betweenG′ and G′′.

A({G0}

)B ({G}) C ({G′})

D ({G′′})

Figure 4: Two branches evolved from a common commit.

Because we do not know anything about the relationbetween G′ and G′′ we can consider them as independent.From now on the graph G is independently evolving intwo branches, while independent means possibly indepen-dent, this means two contributors performing a change donot have to know of each other or do not need a directcommunication channel. The contributors could actuallycommunicate, but communication is not required for thoseactions. Thus by branching a dataset’s evolution the con-tributors can be working from distributed places and nocentral instance for synchronization is required. We thusdefine the branch as follows:

Definition 6 (Branching). Branching is the (indepen-dent) evolution of a graph G with two graphs G1 and G2

as result, where Apl(G,C1) = G1 and Apl(G,C2) = G2.The changes C1 and C2 might be unequal, but can be thesame. The same applies for G1 and G2, they can be dif-ferent after the independent evolution, but can be similaras well.

7.3. Merge Different BranchesAfter creating a second branch, the tree of commits is

diverged, as shown in the example of fig. 4. We now want

11

Page 12: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

to merge the branches, in order to get a version of thegraph, containing changes made in those different branchesor at least take all of these changes into account. Thenotation of the merge is defined as follows:

Definition 7 (Merge of two Evolved Graphs). Givenare two commits C{β}({G′}) and D{γ}({G′′}). Merging thetwo graphs G′ and G′′ with respect to the change historyexpressed by the commits C and D is a function

Merge(C({G′}),D({G′′})) = M{C,D}({Gµ})

The Merge function is taking two commits as argu-ments and creates a new commit dependent on the in-put commits, this new commit is called merge commit.The graph Gµ is the merged graph resulting from G′ andG′′. The merge commit resulting from the merge op-eration has two ancestor commits, which it is referringto. If we take our running example the merge commit isM{C{B},D{B}}({Gm}). Taking the commits A, B{A}, C{B},D{B}, and M{C,D}, we get an acyclic directed graph, as itis depicted in fig. 5.

A({G0}

)B ({G}) C ({G′})

D ({G′′})

M ({Gm})

Figure 5: Merging commits from two branches into a common versionof the graph.

Note that the definition does not make any assump-tions about the ancestors of the two input commits. It de-pends on the actual implementation of the Merge functionwhether it is required that both commits have any com-mon ancestors. Furthermore different merge strategies canproduce different results, thus it is possible to have mul-tiple merge commits with different resulting graphs butwith the same ancestors. Possible merge strategies arepresented in section 8.

7.4. Revert a CommitReverting the commit B{A}({G}) is done by creating

an inverse commit B−1{B}({G̃

0}) (while the commit A isspecified as A({G0})). This inverse commit is then directlyapplied to B. The resulting graph G̃0 is calculated bytaking the inverse difference ∆−1(G0, G) = ∆(G,G0) andapplying the resulting change to G. After this operationG̃0 = G0.

A({G0}

)B ({G}) B−1

({G0}

)Figure 6: A commit reverting the previous commit.

A versioning log containing three commits is shown infig. 6. The last commit reverts its parent and thus the

graph in B−1 is again equal or atleast equivalent to thegraph in the first commit (A). While it is obvious how torevert the previous commits, it might be a problem if othercommits exist between the commit to be reverted and thecurrent top of the versioning log. In this case a merge isapplied (cf. section 8). For this merge, the merge base isthe commit to be reverted, branch A is the parent commitof the commit which is to be reverted, and branch B thecurrent latest commit. Arising conflicts when revertinga commit can be resolved in the same way, as for mergecommits.

8. Merge Strategies

Since we are not only interested in the abstract branch-ing and merging model of the commits, we want to knowwhat a merge operation means for the created graph in thecommit. In the following we present some possible imple-mentations of the merge operations. Note, that merging inthis context is not—generally—to be understood as in theRDF 1.1 Semantics Recommendation [29] as the union oftwo graphs.

8.1. Union MergeMerging two graphs G′ and G′′ could be considered

trivially as the union operation for the two graphs: G′ ∪G′′ = G′′′. This merge – as mentioned above – is welldefined in the RDF 1.1 Semantics Recommendation [29]in section “4.1 Shared blank nodes, unions and merges”.But this operation would not take into account the actualchange operations leading to the versions of the graphs.Furthermore the union merge does not allow the imple-mentation of conflict detection or resolution operations.The union merge might be intended in situations, wheredata is only added from different locations.

8.2. All Ours/All TheirsTwo other merge strategies, which would not produce

merge conflicts are ours and theirs, which just take thewhole graph G′ =: G′′′ or G′′ =: G′′′, while ignoring theother graph respectively. This strategy might be chosento completely discard changes from a certain branch.

8.3. Three-Way-Merge: An Unsupervised Approach for Merg-ing Branched Knowledge Bases

A methodology used in DVCS for software source codefiles, such as Git and Mercurial is the three-way-merge38.The merge consists of three phases, (1) finding a commonmerge base for the two commits to merge, (2) compar-ing the files between the merge base and the individualbranches and inferring which lines where added and re-moved, and (3) creating a merged version by combiningthe changes made in the two branches.

38How does Git merge work: https://www.quora.com/How-does-Git-merge-work, 2016-05-10

12

Page 13: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

A B base result ∆(G,G′) ∆(G,G′′)

G′ G′′ G Gm C+A C−

A C+B C−

B

Non existing statementsX X X X Atomic graph existent in all graphs will also be in the resultX X X An atomic graph added to G′ is also added to the result

X X X An atomic graph added to G′′ is also added to the resultX X X An atomic graph removed from G′ is also not added to the result

X X X An atomic graph removed from G′′ is also not added to the resultX X X X X An atomic graph added to both branches is also added to the result

X X X An atomic graph removed from both branches is also not added to the result

Table 2: Decision table for the different situations on a three-way-merge (X = atomic graph exists, empty = atomic graph does not exist).

Implementing this function for combining the changesis the actual problem and task of the selected merge al-gorithm. A merge algorithm can take several aspects intoaccount when deciding, whether to include a line into themerged version or not. If the algorithm cannot decide on acertain line, it produces a merge conflict. For source codefiles in Git this is for instance the case, if two close-bylines where changed in different branches. Since the orderof source code lines is crucial, a merge conflict is produced.

We transform this situation to RDF datasets. We takeinto account the versions of the graphs in the two commitsto be merged C{B}({G′}), D{B}({G′′}). To find the mergebase (1) this strategy relies on the existence of a commonancestor δ, such that for C and D there must exist an ances-tor path C{. . .{δ,…}

} resp. D{. . .{δ,…}

} to δ. In our case wefind the most recent common ancestor δ := B({G}). Nowwe have to compare (2) the graphs between the merge baseand the individual branches and calculate their difference:

(C+C , C−

C ) = ∆(G,G′)

(C+B , C−

B ) = ∆(G,G′′)

In contrast to source code files, there is no order rele-vant in the RDF data model. Thus we can just take theresulting sets of the comparison and merge them into anew version (3) as follows:

G′′′ =∪(

(P(G′) ∩ P(G′′)) ∪ P(C+C ) ∪ P(C+

D))

A more visual representation of the three-way-mergeis given as decision matrix in table 2. The table showsin the first three columns, all combinations of whether astatement is included in one of the two branches and theirmerge base. The fourth column shows whether a statementis present in the merge result as defined for the three-way-merge. The other four columns visualize the presence of astatement in the deltas between the two branches and themerge base respectively.

This merge strategy purely takes the integrity of theRDF data model into account. This especially means,that semantic contradictions have to be dealt with in otherways. One possibility to highlight possible contradictionsas merge conflicts is the context merge strategy (cf. sec-tion 8.4). But also beyond a merge operation, semantic

ex:USAex:USA

ex:Obamaex:presidentOf

Barack H. Obamardfs:label

ex:Trump

ex:presid

entOf

Donald J. Trumprdfs:label

Legend

Subject Literalproperty

Sub Graphs:

Commit A

Commit B

Marked in A

Marked in B

Figure 7: An example for a conflict using the Context Merge

contradictions within the resulting RDF graph can be han-dled using continuous integration tools, as pointed out insection 11.

8.4. Context Merge: A Supervised Approach for Identify-ing Conflicts

Since the Three-Way-Merge does not produce any mergeconflicts, it can happen, that during a merge semantic con-flicts are introduced. Even though the result is a validRDF graph two statements could contradict each other.Looking at fig. 7 we see, that in commit A the statement“Obama is president of the USA” was introduced, while incommit B the statement “Trump is president of the USA”is added. The result of the Three-Way-Merge would be thewhole graph as shown in fig. 7.

Since we do not want to apply any specially encodedsemantic rules to identify conflicts we have to rely on thepure RDF data model. Thus we can take into account thesemantics of nodes and edges and the semantics of addi-tions and deletions as we have seen them in the Three-Way-Merge. Let us transfer the principle of producing conflictsin a Three-Way-Merge as implemented in Git from sourcecode files to graphs. In files a conflict is produced, as soon,as the merge strategy cannot decide on two lines comingfrom two different commits, in which order they shouldbe stored. The lines thus are overlapping. The ContextMerge for RDF is based on the Three-Way-Merge in theway, that it performs the merge by taking into account

13

Page 14: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

the two commits and their merge base. In contrast to themerge as defined in section 8.3 it produces merge conflicts,as soon as the changes of both merged commits overlap ata node. If an atomic graph is added resp. removed in bothcommits there is obviously no contradiction and hence noconflict. The merge process marks each subject and objectof an added or removed atomic graph in the graph withthe belonging to its originating commit (cf. fig. 7, SubGraphs). As soon as a node is marked for both commits,this node is added to the list of conflicting nodes. The useris presented with all atomic graphs of the both change setswhich contain nodes which are listed as conflicting nodes.

The possible changes in question to be marked as aconflict are those statements39, which where added or re-moved in one of the merged branches. Looking at table 2we see, that both branches agree on the last two lines,while they do not agree on lines 3 to 6.

To perform a context merge we first need to calculatethe change sets:

(C+C , C−

C ) = ∆(G,G′)

(C+B , C−

B ) = ∆(G,G′′)

As a precondition to identify the conflicts we thus onlyneed the statements, where two branches do not agree. Wedenote the set of statements present (resp. absent) in C,where C and B disagree as follows (disagreed statements):

C̃+C\B = C+

C \ C+B

C̃−C\B = C−

C \ C−B

Also to identify the nodes of a statement, the set of allsubject nodes and object nodes of G is defined as:

N(G) := {x | ∃p, o : (x, p, o) ∈ G ∨ ∃s, p : (s, p, x) ∈ G}

The set of potentially conflicting nodes is the intersec-tion of the nodes of the disagreed statements:

IN = N(C̃+

C\B ∪ C̃−C\B

)∩N

(C̃+

B\C ∪ C̃−B\C

)Now we have to find the respective statements which

have to be marked as conflicts, thus the set of all state-ments in G which contain a node in I is defined on G andI as:

EI(G) := {(s, p, o) ∈ G | s ∈ I ∨ o ∈ I}

Thus we have the following sets of potentially conflict-ing statements:

EIN(C̃+

C\B

), EIN

(C̃−

C\B

), EIN

(C̃+

B\C

), EIN

(C̃−

B\C

)39For simplicity, we are dealing with statements in the following

definitions and formulas rather then atomic graphs. To transfer themethod to atomic graphs, a slightly changed definition of N(G) andEI(G) with respect to atomic graphs is needed.

While the set of statements which will be contained inthe result without question is:

(P(G′) ∩ P(G′′))

∪(C+

A \ EIN(C̃+

C\B

))∪(C+

B \ EIN(C̃+

B\C

))Assuming a function R which gives us the conflict res-

olution after a user interaction, we end up with a mergemethod as follows:

G′′′ =∪(

(P(G′) ∩ P(G′′))

∪(C+

A \ EIN(C̃+

C\B

))∪(C+

B \ EIN(C̃+

B\C

))∪R

(EIN

(C̃+

C\B

), EIN

(C̃−

C\B

),

EIN(C̃+

B\C

), EIN

(C̃−

B\C

)))

This merge strategy relies on the local context in graphsby interpreting subjects and objects of statements as nodes,while predicates are seen as edges. In a context, where adifferent treatment of predicates is needed, the method canbe extended to also mark statements, which is identifyingoverlapping usage of predicates as well.

9. Versioning System

Following the foundational work of the last chapters inthis chapter we are describing the architecture of our sys-tem. An overview with the individual components is givenin fig. 8. The Quit API provides query and update inter-faces following the Semantic Web standards SPARQL andRDF, as well as interfaces to control the Git repository,which are described in section 9.1. The storage and accessfacilities for provenance information regarding the aspectsContent, Management, and Use (cf. [25]) are described insections 9.2 and 9.3.

9.1. Quit APIAs an interface accessible to other applications, we are

providing a standard SPARQL 1.1 endpoint. The end-point supports SPARQL 1.1 Select and Update to providea read/write interface on the versioned RDF dataset. Forperforming additional Git operations we provide an addi-tional maintenance interface.

The prototypical implementation of the Quit Store40 isdeveloped using Python41, with the Flask API42 and theRDFlib43 to provide a SPARQL 1.1 Interface via HTTP.

40https://github.com/AKSW/QuitStore41https://www.python.org/42http://flask.pocoo.org/43https://rdflib.readthedocs.io/en/stable/

14

Page 15: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

Git Repository

Quit API

Quad Store

ProvenanceGraphData Graphs

blamecommitspull

push

synchronize

Git Blobs (Files) Git Commits

Data Endpoint(SPARQL)

Provenance Endpoint(SPARQL)

read/write read triggerLegend

RepositoryManager

UPDATESELECT

Writes Commitsand TriggersSynchronization

revertmerge

VirtualGraph

upda

te

sync

Provenance InterfaceData Interface

Figure 8: The components of the Quit Store.

The operations on the Git Repository are pursued usingthe libgit244 bindings for Python, pygit245. The underly-ing storage of the RDF dataset is implemented by an inmemory Quad-Store as provided by the RDFlib and a localGit repository which is kept in sync with the correspond-ing named graphs in the store. Every graph is stored in acanonicalized N-Quads serialization (cf. section 5.2).

All revisions Instance

Graph1

Graph2

Graph3Graph4

Graph1 Graph2

Graph3 Graph4

Figure 9: Creating a dataset from all available revisions of namedgraphs.

Dataset Access. To provide access to the versioned RDFdataset multiple virtual endpoints are provided. The de-fault endpoint provides access to the current latest ver-sion of the dataset (cf. listing 1, line 1). For each branch(cf. listing 1, line 2) and for each version in the history(cf. listing 1, line 3) of the dataset an additional endpointis provided. To implement this we are building on thebasic concept of Git, which is to reuse as many of its ob-jects as possible for a new revision. In a Git repository

44http://libgit2.github.com/45http://www.pygit2.org/

snapshots of modified files, instead of a snapshot of thecomplete dataset, are stored as (Git Blobs). This corre-sponds to the archiving policies fragement based (FB) asdefined in section 4.2. Exploiting this storage structure,we can randomly checkout any Git commit in linear time.This allows us to lookup all versions of a graph from Git’sinternal tree structure to create a virtual dataset (Vir-tual Graph), shown in fig. 9, containing the state of allgraphs at that commit and run queries against it. TheVirtual Graph (cf. fig. 8) thus represents the complete his-tory of the dataset. For better accessibility it is alwayskeeping the latest version of each graph and thus the lat-est dataset available (cf. fig. 9) in the Quad Store. Addi-tionally it maintains the most recently used graphs in theQuad Store as well. Received SPARQL-queries are for-warded to the Virtual Graph, which distinguishes betweenUpdate and Select Queries. For Select Queries it ensures,that all graphs of the respective dataset are available in theQuad Store and then evaluates the query against it. Up-date Queries are also evaluated on the internal Quad Store.The effective changes on the dataset are then applied tothe corresponding Git Blob of the ancestor commit. Theresulting Git Blob is then enclosed in a new Git commit.1 http://quit.local/sparql2 http://quit.local/sparql/<branchname>3 http://quit.local/sparql/<commitId>

Listing 1: Dataset endpoint URLs.

Git Interface. The major operations on a versioning sys-tem are commit to create a new version in the version logas well as merge, and revert as operations on the versionlog. The commit operation is implemented by Update-Queries though the dataset access interface. To also allowthe execution of branch, merge, and revert we are makingthe operations provided by Git available through a webinterface (cf. listing 2).1 http://quit.local...2 /branch/<oldbranch>:<newbranch>3 /merge/<branch>:<target>?method=<strategy>4 /revert/<target>?commit=<commitId>

Listing 2: Merge and Revert Interfaces.

Synchronization Between Instances. So far all operationswere executed on a local versioning graph, which can di-verge and be merged, but still, no collaboration with re-mote participants is possible. To allow collaboration onthe World Wide Web, the versioning graph can be pub-lished (push) to a remote repository, from where othercollaborators can copy (clone) the whole graph. If a collab-orator already has cloned a previous version of the version-ing graph, she can update here local graph by executinga pull. The default Git operations pull and push are alsoavailable via the Quit Store HTTP interface as shown inlisting 3. They allow the user to synchronize with remoterepositories.1 http://quit.local/push/<remote name>/<local>:<remote branch>2 http://quit.local/fetch/<remote name>/<remote branch>

15

Page 16: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

3 http://quit.local/pull/<remote name>/<remote branch>:<local>

Listing 3: Push and Pull Interfaces.

9.2. Storing ProvenanceOur system is built as a tool stack on top of Git for ex-

tending it with semantic capabilities. We are focusing ona generic solution for presenting provenance information,which can be applied to arbitrary domains. Since our ap-proach gains all of its versioning and storage capabilitiesfrom the underlying Git repository, most of the metadatais already captured and maintained (cf. Use and Manage-ment, [25]). With respect to the Quit Stack the two mainconcerns are, (1) to make the already existing provenanceinformation from Git semantically available and (2) checkhow and to which extend additional and domain-specificmetadata can be stored in the version control structure.

Our effort is to transform the metadata, stored in thedata model of Git (cf. section 5.1) to RDF making use ofPROV-O. Figure 10 provides an overview of provenancedata for a commit which we provide. The Git Commit-sub-graph shows the information which could be extractedfrom the metadata associated with a commit in Git. Com-mits in Git can be mapped to instances of the class prov:Activity associated with their author and committer.We follow the idea of De Nies et al. [15] and represent thestart and end time of the activity with the author andcommit date of Git, which can be interpreted as the timetill a change was accepted. PROV-O has no concept forcommenting on activities, therefore we follow the sugges-tion of PROV-O and use rdfs:comment for commit mes-sages. Git users are instances of prov:Agent, stored withtheir provided name and email. We represent Git namesas rdfs:label since they do not necessarily contain a fullname nor necessarily a nick name. Additionally the roleof the user is provided through a prov:Association. Torepresent the roles used by Git we provide quit:authorand quit:Commiter. Further we are storing the commitID using the custom property quit:hex.

In addition to the information, which we can extractfrom the Git data model, we enrich the commits stored inGit with additional metadata. The two additional prove-nance operations which we support are Import and Trans-formation, which are shown on the left side in fig. 10.For an import we store quit:Import, a subclass of prov:Activity, together with a quit:dataSource property.The Import sub-graph contains the source specificationof a dataset by recording its original URL on the Web.The Transformation sub-graph describes a change on thedataset which is represented as a quit:Transformationactivity and a quit:query property for recording the SPARQLUpdate Query, which was executed and resulted in thenew commit. To enable the highest portability of this ad-ditional provenance data we want to persist it alongsideGit’s commit data structure. The main problem here isthat Git itself offers no built-in feature for storing any

user-defined metadata for commits or files. What Git of-fers instead is a functionality called git notes, which isa commentary function on commits. Hereby, Git createsa private branch where text files, named after the committhey comment on, resides. The problem with this func-tionality is, that notes are not integrated into the commitstorage. These notes are not included in the calculationof the object hash resp. commit ID and thus they are notprotected against unperceived changes. Because we wantto rely on all provenance information equally our decisionis to provide and obtain additional metadata as part ofthe commit message as shown in listing 4. Commit mes-sages are unstructured data, meaning it will not break thecommit when additional structured data is provided atthe start or end of a message. More specific key words foridentifying provenance operations other than Source andUpdate can be added as needed.tree 31159f4524edf41e306c3c5148ed7734db1e777dparent 3fe8fd20a44b1737e18872ba8a049641f52fb9efauthor pnaumann <[email protected]> ↵1487675007 +0100

committer pnaumann <[email protected]> ↵1487675007 +0100

Source: http://dbpedia.org/data/Leipzig.n3

Example Import

Listing 4: Git commit with additional data.

Additionally to the metadata which we can extractfrom a Git commit or which we encode in the commitmessage we have extended the graph with more informa-tion on the content of the dataset. In this case we storeevery named graph contained in a file as an instance ofprov:Entity. We link the prov:Entity with the re-spective commit via prov:wasGeneratedBy. Each entityis attributed to the general graph URI using the prov:specializationOf property. The general graph is theURI of the graph under version control. Optionally wecan also extend the provenance graph with the completeupdate information. For this purpose the quit:updateproperty is used to reference the update node. The updatenode is linked to the original graph and the two named-graphs containing the added and deleted statements.

9.3. Accessing Provenance InformationTo access our provenance information we follow the

recommendation of the “PROV-AQ: Provenance Accessand Query” (W3C Working Group Note)46. We providetwo kinds of SPARQL interfaces, one service for the prove-nance graph and for each state in the history of the datasetan individual interface is provided (cf. section 9.1). Theprovenance graph is built from the metadata provided byGit and combined with the additional metadata stored inthe commit messages. To be able to query this informa-tion we have to transform it to RDF and store the resultinggraph. This is done during the initial start-up of the store

46https://www.w3.org/TR/prov-aq/

16

Page 17: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

commit

prov:Activity

a

quit:Import

a

quit:Transformation

a

preceeding commit

quit:preceedingC

ommit

commit message

rdfs:comment

8f564e62...

quit:h

ex

2017-09-30T15:53:01+02:00

prov:star

tedAtTim

e

2017-09-30T15:53:01+02:00

prov:endedAtTime

SPARQL Update Query

quit:query

Link to the data source quit:dataSource

prov:qualifiedAssociation

user

prov:wasAssociatedW

ith

update

quit:updates prov:Association

a

quit:Authorprov:role

quit:Commiter

prov:role

prov:agen

t

prov:Agenta

Usernamerdfs:label

[email protected]

foaf:mbox

blob graphprov:Entitya

prov:wasGenerat

edBy

http://example.org/

prov:specialization

Of

quit:graph

additions (graph)

quit:additions

removals (graph)

quit:removals

Legend

Subject Literalproperty

Sub Graphs:

Git Commit

Transformation

Import

Figure 10: The provenance graph of a commit.

system. The provenance graph is built from the commitsstored in Git, by traversing the Git commit history of ev-ery branch from its end. The depth of the synchronizedhistory as well as the selection of the relevant branches areconfigurable according to the users’ needs. Therefore theneeded storage space can be reduced for devices with lowstorage capacities, at the cost of time for parsing graphson-the-fly later on.

Quit Blame. As an example for the usage of provenance,similar to the functionality of git blame, we have alsobuilt a method to retrieve the origin of each individualstatement in a dataset and associate it with its entry inthe provenance graph. This is especially relevant for theaspects accountability and debugging as part of the aspectUse as described by [25]. Given an initial commit, wetraverse the Git history to find the actual commit of eachstatement when it was inserted and annotate it with themetadata for that commit.

Looking at the example in fig. 11. The three state-ments that exist in the fourth commit should be matchedwith the commits 1, 4, and 3 respectively, since those arethe commits where the statements were introduced. Toimplement this behavior on our provenance graph we areutilizing the SPARQL query as depicted in listing 5. Asan input to the query we list all statements for which wewant to identify the origin with subject ?s, predicate ?p,object ?o, and named-graph ?context in listing 5, line 13.

Commit 1 Commit 2 Commit 3 Commit 4 Commit 5

insert

insert delete insert

insert delete

insert

Figure 11: Example for an insert/delete chain in Git used bygit-blame.

For the execution of the query we loop through the list ofcommits starting at the current commit, which is bound tothe variable ?commit. The query is then executed on theprovenance graph until for all statements an originatingcommit was found.1 SELECT ?s ?p ?o ?context ?commit ?name ?date WHERE {2 ?commit prov:endedAtTime ?date ;3 prov:wasAssociatedWith ?user ;4 quit:updates ?update .5 ?user foaf:mbox ?email ;6 rdfs:label ?name .7 ?update quit:graph ?context ;8 quit:additions ?additions .9 GRAPH ?additions {

10 ?s ?p ?o11 }12 VALUES (?s ?p ?o ?context) {13 ...14 }15 }

Listing 5: Query for git blame implementation.

17

Page 18: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

10. Evaluation and Limitations

For evaluating the proposed framework we considerthe correctness of the framework regarding the recordedchanges and the performance, memory, and storage foot-print. In order to pursue this task we have taken ourimplementation of the Quit Store40. This currently is aprototypical implementation to prove the concept of ourframework, thus we are not aiming at competitive perfor-mance results. The hardware setup of the machine runningthe benchmarks is a virtual machine on a Hyper-V clusterwith Intel(R) Xeon(R) CPU E5-2650 v3 with a maximumfrequency of 2.30GHz and 62.9GiB of main memory. Asoperating system Ubuntu 16.10 (yakkety) 64-Bit is used.$ ./generate -pc 4000 -ud -tc 4000 -ppt 1

Listing 6: The BSBM generate command with its argument.

For the benchmarking framework we have decided touse the Berlin SPARQL benchmark (BSBM) [10], sinceit is made for executing SPARQL Query and SPARQLUpdate operations. The initial dataset as it is generatedusing the BSBM, shown in listing 6, contains 46370 state-ments and 1379201 statements to be added and removedduring the benchmark. To also execute update operationswe are using the Explore and Update Use Case. We haveexecuted 40 warm-up and 1500 query mix runs which re-sulted in 4592 commits on the underlying git repositoryusing the testdriver as shown in listing 7.$ ./testdriver http://localhost:5000/sparql \-runs 1500 -w 40 -dg "urn:bsbm" -o run.xml \-ucf usecases/exploreAndUpdate/sparql.txt \-udataset dataset_update.nt \-u http://localhost:5000/sparql

Listing 7: The BSBM testdriver command with its argument.

The setup for reproducing the evaluation is also avail-able at the following link: https://github.com/AKSW/QuitEval.

10.1. Correctness of Version TrackingFor checking the correctness of the recorded changes in

the underlying git repository we have created a verifica-tion setup. The verification setup takes the git repository,the initial dataset, the query execution log (run.log) pro-duced by the BSBM setup, and a reference store. Therepository is set to its initial commit, while the refer-ence store is initialized with the initial dataset. Each up-date query in the execution log is applied to the referencestore. When an effect, change in number of statements47,is detected on the reference store, the git repository isforwarded to the next commit. Now the content of thereference store is serialized and compared statement by

47Since the Quit Store only creates commits for effective changes itis necessary to identify and skip queries without effect. This heuristicdoes not effect the result of the actual comparison, because still allqueries are executed on the the reference store, to which the contentof the Quit Store is compared in each step.

statement to the content of the git repository at this pointin time. This scenario is implemented in the verify.pyscript in the evaluation tool collection. We have executedthis scenario and could ensure, that the recorded repos-itory has the same data as the store after executing thesame queries.

10.2. Correctness of the Merge MethodThe functional correctness of the three-way merge method

(cf. section 8.3) was verified using a repository filled withdata using the graph generated by the BSBM. Since thethree-way merge is conflict free it can be evaluated in a au-tomated manner. To create the evaluation setup and runthe verification of the results, a script was created. Thisscript takes a git repository and creates a setup of threecommits. An initial commit contains a graph file, whichserves as base of two branches. Each of the branches isforked from this initial commit and contains an alteredgraph file. The files in the individual commits containrandom combinations of added and removed statements,while the script also produces the graph, which is expectedafter merging the branches.

After creating the two branches with different graphsthey are merged using git merge. The result is then com-pared statement by statement to the expected graph andthe result is presented to the user. We have executed theverification 1000 times and no merge conflict or failure inthe merge result occurred.

10.3. Context MergeTo demonstrate and evaluate the conflict identifica-

tion of the Context Merge, we have created a repositorywhich contains two branches holding the data as depictedin fig. 7. Further it contains a second graph with a re-source with a misspelled label, which is corrected on themaster branch, while the resource is moved from the http://example.org/… namespace to http://aksw.org/… inthe develop branch. The output of the context mergemethod is depicted in the screenshot fig. 12.

10.4. Query ThroughputThe query throughput performance of the reference im-

plementation was analyzed in order to identify obstacles inthe conception of our approach. In fig. 13 the queries persecond for the different categories of queries in the BSBMare given and compared to the baseline. Our baseline isa simple RDF store implemented using the Python RD-Flib and a SPARQL interface implemented using Flask.We compare the execution of our store with version track-ing enabled and with additionally enabled delta compres-sion to a setup of only the in memory store of the PythonRDFlib without version tracking. As expected the ver-sioning has a big impact on the update queries (INSERTDATA and DELETE WHERE), while the explore queries(SELECT and CONSTRUCT) are not further impacted.

18

Page 19: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

Figure 12: A merge on the quit store using the context strategy withidentified conflicts.

0.01

0.1

1

10

100

1000

INSE

RT D

ATA

DELET

E W

HERE

Query

1

Query

2

Query

3

Query

4

Query

5

Query

7

Query

8

Query

10

Query

11

Query

12

queri

es

per

seco

nd

(q

ps)

baselinequit versioning

quit versioning w/ delta compressionold quit

Figure 13: Execution of the different BSBM queries.

We could reach 247QMpH48 with Quit’s versioning (quit),235QMpH with additionally enabled delta compression(cf. section 10.5) and 641QMpH for the baseline. This isan improvement of 3.7× over the speed of the old imple-mentation of 67QMpH (old quit), at the state of develop-ment as published in our previous paper [2].

10.5. Storage ConsumptionDuring the executions of the BSBM we have monitored

the impact of the Quit Store repository on the storagesystem. The impact on the storage system is visualizedin fig. 14. We have measured the size of the repositoryon the left y-axis, and the size of the graph as well asthe number of added and deleted statements on the righty-axis and have put it in relation to the number of com-mits generated at that point in time on the x-axis. We

48QMpH Query Mixes per Hour, Query Mixes are defined by theBSBM

0

10

20

30

40

50

60

70

80

90

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500 0

200

400

600

800

1000

1200

GiB

#st

ate

ments

#commits

repository sizerepository size w/ delta compression

graph size (1k statements)added (statements)

deleted (statements)

Figure 14: Storage consumption and number of commits in the repos-itory during the execution of the BSBM.

0

5

10

15

20

0 500 1000 1500 2000 2500 3000 3500 4000

exe

cuti

on t

ime (

s)

#queries

quit insertquit delete

r43ples insertr43ples delete

Figure 15: Query execution time comparison of Quit Store andR43ples with INSERT DATA and DELETE DATA queries.

compare the execution of the BSBM on a Quit Store with-out using Git’s delta compression (garbage collection49)and with the delta compression enabled. The repositoryhas increased from initially 12.4MiB to a size of 92.6GiBwithout delta compression at the end of the benchmark.The benchmark started with 46, 370 initial statements andgrew finally to 1, 196, 420 statements. During the com-mits between 12 and 776 statements were added to thegraph, while between 1 and 10 statements were removed.Enabling the compression feature during the benchmarkcould compress the repository to 18.3GiB. This is a com-pression rate of 80.2% at the end (at commit 4593). Duringthe run of the evaluation the compression rate is fluctu-ating, starting of course with 0% it is between 76.9% atcommit 4217 and 94.5% at commit 4218 near the end ofthe evaluation run.

10.6. Update ComparisonFor comparing our prototypical implementation we have

selected the implementation of the R43ples Store. We havealso considered the R&Wbase implementation but were

49https://git-scm.com/docs/git-gc

19

Page 20: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

0

100

200

300

400

500

600

700

800

900

0 500 1000 1500 2000

0 500 1000 1500 2000 2500 3000 3500 4000

exe

cuti

on t

ime (

s)

#commits (quit)

#revisions (r43ples)

quitr43ples

Figure 16: Query execution time comparison of Quit Store andR43ples with SELECT queries.

not able to get it properly running. To run the R43plesStore and the Quit Store we have created Docker Images50

for both systems which we have made available at theDocker Hub51. We have tried to execute the BSBM onthe R43ples Store, but the queries were not actually ex-ecuted and no revisions were created by the store duringthe run. Thus we have decided to create a custom com-parison setup to measure the execution time of INSERTDATA and DELETE DATA queries. We have defined a cus-tom BSBM use case, which consists of alternating INSERTDATA and DELETE DATA queries. Our custom compari-son setup takes the query log produces by the BSBM testdriver when running our custom BSBM use case and ex-ecutes the according queries on the store. The system isalso available in our QuitEval repository (as mentionedabove). We have executed 4000 queries on both stores,which produced 2002 commits in the Quit Store and 4000revisions in the R43ples Store. The measured query execu-tion times are depicted in fig. 15. The two clusters whichcan be seen in the R43ples Insert and Delete Query execu-tion times are originating from two clusters in the queriesnear a length of 150 statements and near 500 statementsinserted resp. deleted.

10.7. Random AccessAfter executing the update comparison we have also

taken the resulting stores with all revisions and have per-formed a random access query execution comparison be-tween R43ples and Quit Store. The results of this compar-ison are depicted in fig. 16. Due to the fact, that R43plesand Quit Store did not record the same amount of revi-sions, one x-axis for each store is plotted. We have exe-cuted a simple select query (cf. listing 8) on a sample of101 of the recorded revisions on each store.

50Docker is a container system, see also https://www.docker.com/

51R43ples docker image: https://hub.docker.com/r/aksw/r43ples/, Quit Store docker image: https://hub.docker.com/r/aksw/quitstore/

1 SELECT ?s ?p ?o WHERE {2 GRAPH <urn:bsbm> {?s ?p ?o .}} LIMIT 10

Listing 8: SELECT query used for the random access comparison.

11. Discussion

Based on the implementation of Quit we have pur-sued an evaluation regarding the correctness of the model(cf. sections 10.1 and 10.2) and monitored the performanceof our implementation (cf. sections 10.4 to 10.7). With theimproved implementation we could even reach a 3.6× im-provement of the query throughput over the speed of theold implementation, at the state of development as pub-lished in our previous paper [2]. This improvement canmainly be led back to a completely reworked architecture,which is presented in section 9, and a number of opti-mizations based on profiling results, especially by reducingI/O overhead. The theoretical foundations could be con-firmed. While the results show an acceptable performancethe comparison to the base line should be an incentive forimprovement.

The Quit Store is based on Git which is using a snap-shot based repository architecture. This snapshot storagewas expected to increase severely as the stored dataset in-creases (cf. section 10.5), which could be confirmed duringour evaluation. One of the disadvantages of this setup isalso visible in fig. 14 towards the end of the evaluationrun. Starting at about commit 3500, while the datasetslows down its growth, the repository is still growing bya copy of the changed graph plus the added statements.Here the delta compression comes into play, which couldmake up for the growth of the repository and we couldshow that the snapshot based approach does not necessar-ily put a high load on storage requirements. Further wecould show that the impact on the query execution per-formance was negligible (cf. section 10.4 and fig. 13). Theevaluation of more advanced compression system for thestored RDF dataset or the employment of a binary RDFserialization format, such as HDT [16] is still subject tofuture work.

To position our approach in relation to the related workwe have compared the Quit Store with the R43ples imple-mentation with respect to the execution time of INSERTDATA and DELETE DATA queries (cf. section 10.6 and fig. 15).We could show that the Quit Stores query execution timeis related to the number of statements added resp. deleted.In contrast the R43ples Store’s query execution time is in-creasing with the size of the repository. To also compareboth stores with respect to the availability of the storedrevisions we have performed random access select queries(cf. section 10.7 and fig. 16). With our random accessquery interface we can execute standard SPARQL querieson the store at the status of any revision at an overheadnot related to the position of the revision in the repository.Due to the change based approach followed by R43ples, thestore has very high costs to retrieve old revisions. This

20

Page 21: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

impact of the changes could possibly be reduced by onlystoring effective changes as revisions in the store. Fur-ther the approach, as chosen for R43ples, of extending thesyntax of standard SPARQL queries by custom keywordsmakes it harder to integrate the store with existing toolsand systems.

We are able to track the provenance on any updateoperation in the dataset (UC 5 and REQ 4). With theprovenance graph, we are also able to explore the recordeddata using a standard SPARQL interface and due to itsgraph structure we are also able to represent any kind ofa branched and merged history. Using quit blame weare able to track down the origin of any individual state-ment in a dataset. Due to the atomic level of provenancetracking the provenance information can be derived in twodimensions. From changes to individual atomic graphs wecan directly conclude to a resource, graph, and dataset. Inthis way statements like “The dataset was changed by Xon date Y” and also “The resource Z was changed by Xon date Y” are possible. Also on the granularity level ofthe operations, multiple individual addition or deletion ofan atomic graph are stored in a common commit alongsidethe executed query. Thus the provenance information isavailable on all levels of both dimensions data-granularityand granularity of the change operation.

With the context merge strategy we are able to iden-tify possible conflicts and bring them to the users atten-tion. This was demonstrated in section 10.3 with an ex-ample of two branches with conflicts in two graphs of thedataset. But the system can also be extended by custommerge tools. Besides the identification of conflicts duringa merge process users can also define rules specific to theapplication domain, for instance by using ontological re-strictions or rules (for instance using SHACL52). Supportfor ensuring adherence to special semantic constraints, ap-plication specific data models, and certain levels of dataquality is provided by continuous integration systems onthe Git repository as presented in [31, 32, 36]. Employingcontinuous integration is already widely used in the Gitecosystem and can now also be adapted to RDF knowl-edge bases.

In the digital humanities projects Pfarrerbuch, with re-search communities in Hungary, Saxony Anhalt, and Sax-ony, as well as in the Catalogus Professorum project, themanagement and support of the diversity across its dif-ferent datasets were made easy thanks to the adoptionof Quit. The use of Quit allows different researchers todevelop their datasets independently while sharing corecomponents. For shared datasets it is now also possi-ble to merge changes when there is a consensus to do so.This allows the digitalization team to continuously work inthe data extraction and semantification while, the team ofdata curators can explore the extracted data and performchanges on the data. Quit also made it easy to explore dif-ferences across the different dataset versions by using the

52https://www.w3.org/TR/shacl/

diff feature, which previously had to be performed manu-ally. Further it was possible to detect issues regarding theincorrect use of name spaces during the conversion processof the Hungarian Pastors dataset by using the provenancefunctionality. As a result it was also possible to solve thisissue, by reverting the respective changes on the datasetand to deploy the updated dataset version.

12. Conclusion

Learning from software engineering history we couldsuccessfully adopt the distributed version control systemGit. With the presented system we are now having ageneric tool to support distributed teams in collaboratingon RDF datasets. By supporting commit it is possible totrack changes, made to a dataset, by branching the evolu-tion of a dataset different points of view can be expressed(cf. REQ 1). Diverged branches can be consolidated usingthe merge operation while the user can select between dif-ferent merge strategies (cf. REQ 2). Using the push andpull operations of Quit different instances can synchronizetheir changes and collaborate in a distributed setup of sys-tems (cf. REQ 3).

The Quit Store tool provides a SPARQL 1.1 read/writeinterface to query and change an RDF dataset in a quadstore which is part of a network of distributed RDF datarepositories (cf. REQ 8). The store can manage a datasetwith multiple graphs (cf. REQ 7), individual versions canbe randomly selected from the repository (cf. REQ 5), andindividual versions of the dataset can be compared usingQuit Diff (cf. REQ 6). Based on the presented approachthe application in distributed and collaborative data cu-ration scenarios is now possible. It enables the setup ofplatforms similar to GitHub, specialized in the needs ofdata scientists and data engineers for creating datasets us-ing local working copies and sending pull requests, whileautomatically keeping track of the data’s versioning andprovenance.

We have examined, how metadata and datasets storedin a Git repository can be enriched, processed, and usedsemantically. We have added methodologies for how Gitcommits, their metadata, and datasets can be used forstoring and exploiting provenance information (cf. REQ 4).We could show that the concept of git blame can betransferred to semantic data using our provenance graph.With the presented system we can provide access to the au-tomatically tracked provenance information with Seman-tic Web technology in a distributed collaborative environ-ment.

In the future, Quit can support the application of RDFin enterprise scenarios such as supply chain managementas described by Frommhold et al. [20]. An integrationwith the decentralized evolution model of distributed se-mantic social networks [6, 42], as well as the use case ofsynchronization in mobile scenarios [43] is possible. Fur-ther, we are planning to lift the collaborative curation andannotation in distributed scenarios such as presented in

21

Page 22: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

the Structured Feedback protocol [1] to the next level bydirectly recording the user feedback as commits in a QuitStore, which can enable complex distributed collaborationstrategies. As there is a big ecosystem of methodologiesand tools around Git for supporting the software develop-ment process, the Quit Store can support the creation ofsuch an ecosystem for RDF dataset management.

13. Acknowledgements

We want to thank the editors for organizing this spe-cial issue, our shepherd Olaf Hartig, and the reviewersfor the critical and helpful reviews. Also we want to thankSören Auer, Rafael Arndt, and Claudius Henrichs for theirvaluable remarks, important questions and for supportingus in proofreading. This work was partly supported bya grant from the German Federal Ministry of Educationand Research (BMBF) for the LEDS Project under grantagreement No 03WKCG11C and the DFG project Pro-fessorial Career Patterns of the Early Modern History:Development of a scientific method for research on onlineavailable and distributed research databases of academichistory under the grant agreement No 317044652.

References

[1] Arndt, N., Junghanns, K., Meissner, R., Frischmuth, P.,Radtke, N., Frommhold, M., Martin, M., Apr. 2016. Structuredfeedback: A distributed protocol for feedback and patches onthe web of data. In: Proceedings of the Workshop on LinkedData on the Web co-located with the 25th International WorldWide Web Conference (WWW 2016). Vol. 1593 of CEUR Work-shop Proceedings. Montreal, Canada.URL http://ceur-ws.org/Vol-1593/article-02.pdf

[2] Arndt, N., Martin, M., Jun. 2017. Decentralized evolution andconsolidation of rdf graphs. In: 17th International Conferenceon Web Engineering (ICWE 2017). ICWE 2017. Rome, Italy.URL https://svn.aksw.org/papers/2017/ICWE_DecentralizedEvolution/public.pdf

[3] Arndt, N., Naumann, P., Marx, E., May 2017. Exploring theevolution and provenance of git versioned rdf data. In: Fernán-dez, J. D., Debattista, J., Umbrich, J. (Eds.), 3rd Workshopon Managing the Evolution and Preservation of the Data Web(MEPDaW) co-located with 14th European Semantic Web Con-ference (ESWC 2017). Portoroz, Slovenia.URL http://ceur-ws.org/Vol-1824/mepdaw_paper_2.pdf

[4] Arndt, N., Nuck, S., Nareike, A., Radtke, N., Seige, L., Riechert,T., Oct. 2014. AMSL: Creating a linked data infrastructure formanaging electronic resources in libraries. In: Horridge, M.,Rospocher, M., van Ossenbruggen, J. (Eds.), Proceedings of theISWC 2014 Posters & Demonstrations Track. Vol. Vol-1272 ofCEUR Workshop Proceedings. Riva del Garda, Italy, pp. 309–312.URL http://ceur-ws.org/Vol-1272/paper_66.pdf

[5] Arndt, N., Radtke, N., Martin, M., Sep. 2016. Distributed col-laboration on rdf datasets using git: Towards the quit store. In:12th International Conference on Semantic Systems Proceed-ings (SEMANTiCS 2016). SEMANTiCS ’16. Leipzig, Germany.URL https://svn.aksw.org/papers/2016/Semantics_Quit/public.pdf

[6] Arndt, N., Tramp, S., Oct. 2014. Xodx: A node for the dis-tributed semantic social network. In: Horridge, M., Rospocher,M., van Ossenbruggen, J. (Eds.), Proceedings of the ISWC 2014

Posters & Demonstrations Track. Vol. Vol-1272 of CEUR Work-shop Proceedings. Riva del Garda, Italy, pp. 465–468.URL http://ceur-ws.org/Vol-1272/paper_154.pdf

[7] Auer, S., Herre, H., Jun. 2006. A versioning and evolutionframework for RDF knowledge bases. In: Proceedings ofat Sixth International Andrei Ershov Memorial Conference- Perspectives of System Informatics (PSI’06), 27-30 June,Novosibirsk, Akademgorodok, Russia. Vol. 4378.URL http://www.informatik.uni-leipzig.de/~auer/publication/PSI-evolution.pdf

[8] Auer, S., Lehmann, J., Ngomo, A.-C. N., 2011. Introductionto linked data and its lifecycle on the web. In: Proceedings ofthe 7th International Conference on Reasoning Web: Seman-tic Technologies for the Web of Data. RW’11. Springer-Verlag,Berlin, Heidelberg, pp. 1–75.URL http://dl.acm.org/citation.cfm?id=2033313.2033314

[9] Berners-Lee, T., Connolly, D., 2001. Delta: an ontology for thedistribution of differences between rdf graphs. Tech. rep., W3C.URL http://www.w3.org/DesignIssues/Diff

[10] Bizer, C., Schultz, A., 2009. The berlin sparql benchmark. Inter-national Journal On Semantic Web and Information Systems.

[11] Carothers, G., Feb. 2014. Rdf 1.1 n-quads: A line-based syntaxfor rdf datasets. Recommendation, W3C.URL https://www.w3.org/TR/2014/REC-n-quads-20140225/

[12] Cassidy, S., Ballantine, J., 2007. Version control for RDF triplestores. In: Filipe, J., Shishkov, B., Helfert, M. (Eds.), ICSOFT2007, Proceedings of the Second International Conference onSoftware and Data Technologies. INSTICC Press, Barcelona,Spain, pp. 5–12.

[13] Cyganiak, R., Wood, D., Lanthaler, M., Feb. 2014. Rdf 1.1concepts and abstract syntax. Recommendation, W3C.URL https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/

[14] DCMI Usage Board, 2012. DCMI metadata terms. Tech. rep.,Dublin Core Metadata Initiative.URL http://dublincore.org/documents/dcmi-terms/

[15] De Nies, T., Magliacane, S., Verborgh, R., Coppens, S., Groth,P., Mannens, E., Van de Walle, R., 2013. Git2prov: exposingversion control system content as w3c prov. In: Proceedings ofthe 2013th International Conference on Posters & Demonstra-tions Track-Volume 1035. pp. 125–128.

[16] Fernández, J. D., Martínez-Prieto, M. A., Gutiérrez, C.,Polleres, A., Arias, M., 2013. Binary rdf representation forpublication and exchange (hdt). J. Web Sem. 19, 22–41.URL http://dblp.uni-trier.de/db/journals/ws/ws19.html#FernandezMGPA13

[17] Fernández, J. D., Polleres, A., Umbrich, J., 2015. Towardsefficient archiving of dynamic linked open data. In: Debattista,J., d’Aquin, M., Lange, C. (Eds.), DIACRON@ESWC. Vol.1377 of CEUR Workshop Proceedings. CEUR-WS.org, pp.34–49.URL http://dblp.uni-trier.de/db/conf/esws/diachron2015.html#FernandezPU15

[18] Frischmuth, P., Arndt, N., Martin, M., Sep. 2016. Ontowiki 1.0:10 years of development - what’s new in ontowiki. In: Joint Pro-ceedings of the Posters and Demos Track of the 12th Interna-tional Conference on Semantic Systems - SEMANTiCS2016 andthe 1st International Workshop on Semantic Change & Evolv-ing Semantics (SuCCESS’16). CEUR Workshop Proceedings.Leipzig, Germany.URL http://ceur-ws.org/Vol-1695/paper11.pdf

[19] Frischmuth, P., Martin, M., Tramp, S., Riechert, T., Auer, S.,2015. OntoWiki - An Authoring, Publication and VisualizationInterface for the Data Web. Semantic Web Journal 6 (3), 215–240.URL http://www.semantic-web-journal.net/system/files/swj490_0.pdf

[20] Frommhold, M., Arndt, N., Tramp, S., Petersen, N., 2016.Publish and Subscribe for RDF in Enterprise Value Networks.In: Proceedings of the Workshop on Linked Data on theWeb co-located with the 25th International World Wide Web

22

Page 23: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

Conference (WWW 2016).URL http://events.linkeddata.org/ldow2016/papers/LDOW2016_paper_05.pdf

[21] Frommhold, M., Piris, R. N., Arndt, N., Tramp, S., Petersen,N., Martin, M., Sep. 2016. Towards Versioning of ArbitraryRDF Data. In: 12th International Conference on SemanticSystems Proceedings (SEMANTiCS 2016). SEMANTiCS ’16.Leipzig, Germany.URL https://www.researchgate.net/publication/303924732_Towards_Versioning_of_Arbitrary_RDF_Data

[22] Gancarz, M., 1995. The Unix Philosophy. Digital Press.[23] Gancarz, M., 2003. Linux and the Unix Philosophy. Digital

Press.[24] Graube, M., Hensel, S., Urbas, L., 2016. Open semantic revi-

sion control with r43ples: Extending sparql to access revisionsof named graphs. In: Proceedings of the 12th International Con-ference on Semantic Systems. SEMANTiCS 2016. ACM, NewYork, NY, USA, pp. 49–56.URL https://dx.doi.org/10.1145/2993318.2993336

[25] Groth, P., Gil, Y., Cheney, J., Miles, S., 2012. Requirementsfor provenance on the web. International Journal of Digital Cu-ration 7 (1), 39–56.

[26] Haase, P., Stojanovic, L., 2005. Consistent evolution of owl on-tologies. In: Proceedings of the Second European Semantic WebConference, Heraklion, Greece.

[27] Halilaj, L., Grangel-González, I., Coskun, G., Auer, S., 2016.Git4voc: Git-based versioning for collaborative vocabulary de-velopment. In: Li, T., Scherp, A., Ostrowski, D., Wang, W.(Eds.), IEEE Tenth International Conference on Semantic Com-puting (ICSC). pp. 285–292.URL http://arxiv.org/pdf/1601.02433

[28] Halilaj, L., Petersen, N., Grangel-González, I., Lange, C., Auer,S., Coskun, G., Lohmann, S., Nov. 2016. VoCol: An integratedenvironment to support version-controlled vocabulary devel-opment. In: Blomqvist, E., Vitali, F., Ciancarini, P. (Eds.),20th International Conference on Knowledge Engineering andKnowledge Management (EKAW2016). No. 10024 in LectureNotes in Computer Science. Springer Verlag, Heidelberg.URL https://www.researchgate.net/publication/301765577_VoCol_An_Integrated_Environment_to_Support_Vocabulary_Development_with_Version_Control_Systems

[29] Hayes, P. J., Patel-Schneider, P. F., Feb. 2014. Rdf 1.1 seman-tics. Recommendation, W3C.URL https://www.w3.org/TR/2014/REC-rdf11-mt-20140225/

[30] Hogan, A., May 2015. Skolemising blank nodes while preservingisomorphism. In: Proceedings of the 24th International Confer-ence on World Wide Web. WWW ’15. International World WideWeb Conferences Steering Committee, Republic and Canton ofGeneva, Switzerland, pp. 430–440.URL https://doi.org/10.1145/2736277.2741653

[31] Kontokostas, D., Westphal, P., Auer, S., Hellmann, S.,Lehmann, J., Cornelissen, R., 2014. Databugger: A test-drivenframework for debugging the web of data. In: Proceedingsof the Companion Publication of the 23rd International Con-ference on World Wide Web Companion. WWW Companion’14. International World Wide Web Conferences SteeringCommittee, pp. 115–118.URL http://jens-lehmann.org/files/2014/www_demo_databugger.pdf

[32] Kontokostas, D., Westphal, P., Auer, S., Hellmann, S.,Lehmann, J., Cornelissen, R., Zaveri, A., 2014. Test-drivenevaluation of linked data quality. In: Proceedings of the 23rdInternational Conference on World Wide Web. WWW ’14. In-ternational World Wide Web Conferences Steering Committee,pp. 747–758.URL http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf

[33] Krötzsch, M., Vrandečić, D., Völkel, M., Nov. 2006. Seman-tic mediawiki. In: Cruz, I., Decker, S., Allemang, D., Preist,C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. M. (Eds.),The Semantic Web - ISWC 2006: 5th International Seman-

tic Web Conference, Proceedings. Springer Berlin Heidelberg,Berlin, Heidelberg, pp. 935–942.URL https://dx.doi.org/10.1007/11926078_68

[34] Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney,J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao,J., Apr. 2013. Prov-o: The prov ontology. Recommendation,W3C.URL http://www.w3.org/TR/2013/REC-prov-o-20130430/

[35] Meinhardt, P., Knuth, M., Sack, H., 2015. Tailr: A platform forpreserving history on the web of data. In: Proceedings of the11th International Conference on Semantic Systems. SEMAN-TICS ’15. ACM, New York, NY, USA, pp. 57–64.URL https://dx.doi.org/10.1145/2814864.2814875

[36] Meissner, R., Junghanns, K., Sep. 2016. Using devops princi-ples to continuously monitor rdf data quality. In: 12th Interna-tional Conference on Semantic Systems Proceedings (SEMAN-TiCS 2016). CEUR Workshop Proceedings. Leipzig, Germany.URL https://svn.aksw.org/papers/2016/Semantics_DevOps/public.pdf

[37] Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth,P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., et al.,2011. The open provenance model core specification (v1. 1).Future generation computer systems 27 (6), 743–756.

[38] Nareike, A., Arndt, N., Radtke, N., Nuck, S., Seige, L.,Riechert, T., Sep. 2014. AMSL: Managing electronic resourcesfor libraries based on semantic web. In: Plödereder, E.,Grunske, L., Schneider, E., Ull, D. (Eds.), Proceedings of theINFORMATIK 2014: Big Data – Komplexität meistern. Vol.P-232 of GI-Edition—Lecture Notes in Informatics. Gesellschaftfür Informatik e.V., Gesellschaft für Informatik e.V., Stuttgart,Germany, pp. 1017–1026, © 2014 Gesellschaft für Informatik.URL https://dl.gi.de/bitstream/handle/20.500.12116/2713/1017.pdf

[39] Riechert, T., Beretta, F., 2016. Collaborative research on aca-demic history using linked open data: A proposal for the heloisecommon research model. CIAN-Revista de Historia de las Uni-versidades 19 (0).URL http://e-revistas.uc3m.es/index.php/CIAN/article/view/3147

[40] Riechert, T., Morgenstern, U., Auer, S., Tramp, S., Martin, M.,2010. Knowledge engineering for historians on the example ofthe catalogus professorum lipsiensis. In: Patel-Schneider, P. F.,Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J. Z., Horrocks,I., Glimm, B. (Eds.), Proceedings of the 9th InternationalSemantic Web Conference (ISWC2010). Vol. 6497 of LectureNotes in Computer Science. Springer, Shanghai / China, pp.225–240.URL http://svn.aksw.org/papers/2010/ISWC_CP/public.pdf

[41] The W3C SPARQL Working Group, Mar. 2013. Sparql 1.1overview. Recommendation, W3C.URL https://www.w3.org/TR/2013/REC-sparql11-overview-20130321/

[42] Tramp, S., Ermilov, T., Frischmuth, P., Auer, S., 2011. Archi-tecture of a distributed semantic social network. In: FederatedSocial Web Europe 2011, Berlin June 3rd-5th 2011.

[43] Tramp, S., Frischmuth, P., Arndt, N., Ermilov, T., Auer, S.,2011. Weaving a Distributed, Semantic Social Network for Mo-bile Users. In: Proceedings of the ESWC2011.URL http://svn.aksw.org/papers/2011/ESWC_MobileSocialSemanticWeb/public.pdf

[44] Tummarello, G., Morbidoni, C., Bachmann-Gmür, R., Erling,O., Nov. 2007. Rdfsync: Efficient remote synchronization of rdfmodels. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D.,Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mi-zoguchi, R., Schreiber, G., Cudré-Mauroux, P. (Eds.), The Se-mantic Web: 6th International Semantic Web Conference, 2ndAsian Semantic Web Conference, Proceedings. Springer BerlinHeidelberg, Berlin, Heidelberg, pp. 537–551.URL https://dx.doi.org/10.1007/978-3-540-76298-0_39

[45] Vander Sande, M., Colpaert, P., Verborgh, R., Coppens,

23

Page 24: Decentralized Collaborative Knowledge Management using Gitcreation, curation, linking). Our aim is to provide a system that enables distributed collaboration of data scientists and

S., Mannens, E., Van de Walle, R., 2013. R&wbase: git fortriples. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas,M., Auer, S. (Eds.), LDOW. Vol. 996 of CEUR WorkshopProceedings.URL http://dblp.uni-trier.de/db/conf/www/ldow2013.html#SandeCVCMW13

24