genoma: distributed provenance as a service for iot-based ... ·...

Genoma: Distributed Provenance as a Service forIoT-based Systems*

Nanjangud C. Narendra, Anshu Shukla, Sambit NayakEricsson ResearchBangalore, India

{nanjangud.narendra,anshu.shukla,sambit.nayak}@ericsson.com

Asha Jagadish, Rachana KalkurManipal Academy of Higher Education (MAHE)

Manipal, India{asha.jagadish,rachana.kalkur}@gmail.com

Abstract—One of the key aspects of IoT-based systems, whichwe believe has not been getting the attention it deserves, isprovenance. Provenance refers to those actions that record theusage of data in the system, along with the rationale for saidusage. Historically, most provenance methods in distributed sys-tems have been tightly coupled with those of the underlying dataprocessing frameworks in such systems. However, in this paper,we argue that IoT provenance requires a different treatment,given the heterogeneity and dynamism of IoT-based systems. Inparticular, provenance in IoT-based systems should be decoupledas far as possible from the underlying data processing substratesin IoT-based systems.

To that end, in this paper, we present Genoma, our ongoingwork on a system for provenance-as-a-service in IoT-basedsystems. By “provenance-as-a-service” we mean the following:distributed provenance across IoT devices, edge and cloud; andagnostic of the underlying data processing substrate. Genomacomprises a set of services that act together to provide usefulprovenance information to users across the system. We also showhow we are realizing Genoma via an implementation prototypebuilt on Apache Atlas and Tinkergraph, through which weare investigating several key research issues in distributed IoTprovenance.

Index Terms—data provenance, data management, IoT, cloud,edge

I. INTRODUCTION

One of the key aspects of IoT systems is their capabilityto provide voluminous amounts of streaming and transientdata [1]. This data is used by various kinds of users forvarious purposes. For example, in a manufacturing enterprisecomprising multiple geographically distributed factories, ma-chine tools in the factories generate data from sensors locatedon them. In addition, other parts of the factory assemblyline are also instrumented with sensors which generate datacontinuously. All this data is typically stored on an edgedevice in the factory such as a PC or laptop computer; itmay be processed and analyzed before it (or the processeddata thereof) is then transferred to the enterprise’s centralcloud server. Even at that central cloud server, the data isfurther processed, perhaps via analytics algorithms, in orderto glean insights into the manufacturing process across thevarious factories. These insights could also lead to actuations,

Thanks to Harald Gustafsson and Ola Angelsmark for their feedback onthe paper.

i.e., actions on the factory equipment, viz., reconfiguration,recalibration, etc., in order to achieve efficiency or qualityimprovements in the manufacturing process. All this datastreaming, processing and usage, which could be implementedby multiple agents (human and automated) at various levelsof the enterprise hierarchy, needs to be tracked in order todetermine how the data was used, and also to what extent thedata itself is useful. This data tracking is typically referred toas provenance [2].

Typically, provenance tools have suffered from two issues;first, they are usually embedded into the system from whichthey extract and record provenance information; and second,and more crucial, they are centralized, storing all provenancedata in a central location. Both issues render current prove-nance tools unsuitable for IoT-based systems, which requireprovenance to be loosely-coupled and distributed. To that end,in this paper, we present Genoma, our (under development)system for provenance as a service, which meets these require-ments. By ”as a service” we mean that Genoma can be offeredas a service that can be “plugged into” any existing IoT-basedsystem and provide the needed provenance capabilities for thesystem. Genoma is built using currently available best-of-breedopen source tools such as Apache Atlas [3] on the cloud, andTinkergraph [4] on the edge. With Genoma, we show how IoT-based systems can be instrumented in a loosely coupled (i.e.,agnostic of the underlying data processing substrates used) anddistributed manner. Indeed, one of Genoma’s key features isthat it allows provenance to implemented on the edge, even inthe presence of intermittent network connectivity to the cloud.

This paper is organized as follows. In the next Section, wepresent some background which we will use throughout therest of our paper; this section also covers related work in thisarea. We present the data model and architecture of Genoma inSection III. In Section IV we present our ongoing implemen-tation of Genoma. Section V discusses the key research issuesaffecting distributed IoT provenance as a service, and how weare addressing them as part of the Genoma implementation.Finally, the paper concludes in Section VI with suggestionsfor future work.

II. BACKGROUND AND RELATED WORK

Provenance documents the inputs, systems, entities and pro-cesses that influence data of interest in an IoT-based system.978-1-5386-4980-0/19/$31.00 ©2019 IEEE

764

Provenance can be defined in terms of the lifecycle as depictedin Fig. 1. In other words, users first model the provenancedata to be captured; the data is captured and then recorded &stored into the storage system; it is then retrieved by the user,who then performs inferencing on it in order to understandthe provenance data and extract useful insights. In addition,since provenance data is expected to be an immutable recordof usage, it is expected to be read-only; hence it can only bestored, archived or deleted, not overwritten.

Fig. 1. Provenance Lifecycle

The two most popular provenance standards are ProvDM [5]and OPM [6]. The former was developed as a W3C standard,primarily for document provenance and hence it focuses onmodifications made to documents by users working on themcollaboratively. OPM is an improvement over ProvDM, in thatit views activities on data sets as tasks executed in a predefinedsequence and aims at recording modifications to data at eachtask in the sequence.

In the cloud, one of the key works on provenance hasbeen PASS [7]. It automatically collects, stores, manages,and provides search for provenance. In effect, it is both aprovenance solution and a substrate on which other provenancesystems can be supported. However, it is tightly coupled to theunderlying cloud data substrate on which it is built. Anotherrelevant work is Ground [8], which is a data context servicein the cloud on which provenance solutions can be built,However, it is not really a provenance solution.

The emergence of IoT has given rise to IoT provenance andin particular, streaming provenance. Some research challengesfor streaming provenance are discussed in [9], such as preserv-ing the throughput of the underlying streaming infrastructure,independence from the underlying streaming infrastructure andminimization of processing and storage load. In particular,some initial work on the third challenge is described in [10].These challenges are also applicable to Genoma and will beincorporated as part of future work.

With blockchain [11] gaining prominence as a promisingtechnology for tamper-proof storage of data, it is only nat-ural that this would spur research into using blockchain forprovenance. Some initial work has been described in [12]–

[14]. Overall, research in this area has focused on integratingOPM with blockchain for immutable data trails, blockchain-based cloud data provenance, and using publicly auditablecontracts to encode data usage policies in a privacy-friendlyway (for GDPR, etc.). We view this research as complimentaryto Genoma, and we will be considering incorporating it as partof future work.

To summarize, while some initial work on IoT provenancehas been gaining traction [15], [16], there has been very littlework on distributed and loosely coupled IoT provenance, inparticular, the concept of “provenance as a service” . Hencewe believe Genoma is one of the first works on distributed IoTprovenance with emphasis on loose coupling to the underlyingdata processing substrate. In the rest of this paper, we willpresent our Genoma architecture and ongoing implementation.

III. Genoma DATA MODEL AND ARCHITECTURE

A. Genoma Data Model

Genoma provides two kinds of distributed provenance: dataand workflow. Data provenance provides stream provenancein a manner similar to that presented in [16]. Workflowprovenance is built on data provenance, and records theworkflows that were created and executed to act on the data,e.g., streaming, processing, transformation, etc. Genoma’s datamodel therefore is a directed acyclic graph (DAG) comprisingvertices, which represent entities and edges, which representrelationships between entities. Genoma will have a vertex ID,type and version number: type to distinguish between differentvertex types, ID to differentiate between multiple instances ofsame vertex type, and version number for tracking changes inthe existing vertex instance.

Genoma’s data provenance model comprises the followingvertices:

• Parameter: represents a specific sensory input from asensor

• Measurement: represents a specific value of a parame-ter. Storing such values does not strictly fall under thepurview of provenance; rather it is done by the underlyingdata processing system. However, Genoma stores mea-surements whenever there is a significant state change inthe workflow operating on the data in question. The statechanges considered are: workflow start, workflow stop,workflow suspend (in case of failure or reconfiguration)and workflow resume.

• Job: represents a type of action on the underlying stream-ing data

• Workflow: represents a collection of jobs in a predefinedsequence

• Workflow Instance: represents an instance of a workflow• Task: represents an instance of a job; hence a workflow

instance is a collection of tasks• User: represents an entity (human/automated) responsible

for running the workflow instanceGenoma’s workflow provenance model comprises the fol-

lowing vertices:

765

• Source: represents the source of a measurement; thiscould be either the sensor itself or an entity (e.g., amachine in a factory) where the sensor is installed

• Location: represents a location. By location we meana combination of a logical and physical location. Thelogical location could refer to a particular logical propertyof the location, e.g., edge1 or edge2. The physical loca-tion could refer to the actual geographic location (e.g.,GPS coordinates) or a logical location determined by thedomain (e.g., second machine in the factory). A singlelocation can contain one or more sources.

• Environment: represents an individual computation envi-ronment (computing nodes, etc.) at a location. This wouldprovide details of the compute and storage capabilitiesavailable at the location.

• Namespace: refers to the namespace under which theparameter is defined

• Topic: represents a topic on a broker such as MQTT [17];this would represent a parameter or group of parameterswhich are being subscribed to via the broker.

• Storage: represents a resource either on edge or cloudwhere the data parameter in question is stored, eitherpermanently or temporarily. It belongs to a particularEnvironment.

Edges between these vertices are represented in Table I.Please note that Table I is to be read from row to column,i.e., regarding the Parameter row and Source column, it is tobe read as “Parameter ContainedIn Source”. Please also notethat in order to avoid possible cyclic references, we have onlydepicted one-way relationships, e.g., we have not depicted theinverse relationship “Source Contains Parameter”, since this ismeant to be inferred from Table I.

Hence edges in the Genoma data model represent relation-ships between the nodes, which can inform users of how thesensor data was processed and used. Hence these relationshipsbetween vertices in the data model allow users to navigatethrough them after retrieving the appropriate provenance datafrom the database.

Please note that the Genoma data model is the same,whether on edge or cloud, allowing seamless transfer ofprovenance data from edge to cloud.

B. Genoma architecture

Based on its distributed nature, the architecture of Genomais as shown in Fig. 2.

Genoma has two main parts. On the edge, it captures andrecords provenance data via subscriptions to brokers (such asMQTT) on edge devices such as low-end computers or Rasp-berry PIs. Provenance capture and recording are separated inGenoma, in order to ensure independence from the underlyingdata processing system. After storage on the edge, provenancedata is transmitted to the cloud based on policies set by thePolicy Modeler. On the cloud, provenance data is collected viatwo means - from the edge via the Provenance Collector, anddirectly from the data stream processing system.

All collected provenance data eventually gets stored inApache Atlas on the cloud. Our implementation of ApacheAtlas includes changes to its type system to accommodateour provenance data model. On the edge, we have usedTinkerGraph along with file system storage for the followingreasons: TinkerGraph’s graph model is the same as that ofJanusGraph which is used in Apache Atlas; TinkerGraph canrun on low-end edge devices (unlike Apache Atlas); andsince TinkerGraph is only an in-memory data store, persistentstorage on the edge is currently implemented via file system.Future work will investigate the use of lightweight graphdatabases for storing provenance data on the edge.

Although all collected provenance data eventually is sup-posed to be stored on the cloud, Genoma’s distributed storageapproach ensures that provenance data can be stored andprocessed (albeit temporarily) on the edge to accommodatenetwork connectivity issues between edge and cloud, whichare expected to occur in large-scale distributed IoT infrastruc-tures.

On both edge and cloud, Genoma contains componentsfor provenance modeling and viewing. On the cloud Genomaleverages Apache Atlas’s GUI, whereas on the edge, Genomauses the Gremlin GUI compatible with Tinkergraph.

IV. Genoma IMPLEMENTATION

Our implementation approach for Genoma, as derived fromFig. 2, is as depicted in Fig. 3.

Our edge implementation assumes the existence of a low-end device, such as a Raspberry Pi or low-end laptop withRAM not exceeding 2GB and storage not exceeding 80 GB.Due to the lack of any provenance tools that can work onsuch devices, we are currently building on the edge a mini-provenance engine as introduced in Section III-B and depictedin Fig. 3. The following components are being built on theedge:

• Provenance Capture: we have built a connector thatsubscribes to topics from a VerneMQ [18] broker runningMQTT [17]. This connector parses the received data andextracts the appropriate provenance information from it.

• Provenance Recording: this is a connector that receivesprovenance data from the Provenance Capture connectorand converts it into the format of our data model. Thisdata is then stored on the edge device via the file systemstorage. Hence the separation of provenance recordingand storage ensures that Genoma can act as a true“provenance as a service” since it would be agnostic ofthe underlying data stream processing system.

• Provenance Visualization: this component allows the userto visualize the stored provenance data. Visualization isimplemented using TinkerGraph’s GUI feature.

• Provenance Transmitter: this component transmits prove-nance data to the cloud as per policies set by theProvenance Policy Modeler, which is explained in moredetail below. Policies are set based on, among otherthings, storage availability on the edge, frequency of datatransmission as set by the user, and specific pull requests

766

TABLE IEDGES IN THE PROVENANCE DATA MODEL

Vertex Parameter Measurement Job WF WF Instance Task User Source Location Environment Namespace Topic StorageParameter ContainedIn

Measurement BelongsTo GeneratedByJob BelongsToWF BelongsTo ContainedIn

WF Instance BelongsTo Contains CreatedByTask BelongsTo SituatedIn UsedIn

Source SituatedIn ContainedInEnvironment SituatedIn ContainedIn

Topic Contains BelongsTo SituatedIn ContainedIn StoredIn

Fig. 2. Genoma Architecture

Fig. 3. Genoma Implementation Approach

for provenance data from the cloud as defined in thePolicy Modeler on the cloud.

Fig. 4 shows an example Genoma provenance graph visu-alization on the edge using Tinkergraph’s GUI feature.

On the cloud we are building the following components asextensions to Apache Atlas:

• Enhanced Type System for Apache Atlas: as depictedin Fig. 3, this is an enhancement of Apache Atlas’stype system to incorporate the following: the additionalconstructs from our data model; links between theseconstructs so as to represent edges in our data model;

and version attributes to represent versioning of nodes andedges in the data model, which would thereby representdata lineage, i.e., the origin of the data, where it wasused, and how it was used, as part of overall workflowprovenance which can be visualized using Apache Atlas’suser interface.

• Provenance Collector: this component performs the fol-lowing functions: it receives the provenance data sent bythe Provenance Transmitter; and it also (based on policiesspecified via the Provenance Policy Modeler on the cloud)sends requests for provenance data. The collected data is

767

Fig. 4. Genoma Visualization on the Edge

stored in Apache Atlas’s database for subsequent queryand retrieval by users and external applications.

An example data lineage graph derived from Genoma datastored in Apache Atlas, and visualized via Apache Atlas’sGUI, is depicted in Fig. 5.

Fig. 5. Data Lineage Visualization on Apache Atlas

We are also building a Provenance Policy Modeler, whichwill be installed on both the edge and cloud. The objectivesof this modeler are the following:

• On the Edge: at a minimum, ensure that the (resource-constrained) edge device does not run out of storage spaceby transmitting provenance data to the cloud; in addition,transmit provenance data as per rules set by the user onthe edge.

• On the Cloud: send requests for provenance data to theedge device as per policies set by the user on the cloud.

In addition to all the above, on the cloud, our implementa-tion has two aspects. First, we implement the same provenance

capture and recording services as on the edge, but with cru-cial differences: (a) provenance capture service here directlyinterfaces with the underlying data stream processing systemon the cloud (currently we are using Apache Pulsar [19]); (b)provenance recording service converts the collected data andtranslates it into Apache Atlas’s data format. A snippet of theGenoma type system represented in Apache Atlas is illustratedin Fig. 6.

Fig. 6. Genoma Type System Illustration

Hence our Genoma implementation meets our key require-ments, viz., agnostic of the underlying data processing system(hence “provenance as a service”), and distributed across edgeand cloud. This raises a number of key research challenges,which we discuss in the next section. While doing so, we alsoreport our ongoing work in addressing the research challenges.

V. DISCUSSION

We have identified the following key research challengesfor Genoma going forward.Real-time Provenance Capture and Recording: As discussedin [10], provenance capture should be low overhead so asnot to overload the underlying data stream processing system,as well as to ensure that no provenance data is missed.Provenance data can be missed if the speed of provenancedata capture is lower than the speed at which the underlyingdata is transmitted. To that end, the citation [9] presents a“time-value-centric (TVC)” approach towards data provenancecapture. In Genoma, we have addressed this issue by onlycapturing provenance data at specific state changes, viz., startof a workflow, end of the workflow, workflow suspension,workflow resumption. Workflows can be suspended due tomany reasons, viz., faults in the underlying data streamprocessing system; movement of data processing to anotherlocation (e.g., changing the location of a broker or any otherdata processing location); and change in data insert rate due tooperators changing the frequency at which data is transmitted.Provenance Data Storage and Transmission: There has beensome research on provenance data storage, especially in the

768

cloud [7], [20]. However, this research has focused moreon efficient storage, with emphasis on optimizing querieson provenance data. In particular, these works have focusedon how to enhance current storage systems to make themmore “provenance-aware”. In Genoma, on the other hand, ouremphasis is more on efficient provenance storage on resource-constrained edge devices, to which end we are developing ourPolicy Modeler to ensure optimal usage of edge device storage.Efficient provenance data transmission from edge to cloud: thefollowing research issues dominate, and we are investigatingthem as part of our work on Genoma: (a) optimal datatransmission policies balancing storage availability on the edgeand network overhead of provenance data transmission; (b)data compression techniques to facilitate faster provenancedata transmission; (c) decompression at the cloud, alongwith techniques for correlating recently received provenancedata with already stored provenance data, for the purposeof establishing data lineage and thereby facilitating accurateprovenance data visualization on the cloud. Techniques suchas those described in [21] may be applicable here.Securing Provenance Data: This is a crucial topic that hasreceived some attention in the literature, with emphasis onsecuring data provenance on the cloud [22]–[25]. Given thatthe Genoma model is graph-based, we are currently extendingthe approach from [22] to develop a lattice-based role basedaccess control model for provenance graphs in Genoma. Thismodel will provide administrators to specify varying levels ofaccess to Genoma nodes and/or edges to a user based on theuser’s level in the organizational hierarchy and based on theextent to which they are authorized to view (subsets of) storedprovenance data. The latter, in particular, is based on the well-known attribute-based role based access control model [26].We will be reporting more details of our security approach ina future paper.

VI. CONCLUSIONS

In this paper, we have presented our ongoing work onGenoma, our distriuted IoT Provenance as a Service solu-tion. Genoma’s key features, missing in current work onprovenance, are: it is agnostic of the underlying data streamprocessing system; it is offered “as a Service (aaS)”, and itis distributed across edge and cloud by design. We introducedthe architecture of Genoma, along with our implementationapproach.

Apart from the points mentioned in Section V, our fu-ture work will also involve identifying key benchmarks forGenoma, such as processing overhead or energy consumption,and evaluating Genoma against them.

REFERENCES

[1] N. C. Narendra, S. Nayak, and A. Shukla, “Managing large-scaletransient data in iot systems,” CoRR, vol. abs/1803.09102, 2018.[Online]. Available: http://arxiv.org/abs/1803.09102

[2] P. Buneman, S. Khanna, and T. Wang-Chiew, “Why and where: Acharacterization of data provenance,” in International conference ondatabase theory. Springer, 2001, pp. 316–330.

[3] “Apache Atlas.” [Online]. Available: https://atlas.apache.org/

[4] “Tinkergraph.” [Online]. Available: https://github.com/tinkerpop/blueprints/wiki/TinkerGraph

[5] K. Belhajjame, R. BFar, J. Cheney, S. Coppens, S. Cresswell, Y. Gil,P. Groth, G. Klyne, T. Lebo, J. McCusker et al., “Prov-dm: The provdata model,” W3C Recommendation, 2013.

[6] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth,N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale,Y. Simmhan, E. Stephan, and J. V. den Bussche, “The open provenancemodel core specification (v1.1),” Future Generation ComputerSystems, vol. 27, no. 6, pp. 743 – 756, 2011. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0167739X10001275

[7] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer,“Provenance-aware storage systems.” in USENIX Annual TechnicalConference, General Track, 2006, pp. 43–56.

[8] J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, A. Dey, S. Nag,K. Ramachandran, S. Arora, A. Bhattacharyya, S. Das et al., “Ground:A data context service.” in CIDR, 2017.

[9] A. Misra, M. Blount, A. Kementsietsidis, D. Sow, and M. Wang,“Advances and challenges for scalable provenance in stream process-ing systems,” in International Provenance and Annotation Workshop.Springer, 2008, pp. 253–265.

[10] N. N. Vijayakumar and B. Plale, “Towards low overhead provenancetracking in near real-time stream filtering,” in International Provenanceand Annotation Workshop. Springer, 2006, pp. 46–54.

[11] M. Crosby, P. Pattanayak, S. Verma, and V. Kalyanaraman, “Blockchaintechnology: Beyond bitcoin,” Applied Innovation, vol. 2, pp. 6–10, 2016.

[12] R. Neisse, G. Steri, and I. Nai-Fovino, “A blockchain-based approachfor data accountability and provenance tracking,” in Proceedings of the12th International Conference on Availability, Reliability and Security.ACM, 2017, p. 14.

[13] A. Ramachandran and M. Kantarcioglu, “Smartprovenance: A dis-tributed, blockchain based dataprovenance system,” in Proceedings ofthe Eighth ACM Conference on Data and Application Security andPrivacy. ACM, 2018, pp. 35–42.

[14] X. Liang, S. Shetty, D. Tosh, C. Kamhoua, K. Kwiat, and L. Njilla,“Provchain: A blockchain-based data provenance architecture in cloudenvironment with enhanced privacy and availability,” in Proceedings ofthe 17th IEEE/ACM International Symposium on Cluster, Cloud andGrid Computing. IEEE Press, 2017, pp. 468–477.

[15] H. Olufowobi, R. Engel, N. Baracaldo, L. A. D. Bathen, S. Tata, andH. Ludwig, “Data provenance model for internet of things (iot) systems,”in Service-Oriented Computing – ICSOC 2016 Workshops, K. Drira,H. Wang, Q. Yu, Y. Wang, Y. Yan, F. Charoy, J. Mendling, M. Mohamed,Z. Wang, and S. Bhiri, Eds. Cham: Springer International Publishing,2017, pp. 85–91.

[16] B. Glavic, K. S. Esmaili, P. M. Fischer, and N. Tatbul, “Efficient streamprovenance via operator instrumentation,” ACM Transactions on InternetTechnology (TOIT), vol. 14, no. 1, p. 7, 2014.

[17] “MQTT.” [Online]. Available: https://https://mqtt.org/[18] “VerneMQ.” [Online]. Available: https://https://vernemq.com/[19] “Apache Pulsar.” [Online]. Available: https://pulsar.apache.org/[20] P. Macko and N. Ward, “Provenance data stor-

age.” [Online]. Available: \url{https://pdfs.semanticscholar.org/4ba8/93bc0ab3ad0c203159b254d7c1e200051394.pdf}

[21] A. P. Chapman, H. V. Jagadish, and P. Ramanan, “Efficient provenancestorage,” in Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data. ACM, 2008, pp. 993–1006.

[22] U. Braun, A. Shinnar, and M. I. Seltzer, “Securing provenance.” inHotSec, 2008.

[23] X. Wang, K. Zeng, K. Govindan, and P. Mohapatra, “Chaining for se-curing data provenance in distributed information networks,” in MilitaryCommunications Conference, 2012-MILCOM 2012. IEEE, 2012, pp.1–6.

[24] A. Bates, B. Mood, M. Valafar, and K. Butler, “Towards secureprovenance-based access control in cloud environments,” in Proceedingsof the third ACM conference on Data and application security andprivacy. ACM, 2013, pp. 277–284.

[25] M. R. Asghar, M. Ion, G. Russello, and B. Crispo, “Securing dataprovenance in the cloud,” in Open problems in network security.Springer, 2012, pp. 145–160.

[26] D. R. Kuhn, E. J. Coyne, and T. R. Weil, “Adding attributes to role-basedaccess control,” Computer, vol. 43, no. 6, pp. 79–81, 2010.

769

genoma: distributed provenance as a service for iot-based ... ·...

Documents