apache spark and apache kafka at the rescue of distributed rdf

Apache Spark and Apache Kafka at the rescue ofdistributed RDF Stream Processing engines

Xiangnan Ren1,2,3, Olivier Cure3, Houda Khrouf1, Zakia Kazi-Aoul2, YousraChabchoub2

1 ATOS - 80 Quai Voltaire, 95870 Bezons, France{xiang-nan.ren, houda.khrouf}@atos.net

2 ISEP - LISITE, Paris 75006, France{zakia.kazi, yousra.chabchoub}@isep.fr

3 LIGM (UMR 8049), CNRS, UPEM, F-77454, Marne-la-Vallee, [email protected]

Abstract. Due to the growing need to timely process and derive valu-able information and knowledge from data produced in the SemanticWeb, RDF stream processing (RSP) has emerged as an important re-search domain. In this paper, we describe the design of an RSP enginethat is built upon state of the art Big Data frameworks, namely ApacheKafka and Apache Spark. Together, they support the implementation of aproduction-ready RSP engine that guarantees scalability, fault-tolerance,high availability, low latency and high throughput. Moreover, we high-light that the Spark framework considerably eases the implementation ofcomplex applications requiring libraries as diverse as machine learning,graph processing, query processing and stream processing.

1 Introduction

The Resource Description Framework (RDF) is a flexible data format that wasoriginally designed for the Web, and is now gaining popularity in the Internet ofThings (IoT). A majority of IoT data are dynamically generated from varioussources, e.g., sensors, and require inference services to meet their full analyticpotential. This trend leads to the notion of RDF Stream Processing (RSP) whichgains more and more attention as a research topic.

Due to the baseline defined by some RSP benchmarks such as LSBench [5]and SRBench [7], modern RSP engines need to address the following aspects:support of SPARQL main operators, output correctness and engine performance.To cope with these fundamental requirements, several centralized engines havebeen proposed in the last decade, such as C-SPARQL [1] and CQELS [4]. Limitedby a centralized design, these systems can hardly deal with the volume growthand velocity increase of RDF streams. Therefore, a distributed solution for RSPis needed to deal with practical workloads. The current distributed RSP enginessuch as CQELS-Cloud [6] and Katts [2] make a significant progress on engineperformance and scalability. Nevertheless, all available distributed RSP systemslack important features, e.g., support for common SPARQL operators and are

not ready for production. Moreover, they do not integrate the state of the artapproaches that are currently guaranteeing fault tolerance, highly availabilitywhich can enforce the system’s robustness.

To meet the expectations of Big Data projects, a novel RSP engine is re-quired. Motivated by the WAVES project1, this work provides insights on the in-tegration of Apache Kafka, a distributed messagging broker, and Apache Spark,a distributed computing framework. These two Big Data frameworks will ensurerobustness, reliability and scalability properties.

Applications based on stream processing frequently require different librariesand tools, to (1) perform incremental and iterative tasks, e.g., query processingand machine learning, (2) process graph data models and (3) handle all thestreaming machinery, e.g., continuous query, windowing operations. Currently,two distributed stream processing frameworks have been emphasized by theiradoption in large industrial projects: Apache Spark and Apache Flink. Spark-Streaming is based on micro-batch execution mechanism, and provides the sub-second delay. Flink is another popular massively parallel data processing enginewhich supports real-time data processing and CEP. Due to the enrichment andthe maturity of the platform ecosystems, we choose Spark Streaming as theframework of our RSP engine.

2 Use case

Our motivating use case concerns the industrial application of water resourcenetwork management. It aims to provide real-time analytics over RDF datastreams. The observations, also denoted as events, are dynamically generatedfrom various sensors and are hence anchored in spatio-temporal analytics. Themeasures we are considering correspond to pressure, flow, chlorine, temperatureand turbidity. Their real-time analysis permit to detect water network anomalies,such as water leaks, and can have important impacts at the economical andenvironmental levels. Nevertheless, our goal is to design a generic RSP enginethat can easily adapt to use cases concerned with other domains. Intuitively, thegoal is to seamlessly integrate novel ontologies, data streams and sets of querieswithin a highly distributed, reasoning-enabled, continuous query and complexevent processing system.

3 Architecture

Figure 1 gives a high-level overview of our system architecture, in which we in-troduce the use of the principal components. The incoming RDF events are dy-namically filtered and converted into a compressed RDF serialization. Note thatan RDF event is essentially a set of triples. To identify each event and its streamsource, the representation of triple pattern (s, p, o) is extended, i.e., an event isformed as a set of triples as e = (streamID, eventID, t, {(sn, pn, on)}n=1,...,N )

1 More details at http://waves-rsp.org

2

where t is the event timestamp. Then, the obtained RDF event streams are con-tinuously sent to the Kafka message broker. We use Kafka to manage incomingevent streams. Typically, each pre-defined Kafka topic is associated to some spe-cific RDF events (e.g., the event of flow observation or chlorine observation, etc.).Finally, Spark-Streaming concurrently receives, caches and deserializes incomingdata streams.

Fig. 1: Water Resource Management Context

We combine Spark SQL, MLib and GraphX with Spark Streaming libraries toform the computing core of our system. It mainly covers the RDF data analyticin two aspects: basic timely SPARQL query processing and data mining overRDF streams. Indeed, the canonical SPARQL query processing with externaland contextual data becomes valuable, which may also support stream reasoning.Moreover, the system is also planned to meet the requirement of advanced dataanalytic, such as classification and anomaly detection. Besides, Spark providesa seamless connection among its available libraries, since they share the samedata collection abstraction, i.e. RDD. In the following, we present the tasks thathighlight the use of these libraries.

Spark Streaming is an extension of the core Spark API and enables near real-time stream processing. We use Spark Streaming as the basic stream processinglayer. It receives input data from Kafka, divides and parallelizes the data intobatches, which are next processed by Spark Core.

Spark SQL provides a high-level abstraction to support distributed relationaloperations. The work in [3] gives a road map to choose the appropriate approachfor SPARQL query processing on Spark. In our system, we convert the inputRDF event into a data collection called DataFrame before query execution.Then, we use Sesame API to parse a (continuous) SPARQL query and returnthe query algebra tree. The obtained algebra tree allows us to reconstruct anequivalent and optimized algebra tree on Spark, namely logical plan. Finally, wedynamically generate the code from the logical plan for query processing.

MLib is a native Spark library for machine learning. We use MLib to runthe data analytics pipeline. One of the target scenario is anomaly detection.Once the event is received by Spark Streaming, we push events to the analyticlayer which uses classification algorithms and decision tree. For instance, wecompare measures for the same geographical sector on two different dates usinga clustering approach such as k-means.

3

GraphX facilitates parallel graph processing on Spark. We mainly use GraphXto generate semantic-aware dictionaries for our knowledge bases. Intuitively, itenables to seamlessly compute connected components, for instance to computethe transitive closure of some hierarchies or triples involving the owl:sameAsproperty. This computation only requires to provide two RDDs, containing ver-tices and edges of the RDF graph, to a given function. The output can easily beprocessed within our Spark core program.

4 Conclusion

RDF stream processing is an emerging area that is still in its infancy and hencerequires to conduct much more research. In this work, we present our envisionedRSP system which is based on Spark Streaming and Kafka. These mature ecosys-tems bring important properties expected in Big Data applications and haveproved to ease the design of application requiring libraries as diverse as machinelearning, graph computations, query processing and of course stream manage-ment. In future work, we plan to extend our query processing with a trade-offbetween two common reasoning approaches, namely materialization and queryreformulation.

5 Acknowledgment

This work is funded by the Fonds Unique Interministeriel (FUI #17) throughthe WAVES project, available at http://waves-rsp.org.

References

1. D. F. Barbieri, D. Braga, S. Ceri, E. D. Valle, and M. Grossniklaus. C-SPARQL:SPARQL for continuous querying. In Proceedings of the 18th International Confer-ence on World Wide Web (WWW), pages 1061–1062, 2009.

2. L. Fischer, T. Scharrenbach, and A. Bernstein. Scalable linked data stream process-ing via network-aware workload scheduling. In Proceedings of the 9th InternationalWorkshop on Scalable Semantic Web Knowledge Base Systems, pages 81–96, 2013.

3. H. Naacke, O. Cure, and B. Amann. SPARQL query processing with Apache Spark.ArXiv e-prints, 2016.

4. D. L. Phuoc, M. Dao-Tran, J. X. Parreira, and M. Hauswirth. A native and adaptiveapproach for unified processing of linked streams and linked data. In the 10thInternational Semantic Web Conference (ISWC), pages 370–388, 2011.

5. D. L. Phuoc, M. Dao-Tran, M. Pham, P. A. Boncz, T. Eiter, and M. Fink. Linkedstream data processing engines: Facts and figures. In the 11th International Seman-tic Web Conference (ISWC), pages 300–312, 2012.

6. D. L. Phuoc, H. N. M. Quoc, C. L. Van, and M. Hauswirth. Elastic and scalableprocessing of linked stream data in the cloud. In The 12th International SemanticWeb Conference (ISWC), pages 280–297, 2013.

7. Y. Zhang, M. Pham, O. Corcho, and J. Calbimonte. Srbench: A streaming RDF/S-PARQL benchmark. In the 11th International Semantic Web Conference (ISWC),pages 641–657, 2012.

4

apache spark and apache kafka at the rescue of distributed rdf

Documents