stream processing with streamsql, apache...

25
Stream Processing with StreamSQL, Apache Kafka Kshitij Kumar 000456998 Quang Duy Tran INFO-H-415 Advanced Databases Project Report December 2017

Upload: lebao

Post on 05-May-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Stream Processingwith

StreamSQL, Apache Kafka

Kshitij Kumar 000456998

Quang Duy Tran

INFO-H-415 Advanced Databases

Project Report

December 2017

Page 2: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Contents

1 Introduction 31.1 Streaming data . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Stream processing and Batch processing . . . . . . . . . . . . . 41.3 Lambda Architecture . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Stream Data Model Considerations . . . . . . . . . . . . . . . . 7

2 SQLStream 82.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Streaming Data considerations within SQLStream . . . . . . . . 92.4 SQLStream objects . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Querying operations . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Building Applications . . . . . . . . . . . . . . . . . . . . . . . . 102.7 SQLStream, Relational Databases . . . . . . . . . . . . . . . . . 142.8 Application Properties . . . . . . . . . . . . . . . . . . . . . . . 14

3 Apache Kafka, KSQL 15

4 Application 164.1 Clickstream Analysis . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Taxi Trips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Some performance comparisons 22

6 Conclusion 24

List of Figures

1 Schematic diagram of a Stream Processing System . . . . . . . . 32 Batch processing . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Stream processing . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Simple layout of Lambda Architecture . . . . . . . . . . . . . . 55 A stream processing based architecture . . . . . . . . . . . . . . 66 Medusa architecture . . . . . . . . . . . . . . . . . . . . . . . . 67 SQLStream overview . . . . . . . . . . . . . . . . . . . . . . . . 88 A typical SQLStream pipeline . . . . . . . . . . . . . . . . . . . 99 Schematic of a SQLStream application . . . . . . . . . . . . . . 1110 Adding a data source in StreamLab . . . . . . . . . . . . . . . . 1111 Stream Analysis functions . . . . . . . . . . . . . . . . . . . . . 1212 Map and table based visualization in s-Dashboard . . . . . . . . 1313 Map based visualization in s-Dashboard, featuring filtered records 1314 A streaming analysis scenario with KSQL . . . . . . . . . . . . 1515 Latency measurements . . . . . . . . . . . . . . . . . . . . . . . 23

Page 3: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

1 Introduction

1.1 Streaming data

Data being generated continuously over time, from a large number of differentdata sources at high speed is considered as streaming data. The term is usuallyused in the context of big data. The important characteristics associated withstreams, which also motivate the development of a different type of processingsystem, are:

• Unboundedness – the most important factor which characterises stream-ing data; it is continuously generated.

• High velocity – typically, it tends to be generated at a high rate ren-dering it infeasible to store and work with main memory

• Low latency - usually due to the unboundedness and high velocity,there spawns a need for fast processing: real time or near real time

The Figure 1 illustrates a basic schematic for a Stream Data ManagementSystem.

Figure 1: Schematic diagram of a Stream Processing System

Streaming data and real time applications based on it are plentiful. Somestream data sources are:

• Internet of Things, Sensors: Smart Cities, Driverless cars

• Internet, web traffic – clickstream

• Images – Satellite, CCTV

• Finance – Stocks, Transactions

3

Page 4: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

• Factory manufacturing processes

• Logs - generated by automated systems

Some of the typical use cases include:

• Fraud detection based on financial transactions

• Investment decision based on stock prices

• Personalisation of an application, website etc based on user preferencein real time

• Real time dashboards and monitoring, eg road traffic monitoring andidentification of traffic rule violations

• Monitoring of a manufacturing process in a factory

1.2 Stream processing and Batch processing

Traditional, bounded data is served well by batch processing systems. Since,the data is known and finite, the whole data is fetched before performingany processing (Figure 2) It may take hours or even days to finish the queryand deliver the results. In order to handle streams, with low latency beingan important criterion, there is a need for optimisation of existing processingsystems to handle this scenario.

Figure 2: Batch processing

In contrast, stream processing reads data as it is generated over time andthen processes continuously. The faster and earlier, the data is processed, themore valuable is the data. Typically, it is analyzed in the memory before writ-ten to disk, or even not written at all because the data is unpredictable andinfinite.(Figure 2)

Figure 3: Stream processing

4

Page 5: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

The need for stream processing systems have increased in the recent pastmainly because of the availability of multiple sources of streaming data andthe perceived opportunity to utilise this and arrive at faster decision makingby lowering the latency.

1.3 Lambda Architecture

Historically, stream processing has been associated with incorrect results. Thiscan be attributed to the fact that approximation algorithms were used toperform the high velocity operations typically demanded by these applications.

In most present day big data architectures, a stream processing system isused alongwith a batch processing system. The idea is to combine the bestoffered by both worlds. Systems can utilise the low latency, albeit at a slightloss to inaccuracy. The batch system comes in after some time, providing thecorrect results. This kind of system is usually realised with the help of LambdaArchitecture. The Figure 4 gives a simple layout for this.

Figure 4: Simple layout of Lambda Architecture

Typically, the system consists of a data source, in this example it is akafka cluster serving the messages. A stream processing engine and a batchprocessing engine then simultaneously ingest this data and basically performthe same calculation. In the framework of an Apache system, this typically isApache Storm and Apache Hadoop respectively. The results of these are writ-ten to the application database when available and merged subsequently. Theapplication can then query this database. This kind of systems are prevalentwith many advantages. But, as is evident, the maintenance is often tricky andrequires the building and provisioning of two independent pipelines and a finalmerge.

Lately, the stream processing offerings have been getting better in termsof providing higher accuracy. It is definitely possible to use, and there arereportedly many systems, which are composed entirely of stream processingpipelines. Some proponents even make the case of completely letting go ofbatch systems for most use cases. Their argument being that the streamprocessing systems are already beginning to match the performance on otherparameters, in addition to lower latency. [2][3]

5

Page 6: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Figure 5: A stream processing based architecture

1.4 Landscape

Around the beginning of this millennium, as the requirements of handlingstream data was being realised, many systems were being developed withinacademica. Some examples include Stanford University’s STREAM (StanfordStream Data Manager) [4], Aurora[5], Borealis[6], Medusa[7] among others.There is not much active development on these systems anymore. The Figure6 shows a schematic of the Medusa system, from their homepage[7].

Figure 6: Medusa architecture

Subsequently, there have been various other independent systems also de-veloped in the industry which typically build on the ANSI SQL standard toprovide streaming functionality. PipelineDB[8] and StreamSQL[9] are someproducts available. StreamSQL is discussed in greater detail in the later sec-tions of the report.

6

Page 7: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Recently, with the emergence of large amounts of data, many new tools arebeing developed in the industrial space. Moreover, many of these have beenopen sourced or in some cases, their designs have been made available on theweb. The Apache ecosystem has its fair share of offerings: Apache Storm,Apache Spark Streaming, Apache Flink, Apache Samza and others. All theabove are distributed real-time processing engines and typically use ApacheKafka within the pipeline for message communication.

All the above are meant for the same purpose of real time computations,but the inherent distinctions in design choices make for an interesting varietyin terms of selection for an application. Apache Storm is oriented specificallyfor streams and the most popular system in use. Apache Spark is inherently abatch processing engine and its Streaming extension models streams by usingmini batches. Interestingly, Apache Flink is based with design considerationsfor stream processing, but it also provides batch processing capabilities whichare modeled on top of the stream ones. Additionally, around August 2017,Apache Kafka also launched a developer’s preview of KSQL which is based onits own StreamsAPI. This also provides the possibility to construct real timeanalytical queries directly on Kafka topics.

Cloud based services are also available which remove the need for manag-ing the immense amount of hardware required for maximizing the performanceobtained from these distributed processing systems. The most prominent onesare Google Data Cloud, Amazon Kinesis and Microsoft Azure Streaming Ser-vices.

1.5 Stream Data Model Considerations

The basic considerations in a stream data model are:

1. Windowing: The unboundedness and the velocity of data arriving atthe processing engine makes it infeasible to store everything in activestorage. Hence, a sliding window of the most recently arrived data isused for computations and sometimes, the summaries are also kept forsome kind of ad-hoc queries.

2. Time: It is a real time system and latency is arguably the most impor-tant factor. Hence, there is a notion of event time and processing time.Event time, as the name suggests, corresponds to the actual time of theoccurence of the event under consideration. Whereas, the processingtime is the time of arrival of the event at the processing engine.

7

Page 8: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

2 SQLStream

SQLStream is an enterprise software which serves as a real-time platform andprovides SQL-compliant stream analytics for both developers and analysts.

2.1 Components

It has the following components:

• SQLstream s-Server : a fully compliant, distributed, scalable andoptimized SQL query engine for unstructured data streams.

• SQLLineClient : a command line interface for s-Server

• StreamLab : a visual platform for interactively exploring data streamsand building applications in a web based graphical user interface

• s-Dashboard : an HTML5 platform for building push-based, real-timevisualizations on s-Server

• s-Studio: an integrated development environment for stream explo-ration, application development, and administration for the s-Server plat-form. It enables dynamic updates to live applications, adding new queriesas needed or changing existing queries or views.

Figure 7: SQLStream overview

2.2 Pipeline

The data pipelines within a SQLstream application is quite straightforward.There is an input data source, which provides the stream data flowing in at oneend, which is subsequently analyzed based on application specific logic, withthe final computation results as an output at the other end. The analysis istypically carried out with custom views which act as continuous queries. Thereis a SQLStream specific construct, a pump, which essentially encapsulates

8

Page 9: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

the selection and insertion of data in the application context. The Figure 8provides a rough schematic of a typical pipeline.

Figure 8: A typical SQLStream pipeline

2.3 Streaming Data considerations within SQLStream

The pipeline is processed as a continuous flow. A streaming query is a contin-uous, standing query that executes over streaming data and is implemented asviews. It is similar to traditional relational database views, with the importantdistinction of being continuously updated as and when the stream comes in.

Moreover, some queries utilise operations which require finite data availableto them to be able to calculate the result, for example, aggregates and joins.These are typically handled by the concept of windowing. These can be in twoforms. One based on a fixed number of rows and another based on time, egrolling or periodic time. The rolling time considers continuous, joint subsetsof time, eg the last second, the last five seconds and so on. Periodic timeconsiders disjointed regular intervals of time, eg an hourly report.

Streaming data is time-sensitive data and is represented as sequences oftime-stamped messages, which is called ROWTIME in the context of SQL-Stream.

2.4 SQLStream objects

The main objects are described below:

• Stream: A stream is the primary building block in the whole system.It implements a publish-subscribe protocol and can be written and readby multiple agents.

• Foreign Stream: A stream defined in the context of a schema andassociated with an external system.

• Foreign Table: An object that records the metadata necessary for SQL-stream sServer to be able to access a table or a similar data structure ina remote database

• View: A relation that provides a reusable definition of a query and actsas continuous queries.

9

Page 10: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

• Pump: A SQLstream schema object providing a continuous query inthe form of a sequenced INSERT INTO followed by a SELECT FROMquery functionality. It basically selects data from one source, a streamor table, and inserts it into a sink, another stream or table.

• Objects for communicating with external applications: ForeignData Wrappers, Server Objects and User-Defined Routines

2.5 Querying operations

The querying operations are implemented on the existing framework providedby the ANSI SQL standard. Most of the definitions and semantics still holdwith some additional constructs.

In a streaming context, due to the unboundedness of streams, queryingstatements like SELECT essentially run forever. It is a blocking call whichruns until the next row of the stream becomes available or the statement isexplicitly closed.

It is also possible to connect to tables from external databases and performcombined operations like joins, aggregates etc with multiple streams and tables.

Additionally, operations like aggregates and joins require a finite set ofrecords and must have all the data to produce the correct results. This ishandled by windowing, as mentioned in a previous section. A specific keyword,WINDOW, is used to define the window parameters: type and size.

All the basic building blocks, including SQLstream Data Types and Stream-ing SQL Operators are similar to SQL. The standard SQL operators like CRE-ATE statements, DROP statements, SELECT statements, INSERT, MERGE,DELETE, ALTER statements, arithmetic and logical operators are all avail-able. The operators for transforming and filtering incoming data is updatedand consists of WHERE, JOIN, GROUP BY and WINDOW.

Connecting to external applications is possible through JDBC, which makesSQLstream s-Server look like a database. External applications can then sendqueries to streams and obtain results received from a stream.

2.6 Building Applications

A typical application flow is detailed in Figure 9. In the figure, the primarydevelopment environment is StreamLab, the web based graphical user inter-face.

10

Page 11: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Figure 9: Schematic of a SQLStream application

The pipeline begins with the creation of a source. The different optionscan be selected from the menu, by dragging and dropping it into the sourcesframe. The parameters for the data source can be modified here. Figure 10shows this screen.

Figure 10: Adding a data source in StreamLab

Once the source is selected, the next analysis steps are constructed asguides, in SQLStream terminology, which are basically views that transformthe data. There are many different pre-built functions, called as commands,available here. A screenshot of a StreamLab window showing some of thecomponents and operations that can be performed is shown in Figure 11.

11

Page 12: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Figure 11: Stream Analysis functions

The possible commands are listed below:

• Basic: Rename, split, merge etc. These are also shown in the Figure 11top row.

• Aggregate a particular column, including COUNT, SUM, and AVER-AGE.

• Calculate arithmetical calculations on a numerical column

• Categorize a continuous value by applying conditions to the column.

• Running Average creates a new column that is a running average ofanother column over a given window of time.

• Time Sort uses a sliding time-based window of incoming rows to sortby the selected column or by ROWTIME

• Window To perform a windowed aggregation on a selected column.

• GroupRank To group rows by the values in one column (group by) andthen rank the rows within that group according to another column (rankby) across a window

• Partition Window To partition rows into groups using the columns ina particular column.

• Table Lookup To enrich available data with external data from databases

Once, the calculations are performed, it is possible to store these to anexternal database or other files. SQLStream has an inbuilt component, dash-board, for visualization of the data. There are many prebuilt visualizations

12

Page 13: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

available based on time series graphs. There is also the possibility to plotgeospatial data on maps. An example of this is shown in Figures 12 and 13.

Figure 12: Map and table based visualization in s-Dashboard

Figure 13: Map based visualization in s-Dashboard, featuring filtered records

13

Page 14: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

2.7 SQLStream, Relational Databases

The inherent data model of SQLStream is shared with and is fundamentallya common data model centered on processing relation oriented constructs,eg rows, queries, and views. The querying constructs are also shared: datamanipulation and definition languages standardized as SQL.

The differences which exist are primarily to account for the unboundednessof streams. s-Server uses predetermined queries over streaming data, with con-tinuous processing and continuous querying. On the other hand, a RelationalDatabase is used for ad-hoc queries over batch data, with querying performedon this fixed set of data. The queries in streaming scenario are usually scopedover explicit time windows based on the application requirements.

Moreover, the two can work well together and can be seamlessly integrated.As an example, s-Server can use static predetermined queries to process datafor a Relational Database and also respond to incoming messages by triggeringdynamic queries on the stored data.

2.8 Application Properties

SQLStream’s philosophy is to implement an end-to-end system within a streammanagement system, which communicates to an application for computationintensive and complex calculations. So, basically, SQL has control and theprogramming language is embedded within the application.

14

Page 15: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

3 Apache Kafka, KSQL

Apache Kafka is a stream processing platform based on a publisher-subscribermessage queue architecture model. It is designed for distributed deploymentand has a Java based library, Kafka streams, for stream processing. There isalso the option to connect to external systems. Kafka organises messages intopics, with multiple producers and consumers.

Recently, during the end of August, 2017, KSQL was launched as a devel-oper preview. KSQL is a streaming SQL engine which internally uses Kafkastreams. But, with KSQL, it is now possible to directly query the data withoutwriting external applications. There are various stream based operators avail-able, a subset of which includes aggregations, joins, windowing, sessionization.

Most of the semantics is similar to the ones discussed above for the SQL-Stream system. From the roadmap, it seems that more and important capa-bilities are being added to the next version of KSQL.

Figure 14: A streaming analysis scenario with KSQL

15

Page 16: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

4 Application

Some application scenarios were considered to get a first experience in usingand understanding the stream processing systems.

4.1 Clickstream Analysis

The proliferation of websites and the need to generate revenue from the onlinepresence has led to the emergence of clickstream analytics. A clickstream isfundamentally a stream based record of clicks on a website. This kind ofwebsite activity is logged and gaining insights from this data is lucrative. It isalso possible to collect this data from user browsers.

A clickstream typically contains information about a page request based onthe protocol, eg HTTP GET, HTTP 404 error etc, ip information of the user,web-agent of the user, time of the event and possibly additional information.In our analysis, we use the following schema:

clickstream (

_time bigint,

time varchar,

ip varchar,

request varchar,

status int,

userid int,

bytes bigint,

agent varchar)

A brief explanation of the attributes of the stream:

• time and time correspond to the time of the click. One is a numericattribute, storing the milliseconds since January 1, 1970 and the other isstoring time in a varchar format which corresponds to datetime

• ip is the user ip address

• request, status are the protocol request and status packet information

• userid is the user id assigned to the users stored in the system. Theseare typically available for registered users on the website

• bytes is the size of the packet used for the information communication,eg if it is a simple HTTP GET Request, then it will just contain thesize of the text containing the packet information as required by thecommunication protocol. If a file is being transmitted, the size of the fileis also part of the message.

• agent is the web agent, typically a browser which is being used to accessthe website.

16

Page 17: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

This data is generated within kafka. A kafka and ksql installation has somedata generator topics already created and that was used to generate the datafor the analysis.

Two other tables, users and status codes are also created, within separatekafka topics. Their descriptions are as follows:

users (

user_id int,

registered_At long,

username varchar,

first_name varchar,

last_name varchar,

city varchar,

level varchar)

status_codes (

code int,

definition varchar)

The aforementioned three data sources, one streaming and the other twotables, in the form of kafka topics were added as data sources for the SQL-Stream s-server. For KSQL, the possible data sources are only kafka topics,which are provided during the creation of the corresponding stream or table.

The initial setup for s-server looks as follows:

--To access the external kafka topic, a Server object needs to be created

which encodes the connection parameters

CREATE OR REPLACE SERVER "KafkaServer" TYPE ’KAFKA’

FOREIGN DATA WRAPPER ECDA;

--This is the namespace of the application

CREATE OR REPLACE SCHEMA "clickstream";

SET SCHEMA ’"clickstream"’;

--The following creates the stream

CREATE OR REPLACE FOREIGN STREAM "clickstream_source"

("_time" bigint NOT NULL,

"time" varchar NOT NULL,

"ip" varchar NOT NULL,

"request" varchar NOT NULL,

"status" int NOT NULL,

"userid" int NOT NULL,

"bytes" bigint NOT NULL,

"agent" varchar NOT NULL)

SERVER "KafkaServer"

OPTIONS

(topic ’clickstream’,

seed_brokers ’localhost’,

starting_time ’latest’,

17

Page 18: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

parser ’JSON’,

character_encoding ’UTF-8’,

skip_header ’false’);

--Similarly, for the other two topics

users (

"user_id" int NOT NULL,

"registered_At" long NOT NULL,

"username" varchar NOT NULL,

"first_name" varchar NOT NULL,

"last_name" varchar NOT NULL,

"city" varchar NOT NULL,

"level" varchar NOT NULL)

SERVER "KafkaServer"

OPTIONS

(topic ’users’,

seed_brokers ’localhost’,

starting_time ’latest’,

parser ’JSON’,

character_encoding ’UTF-8’,

skip_header ’false’);

status_codes (

"code" int not null,

"definition" varchar not null)

SERVER "KafkaServer"

OPTIONS

(topic ’status_codes’,

seed_brokers ’localhost’,

starting_time ’latest’,

parser ’JSON’,

character_encoding ’UTF-8’,

skip_header ’false’);

The initial setup for ksql is below:

CREATE STREAM clickstream (

_time bigint,

time varchar,

ip varchar,

request varchar,

status int,

userid int,

bytes bigint,

agent varchar)

with (

kafka_topic = ’clickstream’,

18

Page 19: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

value_format = ’json’);

CREATE TABLE users(

user_id int,

registered_At long,

username varchar,

first_name varchar,

last_name varchar,

city varchar,

level varchar)

with (

key=’user_id’,

kafka_topic = ’users’,

value_format = ’json’);

CREATE TABLE status_codes (

code int,

definition varchar)

with (

key=’code’,

kafka_topic = ’status_codes’,

value_format = ’json’);

Some of the queries which are tried:

• Monitoring user activity: This basically requires the creation of a con-tinuous query which joins the two data sources, clickstream and users,subsequently performing a selection for a particular user over the timespan of last few minutes.

--SQLStream s-server

--Note: A WINDOW can be time based(RANGE keyword) or row based(ROWS keyword)

CREATE OR REPLACE STREAM user_clickstream AS

SELECT STREAM userid, u.username, ip, u.city, request, status, bytes

FROM clickstream OVER (RANGE INTERVAL ’3’ SECOND PRECEDING) AS c

LEFT JOIN users u ON c.userid = u.user_id;

CREATE VIEW user_activity

AS SELECT STREAM username, ip, city, COUNT(*) AS count

FROM user_clickstream

GROUP BY username, ip, city HAVING COUNT(*) > 1;

--ksql

CREATE STREAM user_clickstream AS

SELECT userid, u.username, ip, u.city, request, status, bytes

FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id;

CREATE TABLE user_activity

AS SELECT username, ip, city, COUNT(*) AS count

19

Page 20: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

FROM user_clickstream

WINDOW TUMBLING (size 180 second)

GROUP BY username, ip, city HAVING COUNT(*) > 1;

• Some aggregation based queries to get a summary of the maximum,minimum and average number of requests and error over the last timewindow, eg 3 minutes.

• Reconstruction of user sessions was implemented. A session is assumedto be over when there is 5 minutes of inactivity for a particular user.

4.2 Taxi Trips

The dataset was used in Distributed Event-Based Systems (DEBS) 2015 grandchallenge and is based on an open data source consisting of taxi trip informa-tion in New York City, USA. [11] For our analysis, we used the smaller datasetcorresponding to the first 20 days of the data, with 2 million events and about375MB extracted data size of the csv file. The full dataset consists of recordsfor the whole of 2013 and 173 million events.

The schema and description of the attributes of the data set are as follows:

taxi_trip(

--an md5sum of the identifier of the taxi vehicle bound

medallion varchar,

--an md5sum of the identifier for the taxi license

hack_license varchar,

--time when the passenger(s) were picked up

pickup_datetime

--time when the passenger(s) were dropped off

dropoff_datetime

--duration of the trip

trip_time_in_secs

--trip distance in miles

trip_distance

--longitude coordinate of the pickup location

pickup_longitude

--latitude coordinate of the pickup location

pickup_latitude

--longitude coordinate of the drop-off location

dropoff_longitude

--latitude coordinate of the drop-off location

dropoff_latitude

--the payment method: credit card or cash

payment_type

--fare amount in dollars

fare_amount

--surcharge in dollars

surcharge

--tax in dollars

20

Page 21: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

mta_tax

--tip in dollars

tip_amount

--bridge and tunnel tolls in dollars

tolls_amount

--total paid amount in dollars

total_amount)

Since, this data is a regular dataset, it was required to be converted to aform of streaming dataset for analysis. This was done by simulating a log file.An initially empty log file was created and the sorted csv file was read fromthe beginning. A fixed number of lines was copied from the csv file to the logfile. This number basically categorises the event rate of the stream and wevaried this from 50, 100, 250, 500 and 1000.

The main aim of this analysis was to get an experience with the web basedgraphical interface StreamLab and the dashboard provided by StreamLab.Hence, everything was modeled as a graphical data flow.

The source was selected as a log file with a csv parser. Some simple calcula-tion based queries were performed to obtain an idea of the events in a previoustime frame, eg 30 minutes, 1 hour etc. Specifically, the longest and shortesttrips based on the trip time in secs and trip distance, and also the cheapestand the most expensive trips were observed.

21

Page 22: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

5 Some performance comparisons

SQLStream is an enterprise product and offers a trial license for a period oftwo months. The free version of the product has a limited usage and only 1GBdata can be used per day. The pricing is $40k for base annual subscription,plus $10k per server and $1k per core. There is also a possibility to use it viaAmazon Web Services on pay-as-you-use pricing model.

SQLStream, according to its website, is apparently the only SQL 2008compliant stream processing engine. It has reported processing data at over1M events/second per CPU core on benchmarks and at over 1M events/secondper server on real-world applications.

Additionally, Amazon Kinesis Analytics is based in part on certain tech-nology components licensed from SQLstream s-Server[10]

It provides an end-to-end system for stream data analysis. The presence ofGUI based tools allow fast prototyping and also lends itself to be used by peoplenot well versed with SQL. Inbuilt dashboards provides nice visualizations forexploration and data understanding. Also, the setup is quite simple, it has acouple of packaged installers which take care of all the installation procedures.Overall, the experience is quite good. One important thing to consider is thatit’s an enterprise software and quite expensive.

On the other hand, most big data systems modeled using Apache ecosystemcomponents use Apache Kafka already. Hence, the availability of a sql basedreal time querying functionality directly on the kafka topics, provides interest-ing uses. The system is open-source which is an important consideration interms of development of the project and easiness of adopting it. The syntax ofksql is concise and slightly easier to understand and write vis-a-vis for SQL-Stream s-server. Currently, it is still in developer preview and development iscontinuing to achieve more maturity be ready for deployment in production.

Latency is arguably the single most important factor for quantitativelycomparing stream processing systems. Based on benchmarking done on somedistributed system with streaming data, the latency ranges obtained for com-mon systems are shown in Figure 15[12].

22

Page 23: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

Figure 15: Latency measurements

Based on our analysis, the performance of both ksql and streamsql werecomparable. Though extensive benchmarking was not pursued, on a singlecompute node with 16GB RAM and intel i7 7th generation processor, thelatency of the aggregation queries based on the taxi trips data was well under1s, for all event rates ranging from 50-1000 events per second.

23

Page 24: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

6 Conclusion

Stream Processing is a broad topic with an immense scope. Within the scopeof this project, however, the aim was to get a first overall view on stream pro-cessing in general and to test this through an application modeled in a suitabletechnology. SQLStream is a complete independent end-to-end package, witha stream processing engine, a command line interface for engine, a graphicaluser interface(both desktop and web-based) and visualization tools, makingit amenable to get a first look without the potential problems of setting upand learning the different and ever changing landscape of technologies. Totest it with more recent and open source developments, it was combined withApache Kafka as the stream producer data source. Also, KSQL came out asa developer preview during the course of the project, hence, it was also triedwith a simple application scenario. There are some things which unluckilycould not be pursued in the present context. Specially important is the area ofdistributed processing, which is one of the cornerstones for the wide adoptionof present day stream processing systems.

24

Page 25: Stream Processing with StreamSQL, Apache Kafkacs.ulb.ac.be/public/_media/teaching/streamsql_2017.pdf · Stream Processing with StreamSQL, Apache Kafka ... 1 Introduction ... Apache

References

[1] Jure Leskovec, Anand Rajaraman, Jeff Ullman Mining of MassiveDatasets. Cambridge University Press, United Kingdom, 2014.

[2] Tyler Akidau, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 [Online]

[3] Jay Kreps, https://www.oreilly.com/ideas/questioning-the-lambda-architecture [Online]

[4] Stanford University STREAM system,http://infolab.stanford.edu/stream/ [Online]

[5] Aurora system, http://cs.brown.edu/research/aurora/. [Online]

[6] Borealis system, http://cs.brown.edu/research/borealis/public/ [Online]

[7] Medusa system, http://nms.csail.mit.edu/projects/medusa/. [Online]

[8] PipelineDB, https://www.pipelinedb.com, [Online]

[9] SQLStream, http://sqlstream.com/, [Online]

[10] SQLStream and Amazon Kinesis Analytics,http://sqlstream.com/datasheet/cloud-sqlstreamamazon-kinesis-analytics/ [Online]

[11] DEBS 2015 Grand Challenge: Dataset and information,http://debs.org/debs-2015-grand-challenge-taxi-trips/ [Online]

[12] Benchmarking of some Stream Processing Systems https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark [Online]

[13] A github repository providing links to various stream processing systemshttps://github.com/manuzhang/awesome-streaming [Online]

25