abstracting big data processing tools for smart cities · abstracting big data processing tools for...

24
Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the Distributed Smart City (WDSC’2018) Computer Science Department – IME USP This research is part of the INCT of the Future Internet for Smart Cities funded by CNPq, proc. 465446/2014-0, CAPES proc. 88887.136422/2017-00, and FAPESP, proc. 2014/50937-1. Fernanda de Camargo Magano is supported by CNPq. October 2nd, 2018 Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 1 / 24

Upload: others

Post on 28-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Abstracting Big Data Processing Tools for Smart Cities

Fernanda de Camargo Maganoand Kelly Rosa Braghetto

Workshop on the Distributed Smart City (WDSC’2018)Computer Science Department – IME USP

This research is part of the INCT of the Future Internet for Smart Cities funded by CNPq,proc. 465446/2014-0, CAPES proc. 88887.136422/2017-00, and FAPESP, proc.

2014/50937-1. Fernanda de Camargo Magano is supported by CNPq.

October 2nd, 2018

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 1 / 24

Page 2: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Index

1 Introduction

2 ConceptsBig Data ProcessingBig Data Tools’ APIs

3 Related Works

4 Comparison of Big Data Frameworks

5 An Architecture to Abstract Big Data ToolsIntegration with a Smart City PlatformAPI for Dataflow SpecificationValidation and Analysis

6 Conclusion Remarks

7 References

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 2 / 24

Page 3: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Introduction

Context

Urban Big Data

• Evolution of Internet of Things and cheaper technology

• Participatory sensing (mobile phones, social network, among others)

• Large volumes of data from heterogeneous sources

• Important role of data processing and analysis for smart cities

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 3 / 24

Page 4: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Introduction

Problem

Big Data tools

• Have good resources, but are hard to be used by data scientists ordevelopers beginners to these frameworks

• Require from their users knowledge in programming, parallel anddistributed computing

• Have not standardized languages and, therefore, are not completelyinteroperable

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 4 / 24

Page 5: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Introduction

Goals

The main goal of this work is to make the use of Big Data processingframeworks easier for smart cities applications, by abstracting thespecificities of these tools. For this, we propose:

• An interface (API) to specify dataflows for processing data in realtime and batches

• A software system that integrates a smart cities platform withBig Data processing frameworks, using the proposed API anddeveloping mappers for different tools

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 5 / 24

Page 6: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Concepts Big Data Processing

Big Data Processing Tools and the Dataflow Model

• Several tools share almost the same basic concepts

• Apache tools: Storm, Spark, Apex, Flink and Samza• Open source, with active communities and widely used

• They use the Dataflow Model• Expressive model: describes batches, micro-batches and streams• Directed graph to represent data dependencies

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 6 / 24

Page 7: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Concepts Big Data Processing

Real-time Dataflow Example

Clean and  separate traffic

data Best route

Collected data

Operator

Data

Check shorter paths

Select streetswith less

traffic

Combineinformation

Figure 1: Best route selector application for smart cities

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 7 / 24

Page 8: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Concepts Big Data Tools’ APIs

User APIs: Declarative and Topological

Declarative:

• High level

• Expressed as methods of objects representing collections

• Advanced operations (e.g., state and windows managing)

Topological or compositional:

• Programs expressed using graphs

• Explicit connection among nodes

• Specification of the code executed by the nodes

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 8 / 24

Page 9: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Related Works

Related Works

Cho, Shiokawa, and Kitagawa (2016)

• JSFlow framework: uses dataflow algebra based in JSON

• Extends Jaql – a declarative and functional language

• Disadvantage: prototype only uses Spark

Misale et al. (2017)

• Describes the dataflow model and user APIs

• Disadvantage: theoretical work

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 9 / 24

Page 10: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Comparison of Big Data Frameworks

Comparison of Big Data Frameworks

• The features were chosen based on smart cities applications

• The comparison led to the proposed abstraction and API

Frameworks/Features Real-time processing Latency Throughput

Consistency guarantees

Apache Flink Native Low High Exactly-once

Apache StormNative

Micro-batches with Storm Trident Very low HighExactly-once

(only for Trident)

Apache Spark Micro-batchesNot proper forlow latencies High Exactly-once

Apache Samza Native Low High At least onceApache Apex Native Low High Exactly-once

Table 1: Tools comparison - processing and consistency guarantees

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 10 / 24

Page 11: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Comparison of Big Data Frameworks

Comparison of Big Data Frameworks

ConnectorsFrameworks/

Features Written in APIsIntegration with

Kafka and HadoopIntegration with

RabbitMQApache Flink Java, Scala Declarative Yes YesApache Storm Java, Clojure Compositional Yes No

Apache SparkJava, Scala,

Python, R Declarative Yes YesApache Samza Java, Scala Declarative Yes No

Apache Apex Java, ScalaApexStream - declarativeDAG API - compositional Yes Yes

Table 2: Tools comparison - APIs and connectors

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 11 / 24

Page 12: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Comparison of Big Data Frameworks

Frameworks Architecture Structure

• Frameworks have a core layer

• Provide user APIs

• Offer libraries (for IO operators, SQL, ML)

• Can run above Hadoop ecossytem

Hadoop (YARN + HDFS)

Apex Core

Apex Malhar Operators Library

REST API

Customized Operators

Figure 2: Example of framework architecture – Apache Apex1

1Based on figure from http://dt-docs.readthedocs.io/en/latest/rts/Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 12 / 24

Page 13: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Comparison of Big Data Frameworks

Apache Apex - Dataflows representation

Code 1: Word count - DAG API

LineReader lineReader = dag.addOperator("input", new

LineReader());

Parser parser = dag.addOperator("parser", new Parser());

UniqueCounter counter = dag.addOperator("counter", new

UniqueCounter());

ConsoleOutputOperator cons = dag.addOperator("console", new

ConsoleOutputOperator());

dag.addStream("lines", lineReader.output, parser.input);

dag.addStream("words", parser.output, counter.data);

dag.addStream("counts", counter.count, cons.input);

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 13 / 24

Page 14: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Comparison of Big Data Frameworks

Apache Apex - Dataflows representation

Code 2: Word count - ApexStream API

StreamFactory.fromFolder("/tmp")

.flatMap(input -> Arrays.asList(input.split(" ")),

name("Words"))

.window(new WindowOption.GlobalWindow(),

new TriggerOption().accumulatingFiredPanes()

.withEarlyFiringsAtEvery(1))

.countByKey(input -> new Tuple.PlainTuple<>(new

KeyValPair<>(input, 1L)), name("countByKey"))

.map(input -> input.getValue(), name("Counts"))

.print(name("Console"))

.populateDag(dag);

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 14 / 24

Page 15: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools Integration with a Smart City Platform

Our Proposed Architecture• The architecture meets the goals of this project by including an

interface (API) and a software system.

Dataflow Manager

Data Controller Flink Mapper Apex Mapper Tool Mapper

InterSCity Platform

FlinkApex

Generic Tool

API allows the user to represent the dataflow

1

2 3 4 5

Data exchange via API

Microservice

Flow of data

Dataflows representation

Figure 3: Proposed microservices architectureFernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 15 / 24

Page 16: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools Integration with a Smart City Platform

Integration with InterSCity Platform

Figure 4: Smart-cities platform architecture 2

2Image from https://gitlab.com/interscity/interscity-platformFernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 16 / 24

Page 17: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools API for Dataflow Specification

Our API for Dataflow Specification

<<abstract>> IO<T>

+ type: String

+ read(): Data<T>+ write(Data<T> d): void

DataTransformation<T>

+ format: String + input: Data<T> + output: Data<T>

+ select(): Data<ResultSet> + join(): Data<ResultSet>+ count(): Data<ResultSet>+ filter(String condition): Data<ResultSet>   

             . . .

Dataflow<T>+ prop: Properties

+ Dataflow(IO<T> io, DataStream<T> data,DataTransformation<T> trans): void+ setEnv(Properties prop): void

<<dataType>> Data<T>

+ format: String + type: char

+ Data(T obj): void

1

*

1

*

1

*

Figure 5: UML class diagram of the proposed API (simplified)

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 17 / 24

Page 18: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools API for Dataflow Specification

Abstraction - Input and Output

frame

<<abstract>> IO<T>

+ type: String

+ read(): Data<T>+ write(Data<T> d): void

DataTransformation<T>

+ format: String + input: Data<T> + output: Data<T>

+ select(): Data<ResultSet> + join(): Data<ResultSet>+ count(): Data<ResultSet>+ filter(String condition): Data<ResultSet>

              ...

Map<T>+ map(UDF function):D T+ flatMap(UDF function):Data<T>

UDF<T>(UserDefinedFunction)+ name: String+ javaClass: Class

+ checkParams(String name,Class javaClass): boolean

ContextFunction<T>+ status: boolean

+ decisionMaking(): Data<T>

<<dataType>> Data<T>

+ format: String + type: char

+ Data(T obj): void

DataBatch<T>

+ size: Integer

+ DataBatch(T obj): void

DataStream<T>

+ state: boolean+ window: Window<T>

+ DataStream(T obj): void

Kafka<T>+ topic: String+ channel: String+ url: String

+ Kafka(): void

RabbitMQ<T>+ topic: String+ channel: String+ url: String

+ RabbitMQ(): void

Extends Extends

Reduce<T>+ reduce(UDF function):Data<T>

Extends Extends Extends

Extends Extends Extends

Agregação 1..*

0 ..*

Agregação1..*

0 ..*

File<T>+ format: String+ path: String

+ File(): void+ readHadoop(Key, Value): FileInputFormat+ readCsv(): CsvInputFormat+ readText(): TextInputFormat

Window<T>+ type: String+ keyed: Boolean+ size: int

+ apply(UDF function): Data<T>

<<abstract>> IO<T>

+ type: String

+ read(): Data<T>+ write(Data<T> d): void

DataTransformation<T>

+ format: String + input: Data<T> + output: Data<T>

+ select(): Data<ResultSet> + join(): Data<ResultSet>+ count(): Data<ResultSet>+ filter(String condition): Data<ResultSet>   

             . . .

Dataflow<T>+ prop: Properties

+ Dataflow(IO<T> io, DataStream<T> data,DataTransformation<T> trans): void+ setEnv(Properties prop): void

<<dataType>> Data<T>

+ format: String + type: char

+ Data(T obj): void

1

*

1

*

1

*

Figure 6: Classes for input and output connectors

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 18 / 24

Page 19: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools API for Dataflow Specification

Abstraction - Data Transformation

DataTransformation<T>

+ format: String + input: Data<T> + output: Data<T>

+ select(): Data<ResultSet> + join(): Data<ResultSet>+ count(): Data<ResultSet>+ filter(String condition): Data<ResultSet>                 ...

Map<T>+ map(): Data<T> + flatMap(): Data<T>

UDF<T>(UserDefinedFunction)+ name: String+ javaClass: Class+ checkParams(String name,Class javaClass): boolean

ContextFunction<T>+ status: boolean

+ decisionMaking():Data<T>

Reduce<T>+ reduce(): Data<T> + fold(): Data<T>

Extends Extends Extends

Aggregation

1..*

0 ..*

Aggregation1..*

0 ..*

Window<T>+ type: String+ keyed: Boolean+ size: int

+ apply(UDF function):Data<T>

Figure 7: Classes for data transformations

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 19 / 24

Page 20: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools Validation and Analysis

Validation

• Urban mobility case study

• Compare implementation codes with and without using theabstraction (directly done using the Big Data tools)

• Use of metrics to measure API usability (Scheller e Kuhn, 2015)

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 20 / 24

Page 21: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

An Architecture to Abstract Big Data Tools Validation and Analysis

Case Study - Application to Predict Bus Arrival Time

Bus stops

informationupdate

Roads averagespeedupdate

Currentbus

positionestimation

User waitingat bus stop

update

user arrival at abus stop

userdeparture at a

bus stop

Context 1identification:

bus location

bus withunknownlocation

buswith

knownlocation

Bus stopinformationestimation

Context 2identification:user location

(bus_id, line, position, timestamp)

(bus_id, line,last_known_pos, last_pos_timestamp)

(road_id, speed, timestamp)

(bus_id, line,previous_pos ...)

(id_user, position, timestamp)

(user_id, line, wait_time)

(stop, line, wait_time)

(stop, line,estimated_wait_time)

Output  data

Activities

Context

Known bus location

Unknown bus location

User context used internally

Data(bus_id, line,estimated_pos ...)

(user_id, stop, departure_timestamp)

(user_id, stop,arrival_timestamp)

Figure 8: Dataflow of the server system

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 21 / 24

Page 22: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

Conclusion Remarks

Conclusion Remarks

Our contributions in this work are

• A comparison among different Big Data tools

• The proposal of an API to support the specification of dataflows

• A microservices architecture on top of a smart city platform, tomap the dataflows to different Big Data frameworks.

Our ongoing work includes

• The implementation of mapper microservices

• The evaluation of the system by means of a smart city applicationwhich processes urban mobility data

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 22 / 24

Page 23: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

References

References I

Cho, Hirotoshi, Hiroaki Shiokawa, and Hiroyuki Kitagawa (2016). “JsFlow:Integration of massive streams and batches via JSON-based dataflowalgebra”. In: 2016 19th International Conference on Network-BasedInformation Systems (NBiS). IEEE, pp. 188–195.

Misale, Claudia et al. (2017). “A comparison of big data frameworks on alayered dataflow model”. In: Parallel Processing Letters 27(01),p. 1740003.

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 23 / 24

Page 24: Abstracting Big Data Processing Tools for Smart Cities · Abstracting Big Data Processing Tools for Smart Cities Fernanda de Camargo Magano and Kelly Rosa Braghetto Workshop on the

References

Abstracting Big Data Processing Tools for Smart Cities

Fernanda de Camargo Magano, Kelly Rosa [email protected] [email protected]

Open source code at GitLab:https://gitlab.com/interscity/abstraction-layer

InterSCity website:http://interscity.org

Fernanda Magano, Kelly Braghetto (USP) Abstracting Big Data Tools for Smart Cities October 2nd, 2018 24 / 24