streaming sensor data slides_virender

36
© 2016 IBM Corporation Hybrid solution analysis of streaming sensor data with Spark Streaming & Kafka Virender Thakur, IBM Big Data Specialist Big Data Developers, NY City April 27, 2016

Upload: vithakur

Post on 12-Jan-2017

182 views

Category:

Documents


0 download

TRANSCRIPT

© 2016 IBM Corporation

Hybrid solution analysis of streaming

sensor data with Spark Streaming & Kafka

Virender Thakur, IBM Big Data Specialist

Big Data Developers, NY City

April 27, 2016

© 2015 IBM Corporation2

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions

© 2015 IBM Corporation3

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions

© 2015 IBM Corporation4

Internet of Things (IoT)

• Connecting devices through the Internet

• computing devices, appliances, humans and other living beings

• Insight gained by analyzing the data produced by IoT devices can be used to

improve a vast array of items and experiences throughout the world

• The following graphic is from a 2015 report published by IBM

© 2015 IBM Corporation5

Hybrid Cloud

The Internet of Things is helping to drive an explosion in the use of hybrid clouds

© 2015 IBM Corporation6

Gaining Insight from IoT Data

© 2015 IBM Corporation7

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions

© 2015 IBM Corporation8

Apache Spark

© 2015 IBM Corporation9

Spark Streaming

• Scalable, high-throughput, fault-tolerant stream processing

• Data can be ingested from many sources

• Data can be processed using complex algorithms,

including Spark machine learning and graph processing

• Processed data can be pushed out to filesystems, databases and dashboards.

© 2015 IBM Corporation10

Spark Streaming

• Receives live input data streams

• Divides the data into batches

• Batches are processed by the Spark engine

• Stream of results in batches are generated

© 2015 IBM Corporation11

Discretized Streams (Dstreams)

• Basic abstraction provided by Spark Streaming

• Represents a continuous stream of data

• Input data stream received from source

• Processed data stream generated by transforming the input stream

• Represented by a continuous series of RDDs

• RDD is Spark’s abstraction of an immutable, distributed dataset

• Each RDD in a Dstream contains data from a certain interval

© 2015 IBM Corporation12

Window Operations

• Allow transformations to be applied over a sliding window of data

• Window slides over a source Dstream

• Source RDDs that fall in the window are combined and operated upon to

produce RDDs of the windowed Dstream

• Window operation requires two parameters

• Window length = the duration of the window

• sliding interval = the interval at which the window operation is performed

© 2015 IBM Corporation13

Apache Kafka

• Distributed, partitioned, replicated commit log service

• Maintains feeds of messages in categories called topics

• Producers are processes that publish messages to a Kafka topic

• Consumers subscribe to topics and process the feed of published messages

• Runs as a cluster comprised of one or more broker servers

• TCP protocol communication between clients and servers

© 2015 IBM Corporation14

Anatomy of a Topic

• Kafka maintains a partitioned log for each topic

• Partition

• an ordered, immutable sequence of messages that is continually appended to

• distributed over the servers in the Kafka cluster and replicated for fault

tolerance

• Messages in the partitions are each assigned a unique sequential id number

(offset)

• Kafka retains all published messages for a configurable period of time

– whether or not they have been consumed

© 2015 IBM Corporation15

Apache Kafka Decouples Data Pipelines

© 2015 IBM Corporation1616

Open-standards, cloud-based platform for building,

running, and managing applications

Build your apps, your way

Use the most prominent

compute technologies to

power your app: Cloud

Foundry, Docker,

OpenStack.

Extend apps with services

A catalog of IBM, third party,

and open source services

allow the developer to stitch

an application together

quickly.

Scale more than just

instances

Development, monitoring,

deployment, and logging

tools allow the developer to

run and manage the entire

application.

Layered Security

IBM secures the platform and

infrastructure and provides

you with the tools to secure

your apps.

Deploy and manage hybrid

apps seamlessly

Get a seamless dev and

management experience

across a number of hybrid

implementations options.

Flexible Pricing

Try compute options and

services for free and, when

you’re ready, pay only for what

you use. Pay as you go and

subscription models offer

choice and flexibility.Coming Summer 2015

IBM Bluemix

© 2015 IBM Corporation17

Underlined by three key open compute technologies: Cloud Foundry, Docker, and OpenStack.

It extends each of these with a growing number of services, robust DevOps tooling, integration

capabilities, and a seamless developer experience.

Flexible Compute Options to Run Apps / Services

Instant Runtimes Containers Virtual Machines

Platform Deployment Options that Meet Your Workload Requirements

Bluemix

Public

Bluemix

Dedicated

Bluemix

Local*

DevOps

Tooling Your Own Hosted Apps / Services

Integration and

API Mgmt

Powered by IBM SoftLayer In Your Data Center

+ + +

+ +

Catalog of Services that Extend Apps’ Functionality

Web Data Mobile AnalyticsCognitive IoT Security Yours

+

IBM Bluemix

© 2015 IBM Corporation18

Node-RED

• Browser-based flow editor

• Visually wire together hardware devices, APIs and online services

• Built on Node.js and its event-driven, non-blocking model

• Includes a comprehensive built-in library of nodes

• Flows are deployed to the runtime in a single-click

• JavaScript functions can be created within the editor using a rich text editor

© 2015 IBM Corporation19

Bluemix Secure Gateway

• Provides secure connectivity and establishes a tunnel between Bluemix and a

remote location (on-premises or cloud)

• The Secure Gateway UI or REST API is used to connect to your client and

create a destination point

• To increase security, you can add TLS to encrypt the data

• The behavior of your gateways and destinations can be monitored in the Secure

Gateway Dashboard

© 2015 IBM Corporation20

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions

© 2015 IBM Corporation21

Demo Flow

IBM

Secure

Gateway

browser

Bluemix

© 2015 IBM Corporation22

Bluemix Node-RED Application

© 2015 IBM Corporation23

Kafka

To

pic

sensor

data

© 2015 IBM Corporation24

Kafka Setup

# /usr/iop/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper

rvm.svl.ibm.com:2181 --replication-factor 1 --partitions 1 --topic measures

# /usr/iop/current/kafka-broker/bin/kafka-console-producer.sh --broker-list

rvm.svl.ibm.com:6667 --topic measures > /dev/null 2>&1

# /usr/iop/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper

rvm.svl.ibm.com:2181 --from-beginning --topic measures

© 2015 IBM Corporation25

Spark Streaming

time = 10 sec

reading from sensors

every 2 sec

Window Length = 30 sec (average over 30 second rolling window)

Slide Interval = 20 sec (report average every 20 seconds)

Report if object temperature exceeds 25 C

© 2015 IBM Corporation26

Spark Stream Application – Scala Main Class

object StreamingSensor {

def main(args: Array[String]): Unit = {

val streamingRateSeconds = 10

val conf = new SparkConf().setAppName("SteamingSensor")

val sc = new SparkContext(conf)

val ssc = new StreamingContext(sc, Seconds(streamingRateSeconds))

}

© 2015 IBM Corporation27

Spark Stream Application – Kafka Integration

val zkQuorum = "rvm.svl.ibm.com:2181“

val inputGroup = “MyGroup"

val topic = "measures"

val topicMap = Map( topic -> 1 )

val kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, inputGroup,

topicMap).map(_._2)

© 2015 IBM Corporation28

Spark Stream Application – Ingest CSV Data

//define the schema using a case class

case class Sensor(sensorid: String, temp: Integer, humidity: Integer, objectTemp:

Integer)

object SensorObject {

// function to parse line of csv data into Sensor class

def parseSensor(str: String): Sensor = {

val p = str.split(",")

Sensor(p(0), p(1).toInt, p(2).toInt, p(3).toInt)

}

}

val SensorReadings = kafkaStream.map(SensorObject.parseSensor)

© 2015 IBM Corporation29

Spark Stream Application – Calculate Average Sensor Data

val windowLength = 30

val slideInterval = 20

val SensorWindow =

SensorReadings.window(Seconds(windowLength),Seconds(slideInterval))

val SensorCountByKey = SensorWindow.map(sensor => (sensor.sensorid,

(sensor.objectTemp, 1)))

val SensorAddByKey = SensorCountByKey.reduceByKey((x, y) => (x._1 + y._1,

x._2+ y._2))

val SensorAverage = SensorAddByKey.map(x => (x._1, x._2._1.toFloat/

x._2._2.toFloat) )

© 2015 IBM Corporation30

Spark Stream Application – Look for Data Exceeding Threshold

val objectTempThreshold = 25

val numexceptions = 5

// foreachRDD performs function on each RDD in DStream

SensorWindow.foreachRDD { rdd =>

// filter sensor data for objectTemperature above threshold

val alertRDD = rdd.filter(sensor => sensor.objectTemp >

objectTempThreshold).map(sensor => (sensor.sensorid,

sensor.objectTemp)).takeSample(false, numexceptions).foreach(println)

}

© 2015 IBM Corporation31

Spark Stream Application – Start Streaming

ssc.start()

© 2015 IBM Corporation32

Submit Spark Streaming Application

# spark-submit --class "StreamingSensor" --master yarn-client \

target/scala-2.10/Spark-Steaming-of-Sensor-Data-assembly-0.0.1.jar

Notes:

• spark-submit does not automatically include the package containing KafkaUtils

• You need to include KafkaUtils in your project JAR

• For that you need to create an all inclusive uber-jar (ex. using sbt-assembly)

• sbt-assembly is a sbt plugin to create a fat JAR of an sbt project with all of

its dependencies.

© 2015 IBM Corporation33

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions

© 2015 IBM Corporation34

Demo Review

IBM

Secure

Gateway

browser

Bluemixreading from sensors

every 2 sec

CSV

• average data over 30 second rolling window

• report average every 20 seconds

• report if object temperature exceeds 25 C

© 2015 IBM Corporation35

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions

© 2015 IBM Corporation36