streaming sensor data slides_virender

© 2016 IBM Corporation

Hybrid solution analysis of streaming

sensor data with Spark Streaming & Kafka

Virender Thakur, IBM Big Data Specialist

Big Data Developers, NY City

April 27, 2016

© 2015 IBM Corporation2

Agenda

Sensor data, the Internet of Things, and hybrid clouds

Overview of technology components used in the solution

Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway

Demo Scenario Overview

Demo

Questions


Agenda



Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway


Demo

Questions


Internet of Things (IoT)

• Connecting devices through the Internet

• computing devices, appliances, humans and other living beings

• Insight gained by analyzing the data produced by IoT devices can be used to

improve a vast array of items and experiences throughout the world

• The following graphic is from a 2015 report published by IBM


Hybrid Cloud

The Internet of Things is helping to drive an explosion in the use of hybrid clouds


Gaining Insight from IoT Data


Agenda



Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway


Demo

Questions


Apache Spark


Spark Streaming

• Scalable, high-throughput, fault-tolerant stream processing

• Data can be ingested from many sources

• Data can be processed using complex algorithms,

including Spark machine learning and graph processing

• Processed data can be pushed out to filesystems, databases and dashboards.


Spark Streaming

• Receives live input data streams

• Divides the data into batches

• Batches are processed by the Spark engine

• Stream of results in batches are generated


Discretized Streams (Dstreams)

• Basic abstraction provided by Spark Streaming

• Represents a continuous stream of data

• Input data stream received from source

• Processed data stream generated by transforming the input stream

• Represented by a continuous series of RDDs

• RDD is Spark’s abstraction of an immutable, distributed dataset

• Each RDD in a Dstream contains data from a certain interval


Window Operations

• Allow transformations to be applied over a sliding window of data

• Window slides over a source Dstream

• Source RDDs that fall in the window are combined and operated upon to

produce RDDs of the windowed Dstream

• Window operation requires two parameters

• Window length = the duration of the window

• sliding interval = the interval at which the window operation is performed


Apache Kafka

• Distributed, partitioned, replicated commit log service

• Maintains feeds of messages in categories called topics

• Producers are processes that publish messages to a Kafka topic

• Consumers subscribe to topics and process the feed of published messages

• Runs as a cluster comprised of one or more broker servers

• TCP protocol communication between clients and servers


Anatomy of a Topic

• Kafka maintains a partitioned log for each topic

• Partition

• an ordered, immutable sequence of messages that is continually appended to

• distributed over the servers in the Kafka cluster and replicated for fault

tolerance

• Messages in the partitions are each assigned a unique sequential id number

(offset)

• Kafka retains all published messages for a configurable period of time

– whether or not they have been consumed


Apache Kafka Decouples Data Pipelines


Open-standards, cloud-based platform for building,

running, and managing applications

Build your apps, your way

Use the most prominent

compute technologies to

power your app: Cloud

Foundry, Docker,

OpenStack.

Extend apps with services

A catalog of IBM, third party,

and open source services

allow the developer to stitch

an application together

quickly.

Scale more than just

instances

Development, monitoring,

deployment, and logging

tools allow the developer to

run and manage the entire

application.

Layered Security

IBM secures the platform and

infrastructure and provides

you with the tools to secure

your apps.

Deploy and manage hybrid

apps seamlessly

Get a seamless dev and

management experience

across a number of hybrid

implementations options.

Flexible Pricing

Try compute options and

services for free and, when

you’re ready, pay only for what

you use. Pay as you go and

subscription models offer

choice and flexibility.Coming Summer 2015

IBM Bluemix


Underlined by three key open compute technologies: Cloud Foundry, Docker, and OpenStack.

It extends each of these with a growing number of services, robust DevOps tooling, integration

capabilities, and a seamless developer experience.

Flexible Compute Options to Run Apps / Services

Instant Runtimes Containers Virtual Machines

Platform Deployment Options that Meet Your Workload Requirements

Bluemix

Public

Bluemix

Dedicated

Bluemix

Local*

DevOps

Tooling Your Own Hosted Apps / Services

Integration and

API Mgmt

Powered by IBM SoftLayer In Your Data Center

+ + +

+ +

Catalog of Services that Extend Apps’ Functionality

Web Data Mobile AnalyticsCognitive IoT Security Yours

+

IBM Bluemix


Node-RED

• Browser-based flow editor

• Visually wire together hardware devices, APIs and online services

• Built on Node.js and its event-driven, non-blocking model

• Includes a comprehensive built-in library of nodes

• Flows are deployed to the runtime in a single-click

• JavaScript functions can be created within the editor using a rich text editor


Bluemix Secure Gateway

• Provides secure connectivity and establishes a tunnel between Bluemix and a

remote location (on-premises or cloud)

• The Secure Gateway UI or REST API is used to connect to your client and

create a destination point

• To increase security, you can add TLS to encrypt the data

• The behavior of your gateways and destinations can be monitored in the Secure

Gateway Dashboard


Agenda



Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway


Demo

Questions


Demo Flow

IBM

Secure

Gateway

browser

Bluemix


Bluemix Node-RED Application


Kafka

To

pic

sensor

data


Kafka Setup

# /usr/iop/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper

rvm.svl.ibm.com:2181 --replication-factor 1 --partitions 1 --topic measures

# /usr/iop/current/kafka-broker/bin/kafka-console-producer.sh --broker-list

rvm.svl.ibm.com:6667 --topic measures > /dev/null 2>&1

# /usr/iop/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper

rvm.svl.ibm.com:2181 --from-beginning --topic measures


Spark Streaming

time = 10 sec

reading from sensors

every 2 sec

Window Length = 30 sec (average over 30 second rolling window)

Slide Interval = 20 sec (report average every 20 seconds)

Report if object temperature exceeds 25 C


Spark Stream Application – Scala Main Class

…

object StreamingSensor {

def main(args: Array[String]): Unit = {

val streamingRateSeconds = 10

val conf = new SparkConf().setAppName("SteamingSensor")

val sc = new SparkContext(conf)

val ssc = new StreamingContext(sc, Seconds(streamingRateSeconds))

}

…


Spark Stream Application – Kafka Integration

…

val zkQuorum = "rvm.svl.ibm.com:2181“

val inputGroup = “MyGroup"

val topic = "measures"

val topicMap = Map( topic -> 1 )

val kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, inputGroup,

topicMap).map(_._2)

…


Spark Stream Application – Ingest CSV Data

…

//define the schema using a case class

case class Sensor(sensorid: String, temp: Integer, humidity: Integer, objectTemp:

Integer)

object SensorObject {

// function to parse line of csv data into Sensor class

def parseSensor(str: String): Sensor = {

val p = str.split(",")

Sensor(p(0), p(1).toInt, p(2).toInt, p(3).toInt)

}

}

val SensorReadings = kafkaStream.map(SensorObject.parseSensor)

…


Spark Stream Application – Calculate Average Sensor Data

…

val windowLength = 30

val slideInterval = 20

val SensorWindow =

SensorReadings.window(Seconds(windowLength),Seconds(slideInterval))

val SensorCountByKey = SensorWindow.map(sensor => (sensor.sensorid,

(sensor.objectTemp, 1)))

val SensorAddByKey = SensorCountByKey.reduceByKey((x, y) => (x._1 + y._1,

x._2+ y._2))

val SensorAverage = SensorAddByKey.map(x => (x._1, x._2._1.toFloat/

x._2._2.toFloat) )

…


Spark Stream Application – Look for Data Exceeding Threshold

…

val objectTempThreshold = 25

val numexceptions = 5

// foreachRDD performs function on each RDD in DStream

SensorWindow.foreachRDD { rdd =>

// filter sensor data for objectTemperature above threshold

val alertRDD = rdd.filter(sensor => sensor.objectTemp >

objectTempThreshold).map(sensor => (sensor.sensorid,

sensor.objectTemp)).takeSample(false, numexceptions).foreach(println)

}

…


Spark Stream Application – Start Streaming

ssc.start()


Submit Spark Streaming Application

# spark-submit --class "StreamingSensor" --master yarn-client \

target/scala-2.10/Spark-Steaming-of-Sensor-Data-assembly-0.0.1.jar

Notes:

• spark-submit does not automatically include the package containing KafkaUtils

• You need to include KafkaUtils in your project JAR

• For that you need to create an all inclusive uber-jar (ex. using sbt-assembly)

• sbt-assembly is a sbt plugin to create a fat JAR of an sbt project with all of

its dependencies.


Agenda



Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway


Demo

Questions


Demo Review

IBM

Secure

Gateway

browser

Bluemixreading from sensors

every 2 sec

CSV

• average data over 30 second rolling window

• report average every 20 seconds

• report if object temperature exceeds 25 C


Agenda



Spark Streaming

Apache Kafka

IBM Bluemix

Node-RED

Secure Gateway


Demo

Questions

streaming sensor data slides_virender

Documents