real-time processing in hadoop for iot use cases - phoenix hug

61
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time Processing in Hadoop Phoenix Hadoop User Group Shane Kumpf & Mac Moore Solutions Engineers, Hortonworks July 2015

Upload: skumpf

Post on 08-Aug-2015

85 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Real-Time Processing in HadoopPhoenix Hadoop User Group

Shane Kumpf & Mac MooreSolutions Engineers, HortonworksJuly 2015

Page 2: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Agenda

Introduction & about Hortonworks HDP Overview of logistics industry scenario Overview of streaming architecture on HDP Streaming Demo #1 Integrating Predictive Analytics in streaming scenarios Streaming Demo with Predictive additions Q & A

Page 2

Page 3: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Preface: Enabling Technologies

Page 3

Page 4: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Preface: Enabling Technologies

Page 4

Enablers: Key technologies from mass consumer-scale deployments.

Page 5: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Preface: Enabling Technologies

Page 5

• Problems solved at scale, via fundamentally new approaches…• Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand.

• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras (from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…

• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables new real-time advanced analytic applications.

Page 6: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why did Hadoop emerge?

April 2015

Page 7: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Traditional systems under pressure

Challenges• Constrains data to app

• Can’t manage new data

• Costly to Scale

Business Value

Clickstream

Geolocation

Web Data

Internet of Things

Docs, emails

Server logs

20122.8 Zettabytes

202040 Zettabytes

LAGGARDS

INDUSTRY LEADERS

1

2 New Data

ERP CRM SCM

New

Traditional

Page 8: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Spring 2015

Hortonworks. We do Hadoop.

Page 9: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Customer Momentum• 430+ customers (Q1 2015)

Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.

• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.

Partner for Customer Success• Open source community leadership focus on enterprise needs

• Unrivaled world class support

• Founded in 2011

• Original 24 architects, developers, operators of Hadoop from Yahoo!

• 600+ Employees

• 1000+ Ecosystem Partners

Page 10: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Customer Partnerships matterDriving our innovation through

Apache Software Foundation Projects

Apache Project Committers PMC Members

Hadoop 27 21

Pig 5 5

Hive 18 6

Tez 16 15

HBase 6 4

Phoenix 4 4

Accumulo 2 2

Storm 3 2

Slider 11 11

Falcon 5 3

Flume 1 1

Sqoop 1 1

Ambari 34 27

Oozie 3 2

Zookeeper 2 1

Knox 13 3

Ranger 10 n/a

TOTAL 161 108Source: Apache Software Foundation. As of 11/7/2014.

Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF

• ExpertiseUniquely capable to solve the most complex issues & ensure success with latest features

• ConnectionProvide customers & partners direct input into the community roadmap

• PartnershipWe partner with customers with subscription offering. Our success is predicated on yours.

27

Cloudera: 11

Facebook: 5

LinkedIn: 2

IBM: 2

Others: 23

Yahoo10

Page 11: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Technology Partnerships matter

Apache Project Hortonworks

Relationship Named Partner

CertifiedSolution Resells Joint

Engr

Microsoft

HP

SAS

SAP

IBM

Pivotal

Redhat

Teradata

Informatica

Oracle

It is not just about packaging and certifying software…

Our joint engineering with our partners drives open source standards for Apache Hadoop

HDP is Apache Hadoop

Page 12: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP delivers a Centralized Architecture

Modern Data Architecture

• Unifies data and processing.

• Enables applications to have access to all your enterprise data through an efficient centralized platform

• Supported with a centralized approach governance, security and operations

• Versatile to handle any applications and datasets no matter the size or type

Clickstream Web & Social

Geolocation Sensor & Machine

Server Logs

Unstructured

SOU

RCES

Existing Systems

ERP CRM SCM

ANAL

YTIC

S

Data Marts

Business Analytics

Visualization& Dashboards

ANAL

YTIC

S

Applications Business Analytics

Visualization& Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-TimeBatch Partner ISVBatch BatchMPP

EDW

Page 13: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP delivers a completely open data platform

Hortonworks Data Platform 2.2

Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data.

Completely Open

• HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations

• All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem.

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Apa

che

Pig

° °

° °

° ° °

° ° °

HDFS (Hadoop Distributed File System)

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

Apache Falcon

Apa

che

Hiv

e

Cas

cad

ing

Apa

che

HB

ase

Apa

che

Acc

umul

o

Apa

che

So

lr

Apa

che

Sp

ark

Apa

che

Sto

rm

Apache Sqoop

Apache Flume

Apache Kafka

SECURITY

Apache Ranger

Apache Knox

Apache Falcon

OPERATIONS

Apache Ambari

Apache Zookeeper

Apache Oozie

Page 14: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Real World Use Case:Trucking Company

Spring 2015

Hortonworks. We do Hadoop.

Page 15: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Scenario Overview.

Page 16: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Trucking company w/ large fleet of trucks in Midwest

A truck generates millions of events for a given route; an event could be:

'Normal' events: starting / stopping of the vehicle

‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance

Company uses an application that monitors truck locations and violations from the truck/driver in real-time

Route?Truck?Driver?

Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers

Page 17: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 18: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 19: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

What is Kafka? APACHE KAFKA

High throughput distributed messaging system

Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes

Kafka @LinkedIn: 800 billion messages per day 175 terabytes of data written per day 650 terabytes of data read per day Over 13 million messages/2.75GB of data

per second

Kafka Cluster

producer

producer

producer

consumer

consumer

consumer

Page 20: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Kafka: Anatomy of a TopicPartition 0 Partition 1 Partition 2

0 0 0

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

9 9 9

10 10

11 11

12

Writes

Old

New

APACHE KAFKA

Partitioning allows topics to scale beyond a single machine/node

Topics can also be replicated, for high availability.

Page 21: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 22: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Apache Storm

• Distributed, real time, fault tolerant Stream Processing platform.• Provides processing guarantees.• Key concepts include:

•Tuples•Streams•Spouts•Bolts•Topology

Page 22

Page 23: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Tuples and Streams

• What is a Tuple?– Fundamental data structure in Storm. Is a named list of values that can be of any data type.

Page 23

• What is a Stream?– An unbounded sequences of tuples.– Core abstraction in Storm and are what you “process” in Storm

Page 24: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Spouts

• What is a Spout?– Generates or a source of Streams– E.g.: JMS, Twitter, Log, Kafka Spout– Can spin up multiple instances of a Spout and dynamically adjust as needed

Page 24

Page 25: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Bolts

• What is a Bolt?– Processes any number of input streams and produces output streams– Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting

logic– Can spin up multiple instances of a Bolt and dynamically adjust as needed

• Bolts used in the Use Case:1. HBaseBolt: persisting and counting in Hbase2. HDFSBolt: persisting into HFDS as Avro Files using Flume3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the

number of illegal driver incidents exceed a given threshhold.

Page 25

Page 26: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Topology

• What is a Topology?– A network of spouts and bolts wired together into a workflow

Page 26

Page 27: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 28: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Key Constructs in Apache HBase• HBase = Key / Value store• Designed for petabyte scale• Supports low latency reads, writes and updates• Key features

– Updateable records– Versioned Records– Distributed across a cluster of machines– Low Latency– Caching

• Popular use cases:– User profiles and session state– Object store– Sensor apps

Page 28

Page 29: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Data Assignment

Page 29

HBase Table

Keys within HBaseDivided among

different RegionServers

Page 30: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Data Access

• Get– Retrieves a single cell, all cells with a matching rowkey, or all cells in a column

family with a matching rowkey

• Put– Inserts a new version of a cell.  

• Scan– The whole table, row by row, or a section of that table starting at a particular

start key and ending at a particular end key

• Delete– It is actually a version of put(Add a new version with put with a deletion

marker)

• SQL via Apache Phoenix– Unique capability in the NoSQL market

Page 30

Page 31: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 32: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

20092006

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File

System)

MapReduceLargely Batch Processing

Hadoop w/ MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clustersLargely batch systemDifficult to integrate

MR-279: YARN

Hadoop 2 & YARN

Interactive Real-TimeBatch

Architected & led development of YARN to enable the Modern Data Architecture

October 23, 2013

Page 33: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Benefits of YARN as the Data Operating System

• The container based model allows for running nearly any workload.– Enables the centralized architecture.– No longer is MapReduce the only data processing engine.– Docker containers managed by YARN. Yes Please!

• Decouples resource scheduling from application lifecycle.– Improved scalability and fault tolerence

• Dynamically allocated resources, resulting in HUGE utilization gains– Versus static allocation of “slots” in Hadoop 1.0

Page 33

Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time.

They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.

Page 34: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 35: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Apache HDFS – Hadoop Distributed File System

• Very large scale distributed file system• 10K nodes, tens of millions files and PBs of data• Supports large files

• Designed to run on commodity hardware, assumes hardware failures

• Files are replicated to handle hardware failure• Detect failures and recovers from them automatically

• Optimized for Large Scale Processing• Data locations are exposed so that the computations can move to where data

resides• Data Coherency

• Write once and read many times access pattern• Files are broken up in chunks called ‘blocks’

• Blocks are distributed over nodes

Page 35

Page 36: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Streaming Demo - High Level Architecture

Distributed Storage: HDFS

YARN

Storm Stream Processing

Kakfa Spout

HBase

Dangerous Events TableHbase

BoltHDFSBolt

Truck Events

Active MQ

Monitoring Bolt

Web App

Truck Streaming Data

T(1) T(2) T(N)

Inbound Messaging(Kafka)

Truck Events Topic

Page 37: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo – Streaming Dashboard.

Page 38: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A New Challenge.

Page 39: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

CDO’s vision: Build a Predictive Business, not a Reactive one

CDO’s Requirements Offline predictions

Identify investments that will increase safety and reduce company’s liabilities

Real-time predictions Anticipate driver violations before they

happen and take precautionary actions

Data Scientist’s Response Need to explore data & form a hypothesis Verify trends against TBs of events data via

machine learning Generate predictive models with Spark

MLlib on HDP Plug models into the Storm topology to predict

driver violations in real-time

♬ I’ve been waiting for this moment all my life ♬

Page 40: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo – Analyzing Events with Tableau.

Page 41: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – dangerous drivers

Page 41

Page 42: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – dangerous routes

Page 42

Page 43: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – violations by location

Page 43

Page 44: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enriching truck events for analysis with Pig

HDFS Raw Truck EventsWeather Data Sets

Raw Weather Data

HCatalog (Metadata)

Payroll Data

HR & Payroll DBs

Load Raw Truck Events

Clean & Filter

Cleaned Events

TransformedEvents

Transform

Join withHR & weather data

EnrichedEvents

Enriched Events

Store

Tableau

Page 45: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – noncertified and fatigued drivers more dangerous

Page 45

Page 46: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers

Page 46

Page 47: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – foggy weather leads to violations

Page 47

Page 48: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – but top 3 safest routes are also foggy

Page 48

Page 49: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Integrating Predictive Analytics

Page 50: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Building the Predictive Model on HDP

Tableau Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations”

1

Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data

2

Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format

3

Train a logistic regression model in Spark on YARN, with above events as training input, and iterate to fine tune the generated model

4

Integrate Spark MLlib model in a Storm bolt to predict violations in real time

5

Page 51: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Truck Sensors

HDFS

YARN

Integrate Predictive Analytics in Stream Processing

Stream Processing (Storm)

Inbound Messaging(Kafka)

Interactive Query(Hive on Tez)

Real-time Serving (HBase)

Millions of Enriched Truck Events

Prediction Bolt

Plug Spark model into Storm bolt

Machine Learning(Spark)

Train Spark ML model with millions of truck events

Page 52: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Streaming Demo - Updated Architecture

Distributed Storage: HDFS

YARN

Storm Stream Processing

Kakfa Spout

HBase

PayRollTableHBase

BoltHDFSBolt

Truck Events

Active MQ

Monitoring Bolt

Web App

Truck Streaming Data

T(1) T(2) T(N)

Inbound Messaging(Kafka)

Truck Events Topic

PredictionBolt

Enrich

Event

Predict violation in real time & alert via MQ

Render Real time predictions on UI

Page 53: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Transforming training data for Spark MLlibEnriched Events Data

Event Type Is Driver Certified?

Wage Plan

Hours Driven

MilesDriven

Longitude Latitude WeatherFoggy

Weather Rainy

Weather Windy

Normal Yes Hourly 45 2721 -91.3 38.14 No No NoOverspeed No Miles 72 4152 -94.23 37.09 Yes Yes No

… … … … … … … … … …

Spark MLlib Training DataLabel Is Driver

Certified?Wage Plan

Hours Driven

MilesDriven

WeatherFoggy

Weather Rainy

Weather Windy

0 1 1 0.45 0.2721 0 0 01 0 0 0.72 0.4152 1 1 0… … … … … … … …

Normal events labeled as 0 and

violation events as 1

Feature scaling applied to hours and miles to improve

algorithm performance

Features with binary values denoted as 0 and 1

Page 54: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Running Spark ML on YARN

1spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training --numIterations 100

Run spark-submit script to launch a Spark job on YARN.

Training data location on HDFS

2 Monitor progress of Spark job in YARN Resource Mgr UI

Page 55: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Interpreting Spark Logistic Regression Results

Precision: 87.5% Recall: 88%

Top three predictors of violations 1. Foggy Weather 2. Rainy Weather 3. Driver Certification

Page 56: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Integrating Spark model in Storm

Kafka Spout

Storm Prediction Bolt

Initialize Spark model Parse truck event Enrich event with HBase data Predict violation with model Send Alert if violation predicted

Real-time Serving (HBase)

Active MQ

Ops Center LOB Dashboards

Page 57: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Summary: Solution Value.

Page 58: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Value of large scale ML on HDP Accelerate time to market/value

Test out multiple ML algorithms against TBs of training data in reasonable time frames

Confirm hypothesis against TBs of training data with confidence We confirmed that fog does impact safety and wage plans do not,

whereas BI tools indicated otherwise

Easily integrate predictive models in data driven apps Run predictive models in Storm or any other app in your enterprise

Run all of the above in a multi-tenant YARN cluster Large scale ML on YARN respects other tenants in an HDP cluster

Page 59: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Recommendations to CDO

Investment recommendations, in order of priority1. Invest in visibility sensors and auto braking systems to deal with foggy conditions2. Invest in slip resistant tires to fight rainy conditions3. Invest in certifying drivers to reduce violation probability

Power of real time predictions 40% reduction in violation rates by predicting high risk situations in real-time and

sending immediate alerts to drivers

Page 60: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 60 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Predictive Demo.

Page 61: Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Page 61 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q & A