real time big data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Real Time Big Data

InfoFarm Seminar18/11/2015


About Me


About InfoFarm


Agenda

•  Typical Big Data Landscape•  The need for Real Time Big Data•  Real Time Data Ingestion•  Tools for Real Time Big Data– Apache Spark– Apache Storm– Search

•  Q&A•  Lunch


A Typical Big Data Landscape


A Typical Big Data Landscape

•  Data Silo

•  Batch environment

•  Periodical Analytics/statistics

•  Data Source for new systems


The need for Real Time Big Data•  Obtaining analytical results faster– Processing faster than once a day

•  Load evens out over day

•  Past/Present/Future– Alert for certain events– Updating Prediction models on-the-fly

•  Allow faster feedback to end users–  See results of your actions right away


Perfect fits for Real Time Processing•  Anomaly Detection

–  Abnormal readings of sensors–  Abnormal amounts of log files–  Fraud detection

•  Real Time updates to Recommender models–  Fast new recommendations in e-commerce–  Support for trending items–  Fast responses to events happening right now

•  Real Time updates of clustering models

•  Improving Classification based un current events

•  Can be run side-by-side with traditional historical models


Ingestion Processing Output


Ingestion


Apache Kafka

•  Fast

•  Scalable

•  Durable

•  Distributed


Apache Kafka - Overview

•  Producers write messages to Kafka topics

•  Consumers process messages from a topic

•  Kafka runs on a cluster of server where each server is called a broker


Apache Kafka - Topics•  Topics are split up in

different partitions•  Partitions are

replicated across the cluster

•  Order of messages is guaranteed

•  Messages are stored for a period of time

•  Producers decide which partition they write to

•  Consumers keep the offset of which messages they have read

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

DEMO


The Hadoop Ecosystem

HDFS Distributed File System Amazon S3 Local FS

YARN Resource Management

MapReduce

HBase NoSQL

Hive Data Mart

Pig ScripCng

Sqoop SQL

Import Export

Mahout Machine Learning

…


The Hadoop Ecosystem

HDFS Distributed File System Amazon S3 Local FS

YARN Resource Management

MapReduce

HBase NoSQL

Hive Data Mart

Pig ScripCng

Sqoop SQL

Import Export

Mahout Machine Learning

…

Spark Storm …

Spark SQL

Spark MLlib


Apache Storm


Spouts


Spouts

•  Source of streams into the topology•  Can be reliable or unreliable

•  Support for:– Kafka– Kestrel– RabbitMQ–  JMS– Amazon Kinesis– Build your own (e.g. twitter)


Bolts


Bolts

•  Where all the processing happens

•  Filtering, functions, aggregations, joins, database updates, …

•  You subscribe to streams of a different component (other bolts/spouts)

•  Must ack every tuple they process


Parallelism

•  Spouts & Bolts actually run as multiple instances on different machines

•  Making sure that the correct messages goes to the correct instance is up to the developer


Stream Groupings


Stream Groupings

•  Defines how a stream should be partitioned among the bolt's tasks

•  Some examples:– Round Robin

– Based on key– All– Specific instance

– …


Storm Ups and Downs

•  Really real time•  Very Powerful•  Built for performance

•  Very low level (comparable to MapReduce)

•  Trivial tasks can become hard (sorting, joins, …)


Spark Streaming


Spark Architecture


Spark Streaming Concepts


Spark Streaming Input

•  Kafka•  Flume•  Kinesis•  Twitter•  ZeroMQ•  HDFS•  TCP Sockets


Windowing

•  You can group multiple batches together into a sliding window.

•  E.g. all the events from the last 60 seconds


Spark Streaming Strengths

•  Works just like regular Spark processing, just replace SparkContext with StreamingContext

•  Full integration with other Spark libraries (Spark SQL, Spark Mllib, …)

•  Ease of development

•  Scalable, fault-tolerant, …


Spark Streaming Example


Getting to Your Data


Data output bottlenecks

•  Pig & Hive are quite slow

•  No visual feedback from results

•  Specific calculations (cubing) of metrics – Reporting tools cannot handle the

dimensions of the data


Elasticsearch

•  Document store (ideal for denormalized data)

•  Distributed•  Highly Available

•  Open Source

•  Real Time (Inserts & Searches)


ES-Hadoop


Hive Integration

•  Writing to Elasticsearch from Hive

CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists'); -‐-‐ insert data to Elasticsearch from another table called 'source' INSERT OVERWRITE TABLE artists SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;


Hive Integration

•  Reading from Elasticsearch in Hive

CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*'); -‐-‐ stream data from Elasticsearch SELECT * FROM artists;


Pig Integration

•  Writing to Elasticsearch from Pig

-‐-‐ load data from HDFS into Pig using a schema A = LOAD 'src/test/resources/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray); -‐-‐ transform data B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links; -‐-‐ save the result to Elasticsearch STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage();


Pig Integration

•  Reading from Elasticsearch in Pig

-‐-‐ execute Elasticsearch query and load data into Pig A = LOAD 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'); DUMP A;


Spark Integration

•  Writing to Elasticsearch from Spark

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.elasticsearch.spark._ val conf = ... val sc = new SparkContext(conf) -‐-‐ Create RDD here rdd.saveToEs("spark/docs")


Spark Integration

•  Reading from Elasticsearch in Spark

... import org.elasticsearch.spark._ ... val conf = ... val sc = new SparkContext(conf) sc.esRDD("radio/artists", "?q=me*")


Storm Integration

•  Writing to Elasticsearch from Storm

import org.elasticsearch.storm.EsBolt; TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 10); builder.setBolt("es-‐bolt", new EsBolt("storm/docs"), 5) .shuffleGrouping("spout");


Storm Integration

•  Reading from Elasticsearch in Storm

import org.elasticsearch.storm.EsSpout; TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("es-‐spout", new EsSpout("storm/docs", "?q=me*), 5); builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-‐spout");


Visualizing data


Kibana

•  Visualization tool on top of Elasticsearch

•  Allows ad-hoc querying & graphing

•  Support for real time updates

•  Create your own dashboards


Demo


Wrap Up



Data Science Company

Real Time Big Data

InfoFarm Seminar18/11/2015

real time big data

Data & Analytics