real time big data
TRANSCRIPT
![Page 1: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/1.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Real Time Big Data
InfoFarm Seminar18/11/2015
![Page 4: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/4.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Agenda
• Typical Big Data Landscape• The need for Real Time Big Data• Real Time Data Ingestion• Tools for Real Time Big Data– Apache Spark– Apache Storm– Search
• Q&A• Lunch
![Page 7: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/7.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
A Typical Big Data Landscape
• Data Silo
• Batch environment
• Periodical Analytics/statistics
• Data Source for new systems
![Page 8: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/8.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The need for Real Time Big Data• Obtaining analytical results faster– Processing faster than once a day
• Load evens out over day
• Past/Present/Future– Alert for certain events– Updating Prediction models on-the-fly
• Allow faster feedback to end users– See results of your actions right away
![Page 9: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/9.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Perfect fits for Real Time Processing• Anomaly Detection
– Abnormal readings of sensors– Abnormal amounts of log files– Fraud detection
• Real Time updates to Recommender models– Fast new recommendations in e-commerce– Support for trending items– Fast responses to events happening right now
• Real Time updates of clustering models
• Improving Classification based un current events
• Can be run side-by-side with traditional historical models
![Page 13: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/13.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
![Page 14: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/14.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Kafka
• Fast
• Scalable
• Durable
• Distributed
![Page 15: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/15.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Kafka - Overview
• Producers write messages to Kafka topics
• Consumers process messages from a topic
• Kafka runs on a cluster of server where each server is called a broker
![Page 16: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/16.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Kafka - Topics• Topics are split up in
different partitions• Partitions are
replicated across the cluster
• Order of messages is guaranteed
• Messages are stored for a period of time
• Producers decide which partition they write to
• Consumers keep the offset of which messages they have read
![Page 19: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/19.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The Hadoop Ecosystem
HDFS Distributed File System Amazon S3 Local FS
YARN Resource Management
MapReduce
HBase NoSQL
Hive Data Mart
Pig ScripCng
Sqoop SQL
Import Export
Mahout Machine Learning
…
![Page 20: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/20.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The Hadoop Ecosystem
HDFS Distributed File System Amazon S3 Local FS
YARN Resource Management
MapReduce
HBase NoSQL
Hive Data Mart
Pig ScripCng
Sqoop SQL
Import Export
Mahout Machine Learning
…
Spark Storm …
Spark SQL
Spark MLlib
![Page 21: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/21.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
![Page 24: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/24.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spouts
• Source of streams into the topology• Can be reliable or unreliable
• Support for:– Kafka– Kestrel– RabbitMQ– JMS– Amazon Kinesis– Build your own (e.g. twitter)
![Page 26: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/26.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Bolts
• Where all the processing happens
• Filtering, functions, aggregations, joins, database updates, …
• You subscribe to streams of a different component (other bolts/spouts)
• Must ack every tuple they process
![Page 27: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/27.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Parallelism
• Spouts & Bolts actually run as multiple instances on different machines
• Making sure that the correct messages goes to the correct instance is up to the developer
![Page 29: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/29.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Stream Groupings
• Defines how a stream should be partitioned among the bolt's tasks
• Some examples:– Round Robin
– Based on key– All– Specific instance
– …
![Page 30: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/30.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storm Ups and Downs
• Really real time• Very Powerful• Built for performance
• Very low level (comparable to MapReduce)
• Trivial tasks can become hard (sorting, joins, …)
![Page 34: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/34.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Streaming Input
• Kafka• Flume• Kinesis• Twitter• ZeroMQ• HDFS• TCP Sockets
![Page 35: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/35.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Windowing
• You can group multiple batches together into a sliding window.
• E.g. all the events from the last 60 seconds
![Page 36: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/36.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Streaming Strengths
• Works just like regular Spark processing, just replace SparkContext with StreamingContext
• Full integration with other Spark libraries (Spark SQL, Spark Mllib, …)
• Ease of development
• Scalable, fault-tolerant, …
![Page 37: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/37.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Spark Streaming Example
![Page 41: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/41.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data output bottlenecks
• Pig & Hive are quite slow
• No visual feedback from results
• Specific calculations (cubing) of metrics – Reporting tools cannot handle the
dimensions of the data
![Page 42: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/42.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
![Page 43: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/43.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Elasticsearch
• Document store (ideal for denormalized data)
• Distributed• Highly Available
• Open Source
• Real Time (Inserts & Searches)
![Page 45: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/45.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Hive Integration
• Writing to Elasticsearch from Hive
CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists'); -‐-‐ insert data to Elasticsearch from another table called 'source' INSERT OVERWRITE TABLE artists SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;
![Page 46: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/46.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Hive Integration
• Reading from Elasticsearch in Hive
CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*'); -‐-‐ stream data from Elasticsearch SELECT * FROM artists;
![Page 47: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/47.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Pig Integration
• Writing to Elasticsearch from Pig
-‐-‐ load data from HDFS into Pig using a schema A = LOAD 'src/test/resources/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray); -‐-‐ transform data B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links; -‐-‐ save the result to Elasticsearch STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage();
![Page 48: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/48.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Pig Integration
• Reading from Elasticsearch in Pig
-‐-‐ execute Elasticsearch query and load data into Pig A = LOAD 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'); DUMP A;
![Page 49: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/49.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Integration
• Writing to Elasticsearch from Spark
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.elasticsearch.spark._ val conf = ... val sc = new SparkContext(conf) -‐-‐ Create RDD here rdd.saveToEs("spark/docs")
![Page 50: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/50.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Integration
• Reading from Elasticsearch in Spark
... import org.elasticsearch.spark._ ... val conf = ... val sc = new SparkContext(conf) sc.esRDD("radio/artists", "?q=me*")
![Page 51: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/51.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storm Integration
• Writing to Elasticsearch from Storm
import org.elasticsearch.storm.EsBolt; TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 10); builder.setBolt("es-‐bolt", new EsBolt("storm/docs"), 5) .shuffleGrouping("spout");
![Page 52: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/52.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storm Integration
• Reading from Elasticsearch in Storm
import org.elasticsearch.storm.EsSpout; TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("es-‐spout", new EsSpout("storm/docs", "?q=me*), 5); builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-‐spout");
![Page 54: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/54.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Kibana
• Visualization tool on top of Elasticsearch
• Allows ad-hoc querying & graphing
• Support for real time updates
• Create your own dashboards
![Page 57: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/57.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
![Page 58: Real Time Big Data](https://reader031.vdocuments.us/reader031/viewer/2022030308/58ec9f221a28ab666b8b459d/html5/thumbnails/58.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Real Time Big Data
InfoFarm Seminar18/11/2015