real time data processing using spark streaming · 2019-09-06 · kafka connectors •reliable...
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Real Time Data Processing using Spark Streaming
2 © Cloudera, Inc. All rights reserved.
Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...
3 © Cloudera, Inc. All rights reserved.
Use Cases Across Industries
Credit Identify fraudulent transactions as soon as they occur.
Transportation Dynamic Re-routing Of traffic or Vehicle Fleet.
Retail • Dynamic Inventory Management • Real-time In-store Offers and recommendations
Consumer Internet & Mobile Optimize user engagement based on user’s current behavior.
Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients.
Manufacturing • Identify equipment failures and react instantly • Perform Proactive maintenance.
Surveillance Identify threats and intrusions In real-time
Digital Advertising & Marketing Optimize and personalize content based on real-time information.
4 © Cloudera, Inc. All rights reserved.
From Volume and Variety to Velocity
Present
Batch + Stream Processing Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
Past Present
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing Time to insight of Hours
5 © Cloudera, Inc. All rights reserved.
Key Components of Streaming Architectures
Data Ingestion & Transportation Service
Real-Time Stream Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-Time Data Serving
6 © Cloudera, Inc. All rights reserved.
Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS HBase
Data Sources
7 © Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
•Easy to Develop •Rich APIs in Java, Scala, Python
• Interactive shell
•Fast to Run •General execution graphs
• In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
8 © Cloudera, Inc. All rights reserved.
Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
9 © Cloudera, Inc. All rights reserved.
RDDs
RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage
10 © Cloudera, Inc. All rights reserved.
Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput
11 © Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming
12 © Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1 batch @ t batch @ t+2 tweets DStream
hashTags DStream
Stream composed of small (1-10s) batch
computations
“Micro-batch” Architecture
13 © Cloudera, Inc. All rights reserved.
Use DStreams for Windowing Functions
14 © Cloudera, Inc. All rights reserved.
Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.
15 © Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as well
16 © Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:
17 © Cloudera, Inc. All rights reserved.
Reliability
• Received data automatically persisted to HDFS Write Ahead Log to prevent data loss
• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf
• When AM dies, the application is restarted by YARN
• Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks)
• Reliable Receivers can replay data from the original source, if required
• Un-acked data replayed from source.
• Kafka, Flume receivers bundled with Spark are examples
• Reliable Receivers + WAL = No data loss on driver or receiver failure!
18 © Cloudera, Inc. All rights reserved.
Kafka Connectors
• Reliable Kafka DStream
• Stores received data to Write Ahead Log on HDFS for replay
• No data loss
• Stable and supported!
• Direct Kafka DStream
• Uses low level API to pull data from Kafka
• Replays from Kafka on driver failure
• No data loss
• Experimental
19 © Cloudera, Inc. All rights reserved.
Flume Connector
• Flume Polling DStream
• Use Spark sink from Maven to Flume’s plugin directory
• Flume Polling Receiver polls the sink to receive data
• Replays received data from WAL on HDFS
• No data loss
• Stable and Supported!
20 © Cloudera, Inc. All rights reserved.
Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.
21 © Cloudera, Inc. All rights reserved.
What is coming?
• Run on Secure YARN for more than 7 days!
• Better Monitoring and alerting
• Batch-level and task-level monitoring
• SQL on Streaming
• Run SQL-like queries on top of Streaming (medium – long term)
• Python!
• Limited support coming in Spark 1.3
22 © Cloudera, Inc. All rights reserved.
Current Spark project status
• 400+ contributors and 50+ companies contributing
• Includes: Databricks, Cloudera, Intel, Yahoo! etc
• Dozens of production deployments
• Spark Streaming Survived Netflix Chaos Monkey – production ready!
• Included in CDH!
23 © Cloudera, Inc. All rights reserved.
More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark
24 © Cloudera, Inc. All rights reserved.
Thank you [email protected] 15% Discount Code for Cloudera Training PNWCUG_15 university.cloudera.com