![Page 1: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/1.jpg)
Spark Streaming State of the Union and Beyond Tathagata “TD” Das @tathadas
Feb 19, 2015
![Page 2: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/2.jpg)
Who am I?
Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
![Page 3: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/3.jpg)
Founded by the creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud
What is Databricks?
![Page 4: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/4.jpg)
What is Spark Streaming?
![Page 5: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/5.jpg)
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume Kinesis
HDFS/S3
Kafka
Streaming
High-level API
joins, windows, … often 5x less code
Fault-tolerant
Exactly-once semantics, even for stateful ops
Integration
Integrate with MLlib, SQL, DataFrames, GraphX
![Page 6: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/6.jpg)
What can you use it for?
6
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
![Page 7: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/7.jpg)
How does it work?
Data streams are chopped up into batches Each batch is processed in Spark
Results pushed out in batches
7
data streams
rece
iver
s Streaming
batches results
![Page 8: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/8.jpg)
Streaming Word Count
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
8
print some counts on screen
count the words
split lines into words
create DStream from data over socket
start processing the stream
![Page 9: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/9.jpg)
Word Count
9
object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
![Page 10: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/10.jpg)
Word Count
10
object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
Spark Streaming
public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
Storm
public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }
![Page 11: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/11.jpg)
Languages
Can natively use
Can use any other language by using pipe()
11
![Page 12: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/12.jpg)
Integrates with Spark Ecosystem
12
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
![Page 13: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/13.jpg)
Combine batch and streaming processing
Join data streams with static data sets // Create data set from Hadoop file!val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) }
13
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
![Page 14: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/14.jpg)
Combine machine learning with streaming
Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) }
14
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
![Page 15: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/15.jpg)
Combine SQL with streaming
Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents")
15
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
![Page 16: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/16.jpg)
History
16
Late 2011 – research idea AMPLab, UC Berkeley
We need to make Spark
faster
Okay...umm, how??!?!
![Page 17: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/17.jpg)
History
17
Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms
Q3 2012 Spark core improvements open sourced in Spark 0.6
Feb 2013 – Alpha release 7.7k lines, merged in 7 days
Released with Spark 0.7
Late 2011 – idea AMPLab, UC Berkeley
![Page 18: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/18.jpg)
History
18
Late 2011 – idea AMPLab, UC Berkeley
Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms
Q3 2012 Spark core improvements open sourced in Spark 0.6
Feb 2013 – Alpha release 7.7k lines, merged in 7 days
Released with Spark 0.7
Jan 2014 – Stable release Graduation with Spark 0.9
![Page 19: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/19.jpg)
Current state of Spark Streaming
![Page 20: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/20.jpg)
Adoption
20
Roadmap
Development
![Page 21: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/21.jpg)
21
What have we added in the last year?
![Page 22: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/22.jpg)
Python API
Core functionality in Spark 1.2, with sockets and files as sources
Kafka support coming in Spark 1.3
Other sources coming in future
22
lines = ssc.socketTextStream(“localhost", 9999)) counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b) counts.pprint()
![Page 23: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/23.jpg)
Streaming MLlib algorithms
val model = new StreamingKMeans() .setK(args(3).toInt) .setDecayFactor(1.0) .setRandomCenters(args(4).toInt, 0.0) // Apply model to DStreams
model.trainOn(trainingDStream) model.predictOnValues(testDStream.map { lp => (lp.label, lp.features) } ).print()
23
Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1 StreamingKMeans in Spark 1.2 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
![Page 24: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/24.jpg)
Other library additions
Amazon Kinesis integration [ Spark 1.1] More fault-tolerant Flume integration [Spark 1.1] New Kafka API for more native integration [Spark 1.3]
24
![Page 25: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/25.jpg)
System Infrastructure
Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2]
25
![Page 26: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/26.jpg)
Contributors to Streaming
26
0
10
20
30
40
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
![Page 27: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/27.jpg)
Contributors - Full Picture
27
0
30
60
90
120
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Streaming
Core + Streaming (w/o SQL, MLlib,…)
All contributions to core Spark directly improve Spark Streaming
![Page 28: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/28.jpg)
Spark Packages
More contributions from the community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
28
![Page 29: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/29.jpg)
Who is using Spark Streaming?
![Page 30: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/30.jpg)
Spark Summit 2014 Survey
30
40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it
Not using 21%
Evaluating 39%
Prototyping 31%
Production 9%
![Page 31: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/31.jpg)
31
![Page 32: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/32.jpg)
32
80+ known
deployments
![Page 33: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/33.jpg)
Intel China builds big data solutions for large enterprises Multiple streaming applications for different businesses
Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company
![Page 34: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/34.jpg)
Complicated stream processing SQL queries on streams Join streams with large historical datasets
> 1TB/day passing through Spark Streaming
YARN
Spark Streaming
Kafka
RocketMQ HBase
![Page 35: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/35.jpg)
One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming
![Page 36: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/36.jpg)
YARN
Spark Streaming Kafka
Cassandra
Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing
Apache Blur
More information: http://dbricks.co/1BnFZZ8
![Page 37: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/37.jpg)
Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads
Mesos+Marathon
Spark Streaming
Kinesis MySQL Redis
RabbitMQ SQS
![Page 38: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/38.jpg)
http://techblog.netflix.com/2015/02/whats-trending-on-netflix.html
http://goo.gl/mJNf8X
![Page 39: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/39.jpg)
Neuroscience @ Freeman Lab, Janelia Farm
Spark Streaming and MLlib to analyze neural activities
Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
![Page 40: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/40.jpg)
Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning algorithms on time series data of every neuron 2TB/hour and increasing with brain size 80 HPC nodes
![Page 41: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/41.jpg)
Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
41
![Page 42: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/42.jpg)
What’s coming next?
![Page 43: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/43.jpg)
Beyond Spark 1.3
Libraries Streaming machine learning algorithms
A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms
Streaming + SQL, Streaming + DataFrames
43
![Page 44: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/44.jpg)
Beyond Spark 1.3
Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments
44
![Page 45: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/45.jpg)
Beyond Spark 1.3
Performance Higher throughput, especially of stateful operations Lower latencies
Easy deployment of streaming apps in Databricks Cloud!
45
![Page 46: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/46.jpg)
You can help!
Roadmaps are heavily driven by community feedback We have listened to community demands over the last year
Write Ahead Logs for zero data loss New Kafka integration for stronger semantics
Let us know what do you want to see in Spark Streaming
Spark user mailing list, tweet it to me @tathadas
46
![Page 47: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/47.jpg)
Takeaways
Spark Streaming is scalable, fault-tolerant stream processing system with high-level API and rich set of libraries
Over 80+ deployments in the industry
More libraries and operational ease in the roadmap
47
![Page 48: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/48.jpg)
48
Backup slides
![Page 49: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/49.jpg)
Typesafe survey of Spark users
2136 developers, data scientists, and other tech professionals
http://java.dzone.com/articles/apache-spark-survey-typesafe-0
![Page 50: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/50.jpg)
Typesafe survey of Spark users
65% of Spark users are interested in Spark Streaming
![Page 51: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/51.jpg)
Typesafe survey of Spark users
2/3 of Spark users want to process event streams
![Page 52: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/52.jpg)
52
More usecases
![Page 53: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/53.jpg)
• Big data solution provider for enterprises • Multiple applications for different businesses
- Monitoring +optimizing online services of Tier-1 bank - Fraudulent transaction detection for Tier-2 bank
• Kafka à SS à Cassandra, MongoDB • Built their own Stratio Streaming platform on
Spark Streaming, Kafka, Cassandra, MongoDB
![Page 54: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/54.jpg)
• Provides data analytics solutions for Communication Service Providers - 4 of 5 top mobile ops, 3 of 4 top internet backbone providers - Processes >50% of all US mobile traffic
• Multiple applications for different businesses - Real-time anomaly detection in cell tower traffic - Real-time call quality optimizations
• Kafka à SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
![Page 55: Spark streaming State of the Union - Strata San Jose 2015](https://reader030.vdocuments.us/reader030/viewer/2022032420/55a524da1a28abd90e8b45db/html5/thumbnails/55.jpg)
• Runs claims processing applications for healthcare providers
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims
• Predictive models can look for claims that are likely to be held up for approval
• Spark Streaming allows model scoring in seconds instead of hours