spark streaming state of the union - strata san jose 2015

Spark Streaming State of the Union and Beyond Tathagata “TD” Das @tathadas

Feb 19, 2015

Who am I?

Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks

Founded by the creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud

What is Databricks?

What is Spark Streaming?

Spark Streaming

Scalable, fault-tolerant stream processing system

File systems

Databases

Dashboards

Flume Kinesis

HDFS/S3

Twitter

Streaming

High-level API

joins, windows, … often 5x less code

Fault-tolerant

Exactly-once semantics, even for stateful ops

Integration

Integrate with MLlib, SQL, DataFrames, GraphX

What can you use it for?

Real-time fraud detection in transactions

React to anomalies in sensors in real-time

Cat videos in tweets as soon as they go viral

How does it work?

Data streams are chopped up into batches Each batch is processed in Spark

Results pushed out in batches

data streams

s Streaming

batches results

Streaming Word Count

val lines = context.socketTextStream(“localhost”, 9999)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()

print some counts on screen

count the words

split lines into words

create DStream from data over socket

start processing the stream

Word Count

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Word Count

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Spark Streaming

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Languages

Can natively use

Can use any other language by using pipe()

Integrates with Spark Ecosystem

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Combine batch and streaming processing

Join data streams with static data sets // Create data set from Hadoop file!val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) }

Spark Core

Spark Streaming

Combine machine learning with streaming

Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) }

Spark Core

Spark Streaming

Combine SQL with streaming

Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents")

Spark Core

Spark Streaming

History

Late 2011 – research idea AMPLab, UC Berkeley

We need to make Spark

faster

Okay...umm, how??!?!

History

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Late 2011 – idea AMPLab, UC Berkeley

History

Late 2011 – idea AMPLab, UC Berkeley

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Jan 2014 – Stable release Graduation with Spark 0.9

Current state of Spark Streaming

Adoption

Roadmap

Development

What have we added in the last year?

Python API

Core functionality in Spark 1.2, with sockets and files as sources

Kafka support coming in Spark 1.3

Other sources coming in future

lines = ssc.socketTextStream(“localhost", 9999)) counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b) counts.pprint()

Streaming MLlib algorithms

val model = new StreamingKMeans() .setK(args(3).toInt) .setDecayFactor(1.0) .setRandomCenters(args(4).toInt, 0.0) // Apply model to DStreams

model.trainOn(trainingDStream) model.predictOnValues(testDStream.map { lp => (lp.label, lp.features) } ).print()

Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1 StreamingKMeans in Spark 1.2 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

Other library additions

Amazon Kinesis integration [ Spark 1.1] More fault-tolerant Flume integration [Spark 1.1] New Kafka API for more native integration [Spark 1.3]

System Infrastructure

Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2]

Contributors to Streaming

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Contributors - Full Picture

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Streaming

Core + Streaming (w/o SQL, MLlib,…)

All contributions to core Spark directly improve Spark Streaming

Spark Packages

More contributions from the community in spark-packages

Alternate Kafka receiver

Apache Camel receiver

Cassandra examples

http://spark-packages.org/

Who is using Spark Streaming?

Spark Summit 2014 Survey

40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it

Not using 21%

Evaluating 39%

Prototyping 31%

Production 9%

80+ known

deployments

Intel China builds big data solutions for large enterprises Multiple streaming applications for different businesses

Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company

Complicated stream processing SQL queries on streams Join streams with large historical datasets

> 1TB/day passing through Spark Streaming

Spark Streaming

RocketMQ HBase

One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming

Spark Streaming Kafka

Cassandra

Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing

Apache Blur

More information: http://dbricks.co/1BnFZZ8

Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads

Mesos+Marathon

Spark Streaming

Kinesis MySQL Redis

RabbitMQ SQS

http://techblog.netflix.com/2015/02/whats-trending-on-netflix.html

http://goo.gl/mJNf8X

Neuroscience @ Freeman Lab, Janelia Farm

Spark Streaming and MLlib to analyze neural activities

Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!

http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Neuroscience @ Freeman Lab, Janelia Farm

Streaming machine learning algorithms on time series data of every neuron 2TB/hour and increasing with brain size 80 HPC nodes

Why are they adopting Spark Streaming?

Easy, high-level API

Unified API across batch and streaming

Integration with Spark SQL and MLlib

Ease of operations

What’s coming next?

Beyond Spark 1.3

Libraries Streaming machine learning algorithms

A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms

Streaming + SQL, Streaming + DataFrames

Beyond Spark 1.3

Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments

Beyond Spark 1.3

Performance Higher throughput, especially of stateful operations Lower latencies

Easy deployment of streaming apps in Databricks Cloud!

You can help!

Roadmaps are heavily driven by community feedback We have listened to community demands over the last year

Write Ahead Logs for zero data loss New Kafka integration for stronger semantics

Let us know what do you want to see in Spark Streaming

Spark user mailing list, tweet it to me @tathadas

Takeaways

Spark Streaming is scalable, fault-tolerant stream processing system with high-level API and rich set of libraries

Over 80+ deployments in the industry

More libraries and operational ease in the roadmap

Backup slides

Typesafe survey of Spark users

2136 developers, data scientists, and other tech professionals

http://java.dzone.com/articles/apache-spark-survey-typesafe-0

65% of Spark users are interested in Spark Streaming

2/3 of Spark users want to process event streams

More usecases

•  Big data solution provider for enterprises •  Multiple applications for different businesses

-  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank

•  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on

Spark Streaming, Kafka, Cassandra, MongoDB

•  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic

•  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations

•  Kafka à SS

http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark

•  Runs claims processing applications for healthcare providers

http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims

•  Predictive models can look for claims that are likely to be held up for approval

•  Spark Streaming allows model scoring in seconds instead of hours

spark streaming state of the union - strata san jose 2015

Technology

spark concepts - spark sql, graphx, streaming

spark streaming

deep dive into spark streaming

streaming items through a cluster with spark streaming

spark & spark streaming internals - nov 15 (1)

decision making with mllib, spark and spark streaming

spark summit - stratio streaming

cep - simplified streaming architecture - strata singapore...

spark seattle meetup - breaking etl barrier with spark...

spark streaming, machine learning and meetup.com streaming...

[spark meetup] spark streaming overview

spark streaming into context

apache spark streaming - bigdata.com

learning spark ch10 - spark streaming

real-time streaming: apache spark streaming i apache storm

hadoop architecture and ecosystem...input stream 17 test...

spark summit 2014: spark streaming for realtime auctions

spark streaming with kafka

streaming big data with spark streaming, kafka, cassandra...

chapter 1: introduction to apache spark...streaming count...