yahoo compares storm and spark
DESCRIPTION
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other. Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work). Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.TRANSCRIPT
Spark and Storm at YahooW h y c h o o s e o n e o v e r t h e o t h e r ?
P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
2
Tom GravesBobby Evans ([email protected])
Committers and PMC/PPMC Members for› Apache Storm incubating (Bobby)› Apache Hadoop (Tom and Bobby)› Apache Spark (Tom and Bobby)› Apache TEZ (Tom and Bobby)
Low Latency Big Data team at Yahoo (Part of the Hadoop Team)› Apache Storm as a service
• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes).
› Apache Spark on YARN
• 40,000 nodes total, 5000+ node cluster
› Help with distributed ML and deep learning.
Where we come from
Yahoo Champaign:• 100+ engineers• Located in UIUC Research Park http://researchpark.illinois.edu/• Split between Advertising and Data Platform team and Hadoop team.• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo.• Site is 7 years old, and we are building a new building with room for 200.• We are Hiring
• [email protected]• http://bit.ly/1ybTXMe
4 Yahoo Confidential & Proprietary
Agenda
Spark Overview (1.1)Storm Overview (0.9.2)Things to ConsiderExample Architectures
5
Apache Spark
Spark Key Concepts
Resilient Distributed Datasets Collections of objects spread
across a cluster, stored in RAM or on Disk
Built through parallel transformations
Automatically rebuilt on failure
Operations Transformations
(e.g. map, filter, groupBy)
Actions(e.g. count, collect, save)
Write programs in terms of transformations on distributed
datasets
Working With RDDs
RDDRDD
RDDRDD
Transformations
ActionValu
e
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()74
linesWithSpark.first()# Apache Spark
textFile = sc.textFile(”SomeFile.txt”)
> lines = sc.textFile(“hamlet.txt”)
> counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y)
Example: Word Count
“to be or”
“not to be”
“to”“be”“or”
“not”“to”“be”
(to, 1)(be, 1)(or, 1)
(not, 1)(to, 1)(be, 1)
(be, 2)(not, 1)
(or, 1)(to, 2)
updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
…
lines = ssc.socketTextStream(args(0), args(1).toInt)
Words = lines.flatMap(lambda line: line.split(“ ”))wordDstream = words.map(lambda word => (word, 1))
stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
ssc.start()
ssc.awaitTermination()
Spark Streaming Word Count
10
Apache Storm
Storm Concepts
1. Streams› Unbounded sequence of tuples
2. Spout› Source of Stream› E.g. Read from Twitter streaming API
3. Bolts› Processes input streams and produces
new streams› E.g. Functions, Filters, Aggregation,
Joins
4. Topologies› Network of spouts and bolts
Storm Architecture
Master Node
Cluster Coordination
Worker Processes
Worker
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor Worker
Worker
Worker
Launches Workers
Trident (Storm) Word Count
TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6);
“to be or”“to”“be”“or”
(to, 1)(be, 1)(or, 1)
(be, 1)
(or, 1)(to, 1)
“not to be”“not”“to”“be”
(not, 1)(to, 1)(be, 1)
(be, 2)(not, 1)
(or, 1)(to, 2)
14
Use the Right Tool for the Job
https://www.flickr.com/photos/hikingartist/4193330368/
15
Things to Consider
ScaleLatency Iterative Processing› Are there suitable non-iterative alternatives?
Use What You KnowCode ReuseMaturity
16
When We Recommend Spark
Iterative Batch Processing (most Machine Learning)› There really is nothing else right now.› Has some scale issues.
Tried ETL (Not at Yahoo scale yet) Tried Shark/Interactive Queries (Not at Yahoo scale yet)
< 1 TB (or memory size of your cluster) Tuning it to run well can be a pain Data Bricks and others are working on scaling.
Streaming is all μ-batch so latency is at least 1 sec Streaming has single points of failure still All streaming inputs are replicated in memory
17
When We Recommend Storm
Latency < 1 second (single event at a time)› There is little else (especially not open source)
“Real Time” …› Analytics› Budgeting› ML› Anything
Lower Level API than Spark No built-in concept of look back aggregations Takes more effort to combine batch with streaming
18
Fictitious Example: My Commute App
Mobile App that lets users track their commute. Cities, users, companies, etc. compete daily for
› Shortest commute time› Greenest commute
Make money by selling location based ads and aggregate data to› Governments› Advertisers
Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.
19
Chicago vs. Champaign Urbana
Champaign Urbana: 14-15 min
Chicago: 20-30 min
Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
CU Chicago0
5
10
15
20
25
30
35
Bobby
20
Things to Consider
Scale › everyone in the world!!!
Latency › a few seconds max
Iterative Processing › Possibly for targeting, but there are alternatives
21
Architecture
App Web Service
(User, Commute ID, Location History, MPG)
HBase/NoSQL
Kafka Storm
HDFS Spark
Customer
22
Architecture (Alternative)
App Web Service
(User, Commute ID, Location History, MPG)
HBase/NOSQL
HDFS Spark
Customer
Go directly to Spark Streaming, but data loss potential goes up.
23
Architecture (Alternative 2)
App Web Service
(User, Commute ID, Location History, MPG)
HBase/NOSQL
Kafka Storm
Customer
Streaming Operations Only(Kappa Architecture)
24
Fictitious Example 2: Web Scale Monitoring
Look for trends that can indicate a problem.› Alert or provide automated corrections
Provide an interface to visualize› Current data very quickly› Historical data in depth
If you commercialize this one please give me/Yahoo a free license for life (open source works too)
25
Things to Consider
Scale › Lots of events from many different servers
Latency › a few seconds max, but the fewer the better
Iterative Processing › For in depth analysis definetly
26
Fictitious Example 2: Web Scale Monitoring
ServersHBase
Kafka Storm
HDFS Spark
UI
Alert!!
JDBC Server
Rules
ML and trend analysis
Questions?
[email protected] http://bit.ly/1ybTXMe