yahoo compares storm and spark

Spark and Storm at YahooW h y c h o o s e o n e o v e r t h e o t h e r ?

P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s

2

Tom GravesBobby Evans ([email protected])

Committers and PMC/PPMC Members for› Apache Storm incubating (Bobby)› Apache Hadoop (Tom and Bobby)› Apache Spark (Tom and Bobby)› Apache TEZ (Tom and Bobby)

Low Latency Big Data team at Yahoo (Part of the Hadoop Team)› Apache Storm as a service

• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes).

› Apache Spark on YARN

• 40,000 nodes total, 5000+ node cluster

› Help with distributed ML and deep learning.

Where we come from

Yahoo Champaign:• 100+ engineers• Located in UIUC Research Park http://researchpark.illinois.edu/• Split between Advertising and Data Platform team and Hadoop team.• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo.• Site is 7 years old, and we are building a new building with room for 200.• We are Hiring

• [email protected]• http://bit.ly/1ybTXMe

http://researchpark.illinois.edu/



mailto:[email protected]



http://bit.ly/1ybTXMe



4 Yahoo Confidential & Proprietary

Agenda

Spark Overview (1.1)Storm Overview (0.9.2)Things to ConsiderExample Architectures

5

Apache Spark

Spark Key Concepts

Resilient Distributed Datasets Collections of objects spread

across a cluster, stored in RAM or on Disk

Built through parallel transformations

Automatically rebuilt on failure

Operations Transformations

(e.g. map, filter, groupBy)

Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed

datasets

Working With RDDs

RDDRDD

RDDRDD

Transformations

ActionValu

e

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count()74

linesWithSpark.first()# Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

> lines = sc.textFile(“hamlet.txt”)

> counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y)

Example: Word Count

“to be or”

“not to be”

“to”“be”“or”

“not”“to”“be”

(to, 1)(be, 1)(or, 1)

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

updateFunc = (values: Seq[Int], state: Option[Int]) => {

val currentCount = values.foldLeft(0)(_ + _)

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)

}

…

lines = ssc.socketTextStream(args(0), args(1).toInt)

Words = lines.flatMap(lambda line: line.split(“ ”))wordDstream = words.map(lambda word => (word, 1))

stateDstream = wordDstream.updateStateByKey[Int](updateFunc)

ssc.start()

ssc.awaitTermination()

Spark Streaming Word Count

10

Apache Storm

Storm Concepts

1. Streams› Unbounded sequence of tuples

2. Spout› Source of Stream› E.g. Read from Twitter streaming API

3. Bolts› Processes input streams and produces

new streams› E.g. Functions, Filters, Aggregation,

Joins

4. Topologies› Network of spouts and bolts

Storm Architecture

Master Node

Cluster Coordination

Worker Processes

Worker

Nimbus

Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor Worker

Worker

Worker

Launches Workers

Trident (Storm) Word Count

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6);

“to be or”“to”“be”“or”

(to, 1)(be, 1)(or, 1)

(be, 1)

(or, 1)(to, 1)

“not to be”“not”“to”“be”

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

14

Use the Right Tool for the Job

https://www.flickr.com/photos/hikingartist/4193330368/

15

Things to Consider

ScaleLatency Iterative Processing› Are there suitable non-iterative alternatives?

Use What You KnowCode ReuseMaturity

16

When We Recommend Spark

Iterative Batch Processing (most Machine Learning)› There really is nothing else right now.› Has some scale issues.

Tried ETL (Not at Yahoo scale yet) Tried Shark/Interactive Queries (Not at Yahoo scale yet)

< 1 TB (or memory size of your cluster) Tuning it to run well can be a pain Data Bricks and others are working on scaling.

Streaming is all μ-batch so latency is at least 1 sec Streaming has single points of failure still All streaming inputs are replicated in memory

17

When We Recommend Storm

Latency < 1 second (single event at a time)› There is little else (especially not open source)

“Real Time” …› Analytics› Budgeting› ML› Anything

Lower Level API than Spark No built-in concept of look back aggregations Takes more effort to combine batch with streaming

18

Fictitious Example: My Commute App

Mobile App that lets users track their commute. Cities, users, companies, etc. compete daily for

› Shortest commute time› Greenest commute

Make money by selling location based ads and aggregate data to› Governments› Advertisers

Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.

19

Chicago vs. Champaign Urbana

Champaign Urbana: 14-15 min

Chicago: 20-30 min

Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500

CU Chicago0

5

10

15

20

25

30

35

Bobby

20

Things to Consider

Scale › everyone in the world!!!

Latency › a few seconds max

Iterative Processing › Possibly for targeting, but there are alternatives

21

Architecture

App Web Service

(User, Commute ID, Location History, MPG)

HBase/NoSQL

Kafka Storm

HDFS Spark

Customer

22

Architecture (Alternative)

App Web Service


HBase/NOSQL

HDFS Spark

Customer

Go directly to Spark Streaming, but data loss potential goes up.

23

Architecture (Alternative 2)

App Web Service


HBase/NOSQL

Kafka Storm

Customer

Streaming Operations Only(Kappa Architecture)

24

Fictitious Example 2: Web Scale Monitoring

Look for trends that can indicate a problem.› Alert or provide automated corrections

Provide an interface to visualize› Current data very quickly› Historical data in depth

If you commercialize this one please give me/Yahoo a free license for life (open source works too)

25

Things to Consider

Scale › Lots of events from many different servers

Latency › a few seconds max, but the fewer the better

Iterative Processing › For in depth analysis definetly

26

Fictitious Example 2: Web Scale Monitoring

ServersHBase

Kafka Storm

HDFS Spark

UI

Alert!!

JDBC Server

Rules

ML and trend analysis

Questions?

[email protected]

[email protected] http://bit.ly/1ybTXMe





yahoo compares storm and spark

Technology

lambda word

hadoop team

app webservice

yahoo scale

hadoopyahoo

lambda line

apache spark

bobbyapache