yahoo compares storm and spark

27
Spark and Storm at Yahoo Why choose one over the other? PRESENTED BY Bobby Evans and Tom Graves

Upload: chicago-hadoop-users-group

Post on 03-Nov-2014

20 views

Category:

Technology


4 download

DESCRIPTION

Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other. Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work). Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.

TRANSCRIPT

Page 1: Yahoo compares Storm and Spark

Spark and Storm at YahooW h y c h o o s e o n e o v e r t h e o t h e r ?

P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s

Page 2: Yahoo compares Storm and Spark

2

Tom GravesBobby Evans ([email protected])

Committers and PMC/PPMC Members for› Apache Storm incubating (Bobby)› Apache Hadoop (Tom and Bobby)› Apache Spark (Tom and Bobby)› Apache TEZ (Tom and Bobby)

Low Latency Big Data team at Yahoo (Part of the Hadoop Team)› Apache Storm as a service

• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes).

› Apache Spark on YARN

• 40,000 nodes total, 5000+ node cluster

› Help with distributed ML and deep learning.

Page 3: Yahoo compares Storm and Spark

Where we come from

Yahoo Champaign:• 100+ engineers• Located in UIUC Research Park http://researchpark.illinois.edu/• Split between Advertising and Data Platform team and Hadoop team.• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo.• Site is 7 years old, and we are building a new building with room for 200.• We are Hiring

[email protected]• http://bit.ly/1ybTXMe

Page 4: Yahoo compares Storm and Spark

4 Yahoo Confidential & Proprietary

Agenda

Spark Overview (1.1)Storm Overview (0.9.2)Things to ConsiderExample Architectures

Page 5: Yahoo compares Storm and Spark

5

Apache Spark

Page 6: Yahoo compares Storm and Spark

Spark Key Concepts

Resilient Distributed Datasets Collections of objects spread

across a cluster, stored in RAM or on Disk

Built through parallel transformations

Automatically rebuilt on failure

Operations Transformations

(e.g. map, filter, groupBy)

Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed

datasets

Page 7: Yahoo compares Storm and Spark

Working With RDDs

RDDRDD

RDDRDD

Transformations

ActionValu

e

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count()74

linesWithSpark.first()# Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

Page 8: Yahoo compares Storm and Spark

> lines = sc.textFile(“hamlet.txt”)

> counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y)

Example: Word Count

“to be or”

“not to be”

“to”“be”“or”

“not”“to”“be”

(to, 1)(be, 1)(or, 1)

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

Page 9: Yahoo compares Storm and Spark

updateFunc = (values: Seq[Int], state: Option[Int]) => {

val currentCount = values.foldLeft(0)(_ + _)

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)

}

lines = ssc.socketTextStream(args(0), args(1).toInt)

Words = lines.flatMap(lambda line: line.split(“ ”))wordDstream = words.map(lambda word => (word, 1))

stateDstream = wordDstream.updateStateByKey[Int](updateFunc)

ssc.start()

ssc.awaitTermination()

Spark Streaming Word Count

Page 10: Yahoo compares Storm and Spark

10

Apache Storm

Page 11: Yahoo compares Storm and Spark

Storm Concepts

1. Streams› Unbounded sequence of tuples

2. Spout› Source of Stream› E.g. Read from Twitter streaming API

3. Bolts› Processes input streams and produces

new streams› E.g. Functions, Filters, Aggregation,

Joins

4. Topologies› Network of spouts and bolts

Page 12: Yahoo compares Storm and Spark

Storm Architecture

Master Node

Cluster Coordination

Worker Processes

Worker

Nimbus

Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor Worker

Worker

Worker

Launches Workers

Page 13: Yahoo compares Storm and Spark

Trident (Storm) Word Count

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6);

“to be or”“to”“be”“or”

(to, 1)(be, 1)(or, 1)

(be, 1)

(or, 1)(to, 1)

“not to be”“not”“to”“be”

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

Page 14: Yahoo compares Storm and Spark

14

Use the Right Tool for the Job

https://www.flickr.com/photos/hikingartist/4193330368/

Page 15: Yahoo compares Storm and Spark

15

Things to Consider

ScaleLatency Iterative Processing› Are there suitable non-iterative alternatives?

Use What You KnowCode ReuseMaturity

Page 16: Yahoo compares Storm and Spark

16

When We Recommend Spark

Iterative Batch Processing (most Machine Learning)› There really is nothing else right now.› Has some scale issues.

Tried ETL (Not at Yahoo scale yet) Tried Shark/Interactive Queries (Not at Yahoo scale yet)

< 1 TB (or memory size of your cluster) Tuning it to run well can be a pain Data Bricks and others are working on scaling.

Streaming is all μ-batch so latency is at least 1 sec Streaming has single points of failure still All streaming inputs are replicated in memory

Page 17: Yahoo compares Storm and Spark

17

When We Recommend Storm

Latency < 1 second (single event at a time)› There is little else (especially not open source)

“Real Time” …› Analytics› Budgeting› ML› Anything

Lower Level API than Spark No built-in concept of look back aggregations Takes more effort to combine batch with streaming

Page 18: Yahoo compares Storm and Spark

18

Fictitious Example: My Commute App

Mobile App that lets users track their commute. Cities, users, companies, etc. compete daily for

› Shortest commute time› Greenest commute

Make money by selling location based ads and aggregate data to› Governments› Advertisers

Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.

Page 19: Yahoo compares Storm and Spark

19

Chicago vs. Champaign Urbana

Champaign Urbana: 14-15 min

Chicago: 20-30 min

Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500

CU Chicago0

5

10

15

20

25

30

35

Bobby

Page 20: Yahoo compares Storm and Spark

20

Things to Consider

Scale › everyone in the world!!!

Latency › a few seconds max

Iterative Processing › Possibly for targeting, but there are alternatives

Page 21: Yahoo compares Storm and Spark

21

Architecture

App Web Service

(User, Commute ID, Location History, MPG)

HBase/NoSQL

Kafka Storm

HDFS Spark

Customer

Page 22: Yahoo compares Storm and Spark

22

Architecture (Alternative)

App Web Service

(User, Commute ID, Location History, MPG)

HBase/NOSQL

HDFS Spark

Customer

Go directly to Spark Streaming, but data loss potential goes up.

Page 23: Yahoo compares Storm and Spark

23

Architecture (Alternative 2)

App Web Service

(User, Commute ID, Location History, MPG)

HBase/NOSQL

Kafka Storm

Customer

Streaming Operations Only(Kappa Architecture)

Page 24: Yahoo compares Storm and Spark

24

Fictitious Example 2: Web Scale Monitoring

Look for trends that can indicate a problem.› Alert or provide automated corrections

Provide an interface to visualize› Current data very quickly› Historical data in depth

If you commercialize this one please give me/Yahoo a free license for life (open source works too)

Page 25: Yahoo compares Storm and Spark

25

Things to Consider

Scale › Lots of events from many different servers

Latency › a few seconds max, but the fewer the better

Iterative Processing › For in depth analysis definetly

Page 26: Yahoo compares Storm and Spark

26

Fictitious Example 2: Web Scale Monitoring

ServersHBase

Kafka Storm

HDFS Spark

UI

Alert!!

JDBC Server

Rules

ML and trend analysis